Network Resiliency Hystrix

Application Resiliency Using Netflix Hystrix

Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation.
-Wikipedia
Ever since the term services and recently microservices came into usage, application developers have been converting monolithic APIs into simple and single-function microservices. However, such conversions come with the cost of ensuring consistent response times and resiliency when certain dependencies become unavailable. For example, a monolithic web application that performs a retry for every call is potentially resilient to some extent, as it can recover when certain dependencies (such as databases or other services) are unavailable. This resilience comes without any additional network or code complexity.

For a service that orchestrates numerous dependencies, each invocation is costly, and a failure can lead to diminished user experience as well as to higher stress on the underlying system that is attempting to recover from the failure.

Circuit breaker pattern

Consider a typical use case:  An e-commerce site that is overloaded with requests on Black Friday, and the vendor providing the payment operations goes offline for a few seconds due to heavy traffic. The users begin to see long wait times for their checkouts due to the high concurrency of requests. These conditions also keep all of the application servers clogged with the threads that are waiting to receive a response from the vendor. After a long wait time, the eventual result is a failure.

Such events lead to abandoned carts or users trying to refresh or retry their checkouts, increasing the load on the application servers—which already have long-waiting threads, leading to network congestion.

A circuit breaker is a simple structure that constantly remains vigilant, monitoring for faults. In the above-mentioned scenario, the circuit breaker identifies long waiting times among the calls to the vendor and fails-fast, returning an error response to the user instead of making the threads wait. Thus, the circuit breaker prevents the users from having a very sub-optimal response time.

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

-Martin Fowler

Recovery time is crucial for the underlying resource, and having a circuit breaker that fails-fast without overloading the system ensures that the vendor can recover quickly.

A circuit breaker is an always-live system keeping watch over dependency invocations. In case of a high failure rate, the circuit breaker stops the calls from going through for a small amount of time, rather than responding with a standard error.

Circuit breakers in eBay

In earlier times, we used a simple setup called AUTO_MARK_DOWN, which prevented such long wait problems in dependencies by short-circuiting the invocations until they were brought back up via MARK_UP. A bot periodically checks for AUTO_MARK_DOWN on various machines for each of its dependencies and performs MARK_UP.

However, the bot and MARK_UP infrastructure is not embedded within the system, but rather located externally. Due to the absence of live and constant feedback about the request volume and failure rates, the MARK_UP of a failing system dependency would occur without verifying its availability. Relying on this setup also leads to false positives, as the bot system is outside the client and cannot evaluate the continuity of failures.

Another major flaw with the setup is the absence of a comprehensive and real-time monitoring structure for all of the dependencies of any application. This old system is slow and erratic, does not have ongoing telemetry, and blindly marks all systems back up on the assumption that the application will AUTO_MARK_DOWN a dependency in the event of further failures. The result is unpredictable behavior and incorrect evaluation.

Recovery in a circuit breaker

A circuit breaker takes care of tripping the dependencies at the appropriate time. However, a more sophisticated system needs to continue the vigilance to determine if the dependency is available, and if so to close the circuit again to let dependent calls go through.

This behavior can be achieved in two ways:

  1. Allow all calls to go through during a regular time interval and check for errors.
  2. Allow one single call to go through at a more frequent rate to gauge the availability.

AUTO_MARK_DOWN was a variant of Type 1, where the circuit is closed without any proof of recovery, relying on errors to identify an issue.

Type 2 is a more sophisticated mechanism as it does not allow multiple calls to go through because the calls might take a long time to execute and still fail. Rather, allowing  only a single call ensures more frequent execution, enabling faster closure of the circuit and revival of the system.

Ideal circuit breaker

A harmonious system is one where we have an ideal circuit breaker, real-time monitoring, and a fast recovery variable setup, making the application truly resilient.

Circuit Breaker + Real-time Monitoring + Recovery = Resiliency

– Anonymous

Using the example of the e-commerce site from above, with a resilient system in place, the circuit breaker keeps an ongoing evaluation of the faults from the payments processor and identifies long wait times or errors from the vendor. On such occurrences, it breaks the circuit, failing fast. As a result, users are notified of the problem and the vendor has enough time to recover.

In the meantime, the circuit breaker also keeps sending one request at regular intervals to evaluate if the vendor system is back again. If so, the circuit breaker closes the circuit immediately, allowing the rest of the calls to go through successfully, thereby effectively removing the problem of network congestion and long wait times.

Netflix Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

-Netflix

From its inception in 2012, Hystrix has become the go-to solution for many systems attempting to improve their capabilities to sustain failures. It has a fairly mature API and a highly tunable configuration system, enabling application developers to provide optimal utilization of their underlying service dependencies.

Circuit breaker (CB) states for Hystrix

The following state diagram and narrative depicts how a resilient system functions during various states of a circuit breaker’s lifecycle.

Circuit Breaker State Diagram

Normal function (Closed)

When a system is functioning smoothly, the resiliency is measured by the state of its success counters, while any failures are tracked using the failure gauges. This design ensures that when the threshold for failures is reached, the circuit breaker opens the circuit to prevent further calls to the dependent resource.

Failure state (Open)

At this juncture, every call to the dependency is short-circuited with a HystrixRuntimeException and FailureType of SHORTCIRCUIT, giving clear indication of its cause. Once the sleepInterval passes, the Hystrix circuit breaker moves into a half-open state.

Half-open state

In this state, Hystrix takes care of sending the first request to check system availability, letting other requests fail-fast until the response is obtained. If the call is successful, the circuit breaker is reset to Closed; in case of failure, the system goes back to the Open state, and the cycle continues.

How to use Hystrix

Hystrix Github has comprehensive documentation of how to use the library. It is as simple as creating a class for invoking the Hystrix library for service consumption.

   public class CommandHelloWorld extends HystrixCommand<String> {

        private final String name;

        public CommandHelloWorld(String name) {
            super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
            this.name = name;
        }

        @Override
        protected String run() {
            // a real example would do work like a network call here
            return "Hello " + name + "!";
        }
    }

Reference: https://github.com/Netflix/Hystrix/wiki/Getting-Started

Internally, this class utilizes the RxJava library to perform asynchronous invocation of the service dependencies. This design helps the application manage its resources intelligently by using the application threads to its maximum potential. For application developers who perform parallel processing and manage their dependencies using lazy invocations, Hystrix also exposes Future<?> and Observable<?>.

How we use Hystrix in eBay

Multiple applications within eBay have started using Hystrix either as a standalone library or  with our platform wrappers. Our platform wrappers exposes the Hystrix configurations in the form of JMX beans for centralized management.  Our wrappers also inject custom Hystrix plugin implementations to capture the real-time metrics being published and to feed them to the site monitoring systems for critical applications.

The Hystrix dashboard is integrated as part of the core server-monitoring systems, enabling teams to view how their application dependencies are performing during various times of the day.

The execution hook provided by Hystrix is a critical component of this integration, as it helps monitor/alert various failures in real time—especially on errors and fallback failures, thereby helping investigate and resolve issues more quickly with little to no user impact.

eBay example: Secure Token service

eBay hosts a slew of service APIs for both internal and external consumption. All of these services are authenticated via tokens, with the Secure Token service acting as the issuer and validator of these tokens. The Guards in all of the services are now upgraded with the Hystrix-based circuit breaker, which enables the Secure Token service to be highly available. In times of heavy traffic from one of the services, the circuit breaker for that service trips and opens the circuit, failing calls only to that specific service while allowing the other services to function normally.

Secure Token Service protected using Hystrix

Secure Token Service protected using Hystrix

circuit_breaker

The circuit breaker is the default one available through the Hystrix library. The functioning of the circuit breaker can be summarized as follows:

  1. Every incoming call is verified against the current state of the circuit breaker.
  2. A Closed state of the Circuit allows the requests to be sent through.
  3. An Open state fails all requests.
  4. A Half-Open state (which occurs when the sleep time is completed), allows one request to go through, and on success or failure moves the circuit to the Closed or Open state as appropriate.

Conclusion

Hystrix is not just a circuit breaker, but also a complete library with extensive monitoring capabilities, which can be easily plugged into existing systems. We have started exploring the usage of the library’s Request Collapsing and Request Caching abilities for future use cases. There are a few other Java-based implementations available, such as Akka and Spring circuit breakers; but Hystrix has proven to be a sound and mature library for maintaining a resilient environment for our critical applications, providing high availability during any time period.

References

8 thoughts on “Application Resiliency Using Netflix Hystrix

  1. Kishore Senji

    Circuit breakers protect the application/service from the failures of its dependent services. Even though some functionality might be unavailable, the upstream service would be available even though a downstream service is down or having some problems. The side effect of this is that the downstream services naturally get some breathing space while recovering and not having some more load to exacerbate the problem.

    If we view it this way, the eBay example seems to be more of a rate limiting on the Secure Token Service rather than a typical Circuit breaker pattern. You would still need a Circuit breaker pattern on the Service providers (1 to N) to safe guard themselves when the Secure Token Service is down or having problems meeting the SLA. Because there could always be a service N+1 which does not have this code or there could be a bug which needs upgrade, it is better to have rate limiting done on the downstream service itself.

    In my view, CB should be used to protect an upstream service from a downstream service and rate limiting to protect a downstream service from an upstream service.

    Reply
    1. Senthilkumar Gopal Post author

      Hi Kishore,
      Thanks for reading the post and your comment. I definitely agree that the Secure Token service should be protected by a Rate limiting service and it is indeed protected for irregular spikes of traffic. However, the nature of the STS is to serve all service calls as performant as possible and work with higher call rates as it is a critical blocking call due to the nature of its function of verifying authentication.

      As you correctly noted, the circuit breaker usually helps the upstream service maintain functionality even during unavailability of downstream services. However here, the CB also helps in ensuring that the availability of a critical downstream service is not brought down by a sudden spike in activity on an upstream service.

      For example, if we see a heavy spike in listings (such as the infamous Flappy Bird incident), we would have to maintain the availability of the downstream shared service irrespective of heavy traffic from one of the upstream services, here in this case, the listing service to ensure functionality for other services such as checkout, signin etc., Ratelimiting on the downstream service would prevent the spike of calls resulting in failures for the upstream service, however we would like the upstream services to function successfully as failure in calling the STS results in a failure of authentication.

      In such scenarios, CBs allow us to manage the calls to extract maximum potential out of the upstream services without harming the downstream ones.

      I do agree that this is not how a traditional CB is looked upon, but this definitely is a slightly modified means of how we are using it to safeguard the STS and maintain maximum functioning of the upstream services.

      Reply
  2. Pingback: TWiST #72: Und noch ein Shopsystem-Panel | ShopTechBlog

  3. Kishore Senji

    Rate limiting on the down stream service is better as the down stream service has the full picture and have different algorithms to play with. For example, if you have 5 consumers for a service each with a quota of 20%. Instead of the consumer making myopic decision to not even make the call when it reaches the 20%, the downstream service can reject the calls once they are over the quota. It can give a special response code based on which the consumer can function the same it would in the case of using a CB and not even making a call. Now the benefit of having the downstream service do the rate limiting is the following; based on the current usage it can allow a specific consumer to go over the quota. Let us say the current usage of all the other consumers are just 30%, it can allow a consumer to go over the quota of 20% (as the combined would still be lower than 100%). It can also have priority based decisions for example giving priority for a consumer over others. If other consumers are not that important, then the service can take away the quota and give it to the privileged consumer. It can make some heuristic or ML based decisions. All these things are only possible when we use the rate limiting on the down stream service and not when the consumer blocks itself from making calls.

    Reply
  4. Senthilkumar Gopal Post author

    Absolutely correct. Rate limiting is definitely crucial for the downstream service, and we do have rate limiting with the features suggested such as consumer and time based limiting etc., However, that alone is not sufficient as the clients would still suffer if there is a sudden spike in traffic and we need to allow that consumer to call the downstream service to accomodate that request.

    Rate limiting is all about pre-calculation, however our usage of Hystrix in this regard is about ongoing telemetry and ability to extract as much outcome as possible from the downstream service.

    Reply
  5. Pingback: Links & Reads from 2015 Week 37 | Martin's Weekly Curations

  6. Pingback: eBay’s experience with Hystrix | found drama

  7. Pingback: 10-09-2015 - Links - Magnus Udbjørg

Leave a Reply

Your email address will not be published. Required fields are marked *