Application Resiliency Using Netflix Hystrix

photo-1484557052118-f32bd25b45b5

Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation.
-Wikipedia
Ever since the term services and recently microservices came into usage, application developers have been converting monolithic APIs into simple and single-function microservices. However, such conversions come with the cost of ensuring consistent response times and resiliency when certain dependencies become unavailable. For example, a monolithic web application that performs a retry for every call is potentially resilient to some extent, as it can recover when certain dependencies (such as databases or other services) are unavailable. This resilience comes without any additional network or code complexity.

For a service that orchestrates numerous dependencies, each invocation is costly, and a failure can lead to diminished user experience as well as to higher stress on the underlying system that is attempting to recover from the failure.

Circuit breaker pattern

Consider a typical use case:  An e-commerce site that is overloaded with requests on Black Friday, and the vendor providing the payment operations goes offline for a few seconds due to heavy traffic. The users begin to see long wait times for their checkouts due to the high concurrency of requests. These conditions also keep all of the application servers clogged with the threads that are waiting to receive a response from the vendor. After a long wait time, the eventual result is a failure.

Such events lead to abandoned carts or users trying to refresh or retry their checkouts, increasing the load on the application servers—which already have long-waiting threads, leading to network congestion.

A circuit breaker is a simple structure that constantly remains vigilant, monitoring for faults. In the above-mentioned scenario, the circuit breaker identifies long waiting times among the calls to the vendor and fails-fast, returning an error response to the user instead of making the threads wait. Thus, the circuit breaker prevents the users from having a very sub-optimal response time.

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

-Martin Fowler

Recovery time is crucial for the underlying resource, and having a circuit breaker that fails-fast without overloading the system ensures that the vendor can recover quickly.

A circuit breaker is an always-live system keeping watch over dependency invocations. In case of a high failure rate, the circuit breaker stops the calls from going through for a small amount of time, rather than responding with a standard error.

Circuit breakers in eBay

In earlier times, we used a simple setup called AUTO_MARK_DOWN, which prevented such long wait problems in dependencies by short-circuiting the invocations until they were brought back up via MARK_UP. A bot periodically checks for AUTO_MARK_DOWN on various machines for each of its dependencies and performs MARK_UP.

However, the bot and MARK_UP infrastructure is not embedded within the system, but rather located externally. Due to the absence of live and constant feedback about the request volume and failure rates, the MARK_UP of a failing system dependency would occur without verifying its availability. Relying on this setup also leads to false positives, as the bot system is outside the client and cannot evaluate the continuity of failures.

Another major flaw with the setup is the absence of a comprehensive and real-time monitoring structure for all of the dependencies of any application. This old system is slow and erratic, does not have ongoing telemetry, and blindly marks all systems back up on the assumption that the application will AUTO_MARK_DOWN a dependency in the event of further failures. The result is unpredictable behavior and incorrect evaluation.

Recovery in a circuit breaker

A circuit breaker takes care of tripping the dependencies at the appropriate time. However, a more sophisticated system needs to continue the vigilance to determine if the dependency is available, and if so to close the circuit again to let dependent calls go through.

This behavior can be achieved in two ways:

  1. Allow all calls to go through during a regular time interval and check for errors.
  2. Allow one single call to go through at a more frequent rate to gauge the availability.

AUTO_MARK_DOWN was a variant of Type 1, where the circuit is closed without any proof of recovery, relying on errors to identify an issue.

Type 2 is a more sophisticated mechanism as it does not allow multiple calls to go through because the calls might take a long time to execute and still fail. Rather, allowing  only a single call ensures more frequent execution, enabling faster closure of the circuit and revival of the system.

Ideal circuit breaker

A harmonious system is one where we have an ideal circuit breaker, real-time monitoring, and a fast recovery variable setup, making the application truly resilient.

Circuit Breaker + Real-time Monitoring + Recovery = Resiliency

– Anonymous

Using the example of the e-commerce site from above, with a resilient system in place, the circuit breaker keeps an ongoing evaluation of the faults from the payments processor and identifies long wait times or errors from the vendor. On such occurrences, it breaks the circuit, failing fast. As a result, users are notified of the problem and the vendor has enough time to recover.

In the meantime, the circuit breaker also keeps sending one request at regular intervals to evaluate if the vendor system is back again. If so, the circuit breaker closes the circuit immediately, allowing the rest of the calls to go through successfully, thereby effectively removing the problem of network congestion and long wait times.

Netflix Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

-Netflix

From its inception in 2012, Hystrix has become the go-to solution for many systems attempting to improve their capabilities to sustain failures. It has a fairly mature API and a highly tunable configuration system, enabling application developers to provide optimal utilization of their underlying service dependencies.

Circuit breaker (CB) states for Hystrix

The following state diagram and narrative depicts how a resilient system functions during various states of a circuit breaker’s lifecycle.

Circuit Breaker State Diagram

Normal function (Closed)

When a system is functioning smoothly, the resiliency is measured by the state of its success counters, while any failures are tracked using the failure gauges. This design ensures that when the threshold for failures is reached, the circuit breaker opens the circuit to prevent further calls to the dependent resource.

Failure state (Open)

At this juncture, every call to the dependency is short-circuited with a HystrixRuntimeException and FailureType of SHORTCIRCUIT, giving clear indication of its cause. Once the sleepInterval passes, the Hystrix circuit breaker moves into a half-open state.

Half-open state

In this state, Hystrix takes care of sending the first request to check system availability, letting other requests fail-fast until the response is obtained. If the call is successful, the circuit breaker is reset to Closed; in case of failure, the system goes back to the Open state, and the cycle continues.

How to use Hystrix

Hystrix Github has comprehensive documentation of how to use the library. It is as simple as creating a class for invoking the Hystrix library for service consumption.

   public class CommandHelloWorld extends HystrixCommand<String> {

        private final String name;

        public CommandHelloWorld(String name) {
            super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
            this.name = name;
        }

        @Override
        protected String run() {
            // a real example would do work like a network call here
            return "Hello " + name + "!";
        }
    }

Reference: https://github.com/Netflix/Hystrix/wiki/Getting-Started

Internally, this class utilizes the RxJava library to perform asynchronous invocation of the service dependencies. This design helps the application manage its resources intelligently by using the application threads to its maximum potential. For application developers who perform parallel processing and manage their dependencies using lazy invocations, Hystrix also exposes Future<?> and Observable<?>.

How we use Hystrix in eBay

Multiple applications within eBay have started using Hystrix either as a standalone library or  with our platform wrappers. Our platform wrappers exposes the Hystrix configurations in the form of JMX beans for centralized management.  Our wrappers also inject custom Hystrix plugin implementations to capture the real-time metrics being published and to feed them to the site monitoring systems for critical applications.

The Hystrix dashboard is integrated as part of the core server-monitoring systems, enabling teams to view how their application dependencies are performing during various times of the day.

The execution hook provided by Hystrix is a critical component of this integration, as it helps monitor/alert various failures in real time—especially on errors and fallback failures, thereby helping investigate and resolve issues more quickly with little to no user impact.

eBay example: Secure Token service

eBay hosts a slew of service APIs for both internal and external consumption. All of these services are authenticated via tokens, with the Secure Token service acting as the issuer and validator of these tokens. The Guards in all of the services are now upgraded with the Hystrix-based circuit breaker, which enables the Secure Token service to be highly available. In times of heavy traffic from one of the services, the circuit breaker for that service trips and opens the circuit, failing calls only to that specific service while allowing the other services to function normally.

Secure Token Service protected using Hystrix

Secure Token Service protected using Hystrix

circuit_breaker

The circuit breaker is the default one available through the Hystrix library. The functioning of the circuit breaker can be summarized as follows:

  1. Every incoming call is verified against the current state of the circuit breaker.
  2. A Closed state of the Circuit allows the requests to be sent through.
  3. An Open state fails all requests.
  4. A Half-Open state (which occurs when the sleep time is completed), allows one request to go through, and on success or failure moves the circuit to the Closed or Open state as appropriate.

Conclusion

Hystrix is not just a circuit breaker, but also a complete library with extensive monitoring capabilities, which can be easily plugged into existing systems. We have started exploring the usage of the library’s Request Collapsing and Request Caching abilities for future use cases. There are a few other Java-based implementations available, such as Akka and Spring circuit breakers; but Hystrix has proven to be a sound and mature library for maintaining a resilient environment for our critical applications, providing high availability during any time period.

References