Author Archives: Senthilkumar Gopal

Finite-State Machine for Single-Use Code Authentication

Introduction

eBay strives to excel at security and to identify new and improved secure mechanisms to allow users to seamlessly access their account and in the meantime ensure that the fraudulent and malicious users are kept at bay. This is a balancing act that every internet platform player, major and minor, performs every day. Passwords are one such mechanism used to secure a user’s account but with not much success. They have always been a major pain point for users, as they need to be maintained with a unique combination of hard-to-guess and hard-to-remember numbers, alphabets and special characters. Ironically, this leads users to create passwords that are hard to remember but easy for hackers to brute force.

With new websites and platforms cropping up each day, users are required to remember multiple passwords becoming a daily struggle. Passwords are one of the weakest links in our attempt to secure user accounts, because many users don’t use a strong, complex one or reuse the same password across multiple services, making themselves vulnerable to phishing and other types of attacks.

To relieve users from ever needing to remember their password and to move towards the utopia of a “password-free” world, eBay has released the ability of using one-time code to log in, which can be delivered to the users’ phones, a personal and important device that they own. Along with the ease of getting disposable one-time codes, this also helps while traveling or when using a public computer or network, where users can log in with a one-time code rather than running the risk of having their regular passwords hijacked via a key logger, malware, or even a compromised network. More details about the feature available in this release statement

How does it work?

Any user can now use the “Sign in with a single-use code” link on the log-in page to request that a one-time code be delivered to their registered phone number via text messaging. Then the user can type in the code from the text message into the input field, securely getting into the account without the hassle of remembering or exposing the original password.

sign-in window         get single use code next button

The one-time codes are short-lived and cannot be transferred between sessions, making them highly isolated and secure in comparison to regular passwords, which can be used across any number of devices.

Behind the scenes

Given its criticality, the application architecture need to be robust and secure but also better manageable and configurable. The structure should provide better code readability, which in turn ensures high quality code in the production environment. One of the important characteristics of the application architecture is that the finite state machine used for generating and validating the one-time code must conform with all the rules that were set up such as expiry, retry attempts, etc.

state_machine

Finite state machine

A finite-state machine (FSM) is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition. Wikipedia contributors, “Finite-state machine,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Finite-state_machine&oldid=731346054 (accessed August 18, 2016).

A simple representation of a Turnstile state machine is shown below.

Turnstile_state_machine_colored.svg

Advantages

FSM provides many advantages over traditional rule-based systems.

  • States, transitions, events, and their conditions maintained in a modular and declarative fashion
  • Pluggable nature of transitions and conditions
  • Abstraction of states, actions, and their roles
  • Clarity of behavior during firing of events
  • Most importantly, ease of ensuring secure state management

To illustrate, imagine a user has requested a one-time code and received it on their phone. However, while entering the password, they mistyped it incorrectly for more times than the allowed number of attempts. In the case of a rule engine or a simple if/else sequence, there is a high probability of a bug being introduced, as there is no state maintenance — only rule/logic evaluations, which can allow the user to provide the correct code and log in successfully even after exceeding the allowed number of attempts.

However, on a state machine, once the user attempts exceed the limits, the state machine transitions to an AttemptExceeded/Failed state, making it impossible for the user to attempt a code validation even with the correct code. This structure guarantees that the shortcomings of a sequence model execution are not possible in a state machine.

If-else code for validation

 public int validateCode(long transId, String code){
        int currentStatus = getPersistence().getCurrentStatus(transId);
        boolean status = false;
        if(currentStatus==SENT){
            status = getSecureModule().validateCode(transId, code);
            if(status){
                currentStatus = SUCCESS;
            }else{
                currentStatus = INVALID;
                if(getPersistence().verifyAttemptsExceeded(transId)){
                    currentStatus = ATTEMPT_EXCEEDED;
                }
            }
        }
        return currentStatus;
    }

Open source frameworks

Rather than writing a FSM from scratch, we evaluated two widely used frameworks based on maturity and support: Spring StateMachine and squirrel-foundation . Spring StateMachine seemed more suited to a daemon application, such as Zookeeper, and was not ready for a multi-threaded environment, such as a web application [Issue#170]. Moreover, squirrel-foundation provided the ability to add custom event types and actions, which are explained in more detail below, helping make the decision to use the squirrel-foundation framework to model the finite state machine.

State machine for one-time code

A simple representation of the state machine used to generate and validate the one-time code for phone is provided below.

State Machine for one-time code

As illustrated, the one-time code validation moves between different states of the machine based upon the Action of the user, the current state of the machine, and the conditions configured between them. For instance, when a user requests a code for the first time, if the Send failed for some reason, the state machine moves the transaction to a final FAILED state, rendering the transaction inert and effectively terminated.

Elaborating on the use case discussed earlier, where the user exceeds the number of incorrect code attempts, if the user performs the action of Requests Code retry, the retry-limit-exceeded condition fires and the state machine moves the current state to FAILED, terminating the transaction and preventing the user from re-using the session or the code further.

Set-up and configuration

In order to explain the usage and set-up better, following are some of the important parts of the State machine modeling code snippets. These are not complete nor compilable as is, and they are truncated for brevity.

Builder

The squirrel-foundation framework allows configuring the State Machine once and create new instances of the state machine for each thread without incurring the expensive creation time. The newSM() method is invoked with the beginning state of the current transaction to get the State Machine ready, and when the events are fired, the State Machine takes care of identifying the next state to transfer to.

StateMachine Builder

public class PhoneSMBuilder {
	private static StateMachineBuilder<PhoneStateMachine, PhoneStateId, PhoneEvent, PhoneContext> stateMachineBuilder;

    static {
	buildSM();
     }

      private static synchronized void buildSM() {
	if (null != stateMachineBuilder) {
		return;
	}
	stateMachineBuilder = StateMachineBuilderFactory.create(PhoneStateMachine.class, PhoneStateId.class, PhoneEvent.class, PhoneContext.class);
		stateMachineBuilder.setStateMachineConfiguration(StateMachineConfiguration.create());
	}

    ....

    public static PhoneStateMachine newSM(StateId initialState) {
		PhoneStateMachine stateMachine = stateMachineBuilder.newStateMachine(initialState);
		stateMachine.start();
		return stateMachine;
	}

Transitions and conditions

Each of the transitions from one state to another state is managed as an external transition and guarded by conditions. For the Phone StateMachine, the positive transitions are configured similar to below.

buildSM() – adding transitions

private static synchronized void buildSM() {
      .....
    stateMachineBuilder.externalTransition()
    		.from(PhoneStateId.INITIAL).to(PhoneStateId.DELIVERED)
    		.on(PhoneEvent.SEND_CODE)
    		.when(checkEventActionExecutionResult());

    stateMachineBuilder.externalTransition()
    		.from(PhoneStateId.DELIVERED).to(PhoneStateId.SUCCESS)
    		.on(PhoneEvent.VALIDATE_CODE)
    		.when(checkEventActionExecutionResult());
      ....
}
  private static Condition<PhoneContext> checkEventActionExecutionResult() {
	return new AnonymousCondition<PhoneContext>() {

		@Override
		public boolean isSatisfied(PhoneContext context) {
			return context.isActionSuccess();
		}
	};
}

As illustrated, the Condition<T> is an interface provided by the squirrel-foundation framework for configuring the conditions specific to a state-to-state transition. A successful Boolean response to the isSatisfied() method fires the transition that the condition satisfies.

NOTE: If more than one Condition fires for a transition from one initial state to two different end states, an exception is thrown. It is imperative that each condition is mutually exclusive with other conditions for the same initial state.

Even for error scenarios, such as expired or attempts exceeded, the state transition is simple to configure, maintain, and change, ensuring the isolation of the change and easy testability.

buildSM() – adding transitions and conditions

private static synchronized void buildSM() {
        .....

        stateMachineBuilder.externalTransition()
				.from(PhoneStateId.DELIVERED).to(PhoneStateId.EXPIRED)
				.on(PhoneEvent.VALIDATE_CODE)
				.when(codeExpiredCondition());

		stateMachineBuilder.externalTransition().
				from(PhoneStateId.DELIVERED).to(PhoneStateId.FAILED)
				.on(PhoneEvent.VALIDATE_CODE)
				.when(failureCheckCondition());
}

private static Condition<PhoneContext> codeExpiredCondition() {
		return new AnonymousCondition<PhoneContext>() {

			@Override
			public boolean isSatisfied(PhoneContext context) {
				return (!context.isActionSuccess() && context.getError() == Errors.ExpiredCode);
			}
		};
	}

Persistence

For persistence and initialization, the State Machine is backed by a database. The database helps in reading the initial state, which helps in starting the State Machine and also in storing the resolved state after firing of the event in the State Machine. The framework provides appropriate hooks such as afterTransitionCompleted and afterTransitionDeclined for persisting the states. Squirrel-foundation also provides a mechanism to identify other unchecked exceptions using afterTransitionCausedException, which is useful for alerting and monitoring purposes.

PhoneStateMachine – adding afterTransitions

public class PhoneStateMachine extends AbstractStateMachine<PhoneStateMachine, PhoneStateId, PhoneEvent, PhoneContext> {

	@Override
	protected void afterTransitionCompleted(PhoneStateId fromState, PhoneStateId toState, PhoneEvent event, PhoneContext context) {
		persistStateTransition(fromState, toState, event, context);
	}

	@Override
	protected void afterTransitionDeclined(PhoneStateId fromState, PhoneEvent event, PhoneContext context) {
		persistStateTransition(fromState, null, event, context);
	}

@Override
	protected void afterTransitionCausedException(PhoneStateId fromState, PhoneStateId toState, PhoneEvent event, PhoneContext context) {
	    logger.error("Exception during SM transition", getLastException().getTargetException());
	}
}

Custom state machine modification

Even though the squirrel-foundation framework satisfied all the needs, there was no available structure to bind a pre-Action to an Event. For example, the code has to be sent to the user before the state machine is triggered for INITIAL state. Similarly, the code should be validated against the database and relevant context results should be properly populated before firing the state machine. This was achieved by creating a custom state machine and overriding the fire() method to perform an associated Action and then firing the StateMachine.

PhoneStateMachine – Adding Pre-Action

public class PhoneStateMachine extends AbstractStateMachine<PhoneStateMachine, PhoneStateId, PhoneEvent, PhoneContext> {

    @Override
	public void fire(PhoneEvent event, PhoneContext context) {
		try {
			event.getEventAction().execute(context);
			if (context.getError() != null) {
				logger.error("Event execution failed in SM due to:" + context.getError().name());
			}
		} catch (Exception e) {
			logger.error("Exception in firing event: " + event, e);
			context.setError(Errors.UnknownError);
			return;
		}
		super.fire(event, context);
	}
}

Squirrel-foundation provides the freedom of defining custom types for all necessary parameters such as Events, Actions, Context etc., which makes it possible for each PhoneEvent to be configured with an associated Action.

PhoneEvent

public enum PhoneEvent {
	SEND_CODE(sendCodeAction),
	VALIDATE_CODE(validateCodeAction),
	;

	private PhoneEventAction<PhoneContext> eventAction;

	private PhoneEvent(PhoneEventAction<PhoneContext> eventAction) {
		this.eventAction = eventAction;
	}

	public PhoneEventAction<PhoneContext> getEventAction() {
		return eventAction;
	}
}


public interface PhoneEventAction<T> {

    void execute(PhoneContext context);

    static final PhoneEventAction<PhoneContext> sendCodeAction = new PhoneEventAction<PhoneContext>() {

        @Override
        public void execute(PhoneContext context) {
            logger.debug("sendPinAction:send code");
        }
    }

     static final PhoneEventAction<PhoneContext> validateCodeAction = new PhoneEventAction<PhoneContext>() {

        @Override
        public void execute(PhoneContext context) {
            logger.debug("validateCodeAction: verify code");
        }
    }
}

Conclusion

The state machine is a well-known structure for managing processing, and this is just one example of how a complex logical structure can be represented in a simple but effective and maintainable manner. This structure allows developers to manage changes and configure values effectively and almost bug-free. Currently, the above State machine is being configured with better listeners for effective logging, self-healing mechanisms in case of failures, changeover to use RxJava for state persistence, and logging as future enhancements.

Application Resiliency Using Netflix Hystrix

Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation.
-Wikipedia
Ever since the term services and recently microservices came into usage, application developers have been converting monolithic APIs into simple and single-function microservices. However, such conversions come with the cost of ensuring consistent response times and resiliency when certain dependencies become unavailable. For example, a monolithic web application that performs a retry for every call is potentially resilient to some extent, as it can recover when certain dependencies (such as databases or other services) are unavailable. This resilience comes without any additional network or code complexity.

For a service that orchestrates numerous dependencies, each invocation is costly, and a failure can lead to diminished user experience as well as to higher stress on the underlying system that is attempting to recover from the failure.

Circuit breaker pattern

Consider a typical use case:  An e-commerce site that is overloaded with requests on Black Friday, and the vendor providing the payment operations goes offline for a few seconds due to heavy traffic. The users begin to see long wait times for their checkouts due to the high concurrency of requests. These conditions also keep all of the application servers clogged with the threads that are waiting to receive a response from the vendor. After a long wait time, the eventual result is a failure.

Such events lead to abandoned carts or users trying to refresh or retry their checkouts, increasing the load on the application servers—which already have long-waiting threads, leading to network congestion.

A circuit breaker is a simple structure that constantly remains vigilant, monitoring for faults. In the above-mentioned scenario, the circuit breaker identifies long waiting times among the calls to the vendor and fails-fast, returning an error response to the user instead of making the threads wait. Thus, the circuit breaker prevents the users from having a very sub-optimal response time.

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

-Martin Fowler

Recovery time is crucial for the underlying resource, and having a circuit breaker that fails-fast without overloading the system ensures that the vendor can recover quickly.

A circuit breaker is an always-live system keeping watch over dependency invocations. In case of a high failure rate, the circuit breaker stops the calls from going through for a small amount of time, rather than responding with a standard error.

Circuit breakers in eBay

In earlier times, we used a simple setup called AUTO_MARK_DOWN, which prevented such long wait problems in dependencies by short-circuiting the invocations until they were brought back up via MARK_UP. A bot periodically checks for AUTO_MARK_DOWN on various machines for each of its dependencies and performs MARK_UP.

However, the bot and MARK_UP infrastructure is not embedded within the system, but rather located externally. Due to the absence of live and constant feedback about the request volume and failure rates, the MARK_UP of a failing system dependency would occur without verifying its availability. Relying on this setup also leads to false positives, as the bot system is outside the client and cannot evaluate the continuity of failures.

Another major flaw with the setup is the absence of a comprehensive and real-time monitoring structure for all of the dependencies of any application. This old system is slow and erratic, does not have ongoing telemetry, and blindly marks all systems back up on the assumption that the application will AUTO_MARK_DOWN a dependency in the event of further failures. The result is unpredictable behavior and incorrect evaluation.

Recovery in a circuit breaker

A circuit breaker takes care of tripping the dependencies at the appropriate time. However, a more sophisticated system needs to continue the vigilance to determine if the dependency is available, and if so to close the circuit again to let dependent calls go through.

This behavior can be achieved in two ways:

  1. Allow all calls to go through during a regular time interval and check for errors.
  2. Allow one single call to go through at a more frequent rate to gauge the availability.

AUTO_MARK_DOWN was a variant of Type 1, where the circuit is closed without any proof of recovery, relying on errors to identify an issue.

Type 2 is a more sophisticated mechanism as it does not allow multiple calls to go through because the calls might take a long time to execute and still fail. Rather, allowing  only a single call ensures more frequent execution, enabling faster closure of the circuit and revival of the system.

Ideal circuit breaker

A harmonious system is one where we have an ideal circuit breaker, real-time monitoring, and a fast recovery variable setup, making the application truly resilient.

Circuit Breaker + Real-time Monitoring + Recovery = Resiliency

– Anonymous

Using the example of the e-commerce site from above, with a resilient system in place, the circuit breaker keeps an ongoing evaluation of the faults from the payments processor and identifies long wait times or errors from the vendor. On such occurrences, it breaks the circuit, failing fast. As a result, users are notified of the problem and the vendor has enough time to recover.

In the meantime, the circuit breaker also keeps sending one request at regular intervals to evaluate if the vendor system is back again. If so, the circuit breaker closes the circuit immediately, allowing the rest of the calls to go through successfully, thereby effectively removing the problem of network congestion and long wait times.

Netflix Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

-Netflix

From its inception in 2012, Hystrix has become the go-to solution for many systems attempting to improve their capabilities to sustain failures. It has a fairly mature API and a highly tunable configuration system, enabling application developers to provide optimal utilization of their underlying service dependencies.

Circuit breaker (CB) states for Hystrix

The following state diagram and narrative depicts how a resilient system functions during various states of a circuit breaker’s lifecycle.

Circuit Breaker State Diagram

Normal function (Closed)

When a system is functioning smoothly, the resiliency is measured by the state of its success counters, while any failures are tracked using the failure gauges. This design ensures that when the threshold for failures is reached, the circuit breaker opens the circuit to prevent further calls to the dependent resource.

Failure state (Open)

At this juncture, every call to the dependency is short-circuited with a HystrixRuntimeException and FailureType of SHORTCIRCUIT, giving clear indication of its cause. Once the sleepInterval passes, the Hystrix circuit breaker moves into a half-open state.

Half-open state

In this state, Hystrix takes care of sending the first request to check system availability, letting other requests fail-fast until the response is obtained. If the call is successful, the circuit breaker is reset to Closed; in case of failure, the system goes back to the Open state, and the cycle continues.

How to use Hystrix

Hystrix Github has comprehensive documentation of how to use the library. It is as simple as creating a class for invoking the Hystrix library for service consumption.

   public class CommandHelloWorld extends HystrixCommand<String> {

        private final String name;

        public CommandHelloWorld(String name) {
            super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
            this.name = name;
        }

        @Override
        protected String run() {
            // a real example would do work like a network call here
            return "Hello " + name + "!";
        }
    }

Reference: https://github.com/Netflix/Hystrix/wiki/Getting-Started

Internally, this class utilizes the RxJava library to perform asynchronous invocation of the service dependencies. This design helps the application manage its resources intelligently by using the application threads to its maximum potential. For application developers who perform parallel processing and manage their dependencies using lazy invocations, Hystrix also exposes Future<?> and Observable<?>.

How we use Hystrix in eBay

Multiple applications within eBay have started using Hystrix either as a standalone library or  with our platform wrappers. Our platform wrappers exposes the Hystrix configurations in the form of JMX beans for centralized management.  Our wrappers also inject custom Hystrix plugin implementations to capture the real-time metrics being published and to feed them to the site monitoring systems for critical applications.

The Hystrix dashboard is integrated as part of the core server-monitoring systems, enabling teams to view how their application dependencies are performing during various times of the day.

The execution hook provided by Hystrix is a critical component of this integration, as it helps monitor/alert various failures in real time—especially on errors and fallback failures, thereby helping investigate and resolve issues more quickly with little to no user impact.

eBay example: Secure Token service

eBay hosts a slew of service APIs for both internal and external consumption. All of these services are authenticated via tokens, with the Secure Token service acting as the issuer and validator of these tokens. The Guards in all of the services are now upgraded with the Hystrix-based circuit breaker, which enables the Secure Token service to be highly available. In times of heavy traffic from one of the services, the circuit breaker for that service trips and opens the circuit, failing calls only to that specific service while allowing the other services to function normally.

Secure Token Service protected using Hystrix

Secure Token Service protected using Hystrix

circuit_breaker

The circuit breaker is the default one available through the Hystrix library. The functioning of the circuit breaker can be summarized as follows:

  1. Every incoming call is verified against the current state of the circuit breaker.
  2. A Closed state of the Circuit allows the requests to be sent through.
  3. An Open state fails all requests.
  4. A Half-Open state (which occurs when the sleep time is completed), allows one request to go through, and on success or failure moves the circuit to the Closed or Open state as appropriate.

Conclusion

Hystrix is not just a circuit breaker, but also a complete library with extensive monitoring capabilities, which can be easily plugged into existing systems. We have started exploring the usage of the library’s Request Collapsing and Request Caching abilities for future use cases. There are a few other Java-based implementations available, such as Akka and Spring circuit breakers; but Hystrix has proven to be a sound and mature library for maintaining a resilient environment for our critical applications, providing high availability during any time period.

References