A Glimpse into Experimentation Reporting at eBay

 

Around 1500 A/B tests are performed on eBay across different sites and devices on a yearly basis. Experimentation forms a key business process at eBay and plays an important role in the continual improvement of business performance through the optimization of the user experience. Different data insights from these tests enable users to answer important questions such as “How will this new product feature benefit eBay?” or “Does this new page layout improve user engagement and increase GMB?”

Testing allows business units to conceptually explore new ideas with respect to page content and style, search algorithms, product features, etc., which can vary from subtle to radical variations. Such test variations can be easily targeted to segments of the total customer population based on the desired percentage/ramp up and contextual criteria (geographic, system or app-specific), providing a level of assurance before launching to a broader audience.

Experiment Lifecycle

Lifecycle of an experiment at eBay
Lifecycle of an experiment at eBay

All the experiments begin with an idea. The first step is to prepare a test proposal document, which has the summary of what is being tested, why it’s being tested, the amount of traffic assigned, and what action is going to be taken once the results are published. This document will be reviewed and approved in TPS council meetings every week.

The next step is for the test operations team to interact with the product development team to find the right slot for the test schedule, understand the impact of interaction with other tests, and then set up the experiment, assigning necessary traffic to treatment and control, and launch it after smoke testing (a minimal amount of traffic is assigned to make sure everything is working as expected) is successful and necessary validation steps are completed.

The next step is the launch of the experiment. Tracking the experiment will immediately begin, and data is collected. Different reports providing necessary insights will be generated on a daily basis and for the cumulative period through the data collection period. The final results will be published to a wider audience after the experiment is complete. This completes the life cycle of an experiment.

Experimentation reporting

This post will provide a quick overview of the reporting process. Before going further, let’s define some basic terms related to experimentation.

Definitions

  • Guid: A visitor is uniquely identified with a GUID (Global Unique ID). This is a fundamental unit of our traffic that represents the browser on a machine (PC or handheld) visiting the site. It is identified from the cookies that a particular eBay site drops on the user’s browser cache.
  • UserId: A unique ID assigned to each registered user on the site.
  • Event: Every activity of the user captured on the site.
  • Session: All the activity of the user until 30 minutes of inactivity elapses within a day. The aggregate of many events constitute a session.
  • GUID MOD: 100% of the eBay population is divided into 100 different buckets. A Java hash will convert the GUID into a 10-digit hash, and the modulo of the GUID is extracted from this hash, which represents the bucket that the GUID is assigned. A specific GUID will never fall into two different GUID MODs.
  • Treatment and control: The feature to be tested is referred as “treatment,” and “control” is the default behavior.
  • Versions: Any change in the experiment during the active state will create a new version of the experiment. Major and Minor versions are created based on the change’s impact on the experiment.
  • Classifier: A classifier is one of the primary dimensions on which we slice the data and report for different dimensions and metrics under it:

    • Total GUID Inclusive (TGI) — All the GUIDS that are qualified for a particular treatment or control
    • Treated — All the GUIDS that have seen the experience of the treatment
    • Untreated — All the GUIDS that are qualified but have not seen the experience of the treatment

Overview

The following figure shows a simplified view of the reporting process.

Picture1

Upstream data sets

Let us outline the upstream data sets that the process depends on. The data is stored in Hadoop and Teradata systems.

  • User data: Event-level raw data of the user activity on the site, updated every hour.
  • Transaction and activity data: Data sets that capture the metric-level activity of the user such as bid, offer, watch, and many more.
  • Experiment metadata: Metadata tables that provide information about the experiments, treatments, GUID MOD, and various other parameters.

Stage 1

Every day the process first checks for the upstream data sets to be loaded, and stage 1 will be triggered after all the data sets are available. In this stage, detail data sets at the GUID and Session levels are generated from the event level data.

Treatment session: This is one of the primary data set which has GUID and Session-level data at the treatment and version levels. There are various indicators of different dimensions that we will not cover in this post.

Transaction detail data set: All the activity of GUID and Sessions related to transaction metrics such as revenue are captured here. This data set will not have any treatment-level data.

Activity Detail data set: Same as the transaction detail data set but for activity-level metrics such as bid, offer, bin, watch and so on, which are captured here.

There are around six more data sets we generate on a daily basis. We will not go in details about them in this post. All the processing happens on Hadoop, and the data will be copied to Teradata for analysts to access them.

Stage 2 and Outlier Capping

The data sets generated in stage 1 act as upstream data sets for stage 2. In this stage, lots of data transformations and manipulations happen. Data is stored at the GUID, treatment, and dimension levels and stored in Hadoop. This data will not be moved to Teradata, because this stage acts like an intermediate step for our process. Outlier capping is applied to the metrics from the data sets populated from stage 2 to handle extreme values.

Stage 3

The output from the stage 2 is fed into the stage 3, which is the summary process. The data will be aggregated at the treatment, version, and dimension levels, and all the summary statistics are calculated in this step. The data is stored in Hadoop and copied over to Teradata for MicroStrategy to access this information and publish different reports.

Stratification

Post-stratification is an adjustment method in data analysis. It is used to reduce the variance of estimations. In stratification, subjects are randomized to treatment and control at the beginning of the experiment. After data collection, they are stratified according to pre-experiment features, so that subjects are more similar within a stratum than outside that stratum.

The overall treatment effect is then estimated by the weighted average of treatment effects within individual strata. Because the variance of the overall treatment effect estimation consists of variance due to noise and variance due to differences across strata, stratifying experiment subjects removes variance due to strata difference, and thus variance of the estimated overall treatment effect is reduced.

This process runs in parallel and generates stratified transactional metrics. The processing happens in Hadoop, and data is copied over to Teradata for access to different reports.

Scala, Hive, SQL, SAS, R, and MicroStrategy are some of the technologies and statistical packages we use throughout the process. Most of the processing happens in Hadoop, and minor manipulations occur in Teradata.

This concludes the main topic of this post. One of the critical aspects during this process is data quality, as inaccurate results can have impact on the decisions to be made. In the next post, we will talk about different data quality initiatives and how we are tackling them.

Happy Testing!

Scalable and Nimble Continuous Integration for Hadoop Projects

 

Experimentation

The Experimentation Platform at eBay runs around 1500 experiments that are responsible for processing over hundreds of terabytes of reporting data contained in millions of files using a 2500+ node Hadoop infrastructure and consuming thousands of computing resources. The entire report generation process contains well over 200 metrics. It enables millions of customers to experience small and large innovations that enable them to buy and sell products in various countries in diverse currencies and using diverse payment mechanisms in a better way everyday.

The Experimentation Reporting Platform at eBay is developed using Scala, Scoobi, Apache Hive, Teradata, MicroStrategy, InfluxDB, and Grafana, submitting hundreds of Map/Reduce (M/R) jobs to a Hadoop infrastructure. The platform contains well over 35,000 statements and over 25,000 lines of code in around 300 classes.

Problem

We use Jenkins to set up continuous integration (CI), and one of the challenges for humongous projects involving Hadoop technologies is slow-running unit and integration tests. These test cases run one or several M/R jobs in a local JVM, and that effort involves a considerable amount of set-up and destruction time. As additional automated test cases are added to increase code coverage, they have adverse impacts on overall build completion time. One solution that can improve CI run time is to run automated test cases in a distributed and concurrent manner.

This technique helped improve CI running time of the Experimentation Reporting Platform at eBay from 90 minutes to ~10 minutes, thus paving the way for a truly scalable CI solution. This CI involves more than 1,800 unit test cases written in Scala using ScalaTest and Mockito.

Solution

Jenkins provides support for multi-configuration build jobs. A multi-configuration build job can be thought of as a parameterized build job that can be automatically run on multiple Jenkins nodes with all the possible permutations of parameters that it can accept. They are particularly useful for tests where you can test your application using a single build job but under a wide variety of conditions (different browsers, databases, and so forth). A multi-configuration job allows you to configure a standard Jenkins job and specify a set of slave servers for this job to be executed on. Jenkins is capable of running an instance of the job on each of the specified slaves in parallel, passing each slave ID as a build parameter and aggregating JUnit test results into a single report.

The problem boils down to this. There are a number of Jenkins slave nodes, and we have to split all JUnit tests into batches, run all batches in parallel using the available slaves, and aggregate all the test results into a single report. The last two tasks (parallel execution and aggregation) can be solved using built-in Jenkins functionality, namely, multi-configuration jobs (also known as matrix builds).

Setting up a multi-configuration project on Jenkins

There are a number of different ways that you can set up a distributed build farm using Jenkins, depending on your operating systems and network architecture. In all cases, the fact that a build job is being run on a slave (and how that slave is managed) is transparent for the end-user: the build results and artifacts will always end up on the master server. It is assumed that the Jenkins master server has multiple slave nodes configured and ready for use. A new multi-configuration build is created as shown below.

1.createProject

Project configuration

This set-up results in the creation of a multi-configuration project on Jenkins that requires additional configuration before it can be functional.

2.configureProject

Source code management

Assuming the project is set up on Git, you can provide the Git SSH URL and build trigger settings as shown here.

3.sourceCodeManagement

Configuration matrix

Now comes the important part that allows you to choose the list of slave machines on which an individual batch of test cases can be executed. In this example, five machines are selected (slave4 is not visible) on which the build will be triggered.

4.configurationMatrix

Build

In this set-up, a master machine (the part of the distributed CI job that runs on the Master node) dictates the entire run. A multi-configuration build runs as-is on every slave (including the master) machine. Every build receives a $slaveId as a build parameter that allows the script to be written appropriately. The build configuration part of CI involves invocation of a shell script. This shell script performs the following activities.

  • Determine a list of test cases classes that need to be executed. Once the list is obtained, it is shuffled. This process occurs only on the master.
  • Send the complete list to all the slaves.
  • Split the complete list of test cases into batches equal to number of slave machines.
  • Execute each batch on a node (slaves or master)
  • The master node waits for each slave node to complete execution.
  • Each part of distributed CI job runs on the slaves and the master, but the console log of each is available only on the master. As a result, the computation of the number of total number of tests occurs on the master.

The following shell script performs the above listed tasks.

#!/bin/bash
function determine_list_of_test_cases() {
find . -name *Test*.scala | rev | cut -d '/' -f1 | rev | cut -d '.' -f1 | sort -R > alltests.txt
}
function copy_list_to_slaves() {
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave1-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave1/experimentation-reporting-platform
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave2-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave2/experimentation-reporting-platform
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave3-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave3/experimentation-reporting-platform
scp -i /home/username/.ssh/id_rsa alltests.txt username@epci-slave4-ebay.com:/usr/local/jenkins-ci/workspace/epr-staging-distributed-tests-only/slaveId/slave4/experimentation-reporting-platform/
echo "Copied list to slaves"
}
function split_tests_into_batches() {
counts=`wc -l alltests.txt | cut -d ' ' -f1`
total_ci_nodes=5
batch=$((($counts+$total_ci_nodes-1)/$total_ci_nodes))
split -l $batch alltests.txt split.
counter=0
for f in split.*; do
awk '{print $0","}' $f | perl -ne 'chomp and print' > $f.$counter
counter=$((counter+1))
done
}
function wait_until_test_list_comes_from_master() {
while [ ! -f alltests.txt ]
do
sleep 2
done
}
function wait_until_build_completes_on_slaves() {
while [[ ! -f slave1.complete ]] || [[ ! -f slave2.complete ]] || [[ ! -f slave3.complete ]] || [[ ! -f slave4.complete ]]
do
sleep 2
done
}
function cleanup() {
if [ -f alltests.txt ];
then
rm alltests.txt
fi
if ls *.complete 1> /dev/null 2>&1;
then
rm *.complete
fi
if ls split.a* 1> /dev/null 2>&1;
then
rm split.a*
fi
}
function count_test_cases_on_master() {
totalPass=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f3 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalFailures=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f5 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalErrors=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f7 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalSkipped=`cat ../../../../configurations/axis-slaveId/$1/builds/$BUILD_NUMBER/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f9 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
echo "**************************** $1 ********************************"
echo "Number of unit tests executed successfully: $totalPass"
echo "Number of unit tests with failures: $totalFailures"
echo "Number of unit tests with errors: $totalErrors"
echo "Number of unit tests skipped: $totalSkipped"
case $1 in
master)
tests_master=$totalPass
;;
slave1)
tests_slave1=$totalPass
;;
slave2)
tests_slave2=$totalPass
;;
slave3)
tests_slave3=$totalPass
;;
slave4)
tests_slave4=$totalPass
;;
esac
}

function count_test_cases_on_slave() {
totalPass=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f3 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalFailures=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f5 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalErrors=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f7 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
totalSkipped=`cat ../../../../configurations/axis-slaveId/$1/lastStable/log | grep -A 2 "Results :" | grep "Tests" | cut -d " " -f9 | cut -d "," -f1 | awk '{ sum += $1 } END { print sum }'`
echo "**************************** $1 ********************************"
echo "Number of unit tests executed successfully: $totalPass"
echo "Number of unit tests with failures: $totalFailures"
echo "Number of unit tests with errors: $totalErrors"
echo "Number of unit tests skipped: $totalSkipped"
case $1 in
master)
tests_master=$totalPass
;;
slave1)
tests_slave1=$totalPass
;;
slave2)
tests_slave2=$totalPass
;;
slave3)
tests_slave3=$totalPass
;;
slave4)
tests_slave4=$totalPass
;;
esac
}

function execute_tests() {
echo "Executing a batch of $batch test classes, each with multiple test cases"
buildCommand="mvn clean -U -DfailIfNoTests=false -Dtest=`cat $my_batch` test"
echo $buildCommand
eval $buildCommand
}
function report_build_complete_to_master() {
touch $1.complete
scp -i /home/username/.ssh/id_rsa $1.complete username@epci-master-ebay.com:/usr/local/jenkins-ci/.jenkins/jobs/epr-staging-distributed-tests-only/workspace/slaveId/master/experimentation-reporting-platform
}
cleanup
export MAVEN_OPTS="-Xms700m -Xmx4g -XX:MaxPermSize=2g"
cd experimentation-reporting-platform
ls -l
my_batch=split.aa.0
case $slaveId in
master)
determine_list_of_test_cases
copy_list_to_slaves
split_tests_into_batches
execute_tests
wait_until_build_completes_on_slaves
count_test_cases_on_master "master"
count_test_cases_on_slave "slave1"
count_test_cases_on_slave "slave2"
count_test_cases_on_slave "slave3"
count_test_cases_on_slave "slave4"
totalTests=$(($tests_master+$tests_slave1+$tests_slave2+$tests_slave3+$tests_slave4))
echo "*****************************************************************************"
echo " Total number of unit tests executed successfully across: $totalTests"
echo "*****************************************************************************"
;;
slave1)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ab.1
execute_tests
report_build_complete_to_master $slaveId
;;
slave2)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ac.2
execute_tests
report_build_complete_to_master $slaveId
;;
slave3)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ad.3
execute_tests
report_build_complete_to_master $slaveId
;;
slave4)
wait_until_test_list_comes_from_master
split_tests_into_batches
my_batch=split.ae.4
execute_tests
report_build_complete_to_master $slaveId
;;
esac
cleanup

Run

Once the multi-configuration project is created, it can be run as follows.

5.run.a

6.run.b

Each configuration runs a subset of automated test cases on a separate CI node (machine). This allows the entire CI job to complete execution in a distributed manner. This solution is scalable, and as additional automated test cases are added, the execution speed can be maintained by simply adding additional CI slave nodes.

Conclusion

Currently, the batch of unit test case classes are distributed randomly across CI machines. There might be a batch that has slower running test cases, thereby slowing down the execution time of the CI build. Better task allocation times can be achieved by recording test class execution times in any relational database so as to analyze them and construct robust batches of unit test classes with uniform execution times.

Distributed and concurrent execution of unit test cases has allowed the Experimentation Reporting Platform at eBay to build a CI solution with 1,800+ unit test cases with more than 70% statement and branch coverage. The test cases are a mix of Hadoop tests and non-Hadoop unit tests. Each commit triggers a distributed build that finishes in a timely manner (approximately 10 minutes), allowing the committer to quickly verify each commit.

References

One time Password

Finite-State Machine for Single-Use Code Authentication

Introduction

eBay strives to excel at security and to identify new and improved secure mechanisms to allow users to seamlessly access their account and in the meantime ensure that the fraudulent and malicious users are kept at bay. This is a balancing act that every internet platform player, major and minor, performs every day. Passwords are one such mechanism used to secure a user’s account but with not much success. They have always been a major pain point for users, as they need to be maintained with a unique combination of hard-to-guess and hard-to-remember numbers, alphabets and special characters. Ironically, this leads users to create passwords that are hard to remember but easy for hackers to brute force.

With new websites and platforms cropping up each day, users are required to remember multiple passwords becoming a daily struggle. Passwords are one of the weakest links in our attempt to secure user accounts, because many users don’t use a strong, complex one or reuse the same password across multiple services, making themselves vulnerable to phishing and other types of attacks.

To relieve users from ever needing to remember their password and to move towards the utopia of a “password-free” world, eBay has released the ability of using one-time code to log in, which can be delivered to the users’ phones, a personal and important device that they own. Along with the ease of getting disposable one-time codes, this also helps while traveling or when using a public computer or network, where users can log in with a one-time code rather than running the risk of having their regular passwords hijacked via a key logger, malware, or even a compromised network. More details about the feature available in this release statement

How does it work?

Any user can now use the “Sign in with a single-use code” link on the log-in page to request that a one-time code be delivered to their registered phone number via text messaging. Then the user can type in the code from the text message into the input field, securely getting into the account without the hassle of remembering or exposing the original password.

sign-in window         get single use code next button

The one-time codes are short-lived and cannot be transferred between sessions, making them highly isolated and secure in comparison to regular passwords, which can be used across any number of devices.

Behind the scenes

Given its criticality, the application architecture need to be robust and secure but also better manageable and configurable. The structure should provide better code readability, which in turn ensures high quality code in the production environment. One of the important characteristics of the application architecture is that the finite state machine used for generating and validating the one-time code must conform with all the rules that were set up such as expiry, retry attempts, etc.

state_machine

Finite state machine

A finite-state machine (FSM) is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition. Wikipedia contributors, “Finite-state machine,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Finite-state_machine&oldid=731346054 (accessed August 18, 2016).

A simple representation of a Turnstile state machine is shown below.

Turnstile_state_machine_colored.svg

Advantages

FSM provides many advantages over traditional rule-based systems.

  • States, transitions, events, and their conditions maintained in a modular and declarative fashion
  • Pluggable nature of transitions and conditions
  • Abstraction of states, actions, and their roles
  • Clarity of behavior during firing of events
  • Most importantly, ease of ensuring secure state management

To illustrate, imagine a user has requested a one-time code and received it on their phone. However, while entering the password, they mistyped it incorrectly for more times than the allowed number of attempts. In the case of a rule engine or a simple if/else sequence, there is a high probability of a bug being introduced, as there is no state maintenance — only rule/logic evaluations, which can allow the user to provide the correct code and log in successfully even after exceeding the allowed number of attempts.

However, on a state machine, once the user attempts exceed the limits, the state machine transitions to an AttemptExceeded/Failed state, making it impossible for the user to attempt a code validation even with the correct code. This structure guarantees that the shortcomings of a sequence model execution are not possible in a state machine.

If-else code for validation

 public int validateCode(long transId, String code){
        int currentStatus = getPersistence().getCurrentStatus(transId);
        boolean status = false;
        if(currentStatus==SENT){
            status = getSecureModule().validateCode(transId, code);
            if(status){
                currentStatus = SUCCESS;
            }else{
                currentStatus = INVALID;
                if(getPersistence().verifyAttemptsExceeded(transId)){
                    currentStatus = ATTEMPT_EXCEEDED;
                }
            }
        }
        return currentStatus;
    }

Open source frameworks

Rather than writing a FSM from scratch, we evaluated two widely used frameworks based on maturity and support: Spring StateMachine and squirrel-foundation . Spring StateMachine seemed more suited to a daemon application, such as Zookeeper, and was not ready for a multi-threaded environment, such as a web application [Issue#170]. Moreover, squirrel-foundation provided the ability to add custom event types and actions, which are explained in more detail below, helping make the decision to use the squirrel-foundation framework to model the finite state machine.

State machine for one-time code

A simple representation of the state machine used to generate and validate the one-time code for phone is provided below.

State Machine for one-time code

As illustrated, the one-time code validation moves between different states of the machine based upon the Action of the user, the current state of the machine, and the conditions configured between them. For instance, when a user requests a code for the first time, if the Send failed for some reason, the state machine moves the transaction to a final FAILED state, rendering the transaction inert and effectively terminated.

Elaborating on the use case discussed earlier, where the user exceeds the number of incorrect code attempts, if the user performs the action of Requests Code retry, the retry-limit-exceeded condition fires and the state machine moves the current state to FAILED, terminating the transaction and preventing the user from re-using the session or the code further.

Set-up and configuration

In order to explain the usage and set-up better, following are some of the important parts of the State machine modeling code snippets. These are not complete nor compilable as is, and they are truncated for brevity.

Builder

The squirrel-foundation framework allows configuring the State Machine once and create new instances of the state machine for each thread without incurring the expensive creation time. The newSM() method is invoked with the beginning state of the current transaction to get the State Machine ready, and when the events are fired, the State Machine takes care of identifying the next state to transfer to.

StateMachine Builder

public class PhoneSMBuilder {
	private static StateMachineBuilder<PhoneStateMachine, PhoneStateId, PhoneEvent, PhoneContext> stateMachineBuilder;

    static {
	buildSM();
     }

      private static synchronized void buildSM() {
	if (null != stateMachineBuilder) {
		return;
	}
	stateMachineBuilder = StateMachineBuilderFactory.create(PhoneStateMachine.class, PhoneStateId.class, PhoneEvent.class, PhoneContext.class);
		stateMachineBuilder.setStateMachineConfiguration(StateMachineConfiguration.create());
	}

    ....

    public static PhoneStateMachine newSM(StateId initialState) {
		PhoneStateMachine stateMachine = stateMachineBuilder.newStateMachine(initialState);
		stateMachine.start();
		return stateMachine;
	}

Transitions and conditions

Each of the transitions from one state to another state is managed as an external transition and guarded by conditions. For the Phone StateMachine, the positive transitions are configured similar to below.

buildSM() – adding transitions

private static synchronized void buildSM() {
      .....
    stateMachineBuilder.externalTransition()
    		.from(PhoneStateId.INITIAL).to(PhoneStateId.DELIVERED)
    		.on(PhoneEvent.SEND_CODE)
    		.when(checkEventActionExecutionResult());

    stateMachineBuilder.externalTransition()
    		.from(PhoneStateId.DELIVERED).to(PhoneStateId.SUCCESS)
    		.on(PhoneEvent.VALIDATE_CODE)
    		.when(checkEventActionExecutionResult());
      ....
}
  private static Condition<PhoneContext> checkEventActionExecutionResult() {
	return new AnonymousCondition<PhoneContext>() {

		@Override
		public boolean isSatisfied(PhoneContext context) {
			return context.isActionSuccess();
		}
	};
}

As illustrated, the Condition<T> is an interface provided by the squirrel-foundation framework for configuring the conditions specific to a state-to-state transition. A successful Boolean response to the isSatisfied() method fires the transition that the condition satisfies.

NOTE: If more than one Condition fires for a transition from one initial state to two different end states, an exception is thrown. It is imperative that each condition is mutually exclusive with other conditions for the same initial state.

Even for error scenarios, such as expired or attempts exceeded, the state transition is simple to configure, maintain, and change, ensuring the isolation of the change and easy testability.

buildSM() – adding transitions and conditions

private static synchronized void buildSM() {
        .....

        stateMachineBuilder.externalTransition()
				.from(PhoneStateId.DELIVERED).to(PhoneStateId.EXPIRED)
				.on(PhoneEvent.VALIDATE_CODE)
				.when(codeExpiredCondition());

		stateMachineBuilder.externalTransition().
				from(PhoneStateId.DELIVERED).to(PhoneStateId.FAILED)
				.on(PhoneEvent.VALIDATE_CODE)
				.when(failureCheckCondition());
}

private static Condition<PhoneContext> codeExpiredCondition() {
		return new AnonymousCondition<PhoneContext>() {

			@Override
			public boolean isSatisfied(PhoneContext context) {
				return (!context.isActionSuccess() && context.getError() == Errors.ExpiredCode);
			}
		};
	}

Persistence

For persistence and initialization, the State Machine is backed by a database. The database helps in reading the initial state, which helps in starting the State Machine and also in storing the resolved state after firing of the event in the State Machine. The framework provides appropriate hooks such as afterTransitionCompleted and afterTransitionDeclined for persisting the states. Squirrel-foundation also provides a mechanism to identify other unchecked exceptions using afterTransitionCausedException, which is useful for alerting and monitoring purposes.

PhoneStateMachine – adding afterTransitions

public class PhoneStateMachine extends AbstractStateMachine<PhoneStateMachine, PhoneStateId, PhoneEvent, PhoneContext> {

	@Override
	protected void afterTransitionCompleted(PhoneStateId fromState, PhoneStateId toState, PhoneEvent event, PhoneContext context) {
		persistStateTransition(fromState, toState, event, context);
	}

	@Override
	protected void afterTransitionDeclined(PhoneStateId fromState, PhoneEvent event, PhoneContext context) {
		persistStateTransition(fromState, null, event, context);
	}

@Override
	protected void afterTransitionCausedException(PhoneStateId fromState, PhoneStateId toState, PhoneEvent event, PhoneContext context) {
	    logger.error("Exception during SM transition", getLastException().getTargetException());
	}
}

Custom state machine modification

Even though the squirrel-foundation framework satisfied all the needs, there was no available structure to bind a pre-Action to an Event. For example, the code has to be sent to the user before the state machine is triggered for INITIAL state. Similarly, the code should be validated against the database and relevant context results should be properly populated before firing the state machine. This was achieved by creating a custom state machine and overriding the fire() method to perform an associated Action and then firing the StateMachine.

PhoneStateMachine – Adding Pre-Action

public class PhoneStateMachine extends AbstractStateMachine<PhoneStateMachine, PhoneStateId, PhoneEvent, PhoneContext> {

    @Override
	public void fire(PhoneEvent event, PhoneContext context) {
		try {
			event.getEventAction().execute(context);
			if (context.getError() != null) {
				logger.error("Event execution failed in SM due to:" + context.getError().name());
			}
		} catch (Exception e) {
			logger.error("Exception in firing event: " + event, e);
			context.setError(Errors.UnknownError);
			return;
		}
		super.fire(event, context);
	}
}

Squirrel-foundation provides the freedom of defining custom types for all necessary parameters such as Events, Actions, Context etc., which makes it possible for each PhoneEvent to be configured with an associated Action.

PhoneEvent

public enum PhoneEvent {
	SEND_CODE(sendCodeAction),
	VALIDATE_CODE(validateCodeAction),
	;

	private PhoneEventAction<PhoneContext> eventAction;

	private PhoneEvent(PhoneEventAction<PhoneContext> eventAction) {
		this.eventAction = eventAction;
	}

	public PhoneEventAction<PhoneContext> getEventAction() {
		return eventAction;
	}
}


public interface PhoneEventAction<T> {

    void execute(PhoneContext context);

    static final PhoneEventAction<PhoneContext> sendCodeAction = new PhoneEventAction<PhoneContext>() {

        @Override
        public void execute(PhoneContext context) {
            logger.debug("sendPinAction:send code");
        }
    }

     static final PhoneEventAction<PhoneContext> validateCodeAction = new PhoneEventAction<PhoneContext>() {

        @Override
        public void execute(PhoneContext context) {
            logger.debug("validateCodeAction: verify code");
        }
    }
}

Conclusion

The state machine is a well-known structure for managing processing, and this is just one example of how a complex logical structure can be represented in a simple but effective and maintainable manner. This structure allows developers to manage changes and configure values effectively and almost bug-free. Currently, the above State machine is being configured with better listeners for effective logging, self-healing mechanisms in case of failures, changeover to use RxJava for state persistence, and logging as future enhancements.