Monitoring Anomalies in the Experimentation Platform


The Experimentation platform at eBay runs around 1500 experiments that are responsible for processing over hundreds of TBs of reporting data contained in millions of files using Hadoop infrastructure and consuming thousands of computing resources. The entire report generation process contains well over 200 metrics, and it enables millions of customers to experience small and large innovations that enable them to buy and sell products in various countries in diverse currencies and using diverse payment mechanisms in a better way everyday.

The Experimentation reporting platform at eBay is developed using Scala, Scoobi, Apache Hive, Teradata, MicroStrategy, InfluxDB, and Grafana.


Our user-behavior tracking platform enables us to gain insights into how customers behave
and how products are used and unlock the information needed to build the right strategies for improving conversion, deepening engagement, and maximizing retention.

The eBay platform contains hundreds of applications that enable users to search for products, view specific product, and engage in commerce. These applications are running on numerous servers in data centers across the world, and they log details of every event that occurs between a user and eBay (in a specific application), such as activity (view product, perform search, add to cart, and ask questions, to name a few) and transaction (BID, BIN, and Buyer Offer, for example), including the list of experiments that a user is qualified for and has experienced during that event. Tracking data is moved from application servers to distributed systems like Hadoop and Teradata for post-processing, analytics, and archival.


Any experiment that runs on the Experimentation platform can experience anomalies that need to be identified, monitored, and rectified in order to achieve the goal of that experiment.

  • Traffic corruption. An experiment is set up to ensure that it receives an approximately equal share of unique visitors, identified by GUID (global unique identifier) or UID (signed-in), throughout its life cycle. At times, this traffic share is significantly skewed between experiment (the new experience) and control (the default experience), potentially resulting in incorrect computation of bankable and non-bankable metrics. This is one of the critical anomalies and is carefully monitored.
  • Tag corruption. The vast amounts of user activity collected by eBay application servers include information (tags) about the related list of experiments that a user is qualified for. Any kind of corruption or data loss can significantly hamper metrics computed for any experiment.

Here are some typical reasons for these anomalies:

  • GUID reset: GUIDs are stored on browser cookies. Any kind of application error or mishandling of browser upgrades can cause GUID resets against either the experiment or the control, resulting in traffic corruption.
  • Cache refresh: eBay application servers maintain caches of experiment configurations. A software or hardware glitch can cause the caches on these servers to go out of sync. This problem can lead to both traffic and tag corruption.
  • Application anomalies: Web pages are served by application servers. These application servers invoke several experimentation services to determine the list of experiments that a user is qualified for, based on several factors. Application servers can incorrectly log this information, thereby corrupting essential tags because of incorrect encoding, truncation, and application errors. This problem results in both traffic and tag corruption.

flow chart showing the logical flow between users and the experimentation back end

Monitoring anomalies

Anomalies in experiments are detected daily and ingested into InfluxDB, an open-source time-series database, visualized with Grafana.

InfluxDB is an open-source database, specifically designed to handle time-series data with high availability and high performance requirements. InfluxDB installs in minutes without external dependencies, yet is flexible and scalable enough for complex deployments. InfluxDB offers these features, among many others.

  • InfluxDB possesses on-the-fly computational capabilities that allow data to become available within milliseconds of its capture.
  • InfluxDB can store billions of data points for historical analysis.
  • InfluxDB aggregates and precomputes time-series data before it is written to disk.

Grafana provides a powerful and elegant way to create, explore, and share dashboards and data with your team. Grafana includes these features among many others:

  • Fast and flexible client-side graphs with a multitude of options
  • Drag-and-drop panels, where you can change row and panel widths easily
  • Support for several back-end time series databases, like InfluxDB, Prometheus, Graphite, and Elastic Search, with the capability to plug in custom databases
  • Shareable links to dashboards or full-screen panels

The Experimentation reporting platform leverages both InfluxDB and Grafana to monitor anomalies in experiments. It supports the following features.

Home page

The home page consists of a bird’s-eye view of all anomalies, broken at various levels like channel, business (application), and country. Every anomaly has certain threshold beyond which it needs to be further analyzed. The Gauge panel in Grafana enables us to do just that.

animated gif showing an example of the home page with a dashboard for multiple anomalies

Drill-Down view

Any anomaly can be further analyzed in a drill-down view that shows details of that anomaly, which is again broken down at various levels.

animated gif showing an example of selecting the graph for an anomaly and then displaying its enhanced drill-down view

Grafana allows quick duplication of each panel with a view that can be be easily modified. The user can select either an SQL or a drop-down interface to edit queries.

animated gif showing an example of duplicating a single panel and modifying the query for the duplicate


There are several occasions during the triaging process when we need to quickly check if a given experiment or channel or country is experiencing any anomalies. The search feature provided by Grafana (through templating) allows us to do just that. The user can type or select from a drop-down to view details of all anomalies for a specific combination of filters.

animated gif showing an example of entering a search string into a search field

Every dashboard can be customized and shared across the organization.

animated gif showing an example of sharing a dashboard

Setup and scale

InfluxDB (v 0.11-1) is installed on a single node, and so is Grafana (v 3.0.2). Each of these are hosted on the eBay cloud with 45 GB of memory, 60GB of disk space, and Ubuntu 14.04. Each day, around 2000 points are ingested into InfluxDB using a Scala client with ingestion time of around few seconds. Currently, the system contains seven months of historical anomaly data, taking around 1.5 GB disk space in InfluxDB and consuming approximately 19 GB of RAM. Anomaly data is archived on HDFS for recovery in case of system failure.

This dataset is minuscule compared to vast amounts of data that can be handled by InfluxDB, especially when assisted by its capability to be set up as a cluster for fault tolerance, which unfortunately is not supported beyond v 0.11-1.


The anomaly monitoring platform is the cornerstone for monitoring anomalies in experiments at eBay. It is becoming a single point for monitoring, sharing, and searching for anomalies in experiments for anyone in the company who runs experiments on the Experimentation platform. Its ability to be self-service (thanks to Grafana) in terms of creating new dashboards for new datasets is what makes it stand out.

There are several measures and metrics that determine if a experiment is experiencing an anomaly. If the thresholds are breached, the experiment is flagged and a consolidated email notification is sent out. It’s always been discussed in Grafana circles as to when alerting is coming (Winter has come, so will alerting), and it seems that alerting is actually coming to Grafana, enabling users to set alert thresholds for every metric that is being monitored, right from the dashboard.



Visualizing Machine Translation Quality Data — Part I

There is no knowledge that is not power

We are, no doubt, living some of the most exciting days of the Information Age. Computers keep getting faster, smartphones are ubiquitous. Huge amounts of data are created daily by amazingly diverse sources. It is definitely easier than ever to gather data for language services buyers and providers, but it looks like the localization industry is really not doing a lot to harness all this information. Overall, and with a few exceptions, of course, the industry seems to be missing out on this and is not fully leveraging, or at least trying to understand, all these wonderful bits and pieces of information that are generated every day.

Perhaps the issue is that too much information can be hard to make sense out of and may even feel overwhelming. That is, precisely, the advantage of data visualization.

In this series of articles, I will cover three different tools you can use to visualize your translation quality data: Tableau, TAUS DQF, and Excel. This article is part 1 and will only focus on general information and Tableau.

The case for data visualization

Perhaps the single most important point of data visualization is that it allows you to assimilate information in a very natural way. An enormous amount of information that is difficult to take in when in a table suddenly makes sense when presented and summarized in a nice chart. Patterns and trends may become easier to spot and, sometimes, even obvious. Correlations may pop up and give you much-needed business or strategic advantages, allowing you to effectively act on your information.

How does this apply to translation and localization practices? Well, there simply is a lot of information you can measure and analyze, for example:

  • Productivity
  • Vendor performance
  • MT system performance
  • Tool performance
  • Financials
  • Process efficiency, etc.

At eBay, we use data visualization to track our vendors’ performance, the quality of our MT output for different language combinations, details on the types of issues found in our MT output, what types of issues we are finding in our vendors’ deliverables, and more.

The Keys

Let’s take a minute to examine what is necessary to make a visualization effective. I’m by no means an expert on this subject, you’ll notice, but based on my experience and research, these are the key points to consider:

First of all, be clear. What are you trying to find out with a chart? What do you want to bring the attention to? What are you trying to say? Transmitting a clear message is a priority.

Be selective: don’t cram columns and lines in a visualization just because. Carefully plan the data points you want to include, assessing if they contribute or not to the main purpose of your message. This can be difficult, especially if you have too much information – you may feel tempted to add information that might not add any value at all.

Keep your audience in mind, and be relevant. Shape your message to answer the questions they may have. Discard any information they may find unnecessary. Project managers may be interested in financials and the percentage of on-time deliveries, and engineers on process efficiencies, while language managers may be focused on quality and language performance.

Put some thinking on what’s the best way to represent the information and how you can make the most important information stand out. It’s usually a good idea to include trends, highlight patterns, and make meaningful correlations obvious.


Tableau is perhaps one of the most popular visualization programs available. The concept is simple: Tableau can read your data, from a simple Excel file or a database (among several other options), parse it, and turn the information into dimensions and measures. And here’s the best part: you can simply drag and drop those dimensions and measures onto columns and rows, and Tableau will generate charts (or views, as they like to call them) for you. Automatically. Effortlessly.

And it comes with an amazing range of chart options and customization options that may seem overwhelming when you start using the software but, once you get the hang of it, make total sense.

Let’s look at some examples:

  • This chart shows in a very simple way how vendors are performing for each of the two content types we are working with at the moment, that is, titles and descriptions. It becomes evident that Vendor 2 may be the best option for descriptions while Vendor 5 is underperforming when it comes to titles.


  • Now, let’s imagine we want to analyze how post-editors for the different languages are doing, again based on the content type. We can take a look at how many errors reviewers found for each of them.

    Here it becomes evident that German post-editors are doing great with descriptions, but they are struggling with titles, as there’s a big difference in the position of the blue columns. We can also see that Spanish and French seem to be above the error average. Italian, Portuguese and Russian don’t show major changes from one content type to the other.


  • Now we want to dig deeper into the errors our reviewers are finding, and for that, we are going to look at the different types of errors by language. Looking at this chart, it seems like the biggest problem are mistranslations. This is a good hint to try to find out why is this happening: Is the source too complex? Are post-editors not doing enough research? Are we providing the right reference material? On the other hand, data seems to indicate that terminology is not really a big problem. We could infer that our glossaries are probably good, our tool is showing the right glossary matches, and our translators are subject matter experts.

    We can also see that French has many more issues than Italian, for example.


    Tableau will easily let you swap your columns and rows to change the way the data is presented. In the example below, the focus is now on error categories and not on the number of errors found. However, what I don’t like in this view is that the names of the error categories are vertical and are hard to read — it is possible to rotate them, but that will make the chart wider.

    There are plenty of options you can try to create a view that shows exactly what you want, in the best possible way.


  • Here’s a very simple one to quickly see what are the busiest months based on the number of words processed.


  • Now we want to look at edit distance — analyzing this information can help us figure out, for example, MT performance by language, considering that a low edit distance indicates less post-editing effort. I’m going to include the wordcount, as well, to see the edit distance in context.

    I can ask Tableau to display the average edit distance for all languages by placing a line in the chart.

    The blue dots indicate that German is the language with the lowest edit distance, with an average of 21.15. This may be an indication that my DE output is good, or at least better than the rest of the languages. The red dots for Italian are all over the place, which may indicate that the quality of my IT MT output is inconsistent — just the opposite of Portuguese, with most purple dots concentrated in the center of the chart.


  • In this final example, let’s assume we want to see how much content our reviewers are covering; ideally, they should be reviewing 50% of the total wordcount. Here we can see, by language, how many words we’ve processed and how many were reviewed. You can quickly see that the wordcount for French doubles the Russian wordcount. You can also easily notice that the FR reviewer is not covering as much as the rest. This may indicate that you need another reviewer or that the current reviewer is underperforming. Compare this to Portuguese, where the difference between total words and reviewed words is minimal. If we only need to review 50% of the content, PT reviewer is covering too much.


Customizing Spring Security with Multiple Authentications


Spring Boot offers an easier way to create new web applications or web services. The Security module in the Spring framework enables us to plug in different authentication mechanisms. In some cases, we needed to provide multiple authentication mechanisms for our web service. These authentication mechanisms can be standard or custom.


We had a similar requirement while working on an in-house project to develop a web service for distributing and renewing Kerberos key tabs. The project is named Kite, and it is a web service built on Spring Boot. We initially added SPNEGO to authenticate users of our Kite service.

Enabling SPNEGO Authentication using Spring Security

To enable security and add SPNEGO, we needed to make changes to our pom.xml. The relevant POM changes are shown here:


We also needed to add Java code to configure SPENGO authentication. The relevant parts of the Java code related to hooking up SPENGO authentication are shown below:

@Configuration @EnableWebSecurity 
public class SecurityConfig extends WebSecurityConfigurerAdapter { 

  protected void configure(HttpSecurity http) throws Exception { 
   .and().addFilterBefore(spnegoAuthenticationProcessingFilter(authenticationManagerBean()), BasicAuthenticationFilter.class); 

  protected void configure(AuthenticationManagerBuilder auth) throws Exception {

  public KerberosAuthenticationProvider kerberosAuthenticationProvider() {
    KerberosAuthenticationProvider provider = new KerberosAuthenticationProvider(); 
    SunJaasKerberosClient client = new SunJaasKerberosClient(); 
    return provider; 

  public SpnegoEntryPoint spnegoEntryPoint() { 
    return new SpnegoEntryPoint(); 

  public SpnegoAuthenticationProcessingFilter spnegoAuthenticationProcessingFilter( AuthenticationManager authenticationManager) {
    SpnegoAuthenticationProcessingFilter filter = new SpnegoAuthenticationProcessingFilter(); 
    return filter; 

  public KerberosServiceAuthenticationProvider kerberosServiceAuthenticationProvider() { 
    KerberosServiceAuthenticationProvider provider = new KerberosServiceAuthenticationProvider(); 
    provider.setTicketValidator(sunJaasKerberosTicketValidator()); provider.setUserDetailsService(dummyUserDetailsService()); 
    return provider; 

  public SunJaasKerberosTicketValidator sunJaasKerberosTicketValidator() {
    SunJaasKerberosTicketValidator ticketValidator = new SunJaasKerberosTicketValidator(); 
    ticketValidator.setKeyTabLocation(new FileSystemResource(kiteConfiguration.getKeytab())); 
    ticketValidator.setDebug(true); return ticketValidator; 

These POM and source code changes enable users to authenticate via Kerberos. Some of our users needed an alternate form of authentication, where the users present a one-time-use token. To hook up our custom token authentication, we took the following steps:

  1. Implement token authentication logic as TokenAuthenticationFilter by extending AbstractAuthenticationProcessingFilter.
  2. Plug in TokenAuthenticationFilter via FilterRegistrationBean.

Custom TokenAuthenticationFilter

The custom token-based authentication filter extends AbstractAuthenticationProcessingFilter. Here we override the doFilter to implement our custom authentication logic.

public class TokenAuthenticationFilter extends AbstractAuthenticationProcessingFilter {
 private static final String SECURITY_TOKEN_KEY = "token";
 private TokenManager tokenManager;

 public TokenAuthenticationFilter(TokenManager tm) {
   tokenManager = tm;

 public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
   HttpServletRequest request = (HttpServletRequest) req;
   HttpServletResponse response = (HttpServletResponse) res;
   String token = request.getParameter(SECURITY_TOKEN_KEY);
   if (token != null) {
     Authentication authResult;
     try {
       authResult = attemptAuthentication(request, response, token);
       if (authResult == null) {
     } catch (AuthenticationException failed) {

    try {
    } catch (Exception e) {
      logger.error(e.getMessage(), e);
      if (e.getCause() instanceof AccessDeniedException) {
  chain.doFilter(request, response);// return to others spring security filters

 public Authentication attemptAuthentication(HttpServletRequest request, HttpServletResponse response, String token)
 throws AuthenticationException, IOException, ServletException {
   AbstractAuthenticationToken userAuthenticationToken = authUserByToken(token);
   if (userAuthenticationToken == null)
     throw new AuthenticationServiceException(MessageFormat.format("Error | {0}", "Bad Token"));
    return userAuthenticationToken;

 private AbstractAuthenticationToken (String tokenRaw) {
 AbstractAuthenticationToken authToken = null;
   try {
     String user = tokenManager.verifyAndExtractUser(tokenRaw);
     if (user != null) {
       user = user + "@REALM";
       Principal securityUser = new SecurityUser(user);
       return new PreAuthenticatedAuthenticationToken(securityUser, null, null);
   } catch (Exception e) {
     logger.error("Error during authUserByToken", e);
   return authToken;

Note that in authUserByToken, we created a SecurityUser object, and this needs to be of the format user@REALM.

Adding TokenAuthenticationFilter

To plug in the new authentication mechanism, we can use the FilterRegistrationBean.

The code snippet below is added to the SecurityConfig above:

 public FilterRegistrationBean filterRegistrationBean() {
   FilterRegistrationBean filterRegistrationBean = new FilterRegistrationBean();
   TokenAuthenticationFilter tokenAuthenticationFilter = new TokenAuthenticationFilter();
   return filterRegistrationBean;

With these code changes, we can add our custom authentication logic in Spring in addition to the existing authentication mechanisms.