eBay Tech Blog http://www.ebaytechblog.com Where e-commerce meets world-class technology Thu, 10 Aug 2017 20:54:04 +0000 en-US hourly 1 https://wordpress.org/?v=4.8.1 Introducing Regressr – An Open Source Command Line Tool to Regression Test HTTP Services http://www.ebaytechblog.com/2017/08/10/introducing-regressr-an-open-source-command-line-tool-to-regression-test-http-services/ http://www.ebaytechblog.com/2017/08/10/introducing-regressr-an-open-source-command-line-tool-to-regression-test-http-services/#respond Thu, 10 Aug 2017 16:36:37 +0000 http://www.ebaytechblog.com/?p=7463
Continue Reading »

In the Artificial Intelligence-Human Language Technologies team at eBay, we work on software that powers eBay’s conversational bot, ShopBot. We ship software daily to production that makes our bot intelligent, smarter, and more human. As a crucial part of this effort, we have to make sure any regressions are caught quickly and fixed to help keep our customers doing what they love – making purchases on ShopBot.

ShopBot’s backend is built on a polyglot suite of Scala, Java, and Python-based microservices that work in unison to provide ShopBot’s functionality. Hence, many of the crucial services need to be regression tested before we can release a new version to production.

To help with that effort, we built and are open sourcing our regression testing tool, Regressr.

Why Regressr

We looked at the common approaches that are widely used in the industry today to build an automated regression testing suite. In no particular order, they are listed below.

  • Comprehensive JUnit suite that calls two versions (old and new) of the service and compares the minutiae of the responses – JSON elements, their values and the like.
  • Using SOAP UI’s Test Runner to run functional tests and catch regressions as a result of catching functionality failures.
  • No regression tests. Wait for the front-end to fail as a result of front-end regression tests in dev or test, and trace the failure to the backend.

We also looked at Diffy and were inspired by how simple it was to use for catching regressions.

We had some very unique requirements for testing eBay ShopBot and found out that none of these tools provided the features we wanted:

  1. Super-low ceremony: Must quickly be able to productionize and gain significant value without too much coding or process.
  2. Low conceptual surface area: An engineer should be able to grok what the tool does and use it quickly without going through bricks of manuals and frameworks.
  3. Configurability of comparisons: We want to able to specify how the response should be compared. Do we want to ignore JSON values? Do we want to ignore certain elements? What about comparing floating point numbers, precision, etc.?
  4. Compare at high levels of abstraction: We want to capture high-level metrics of the responses and then perform regression testing on them. For example, we would like to be able to say the number of search results in this response were 5 and then use that number to compare against future deployments.
  5. Low maintenance overhead: We want maintenance of the regression suite to have low or negligible coding effort. Once every deployment is approved for release, we just want the suite to automatically capture the current state of the deployment and use that as a reference for future deployments.
  6. CI/CD Integration: Finally, we wanted this suite to be hooked into our CI/CD build.

We built Regressr specifically to solve these requirements, so that the team can focus on the important stuff, which is serving great experiences and value to our customers who use ShopBot.

Regressr is a Scala-based command line tool that tests HTTP services, plain and simple. We built it to be really good at what it does. With Regressr, you can use the out-of-the-box components to get a basic regression test for your service up and running quickly and gain instantaneous value, while coding regression tests that will cover close to 100% of the functionality in a more delayed fashion as time permits. Finally, Regressr doesn’t even need the two services to be up and running at the same time, as it uses a datastore to capture the detail of the baseline.

Regressr works in two modes:

  1. Record – Use Record when you want to capture the current state of a deployment to be compared as the baseline for later deployments. A strategy file is specified that contains the specifics of what needs to be recorded.
  2. Compare/Replay – Compares the current state of a deployment with a baseline and generates a comparison report.

The image below captures what is done in these two flows.

The Strategy File

The strategy file is the configuration that drives what happens during a record and a compareWith execution.

An example strategy file that posts two requests and performs regression testing is specified below:

  baseURL       : http://localhost:9882/endpoint

  Content-Type    : application/json


  - requestName: say_hello
    path: /say_hello
    method: GET
    recorder: org.ebayopensource.regression.internal.components.recorder.SimpleHTTPJSONRecorder
    comparator: org.ebayopensource.regression.internal.components.comparator.SimpleHTTPJsonComparator

  - requestName: shop for a pair of shoes
    path: /shopping
    method: POST
    requestBuilder: org.ebayopensource.regression.example.ExampleRequestBuilder
      conversationId     : 12345
      keyword            : Show me a pair of shoes
      mission_start      : yes
    recorder: org.ebayopensource.regression.internal.components.recorder.SimpleHTTPJSONRecorder
    comparator: org.ebayopensource.regression.internal.components.comparator.SimpleHTTPJsonComparator

  - requestName: say goodbye
    path: /goodbye
    method: POST
    requestBuilder: org.ebayopensource.regression.internal.components.requestBuilder.StringHTTPBuilder
      payload            : '{"mission" : "12345", "keyword" : "good bye", "mission_start" : "no" }'
    recorder: org.ebayopensource.regression.internal.components.recorder.SimpleHTTPJSONRecorder
    comparator: org.ebayopensource.regression.internal.components.comparator.SimpleHTTPJsonComparator

The Components

The important parts of the strategy file are the different components, RequestBuilder, Recorder, and Comparator.

RequestBuilder is used to specify how the request should be built in case of a POST or a PUT request.

The interface for RequestBuilder accepts a Map of Strings and outputs the payload that will be sent in the request.

abstract class RequestPayloadBuilder {

  def buildRequest(dataInput: Map[String, String]): Try[String]


Recorder is used to specify what parts of the response should be recorded for future comparison. Regressr injects all parts of the response to the Recorder during this time.

The interface for Recorder accepts a List of HTTPResponses (most of the time this will be one) and return a RequestRecordingEntry.

The RequestRecordingEntry is a holder for a value that will be recorded in Regressr’s datastore. The response code can be stored in a RequestRecordingEntry. Similarly a JSON response can be stored in a RequestRecordingEntry. You can also do some computation on the JSON and store a number (like the number of search results).

The interface for Recorder looks like the below.

protected def record(responses: Seq[HTTPResponse]) : Try[RequestRecordingEntry]

Finally, the Comparator is used to specify the details of comparison during the compareWith mode. How do you want to compare JSON’s? What about strings?

The interface for Comparator looks like the below. It accepts both the recorded RequestRecordingEntry and the current one and returns a List of CompareMessages which will be included in the comparison report.

abstract class Comparator {

  def compare(recorded: RequestRecordingEntry, replayed: RequestRecordingEntry): Try[Seq[CompareMessage]]


Regressr comes with out-of-the-box components that can be plugged in to provide significant value instantaneously for many common types of services. However, you can write your own components implementing these interfaces and include them into Regressr (Use ./regressr.sh -r to build everything)

The comparison report is generated at the end of the compareWith lifecycle and looks like this:

Testing HATEOAS services

HATEOAS (Hypermedia As The Engine Of Application State) is where some classes of RESTful services tend to go to, especially when there are lightweight GUIs in front of them that mimic the conversation which happens to the service. Regressr also supports simple and efficient breadth first traversal of HATEOAS resources for regression testing.

We support this through the use of a new component class called as Continuations.

Let’s imagine you have a shopping cart service exposed at a URL such as /shoppingCart/items.

When issued a GET request on this URL, if the services is modeled on HATEOAS principles the results will be similar to:

    "items": [
        "item1": "http://<host>/shoppingCart/items/<item-id>/",
        "item2": "http://<host>/shoppingCart/items/<item-id>/",
        "item3": "http://<host>/shoppingCart/items/<item-id>/"

As you can imagine, these new requests are non-deterministic and cannot be modeled with the help of Regressr’s configuration files, because the data may change over time.

That is where Continuations come in. With continuations, the tester can specify how many new requests should be created programmatically based on the response of a previous service call.

This allows the tester to write a continuation generically that creates new requests based on how many items were present in the response of the /items call.

An example of continuations is here.

What’s Next

  1. Maven plugin that attaches to Regressr that can be used in a CI/CD build.
  2. Jenkins plugin for Regressr report.
  3. Global comparators that can be used to capture global metrics across requests and compare them.

Conclusion and Credits

We have found Regressr to be a very useful regression testing tool for lean and low ceremony engineering teams that wish to minimize effort when it comes to regression testing of their services.

There were many people involved in the design, build and testing of Regressr without which this could not have been possible. Recognizing them: Ashwanth FernandoAlex ZhangRobert EnyediAjinkya Kale and our director Amit Srivastava.

Comments and PRs are welcome with open hands at https://github.com/eBay/regressr.

http://www.ebaytechblog.com/2017/08/10/introducing-regressr-an-open-source-command-line-tool-to-regression-test-http-services/feed/ 0
Cube Planner – Build an Apache Kylin OLAP Cube Efficiently and Intelligently http://www.ebaytechblog.com/2017/08/02/cube-planner-build-an-apache-kylin-olap-cube-efficiently-and-intelligently/ http://www.ebaytechblog.com/2017/08/02/cube-planner-build-an-apache-kylin-olap-cube-efficiently-and-intelligently/#comments Wed, 02 Aug 2017 17:33:43 +0000 http://www.ebaytechblog.com/?p=7422
Continue Reading »
Life is about carefully calculating daily necessities and the same is true for technology. Frugal people spend money on things that are needed most, while programmers are always seeking to reduce the resources cost of their code. Cube Planner is a new feature created by eBay’s programmers that helps you spend resources on building cost-effective dimension combinations.


Apache Kylin is an open source Distributed Analytics Engine designed to provide multi-dimensional analysis (OLAP) on Hadoop, originally contributed from eBay to the Apache Software Foundation in 2014. Kylin became a Apache top-level project in 2015. With the rapid development of the Apache Kylin community since 2015, Apche Kylin has been deployed by many companies worldwide.

There has also been a crescendo of data analysis applications deployed on the Apache Kylin platform within eBay. There are currently 144 cubes in the production environment of eBay, and this number continues to grow on an average of 10 per month. The largest cube is about 100T. The average query amount is between 40,000 and 50,000 per week day. During last month, the average query latency is 0.49 seconds.

Why is Cube Planner needed

In 2016, eBay Data Services and Solutions (DSS) department began to introduce self-service reports. Data analysts are able to drag and drop dimensions and measures to create their own reports. But the self-service report generally has many dimensions, measures, and it is difficult to define the query pattern, which requires Kylin to pre-calculate a large amount of dimension combinations.

In addition, in companies that reach a certain scale, data providers and data users don’t belong to the same department, and data providers may not be particularly familiar to the query pattern, which leads to a gap between the capability of cube design and actual use. Some combinations of dimensions are pre-calculated, but are not queried, while some are often queried, but are not pre-calculated. These gaps have effect on the resource utilization and query performance of a cube.

We decided to make a cube build planner, Cube Planner, with the goal of efficiently using computing and storage resources to build cost-effective dimension combinations intelligently.

What is Cube Planner?

Cube Planner checks the costs and benefits of each dimension combination, and selects cost-effective dimension combination sets to improve cube build efficiency and query performance. The Cube Planner recommendation algorithm can be run multiple times throughout the life cycle of the cube. For example, after the cube is created for the first time, or after a certain amount of query statistics is collected, the cube is periodically optimized.

Definitions and Formulas

Here are some definitions and formulas used in the Cube Planner algorithm.

  • Dimensions and Measures are the basic concepts of data warehousing.
  • A dimension combination (Cuboid) refers to any combination of multiple dimensions of the data. Suppose the data has three dimensions (time, location, category), then it has eight kinds of dimension combinations: [ ], [time], [location], [category], [time, location], [time, category], [location, category], [time, location, category].
  • The basic dimension combination (Basic Cuboid) is a combination of all dimensions, with the most detailed aggregation information, such as the combination of the dimensions above [time, place, category].
  • The symbol≼ represents a parent-child relationship between dimension combinations,
    w ≼ i, indicates that a query involving dimension combination w can be aggregated from the data of the dimension combination i, the combination w is a child combination, and the combination i is the parent combination. For example, dimension combination w [time, category] and dimension combination i [time, location, category]. Each parent combination may have multiple child combinations, and the basic dimension combination is the parent of all other dimension combinations.
  • Costs of a dimension combination
    The costs of a dimension combination is divided into two aspects. One is the build cost, which is defined by the number of rows of data combined by itself, and the other is the query cost, which is defined by the number of rows scanned by the query. Since the build is often one-time while the query is repetitive, we ignore the build cost and only use the query cost to calculate the cost of a dimension combination. C(i) = Cost of dimension combination(i).
  • Benefits of dimension combination
    The benefits of a dimension combination can be understood as reduced costs, that is, the cost of all queries on this cube that can be reduced by pre-calculating this dimension combination.B(i, S) refers to the benefit of creating a dimension combination (i) in the dimension combination set S, calculated by summing up the benefits of all the subsets of i:B(i,S) = ∑w≼i Bw
    Among the set, the benefit of each subset Bw=max{ C(w) – C(i) ,0}
  • Benefit ratio of dimension combination
    Dimension combination benefit ratio = (cost of the dimension combination/benefit of dimension combination) * 100%:
    Br(i, S) = (B(i, S) / C(i)) * 100%

How to choose a cost-effective dimension combination?

We use greedy algorithms to choose a cost-effective dimension combination, and the process of selection is iterative. For each iteration, the system will go through the candidate set, select a dimension combination with the highest benefit ratios at the current state, and add it to the combination result set that needs to be built.

Let’s now look at a simple example to understand the process of the iterative selection.

Cube Planner Select a dimension combination process example

Let’s assume there is a cube that has the following dimension combination.

Dimension Combination ID Dimension
a time, place, category
b time, place
c time, category

The parent-child relationship between each dimension combination is shown in the following figure. Each dimension combination has a set of numbers (i, j) in addition to ID.

i: the number of rows in this dimension combination

j: the cost of querying the dimension combination

In the following figure, dimension combination a is the basic dimension combination (Basic Cuboid), and the data has 100 rows. In the Kylin cube, the basic dimension combination will be pre-calculated by default. Initially, the result set starts with only the basic dimension combination. Therefore, queries on other dimension combinations need to answered by scanning the basic dimension combination data then doing the aggregation, so the cost of these queries is the number of rows, 100.

First iteration of Cube Planner greedy algorithm

The green dimension combinations are the combinations that have been selected into the result set during the previous rounds, initially with only the basic dimension combination. The red ones are the combinations that may be selected into the result set during the this round. The blue ones indicate that these combinations are selected and will be added to the result set during this round.

What is the benefit of selecting dimension combination b, that is, the reduced cost of the entire cube? If b is selected, the cost of b and its children (d, e, g, h) will be reduced from 100 to 50, so that the benefit ratio of the dimension combination b is (100 – 50) * 5 / 50 = 5.

If you choose the dimension combination C, the benefit ratio is 1.67.

If other dimension combinations are selected, the benefit ratio is as follows. It is obvious that g is a combination with the highest benefit ratio, so dimension combination g is added to the result set during this round.

Second round selection of the Cube Planner

At this point, the result set has become (a, g). At this state, if b is selected, although the children of b are d, e, g and h, only query over b, d and e will be effected because g has been pre-calculated and g has fewer row counts than b. When the query against dimension combination g arrives, the system will still choose g to answer the query. This is exactly why the benefit of each child combinations mentioned above is Bw=max{ QC(w) – QC(i) ,0}. So, the benefit ratio of combination b is 4.

If you pick the combination c, the benefit ratio is 1.34.

The benefit of other combinations is as follows. Combination h is selected eventually and is added to the result set during this round.

When do we stop the iteration selection?

When any one of the following three conditions is fulfilled, the system will stop the selection:

  • The expansion rate of combination dimensions in the result set reaches the established expansion rate of the cube.
  • The efficiency ratio of the selected dimension combination this round is lower than the established minimum value. This indicates that the newly added dimension is not cost-effective, so there is no need to select more dimension combinations.
  • The run time of the selection algorithm reaches the established ceiling.

A Cube Planner algorithm based on weighted query statistics

In the previous algorithm, the benefit weights of other dimension combinations are the same when calculating the benefit that one dimension combination will bring for the entire cube. But in the actual application, the probability of querying dimension combination varies. Therefore, when calculating the benefit, we use the probability of querying as the weight of each dimension combination.

Weighted dimension combination benefit WB(i,S) = ∑w≼i (Bw * Pw)

Pw is the probability of querying this dimension combination. As for those dimension combinations that are not queried, we set the probability to a small non-zero value to avoid overfitting.

As for the newly created cube, the probability is set to the same because there are no query statistics in the system. There is room for improvement. We want to use the platform-level statistics data and some data-mining algorithms to estimate the probability of the querying dimension combinations of the new cube in the future. For the cubes that are already online, we collect the query statistics, and it is easy to calculate the probability of querying each dimension combination.

In fact, the real algorithm based on weighted query statistics is more complex than this. Many practical problems need to be solved, for example, how to estimate the cost of a dimension combination that is not pre-calculated, because Kylin only stores statistics of the pre-calculated dimension combinations.

Greedy algorithm VS Genetic algorithm

Greedy algorithms have some limitations when there are too many dimensions. Because the candidate dimension combination set is very large, the algorithm needs to go through each dimension combination, so it takes some time to run. Thus, we implemented a genetic algorithm as an alternative. Using the iterative and evolutionary model of the gene to simulate the selection of the dimension combination set, the optimal combination set of dimensions is finally evolved (selected) through a number of iterations. There is no refinement. If anyone is interested, we can write a detailed description of the genetic algorithm used in Cube Planner.

Cube Planner one-click optimization

For the cube that is already on the production site, Cube Planner can use the collected query statistics, and intelligently recommend a dimension combination set for pre-calculation based on cost and benefit algorithm. Users can make a one-click optimization based on this recommendation. The following figure is the one-click optimization process. We use two jobs, Optimizejob and Checkpointjob, not only to ensure the optimization process concurrency, but also to ensure the atomicity of the data switching after optimizing. In order to save the resource of optimizing the cube, we will not recalculate all the recommended dimension combinations in Optimizejob, but only compute those new combinations in the recommended dimension combinations set. If some recommended dimension combinations are pre-calculated in the cube, we reuse them.

Application of Cube Planner in eBay

Cube Planner’s development was basically completed at the end of 2016. After a variety of testing, it was formally launched in the production environment in February 2017. After the launch, it was highly praised and unanimously recognized by the users. Here is an example of the use of Cube Planner on eBay’s production line. The customer optimized the cube on April 19th. The optimized cube not only had a significant boost in building performance, but also improved significantly in query performance.

Follow-up development of Cube Planner

In addition to enhancing resource utilization, Cube Planner has another goal of reducing the cube design complexity. At present, the design of a Kylin cube depends on the understanding of data. Cube Planner has made Kylin able to recommend and optimize based on data distribution and data usages, but this is not enough. We want to make the Kylin cube design a smart recommendation system based on query statistics across cubes within same domain.

http://www.ebaytechblog.com/2017/08/02/cube-planner-build-an-apache-kylin-olap-cube-efficiently-and-intelligently/feed/ 4
eBay’s New Homepage http://www.ebaytechblog.com/2017/07/26/ebays-new-homepage/ http://www.ebaytechblog.com/2017/07/26/ebays-new-homepage/#respond Wed, 26 Jul 2017 16:53:36 +0000 http://www.ebaytechblog.com/?p=7486
Continue Reading »

Building and releasing a modern homepage for eBay’s hundreds of millions of annual and tens of millions daily users was the team’s greatest challenge to date. We were given the mission to develop a personalized buying experience by mashing merchandised best offers with tailored ads, topping it off with the hottest trends and relevant curated content. Any such virtual front door had to be robust to changes, be easily configurable for any site, and had to load really, really fast!

There’s no second chance to make a first impression, so we had to make it count. But how quickly can you go from 0 to 100 (100% deployed for 100% of traffic to the homepage)?

The team commenced this project in late 2016, and we rushed to get an early flavor available for an internal beta in mid-February. Then, traffic was slowly ramped until reaching a 100% mark for the Big 4 sites (US, UK, DE, AU), delivering it less than three months after the beta, while simultaneously adding more scope and fresh content to the sites.

Our new homepage consists of three main components, spread across three large pools:

  • Node Application – The homepage front-end web server handles incoming user requests and acts as the presentation layer for the application, boasting responsive design.
  • Experience Service – The experience service tier is the primary application server that interacts with the page template and customization service and orchestrates the request handling to various data providers. It performs response post-processing that includes tracking. This layer returns a render-ready format that is consumable by all screen types. The experience layer is invoked by Node for desktop and mobile web traffic and directly by the native applications.
  • Data Provider Service – The data provider service is facade that massages responses from datastores into a semantic format. Providers, such as Merchandise, Ads, Marketing, Collections, and Trends, are interfacing with this service. JSON transformation, done using Jolt, produces the response datagrams.

Additional auxiliary components help keep the system configurable and simple to reorganize or to publish new content to site within seconds.

The Experience service was built on a shared platform, called Answers, developed jointly with the Browse and Search domains.

Due to the hierarchical nature of the system, its complex compositeness (16 unique modules), and its multiple dependencies (9 different services, to be exact), special consideration has to be given to the rollout and deployment process.

We adopted a bottom-up strategy by deploying in the reverse order of dependencies. That is, we start with datastores, data providers, experience services and lastly deploy the node application server. This approach helps us bubble up issues early on in the process and tackle misalignment in an effective manner.

All service-to-service calls are made via clearly defined, internally reviewed APIs. Interface changes are only allowed to be additive to maintain backwards compatibility. This is essential for the incremental bottom-up rollout scheme.

When developing and deploying a new build, as you would delicately change a layer in a house of cards, one must act with great care and much thought about the inherited risks. Test strategies like “mock-first” were used as a working proof-of-concept and allowed testing to begin early on, where caching techniques quickly took the place of the very same mocks in the final product. Unit tests, hand-in-hand with functional and nonfunctional tests, had to be done with rigor, to make regression easy, fast, and resilient to change.

The team always deployed to Staging and Pre-production before conducting any smoke-testing on three boxes in Production, one VM instance per Data Center. Specific networking issues were exposed and quickly remediated, as a result.

Some engineering-related challenges that the team was facing had to do with Experimentation Platform complexities, where a selective pool channel redirect appeared to have unexpected behavior with traffic rerouting, while gradually diverting from the current Feed home page to the new carousel-based one.

Internal Beta testing, done by Buyer Experience teams, and external in-the-wild testing conducted by a crowd-testing company, took place with a zero-traffic experiment. This process was followed by a slow 1% traffic ramp, to help control the increments, as gradual 5% and 25% steps were introduced to the pools, spanning days to weeks between the phases. Each ramp involved multiple daily Load & Performance cycles, while fine-tuning timeout and retry variables.

To summarize our key learnings from the homepage experience:

  • Relying on clearly-defined APIs was vital in achieving an iterative, incremental development and rollout strategy.
  • Co-evolving our logging and monitoring infrastructure and the team’s debugging expertise was essential to achieving an aggressive rollout approach on a distributed stack.
  • Tracking and Experimentation setups, as well as baking in Accessibility, proved to be major variables that should get well-scoped into a product’s development lifecycle and launch timelines.
  • Introducing gradual functional increments along with a design for testability in mind proved to be a valuable quality control measure and allowed absorbing change while moving forward.
  • A meticulous development and release process maintained the associated risks and mitigation actions, allowing the team to meet their milestones, as defined and on time.
http://www.ebaytechblog.com/2017/07/26/ebays-new-homepage/feed/ 0
Enhancing the User Experience of the Hadoop Ecosystem http://www.ebaytechblog.com/2017/05/12/enhancing-the-user-experience-of-the-hadoop-ecosystem/ http://www.ebaytechblog.com/2017/05/12/enhancing-the-user-experience-of-the-hadoop-ecosystem/#respond Fri, 12 May 2017 19:24:55 +0000 http://www.ebaytechblog.com/?p=7223
Continue Reading »

At eBay, we have multiple large, multi-tenant clusters. Each of these clusters stores hundreds of petabytes of data. These clusters offer tens of thousands of cores to run computations on the data. We have thousands of internal users who use Hadoop in their roles, including data analysts, data scientists, engineers, and product managers. These users use multiple technologies like MapReduce, Hive, and Spark to process data. There are thousands of applications that push and pull data from Hadoop and run computations.

Figure 1: Hadoop clusters, auxiliary components, users, applications, and services


The users normally interact with the cluster via the command line by SSHing to specialized gateway machines that reside in the same network zone as the cluster. To transfer job files and scripts, the users need to SCP over multiple hops.

Figure 2: Old way of accessing a Hadoop cluster

The need to traverse multiple hops as well as the command-line-only usage was a major hindrance to the productivity of our data users.

On the other side, our website applications and services need to access data and perform compute. These applications and services reside in a different network zone and hence need to set up network rules to access various services like HDFS, YARN, and Oozie. Since our clusters are secured with Kerberos, the applications need to be able to use Kerberos to authenticate to the Hadoop services. This was causing an extra burden for our application developers.

In this post, I will share the work in progress to facilitate access to our Hadoop clusters for data and compute resources by users and applications.


We need better ways to achieve the following goals:

  • Our engineers and other users need to use multiple clusters and related components.
  • Data Analysts and other users need to run interactive queries and create shareable reports.
  • Developers need to be able to develop applications and services without spending time on connectivity problems or Kerberos authentication.
  • We can afford no compromise on security.


To improve user experience and productivity, we added three open-source components:

Hue — to perform operations on Hadoop and related components.

Apache Zeppelin — to develop interactive notebooks with queries, programs, and reports.

Apache Knox — to serve as a single point for applications to access HDFS, Oozie, and other Hadoop services.

Figure 3: Enhanced user experience with Hue, Zeppelin, and Knox

We will describe each product, the main use cases, a list of our customizations, and the architecture.


Hue is a user interface to the Hadoop ecosystem. It provides user interfaces to several components including HDFS, Oozie, Resource Manager, Hive, and HBase. It is a 100% open-source product, actively supported by Cloudera, and stored at the Hue GitHub site.

Similar products

Apache Airflow allows users to specify workflows in Python. Since we did not want a Python learning curve for our users, we chose Hue instead of Airflow. But we may find Airflow compelling enough to deploy it in future so that it can be used by people who prefer Airflow.

Use cases of Hue

Hue allows a user to work with multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • To browse, manage, upload, and download HDFS files and directories
  • To specify workflows comprising MapReduce, Hive, Pig, Spark, Java, and shell actions
  • Schedule workflows and track SLAs
  • To manage Hive metadata, run Hive queries, and share the queries with other users
  • To manage HBase metadata and interact with HBase
  • To view YARN applications and terminate applications if needed


Two-factor authentication — To ensure that the same security level is maintained as that of command-line access, we needed to integrate our custom SAML-based two-factor authentication in Hue. Hue supports plugging in new authentication mechanisms, using which we were able to plug in our two-factor authentication.

Ability to impersonate other users — At eBay, users sometimes operate on behalf of a team account. We added capability in Hue so that users can impersonate as another account as long as they are authorized to do so. The authorization is controlled by LDAP group memberships. The users can switch back between multiple accounts.

Working with multiple clusters — Since we have multiple clusters, we wanted to provide single Hue instance serving multiple Hadoop clusters and components. This enhancement required changes in HDFS File Browser, Job Browser, Hive Metastore Managers, Hive query editors, and work flow submissions.


Figure 4: Hue architecture at eBay


A lot of our users, especially data scientists, want to run interactive queries on the data stored on Hadoop clusters. They run one query, check its results, and, based on the results, form the next query. Big data frameworks like Spark, Presto, Kylin, and to some extent, HiveServer2 provide this kind of interactive query support.

Apache Zeppelin (GitHub repo) is a user interface that integrates well with products like Spark, Presto, Kylin, among others. In addition, Zeppelin provides an interface where users can develop data notebooks. The notebooks can express data processing logic in SQL or Scala or Python or R. Zeppelin also supports data visualization in notebooks in the form of tables and charts.

Zeppelin is an Apache project and is 100% open source.

Use cases

Zeppelin allows a user to develop visually appealing interactive notebooks using multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • Run a quick Select statement on a Hive table using Presto.
  • Develop a report based on a dataset by reading files from HDFS and persisting them in memory as Spark data frames.
  • Create an interactive dashboard that allows users to search through a specific set of log files with custom format and schema.
  • Inspect the schema of a Hive table.


Two-factor authentication — To maintain security parity with that command-line access, we plugged in our custom two-factor authentication mechanism in Zeppelin. Zeppelin uses Shiro for security, and Shiro allows one to plug in a custom authentication with some difficulty.

Support for multiple clusters — We have multiple clusters and multiple instances of components like Hive. To support multiple instances in one Zeppelin server, we created different interpreters for different clusters or server instances.

Capability to override interpreter settings at the user level — Some of the interpreter settings, such as job queues and memory values, among others, need to be customized by users for their specific use cases. To support this, we added a feature in Zeppelin so that users can override certain Interpreter settings by setting properties. This is described in detail in this Apache JIRA ticket ZEPPELIN-1625


Figure 5: Zeppelin Architecture at eBay


Apache Knox (GitHub repo) is an HTTP reverse proxy, and it provides a single endpoint for applications to invoke Hadoop operations. It supports multiple clusters and multiple components like webHDFS, Oozie, WebHCat, etc. It can also support multiple authentication mechanisms so that we can hook up custom authentication along with Kerberos authentication

It is an Apache top-level project and is 100% open source.

Use cases

Knox allows an application to talk to multiple Hadoop clusters and related components through a single entry point using any application-friendly non-Kerberos authentication mechanism. A few common use cases are listed below:

  • To authenticate using an application token and put/get files to/from HDFS on a specific cluster
  • To authenticate using an application token and trigger an Oozie job
  • To run a Hive script using WebHCat


Authentication using application tokens — The applications and services in the eBay backend use a custom token-based authentication mechanism. To take advantage of the existing application credentials, we enhanced Knox to support our application token-based authentication mechanism in addition to Kerberos. Knox utilizes the Hadoop Authentication framework, which is flexible enough to plug in new authentication mechanisms. The steps to plug in an authentication mechanism on Hadoop’s HTTP interface is described in Multiple Authentication Mechanisms for Hadoop Web Interfaces


Figure 6: Knox Architecture at eBay


In this blog post, we describe the approach taken to improve user experience and developer productivity in using our multiple Hadoop clusters and related components. We illustrate the use of three open-source products to make Hadoop users’ life a lot simpler. The products are Hue, Zeppelin, and Knox. We evaluated these products, customized them for eBay’s purpose, and made them available for our users to carry out their projects efficiently.

http://www.ebaytechblog.com/2017/05/12/enhancing-the-user-experience-of-the-hadoop-ecosystem/feed/ 0
A Creative Visualization of OLAP Cuboids http://www.ebaytechblog.com/2017/05/09/a-creative-visualization-of-olap-cuboids/ http://www.ebaytechblog.com/2017/05/09/a-creative-visualization-of-olap-cuboids/#respond Tue, 09 May 2017 18:50:16 +0000 http://www.ebaytechblog.com/?p=6994
Continue Reading »

eBay is one of the world’s largest and most vibrant marketplaces with 1.1 billion live listings every day, 169 million active buyers, and trillions of rows of datasets ranging from terabytes to petabytes. Analyzing such volumes of data required eBay’s Analytics Data Infrastructure (ADI) team to create a fast data analytics platform for this big data using Apache Kylin, which is an open-source Distributed Analytics Engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

Apache Kylin creatively applies data warehouse OLAP technology to the Hadoop domain, which makes a query return within milliseconds to seconds against datasets in the PBs. The magic of Apache Kylin is that it pre-calculates the metrics against defined dimensions. So when a query is launched, it doesn’t need to scan PBs worth of source data, but instead it scans the pre-calculated metrics, which is much smaller than the source data to accelerate the query speed.

Currently there are hundreds of cubes running on the Kylin production environment within eBay, serving dozens of Business domains in Inventory Healthy Analysis, User Behavior Analysis, eBay APIs Usage Analysis and eBay Business Partner Channel Performance Analysis, etc.

This post showcases the creative visualization of OLAP cuboids that has been implemented in the Cube Planner feature built on Apache Kylin by eBay’s ADI team to solve the challenge of showing huge OLAP cuboids in a fixed space. To better understand the challenge as well as the value of the visualization of sunburst charts that have been introduced into OLAP cuboids, some basic concepts need to be covered.

Basic Concepts

An OLAP cube is a term that typically refers to multi-dimensional array of data [1][2]. OLAP is an acronym for online analytical processing [3], which is a computer-based technique of analyzing data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than 3.

A cuboid is one combination of dimensions.

For example, if a cube has 4 dimensions — time, item, location, and supplier — it has 16 cuboids, as shown here.

A basic cuboid has the most detailed data, except for the source data itself; it is composed of all dimensions, like (time, item, location, supplier). It can be rolled up to all the other cuboids. For example, a user can roll up the basic cuboid (time, item, location, supplier) along dimension “supplier” to cuboid (time, item, location). And in this case, the basic cuboid is the Parent Cuboid, and a 3-D cuboid (time, item, location) is a Child Cuboid.

OLAP Cuboids Visualization Challenges

The OLAP cuboids visualization has the following characteristics:

  • All the cuboids have a root parent — the basic cuboid.
  • The relationship “rollup to” between two cuboids is directed, from parent to child.
  • The relationship “rollup to” is m:n mapping. One parent cuboid can be rolled up to multiple child cuboids, while one child cuboid can be rolled up from multiple parent cuboids.

So the visualization of cuboids is typically a directed graph. But, in the real OLAP world, not all the relationships are kept. The m:n mappings are simplified to 1:n mappings. Every child cuboid would have just one parent. Usually we keep the relationship with the parent, who has lowest row count, and eliminate others, because the lowest row count of parent cuboid means the lowest cost of aggregation to the child cuboid. Hence, the visualization of cuboids is simplified to a tree, where the basic cuboid is the root, as shown below.

Even with the simplified relationships between cuboids, there can still be some challenges to cuboids layout with a tree:

  • The tree must be collapsed to fit in a fixed space.
  • It is impossible to have an overall view of all cuboids.
  • Multiple clicks are needed to the target node from root layer by layer.
  • It’s hard to focus on the whole view of all the child cuboids from a given cuboid.

Cuboid Visualization Solution in Cube Planner

Cube Planner makes OLAP cubes more resource-efficient. It intelligently builds a partial cube to minimize the cost of building a cube and at the same time maximizes the benefit of serving end user queries. Besides, it learns patterns from queries at runtime and dynamically recommends cuboids accordingly.

In Cube Planner, we want to show query usage down to the cuboid level, which enables the cube owner to gain more insight into their cubes. We want to have a color-coded heat map with all cuboids in one view to give the cube owner an overall feeling of its cube design. Furthermore, when the user hovers over each cuboid, details of individual cuboid — cuboid ID, query count, row count, rollup rate, etc. — would be displayed. We will also recommend a new cube design with recommended cuboids based on the query statistics, thus we will need to put the old cuboids and new cuboids together in one page to show the benefit of cube optimization.

We are not able to use any tree or directed graph component to meet all our requirements above. Luckily, our GUI engineer discovered a means to produce sunburst charts, which greatly meet our expectations.

What is a Sunburst Chart?

Our sunburst charts are created with Angular-nvD3, an AngularJS directive for NVD3 re-usable charting library (based on D3). Users can easily customize their charts via a JSON API. Go to the Angular-nvD3 quick start page if you want to know more about how to include these fancy charts in your GUI.

How Are Sunburst Charts Used for Cube Cuboids Visualization?

Basically, at the back end we create a REST API to return the cuboid tree with the necessary information, and at the front end a JavaScript controller parses the REST response to a relative JSON format and then renders the sunburst chart. Below are samples of code from these two layers.

REST Service returns the Cuboid Tree response

@RequestMapping(value = "/{cubeName}/cuboids/current", method = RequestMethod.GET)
    public CuboidTreeResponse getCurrentCuboids(@PathVariable String cubeName) {
        CubeInstance cube = cubeService.getCubeManager().getCube(cubeName);
        if (cube == null) {
            logger.error("Get cube: [" + cubeName + "] failed when get current cuboids");
            throw new BadRequestException("Get cube: [" + cubeName + "] failed when get current cuboids");
        Map<Long, Long> cuboidList = Maps.newLinkedHashMap();
        try {
            cuboidList = cubeService.getCurrentCuboidStatistics(cube);
        } catch (IOException e) {
            logger.error("Get cuboids list failed.", e);
            throw new InternalErrorException("Get cuboids failed.", e);
        CuboidTree currentCuboidTree = CuboidTreeManager.getCuboidTree(cuboidList);
        Map<Long, Long> hitFrequencyMap = getCuboidHitFrequency(cubeName);
        Map<Long, Long> queryMatchMap = getCuboidQueryMatchCount(cubeName);
        if (currentCuboidTree == null) {
            logger.warn("Get current cuboid tree failed.");
            return null;
        return cubeService.getCuboidTreeResponse(currentCuboidTree, cuboidList, hitFrequencyMap, queryMatchMap, null);

The JavaScript controller parses the REST response to create the sunburst chart

// transform chart data and customized options.
    $scope.createChart = function(data, type) {
        var chartData = data.treeNode;
        var baseNodeInfo = _.find(data.nodeInfos, function(o) { return o.cuboid_id == data.treeNode.cuboid_id; });
        $scope.formatChartData(chartData, data, baseNodeInfo.row_count);
        } else if ('recommend' === type) {
            $scope.recommendData = [chartData];
            $scope.recommendOptions = angular.copy(cubeConfig.chartOptions);
            $scope.recommendOptions.caption = {
                enable: true,
                html: '
Existed:  Hottest '
                    + ' Hot '
                    + ' Warm '
                    + ' Cold
                    + '
New:  Hottest '
                    + ' Hot '
                    + ' Warm '
                    + ' Cold
                    + '
                css: {
                    position: 'relative',
                    top: '-25px'
            $scope.recommendOptions.chart.color = function(d) {
                var cuboid = _.find(data.nodeInfos, function(o) { return o.name == d; });
                if (cuboid.row_count < 0) {
                    return d3.scale.category20c().range()[5];
                } else {
                    var baseRate = 1/data.nodeInfos.length;
                    var colorIndex = 0;
                    if (!cuboid.existed) {
                        colorIndex = 8;
                    if (cuboid.query_rate > (3 * baseRate)) {
                        return d3.scale.category20c().range()[colorIndex];
                    } else if (cuboid.query_rate > (2 * baseRate)) {
                        return d3.scale.category20c().range()[colorIndex+1];
                    } else if (cuboid.query_rate > baseRate) {
                        return d3.scale.category20c().range()[colorIndex+2];
                    } else {
                        return d3.scale.category20c().range()[colorIndex+3];
            $scope.recommendOptions.title.text = 'Recommend Cuboid Distribution';
            $scope.recommendOptions.subtitle.text = 'Cuboid Count: ' + data.nodeInfos.length;

    // transform chart data
    $scope.formatChartData= function(treeNode, orgData, parentRowCount) {
        var nodeInfo = _.find(orgData.nodeInfos, function(o) { return o.cuboid_id == treeNode.cuboid_id; });
        $.extend(true, treeNode, nodeInfo);
        treeNode.parent_row_count = parentRowCount;
        if(treeNode.children.length > 0) {
            angular.forEach(treeNode.children, function(child) {
                $scope.formatChartData(child, orgData, nodeInfo.row_count);

Screenshots and Interaction

Below are some screenshots from the eBay Kylin production environment. With a sunburst chart, cube owners can easily understand their overall cube design with a color-coded cuboid usage heat map. The greater number of dark blue elements in the sunburst chart, the more resource efficient this cube is. The greater number of light blue elements, the more room there is for efficiency.

Putting two sunburst charts of both current and recommended cuboids together, the changes become obvious. The cuboids that are recommended to be removed are marked with gray, and greens are recommended to be added. A popup window with more details of the cuboid will be shown when mouse hover over the cuboid element in a sunburst chart. The value of Cube Planner is now apparent.

Interaction with a sunburst chart is fast and convenient. The user is able to focus on any cuboid and its children with just one click, and the view changes automatically, like from the left chart to the right chart below.


If you want to specify the parent of a leaf, click on the center circle (the part marked yellow).


In short, leveraging sunburst charts as OLAP cuboid visualizations introduces a creative way to discover cube insights down to the cuboid level. With these insights, the cube owner is able to have a resource-efficient cube, thus make Apache Kylin more competitive as an OLAP engine on Hadoop.


Some graphics copyright Apache Kylin

http://www.ebaytechblog.com/2017/05/09/a-creative-visualization-of-olap-cuboids/feed/ 0
A Surprising Pitfall of Human Judgement and How to Correct It http://www.ebaytechblog.com/2017/05/04/a-surprising-pitfall-of-human-judgement-and-how-to-correct-it/ http://www.ebaytechblog.com/2017/05/04/a-surprising-pitfall-of-human-judgement-and-how-to-correct-it/#comments Thu, 04 May 2017 17:00:10 +0000 http://www.ebaytechblog.com/?p=7156
Continue Reading »

Algorithms based on machine learning, deep learning, and AI are in the news these days. Evaluating the quality of these algorithms is usually done using human judgment. For example, if an algorithm claims to detect whether an image contains a pet, the claim can be checked by selecting a sample of images, using human judges to detect if there is a pet, and then comparing this to the results to the algorithm. This post discusses a pitfall in using human judgment that has been mostly overlooked until now.

In real life, human judges are imperfect. This is especially true if the judges are crowdsourced. This is not a new observation. Many proposals to process raw judgment scores and improve their accuracy have been made. They almost always involve having multiple judges score each item and combining the scores in some fashion. The simplest (and probably most common) method of combination is majority vote: for example, if the judges are rating yes/no (for example, is there a pet), you could report the rating given by the majority of the judges.

But even after such processing, some errors will remain. Judge errors are often correlated, and so multiple judgments can correct fewer errors than you might expect. For example, most of the judges might be unclear on the meaning of “pet”, incorrectly assume that a chicken cannot be a pet, and wrongly label a photo of a suburban home showing a child playing with chickens in the backyard. Nonetheless, any judgment errors remaining after processing are usually ignored, and the processed judgments are treated as if they were perfect.

Ignoring judge errors is a pitfall. Here is a simulation illustrating this. I’ll use the pet example again. Suppose there are 1000 images sent to the judges and that 70% of the images contain a pet. Suppose that when there is no pet, the judges (or the final processed judgment) correctly detect this 95% of the time. This is actually higher than typical for realistic industrial applications. And suppose that some of the pet images are subtle, and so when there is a pet, judges are correct only 90% of the time. I now present a simulation to show what happens.

Here’s how one round of the simulation works. I assume that there are millions of possible images and that 70% of them contain a pet. I draw a random sample of 1000. I assume that when the image has a pet, the judgment process has a 90% chance reporting “pet”, and when there is no pet, the judgment process reports “no pet” 95% of the time. I get 1000 judgments (one for each image), and I record the fraction of those judgments that were “pet”.

That’s one round of the simulation. I do the simulation many times and plot the results as a histogram. I actually did 100,000 simulations, so the histogram has 100,000 values. The results are below in red. Since this is a simulation, I know the true fraction of images with pets: it’s 70%. The estimated fraction from the judges in each simulation is almost always lower than the true fraction. Averaged over all simulations, the estimated fraction is 64%, noticeably lower than the true value of 70%.

This illustrates the pitfall: by ignoring judge error, you get the wrong answer. You might wonder if this has any practical implications. How often do you need a precise estimate for the fraction of pets in a collection of images? But what if you’re using the judgments to determine the accuracy of an algorithm that detects pets? And suppose the algorithm has an accuracy rate of 70%, but the judges say it is 64%. In the machine learning/AI community, the difference between an algorithm that is 70% accurate vs 64% is a big deal.

But maybe things aren’t as bad as they seem. When statistically savvy experimenters report the results of human judgment, they return error bars. The error bars (I give details below) in this case are

    \[       0.641 \pm 1.96 \sqrt{\frac{p (1-p)}{n}} \approx 0.641 \pm 0.030 \qquad (p = 0.64) \]

So even error bars don’t help: the actual accuracy rate of 0.7 is not included inside the error bars.

The purpose of this post is to show that you can compensate for judge error. In the image above, the blue bars are the simulation of the compensating algorithm I will present. The blue bars do a much better job of estimating the fraction of pets than the naive algorithm. This post is just a summary, but full details are in Drawing Sound Conclusions from Noisy Judgments from the WWW17 conference.

The histogram shows that the traditional naive algorithm (red) has a strong bias. But it also seems to have a smaller variance, so you might wonder if it has a smaller mean square error (MSE) than the blue algorithm. It does not. The naive algorithm has MSE 0.0037, the improved algorithm 0.0007. The smaller variance does not compensate for the large bias.

Finally, I can explain why judge error is so important. For a typical problem requiring ML/AI, many different algorithms are tried. Then human judgment can be used to detect if one is better than the other. Current practice is to use error bars as above, which do not take into account errors in the human judgment process. But this can lead the algorithm developer astray and suggest that a new algorithm is better (or worse) when in fact the difference is just noise.

The setup

I’m going to assume there are both an algorithm that takes an input and outputs a label (not necessarily a binary label) and also a judge that decides if the output is correct. So the judge is performing a binary task: determining if the algorithm is correct or not. From these judgments, you can compute the fraction of times (an estimate of the probability p) that the algorithm is correct. I will show that if you can estimate the error in the judgment process, then you can compensate for it and get a better estimate of the accuracy of the algorithm. This applies if you use raw judge scores or if you use a judgment process (for example, majority vote) to improve judge accuracy. In the latter case, you need to estimate the accuracy of the judgment process.

A simple example of this setup is an information retrieval algorithm. The algorithm is given a query and returns a document. A judge decides if the document is relevant or not. A subset of those judgments is reevaluated by gold judge experts, giving an estimate of the judges’ accuracy. Of course, if you are rich enough to be able to afford gold judges to perform all your labeling, then you don’t need to use the method presented here.

A slightly more subtle example is the pet detection algorithm mentioned above. Here it is likely that there would be a labeled set of images (a test set) used to evaluate the algorithm. You want to know how often the algorithm agrees with the labels, and you need to correct for errors in the judgment about agreement. To estimate the error rate, pick a subset of images and have them rejudged by experts (gold judges). The judges were detecting if an image had a pet or not. However, what I am really interested in is the error rate in judging whether the algorithm gave the correct answer or not. But that is easily computed from the gold judgments of the images.

The first formula

In the introduction, I mentioned that there are formulas that can correct for judge error. To explain the formula, recall that there is a task that requires labeling items into two classes, which I’ll call positive and negative. In the motivating example, positive means the algorithm gives the correct output for the item, negative that it is incorrect. I have a set of items, and judges perform the task on each item, getting an estimate of how many items are in the positive class. I’d like to use information on the accuracy of the judges to adjust this estimate. The symbols used in the formula are:

Rendered by QuickLaTeX.com

The first formula is

    \[ p = \frac{p_J + q_- - 1}{q_+ + q_- -1} \]

This formula is not difficult to derive. See the aforementioned WWW paper for details. Here are some checks that the formula is plausible. If the judges are prefect then q_+ = q_- = 1, and the formula reduces to p = p_J. In other words, the judges’ opinion of p is correct. Next, suppose the judges are useless and guess randomly. Then q_+ = q_- = 1/2, and the formula makes no sense because the denominator is infinity. So that’s also consistent.

Notice the formula is asymmetric: the numerator has q_- but not q_+. To see why this makes sense, first consider the case when the judges are perfect on negative items so that q_- = 1. The judges’ only error is to take correct answers and claim they are negative. So an estimate of p by the judges is always too pessimistic. On the other hand, if q_+ = 1, then the judges are optimistic, because they sometimes judge incorrect items as correct.

I will now show that the formula achieves these requirements. In particular, if q_- = 1 then I expect p_J < p and if q_+ = 1 then p_J > p. To verify this, note that if q_- = 1 the formula becomes p = p_J/q_+ > p_J. And if q_+ = 1 then

    \begin{eqnarray*}   p  &=& \frac{p_J + q_- - 1}{q_-} \\ &=& \frac{p_J - 1}{q_-} + 1 \\ &=& \frac{p_J - 1}{q_-} + (1-p_J) + p_J \\   &= & (p_J - 1)\left(\frac{1}{q_-} - 1\right) + p_J \\   & < & p_J \end{eqnarray*}

the last inequality because p_J - 1 < 0 and 1/q_- -1 > 0.

Up until now the formula is theoretical, because the precise values of p_J, q_+ and q_- are not known. I introduce some symbols for the estimates.

Rendered by QuickLaTeX.com

The practical formula is

    \[ \widehat{p} = \frac{\widehat{p}_J + \widehat{q}_\text{--} - 1}{\widehat{q}_+ + \widehat{q}_\text{--} -1} \]

The second formula

In the introduction, I motivated the concern with judgment error using the problem of determining the accuracy of an ML algorithm and, in particular, comparing two algorithms to determine which is better. If you had perfect judges and an extremely large set of labeled data, you could measure the accuracy of each algorithm on the large labeled set, and the one with the highest accuracy would clearly be best. But the labeled set is often limited in size. That leads to some uncertainty: if you had picked a different labeled set you might get a difference answer. That is the purpose of error bars: they quantify the uncertainty. Traditionally, the only uncertainty taken into account is due to the finite sample size. In this case, the traditional method gives 95% error bars of \widehat{p}_J  \pm 2 \sqrt{v_{\widehat{p}_J}} where

    \begin{eqnarray*}       \widehat{p}_J  &=&  \frac{k}{n} \\       v_{\widehat{p}_J} &= &  \frac{\widehat{p}_J(1-\widehat{p}_J)}{n} \end{eqnarray*}

The judge gave a positive label k times out of n, \widehat{p}_J is the estimated mean and v_{\widehat{p}_J} the estimate of the variance of \widehat{p}_J. I showed via simulation that if there are judgment errors, these intervals are too small. The corrected formula is

    \[ v_{\widehat{p}} =  \frac{v_{\widehat{p}_J}}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^2} + v_{\widehat{q}_+}\frac{(\widehat{p}_J - 1 + \widehat{q}_\text{--})^2}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^4} + v_{\widehat{q}_\text{--}}\frac{(\widehat{p}_J - \widehat{q}_+)^2}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^4} \]


Rendered by QuickLaTeX.com

The formula shows that when taking judge error into account, the variance increases, that is, v_{\widehat{p}} >  v_{\widehat{p}_J}. The first term of the equation for v_{\widehat{p}} is already larger than v_{\widehat{p}_J}, since is it v_{\widehat{p}_J} divided by a number less than one. The next two terms add additional error, due to the uncertainty in estimating the judge errors \widehat{q}_+ and \widehat{q}_-.


The theme of this post is correcting for judge errors, but I have only discussed the case when the judge gives each item a binary rating, as a way to estimate what fraction of items are in each of the two classes. For example, the judge reports if an algorithm has given the correct answer, and you want to know how often the algorithm is correct. Or the judge examines a query and a document, and you want to know how often the document is relevant to the query.

But the methods presented so far extend to other situations. The WWW paper referred to earlier gives a number of examples in the domain of document retrieval. It explains how to correct for judge error in the case when a judge gives documents a relevance rating rather than just a binary relevant or not. And it shows how to extend these ideas beyond the metric “fraction in each class” to the metric DCG, which is a common metric for document retrieval.


Analyses of the type presented here always have assumptions. The assumption of this work is that there is a task (for example, “is the algorithm correct on this item?”) with a correct answer and that there are gold judges who are expert enough and have enough time to reliably obtain that correct answer. Industrial uses of human judgment often use detailed guidelines explaining how the judgment should be done, in which case these assumptions are reasonable.

But what about the general case? What if different audiences have different ideas about the task? For example, there may differences from country to country about what constitutes a pet. The analysis presented here is still relevant. You only need to compute the error rate of the judges relative to the appropriate audience.

I treat the judgment process as a black box, where only the overall accuracy of the judgments is known. This is sometimes very realistic, for example, when the judgment process is outsourced. But sometimes details are known about the judgment process would lead you to conclude that the judgments of some items are more reliable than others. For example, you might know which judges judged which items, and so have reliable information on the relative accuracy of the judges. The approach I use here can be extended to apply in these cases.

The simulation

I close this post with the R code that was used to generate the simulation given earlier.

# sim() returns a (n.iter x 4) array, where each row contains:
#   the naive estimate for p,
#   the corrected estimate,
#   whether the true value of p is in a 95% confidence interval (0/1)
#        for the naive standard error
#   the same value for the corrected standard error

sim = function(
  q.pos,  # probability a judge is correct on positive items
  q.neg,  # probability a judge is correct on negative items
  p,      # fraction of positive items
  n,      # number of items judged
  n.sim,  # number of simulations
  n.gold.neg = 200, # number of items the gold judge deems negative
  n.gold.pos = 200  # number of items the gold judge deems positive
   # number of positive items in sample
   n.pos = rbinom(n.sim, n, p)

   # estimate of judge accuracy
   q.pos.hat = rbinom(n.sim, n.gold.pos, q.pos)/n.gold.pos
   q.neg.hat = rbinom(n.sim, n.gold.neg, q.neg)/n.gold.neg

   # what the judges say (.jdg)
   n.pos.jdg = rbinom(n.sim, n.pos, q.pos) +
               rbinom(1, n - n.pos, 1 - q.neg)

   # estimate of fraction of positive items
   p.jdg.hat = n.pos.jdg/n    # naive formula
   # corrected formula
   p.hat = (p.jdg.hat - 1 + q.neg.hat)/
                      (q.neg.hat + q.pos.hat - 1)

   # is p.jdg.hat within 1.96*sigma of p?
   s.hat =  sqrt(p.jdg.hat*(1 - p.jdg.hat)/n)
   b.jdg = abs(p.jdg.hat - p) < 1.96*s.hat

   # is p.hat within 1.96*sigma.hat of p ?
   v.b = q.neg.hat*(1 - q.neg.hat)/n.gold.neg   # variance q.neg
   v.g = q.pos.hat*(1 - q.pos.hat)/n.gold.pos
   denom = (q.neg.hat + q.pos.hat - 1)^2
   v.cor = s.hat^2/denom +
           v.g*(p.jdg.hat - 1 + q.neg.hat)^2/denom^2 +
           v.b*(p.jdg.hat - q.pos)^2/denom^2
   s.cor.hat = sqrt(v.cor)
   # is negative.jdg.rate.cor within 1.96*sigma of p?
   b.cor = abs(p.hat - p) < 1.96*s.cor.hat

   rbind(p.jdg.hat, p.hat, b.jdg, b.cor)

r = sim(n.sim = 100000, q.pos = 0.90,  q.neg = 0.95, p = 0.7, n=1000)

# plot v1 and v2 as two separate histograms, but on the same ggplot
hist2 = function(v1, v2, name1, name2, labelsz=12)
  df1 = data.frame(x = v1, y = rep(name1, length(v1)))
  df2 = data.frame(x = v2, y = rep(name2, length(v2)))
  df = rbind(df1, df2)

  freq = aes(y = ..count..)
  print(ggplot(df, aes(x=x, fill=y)) + geom_histogram(freq, position='dodge') +
      theme(axis.text.x = element_text(size=labelsz),
      axis.text.y = element_text(size=labelsz),
      axis.title.x = element_text(size=labelsz),
      axis.title.y = element_text(size=labelsz),
      legend.text = element_text(size=labelsz))

hist2(r[1,], r[2,], "naive", "corrected", labelsz=16)

Powered by QuickLaTeX.

http://www.ebaytechblog.com/2017/05/04/a-surprising-pitfall-of-human-judgement-and-how-to-correct-it/feed/ 1
Building a UI Component in 2017 and Beyond http://www.ebaytechblog.com/2017/05/03/building-a-ui-component-in-2017-and-beyond/ http://www.ebaytechblog.com/2017/05/03/building-a-ui-component-in-2017-and-beyond/#comments Wed, 03 May 2017 14:00:56 +0000 http://www.ebaytechblog.com/?p=7230
Continue Reading »
As web developers, we have seen the guidelines for building UI components evolve over the years. Starting from jQuery UI to the current Custom Elements, various patterns have emerged. To top it off, there are numerous libraries and frameworks, each advocating their own style on how a component should be built. So in today’s world, what would be the best approach in terms of thinking about a UI component interface? That is the essence of this blog. Huge thanks to folks mentioned in the bottom Acknowledgments section. Also, this post leverages a lot of learnings from the articles and projects listed in the Reference section at the end.

Setting the context

Before we get started, let’s set the context on what this post covers.

  • By UI components, we mean the core, standalone UI patterns that apply to any web page in general. Examples would be Button, Dropdown Menu, Carousel, Dialog, Tab, etc. Organizations in general maintain a pattern library for these core components. We are NOT talking about application-specific components here, like a Photo Viewer used in social apps or an Item Card used in eCommerce apps. They are designed to solve application-specific (social, eCommerce, etc.) use cases. Application components are usually built using core UI components and are tied with a JavaScript framework (Vue.js, Marko, React, etc.) to create a fully featured web page.

  • We will only be talking about the interface (that is, API) of the component and how to communicate. We will not go over the implementation details in depth, just a quick overview.

Declarative HTML-based interface

Our fundamental principle behind building a UI component API is to make it agnostic in regard to any library or framework. This means the API should be able to work in any context and be framework-interoperable. Any popular framework can just leverage the interface out of the box and augment it with their own implementation. With this principle in mind, what would be the best way to get started? The answer is pretty simple — make the component API look like how any other HTML element would look like. Just like how a <div>, <img>, or a <video> tag would work. This is the idea behind having a declarative HTML-based interface/API.

So what does a component look like? Let’s take carousel as an example. This is what the component API will look like.

<carousel index="2" controls aria-label-next="next" aria-label-previous="previous" autoplay>
    <div>Markup for Item 1</div>
    <div>Markup for Item 2</div>

What does this mean? We are declaratively telling the component that the rendered markup should do the following.

  • Start at the second item in the carousel.
  • Display the left and right arrow key controls.
  • Use "previous" and "next" as the aria-label attribute values for the left and right arrow keys.
  • Also, autoplay the carousel after it is loaded.

For a consumer, this is the only information they need to know to include this component. It is exactly like how you include a <button> or <canvas> HTML element. Component names are always lowercase. They should also be hyphenated to distinguish them from native HTML elements. A good suggestion would be to prefix them with a namespace, usually your organization or project name, for example ebay-, core-, git-, etc.

Attributes form the base on how you will pass the initial state (data and configuration) to a component. Let’s talk about them.


  • An attribute is a name-value pair where the value is always a string. Now the question may arise that anything can be serialized as a string and the component can de-serialize the string to the associated data type (JSON for example). While that is true, the guideline is to NOT do that. A component can only interpret the value as a String (which is default) or a Number (similar to tabindex) or a JavaScript event handler (similar to DOM on-event handlers). Again, at the end of the day, this is exactly how an HTML element works.

  • Attributes can also be boolean. As per the HTML5 spec, “The presence of a boolean attribute on an element represents the true value, and the absence of the attribute represents the false value.” This means that as a component developer, when you need a boolean attribute, just check for the presence of it on the element and ignore the value. Having a value for it has no significance; both creator and consumer of a component should follow the same. For example, <button disabled="false"> will still disable the button, even if it is set to false, just because the boolean attribute disabled is present.

  • All attribute names should be lowercase. Camel case or Pascal case is NOT allowed. For certain multiword attributes, hyphenated names like accept-charset, data-*, etc. can be used, but that should be a rare occurrence. Even for multiwords, try your best to keep them as one lowercase name, for example, crossorigin, contenteditable, etc. Check out the HTML attribute reference for tips on how the native elements are doing it.

We can correlate the above attribute rules with our <carousel> example.

  • aria-label-next and aria-label-previous as string attributes. We hyphenate them as they are multiwords, very similar to the HTML aria-label attribute.
  • index attribute will be deserialized as a number, to indicate the position of the item to be displayed.
  • controls and autoplay will be considered as boolean attributes.

A common pattern that used to exist (or still exists) is to pass configuration and data as JSON strings. For our carousel, it would be something like the following example.

<!-- This is not recommended -->
    data-config='{"controls": true, "index": 2, "autoplay": true, "ariaLabelNext": "next", "ariaLabelPrevious": "previous"}' 
    data-items='[{"title":"Item 1", ..}, {"title": "Item 2", ...}]'>

This is not recommended.

Here the component developer reads the data attribute data-config, does a JSON parse, and then initializes the component with the provided configuration. They also build the items of the carousel using data-items. This may not be intuitive, and it works against a natural HTML-based approach. Instead consider a declarative API as proposed above, which is easy to understand and aligns with the HTML spec. Finally, in the case of a carousel, give the component consumers the flexibility to build the carousel items however they want. This decouples a core component from the context in which it is going to be used, which is usually application-specific.


There will be scenarios where you really need to pass an array of items to a core component, for example, a dropdown menu. How to do this declaratively? Let’s see how HTML does it. Whenever any input is a list, HTML uses the <option> element to represent an item in that list. As a reference, check out how the <select> and <datalist> elements leverage the <option> element to list out an array of items. Our component API can use the same technique. So in the case of a dropdown menu, the component API would look like the following.

<dropdown-menu list="options" index="0">
    <option value="0" selected>--Select--</option>
    <option value="1">Option 1</option>
    <option value="2">Option 2</option>
    <option value="3">Option 3</option>
    <option value="4">Option 4</option>
    <option value="5">Option 5</option>

It is not necessary that we should always use the <option> element here. We could create our own element, something like <dropdown-option>, which is a child of the <dropdown-menu> component, and customize it however we want. For example, if you have an array of objects, you can represent each object ({"userName": "jdoe", "score": 99, "displayName": "John Doe"}) declaratively in the markup as <dropdown-option value="jdoe" score="99">John Doe</dropdown-option>. Hopefully you do not need a complex object for a core component.


You may also argue that there is a scenario where I need to pass a JSON config for it to work or else usability becomes painful. Although this is a rare scenario for core components, a use case I can think about will be a core analytics component. This component may not have a UI, but it does all tracking related stuff, where you need to pass in a complex JSON object. What do we do? The AMP Project has a good solution for this. The component would look like the following.

    <script type="application/json">
      "requests": {
        "pageview": "https://example.com/analytics?pid=",
        "event": "https://example.com/analytics?eid="
      "vars": {
        "account": "ABC123"
      "triggers": {
        "trackPageview": {
          "on": "visible",
          "request": "pageview"
        "trackAnchorClicks": {
          "on": "click",
          "selector": "a",
          "request": "event",
          "vars": {
            "eventId": "42",
            "eventLabel": "clicked on a link"

Here again we piggyback the interface based on how we would do it in simple HTML. We use a <script> element inside the component and set the type to application/json, which is exactly what we want. This brings back the declarative approach and makes it look natural.


Till now we talked only about the initial component API. This enables consumers to include a component in a page and set the initial state. Once the component is rendered in the browser, how do you interact with it? This is where the communication mechanism comes into play. And for this, the golden rule comes from the reactive principles of

Data in via attributes and properties, data out via events

This means that attributes and properties can be used to send data to a component and events send the data out. If you take a closer look, this is exactly how any normal HTML element (input, button, etc.) behaves. We already discussed attributes in detail. To summarize, attributes set the initial state of a component, whereas properties update or reflect the state of a component. Let’s dive into properties a bit more.


At any point in time, properties are your source of truth. After setting the initial state, some attributes do not get updated as the component changes over time. For example, typing in a new phrase in an input text box and then calling element.getAttribute('value') will produce the previous (stale) value. But doing element.value will always produce the current typed-in phrase. Certain attributes, like disabled, do get reflected when the corresponding property is changed. There has always been some confusion around this topic, partly due to legacy reasons. It would be ideal for attributes and properties to be in sync, as the usability benefits are undeniable.

If you are using Custom Elements, implementing properties is quite straightforward. For a carousel, we could do this.

class Carousel extends HTMLElement {  
    static get observedAttributes() {
        return ['index'];
    // Called anytime the 'index' attribute is changed
    attributeChangedCallback(attrName, oldVal, newVal) {
        this[attrName] = newVal;
    // Takes an index value
    set index(idx) {
        // First check if it is numeric
        const numericIndex = parseInt(idx, 10);
        if (isNaN(numericIndex)) {
        // Update the internal state
        this._index = numericIndex;
        /* Perform the associated DOM operations */
    get index() {
        return this._index;

Here the index property gets all its associated characteristics. If you are doing carouselElement.index=4, it will update the internal state and then perform the corresponding DOM operations to move the carousel to the fourth item. Additionally, even if you directly update the attribute carouselElement.setAttribute('index', 4), the component will still update the index property, the internal state and perform the exact DOM operations to move the carousel to the fourth item.

However, until Custom Elements gain massive browser adoption and have a good server-side rendering story, we need to come up with other mechanisms to implement properties. And one way would be to use the Object.defineProperty() API.

const carouselElement = document.querySelector('#carousel1');
Object.defineProperty(carouselElement, 'index', {
    set(idx) {
        // First check if it is numeric
        const numericIndex = parseInt(idx, 10);
        if (isNaN(numericIndex)) {
        // Update the internal state
        this._index = numericIndex;
        /* Perform the associated DOM operations */
    get() {
        return this._index;

Here we are augmenting the carousel element DOM node with the index property. When you do carouselElement.index=4, it gives us the same functionality as the Custom Element implementation. But directly updating an attribute with carouselElement.setAttribute('index', 4) will do nothing. This is the tradeoff in this approach. (Technically we could still use a MutationObserver to achieve the missing functionality, but that would be an overkill.) Hopefully as a team, if you can standardize that state updates should only happen through properties, then it should be less of a concern.

With respect to naming conventions, since properties are accessed programmatically, they should always be camel-cased. All exposed attributes (an exception would be ARIA attributes) should have a corresponding camel-cased property, very similar to native DOM elements.


When the state of a component has changed, either programmatically or due to user interaction, it has to communicate the change to the outside world. And the best way to do it is by dispatching events, very similar to click or touchstart events dispatched by a native HTML element. The good news is that the DOM comes with a built-in custom eventing mechanism through the CustomEvent constructor. So in the case of a carousel, we can tell the outside world that the carousel transition has been completed by dispatching a transitionend event as shown below.

const carouselElement = document.querySelector('#carousel1');

// Dispatching 'transitionend' event
carouselElement.dispatchEvent(new CustomEvent('transitionend', {
    detail: {index: this._index}

// Listening to 'transitionend' event
carouselElement.addEventListener('transitionend', event => {
    alert(`User has moved to item number ${event.detail.index}`);

By doing this, we get all the benefits of DOM events like bubbling, capture etc. and also the event APIs like event.stopPropagation(), event.preventDefault(), etc. Another added advantage is that it makes the component framework-agnostic, as most frameworks already have built-in mechanisms for listening to DOM events. Check out Rob Dodson’s post on how this works with major frameworks.

Regarding a naming convention for events, I would go with the same guidelines that we listed above for attribute names. Again, when in doubt, look at how the native DOM does it.


Let me briefly touch upon the implementation details, as they give the full picture. We have been only talking about the component API and communication patterns till now. But the critical missing part is that we still need JavaScript to provide the desired functionality and encapsulation. Some components can be purely markup- and CSS-based, but in reality, most of them will require some amount of JavaScript. How do we implement this JavaScript? Well, there a couple of ways.

  • Use vanilla JavaScript. Here the developer builds their own JavaScript logic for each component. But you will soon see a common pattern across components, and the need for abstraction arises. This abstraction library will pretty much be similar to those numerous frameworks out in the wild. So why reinvent the wheel? We can just choose one of them.

  • Usually in organizations, web pages are built with a particular library or framework (Angular, Ember, Preact, etc.). You can piggyback on that library to implement the functionality and provide encapsulation. The drawback here is that your core components are also tied with the page framework. So in case you decide to move to a different framework or do a major version upgrade, the core components should also change with it. This can cause a lot of inconvenience.

  • You can use Custom Elements. That would be ideal, as it comes default in the browsers, and the browser makers recommend it. But you need a polyfill to make it work across all of them. You can try a Progressive Enhancement technique as described here, but you would lose functionality in non-supportive browsers. Moreover, until we have a solid and performant server-side rendering mechanism, Custom Elements would lack mass adoption.

And yes, all options are open-ended. It all boils down to choices, and software engineering is all about the right tradeoffs. My recommendation would be to go with either Option 2 or 3, based on your use cases.


Though the title mentions the year “2017”, this is more about building an interface that works not only today but also in the future. We are making the component API-agnostic of the underlying implementation. This enables developers to use a library or framework of their choice, and it gives them the flexibility to switch in the future (based on what is popular at that point in time). The key takeaway is that the component API and the principles behind it always stay the same. I believe Custom Elements will become the default implementation mechanism for core UI components as soon as they gain mainstream browser adoption.

The ideal state is when a UI component can be used in any page, without the need of a library or polyfill and it can work with the page owner’s framework of choice. We need to design our component APIs with that ideal state in mind and this is a step towards it. Finally, worth repeating, when in doubt, check how HTML does it, and you will probably have an answer.


Many thanks to Rob Dodson and Lea Verou for their technical reviews and valuable suggestions. Also huge thanks to my colleagues Ian McBurnie, Arun Selvaraj, Tony Topper, and Andrew Wooldridge for their valuable feedback.


http://www.ebaytechblog.com/2017/05/03/building-a-ui-component-in-2017-and-beyond/feed/ 2
Elasticsearch Cluster Lifecycle at eBay http://www.ebaytechblog.com/2017/04/12/elasticsearch-cluster-lifecycle-at-ebay/ http://www.ebaytechblog.com/2017/04/12/elasticsearch-cluster-lifecycle-at-ebay/#respond Wed, 12 Apr 2017 15:20:37 +0000 http://www.ebaytechblog.com/?p=6821
Continue Reading »
Defining an Elasticsearch cluster lifecycle

eBay’s Pronto, our implementation of the “Elasticsearch as service” (ES-AAS) platform, provides fully managed Elasticsearch clusters for various search use cases. Our ES-AAS platform is hosted in a private internal cloud environment based on OpenStack. The platform currently manages around 35+ clusters and supports multiple data center deployments. This blog provides guidelines on all the different pieces for creating a cluster lifecycle to allow streamlined management of Elasticsearch clusters. All Elasticsearch clusters deployed within the eBay infrastructure follow our defined Elasticsearch lifecycle depicted in the figure below.

Cluster preparation

This lifecycle stage begins when a new use case is being onboarded onto our ES-AAS platform.

On-boarding information

Customers’ requirements are captured onto an onboarding template that contains information such as document size, retention policy, and read/write throughput requirement. Based on the inputs provided by the customer, infrastructure sizing is performed. The sizing uses historic learnings from our benchmarking exercises. On-boarding information has helped us in cluster planning and defining SLA for customer commitments.

We collect the following information from customers before any use case is onboarded:

  • Use case details: Consists of queries relating to use case description and significance.
  • Sizing Information: Captures the number of documents, their average document size, and year-on-year growth estimation.
  • Data read/write information: Consists of expected indexing/search rate, mode of ingestion (batch mode or individual documents), data freshness, average number of users, and specific search queries containing any aggregation, pagination, or sorting operations.
  • Data source/retention: Original data source information (such as Oracle, MySQL, etc.) is captured on an onboarding template. If the indices are time-based, then an index purge strategy is logged. Typically, we do not use Elasticsearch as the source of data for critical applications.

Benchmarking strategy

Before undertaking any benchmarking exercise, it’s really important to understand the underlying infrastructure that hosts your VMs. This is especially true in a cloud-based environment where such information is usually abstracted from end users. Be aware of different potential noisy-neighbors issues, especially on a multi-tenant-based infrastructure.

Like most folks, we have also performed extensive benchmarking exercise on existing hardware infrastructure and image flavors. Data stored in Elasticsearch clusters are specific to customer use cases. It is near to impossible to perform benchmarking runs on all data schemas used by different customers. Therefore, we made assumptions before embarking on any benchmarking exercise, and the following assumptions were key.

  • Clients will use a REST path for any data access on our provisioned Elasticsearch clusters. (No transport client)
  • To start with, we kept a mapping of 1GB RAM to 32GB disk space ratio. (This was later refined as we learnt from benchmarking)
  • Indexing numbers were carefully profiled for different numbers of replicas (1, 2, and 3 replicas).
  • Search benchmarking was done always on GetById queries (as search queries are custom and profiling different custom search queries was not viable).
  • We used fixed-size 1KB, 2KB, 5KB, and 10 KB documents

Working from these assumptions, we derived at a maximum shard size for performance (around 22GB), right payload size for _bulk requests (~5MB), etc. We used our own custom JMeter scripts to perform benchmarking. Recently Elasticsearch has developed and open-sourced the Rally benchmarking tool, which can be used as well. Additionally, based on our benchmarking learnings, we created a capacity-estimation calculator tool that can take in customer requirement inputs and calculate the infrastructure requirement for a use case. We avoided a lot of conversation with our customers on infrastructure cost by sharing this tool directly with end users.

VM cache pool

Our ES clusters are deployed by leveraging an intelligent warm-cache layer. The warm-cache layer consists of ready-to-use VM nodes that are prepared over a period of time based on some predefined rules. This ensures that VMs are distributed across different underlying hardware uniformly. This layer has allowed us to quickly spawn large clusters within seconds. Additionally, our remediation engine leverages this layer to flex up nodes on existing clusters without errors or any manual intervention. More details on our cache pool are available in another eBay tech blog at Ready-to-use Virtual-machine Pool Store via warm-cache

Cluster deployment

Cluster deployment is fully automated via a Puppet/Foreman infrastructure. We will not talk in detail about how Elasticsearch Puppet module was leveraged for provisioning Elasticsearch clusters. This is well documented at Elasticsearch puppet module. Along with every release of Elasticsearch, a corresponding version of the Puppet module is generally made publically available. We have made minor modifications to these Puppet scripts to suit eBay-specific needs. Different configuration settings for Elasticsearch are customized based on our benchmarking learnings. As a general guideline, we do not set the JVM heap memory size to more than 28 GB (because doing so leads to long garbage collection cycles), and we always disable in-memory swapping for the Elasticsearch JVM process. Independent clusters are deployed across data centers, and load balancing VIPs (Virtual IP addresses) are provisioned for data access.

Typically, with each cluster provisioned we give out two VIPs, one for data read operations and another one for write operations. Read VIPs are always created over client nodes (or coordinating nodes), while write VIPs are configured over data nodes. We have observed improved throughput from our clusters with such a configuration.

Deployment diagram


We use a lot of open source on our platform such as OpenStack, MongoDB, Airflow, Grafana, InfluxDB (open version), openTSDB, etc. Our internal services, such as cluster provisioning, cluster management, and customer management services, allow REST API-driven management for deployment and configuration. They also help in tracking clusters as assets against different customers. Our cluster provisioning service relies heavily on OpenStack. For example, we use NOVA for managing compute resources (nodes), Neutron APIs for load balancer provisioning, and Keystone for internal authentication and authorization of our APIs.

We do not use federated or cross-region deployments for an Elasticsearch cluster. Network latency limits us from having such a deployment strategy. Instead, we host independent clusters for use cases across multiple regions. Clients will have to perform dual writes when clusters are deployed in multiple regions. We also do not use Tribe nodes.

Cluster onboarding

We create cluster topology during customer onboarding. This helps to track resources and cost associated with cluster infrastructure. The metadata stored as part of a cluster topology maintains region deployment information, SLA agreements, cluster owner information, etc. We use eBay’s internal configuration management system (CMS) to track cluster information in form of a directed graph. There are external tools that hook onto this topology. Such external integrations allow easy monitoring of our clusters from centralized eBay-specific systems.

Cluster topology example

Cluster management

Cluster security

Security is provided on our clusters via a custom security plug-in that provides a mechanism to both authenticate and authorize the use of Elasticsearch clusters. Our security plug-in intercepts messages and then performs context-based authorization and authentication using an internal authentication framework. Explicit whitelisting based on client IP is supported. This is useful for configuring Kibana or other external UI dashboards. Admin (Dev-ops) are configured to have complete access to Elasticsearch cluster. We encourage using HTTPS (based on TLS 1.2) for securing communication between client and Elasticsearch clusters.

The following is a sample simple security rule that can configure be configured on our platform of provisioned clusters.

sample json code implementing a security rule

In the above sample rule, the enabled field controls if the security feature is enabled or not. whitelisted_ip_list is an array attribute for providing all whitelisted Client IPs. Any Open/Close index operations or delete index operations can be performed only by admin users.

Cluster monitoring

Cluster monitoring is done by custom monitoring plug-in that pushes 70+ metrics from each Elasticsearch node to a back-end TSDB-based data store. The plug-in works on a push-based design. External dashboards using Grafana consume the data on TSDB store. Custom templates are created on a Grafana dashboard, which allows easy centralized monitoring of our own clusters.



We leverage an internal alert system that can be used to configure threshold-based alerts on data stored on OpenTSDB. Currently, we have 500+ active alerts configured on our clusters with varying severity. Alerts are classified as ‘Errors’ or ‘Warnings’. Error alerts, when raised, are immediately attended to either by DevOps or by our internal auto-remediation engine, based on the alert rule configured.

Alerts are created during cluster provisioning based on various thresholds. For Example, if a cluster status turns RED, an ‘Error’ alert is raised or if CPU utilization of node exceeds 80% a ‘Warning’ alert is raised.

Cluster remediation

Our ES-AAS platform can perform an auto-remediation action on receiving any cluster anomaly event. Such actions are enabled via our custom Lights-Out-Management (LOM) module. Any auto-remediation module can significantly reduce manual intervention for DevOps. Our LOM module uses a rule-based engine which listens to all alerts raised on our cluster. The reactor instance maintains a context of the alerts raised and, based on cluster topology state (AUTO ON/OFF), takes remediation actions. For example, if a cluster loses a node and if this node does not return to its cluster within the next 15 minutes, the remediation engine replaces that node via our internal cluster management services. Optionally, alerts can be sent to the team instead of taking a cluster remediation action. The actions of the LOM module are tracked as stateful jobs that are persisted on a back-end MongoDB store. Due to the stateful nature of these jobs, they can be retried or rolled back as required. Audit logs are also maintained to capture the history or timeline of all remediation actions that were initiated by our custom LOM module.

Cluster logging

Along with the standard Elasticsearch distribution, we also ship our custom logging library. This library pushes all Elasticsearch application logs onto a back-end Hadoop store via an internal system called Sherlock. All centralized application logs can be viewed at both cluster and node levels. Once Elasticsearch log data is available on Hadoop, we run daily PIG jobs on our log store to generate reports for error log or slow log counts. We generally have our logging settings as INFO, and whenever we need to triage issues, we use transient a logging setting of DEBUG, which collects detailed logs onto our back-end Hadoop store.

Cluster decommissioning

We follow a cluster decommissioning process for major version upgrades of Elasticsearch. For major upgrades for Elasticsearch clusters, we spawn a new cluster with our latest offering of the Elasticsearch version. We replay all documents from old or existing version of Elasticsearch clusters to the newly created cluster. Client (user applications) starts using both cluster endpoints for all future ingestion until data catches up on the new cluster. Once data parity is achieved, we decommission the old cluster. In addition to freeing up infrastructure resources, we also clean up the associated cluster topology. Elasticsearch also provides a migration plug-in that can be used to check if direct, in-place upgrades can be done on major Elasticsearch versions. Minor Elasticsearch upgrades are done on an as-needed basis and are usually done in-place.

http://www.ebaytechblog.com/2017/04/12/elasticsearch-cluster-lifecycle-at-ebay/feed/ 0
Healthy Team Backlogs http://www.ebaytechblog.com/2017/03/30/healthy-team-backlogs/ http://www.ebaytechblog.com/2017/03/30/healthy-team-backlogs/#comments Thu, 30 Mar 2017 13:50:32 +0000 http://www.ebaytechblog.com/?p=6731
Continue Reading »

What is a backlog?

Agile product owners use a backlog to organize and communicate the requirements for a team’s work. Product backlogs are deceptively simple, which can sometimes make them challenging to adopt for product owners who may be used to working with lengthy PRDs (“project requirement documents” or similar).

Scrum most commonly uses the term product backlog. However, many product owners who are new to Scrum are confused by this term. Reasonable questions arise: Does this suggest that a team working on multiple products would have multiple backlogs? If so, how do we prioritize between them? Where do bugs get recorded? What happens if work needs to be done, but it isn’t associated with a product; do we create a placeholder?

Therefore, we prefer the term team backlog. Our working definition of team backlog is “the maintained, ordered list of work that the team plans to do now or in the future.” This is a dense description, so let’s unpack it a little.

“The” and “Team”

  • We say the and team because each team needs a single source of truth to track their current and future work.
  • If a team is working on multiple projects or products, all of the work for those stories should appear on a single, unified, team backlog.
  • Teams do not generally share backlogs.


  • Work includes almost everything that the development team needs to do.
  • Features, bugs, technical debt, research, improvements, and even user experience work all appear on the same backlog.
  • Generally speaking, recurring team meetings and similar events do not appear on the backlog.


  • We say maintained because the backlog is a “living” artifact.
  • The product owner and team must continually update and refine their backlog. Otherwise, the team will waste time doing useless work and chasing requirements.
  • This requires several hours per week for the product owner and 1–2 hours per week for the team. It involves adding, ordering, discussing, describing, justifying, deleting, and splitting work.


  • We say ordered list rather than prioritized list because the backlog is ordered, not just prioritized.
  • If the backlog is only prioritized, there can be multiple items that are all “very high priority.”
  • If the backlog is ordered, we communicate exactly in what order those “very high priority” tasks should be worked on.

“Plans to Do”

  • We say plans to do because we regularly delete everything from the backlog that we no longer plan to work on.
  • Deleting unnecessary work is essential. Unnecessary work clutters up our backlog and distracts from the actual work.

What makes a backlog healthy?

Now that we know what a backlog is, what makes a backlog healthy or not? While what makes for a good backlog is somewhat subjective — in the same way that what makes a good PRD could be subjective — there are 10 characteristics that we’ve found to be particularly important.

Would you like to know if your backlog is healthy? Download this handy PDF checklist, print it out, then open up your backlog and follow along. For each criterion, take note of whether your backlog currently does, doesn’t, or only somewhat meets the criterion. In exchange for less than half an hour of your time, you’ll have good sense as to the health of your backlog and a few ideas for improvement.

  1. Focused, ordered by priority, and the team follows the order diligently

    • At all times, anyone can look at the backlog and know what needs to be worked on next without ambiguity.
    • Even if you have several “P1” issues, the team needs to know which P1 issue needs to be addressed next. Simply saying “they’re all important” will paralyze the team.
    • Although the PO is responsible for the product backlog order and makes the final call, the PO should be willing to negotiate the order with their team. The team often has good insights that can mitigate dependencies or help the PO deliver more value.
    • Stay focused on one thing at a time when possible to deliver value earlier and reduce context switching waste.

  3. Higher-value items towards the top, lower-value items towards the bottom

    • In general, do high-value, low-cost work first (“lowest hanging fruit”).
    • Next, do high-value, high-cost work because it is usually more strategic.
    • Then, do low-value, low-cost work.
    • Finally, eliminate low-value, high-cost work. You will almost always find something better to do with your time and resources, so don’t waste your time tracking it. It will be obvious if and when that work becomes valuable.
    • Hint: You can use Weighted Shortest Job First or a similar technique if you’re having difficulty prioritizing.

  5. Granular, ready-to-work items towards the top, loosely-defined epics towards the bottom

    • Items that are at the top of the backlog will be worked on next, so we want to ensure that they are the right size to work on.
    • The typical team’s Definition of Ready recommends that items take ≤ ½ of a sprint to complete.
    • Delay decision-making and commitments — manifested as small, detailed, team-ready items — until the last responsible moment.
    • There is little value specifying work in detail if you will not work on it soon. Due to learning and changing customer/company/competitive conditions, your requirements may change or you may cancel the work altogether.

    What is an Epic?

    • An “epic” is simply a user story that is too large to complete in one sprint. It gets prioritized in the backlog like every other item.
    • JIRA Tip: “Epics” in JIRA do not appear in the backlog for Scrum boards. As a result, they behave more like organizing themes than epics. Therefore, we suggest using JIRA’s epic functionality to indicate themes and user stories with the prefix “Epic: ”  to indicate actual epics.

  7. Solutions towards the top, statements of need towards the bottom

    • Teams can decide to start working on an item as soon as they know what customer needs they hope to solve. However, collaborating between product, design, development, and stakeholders to translate customer needs into solutions takes time.
    • As with other commitments, defer solutioning decisions until the last responsible moment:
      • Your ideal solution may change through learning or changing conditions such as customer, competitors, company, or even technology options.
      • You may decide not to work on the problem after all.

  9. 1½ to 2 sprints worth of work that’s obviously ready to work on at the top

    • Teams sometimes surprise the product owner by having more capacity by expected.
    • Having enough ready stories ensures that the team is:
      • Unlikely to run out of work to pull into their sprint backlog during sprint planning.
      • Able to pull in additional work during the sprint if they complete the rest of the work on their sprint backlog.
    • It should be obvious what work is and isn’t ready to work on so that the team doesn’t have to waste time figuring it out each time they look at the backlog.
      • Some teams prefix a story title with a “* ” to indicate a ready story (or a story that isn’t ready).

  11. The value of each piece of work is clearly articulated

    • Your team should be able to understand why the work is important to work on.
    • There are three primary sources of value (and you can define your own):
      • User/Business Value: Increase revenue, reduce costs, make users happy
      • Time Criticality: Must it happen soon due to competition, risk, etc.?
      • Opportunity Enablement/Risk Reduction/Learning: Is it strategic? Is it necessary to enable another valuable item (for example, a dependency)?
    • You won’t usually need a complex financial projection, just a reasonable justification as to why the item should be worked on next relative to all other known possibilities. Time previously spent with complex projections can instead be used to talk to customers and identify other opportunities.

  13. The customer persona for the work is clearly articulated

    • The “As a” part of the “As a ____, I can ___, so that ____” user story isn’t a mere formality; it’s an essential part of user-centered product development.
    • Who is the customer? Who are you completing this work for? Even if you’re on a “back-end” team, keep the end-user in mind.
    • Partner with your designer to identify your personas and reference them whenever possible. Is this feature for “Serious Seller Sally?” Can you imagine her personality and needs just as well as any of your friends?
      • Example: “As Serious Seller Sally, I can list items using a ‘advanced’ flow so that I can get the options I need without the guidance for casual-sellers that only slows me down.”
    • Tool Tip: Most teams and POs find it best to put just the “I can” part the user story (for example, “List items using a ‘advanced’ flow”) in the planning tool’s title field. Otherwise it can be harder to read the backlog. Put the entire user story at the top of your tool’s description field.

  15. ≤ 100 items (a rule of thumb), and contains no work that — realistically — will never be done

    • This is a general rule. If your team works on many very small items or has considerable work that you must track, your backlog could be longer.
    • Assuming that each backlog item takes a minute to read and understand, 100 items alone would take over an hour and a half to process. Keeping our backlog limited like this makes it easier and faster to fully understand.
    • A longer backlog is more likely to contain features that will never be built or bugs that will never be fixed. Keeping a short backlog helps us ensure that we triage effectively and delete items that we are unlikely to work on.

  17. The team backlog is not a commitment

    • A Scrum team cannot make a realistic, firm commitment on an entire team backlog because:
      • It has not been through high-level design (for example, tasking at end of Sprint planning).
      • The risk of missed dependencies and unexpected requests/impediments is too great.
      • “Locking in” a plan that far into the future considerably restricts flexibility
    • A Scrum team can make a valid commitment on a sprint backlog if there are no mid-sprint scope changes and few unexpected requests and impediments.

  19. Backlog reflects the release plan if available

    • If the team has conducted release planning, create pro forma sprints with items in your planning tool to reflect the release plan.
    • If there are production release, moratorium, or similar dates, communicate those too.
    • Update the release plan at end of each sprint as you learn.

What does a healthy team backlog look like in JIRA?

Glad you asked. Here are four sample “sprints” that take good advantage of JIRA’s built-in functionality.

Sprint 1 (active sprint)

Sprint 2 (next sprint)

Sprint 3 (future sprint)

Sprint 4 (future sprint)


Now you know what a healthy team backlog looks like. If you’ve filled out our printable checklist, mark off up to three items that you’ll work to improve over the next week or two with your teams. We hope this is of use to you!

http://www.ebaytechblog.com/2017/03/30/healthy-team-backlogs/feed/ 3
Email Tech Is Now Ad Tech http://www.ebaytechblog.com/2017/03/28/email-tech-is-now-ad-tech/ http://www.ebaytechblog.com/2017/03/28/email-tech-is-now-ad-tech/#comments Tue, 28 Mar 2017 17:39:26 +0000 http://www.ebaytechblog.com/?p=6831
Continue Reading »


eBay has come a long way in our CRM and email marketing in the past two years. Personalization is a relatively easy task when you’re dealing with just one region and one vertical and a hundred thousand customers. With 167M active buyers across the globe, eBay’s journey to help each of our buyers find their version of perfect was quite complex.

Like many in our industry, we’ve had to deal with legacy systems, scalability, and engineering resource constraints. And yet, we’ve made email marketing a point of pride — instead of the “check mark” that we started from. Here’s our story.

Our starting point was a batch-and-blast approach. Our outbound communications very much reflected our organizational structure: as a customer, I’d get a fashion email on Monday, a tech email on Tuesday, and a motors email on Wednesday. This of course wasn’t the kind of an experience we wanted to create.

Additionally, for each of our marketing campaigns, we hand-authored targeting criteria — just as many of our industry colleagues do today in tools like Marketo and ExactTarget. This approach worked OK, but the resulting segment size was too large — in hundreds of thousands. This meant that we were missing out on the opportunity to treat customers individually. It also didn’t scale well — as our business grew internationally, we needed to add more and more business analysts; and the complexity of our contact strategy was becoming unmanageable.

We wanted to create a structurally better experience for our customers — and our big bet was to go after 1:1 personalization using real-time data. We wanted to use machine learning to do the targeting, with real-time feedback loops powering our models.

Since email is such a powerful driver for eCommerce, we committed to a differentiated experience in this channel. After evaluating multiple off-the-shelf solutions, we settled on building an in-house targeting and personalization system — as the size of the eBay marketplace is astounding, and many opportunities and issues are quite unique. We set a high bar: every time we show an offer to a customer, it has to be driven by our up-to-the minute understanding of what the customer did and how other customers are responding to this offer.

Here are some examples of the scenarios we targeted:

  • eBay has many amazing deals, and our community is very active. Deals quickly run out of inventory. We can’t send an offer to a customer and direct them to an expired deal. Thus, our approach involved open-time rendering of offers in email.
  • Some of our retail events turn out to be much more popular than we anticipate. We want to respond to this real-time engagement feedback by adjusting our recommendations quickly. We thus built a feedback loop that shows an offer to a subset of customers; then, if an event is getting a much higher click-through rate than we expected, we show it to more customers. If it for some reason isn’t doing well — for example, if the creative is underperforming — the real-time “bandit” approach reduces its visibility.

Both of these scenarios required us to have real-time CRM and engagement streams. That is, we needed to know when a customer opens an email or clicks on it, and based on this knowledge, instantaneously adjust our recommendations for other customers. This of course is miles away from a typical multi-step batch system based on ETL pipelines that most retailers have today. We were no different — we had to reengineer our delivery and data collection pipes to be real-time. The payoff, however, is quite powerful — this real-time capability is foundational to our ability to personalize better: both today and in the years to come.

The resulting solution transformed email marketing at eBay: instead of hundreds of small, uncoordinated campaigns each month, we now have a small set of “flagship” campaigns, each of which is comprised of one or more offers. Each of the offers is selected at the open time of the email, and the selection is made based upon machine-learned model which uses real-time data. As a result, we saw significant growth in both engagement and sales driven by our emails.

You’ll notice that this component-level personalization approach is all about treating email content as an ad canvas. The problem is fundamentally similar: once you’ve captured the customers attention — be it via a winning bid on an ad auction, or by having that customer open your email — you need to find the most relevant offer to show. Each email slot can be thought of as a first-party ad slot. This realization allowed us to unify our approaches between display advertising and email: the same stack now powers both.

We extended this approach to scenarios like paid social campaigns, where Facebook would want to retrieve the offer from us a priori to manage their customer experience. We built a real-time push framework, where, whenever we find a deal that is better than what we previously apprised Facebook of, we immediately push that offer to Facebook.

This creates a powerful cross-channel multiplier: if we happen to see the customer on the display channel, the same ad-serving pipeline is engaged — and our flagship deal-finding campaign can be served to that customer, too. This means that evolving our flagship campaigns — adding more sophisticated machine learning, improving our creatives — contributes to all channels that are powered by this pipeline, not just email.

Orchestration across channels too becomes possible: we can choose to send an email to a customer with a relevant offer; if they don’t open it, we can then target them with a display ad, and an onsite banner; then, after showing the offer a set number of times across all channels, we can choose to stop the ad — implementing an effective cross-channel impression cap. And each condition for state transitions in this flow can itself be powered by a machine-learning model.

eBay’s scale creates an admirable engineering challenge for a true CRM. By putting our customers, and their behavioral signals, at the top of our priority list, we were able to create an asset in our CRM platform that positions us well towards In this journey towards 1:1 personalization. A single real-time, event-driven pipeline we’ve built allows for coordinated, up-to-the-minute offers to be served — wherever we happen to see the customer.

Alex Weinstein (@alexweinstein) is the Director of Marketing Technologies and CRM at eBay and the author of the Technology + Entrepreneurship blog, where he explores data-driven decision making in the face of uncertainty. Prior to eBay, Alex was the head of product development at Wetpaint, a personalization tech startup.

Graphic: Rahul Rodriguez

http://www.ebaytechblog.com/2017/03/28/email-tech-is-now-ad-tech/feed/ 1