eBay Tech Blog http://www.ebaytechblog.com Where e-commerce meets world-class technology Fri, 12 May 2017 19:24:55 +0000 en-US hourly 1 https://wordpress.org/?v=4.8 Enhancing the User Experience of the Hadoop Ecosystem http://www.ebaytechblog.com/2017/05/12/enhancing-the-user-experience-of-the-hadoop-ecosystem/ http://www.ebaytechblog.com/2017/05/12/enhancing-the-user-experience-of-the-hadoop-ecosystem/#respond Fri, 12 May 2017 19:24:55 +0000 http://www.ebaytechblog.com/?p=7223
Continue Reading »
]]>
 

At eBay, we have multiple large, multi-tenant clusters. Each of these clusters stores hundreds of petabytes of data. These clusters offer tens of thousands of cores to run computations on the data. We have thousands of internal users who use Hadoop in their roles, including data analysts, data scientists, engineers, and product managers. These users use multiple technologies like MapReduce, Hive, and Spark to process data. There are thousands of applications that push and pull data from Hadoop and run computations.

Figure 1: Hadoop clusters, auxiliary components, users, applications, and services

Pains

The users normally interact with the cluster via the command line by SSHing to specialized gateway machines that reside in the same network zone as the cluster. To transfer job files and scripts, the users need to SCP over multiple hops.

Figure 2: Old way of accessing a Hadoop cluster

The need to traverse multiple hops as well as the command-line-only usage was a major hindrance to the productivity of our data users.

On the other side, our website applications and services need to access data and perform compute. These applications and services reside in a different network zone and hence need to set up network rules to access various services like HDFS, YARN, and Oozie. Since our clusters are secured with Kerberos, the applications need to be able to use Kerberos to authenticate to the Hadoop services. This was causing an extra burden for our application developers.

In this post, I will share the work in progress to facilitate access to our Hadoop clusters for data and compute resources by users and applications.

Requirements

We need better ways to achieve the following goals:

  • Our engineers and other users need to use multiple clusters and related components.
  • Data Analysts and other users need to run interactive queries and create shareable reports.
  • Developers need to be able to develop applications and services without spending time on connectivity problems or Kerberos authentication.
  • We can afford no compromise on security.

Solutions

To improve user experience and productivity, we added three open-source components:

Hue — to perform operations on Hadoop and related components.

Apache Zeppelin — to develop interactive notebooks with queries, programs, and reports.

Apache Knox — to serve as a single point for applications to access HDFS, Oozie, and other Hadoop services.

Figure 3: Enhanced user experience with Hue, Zeppelin, and Knox

We will describe each product, the main use cases, a list of our customizations, and the architecture.

Hue

Hue is a user interface to the Hadoop ecosystem. It provides user interfaces to several components including HDFS, Oozie, Resource Manager, Hive, and HBase. It is a 100% open-source product, actively supported by Cloudera, and stored at the Hue GitHub site.

Similar products

Apache Airflow allows users to specify workflows in Python. Since we did not want a Python learning curve for our users, we chose Hue instead of Airflow. But we may find Airflow compelling enough to deploy it in future so that it can be used by people who prefer Airflow.

Use cases of Hue

Hue allows a user to work with multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • To browse, manage, upload, and download HDFS files and directories
  • To specify workflows comprising MapReduce, Hive, Pig, Spark, Java, and shell actions
  • Schedule workflows and track SLAs
  • To manage Hive metadata, run Hive queries, and share the queries with other users
  • To manage HBase metadata and interact with HBase
  • To view YARN applications and terminate applications if needed

Enhancements

Two-factor authentication — To ensure that the same security level is maintained as that of command-line access, we needed to integrate our custom SAML-based two-factor authentication in Hue. Hue supports plugging in new authentication mechanisms, using which we were able to plug in our two-factor authentication.

Ability to impersonate other users — At eBay, users sometimes operate on behalf of a team account. We added capability in Hue so that users can impersonate as another account as long as they are authorized to do so. The authorization is controlled by LDAP group memberships. The users can switch back between multiple accounts.

Working with multiple clusters — Since we have multiple clusters, we wanted to provide single Hue instance serving multiple Hadoop clusters and components. This enhancement required changes in HDFS File Browser, Job Browser, Hive Metastore Managers, Hive query editors, and work flow submissions.

Architecture

Figure 4: Hue architecture at eBay

Zeppelin

A lot of our users, especially data scientists, want to run interactive queries on the data stored on Hadoop clusters. They run one query, check its results, and, based on the results, form the next query. Big data frameworks like Spark, Presto, Kylin, and to some extent, HiveServer2 provide this kind of interactive query support.

Apache Zeppelin (GitHub repo) is a user interface that integrates well with products like Spark, Presto, Kylin, among others. In addition, Zeppelin provides an interface where users can develop data notebooks. The notebooks can express data processing logic in SQL or Scala or Python or R. Zeppelin also supports data visualization in notebooks in the form of tables and charts.

Zeppelin is an Apache project and is 100% open source.

Use cases

Zeppelin allows a user to develop visually appealing interactive notebooks using multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • Run a quick Select statement on a Hive table using Presto.
  • Develop a report based on a dataset by reading files from HDFS and persisting them in memory as Spark data frames.
  • Create an interactive dashboard that allows users to search through a specific set of log files with custom format and schema.
  • Inspect the schema of a Hive table.

Enhancements

Two-factor authentication — To maintain security parity with that command-line access, we plugged in our custom two-factor authentication mechanism in Zeppelin. Zeppelin uses Shiro for security, and Shiro allows one to plug in a custom authentication with some difficulty.

Support for multiple clusters — We have multiple clusters and multiple instances of components like Hive. To support multiple instances in one Zeppelin server, we created different interpreters for different clusters or server instances.

Capability to override interpreter settings at the user level — Some of the interpreter settings, such as job queues and memory values, among others, need to be customized by users for their specific use cases. To support this, we added a feature in Zeppelin so that users can override certain Interpreter settings by setting properties. This is described in detail in this Apache JIRA ticket ZEPPELIN-1625

Architecture

Figure 5: Zeppelin Architecture at eBay

Knox

Apache Knox (GitHub repo) is an HTTP reverse proxy, and it provides a single endpoint for applications to invoke Hadoop operations. It supports multiple clusters and multiple components like webHDFS, Oozie, WebHCat, etc. It can also support multiple authentication mechanisms so that we can hook up custom authentication along with Kerberos authentication

It is an Apache top-level project and is 100% open source.

Use cases

Knox allows an application to talk to multiple Hadoop clusters and related components through a single entry point using any application-friendly non-Kerberos authentication mechanism. A few common use cases are listed below:

  • To authenticate using an application token and put/get files to/from HDFS on a specific cluster
  • To authenticate using an application token and trigger an Oozie job
  • To run a Hive script using WebHCat

Enhancements

Authentication using application tokens — The applications and services in the eBay backend use a custom token-based authentication mechanism. To take advantage of the existing application credentials, we enhanced Knox to support our application token-based authentication mechanism in addition to Kerberos. Knox utilizes the Hadoop Authentication framework, which is flexible enough to plug in new authentication mechanisms. The steps to plug in an authentication mechanism on Hadoop’s HTTP interface is described in Multiple Authentication Mechanisms for Hadoop Web Interfaces

Architecture

Figure 6: Knox Architecture at eBay

Summary

In this blog post, we describe the approach taken to improve user experience and developer productivity in using our multiple Hadoop clusters and related components. We illustrate the use of three open-source products to make Hadoop users’ life a lot simpler. The products are Hue, Zeppelin, and Knox. We evaluated these products, customized them for eBay’s purpose, and made them available for our users to carry out their projects efficiently.

]]>
http://www.ebaytechblog.com/2017/05/12/enhancing-the-user-experience-of-the-hadoop-ecosystem/feed/ 0
A Creative Visualization of OLAP Cuboids http://www.ebaytechblog.com/2017/05/09/a-creative-visualization-of-olap-cuboids/ http://www.ebaytechblog.com/2017/05/09/a-creative-visualization-of-olap-cuboids/#respond Tue, 09 May 2017 18:50:16 +0000 http://www.ebaytechblog.com/?p=6994
Continue Reading »
]]>
Background

eBay is one of the world’s largest and most vibrant marketplaces with 1.1 billion live listings every day, 169 million active buyers, and trillions of rows of datasets ranging from terabytes to petabytes. Analyzing such volumes of data required eBay’s Analytics Data Infrastructure (ADI) team to create a fast data analytics platform for this big data using Apache Kylin, which is an open-source Distributed Analytics Engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

Apache Kylin creatively applies data warehouse OLAP technology to the Hadoop domain, which makes a query return within milliseconds to seconds against datasets in the PBs. The magic of Apache Kylin is that it pre-calculates the metrics against defined dimensions. So when a query is launched, it doesn’t need to scan PBs worth of source data, but instead it scans the pre-calculated metrics, which is much smaller than the source data to accelerate the query speed.

Currently there are hundreds of cubes running on the Kylin production environment within eBay, serving dozens of Business domains in Inventory Healthy Analysis, User Behavior Analysis, eBay APIs Usage Analysis and eBay Business Partner Channel Performance Analysis, etc.

This post showcases the creative visualization of OLAP cuboids that has been implemented in the Cube Planner feature built on Apache Kylin by eBay’s ADI team to solve the challenge of showing huge OLAP cuboids in a fixed space. To better understand the challenge as well as the value of the visualization of sunburst charts that have been introduced into OLAP cuboids, some basic concepts need to be covered.

Basic Concepts

An OLAP cube is a term that typically refers to multi-dimensional array of data [1][2]. OLAP is an acronym for online analytical processing [3], which is a computer-based technique of analyzing data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than 3.

A cuboid is one combination of dimensions.

For example, if a cube has 4 dimensions — time, item, location, and supplier — it has 16 cuboids, as shown here.

A basic cuboid has the most detailed data, except for the source data itself; it is composed of all dimensions, like (time, item, location, supplier). It can be rolled up to all the other cuboids. For example, a user can roll up the basic cuboid (time, item, location, supplier) along dimension “supplier” to cuboid (time, item, location). And in this case, the basic cuboid is the Parent Cuboid, and a 3-D cuboid (time, item, location) is a Child Cuboid.

OLAP Cuboids Visualization Challenges

The OLAP cuboids visualization has the following characteristics:

  • All the cuboids have a root parent — the basic cuboid.
  • The relationship “rollup to” between two cuboids is directed, from parent to child.
  • The relationship “rollup to” is m:n mapping. One parent cuboid can be rolled up to multiple child cuboids, while one child cuboid can be rolled up from multiple parent cuboids.

So the visualization of cuboids is typically a directed graph. But, in the real OLAP world, not all the relationships are kept. The m:n mappings are simplified to 1:n mappings. Every child cuboid would have just one parent. Usually we keep the relationship with the parent, who has lowest row count, and eliminate others, because the lowest row count of parent cuboid means the lowest cost of aggregation to the child cuboid. Hence, the visualization of cuboids is simplified to a tree, where the basic cuboid is the root, as shown below.

Even with the simplified relationships between cuboids, there can still be some challenges to cuboids layout with a tree:

  • The tree must be collapsed to fit in a fixed space.
  • It is impossible to have an overall view of all cuboids.
  • Multiple clicks are needed to the target node from root layer by layer.
  • It’s hard to focus on the whole view of all the child cuboids from a given cuboid.

Cuboid Visualization Solution in Cube Planner

Cube Planner makes OLAP cubes more resource-efficient. It intelligently builds a partial cube to minimize the cost of building a cube and at the same time maximizes the benefit of serving end user queries. Besides, it learns patterns from queries at runtime and dynamically recommends cuboids accordingly.

In Cube Planner, we want to show query usage down to the cuboid level, which enables the cube owner to gain more insight into their cubes. We want to have a color-coded heat map with all cuboids in one view to give the cube owner an overall feeling of its cube design. Furthermore, when the user hovers over each cuboid, details of individual cuboid — cuboid ID, query count, row count, rollup rate, etc. — would be displayed. We will also recommend a new cube design with recommended cuboids based on the query statistics, thus we will need to put the old cuboids and new cuboids together in one page to show the benefit of cube optimization.

We are not able to use any tree or directed graph component to meet all our requirements above. Luckily, our GUI engineer discovered a means to produce sunburst charts, which greatly meet our expectations.

What is a Sunburst Chart?

Our sunburst charts are created with Angular-nvD3, an AngularJS directive for NVD3 re-usable charting library (based on D3). Users can easily customize their charts via a JSON API. Go to the Angular-nvD3 quick start page if you want to know more about how to include these fancy charts in your GUI.

How Are Sunburst Charts Used for Cube Cuboids Visualization?

Basically, at the back end we create a REST API to return the cuboid tree with the necessary information, and at the front end a JavaScript controller parses the REST response to a relative JSON format and then renders the sunburst chart. Below are samples of code from these two layers.

REST Service returns the Cuboid Tree response

@RequestMapping(value = "/{cubeName}/cuboids/current", method = RequestMethod.GET)
    @ResponseBody
    public CuboidTreeResponse getCurrentCuboids(@PathVariable String cubeName) {
        CubeInstance cube = cubeService.getCubeManager().getCube(cubeName);
        if (cube == null) {
            logger.error("Get cube: [" + cubeName + "] failed when get current cuboids");
            throw new BadRequestException("Get cube: [" + cubeName + "] failed when get current cuboids");
        }
        Map<Long, Long> cuboidList = Maps.newLinkedHashMap();
        try {
            cuboidList = cubeService.getCurrentCuboidStatistics(cube);
        } catch (IOException e) {
            logger.error("Get cuboids list failed.", e);
            throw new InternalErrorException("Get cuboids failed.", e);
        }
        CuboidTree currentCuboidTree = CuboidTreeManager.getCuboidTree(cuboidList);
        Map<Long, Long> hitFrequencyMap = getCuboidHitFrequency(cubeName);
        Map<Long, Long> queryMatchMap = getCuboidQueryMatchCount(cubeName);
        if (currentCuboidTree == null) {
            logger.warn("Get current cuboid tree failed.");
            return null;
        }
        return cubeService.getCuboidTreeResponse(currentCuboidTree, cuboidList, hitFrequencyMap, queryMatchMap, null);
    }

The JavaScript controller parses the REST response to create the sunburst chart

// transform chart data and customized options.
    $scope.createChart = function(data, type) {
        var chartData = data.treeNode;
        var baseNodeInfo = _.find(data.nodeInfos, function(o) { return o.cuboid_id == data.treeNode.cuboid_id; });
        $scope.formatChartData(chartData, data, baseNodeInfo.row_count);
           ......
        } else if ('recommend' === type) {
            $scope.recommendData = [chartData];
            $scope.recommendOptions = angular.copy(cubeConfig.chartOptions);
            $scope.recommendOptions.caption = {
                enable: true,
                html: '
Existed:  Hottest '
                    + ' Hot '
                    + ' Warm '
                    + ' Cold
'
                    + '
New:  Hottest '
                    + ' Hot '
                    + ' Warm '
                    + ' Cold
'
                    + '
 Mandatory
',
                css: {
                    position: 'relative',
                    top: '-25px'
                }
            };
            $scope.recommendOptions.chart.color = function(d) {
                var cuboid = _.find(data.nodeInfos, function(o) { return o.name == d; });
                if (cuboid.row_count < 0) {
                    return d3.scale.category20c().range()[5];
                } else {
                    var baseRate = 1/data.nodeInfos.length;
                    var colorIndex = 0;
                    if (!cuboid.existed) {
                        colorIndex = 8;
                    }
                    if (cuboid.query_rate > (3 * baseRate)) {
                        return d3.scale.category20c().range()[colorIndex];
                    } else if (cuboid.query_rate > (2 * baseRate)) {
                        return d3.scale.category20c().range()[colorIndex+1];
                    } else if (cuboid.query_rate > baseRate) {
                        return d3.scale.category20c().range()[colorIndex+2];
                    } else {
                        return d3.scale.category20c().range()[colorIndex+3];
                    }
                }
            };
            $scope.recommendOptions.title.text = 'Recommend Cuboid Distribution';
            $scope.recommendOptions.subtitle.text = 'Cuboid Count: ' + data.nodeInfos.length;
        }
    };

    // transform chart data
    $scope.formatChartData= function(treeNode, orgData, parentRowCount) {
        var nodeInfo = _.find(orgData.nodeInfos, function(o) { return o.cuboid_id == treeNode.cuboid_id; });
        $.extend(true, treeNode, nodeInfo);
        treeNode.parent_row_count = parentRowCount;
        if(treeNode.children.length > 0) {
            angular.forEach(treeNode.children, function(child) {
                $scope.formatChartData(child, orgData, nodeInfo.row_count);
            });
        }
    };

Screenshots and Interaction

Below are some screenshots from the eBay Kylin production environment. With a sunburst chart, cube owners can easily understand their overall cube design with a color-coded cuboid usage heat map. The greater number of dark blue elements in the sunburst chart, the more resource efficient this cube is. The greater number of light blue elements, the more room there is for efficiency.

Putting two sunburst charts of both current and recommended cuboids together, the changes become obvious. The cuboids that are recommended to be removed are marked with gray, and greens are recommended to be added. A popup window with more details of the cuboid will be shown when mouse hover over the cuboid element in a sunburst chart. The value of Cube Planner is now apparent.

Interaction with a sunburst chart is fast and convenient. The user is able to focus on any cuboid and its children with just one click, and the view changes automatically, like from the left chart to the right chart below.

      

If you want to specify the parent of a leaf, click on the center circle (the part marked yellow).

Summary

In short, leveraging sunburst charts as OLAP cuboid visualizations introduces a creative way to discover cube insights down to the cuboid level. With these insights, the cube owner is able to have a resource-efficient cube, thus make Apache Kylin more competitive as an OLAP engine on Hadoop.

References

Some graphics copyright Apache Kylin

]]>
http://www.ebaytechblog.com/2017/05/09/a-creative-visualization-of-olap-cuboids/feed/ 0
A Surprising Pitfall of Human Judgement and How to Correct It http://www.ebaytechblog.com/2017/05/04/a-surprising-pitfall-of-human-judgement-and-how-to-correct-it/ http://www.ebaytechblog.com/2017/05/04/a-surprising-pitfall-of-human-judgement-and-how-to-correct-it/#respond Thu, 04 May 2017 17:00:10 +0000 http://www.ebaytechblog.com/?p=7156
Continue Reading »
]]>
Introduction

Algorithms based on machine learning, deep learning, and AI are in the news these days. Evaluating the quality of these algorithms is usually done using human judgment. For example, if an algorithm claims to detect whether an image contains a pet, the claim can be checked by selecting a sample of images, using human judges to detect if there is a pet, and then comparing this to the results to the algorithm. This post discusses a pitfall in using human judgment that has been mostly overlooked until now.

In real life, human judges are imperfect. This is especially true if the judges are crowdsourced. This is not a new observation. Many proposals to process raw judgment scores and improve their accuracy have been made. They almost always involve having multiple judges score each item and combining the scores in some fashion. The simplest (and probably most common) method of combination is majority vote: for example, if the judges are rating yes/no (for example, is there a pet), you could report the rating given by the majority of the judges.

But even after such processing, some errors will remain. Judge errors are often correlated, and so multiple judgments can correct fewer errors than you might expect. For example, most of the judges might be unclear on the meaning of “pet”, incorrectly assume that a chicken cannot be a pet, and wrongly label a photo of a suburban home showing a child playing with chickens in the backyard. Nonetheless, any judgment errors remaining after processing are usually ignored, and the processed judgments are treated as if they were perfect.

Ignoring judge errors is a pitfall. Here is a simulation illustrating this. I’ll use the pet example again. Suppose there are 1000 images sent to the judges and that 70% of the images contain a pet. Suppose that when there is no pet, the judges (or the final processed judgment) correctly detect this 95% of the time. This is actually higher than typical for realistic industrial applications. And suppose that some of the pet images are subtle, and so when there is a pet, judges are correct only 90% of the time. I now present a simulation to show what happens.

Here’s how one round of the simulation works. I assume that there are millions of possible images and that 70% of them contain a pet. I draw a random sample of 1000. I assume that when the image has a pet, the judgment process has a 90% chance reporting “pet”, and when there is no pet, the judgment process reports “no pet” 95% of the time. I get 1000 judgments (one for each image), and I record the fraction of those judgments that were “pet”.

That’s one round of the simulation. I do the simulation many times and plot the results as a histogram. I actually did 100,000 simulations, so the histogram has 100,000 values. The results are below in red. Since this is a simulation, I know the true fraction of images with pets: it’s 70%. The estimated fraction from the judges in each simulation is almost always lower than the true fraction. Averaged over all simulations, the estimated fraction is 64%, noticeably lower than the true value of 70%.

This illustrates the pitfall: by ignoring judge error, you get the wrong answer. You might wonder if this has any practical implications. How often do you need a precise estimate for the fraction of pets in a collection of images? But what if you’re using the judgments to determine the accuracy of an algorithm that detects pets? And suppose the algorithm has an accuracy rate of 70%, but the judges say it is 64%. In the machine learning/AI community, the difference between an algorithm that is 70% accurate vs 64% is a big deal.

But maybe things aren’t as bad as they seem. When statistically savvy experimenters report the results of human judgment, they return error bars. The error bars (I give details below) in this case are

    \[       0.641 \pm 1.96 \sqrt{\frac{p (1-p)}{n}} \approx 0.641 \pm 0.030 \qquad (p = 0.64) \]

So even error bars don’t help: the actual accuracy rate of 0.7 is not included inside the error bars.

The purpose of this post is to show that you can compensate for judge error. In the image above, the blue bars are the simulation of the compensating algorithm I will present. The blue bars do a much better job of estimating the fraction of pets than the naive algorithm. This post is just a summary, but full details are in Drawing Sound Conclusions from Noisy Judgments from the WWW17 conference.

The histogram shows that the traditional naive algorithm (red) has a strong bias. But it also seems to have a smaller variance, so you might wonder if it has a smaller mean square error (MSE) than the blue algorithm. It does not. The naive algorithm has MSE 0.0037, the improved algorithm 0.0007. The smaller variance does not compensate for the large bias.

Finally, I can explain why judge error is so important. For a typical problem requiring ML/AI, many different algorithms are tried. Then human judgment can be used to detect if one is better than the other. Current practice is to use error bars as above, which do not take into account errors in the human judgment process. But this can lead the algorithm developer astray and suggest that a new algorithm is better (or worse) when in fact the difference is just noise.

The setup

I’m going to assume there are both an algorithm that takes an input and outputs a label (not necessarily a binary label) and also a judge that decides if the output is correct. So the judge is performing a binary task: determining if the algorithm is correct or not. From these judgments, you can compute the fraction of times (an estimate of the probability p) that the algorithm is correct. I will show that if you can estimate the error in the judgment process, then you can compensate for it and get a better estimate of the accuracy of the algorithm. This applies if you use raw judge scores or if you use a judgment process (for example, majority vote) to improve judge accuracy. In the latter case, you need to estimate the accuracy of the judgment process.

A simple example of this setup is an information retrieval algorithm. The algorithm is given a query and returns a document. A judge decides if the document is relevant or not. A subset of those judgments is reevaluated by gold judge experts, giving an estimate of the judges’ accuracy. Of course, if you are rich enough to be able to afford gold judges to perform all your labeling, then you don’t need to use the method presented here.

A slightly more subtle example is the pet detection algorithm mentioned above. Here it is likely that there would be a labeled set of images (a test set) used to evaluate the algorithm. You want to know how often the algorithm agrees with the labels, and you need to correct for errors in the judgment about agreement. To estimate the error rate, pick a subset of images and have them rejudged by experts (gold judges). The judges were detecting if an image had a pet or not. However, what I am really interested in is the error rate in judging whether the algorithm gave the correct answer or not. But that is easily computed from the gold judgments of the images.

The first formula

In the introduction, I mentioned that there are formulas that can correct for judge error. To explain the formula, recall that there is a task that requires labeling items into two classes, which I’ll call positive and negative. In the motivating example, positive means the algorithm gives the correct output for the item, negative that it is incorrect. I have a set of items, and judges perform the task on each item, getting an estimate of how many items are in the positive class. I’d like to use information on the accuracy of the judges to adjust this estimate. The symbols used in the formula are:

Rendered by QuickLaTeX.com

The first formula is

    \[ p = \frac{p_J + q_- - 1}{q_+ + q_- -1} \]

This formula is not difficult to derive. See the aforementioned WWW paper for details. Here are some checks that the formula is plausible. If the judges are prefect then q_+ = q_- = 1, and the formula reduces to p = p_J. In other words, the judges’ opinion of p is correct. Next, suppose the judges are useless and guess randomly. Then q_+ = q_- = 1/2, and the formula makes no sense because the denominator is infinity. So that’s also consistent.

Notice the formula is asymmetric: the numerator has q_- but not q_+. To see why this makes sense, first consider the case when the judges are perfect on negative items so that q_- = 1. The judges’ only error is to take correct answers and claim they are negative. So an estimate of p by the judges is always too pessimistic. On the other hand, if q_+ = 1, then the judges are optimistic, because they sometimes judge incorrect items as correct.

I will now show that the formula achieves these requirements. In particular, if q_- = 1 then I expect p_J < p and if q_+ = 1 then p_J > p. To verify this, note that if q_- = 1 the formula becomes p = p_J/q_+ > p_J. And if q_+ = 1 then

    \begin{eqnarray*}   p  &=& \frac{p_J + q_- - 1}{q_-} \\ &=& \frac{p_J - 1}{q_-} + 1 \\ &=& \frac{p_J - 1}{q_-} + (1-p_J) + p_J \\   &= & (p_J - 1)\left(\frac{1}{q_-} - 1\right) + p_J \\   & < & p_J \end{eqnarray*}

the last inequality because p_J - 1 < 0 and 1/q_- -1 > 0.

Up until now the formula is theoretical, because the precise values of p_J, q_+ and q_- are not known. I introduce some symbols for the estimates.

Rendered by QuickLaTeX.com

The practical formula is

    \[ \widehat{p} = \frac{\widehat{p}_J + \widehat{q}_\text{--} - 1}{\widehat{q}_+ + \widehat{q}_\text{--} -1} \]

The second formula

In the introduction, I motivated the concern with judgment error using the problem of determining the accuracy of an ML algorithm and, in particular, comparing two algorithms to determine which is better. If you had perfect judges and an extremely large set of labeled data, you could measure the accuracy of each algorithm on the large labeled set, and the one with the highest accuracy would clearly be best. But the labeled set is often limited in size. That leads to some uncertainty: if you had picked a different labeled set you might get a difference answer. That is the purpose of error bars: they quantify the uncertainty. Traditionally, the only uncertainty taken into account is due to the finite sample size. In this case, the traditional method gives 95% error bars of \widehat{p}_J  \pm 2 \sqrt{v_{\widehat{p}_J}} where

    \begin{eqnarray*}       \widehat{p}_J  &=&  \frac{k}{n} \\       v_{\widehat{p}_J} &= &  \frac{\widehat{p}_J(1-\widehat{p}_J)}{n} \end{eqnarray*}

The judge gave a positive label k times out of n, \widehat{p}_J is the estimated mean and v_{\widehat{p}_J} the estimate of the variance of \widehat{p}_J. I showed via simulation that if there are judgment errors, these intervals are too small. The corrected formula is

    \[ v_{\widehat{p}} =  \frac{v_{\widehat{p}_J}}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^2} + v_{\widehat{q}_+}\frac{(\widehat{p}_J - 1 + \widehat{q}_\text{--})^2}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^4} + v_{\widehat{q}_\text{--}}\frac{(\widehat{p}_J - \widehat{q}_+)^2}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^4} \]

where

Rendered by QuickLaTeX.com

The formula shows that when taking judge error into account, the variance increases, that is, v_{\widehat{p}} >  v_{\widehat{p}_J}. The first term of the equation for v_{\widehat{p}} is already larger than v_{\widehat{p}_J}, since is it v_{\widehat{p}_J} divided by a number less than one. The next two terms add additional error, due to the uncertainty in estimating the judge errors \widehat{q}_+ and \widehat{q}_-.

Extensions

The theme of this post is correcting for judge errors, but I have only discussed the case when the judge gives each item a binary rating, as a way to estimate what fraction of items are in each of the two classes. For example, the judge reports if an algorithm has given the correct answer, and you want to know how often the algorithm is correct. Or the judge examines a query and a document, and you want to know how often the document is relevant to the query.

But the methods presented so far extend to other situations. The WWW paper referred to earlier gives a number of examples in the domain of document retrieval. It explains how to correct for judge error in the case when a judge gives documents a relevance rating rather than just a binary relevant or not. And it shows how to extend these ideas beyond the metric “fraction in each class” to the metric DCG, which is a common metric for document retrieval.

Assumptions

Analyses of the type presented here always have assumptions. The assumption of this work is that there is a task (for example, “is the algorithm correct on this item?”) with a correct answer and that there are gold judges who are expert enough and have enough time to reliably obtain that correct answer. Industrial uses of human judgment often use detailed guidelines explaining how the judgment should be done, in which case these assumptions are reasonable.

But what about the general case? What if different audiences have different ideas about the task? For example, there may differences from country to country about what constitutes a pet. The analysis presented here is still relevant. You only need to compute the error rate of the judges relative to the appropriate audience.

I treat the judgment process as a black box, where only the overall accuracy of the judgments is known. This is sometimes very realistic, for example, when the judgment process is outsourced. But sometimes details are known about the judgment process would lead you to conclude that the judgments of some items are more reliable than others. For example, you might know which judges judged which items, and so have reliable information on the relative accuracy of the judges. The approach I use here can be extended to apply in these cases.

The simulation

I close this post with the R code that was used to generate the simulation given earlier.

# sim() returns a (n.iter x 4) array, where each row contains:
#   the naive estimate for p,
#   the corrected estimate,
#   whether the true value of p is in a 95% confidence interval (0/1)
#        for the naive standard error
#   the same value for the corrected standard error

sim = function(
  q.pos,  # probability a judge is correct on positive items
  q.neg,  # probability a judge is correct on negative items
  p,      # fraction of positive items
  n,      # number of items judged
  n.sim,  # number of simulations
  n.gold.neg = 200, # number of items the gold judge deems negative
  n.gold.pos = 200  # number of items the gold judge deems positive
  )
{
   # number of positive items in sample
   n.pos = rbinom(n.sim, n, p)

   # estimate of judge accuracy
   q.pos.hat = rbinom(n.sim, n.gold.pos, q.pos)/n.gold.pos
   q.neg.hat = rbinom(n.sim, n.gold.neg, q.neg)/n.gold.neg

   # what the judges say (.jdg)
   n.pos.jdg = rbinom(n.sim, n.pos, q.pos) +
               rbinom(1, n - n.pos, 1 - q.neg)

   # estimate of fraction of positive items
   p.jdg.hat = n.pos.jdg/n    # naive formula
   # corrected formula
   p.hat = (p.jdg.hat - 1 + q.neg.hat)/
                      (q.neg.hat + q.pos.hat - 1)

   # is p.jdg.hat within 1.96*sigma of p?
   s.hat =  sqrt(p.jdg.hat*(1 - p.jdg.hat)/n)
   b.jdg = abs(p.jdg.hat - p) < 1.96*s.hat

   # is p.hat within 1.96*sigma.hat of p ?
   v.b = q.neg.hat*(1 - q.neg.hat)/n.gold.neg   # variance q.neg
   v.g = q.pos.hat*(1 - q.pos.hat)/n.gold.pos
   denom = (q.neg.hat + q.pos.hat - 1)^2
   v.cor = s.hat^2/denom +
           v.g*(p.jdg.hat - 1 + q.neg.hat)^2/denom^2 +
           v.b*(p.jdg.hat - q.pos)^2/denom^2
   s.cor.hat = sqrt(v.cor)
   # is negative.jdg.rate.cor within 1.96*sigma of p?
   b.cor = abs(p.hat - p) < 1.96*s.cor.hat

   rbind(p.jdg.hat, p.hat, b.jdg, b.cor)
}

set.seed(13)
r = sim(n.sim = 100000, q.pos = 0.90,  q.neg = 0.95, p = 0.7, n=1000)

# plot v1 and v2 as two separate histograms, but on the same ggplot
hist2 = function(v1, v2, name1, name2, labelsz=12)
{
  df1 = data.frame(x = v1, y = rep(name1, length(v1)))
  df2 = data.frame(x = v2, y = rep(name2, length(v2)))
  df = rbind(df1, df2)

  freq = aes(y = ..count..)
  print(ggplot(df, aes(x=x, fill=y)) + geom_histogram(freq, position='dodge') +
      theme(axis.text.x = element_text(size=labelsz),
      axis.text.y = element_text(size=labelsz),
      axis.title.x = element_text(size=labelsz),
      axis.title.y = element_text(size=labelsz),
      legend.text = element_text(size=labelsz))
  )
}

hist2(r[1,], r[2,], "naive", "corrected", labelsz=16)

Powered by QuickLaTeX.

]]>
http://www.ebaytechblog.com/2017/05/04/a-surprising-pitfall-of-human-judgement-and-how-to-correct-it/feed/ 0
Building a UI Component in 2017 and Beyond http://www.ebaytechblog.com/2017/05/03/building-a-ui-component-in-2017-and-beyond/ http://www.ebaytechblog.com/2017/05/03/building-a-ui-component-in-2017-and-beyond/#comments Wed, 03 May 2017 14:00:56 +0000 http://www.ebaytechblog.com/?p=7230
Continue Reading »
]]>
As web developers, we have seen the guidelines for building UI components evolve over the years. Starting from jQuery UI to the current Custom Elements, various patterns have emerged. To top it off, there are numerous libraries and frameworks, each advocating their own style on how a component should be built. So in today’s world, what would be the best approach in terms of thinking about a UI component interface? That is the essence of this blog. Huge thanks to folks mentioned in the bottom Acknowledgments section. Also, this post leverages a lot of learnings from the articles and projects listed in the Reference section at the end.

Setting the context

Before we get started, let’s set the context on what this post covers.

  • By UI components, we mean the core, standalone UI patterns that apply to any web page in general. Examples would be Button, Dropdown Menu, Carousel, Dialog, Tab, etc. Organizations in general maintain a pattern library for these core components. We are NOT talking about application-specific components here, like a Photo Viewer used in social apps or an Item Card used in eCommerce apps. They are designed to solve application-specific (social, eCommerce, etc.) use cases. Application components are usually built using core UI components and are tied with a JavaScript framework (Vue.js, Marko, React, etc.) to create a fully featured web page.

  • We will only be talking about the interface (that is, API) of the component and how to communicate. We will not go over the implementation details in depth, just a quick overview.

Declarative HTML-based interface

Our fundamental principle behind building a UI component API is to make it agnostic in regard to any library or framework. This means the API should be able to work in any context and be framework-interoperable. Any popular framework can just leverage the interface out of the box and augment it with their own implementation. With this principle in mind, what would be the best way to get started? The answer is pretty simple — make the component API look like how any other HTML element would look like. Just like how a <div>, <img>, or a <video> tag would work. This is the idea behind having a declarative HTML-based interface/API.

So what does a component look like? Let’s take carousel as an example. This is what the component API will look like.

<carousel index="2" controls aria-label-next="next" aria-label-previous="previous" autoplay>
    <div>Markup for Item 1</div>
    <div>Markup for Item 2</div>
    ...
</carousel>

What does this mean? We are declaratively telling the component that the rendered markup should do the following.

  • Start at the second item in the carousel.
  • Display the left and right arrow key controls.
  • Use "previous" and "next" as the aria-label attribute values for the left and right arrow keys.
  • Also, autoplay the carousel after it is loaded.

For a consumer, this is the only information they need to know to include this component. It is exactly like how you include a <button> or <canvas> HTML element. Component names are always lowercase. They should also be hyphenated to distinguish them from native HTML elements. A good suggestion would be to prefix them with a namespace, usually your organization or project name, for example ebay-, core-, git-, etc.

Attributes form the base on how you will pass the initial state (data and configuration) to a component. Let’s talk about them.

Attributes

  • An attribute is a name-value pair where the value is always a string. Now the question may arise that anything can be serialized as a string and the component can de-serialize the string to the associated data type (JSON for example). While that is true, the guideline is to NOT do that. A component can only interpret the value as a String (which is default) or a Number (similar to tabindex) or a JavaScript event handler (similar to DOM on-event handlers). Again, at the end of the day, this is exactly how an HTML element works.

  • Attributes can also be boolean. As per the HTML5 spec, “The presence of a boolean attribute on an element represents the true value, and the absence of the attribute represents the false value.” This means that as a component developer, when you need a boolean attribute, just check for the presence of it on the element and ignore the value. Having a value for it has no significance; both creator and consumer of a component should follow the same. For example, <button disabled="false"> will still disable the button, even if it is set to false, just because the boolean attribute disabled is present.

  • All attribute names should be lowercase. Camel case or Pascal case is NOT allowed. For certain multiword attributes, hyphenated names like accept-charset, data-*, etc. can be used, but that should be a rare occurrence. Even for multiwords, try your best to keep them as one lowercase name, for example, crossorigin, contenteditable, etc. Check out the HTML attribute reference for tips on how the native elements are doing it.

We can correlate the above attribute rules with our <carousel> example.

  • aria-label-next and aria-label-previous as string attributes. We hyphenate them as they are multiwords, very similar to the HTML aria-label attribute.
  • index attribute will be deserialized as a number, to indicate the position of the item to be displayed.
  • controls and autoplay will be considered as boolean attributes.

A common pattern that used to exist (or still exists) is to pass configuration and data as JSON strings. For our carousel, it would be something like the following example.

<!-- This is not recommended -->
<carousel 
    data-config='{"controls": true, "index": 2, "autoplay": true, "ariaLabelNext": "next", "ariaLabelPrevious": "previous"}' 
    data-items='[{"title":"Item 1", ..}, {"title": "Item 2", ...}]'>
</carousel>

This is not recommended.

Here the component developer reads the data attribute data-config, does a JSON parse, and then initializes the component with the provided configuration. They also build the items of the carousel using data-items. This may not be intuitive, and it works against a natural HTML-based approach. Instead consider a declarative API as proposed above, which is easy to understand and aligns with the HTML spec. Finally, in the case of a carousel, give the component consumers the flexibility to build the carousel items however they want. This decouples a core component from the context in which it is going to be used, which is usually application-specific.

Array-based

There will be scenarios where you really need to pass an array of items to a core component, for example, a dropdown menu. How to do this declaratively? Let’s see how HTML does it. Whenever any input is a list, HTML uses the <option> element to represent an item in that list. As a reference, check out how the <select> and <datalist> elements leverage the <option> element to list out an array of items. Our component API can use the same technique. So in the case of a dropdown menu, the component API would look like the following.

<dropdown-menu list="options" index="0">
    <option value="0" selected>--Select--</option>
    <option value="1">Option 1</option>
    <option value="2">Option 2</option>
    <option value="3">Option 3</option>
    <option value="4">Option 4</option>
    <option value="5">Option 5</option>
</dropdown-menu>

It is not necessary that we should always use the <option> element here. We could create our own element, something like <dropdown-option>, which is a child of the <dropdown-menu> component, and customize it however we want. For example, if you have an array of objects, you can represent each object ({"userName": "jdoe", "score": 99, "displayName": "John Doe"}) declaratively in the markup as <dropdown-option value="jdoe" score="99">John Doe</dropdown-option>. Hopefully you do not need a complex object for a core component.

Config-based

You may also argue that there is a scenario where I need to pass a JSON config for it to work or else usability becomes painful. Although this is a rare scenario for core components, a use case I can think about will be a core analytics component. This component may not have a UI, but it does all tracking related stuff, where you need to pass in a complex JSON object. What do we do? The AMP Project has a good solution for this. The component would look like the following.

<analytics>
    <script type="application/json">
    {
      "requests": {
        "pageview": "https://example.com/analytics?pid=",
        "event": "https://example.com/analytics?eid="
      },
      "vars": {
        "account": "ABC123"
      },
      "triggers": {
        "trackPageview": {
          "on": "visible",
          "request": "pageview"
        },
        "trackAnchorClicks": {
          "on": "click",
          "selector": "a",
          "request": "event",
          "vars": {
            "eventId": "42",
            "eventLabel": "clicked on a link"
          }
        }
      }
    }
    </script>
</analytics>

Here again we piggyback the interface based on how we would do it in simple HTML. We use a <script> element inside the component and set the type to application/json, which is exactly what we want. This brings back the declarative approach and makes it look natural.

Communication

Till now we talked only about the initial component API. This enables consumers to include a component in a page and set the initial state. Once the component is rendered in the browser, how do you interact with it? This is where the communication mechanism comes into play. And for this, the golden rule comes from the reactive principles of

Data in via attributes and properties, data out via events

This means that attributes and properties can be used to send data to a component and events send the data out. If you take a closer look, this is exactly how any normal HTML element (input, button, etc.) behaves. We already discussed attributes in detail. To summarize, attributes set the initial state of a component, whereas properties update or reflect the state of a component. Let’s dive into properties a bit more.

Properties

At any point in time, properties are your source of truth. After setting the initial state, some attributes do not get updated as the component changes over time. For example, typing in a new phrase in an input text box and then calling element.getAttribute('value') will produce the previous (stale) value. But doing element.value will always produce the current typed-in phrase. Certain attributes, like disabled, do get reflected when the corresponding property is changed. There has always been some confusion around this topic, partly due to legacy reasons. It would be ideal for attributes and properties to be in sync, as the usability benefits are undeniable.

If you are using Custom Elements, implementing properties is quite straightforward. For a carousel, we could do this.

class Carousel extends HTMLElement {  
    static get observedAttributes() {
        return ['index'];
    }
    // Called anytime the 'index' attribute is changed
    attributeChangedCallback(attrName, oldVal, newVal) {
        this[attrName] = newVal;
    }
    // Takes an index value
    set index(idx) {
        // First check if it is numeric
        const numericIndex = parseInt(idx, 10);
        if (isNaN(numericIndex)) {
            return;   
        }
        // Update the internal state
        this._index = numericIndex;
        /* Perform the associated DOM operations */
        moveCarousel();
    }
    get index() {
        return this._index;
    }
}

Here the index property gets all its associated characteristics. If you are doing carouselElement.index=4, it will update the internal state and then perform the corresponding DOM operations to move the carousel to the fourth item. Additionally, even if you directly update the attribute carouselElement.setAttribute('index', 4), the component will still update the index property, the internal state and perform the exact DOM operations to move the carousel to the fourth item.

However, until Custom Elements gain massive browser adoption and have a good server-side rendering story, we need to come up with other mechanisms to implement properties. And one way would be to use the Object.defineProperty() API.

const carouselElement = document.querySelector('#carousel1');
Object.defineProperty(carouselElement, 'index', {
    set(idx) {
        // First check if it is numeric
        const numericIndex = parseInt(idx, 10);
        if (isNaN(numericIndex)) {
            return;   
        }
        // Update the internal state
        this._index = numericIndex;
        /* Perform the associated DOM operations */
        moveCarousel();        
    },
    get() {
        return this._index;
    }
});

Here we are augmenting the carousel element DOM node with the index property. When you do carouselElement.index=4, it gives us the same functionality as the Custom Element implementation. But directly updating an attribute with carouselElement.setAttribute('index', 4) will do nothing. This is the tradeoff in this approach. (Technically we could still use a MutationObserver to achieve the missing functionality, but that would be an overkill.) Hopefully as a team, if you can standardize that state updates should only happen through properties, then it should be less of a concern.

With respect to naming conventions, since properties are accessed programmatically, they should always be camel-cased. All exposed attributes (an exception would be ARIA attributes) should have a corresponding camel-cased property, very similar to native DOM elements.

Events

When the state of a component has changed, either programmatically or due to user interaction, it has to communicate the change to the outside world. And the best way to do it is by dispatching events, very similar to click or touchstart events dispatched by a native HTML element. The good news is that the DOM comes with a built-in custom eventing mechanism through the CustomEvent constructor. So in the case of a carousel, we can tell the outside world that the carousel transition has been completed by dispatching a transitionend event as shown below.

const carouselElement = document.querySelector('#carousel1');

// Dispatching 'transitionend' event
carouselElement.dispatchEvent(new CustomEvent('transitionend', {
    detail: {index: this._index}
}));

// Listening to 'transitionend' event
carouselElement.addEventListener('transitionend', event => {
    alert(`User has moved to item number ${event.detail.index}`);
});

By doing this, we get all the benefits of DOM events like bubbling, capture etc. and also the event APIs like event.stopPropagation(), event.preventDefault(), etc. Another added advantage is that it makes the component framework-agnostic, as most frameworks already have built-in mechanisms for listening to DOM events. Check out Rob Dodson’s post on how this works with major frameworks.

Regarding a naming convention for events, I would go with the same guidelines that we listed above for attribute names. Again, when in doubt, look at how the native DOM does it.

Implementation

Let me briefly touch upon the implementation details, as they give the full picture. We have been only talking about the component API and communication patterns till now. But the critical missing part is that we still need JavaScript to provide the desired functionality and encapsulation. Some components can be purely markup- and CSS-based, but in reality, most of them will require some amount of JavaScript. How do we implement this JavaScript? Well, there a couple of ways.

  • Use vanilla JavaScript. Here the developer builds their own JavaScript logic for each component. But you will soon see a common pattern across components, and the need for abstraction arises. This abstraction library will pretty much be similar to those numerous frameworks out in the wild. So why reinvent the wheel? We can just choose one of them.

  • Usually in organizations, web pages are built with a particular library or framework (Angular, Ember, Preact, etc.). You can piggyback on that library to implement the functionality and provide encapsulation. The drawback here is that your core components are also tied with the page framework. So in case you decide to move to a different framework or do a major version upgrade, the core components should also change with it. This can cause a lot of inconvenience.

  • You can use Custom Elements. That would be ideal, as it comes default in the browsers, and the browser makers recommend it. But you need a polyfill to make it work across all of them. You can try a Progressive Enhancement technique as described here, but you would lose functionality in non-supportive browsers. Moreover, until we have a solid and performant server-side rendering mechanism, Custom Elements would lack mass adoption.

And yes, all options are open-ended. It all boils down to choices, and software engineering is all about the right tradeoffs. My recommendation would be to go with either Option 2 or 3, based on your use cases.

Conclusion

Though the title mentions the year “2017”, this is more about building an interface that works not only today but also in the future. We are making the component API-agnostic of the underlying implementation. This enables developers to use a library or framework of their choice, and it gives them the flexibility to switch in the future (based on what is popular at that point in time). The key takeaway is that the component API and the principles behind it always stay the same. I believe Custom Elements will become the default implementation mechanism for core UI components as soon as they gain mainstream browser adoption.

The ideal state is when a UI component can be used in any page, without the need of a library or polyfill and it can work with the page owner’s framework of choice. We need to design our component APIs with that ideal state in mind and this is a step towards it. Finally, worth repeating, when in doubt, check how HTML does it, and you will probably have an answer.

Acknowledgments

Many thanks to Rob Dodson and Lea Verou for their technical reviews and valuable suggestions. Also huge thanks to my colleagues Ian McBurnie, Arun Selvaraj, Tony Topper, and Andrew Wooldridge for their valuable feedback.

References

]]>
http://www.ebaytechblog.com/2017/05/03/building-a-ui-component-in-2017-and-beyond/feed/ 2
Elasticsearch Cluster Lifecycle at eBay http://www.ebaytechblog.com/2017/04/12/elasticsearch-cluster-lifecycle-at-ebay/ http://www.ebaytechblog.com/2017/04/12/elasticsearch-cluster-lifecycle-at-ebay/#respond Wed, 12 Apr 2017 15:20:37 +0000 http://www.ebaytechblog.com/?p=6821
Continue Reading »
]]>
Defining an Elasticsearch cluster lifecycle

eBay’s Pronto, our implementation of the “Elasticsearch as service” (ES-AAS) platform, provides fully managed Elasticsearch clusters for various search use cases. Our ES-AAS platform is hosted in a private internal cloud environment based on OpenStack. The platform currently manages around 35+ clusters and supports multiple data center deployments. This blog provides guidelines on all the different pieces for creating a cluster lifecycle to allow streamlined management of Elasticsearch clusters. All Elasticsearch clusters deployed within the eBay infrastructure follow our defined Elasticsearch lifecycle depicted in the figure below.

Cluster preparation

This lifecycle stage begins when a new use case is being onboarded onto our ES-AAS platform.

On-boarding information

Customers’ requirements are captured onto an onboarding template that contains information such as document size, retention policy, and read/write throughput requirement. Based on the inputs provided by the customer, infrastructure sizing is performed. The sizing uses historic learnings from our benchmarking exercises. On-boarding information has helped us in cluster planning and defining SLA for customer commitments.

We collect the following information from customers before any use case is onboarded:

  • Use case details: Consists of queries relating to use case description and significance.
  • Sizing Information: Captures the number of documents, their average document size, and year-on-year growth estimation.
  • Data read/write information: Consists of expected indexing/search rate, mode of ingestion (batch mode or individual documents), data freshness, average number of users, and specific search queries containing any aggregation, pagination, or sorting operations.
  • Data source/retention: Original data source information (such as Oracle, MySQL, etc.) is captured on an onboarding template. If the indices are time-based, then an index purge strategy is logged. Typically, we do not use Elasticsearch as the source of data for critical applications.

Benchmarking strategy

Before undertaking any benchmarking exercise, it’s really important to understand the underlying infrastructure that hosts your VMs. This is especially true in a cloud-based environment where such information is usually abstracted from end users. Be aware of different potential noisy-neighbors issues, especially on a multi-tenant-based infrastructure.

Like most folks, we have also performed extensive benchmarking exercise on existing hardware infrastructure and image flavors. Data stored in Elasticsearch clusters are specific to customer use cases. It is near to impossible to perform benchmarking runs on all data schemas used by different customers. Therefore, we made assumptions before embarking on any benchmarking exercise, and the following assumptions were key.

  • Clients will use a REST path for any data access on our provisioned Elasticsearch clusters. (No transport client)
  • To start with, we kept a mapping of 1GB RAM to 32GB disk space ratio. (This was later refined as we learnt from benchmarking)
  • Indexing numbers were carefully profiled for different numbers of replicas (1, 2, and 3 replicas).
  • Search benchmarking was done always on GetById queries (as search queries are custom and profiling different custom search queries was not viable).
  • We used fixed-size 1KB, 2KB, 5KB, and 10 KB documents

Working from these assumptions, we derived at a maximum shard size for performance (around 22GB), right payload size for _bulk requests (~5MB), etc. We used our own custom JMeter scripts to perform benchmarking. Recently Elasticsearch has developed and open-sourced the Rally benchmarking tool, which can be used as well. Additionally, based on our benchmarking learnings, we created a capacity-estimation calculator tool that can take in customer requirement inputs and calculate the infrastructure requirement for a use case. We avoided a lot of conversation with our customers on infrastructure cost by sharing this tool directly with end users.

VM cache pool

Our ES clusters are deployed by leveraging an intelligent warm-cache layer. The warm-cache layer consists of ready-to-use VM nodes that are prepared over a period of time based on some predefined rules. This ensures that VMs are distributed across different underlying hardware uniformly. This layer has allowed us to quickly spawn large clusters within seconds. Additionally, our remediation engine leverages this layer to flex up nodes on existing clusters without errors or any manual intervention. More details on our cache pool are available in another eBay tech blog at Ready-to-use Virtual-machine Pool Store via warm-cache

Cluster deployment

Cluster deployment is fully automated via a Puppet/Foreman infrastructure. We will not talk in detail about how Elasticsearch Puppet module was leveraged for provisioning Elasticsearch clusters. This is well documented at Elasticsearch puppet module. Along with every release of Elasticsearch, a corresponding version of the Puppet module is generally made publically available. We have made minor modifications to these Puppet scripts to suit eBay-specific needs. Different configuration settings for Elasticsearch are customized based on our benchmarking learnings. As a general guideline, we do not set the JVM heap memory size to more than 28 GB (because doing so leads to long garbage collection cycles), and we always disable in-memory swapping for the Elasticsearch JVM process. Independent clusters are deployed across data centers, and load balancing VIPs (Virtual IP addresses) are provisioned for data access.

Typically, with each cluster provisioned we give out two VIPs, one for data read operations and another one for write operations. Read VIPs are always created over client nodes (or coordinating nodes), while write VIPs are configured over data nodes. We have observed improved throughput from our clusters with such a configuration.

Deployment diagram

 

We use a lot of open source on our platform such as OpenStack, MongoDB, Airflow, Grafana, InfluxDB (open version), openTSDB, etc. Our internal services, such as cluster provisioning, cluster management, and customer management services, allow REST API-driven management for deployment and configuration. They also help in tracking clusters as assets against different customers. Our cluster provisioning service relies heavily on OpenStack. For example, we use NOVA for managing compute resources (nodes), Neutron APIs for load balancer provisioning, and Keystone for internal authentication and authorization of our APIs.

We do not use federated or cross-region deployments for an Elasticsearch cluster. Network latency limits us from having such a deployment strategy. Instead, we host independent clusters for use cases across multiple regions. Clients will have to perform dual writes when clusters are deployed in multiple regions. We also do not use Tribe nodes.

Cluster onboarding

We create cluster topology during customer onboarding. This helps to track resources and cost associated with cluster infrastructure. The metadata stored as part of a cluster topology maintains region deployment information, SLA agreements, cluster owner information, etc. We use eBay’s internal configuration management system (CMS) to track cluster information in form of a directed graph. There are external tools that hook onto this topology. Such external integrations allow easy monitoring of our clusters from centralized eBay-specific systems.

Cluster topology example

Cluster management

Cluster security

Security is provided on our clusters via a custom security plug-in that provides a mechanism to both authenticate and authorize the use of Elasticsearch clusters. Our security plug-in intercepts messages and then performs context-based authorization and authentication using an internal authentication framework. Explicit whitelisting based on client IP is supported. This is useful for configuring Kibana or other external UI dashboards. Admin (Dev-ops) are configured to have complete access to Elasticsearch cluster. We encourage using HTTPS (based on TLS 1.2) for securing communication between client and Elasticsearch clusters.

The following is a sample simple security rule that can configure be configured on our platform of provisioned clusters.

sample json code implementing a security rule

In the above sample rule, the enabled field controls if the security feature is enabled or not. whitelisted_ip_list is an array attribute for providing all whitelisted Client IPs. Any Open/Close index operations or delete index operations can be performed only by admin users.

Cluster monitoring

Cluster monitoring is done by custom monitoring plug-in that pushes 70+ metrics from each Elasticsearch node to a back-end TSDB-based data store. The plug-in works on a push-based design. External dashboards using Grafana consume the data on TSDB store. Custom templates are created on a Grafana dashboard, which allows easy centralized monitoring of our own clusters.

 

 

We leverage an internal alert system that can be used to configure threshold-based alerts on data stored on OpenTSDB. Currently, we have 500+ active alerts configured on our clusters with varying severity. Alerts are classified as ‘Errors’ or ‘Warnings’. Error alerts, when raised, are immediately attended to either by DevOps or by our internal auto-remediation engine, based on the alert rule configured.

Alerts are created during cluster provisioning based on various thresholds. For Example, if a cluster status turns RED, an ‘Error’ alert is raised or if CPU utilization of node exceeds 80% a ‘Warning’ alert is raised.

Cluster remediation

Our ES-AAS platform can perform an auto-remediation action on receiving any cluster anomaly event. Such actions are enabled via our custom Lights-Out-Management (LOM) module. Any auto-remediation module can significantly reduce manual intervention for DevOps. Our LOM module uses a rule-based engine which listens to all alerts raised on our cluster. The reactor instance maintains a context of the alerts raised and, based on cluster topology state (AUTO ON/OFF), takes remediation actions. For example, if a cluster loses a node and if this node does not return to its cluster within the next 15 minutes, the remediation engine replaces that node via our internal cluster management services. Optionally, alerts can be sent to the team instead of taking a cluster remediation action. The actions of the LOM module are tracked as stateful jobs that are persisted on a back-end MongoDB store. Due to the stateful nature of these jobs, they can be retried or rolled back as required. Audit logs are also maintained to capture the history or timeline of all remediation actions that were initiated by our custom LOM module.

Cluster logging

Along with the standard Elasticsearch distribution, we also ship our custom logging library. This library pushes all Elasticsearch application logs onto a back-end Hadoop store via an internal system called Sherlock. All centralized application logs can be viewed at both cluster and node levels. Once Elasticsearch log data is available on Hadoop, we run daily PIG jobs on our log store to generate reports for error log or slow log counts. We generally have our logging settings as INFO, and whenever we need to triage issues, we use transient a logging setting of DEBUG, which collects detailed logs onto our back-end Hadoop store.

Cluster decommissioning

We follow a cluster decommissioning process for major version upgrades of Elasticsearch. For major upgrades for Elasticsearch clusters, we spawn a new cluster with our latest offering of the Elasticsearch version. We replay all documents from old or existing version of Elasticsearch clusters to the newly created cluster. Client (user applications) starts using both cluster endpoints for all future ingestion until data catches up on the new cluster. Once data parity is achieved, we decommission the old cluster. In addition to freeing up infrastructure resources, we also clean up the associated cluster topology. Elasticsearch also provides a migration plug-in that can be used to check if direct, in-place upgrades can be done on major Elasticsearch versions. Minor Elasticsearch upgrades are done on an as-needed basis and are usually done in-place.

]]>
http://www.ebaytechblog.com/2017/04/12/elasticsearch-cluster-lifecycle-at-ebay/feed/ 0
Healthy Team Backlogs http://www.ebaytechblog.com/2017/03/30/healthy-team-backlogs/ http://www.ebaytechblog.com/2017/03/30/healthy-team-backlogs/#comments Thu, 30 Mar 2017 13:50:32 +0000 http://www.ebaytechblog.com/?p=6731
Continue Reading »
]]>
 

What is a backlog?

Agile product owners use a backlog to organize and communicate the requirements for a team’s work. Product backlogs are deceptively simple, which can sometimes make them challenging to adopt for product owners who may be used to working with lengthy PRDs (“project requirement documents” or similar).

Scrum most commonly uses the term product backlog. However, many product owners who are new to Scrum are confused by this term. Reasonable questions arise: Does this suggest that a team working on multiple products would have multiple backlogs? If so, how do we prioritize between them? Where do bugs get recorded? What happens if work needs to be done, but it isn’t associated with a product; do we create a placeholder?

Therefore, we prefer the term team backlog. Our working definition of team backlog is “the maintained, ordered list of work that the team plans to do now or in the future.” This is a dense description, so let’s unpack it a little.

“The” and “Team”

  • We say the and team because each team needs a single source of truth to track their current and future work.
  • If a team is working on multiple projects or products, all of the work for those stories should appear on a single, unified, team backlog.
  • Teams do not generally share backlogs.

“Work”

  • Work includes almost everything that the development team needs to do.
  • Features, bugs, technical debt, research, improvements, and even user experience work all appear on the same backlog.
  • Generally speaking, recurring team meetings and similar events do not appear on the backlog.

“Maintained”

  • We say maintained because the backlog is a “living” artifact.
  • The product owner and team must continually update and refine their backlog. Otherwise, the team will waste time doing useless work and chasing requirements.
  • This requires several hours per week for the product owner and 1–2 hours per week for the team. It involves adding, ordering, discussing, describing, justifying, deleting, and splitting work.

“Ordered”

  • We say ordered list rather than prioritized list because the backlog is ordered, not just prioritized.
  • If the backlog is only prioritized, there can be multiple items that are all “very high priority.”
  • If the backlog is ordered, we communicate exactly in what order those “very high priority” tasks should be worked on.

“Plans to Do”

  • We say plans to do because we regularly delete everything from the backlog that we no longer plan to work on.
  • Deleting unnecessary work is essential. Unnecessary work clutters up our backlog and distracts from the actual work.

What makes a backlog healthy?

Now that we know what a backlog is, what makes a backlog healthy or not? While what makes for a good backlog is somewhat subjective — in the same way that what makes a good PRD could be subjective — there are 10 characteristics that we’ve found to be particularly important.

Would you like to know if your backlog is healthy? Download this handy PDF checklist, print it out, then open up your backlog and follow along. For each criterion, take note of whether your backlog currently does, doesn’t, or only somewhat meets the criterion. In exchange for less than half an hour of your time, you’ll have good sense as to the health of your backlog and a few ideas for improvement.

  1. Focused, ordered by priority, and the team follows the order diligently

    • At all times, anyone can look at the backlog and know what needs to be worked on next without ambiguity.
    • Even if you have several “P1” issues, the team needs to know which P1 issue needs to be addressed next. Simply saying “they’re all important” will paralyze the team.
    • Although the PO is responsible for the product backlog order and makes the final call, the PO should be willing to negotiate the order with their team. The team often has good insights that can mitigate dependencies or help the PO deliver more value.
    • Stay focused on one thing at a time when possible to deliver value earlier and reduce context switching waste.
  2.  

  3. Higher-value items towards the top, lower-value items towards the bottom

    • In general, do high-value, low-cost work first (“lowest hanging fruit”).
    • Next, do high-value, high-cost work because it is usually more strategic.
    • Then, do low-value, low-cost work.
    • Finally, eliminate low-value, high-cost work. You will almost always find something better to do with your time and resources, so don’t waste your time tracking it. It will be obvious if and when that work becomes valuable.
    • Hint: You can use Weighted Shortest Job First or a similar technique if you’re having difficulty prioritizing.
  4.  

  5. Granular, ready-to-work items towards the top, loosely-defined epics towards the bottom

    • Items that are at the top of the backlog will be worked on next, so we want to ensure that they are the right size to work on.
    • The typical team’s Definition of Ready recommends that items take ≤ ½ of a sprint to complete.
    • Delay decision-making and commitments — manifested as small, detailed, team-ready items — until the last responsible moment.
    • There is little value specifying work in detail if you will not work on it soon. Due to learning and changing customer/company/competitive conditions, your requirements may change or you may cancel the work altogether.

     
    What is an Epic?

    • An “epic” is simply a user story that is too large to complete in one sprint. It gets prioritized in the backlog like every other item.
    • JIRA Tip: “Epics” in JIRA do not appear in the backlog for Scrum boards. As a result, they behave more like organizing themes than epics. Therefore, we suggest using JIRA’s epic functionality to indicate themes and user stories with the prefix “Epic: ”  to indicate actual epics.
  6.  

  7. Solutions towards the top, statements of need towards the bottom

    • Teams can decide to start working on an item as soon as they know what customer needs they hope to solve. However, collaborating between product, design, development, and stakeholders to translate customer needs into solutions takes time.
    • As with other commitments, defer solutioning decisions until the last responsible moment:
      • Your ideal solution may change through learning or changing conditions such as customer, competitors, company, or even technology options.
      • You may decide not to work on the problem after all.
  8.  

  9. 1½ to 2 sprints worth of work that’s obviously ready to work on at the top

    • Teams sometimes surprise the product owner by having more capacity by expected.
    • Having enough ready stories ensures that the team is:
      • Unlikely to run out of work to pull into their sprint backlog during sprint planning.
      • Able to pull in additional work during the sprint if they complete the rest of the work on their sprint backlog.
    • It should be obvious what work is and isn’t ready to work on so that the team doesn’t have to waste time figuring it out each time they look at the backlog.
      • Some teams prefix a story title with a “* ” to indicate a ready story (or a story that isn’t ready).
  10.  

  11. The value of each piece of work is clearly articulated

    • Your team should be able to understand why the work is important to work on.
    • There are three primary sources of value (and you can define your own):
      • User/Business Value: Increase revenue, reduce costs, make users happy
      • Time Criticality: Must it happen soon due to competition, risk, etc.?
      • Opportunity Enablement/Risk Reduction/Learning: Is it strategic? Is it necessary to enable another valuable item (for example, a dependency)?
    • You won’t usually need a complex financial projection, just a reasonable justification as to why the item should be worked on next relative to all other known possibilities. Time previously spent with complex projections can instead be used to talk to customers and identify other opportunities.
  12.  

  13. The customer persona for the work is clearly articulated

    • The “As a” part of the “As a ____, I can ___, so that ____” user story isn’t a mere formality; it’s an essential part of user-centered product development.
    • Who is the customer? Who are you completing this work for? Even if you’re on a “back-end” team, keep the end-user in mind.
    • Partner with your designer to identify your personas and reference them whenever possible. Is this feature for “Serious Seller Sally?” Can you imagine her personality and needs just as well as any of your friends?
      • Example: “As Serious Seller Sally, I can list items using a ‘advanced’ flow so that I can get the options I need without the guidance for casual-sellers that only slows me down.”
    • Tool Tip: Most teams and POs find it best to put just the “I can” part the user story (for example, “List items using a ‘advanced’ flow”) in the planning tool’s title field. Otherwise it can be harder to read the backlog. Put the entire user story at the top of your tool’s description field.
  14.  

  15. ≤ 100 items (a rule of thumb), and contains no work that — realistically — will never be done

    • This is a general rule. If your team works on many very small items or has considerable work that you must track, your backlog could be longer.
    • Assuming that each backlog item takes a minute to read and understand, 100 items alone would take over an hour and a half to process. Keeping our backlog limited like this makes it easier and faster to fully understand.
    • A longer backlog is more likely to contain features that will never be built or bugs that will never be fixed. Keeping a short backlog helps us ensure that we triage effectively and delete items that we are unlikely to work on.
  16.  

  17. The team backlog is not a commitment

    • A Scrum team cannot make a realistic, firm commitment on an entire team backlog because:
      • It has not been through high-level design (for example, tasking at end of Sprint planning).
      • The risk of missed dependencies and unexpected requests/impediments is too great.
      • “Locking in” a plan that far into the future considerably restricts flexibility
    • A Scrum team can make a valid commitment on a sprint backlog if there are no mid-sprint scope changes and few unexpected requests and impediments.
  18.  

  19. Backlog reflects the release plan if available

    • If the team has conducted release planning, create pro forma sprints with items in your planning tool to reflect the release plan.
    • If there are production release, moratorium, or similar dates, communicate those too.
    • Update the release plan at end of each sprint as you learn.

What does a healthy team backlog look like in JIRA?

Glad you asked. Here are four sample “sprints” that take good advantage of JIRA’s built-in functionality.

Sprint 1 (active sprint)

Sprint 2 (next sprint)

Sprint 3 (future sprint)

Sprint 4 (future sprint)

Conclusion

Now you know what a healthy team backlog looks like. If you’ve filled out our printable checklist, mark off up to three items that you’ll work to improve over the next week or two with your teams. We hope this is of use to you!

]]>
http://www.ebaytechblog.com/2017/03/30/healthy-team-backlogs/feed/ 3
Email Tech Is Now Ad Tech http://www.ebaytechblog.com/2017/03/28/email-tech-is-now-ad-tech/ http://www.ebaytechblog.com/2017/03/28/email-tech-is-now-ad-tech/#comments Tue, 28 Mar 2017 17:39:26 +0000 http://www.ebaytechblog.com/?p=6831
Continue Reading »
]]>

 

eBay has come a long way in our CRM and email marketing in the past two years. Personalization is a relatively easy task when you’re dealing with just one region and one vertical and a hundred thousand customers. With 167M active buyers across the globe, eBay’s journey to help each of our buyers find their version of perfect was quite complex.

Like many in our industry, we’ve had to deal with legacy systems, scalability, and engineering resource constraints. And yet, we’ve made email marketing a point of pride — instead of the “check mark” that we started from. Here’s our story.

Our starting point was a batch-and-blast approach. Our outbound communications very much reflected our organizational structure: as a customer, I’d get a fashion email on Monday, a tech email on Tuesday, and a motors email on Wednesday. This of course wasn’t the kind of an experience we wanted to create.

Additionally, for each of our marketing campaigns, we hand-authored targeting criteria — just as many of our industry colleagues do today in tools like Marketo and ExactTarget. This approach worked OK, but the resulting segment size was too large — in hundreds of thousands. This meant that we were missing out on the opportunity to treat customers individually. It also didn’t scale well — as our business grew internationally, we needed to add more and more business analysts; and the complexity of our contact strategy was becoming unmanageable.

We wanted to create a structurally better experience for our customers — and our big bet was to go after 1:1 personalization using real-time data. We wanted to use machine learning to do the targeting, with real-time feedback loops powering our models.

Since email is such a powerful driver for eCommerce, we committed to a differentiated experience in this channel. After evaluating multiple off-the-shelf solutions, we settled on building an in-house targeting and personalization system — as the size of the eBay marketplace is astounding, and many opportunities and issues are quite unique. We set a high bar: every time we show an offer to a customer, it has to be driven by our up-to-the minute understanding of what the customer did and how other customers are responding to this offer.

Here are some examples of the scenarios we targeted:

  • eBay has many amazing deals, and our community is very active. Deals quickly run out of inventory. We can’t send an offer to a customer and direct them to an expired deal. Thus, our approach involved open-time rendering of offers in email.
  • Some of our retail events turn out to be much more popular than we anticipate. We want to respond to this real-time engagement feedback by adjusting our recommendations quickly. We thus built a feedback loop that shows an offer to a subset of customers; then, if an event is getting a much higher click-through rate than we expected, we show it to more customers. If it for some reason isn’t doing well — for example, if the creative is underperforming — the real-time “bandit” approach reduces its visibility.

Both of these scenarios required us to have real-time CRM and engagement streams. That is, we needed to know when a customer opens an email or clicks on it, and based on this knowledge, instantaneously adjust our recommendations for other customers. This of course is miles away from a typical multi-step batch system based on ETL pipelines that most retailers have today. We were no different — we had to reengineer our delivery and data collection pipes to be real-time. The payoff, however, is quite powerful — this real-time capability is foundational to our ability to personalize better: both today and in the years to come.

The resulting solution transformed email marketing at eBay: instead of hundreds of small, uncoordinated campaigns each month, we now have a small set of “flagship” campaigns, each of which is comprised of one or more offers. Each of the offers is selected at the open time of the email, and the selection is made based upon machine-learned model which uses real-time data. As a result, we saw significant growth in both engagement and sales driven by our emails.

You’ll notice that this component-level personalization approach is all about treating email content as an ad canvas. The problem is fundamentally similar: once you’ve captured the customers attention — be it via a winning bid on an ad auction, or by having that customer open your email — you need to find the most relevant offer to show. Each email slot can be thought of as a first-party ad slot. This realization allowed us to unify our approaches between display advertising and email: the same stack now powers both.

We extended this approach to scenarios like paid social campaigns, where Facebook would want to retrieve the offer from us a priori to manage their customer experience. We built a real-time push framework, where, whenever we find a deal that is better than what we previously apprised Facebook of, we immediately push that offer to Facebook.

This creates a powerful cross-channel multiplier: if we happen to see the customer on the display channel, the same ad-serving pipeline is engaged — and our flagship deal-finding campaign can be served to that customer, too. This means that evolving our flagship campaigns — adding more sophisticated machine learning, improving our creatives — contributes to all channels that are powered by this pipeline, not just email.

Orchestration across channels too becomes possible: we can choose to send an email to a customer with a relevant offer; if they don’t open it, we can then target them with a display ad, and an onsite banner; then, after showing the offer a set number of times across all channels, we can choose to stop the ad — implementing an effective cross-channel impression cap. And each condition for state transitions in this flow can itself be powered by a machine-learning model.

eBay’s scale creates an admirable engineering challenge for a true CRM. By putting our customers, and their behavioral signals, at the top of our priority list, we were able to create an asset in our CRM platform that positions us well towards In this journey towards 1:1 personalization. A single real-time, event-driven pipeline we’ve built allows for coordinated, up-to-the-minute offers to be served — wherever we happen to see the customer.

Alex Weinstein (@alexweinstein) is the Director of Marketing Technologies and CRM at eBay and the author of the Technology + Entrepreneurship blog, where he explores data-driven decision making in the face of uncertainty. Prior to eBay, Alex was the head of product development at Wetpaint, a personalization tech startup.

Graphic: Rahul Rodriguez

]]>
http://www.ebaytechblog.com/2017/03/28/email-tech-is-now-ad-tech/feed/ 1
An Approach to Achieve Scalability and Availability of Data Stores http://www.ebaytechblog.com/2017/03/23/an-approach-to-achieve-scalability-and-availability-of-data-stores/ http://www.ebaytechblog.com/2017/03/23/an-approach-to-achieve-scalability-and-availability-of-data-stores/#respond Thu, 23 Mar 2017 19:00:58 +0000 http://www.ebaytechblog.com/?p=6914
Continue Reading »
]]>
 

Today there has been an explosion of the web, specifically in social networks and users of ecommerce applications, that corresponds to an explosion in the sheer volume of data we must deal with. The web has become so ubiquitous that it is used by everyone, from the scientists in 1990s, who used it for exchanging scientific documents, to five-year-olds today exchanging emoticons about kittens. There comes the need of scalability, which is the potential of a system, network, or process to be enlarged in order to accommodate that data growth. The web has virtually brought the world closer, which means there is no such thing as “down time” anymore. Business hours are 24/7, with buyers shopping in disparate time zones. Thereby, a necessity for high availability of the data stores arises. This blog post provides a course of action required to achieve scalability and availability for data stores.

This article covers the following methods to provide a scalable and highly available data stores for applications.

  • Scalability: a distributed system with self-service scaling capability
    • Data capacity analysis
    • Review of data access patterns
    • Different techniques for sharding
    • Self-service scaling capability
  • Availability: physical deployment, rigorous operational procedures, and application resiliency
    • Multiple data center deployment
    • Self-healing tools
    • Well-defined DR tiering, RTO, RPO, and SOPs
    • Application resiliency for data stores

Scalability

With the advent of the web, especially Web 2.0 sites where millions of users may both read and write data, scalability of simple database operations has become more important. There are two ways to scale a system: vertically and horizontally. This talk focuses on horizontal scalability, where both the data and the load of simple operations is distributed/sharded over many servers, where the servers do not share RAM, CPU, or disk. Although in some implementations disk and storage can be shared, auto scaling can become a challenge for such cases.

diagram abstractly illustrating scalability measures. Image by freeimageslive.co.uk – freebie.photography

The following measures should be considered as mandatory methods in building a scalable data store.

  • Data capacity analysis: It is a very important task to understand the extreme requirements of the application in terms of peak and average transactions per second, peak number of queries, payload size, expected throughput, and backup requirements. This enables the data store scalability design in terms of how many physical servers are needed and hardware configuration of the data store with respect to memory footprint, disk size, CPU Cores, I/O throughput, and other resources.

  • Review data access patterns: The simplest course to scale an application is to start by looking for access patterns. Given the nature of distributed systems, all queries to the data store must have the access key in all real-time queries to avoid scatter and gather problem across different servers. Data must be aligned by the access key in each of the shards of the distributed data store. In many applications, there can be more than one access key. For example, in an ecommerce application, data retrieval can be by Product ID or by User ID. In such cases, the options are to either store the data redundantly aligned by both keys or store the data with a reference key, depending upon the application’s requirements.

  • Different techniques for sharding: There are different ways to shard the data in a distributed data store. Two of the common mechanisms are function-based sharding and lookup-based sharding.Function-based sharding refers to the sharding scheme where a deterministic function is applied on the key to get the value of shard. In this case, the shard key should exist in each entity stored in the distributed data store, for efficient retrieval. In addition, if the shard key is not random, it can cause hot spots in the system.Lookup-based sharding refers to a lookup table used to store the start range and end range of the key. Clients can cache the lookup table to avoid single point of failure.Many NoSQL databases implement one of these techniques for achieving scalability.

  • Self-service scaling capability: Self-service scaling, or auto-scaling, can work as a jewel in the scalable system crown. Data stores are designed and architected to provide enough capacity to scale up front, but rapid elasticity and cloud services can enable vertical and horizontal scaling in the true sense. Self-service vertical scaling enables the addition of resources to an existing node to increase its capacity, while self-service horizontal scaling enables the addition or removal of nodes in the distributed data store via “scale-up” or “scale-down” functionality.

Availability

Data stores need to be highly available for read and write operations. Availability refers to a system or component that is continuously operational for a desirably long length of time. Below are some of the methods to ensure that the right architectural patterns, physical deployment, and rigorous operational procedures are in place for a highly available data store.

diagram of the four availability methods discussed in this blog post

  • Multiple data center deployment: Distributed data stores must be deployed in different data centers with redundant replicas for disaster recovery. Geographical location of data centers should be chosen cautiously to avoid network latency across the nodes. The ideal way is to deploy primary nodes equally amongst the data centers along with local and remote replicas in each data center. Distributed Data stores inherently reduces the downtime footprint by the sharding factor. In addition, equal distribution of nodes across data centers causes only 1/nth of the data to be unavailable in case of a complete data center shutdown.

  • Self-healing tools: Efficient monitoring and self-healing tools must be in place to monitor the heartbeat of the nodes in the distributed data store. In case of failures, these tools should not only monitor but also provide a way to bring the failed component alive or should provide a mechanism to bring its most recent replica up as the next primary. This self-healing mechanism should be cautiously used per the application’s requirements. Some high-write-intensive applications cannot afford inconsistent data, which can change the role of self-healing tools to monitor and alert the application for on-demand healing, instead.

  • Well-defined DR tiering, RTO, RPO, and SOPs: Rigorous operational procedures can bring the availability numbers (ratio of the expected value of the uptime of a system to the aggregate of the expected values of up and down time) to a higher value. Disaster recovery tiers must be well defined for any large-scale enterprise, with an associated expected downtime for the corresponding tiers. The Recovery Time Objective (RTO) and Recovery Point Objective (RPO) should be well tested in a simulated production environment to provide a predicted loss in availability, if any. Well-written SOPs are proven saviors in a crisis, especially in a large enterprise, where Operations can implement SOPs to recover the system as early as possible.

  • Application resiliency for data stores: Hardware fails, but systems must not die. Application resiliency is the ability of an application to react to problems in one of its components and still provide the best possible service. There are multiple ways that an application can use to achieve high availability for read and write database operations. Application resiliency for reads enables the application to read from a replica in the case of primary failure. Resiliency can also be part of a distributed data store feature, as in many of the NoSQL databases. When there is no data affinity of the newly inserted data with the existing data, a round-robin insertion approach can be taken, where new inserts can write to a node other than the primary when the primary is unavailable. On the contrary, when there is data affinity of the newly inserted data with the existing data, the approach is primarily driven by consistency requirements of the application.

The key takeaway is that in order to build a scalable and highly available data store, one must take a systematic approach to implement the methods described in this paper. This list of methods is a mandatory, comprehensive list, but not exhaustive, and it can have more methods added to it as needed. Plan to grow BIG and aim to be 24/7 UP, and with the proper scalability and availability measures in place, the sky is the limit.

References

Image by freeimageslive.co.uk – freebie.photography

]]>
http://www.ebaytechblog.com/2017/03/23/an-approach-to-achieve-scalability-and-availability-of-data-stores/feed/ 0
Rheos http://www.ebaytechblog.com/2017/03/14/rheos/ http://www.ebaytechblog.com/2017/03/14/rheos/#comments Tue, 14 Mar 2017 13:20:35 +0000 http://www.ebaytechblog.com/?p=6845
Continue Reading »
]]>
 

Data IS the next currency.  The increased demand for real-time data across almost every business and technology platform has changed the world we live in.  It is no different at eBay.

About two years ago, I was thrilled when I was asked to lead a development team to build a real-time data platform at eBay using Kafka. Initially, it was just for our Oracle change stream. In late 2015, we decided to expand it to a fully managed, secure, and easy-to-use real-time data platform, known as Rheos. The goal of Rheos is to provide a near real-time buyer experience, seller insights, and a data-driven commerce business at eBay.

While Kafka has given us core capabilities in stream processing, managing a large, distributed, highly available, real-time data pipelines running on the cloud across security zones and data centers is hard without automation and core services. Hence, Rheos was built to provide the necessary life-cycle management, monitoring, and well-architected standards and ecosystem for the real-time streaming data pipelines. Currently, the pipelines consist of Kafka, Storm and stream processing applications. Shared and non-shared data streams can be running on these pipelines.

By the end of 2016, nearly 100 billion messages flowed through the pipelines in Rheos daily. In 2017, Rheos is expected to handle 15 times the current traffic.

So, how did we get there?

Concepts

At a very high level, Rheos has these concepts:

  • Data taxonomy is a well-defined convention that classifies and catalogs events into proper namespaces for organizational, ease of discovery, and management purposes.
  • Category is a top-level component in a namespace for a given stream type, for example, monitoring events, click stream events, business events, and so on.
  • Stream captures the logical data flow that leads to a consumable data point in Kafka. The data flow may cut across one or more data points and stream processing units.
  • Domain represents a shard or a group of related topics for a given stream type. Topics in the group are subject to a set of control parameters such as max partitions, max replica, max data retention period, max topic count, and service level agreement, just as examples.
  • Namespace is used to classify the different data streams in Rheos. A namespace is composed of category, stream, and domain

Automation

Lifecycle Management Service

Lifecycle Management Service is a cloud service that provisions and provides full lifecycle management (LCM) for Zookeeper, Kafka, Storm, and MirrorMaker clusters. It is built on a modular architecture with a pluggable extension and frameworks. This combination allows it to create and perform LCM on a stream pipeline running on any cloud platforms (such as OpenStack, AWS, Google Cloud). The Lifecycle Management Service allows you to provision, flex up/down a cluster, or replace a bad node in a cluster. In addition to its CLI API, it is equipped with a RESTful API that allows Rheos Management Service (see the Core Service below) to perform simple operation on a guest instance. For example, the management service can do a rolling start on a troubled Kafka cluster via the Lifecycle Manager API.

Lifecycle Management Service architectural building blocks consist of these components

  • API Server (REST and CLI) — a thin layer that parses, validates, and forwards requests to Task Manager
  • Task Manager (RPC) — a stateful service that creates and executes orchestration workflows on a cluster of nodes
  • Conductor — a component that is responsible for receiving heartbeat information from the guest instances
  • Guest Agent — A lightweight agent that runs on the guest instance; responsible for executing a command from the Task Manager on the instance as well as sending heartbeat metrics to the Conductor
  • Message Queue — a scoped, controlled, and secured way for the communication between the API Server, Task Manager, Conductor and the Guest Agent

The pluggable extension includes these functions:

  • Workflow
  • Monitoring and metrics emitter and aggregator
  • Authentication and authorization
  • Configuration management
  • IaaS (the underlying compute, storage, resource management, etc.)

Core Service

Rheos core service consists of the following components: Kafka Proxy Server, Schema Registry Service, Stream Metadata Service, and Management Service. The following picture captures how these components interact with each other.

Rheos Kafka Proxy Server

One of Rheos’ key objectives is to provide a single point of access to the data streams for the producers and consumers without hard-coding the actual broker names. This allows any open-source Kafka connectors, framework, and Kafka clients written in any programming language to seamlessly produce or consume in Rheos.

To do this, we created a Rheos Kafka Proxy Server that handles Kafka TCP Protocol so that the Proxy Server can intercept any initial connection requests from the clients. Upon receiving the initial connection requests, the Proxy Server identifies which Kafka cluster the topic resides on via the Rheos Metadata Service (described below). Then, the actual broker cnames will be returned to the clients so that the clients can complete the final connection handshake with the brokers.

In addition, Rheos Kafka Proxy Server also allows operations to easily replace a bad node or move a topic from one Kafka cluster to another with very little to no impact to the clients.

Schema Registry Service

To promote data hygiene in Rheos and ease of use for both stream producer and consumer, each event in Rheos must be identifiable with an Avro schema. Rheos has built a Schema Registry Service based on confluent.io Schema Registry. This service hosts data format definition, provides schema versioning and serialization information for each event type. In addition, Rheos users can view, insert, and update the schemas in the registry.

Rheos Metadata Service

Stream Metadata Service provides a system of record for each stream and the associated producer and consumer(s) that are known to the system. Prior to producing to or consuming from a stream, one must “register” the Kafka topic along with the associated schema, stream producer, and consumer with the Metadata Service. With this, Kafka topics, broker list along with the associated schemas can easily be discovered or browsed via Rheos REST API or Portal. More importantly, no hard coding of broker names in the client code! In addition, the Metadata Service also makes it possible for our Management Service and Health Check System to seamlessly monitor, alert, and perform life cycle management operations on streams and the infrastructure that the streams run on.

The recorded information includes the following items:

  • The physical (cluster) location of a topic or a stream processing job/topology
  • Data durability, retention policy, partition, producer, and consumer information
  • Source and target data mirroring information
  • Default configuration for Zookeeper, Kafka, and Storm
  • Topic schema information
  • And more

Management Service

Rheos performs stream, producer, and consumer life cycle management operations with a set of predefined Standard Operating Procedure (SOP) in the Management Service. Each SOP has a series of steps that can be performed on a guest instance via the Lifecycle Management Service. For example, Operations can initiate a rolling restart of a Kafka cluster using one of the SOPs.

Health Check System

This service monitors the health of each asset (for example, a Kafka, Zookeeper, or MirrorMaker node) that is provisioned through the Lifecycle Management Service in these aspects:

  • Node state (up or down)
  • Cluster health
  • Producer traffic, consumer lags, or data loss

It periodically samples data from Kafka topics, performs consumer lag checks, and end-to-end latency checks via Management Service. Upon anomaly or error detection, the service generates an alert via email and/or to eBay Operations. In addition, the Health Check Service records a consumer’s current offset with a timestamp in the primary and the secondary Kafka clusters.

Producer traffic

Producer traffic is closely monitored and can be viewed on the Rheos Portal. To provide a quick visual for a producer’s traffic trending or pattern, the current traffic volume of a stream domain (aka topic group with selected or all partitions) is overlaid on top of its yesterday’s traffic pattern. This way, one can quickly detect if there’s an anomaly with the current traffic.

End-to-end latency

A popular question everyone wants to ask is the data end-to-end latency or consumer lags in a stream pipeline. Rheos Health Check System provides a stream domain’s end-to-end latency by measuring two periods of time:

  • From when an event is published to Kafka to the time when the event is consumed by a consumer
  • From when an event is published to Kafka to the time when the broker writes to disk

Stream consistency check

To quickly remediate a problem in a stream, the Health Check System proactively monitors a set of in-sync replicas (ISR) for a given topic in a stream. In addition, it also ensures that the stream that the topic goes through is consistent spanning across one or more Kafka clusters.

Node status

Last but not the least, our Health Check System also monitors the state of each node in Rheos. At a high level, it provides a quick overview of the cluster health by checking these conditions:

  • Whether a node is reachable or not
  • Whether the primary workload (broker, Zookeeper, etc.) is running or not on a reachable node or not
  • Whether a randomly selected node in a cluster can properly fulfil a request or not

Rheos Mirroring Service

In addition to Kafka’s cluster replication, Rheos Mirroring Service provides high data availability and integrity by mirroring data from source cluster to one or more target clusters. Built around Kafka’s MirrorMaker, the service is used to set up MirrorMaker instances and mirror a group of topics from one cluster to another via a REST API. Through the API, one can start and stop the mirroring of a topic group.

Rheos Mirroring Service consists of these key components:

  • Asset Agent is co-located on a mirroring compute node and responsible for reporting heartbeat metrics to a State Store.
  • Mirror Manager is a REST service that starts and stops the mirroring of a topic group. It is equipped with the intelligence to properly distribute the MirrorMaker instances across the cluster based on a distribution strategy.
  • Configurator is an Ansible playbook that resides on each MirrorMaker node. It is responsible for these functions:
    • Creating the required Kafka producer/consumer properties for a topic group
    • Creating the required directory structure for the instance along with the supervisor configuration
    • Starting or stopping the MirrorMaker instance based on the given source to target mirroring configuration
  • Mirror Bootstrap is a thin Java wrapper that registers and deregisters the MirrorMaker instance in the State Store prior to interacting with the underlying Mirror Maker instance. This allows us to capture the physical and the logical data mirroring activities.

Using the Mirroring Service to achieve high availability

As shown below, data can be mirrored from one region or availability zone to one or more regions or availability zones for highly availablity reasons. To do that, MirrorMaker instances are set up in the target locations to consume data from a source cluster and subsequently publish to target clusters.

Using the Mirroring Service to move data across security zones

In addition, Data Mirroring is used to provide data movement from one security zone to another. As shown below, MirrorMaker instances are set up in the target security zone to consume data from the source security zone over a TLS connection and subsequently publish the received data to the target clusters.

How to access Kafka securely?

To acquire a broker connection, a Rheos client must be authenticated by the eBay Identity Service via the Kafka SASL mechanism. Upon authentication, the client is then further authorized through Kafka’s default pluggable Authorizer via Zookeeper.

In some cases, such as moving data across security zones, TLS is also enabled at the connection level.

Conclusion

Rheos has opened a new chapter in many aspects at eBay.  With Rheos, eBay data can now be securely extracted and moved from a data store, application, or other source to one or more locations in a real-time manner.  Stream processing has opened up new possibilities for eBay businesses, fraud detection, monitoring, analytics, and more at eBay.

]]>
http://www.ebaytechblog.com/2017/03/14/rheos/feed/ 3
Coding Kata Month http://www.ebaytechblog.com/2017/03/06/coding-kata-month/ http://www.ebaytechblog.com/2017/03/06/coding-kata-month/#comments Mon, 06 Mar 2017 16:00:37 +0000 http://www.ebaytechblog.com/?p=6795
Continue Reading »
]]>
 

I’m very lucky to be working at eBay with some of the most talented people I know. More fortunate still perhaps that they indulge me in my regular experiments in making our department a better place to work. I’ve been thinking about what I perceive as deficiencies in coding kata and talking to one of my colleagues at EPD (European Product Development) about this since maybe midyear 2016. As we talked about the similarities and differences in martial arts and coding kata, we began to explore what we might do in order to shift the needle on current coding kata practice.

To that end, we kicked off ‘Kata Month’ in December. It was very much an exploratory exercise to see what would happen if we solved the same kata every day for a month. Rather than do a kata until it was ‘solved’, what if we practiced it daily and with a view to deliberately practicing elements of coding? Truth be told, it very nearly did not happen, and I owe thanks to my manager Paul Hammond, who pushed me to kick off the exercise despite not being completely prepared. My tendency is to over-engineer and given the various pressures of our day to day I’d likely have delayed until January or February to try and have everything as I wanted it. As it turned out, we had enough in place and so with pretty much zero notice, I sent out the following email in December week 1:

Hi all,

For the next four weeks in the London office, we’ll be holding Coding Kata Month. Each day between 11 – 12, you’ll have one hour in which to participate. (Instructions below for week 1)

In martial arts, constant, deliberate practice of fundamentals is key to attaining mastery. In Kendo, there are 10 kata (interestingly, they are done in pairs) — effectively 20 movements to learn. When I first started kendo, the kata were the ‘boring’ bits that I had to do in order to do the fun stuff (beating someone with a stick). The more I did them though, the more I realised there was a richness in them that I hadn’t seen (or had wilfully ignored). Yes, the movements are choreographed, but an understanding of the fundamentals ingrained in them is crucial. There is correctness of physical form, but also distance, timing, and things that are more difficult to perceive without practice — reading your opponent, their movement, their breathing, gauging their readiness.

Deliberate practice to improve these fundamentals is key. The same is true for any skill, be it a musical instrument, carpentry, ballet and also programming. For the next month, we’re going to delve into deliberate practice for programming through kata.
Monday to Thursday are kata day (implementation).
Friday will be for code review/debrief — an opportunity for people to talk about what they learned.

Instructions:
Each day between 11:00 – 12:00 sharp
Complete the Harry Potter coding kata within the constraints set for that day/week.

  • Each time you begin, start from scratch.
    1. Go to our GitHub kata repository.
    2. Create a new repo named day1-<my initials>[-<my pair's initials>].
    3. Clone your new repo.
    4. Open your IDE of choice and create new project in your new repo.
    5. Code…
  • Commit after each Red/Green/Refactor cycle.
  • At the conclusion of the kata:
    • Include a text file listing the participants.
    • Record any thoughts you think are relevant: learnings, assumptions, gripes, ideas, notes for next time, etc.
    • Commit the above notes along with your code.

Week 1 – Individual Practice
Mon –> Thursday — Code solutions
Choose your language — you will be sticking with this language for a while, so choose carefully!
Repeat the kata each day.
Use the same language, same IDE.
Friday –> Code review (group)
On Friday, we’ll get together as a group and talk about what we learned and look at some different examples of your solutions.

Weeks 2–4 will change things up a little. Here’s a taste of what is to come:
Week 2 — Pairing
Week 3 — Design variation and mobbing
Week 4 — Open

Honestly, I was a little taken aback at how enthusiastically the initiative was picked up by the teams. I figured they might get a kick out of it, but they grabbed the idea and ran with it. They talked about it over lunch, they talked about it across teams. After a long and challenging year, it was great to see the crew jumping in with so much energy.

I dived in with equal enthusiasm. Honestly, I’d not coded in anger in well over a year, and I was painfully rusty. On day one, I realised how much I’d forgotten about TDD and got an embarrassingly small amount of code written. On day 2, I sort of hit my groove and worked out where I wanted to go with a solution. On day 3, I’d nailed a working solution to the problem, and by day 4, I knocked it out in about 20 minutes and started looking at how to evolve the data structures I’d chosen to make my solution extensible. I was feeling pretty good about myself.

I sat down to pair with one of our programmers in week 2. At the end of the first session I had the humbling experience of seeing just how much I had to learn about TDD (not to mention intentional programming and various design patterns). The other thing it did was make me realise just how rich this area of kata could be. Having an interesting problem to solve was one thing, but putting together a repeatable solution that incorporates a contextually appropriate use of both fundamental and advanced programming skills has so much potential.

I won’t give you a detailed rundown of the entire month; suffice it to say there were some interesting things to come out of it. Some of them code-related, some not.

For example, we stipulated one hour for kata between 11 and 12 (just before most people go to lunch). The consensus was after a couple of weeks that this was quite disruptive to the day overall. The teams had standup in the morning, then a small amount of time to work before kata started, then lunch and then the afternoon. Productivity-wise, there was the general feeling that half the day was gone before any project work got done. For future iterations of kata month, we’ll kick off the day with kata. If nothing else, at least that way folks are starting the day writing code — something that you don’t always get to do despite best intentions.

Another interesting thing that came out of our Friday review sessions was that some people were bored after ‘solving’ the kata. This was what I really wanted to address — that kata are not a thing to be ‘solved’, but a way to practice fundamentals. To some extent this was helped by the variety from week to week (individual, pairing, mobbing, etc.), but we also discussed using the time to work on weak points or selecting a different approach to solving the problem or even making more effective use of the IDE to do some of the heavy lifting. In hindsight, this might have been different if I’d spent more time setting the scene at the beginning, explaining how kata work in martial arts and what I was expecting. It also helped reinforce to me the importance of having a repeatable solution in place. Having a repeatable solution takes the ‘solving’ part out of the equation and lets you focus on practice of implementing a solution (more on that in a future post).

At the end of the month, I ran a retro and put out a survey to the participants. I’d like to share some of the responses.

What were your major take-aways from Kata Month?

responses to What were your major take-aways from Kata Month?

What changes would you like to see for the next time we run this exercise?

responses to What changes would you like to see for the next time we run this exercise?

It was interesting to see the various viewpoints of the people that participated, what their preconceptions and assumptions were, and how they changed over time. As far as our Friday sessions went, they were quite unstructured and in hindsight we could have made a lot more of them. We looked through some code, but with the exception of week 3 where we did an impromptu mobbing session, we didn’t really demo any writing of code. Given my views on kata as a visual teaching and learning aid, that feels like an opportunity missed.

Setting expectations early on was also a recurring theme. I think there is a place for some amount of ritual to designate a mental shift required for working on kata. It need not be elaborate, but something that puts the practitioner in the mindset of deliberate practice. In that way, the goal and the aim is clear — execute the kata in order to practice your fundamentals.

We talked also about the fact that this was a ‘greenfields’ kata and that it might be useful to try to do a kata along similar lines that was refactoring existing code that had issues of varying kinds. There are refactoring kata out there, but I quite like the idea of having kata that exist in pairs to exercise similar principles in both greenfields and brownfields situations, possibly even having kata whose solution works for one situation but needs refactoring to suit another. There are subtly different skills involved in selecting a particular design pattern to implement a solution versus recognising when existing code should be refactored to use that pattern.

Since kata month finished, I’ve put together a small working group of interested folks with the aim of putting together some kata of our own. We’re working to that end now, to come up with a problem and a solution that is representative of the skills required by an EPD programmer. My intention, once we have something that works for us, is to then share those with the wider world. In the meantime, there is no shortage of kata ‘problems’ out there, but very few of them are accompanied with a solution. About the only one that springs to mind is Bob Martin’s Bowling Kata. I think there is certainly scope for other existing kata to similarly have repeatable solutions designed for them — not simply ‘solved’, but achieving a repeatable solution deliberately designed to exercise fundamentals and good design principles in context.

]]>
http://www.ebaytechblog.com/2017/03/06/coding-kata-month/feed/ 1