Enhancing the User Experience of the Hadoop Ecosystem


At eBay, we have multiple large, multi-tenant clusters. Each of these clusters stores hundreds of petabytes of data. These clusters offer tens of thousands of cores to run computations on the data. We have thousands of internal users who use Hadoop in their roles, including data analysts, data scientists, engineers, and product managers. These users use multiple technologies like MapReduce, Hive, and Spark to process data. There are thousands of applications that push and pull data from Hadoop and run computations.

Figure 1: Hadoop clusters, auxiliary components, users, applications, and services


The users normally interact with the cluster via the command line by SSHing to specialized gateway machines that reside in the same network zone as the cluster. To transfer job files and scripts, the users need to SCP over multiple hops.

Figure 2: Old way of accessing a Hadoop cluster

The need to traverse multiple hops as well as the command-line-only usage was a major hindrance to the productivity of our data users.

On the other side, our website applications and services need to access data and perform compute. These applications and services reside in a different network zone and hence need to set up network rules to access various services like HDFS, YARN, and Oozie. Since our clusters are secured with Kerberos, the applications need to be able to use Kerberos to authenticate to the Hadoop services. This was causing an extra burden for our application developers.

In this post, I will share the work in progress to facilitate access to our Hadoop clusters for data and compute resources by users and applications.


We need better ways to achieve the following goals:

  • Our engineers and other users need to use multiple clusters and related components.
  • Data Analysts and other users need to run interactive queries and create shareable reports.
  • Developers need to be able to develop applications and services without spending time on connectivity problems or Kerberos authentication.
  • We can afford no compromise on security.


To improve user experience and productivity, we added three open-source components:

Hue — to perform operations on Hadoop and related components.

Apache Zeppelin — to develop interactive notebooks with queries, programs, and reports.

Apache Knox — to serve as a single point for applications to access HDFS, Oozie, and other Hadoop services.

Figure 3: Enhanced user experience with Hue, Zeppelin, and Knox

We will describe each product, the main use cases, a list of our customizations, and the architecture.


Hue is a user interface to the Hadoop ecosystem. It provides user interfaces to several components including HDFS, Oozie, Resource Manager, Hive, and HBase. It is a 100% open-source product, actively supported by Cloudera, and stored at the Hue GitHub site.

Similar products

Apache Airflow allows users to specify workflows in Python. Since we did not want a Python learning curve for our users, we chose Hue instead of Airflow. But we may find Airflow compelling enough to deploy it in future so that it can be used by people who prefer Airflow.

Use cases of Hue

Hue allows a user to work with multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • To browse, manage, upload, and download HDFS files and directories
  • To specify workflows comprising MapReduce, Hive, Pig, Spark, Java, and shell actions
  • Schedule workflows and track SLAs
  • To manage Hive metadata, run Hive queries, and share the queries with other users
  • To manage HBase metadata and interact with HBase
  • To view YARN applications and terminate applications if needed


Two-factor authentication — To ensure that the same security level is maintained as that of command-line access, we needed to integrate our custom SAML-based two-factor authentication in Hue. Hue supports plugging in new authentication mechanisms, using which we were able to plug in our two-factor authentication.

Ability to impersonate other users — At eBay, users sometimes operate on behalf of a team account. We added capability in Hue so that users can impersonate as another account as long as they are authorized to do so. The authorization is controlled by LDAP group memberships. The users can switch back between multiple accounts.

Working with multiple clusters — Since we have multiple clusters, we wanted to provide single Hue instance serving multiple Hadoop clusters and components. This enhancement required changes in HDFS File Browser, Job Browser, Hive Metastore Managers, Hive query editors, and work flow submissions.


Figure 4: Hue architecture at eBay


A lot of our users, especially data scientists, want to run interactive queries on the data stored on Hadoop clusters. They run one query, check its results, and, based on the results, form the next query. Big data frameworks like Spark, Presto, Kylin, and to some extent, HiveServer2 provide this kind of interactive query support.

Apache Zeppelin (GitHub repo) is a user interface that integrates well with products like Spark, Presto, Kylin, among others. In addition, Zeppelin provides an interface where users can develop data notebooks. The notebooks can express data processing logic in SQL or Scala or Python or R. Zeppelin also supports data visualization in notebooks in the form of tables and charts.

Zeppelin is an Apache project and is 100% open source.

Use cases

Zeppelin allows a user to develop visually appealing interactive notebooks using multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • Run a quick Select statement on a Hive table using Presto.
  • Develop a report based on a dataset by reading files from HDFS and persisting them in memory as Spark data frames.
  • Create an interactive dashboard that allows users to search through a specific set of log files with custom format and schema.
  • Inspect the schema of a Hive table.


Two-factor authentication — To maintain security parity with that command-line access, we plugged in our custom two-factor authentication mechanism in Zeppelin. Zeppelin uses Shiro for security, and Shiro allows one to plug in a custom authentication with some difficulty.

Support for multiple clusters — We have multiple clusters and multiple instances of components like Hive. To support multiple instances in one Zeppelin server, we created different interpreters for different clusters or server instances.

Capability to override interpreter settings at the user level — Some of the interpreter settings, such as job queues and memory values, among others, need to be customized by users for their specific use cases. To support this, we added a feature in Zeppelin so that users can override certain Interpreter settings by setting properties. This is described in detail in this Apache JIRA ticket ZEPPELIN-1625


Figure 5: Zeppelin Architecture at eBay


Apache Knox (GitHub repo) is an HTTP reverse proxy, and it provides a single endpoint for applications to invoke Hadoop operations. It supports multiple clusters and multiple components like webHDFS, Oozie, WebHCat, etc. It can also support multiple authentication mechanisms so that we can hook up custom authentication along with Kerberos authentication

It is an Apache top-level project and is 100% open source.

Use cases

Knox allows an application to talk to multiple Hadoop clusters and related components through a single entry point using any application-friendly non-Kerberos authentication mechanism. A few common use cases are listed below:

  • To authenticate using an application token and put/get files to/from HDFS on a specific cluster
  • To authenticate using an application token and trigger an Oozie job
  • To run a Hive script using WebHCat


Authentication using application tokens — The applications and services in the eBay backend use a custom token-based authentication mechanism. To take advantage of the existing application credentials, we enhanced Knox to support our application token-based authentication mechanism in addition to Kerberos. Knox utilizes the Hadoop Authentication framework, which is flexible enough to plug in new authentication mechanisms. The steps to plug in an authentication mechanism on Hadoop’s HTTP interface is described in Multiple Authentication Mechanisms for Hadoop Web Interfaces


Figure 6: Knox Architecture at eBay


In this blog post, we describe the approach taken to improve user experience and developer productivity in using our multiple Hadoop clusters and related components. We illustrate the use of three open-source products to make Hadoop users’ life a lot simpler. The products are Hue, Zeppelin, and Knox. We evaluated these products, customized them for eBay’s purpose, and made them available for our users to carry out their projects efficiently.

A Creative Visualization of OLAP Cuboids


eBay is one of the world’s largest and most vibrant marketplaces with 1.1 billion live listings every day, 169 million active buyers, and trillions of rows of datasets ranging from terabytes to petabytes. Analyzing such volumes of data required eBay’s Analytics Data Infrastructure (ADI) team to create a fast data analytics platform for this big data using Apache Kylin, which is an open-source Distributed Analytics Engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

Apache Kylin creatively applies data warehouse OLAP technology to the Hadoop domain, which makes a query return within milliseconds to seconds against datasets in the PBs. The magic of Apache Kylin is that it pre-calculates the metrics against defined dimensions. So when a query is launched, it doesn’t need to scan PBs worth of source data, but instead it scans the pre-calculated metrics, which is much smaller than the source data to accelerate the query speed.

Currently there are hundreds of cubes running on the Kylin production environment within eBay, serving dozens of Business domains in Inventory Healthy Analysis, User Behavior Analysis, eBay APIs Usage Analysis and eBay Business Partner Channel Performance Analysis, etc.

This post showcases the creative visualization of OLAP cuboids that has been implemented in the Cube Planner feature built on Apache Kylin by eBay’s ADI team to solve the challenge of showing huge OLAP cuboids in a fixed space. To better understand the challenge as well as the value of the visualization of sunburst charts that have been introduced into OLAP cuboids, some basic concepts need to be covered.

Basic Concepts

An OLAP cube is a term that typically refers to multi-dimensional array of data [1][2]. OLAP is an acronym for online analytical processing [3], which is a computer-based technique of analyzing data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than 3.

A cuboid is one combination of dimensions.

For example, if a cube has 4 dimensions — time, item, location, and supplier — it has 16 cuboids, as shown here.

A basic cuboid has the most detailed data, except for the source data itself; it is composed of all dimensions, like (time, item, location, supplier). It can be rolled up to all the other cuboids. For example, a user can roll up the basic cuboid (time, item, location, supplier) along dimension “supplier” to cuboid (time, item, location). And in this case, the basic cuboid is the Parent Cuboid, and a 3-D cuboid (time, item, location) is a Child Cuboid.

OLAP Cuboids Visualization Challenges

The OLAP cuboids visualization has the following characteristics:

  • All the cuboids have a root parent — the basic cuboid.
  • The relationship “rollup to” between two cuboids is directed, from parent to child.
  • The relationship “rollup to” is m:n mapping. One parent cuboid can be rolled up to multiple child cuboids, while one child cuboid can be rolled up from multiple parent cuboids.

So the visualization of cuboids is typically a directed graph. But, in the real OLAP world, not all the relationships are kept. The m:n mappings are simplified to 1:n mappings. Every child cuboid would have just one parent. Usually we keep the relationship with the parent, who has lowest row count, and eliminate others, because the lowest row count of parent cuboid means the lowest cost of aggregation to the child cuboid. Hence, the visualization of cuboids is simplified to a tree, where the basic cuboid is the root, as shown below.

Even with the simplified relationships between cuboids, there can still be some challenges to cuboids layout with a tree:

  • The tree must be collapsed to fit in a fixed space.
  • It is impossible to have an overall view of all cuboids.
  • Multiple clicks are needed to the target node from root layer by layer.
  • It’s hard to focus on the whole view of all the child cuboids from a given cuboid.

Cuboid Visualization Solution in Cube Planner

Cube Planner makes OLAP cubes more resource-efficient. It intelligently builds a partial cube to minimize the cost of building a cube and at the same time maximizes the benefit of serving end user queries. Besides, it learns patterns from queries at runtime and dynamically recommends cuboids accordingly.

In Cube Planner, we want to show query usage down to the cuboid level, which enables the cube owner to gain more insight into their cubes. We want to have a color-coded heat map with all cuboids in one view to give the cube owner an overall feeling of its cube design. Furthermore, when the user hovers over each cuboid, details of individual cuboid — cuboid ID, query count, row count, rollup rate, etc. — would be displayed. We will also recommend a new cube design with recommended cuboids based on the query statistics, thus we will need to put the old cuboids and new cuboids together in one page to show the benefit of cube optimization.

We are not able to use any tree or directed graph component to meet all our requirements above. Luckily, our GUI engineer discovered a means to produce sunburst charts, which greatly meet our expectations.

What is a Sunburst Chart?

Our sunburst charts are created with Angular-nvD3, an AngularJS directive for NVD3 re-usable charting library (based on D3). Users can easily customize their charts via a JSON API. Go to the Angular-nvD3 quick start page if you want to know more about how to include these fancy charts in your GUI.

How Are Sunburst Charts Used for Cube Cuboids Visualization?

Basically, at the back end we create a REST API to return the cuboid tree with the necessary information, and at the front end a JavaScript controller parses the REST response to a relative JSON format and then renders the sunburst chart. Below are samples of code from these two layers.

REST Service returns the Cuboid Tree response

@RequestMapping(value = "/{cubeName}/cuboids/current", method = RequestMethod.GET)
    public CuboidTreeResponse getCurrentCuboids(@PathVariable String cubeName) {
        CubeInstance cube = cubeService.getCubeManager().getCube(cubeName);
        if (cube == null) {
            logger.error("Get cube: [" + cubeName + "] failed when get current cuboids");
            throw new BadRequestException("Get cube: [" + cubeName + "] failed when get current cuboids");
        Map<Long, Long> cuboidList = Maps.newLinkedHashMap();
        try {
            cuboidList = cubeService.getCurrentCuboidStatistics(cube);
        } catch (IOException e) {
            logger.error("Get cuboids list failed.", e);
            throw new InternalErrorException("Get cuboids failed.", e);
        CuboidTree currentCuboidTree = CuboidTreeManager.getCuboidTree(cuboidList);
        Map<Long, Long> hitFrequencyMap = getCuboidHitFrequency(cubeName);
        Map<Long, Long> queryMatchMap = getCuboidQueryMatchCount(cubeName);
        if (currentCuboidTree == null) {
            logger.warn("Get current cuboid tree failed.");
            return null;
        return cubeService.getCuboidTreeResponse(currentCuboidTree, cuboidList, hitFrequencyMap, queryMatchMap, null);

The JavaScript controller parses the REST response to create the sunburst chart

// transform chart data and customized options.
    $scope.createChart = function(data, type) {
        var chartData = data.treeNode;
        var baseNodeInfo = _.find(data.nodeInfos, function(o) { return o.cuboid_id == data.treeNode.cuboid_id; });
        $scope.formatChartData(chartData, data, baseNodeInfo.row_count);
        } else if ('recommend' === type) {
            $scope.recommendData = [chartData];
            $scope.recommendOptions = angular.copy(cubeConfig.chartOptions);
            $scope.recommendOptions.caption = {
                enable: true,
                html: '
Existed:  Hottest '
                    + ' Hot '
                    + ' Warm '
                    + ' Cold
                    + '
New:  Hottest '
                    + ' Hot '
                    + ' Warm '
                    + ' Cold
                    + '
                css: {
                    position: 'relative',
                    top: '-25px'
            $scope.recommendOptions.chart.color = function(d) {
                var cuboid = _.find(data.nodeInfos, function(o) { return o.name == d; });
                if (cuboid.row_count < 0) {
                    return d3.scale.category20c().range()[5];
                } else {
                    var baseRate = 1/data.nodeInfos.length;
                    var colorIndex = 0;
                    if (!cuboid.existed) {
                        colorIndex = 8;
                    if (cuboid.query_rate > (3 * baseRate)) {
                        return d3.scale.category20c().range()[colorIndex];
                    } else if (cuboid.query_rate > (2 * baseRate)) {
                        return d3.scale.category20c().range()[colorIndex+1];
                    } else if (cuboid.query_rate > baseRate) {
                        return d3.scale.category20c().range()[colorIndex+2];
                    } else {
                        return d3.scale.category20c().range()[colorIndex+3];
            $scope.recommendOptions.title.text = 'Recommend Cuboid Distribution';
            $scope.recommendOptions.subtitle.text = 'Cuboid Count: ' + data.nodeInfos.length;

    // transform chart data
    $scope.formatChartData= function(treeNode, orgData, parentRowCount) {
        var nodeInfo = _.find(orgData.nodeInfos, function(o) { return o.cuboid_id == treeNode.cuboid_id; });
        $.extend(true, treeNode, nodeInfo);
        treeNode.parent_row_count = parentRowCount;
        if(treeNode.children.length > 0) {
            angular.forEach(treeNode.children, function(child) {
                $scope.formatChartData(child, orgData, nodeInfo.row_count);

Screenshots and Interaction

Below are some screenshots from the eBay Kylin production environment. With a sunburst chart, cube owners can easily understand their overall cube design with a color-coded cuboid usage heat map. The greater number of dark blue elements in the sunburst chart, the more resource efficient this cube is. The greater number of light blue elements, the more room there is for efficiency.

Putting two sunburst charts of both current and recommended cuboids together, the changes become obvious. The cuboids that are recommended to be removed are marked with gray, and greens are recommended to be added. A popup window with more details of the cuboid will be shown when mouse hover over the cuboid element in a sunburst chart. The value of Cube Planner is now apparent.

Interaction with a sunburst chart is fast and convenient. The user is able to focus on any cuboid and its children with just one click, and the view changes automatically, like from the left chart to the right chart below.


If you want to specify the parent of a leaf, click on the center circle (the part marked yellow).


In short, leveraging sunburst charts as OLAP cuboid visualizations introduces a creative way to discover cube insights down to the cuboid level. With these insights, the cube owner is able to have a resource-efficient cube, thus make Apache Kylin more competitive as an OLAP engine on Hadoop.


Some graphics copyright Apache Kylin

A Surprising Pitfall of Human Judgement and How to Correct It


Algorithms based on machine learning, deep learning, and AI are in the news these days. Evaluating the quality of these algorithms is usually done using human judgment. For example, if an algorithm claims to detect whether an image contains a pet, the claim can be checked by selecting a sample of images, using human judges to detect if there is a pet, and then comparing this to the results to the algorithm. This post discusses a pitfall in using human judgment that has been mostly overlooked until now.

In real life, human judges are imperfect. This is especially true if the judges are crowdsourced. This is not a new observation. Many proposals to process raw judgment scores and improve their accuracy have been made. They almost always involve having multiple judges score each item and combining the scores in some fashion. The simplest (and probably most common) method of combination is majority vote: for example, if the judges are rating yes/no (for example, is there a pet), you could report the rating given by the majority of the judges.

But even after such processing, some errors will remain. Judge errors are often correlated, and so multiple judgments can correct fewer errors than you might expect. For example, most of the judges might be unclear on the meaning of “pet”, incorrectly assume that a chicken cannot be a pet, and wrongly label a photo of a suburban home showing a child playing with chickens in the backyard. Nonetheless, any judgment errors remaining after processing are usually ignored, and the processed judgments are treated as if they were perfect.

Ignoring judge errors is a pitfall. Here is a simulation illustrating this. I’ll use the pet example again. Suppose there are 1000 images sent to the judges and that 70% of the images contain a pet. Suppose that when there is no pet, the judges (or the final processed judgment) correctly detect this 95% of the time. This is actually higher than typical for realistic industrial applications. And suppose that some of the pet images are subtle, and so when there is a pet, judges are correct only 90% of the time. I now present a simulation to show what happens.

Here’s how one round of the simulation works. I assume that there are millions of possible images and that 70% of them contain a pet. I draw a random sample of 1000. I assume that when the image has a pet, the judgment process has a 90% chance reporting “pet”, and when there is no pet, the judgment process reports “no pet” 95% of the time. I get 1000 judgments (one for each image), and I record the fraction of those judgments that were “pet”.

That’s one round of the simulation. I do the simulation many times and plot the results as a histogram. I actually did 100,000 simulations, so the histogram has 100,000 values. The results are below in red. Since this is a simulation, I know the true fraction of images with pets: it’s 70%. The estimated fraction from the judges in each simulation is almost always lower than the true fraction. Averaged over all simulations, the estimated fraction is 64%, noticeably lower than the true value of 70%.

This illustrates the pitfall: by ignoring judge error, you get the wrong answer. You might wonder if this has any practical implications. How often do you need a precise estimate for the fraction of pets in a collection of images? But what if you’re using the judgments to determine the accuracy of an algorithm that detects pets? And suppose the algorithm has an accuracy rate of 70%, but the judges say it is 64%. In the machine learning/AI community, the difference between an algorithm that is 70% accurate vs 64% is a big deal.

But maybe things aren’t as bad as they seem. When statistically savvy experimenters report the results of human judgment, they return error bars. The error bars (I give details below) in this case are

    \[       0.641 \pm 1.96 \sqrt{\frac{p (1-p)}{n}} \approx 0.641 \pm 0.030 \qquad (p = 0.64) \]

So even error bars don’t help: the actual accuracy rate of 0.7 is not included inside the error bars.

The purpose of this post is to show that you can compensate for judge error. In the image above, the blue bars are the simulation of the compensating algorithm I will present. The blue bars do a much better job of estimating the fraction of pets than the naive algorithm. This post is just a summary, but full details are in Drawing Sound Conclusions from Noisy Judgments from the WWW17 conference.

The histogram shows that the traditional naive algorithm (red) has a strong bias. But it also seems to have a smaller variance, so you might wonder if it has a smaller mean square error (MSE) than the blue algorithm. It does not. The naive algorithm has MSE 0.0037, the improved algorithm 0.0007. The smaller variance does not compensate for the large bias.

Finally, I can explain why judge error is so important. For a typical problem requiring ML/AI, many different algorithms are tried. Then human judgment can be used to detect if one is better than the other. Current practice is to use error bars as above, which do not take into account errors in the human judgment process. But this can lead the algorithm developer astray and suggest that a new algorithm is better (or worse) when in fact the difference is just noise.

The setup

I’m going to assume there are both an algorithm that takes an input and outputs a label (not necessarily a binary label) and also a judge that decides if the output is correct. So the judge is performing a binary task: determining if the algorithm is correct or not. From these judgments, you can compute the fraction of times (an estimate of the probability p) that the algorithm is correct. I will show that if you can estimate the error in the judgment process, then you can compensate for it and get a better estimate of the accuracy of the algorithm. This applies if you use raw judge scores or if you use a judgment process (for example, majority vote) to improve judge accuracy. In the latter case, you need to estimate the accuracy of the judgment process.

A simple example of this setup is an information retrieval algorithm. The algorithm is given a query and returns a document. A judge decides if the document is relevant or not. A subset of those judgments is reevaluated by gold judge experts, giving an estimate of the judges’ accuracy. Of course, if you are rich enough to be able to afford gold judges to perform all your labeling, then you don’t need to use the method presented here.

A slightly more subtle example is the pet detection algorithm mentioned above. Here it is likely that there would be a labeled set of images (a test set) used to evaluate the algorithm. You want to know how often the algorithm agrees with the labels, and you need to correct for errors in the judgment about agreement. To estimate the error rate, pick a subset of images and have them rejudged by experts (gold judges). The judges were detecting if an image had a pet or not. However, what I am really interested in is the error rate in judging whether the algorithm gave the correct answer or not. But that is easily computed from the gold judgments of the images.

The first formula

In the introduction, I mentioned that there are formulas that can correct for judge error. To explain the formula, recall that there is a task that requires labeling items into two classes, which I’ll call positive and negative. In the motivating example, positive means the algorithm gives the correct output for the item, negative that it is incorrect. I have a set of items, and judges perform the task on each item, getting an estimate of how many items are in the positive class. I’d like to use information on the accuracy of the judges to adjust this estimate. The symbols used in the formula are:

Rendered by QuickLaTeX.com

The first formula is

    \[ p = \frac{p_J + q_- - 1}{q_+ + q_- -1} \]

This formula is not difficult to derive. See the aforementioned WWW paper for details. Here are some checks that the formula is plausible. If the judges are prefect then q_+ = q_- = 1, and the formula reduces to p = p_J. In other words, the judges’ opinion of p is correct. Next, suppose the judges are useless and guess randomly. Then q_+ = q_- = 1/2, and the formula makes no sense because the denominator is infinity. So that’s also consistent.

Notice the formula is asymmetric: the numerator has q_- but not q_+. To see why this makes sense, first consider the case when the judges are perfect on negative items so that q_- = 1. The judges’ only error is to take correct answers and claim they are negative. So an estimate of p by the judges is always too pessimistic. On the other hand, if q_+ = 1, then the judges are optimistic, because they sometimes judge incorrect items as correct.

I will now show that the formula achieves these requirements. In particular, if q_- = 1 then I expect p_J < p and if q_+ = 1 then p_J > p. To verify this, note that if q_- = 1 the formula becomes p = p_J/q_+ > p_J. And if q_+ = 1 then

    \begin{eqnarray*}   p  &=& \frac{p_J + q_- - 1}{q_-} \\ &=& \frac{p_J - 1}{q_-} + 1 \\ &=& \frac{p_J - 1}{q_-} + (1-p_J) + p_J \\   &= & (p_J - 1)\left(\frac{1}{q_-} - 1\right) + p_J \\   & < & p_J \end{eqnarray*}

the last inequality because p_J - 1 < 0 and 1/q_- -1 > 0.

Up until now the formula is theoretical, because the precise values of p_J, q_+ and q_- are not known. I introduce some symbols for the estimates.

Rendered by QuickLaTeX.com

The practical formula is

    \[ \widehat{p} = \frac{\widehat{p}_J + \widehat{q}_\text{--} - 1}{\widehat{q}_+ + \widehat{q}_\text{--} -1} \]

The second formula

In the introduction, I motivated the concern with judgment error using the problem of determining the accuracy of an ML algorithm and, in particular, comparing two algorithms to determine which is better. If you had perfect judges and an extremely large set of labeled data, you could measure the accuracy of each algorithm on the large labeled set, and the one with the highest accuracy would clearly be best. But the labeled set is often limited in size. That leads to some uncertainty: if you had picked a different labeled set you might get a difference answer. That is the purpose of error bars: they quantify the uncertainty. Traditionally, the only uncertainty taken into account is due to the finite sample size. In this case, the traditional method gives 95% error bars of \widehat{p}_J  \pm 2 \sqrt{v_{\widehat{p}_J}} where

    \begin{eqnarray*}       \widehat{p}_J  &=&  \frac{k}{n} \\       v_{\widehat{p}_J} &= &  \frac{\widehat{p}_J(1-\widehat{p}_J)}{n} \end{eqnarray*}

The judge gave a positive label k times out of n, \widehat{p}_J is the estimated mean and v_{\widehat{p}_J} the estimate of the variance of \widehat{p}_J. I showed via simulation that if there are judgment errors, these intervals are too small. The corrected formula is

    \[ v_{\widehat{p}} =  \frac{v_{\widehat{p}_J}}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^2} + v_{\widehat{q}_+}\frac{(\widehat{p}_J - 1 + \widehat{q}_\text{--})^2}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^4} + v_{\widehat{q}_\text{--}}\frac{(\widehat{p}_J - \widehat{q}_+)^2}{(\widehat{q}_+ + \widehat{q}_\text{--} - 1)^4} \]


Rendered by QuickLaTeX.com

The formula shows that when taking judge error into account, the variance increases, that is, v_{\widehat{p}} >  v_{\widehat{p}_J}. The first term of the equation for v_{\widehat{p}} is already larger than v_{\widehat{p}_J}, since is it v_{\widehat{p}_J} divided by a number less than one. The next two terms add additional error, due to the uncertainty in estimating the judge errors \widehat{q}_+ and \widehat{q}_-.


The theme of this post is correcting for judge errors, but I have only discussed the case when the judge gives each item a binary rating, as a way to estimate what fraction of items are in each of the two classes. For example, the judge reports if an algorithm has given the correct answer, and you want to know how often the algorithm is correct. Or the judge examines a query and a document, and you want to know how often the document is relevant to the query.

But the methods presented so far extend to other situations. The WWW paper referred to earlier gives a number of examples in the domain of document retrieval. It explains how to correct for judge error in the case when a judge gives documents a relevance rating rather than just a binary relevant or not. And it shows how to extend these ideas beyond the metric “fraction in each class” to the metric DCG, which is a common metric for document retrieval.


Analyses of the type presented here always have assumptions. The assumption of this work is that there is a task (for example, “is the algorithm correct on this item?”) with a correct answer and that there are gold judges who are expert enough and have enough time to reliably obtain that correct answer. Industrial uses of human judgment often use detailed guidelines explaining how the judgment should be done, in which case these assumptions are reasonable.

But what about the general case? What if different audiences have different ideas about the task? For example, there may differences from country to country about what constitutes a pet. The analysis presented here is still relevant. You only need to compute the error rate of the judges relative to the appropriate audience.

I treat the judgment process as a black box, where only the overall accuracy of the judgments is known. This is sometimes very realistic, for example, when the judgment process is outsourced. But sometimes details are known about the judgment process would lead you to conclude that the judgments of some items are more reliable than others. For example, you might know which judges judged which items, and so have reliable information on the relative accuracy of the judges. The approach I use here can be extended to apply in these cases.

The simulation

I close this post with the R code that was used to generate the simulation given earlier.

# sim() returns a (n.iter x 4) array, where each row contains:
#   the naive estimate for p,
#   the corrected estimate,
#   whether the true value of p is in a 95% confidence interval (0/1)
#        for the naive standard error
#   the same value for the corrected standard error

sim = function(
  q.pos,  # probability a judge is correct on positive items
  q.neg,  # probability a judge is correct on negative items
  p,      # fraction of positive items
  n,      # number of items judged
  n.sim,  # number of simulations
  n.gold.neg = 200, # number of items the gold judge deems negative
  n.gold.pos = 200  # number of items the gold judge deems positive
   # number of positive items in sample
   n.pos = rbinom(n.sim, n, p)

   # estimate of judge accuracy
   q.pos.hat = rbinom(n.sim, n.gold.pos, q.pos)/n.gold.pos
   q.neg.hat = rbinom(n.sim, n.gold.neg, q.neg)/n.gold.neg

   # what the judges say (.jdg)
   n.pos.jdg = rbinom(n.sim, n.pos, q.pos) +
               rbinom(1, n - n.pos, 1 - q.neg)

   # estimate of fraction of positive items
   p.jdg.hat = n.pos.jdg/n    # naive formula
   # corrected formula
   p.hat = (p.jdg.hat - 1 + q.neg.hat)/
                      (q.neg.hat + q.pos.hat - 1)

   # is p.jdg.hat within 1.96*sigma of p?
   s.hat =  sqrt(p.jdg.hat*(1 - p.jdg.hat)/n)
   b.jdg = abs(p.jdg.hat - p) < 1.96*s.hat

   # is p.hat within 1.96*sigma.hat of p ?
   v.b = q.neg.hat*(1 - q.neg.hat)/n.gold.neg   # variance q.neg
   v.g = q.pos.hat*(1 - q.pos.hat)/n.gold.pos
   denom = (q.neg.hat + q.pos.hat - 1)^2
   v.cor = s.hat^2/denom +
           v.g*(p.jdg.hat - 1 + q.neg.hat)^2/denom^2 +
           v.b*(p.jdg.hat - q.pos)^2/denom^2
   s.cor.hat = sqrt(v.cor)
   # is negative.jdg.rate.cor within 1.96*sigma of p?
   b.cor = abs(p.hat - p) < 1.96*s.cor.hat

   rbind(p.jdg.hat, p.hat, b.jdg, b.cor)

r = sim(n.sim = 100000, q.pos = 0.90,  q.neg = 0.95, p = 0.7, n=1000)

# plot v1 and v2 as two separate histograms, but on the same ggplot
hist2 = function(v1, v2, name1, name2, labelsz=12)
  df1 = data.frame(x = v1, y = rep(name1, length(v1)))
  df2 = data.frame(x = v2, y = rep(name2, length(v2)))
  df = rbind(df1, df2)

  freq = aes(y = ..count..)
  print(ggplot(df, aes(x=x, fill=y)) + geom_histogram(freq, position='dodge') +
      theme(axis.text.x = element_text(size=labelsz),
      axis.text.y = element_text(size=labelsz),
      axis.title.x = element_text(size=labelsz),
      axis.title.y = element_text(size=labelsz),
      legend.text = element_text(size=labelsz))

hist2(r[1,], r[2,], "naive", "corrected", labelsz=16)

Powered by QuickLaTeX.