A Case Study in Empirical Bayes

Empirical Bayes is a statistical technique that is both powerful and easy to use. My goal is to illustrate that statement via a case study using eBay data. Quoting the famous statistician Brad Efron,

Empirical Bayes seems like the wave of the future to me, but it seemed that way 25 years ago and the wave still hasn’t washed in, despite the fact that it is an area of enormous potential importance.

Hopefully this post will be one small step in helping Empirical Bayes to wash in! The case study I'll present comes from ranking the items that result from a search query. One feature that is useful for ranking items is their historical popularity. On eBay, some items are available in multiple quantities. For these, popularity can be measured by the number of times an item is sold divided by the number of times it is displayed, which I will call sales/impressions (S/I). By the way, everything I say applies to any ratio of counts, not just sales and impressions.

The problem

The problem I want to discuss is what to do if the denominator is small. Suppose that items typically have 1 sale per 100 impressions. Now suppose that a particular item gets a sale just after being listed. This is a typical item that has a long-term S/I of about 0.01, but by chance it got its sale early, say after the 3rd impression. So S/I is 1/3, which is huge. It looks like an enormously popular item, until you realize that the denominator I is small: it has received only 3 impressions. One solution is to pass the problem downstream, and give the ranker both S/I and I. Let the ranker figure out how much to discount S/I when I is small. Passing the buck might make sense in some situations, but I will show that it's not necessary, and that it's possible to pass a meaningful value even when I is small.

How to do that? Informally, I want a default value of S/I, and I want to gradually move from that default to the actual S/I as I increases. Your first reaction is probably to do this by picking a number (say 100), and if I < 100 use the default, otherwise S/I. But once you start to wonder whether 100 is the right number, you might as well go all the way and do things in a principled way using probabilities.

The solution

Jumping to the bottom line: the formula will be (S + α)/(I + γ). This clearly satisfies the desire to be near S/I when S and I are large. It also implies that the default value is α/γ, since that's what you get when S=I=0. In the rest of this post I will explain two things. First, how to pick α and γ (there is a right way and a wrong way). And second, where the shape of the formula (S + α)/(I +γ) comes from. If you're familiar with Laplace smoothing then you might think of using (S+1)/(I+1), and our formula is a generalization of that. But it still begs the question — why a formula of this form, rather than, for example, a weighted sum (1 - e^{-\alpha I})(S/I) + e^{-\alpha I}(\alpha/\gamma).

The formula (S + α)/(I +γ) comes by imagining that at each impression, there is a probability of an associated sale, and then returning the best estimate of that probability instead of returning S/I. I'll start with the simplest way of implementing this idea (although it is too simple to work well).

Suppose the probability of a sale has a fixed universal value p, so that whenever a user is shown an item, there is a probability p that the item is sold. This is a hypothetical model of how users behave, and it's straightforward to test if it fits the data. Simply pick a set of items, each with an observed sale count and impression count. If the simple model is correct, then an item with n impressions will receive k sales according to the binomial formula:

\Pr(\mbox{getting } k \mbox{ sales}) = \binom{n }{ k} p^{k}(1-p)^{n - k}

Here n is the number of impressions and k the number of sales. As mentioned earlier, this whole discussion also works for other meanings of k and n, such as k is clicks and n is impressions. To test the simple model, I can compare two sets of data. The first is the observed pairs (n,k). In other words, I retrieve historical info for each item, and record n impressions and k sales. I construct the second set by following the simple model: I take the actual number of impressions n, and randomly generate the number of sales k according to the formula above. Below is a histogram of the two data sets. Red is simulated (the model), and blue is actual. The match is terrible.

Bayes plot1

Here is some more detail on the plot: Only items with a nonzero sale count are shown. In the simulation there are 21% items with S=0, but the actual data has 47%.

So we need to go to a more sophisticated model. Instead of a fixed value of p, imagine drawing p from a probability distribution and plugging it into the inset equation, which is then used to get the random k. As you can see in the plot below, the two histograms have a much more similar shape than the previous plot, and so this model does a better job of matching the actual data.

Bayes plot2

Now it all boils down to finding the distribution for p. Since 0 \leq p \leq 1, that means finding a probability distribution on the interval [0, 1]. The most common such distribution is the Beta distribution, which has two parameters, \alpha and \beta. By assuming a Beta distribution, I reduce the problem to finding \alpha and \beta (and yes, this α is the same one as in the formula (S + α)/(I +γ)). This I will do by finding the values of \alpha and \beta that best explain the observed values of k and n. Being more precise, associated to each of N historical items is a sale count k_i and an impression count n_i, with 1 \leq i \leq N.

I was perhaps a bit flip in suggesting the Beta distribution because it is commonly used. The real reason for selecting Beta is that it makes the computations presented in the Details section below much simpler. In the language of Bayesian statistics, the Beta distribution is conjugate to the binomial.

At this point you can fall into a very tempting trap. Each k_i/n_i is a number between 0 and 1, so all the values form a histogram on [0,1]. The possible values of p follow the density function for the Beta distribution and so also form a histogram on [0,1]. Thus you might think you could simply pick the values of \alpha and \beta that make the two histograms match as closely as possible. This is wrong, wrong, wrong. The values k_i/n_i are from a discrete distribution and often take on the value 0. The values of p come from a continuous distribution (Beta) and are never 0, or more precisely, the probability that p=0 is 0. The distributions of k/n and of p are incompatible.

In my model, I'm given n and I spit out k by drawing p from a Beta distribution. The Beta is invisible (latent) and indirectly defines the model. I'll give a name to the output of the model: X. Restating, fix an n and make X a random variable that produces value k with the probability controlled indirectly by the Beta distribution. I need to match the observed (empirical) values of (n_i, k_i) to X, not to Beta. This is the empirical Bayes part. I'll give an algorithm that computes \alpha and \beta later.

But first let me close the loop, and explain how all this relates to (S + α)/(I + γ). Instead of reporting S/I, I will report the probability of a sale. Think of the probability as a random variable — call it P. I will report the mean value of the random variable P. How to compute that? I heard a story about a math department that was at the top of a tall building whose windows faced the concrete wall of an adjacent structure. Someone had spray-painted on the wall "don't jump, integrate by parts." If it had been a statistics department, it might have said "don't jump, use Baye's rule."

Baye's rule implies a conditional probability. I want not the expected value of P, but the expected value of P conditional on n impressions and k sales. I can compute that from the conditional distribution \Pr(P = p \mid (n,k)). To compute this, flip the two sides of the | to get \Pr((n,k) \mid P=p). This is \Pr(\mbox{getting } k \mbox{ sales}), which is just the inset equation at the beginning of this post!

Now you probably know that in Baye's rule you can't just flip the two sides, you also have to include the prior. The formula is really \Pr(P = p \mid (n,k)) = \mbox{constant} \times \Pr((n,k) \mid P = p) \Pr(P=p). And \Pr (P=p) is what we decided to model using the Beta distribution with parameters \alpha and \beta. These are all the ingredients for Empirical Bayes. I need \Pr(P = p \mid (n,k)), I evaluate it using Baye's rule, the rule requires a prior, and I use empirical data to pick the prior. In empirical Bayes, I select the prior that best explains the empirical data. For us, the empirical data is the observed values of (n_i, k_i). When you do the calculations (below) using the Beta(\alpha, \beta) distribution as the prior, you get that the mean of P is (S + α)/(I + γ) where γ = α + β.

How does this compare with the simplistic method of using S/I when I > δ, and η otherwise? The simplistic formula involves two constants δ and η just as the principled formula involves two constants α and γ. But the principled method comes with an algorithm for computing α and γ given below. The algorithm is a few lines of R code (using the optimx package).

The details

I'll close by filling in the details. First I'll explain how to compute \alpha and \beta.

I have empirical data on N items. Associated with the i-th item (1 \leq i \leq N) is a pair (k_i, n_i), where k_i might be the number of sales and n_i the number of impressions, but the same reasoning works for clicks instead of sales. A model for generating the (k_i, n_i) is that for each impression there is a probability p that the impression results in a sale. So given n_i, the probability that k_i = j is \binom{n_i }{ j} p^{j}(1-p)^{n_i - j}. Then I add in that the probability p is itself random, drawn from a parametrized prior distribution with density function f_\theta(p). I generate the (k_i, n_i) in a series of independent steps. At step i, I draw p_i from f_\theta(p), and then generate k_i according to the binomial probability distribution on k_i:

\mbox{Prob}(k_i = j) = \binom{n_i }{ j} p_i^{j}(1-p_i)^{n_i - j}

Using this model, the probability of seeing (k_i, n_i) given n_i is computed by averaging over the different possible values of p, giving

q_i(\theta) = \int_0^1 \binom{n_i }{ k_i} p^{k_i}(1-p)^{n_i - k_i} f_\theta(p) dp

I'd like to find the parameter \theta that best explains the observed (k_i, n_i) and I can do that by maximizing the probability of seeing all those (n_i, k_i). The probability seeing (n_i, k_i) is q_i(\theta), the probability of seeing the whole set is \prod_i q_i(\theta) and the log of that probability is \sum_i \log q_i(\theta). This is a function of \theta, and I want to find the value of \theta that maximizes it. This log probability is conventionally called the log-likelihood.

Since I'm assuming f_\theta(p) is a beta distribution, with \theta = (\alpha, \beta), then q_i(\theta) becomes

q_i(\alpha, \beta) & = & \dbinom{n_i }{ k_i} \int_0^1 p^{k_i}(1-p)^{n_i - k_i}
\frac{ \Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1}(1-p)^{\beta-1} dp \\
& = &
\dbinom{n_i }{ k_i} \frac{ \Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}
\int_0^1 p^{k_i + \alpha -1}(1-p)^{n_i +\beta - k_i - 1} dp \\
& = &
\dbinom{n_i }{ k_i} \frac{B(\alpha + k_i, n_i + \beta - k_i)}{B(\alpha, \beta)}
The calculation above uses the definition of the beta function B and the formula for the beta integral
B(\alpha,\beta) & = & \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \\
\int_0^1 x^{\alpha-1} (1-x)^{\beta-1} dx & = & B(\alpha, \beta)

If you don't want to check my calculations, q_i(\alpha, \beta) is just the beta-binomial distribution, and you can find its formula in many books and web pages.

Restating, to find \alpha, \beta is to maximize the log-likelihood l(\alpha, \beta) = \sum_i \log q_i(\alpha, \beta), specifically

l(\alpha, \beta) = \sum_i \left(\log \dbinom{n_i }{ k_i} + \log B(\alpha + k_i, n_i + \beta - k_i) - \log B(\alpha, \beta)\right)

And since the first term doesn't involve \alpha or \beta, you only need to maximize

\sum_{i=1}^N\log B(\alpha + k_i, n_i + \beta - k_i) - N\log B(\alpha, \beta)

The method I used to maximize that expression was the optimx package in R.

The final missing piece is why, when I replace S/I with the probability that an impression leads to a sale, the formula is (k + \alpha)/(n + \gamma).

I have an item with an unknown probability of sale p. All that I do know is that it got k sales out of n impressions. If P is the random variable representing the sale probability of an item, and F = (k,n) is a random variable representing the sale/impression of an item, I want \Pr(P = p \mid F = (k, n)), which I write as \Pr(p \mid k, n) for short. Evaluate this using Baye's rule,

\Pr(p \mid k, n) = \Pr(k,n \mid p) \Pr(p) / \Pr(k,n)

The \Pr(k,n) term can be ignored. This is not deep, but can be confusing. In fact, any factor \phi(k,n) involving only k and n (like 1/\Pr(k,n)) can be ignored. That's because \int \Pr(p \mid k, n) dp = 1, so if \Pr(p \mid k, n) =f(p,k,n)\phi(k,n) it follows that \phi can be recovered from f(p,k,n) using \phi(k,n) = 1/\int(f(p,k,n)dp. In other words, I can simply ignore a \phi and reconstruct it at the very end by making sure that \int \Pr(p \mid k, n) dp = 1.

I know that

\Pr(k,n \mid p) = \binom{n }{ k} p^k(1-p)^{n-k}

For us, the prior \Pr(p) = f_\theta(p) = f_{\alpha, \beta}(p) is a beta distribution, \mbox{Beta}_{\alpha, \beta}(p) = p^{\alpha-1}(1~-~p)^{\beta-1}/B(\alpha, \beta). Some algebra then gives

\Pr(p \mid k, n) \propto \Pr(k,n \mid p) \Pr(p)\propto \mbox{Beta}_{\alpha + k, \beta + n - k}(p)

The \propto symbol ignores constants involving only k and n. Since the rightmost term integrates to 1, the proportionality is an equality:

\Pr(p \mid k, n) = \mbox{Beta}_{\alpha + k, \beta + n - k}(p)

For an item with (k,n) I want to know the value of p, but this formula gives the probability density for p. To get a single value I take the mean, using the fact that the mean of \mbox{Beta}_{\alpha, \beta} is \alpha/(\alpha+\beta). So the estimate for p is

\mbox{Mean}(\mbox{Beta}_{\alpha + k, \beta + n - k}) = \frac{\alpha + k}{\alpha + \beta + n}

This is just (S + α)/(I + γ) with γ = α + β.

There's room for significant improvement. For each item on eBay, you have extra information like the price. The price has a big effect on S/I, and so you might account for that by dividing items into a small number of groups (perhaps low-price, medium-price and high-price), and computing \alpha, \beta for each. There's a better way, which I will discuss in a future post.

HDFS Storage Efficiency Using Tiered Storage

At eBay, we run Hadoop clusters comprised of thousands of nodes that are shared by thousands of users. We store hundreds of petabytes of data in our Hadoop clusters. In this post, we look at how to optimize big data storage based on frequency of data usage. This method helps reduce the cost in an effective manner.

Hadoop and its promise

It is now common knowledge that commodity hardware can be grouped together to create a Hadoop cluster with big data storage and computing capability. Parts of the data are stored in each individual machine, and data processing logic is also run on the same machines.


For example: A 1,000-node Hadoop cluster with storage capacity of 20 TB per node can store up to 20 petabytes (PB) of data. All these machines have sufficient computing power to fulfill Hadoop’s motto of “take compute to data.”

Temperature of data

Different types of datasets are usually stored in the clusters, which are shared by different teams running different types of workloads to crunch through the data. Each dataset is enhanced and enriched by daily and hourly feeds through the data pipelines.

A common trait of datasets is heavy initial usage. During this period the datasets are considered HOT. Based on our analysis, we found there is a definite decline in usage with time, where the stored data is accessed a few times a week and ages to being WARM data. In the next 90 days, when data usage falls to a few times a month, it is defined as COLD data.

So data can be considered HOT during its initial days, then it remains WARM in the first month. Jobs or applications use the data a few times during this period. The data’s usage goes down further; the data becomes COLD, and may be used only a handful of times in the next 90 days. Finally, when the data is very rarely used, at a frequency of once or twice per year, the “temperature” of the data is referred to as FROZEN.

 Data Age   Usage Frequency   Temperature 
Age < 7 days 20 times a day HOT
7 days > Age < 1 month 5 times a week WARM
1 month < Age < 3 months 5 times a month COLD
3 months < Age < 3 years 2 times a year FROZEN

In general, a temperature can be associated with each dataset. In this case, temperature is inversely proportional to the age of the data. Other factors can affect the temperature of a particular dataset. You can also write algorithms to determine the temperature of datasets.

Tiered storage in HDFS

HDFS supports tiered storage since Hadoop 2.3.

How does it work?

Normally, a machine is added to the cluster, and local file system directories are specified to store the block replicas. The parameter used to specify the local storage directories is dfs.datanode.data.dir. Another tier, such as ARCHIVE, can be added using an enum called StorageType. To denote that a local directory belongs to the ARCHIVE tier, the directory is prefixed in the configuration with [ARCHIVE]. In theory, multiple tiers can exist, as defined by a Hadoop cluster administrator.

For example: Let’s add 100 nodes that contain 200 TB of storage per node to an existing 1,000-node cluster having a total of 20 PB of storage. These new nodes have limited computing capability compared to the existing 1,000 nodes. Let’s prefix all the local data directories with ARCHIVE. These 100 nodes now form the ARCHIVE tier and can store 20 PB of data. The total capacity of the cluster is 40 PB, which is divided into two tiers – the DISK tier and the ARCHIVE tier. Each tier has 20 PB.

Mapping data to a storage tier based on temperature

For this example, we will store the heavily used HOT data in the DISK tier, which has nodes with better computing power.

For WARM data, we will keep most of its replicas in the DISK tier. For data with a replication factor of 3, we will keep two replicas in the DISK tier and one replica in the ARCHIVE tier.

If data is COLD, we will keep at least one replica of each block of the COLD data in the DISK tier. All the remaining replicas go to the ARCHIVE tier.

When a dataset is deemed FROZEN, which means it is almost never used, it is not optimal to store it on a node that has lots of CPU power to run many tasks or containers. We will keep it on a node that has minimal computing power. Thus, all the replicas of all blocks of FROZEN data can move to the ARCHIVE tier.

Data flow across tiers

When data is first added to the cluster, it gets stored in the default tier, DISK. Based on the temperature of the data, one or more replicas are moved to the ARCHIVE tier. Mover is used for data movement from one storage tier to another tier. Mover works similarly to Balancer except that it moves block replicas across tiers. Mover accepts an HDFS path, a replica count, and destination tier information. Then it identifies the replicas to be moved based on the tier information, and schedules the moves between source and destination data nodes.

Changes in Hadoop 2.6 to support tiered storage

Many improvements in Hadoop 2.6 further support tiered storage. You can attach a storage policy to a directory to denote it as HOT, WARM, COLD, or FROZEN. The storage policy defines the number of replicas to be located on each tier. It is possible to change the storage policy on a directory and then invoke Mover on that directory to make the policy effective.

Applications using data

Based on the data temperature, some or all replicas of data could be on either tier. But the location is transparent to applications consuming the data via HDFS.

Even though all the replicas of FROZEN data are on ARCHIVE storage, applications can still access it just like any HDFS data. Because no computing power is available on ARCHIVE nodes, mapped tasks running on DISK nodes will read the data from ARCHIVE nodes, which leads to increased network traffic for the applications. If this occurs too frequently, you can declare the data as WARM/COLD, and Mover can move one or more replicas back to DISK.

The determination of data temperature and the designated replica movement to pre-defined tiered storage can be fully automated.

Tiered storage at eBay

Tiered storage is enabled in one of the very large clusters at eBay. The cluster had 40 PB of data. We added 10 PB of additional storage with limited computing power. Each new machine could store 220 TB. We marked the additional storage as ARCHIVE. We identified a few directories as WARM, COLD, or FROZEN. Based on their temperature, we moved all or a few replicas to the ARCHIVE storage.

The price per GB of the ARCHIVE tier is four times less than the price per GB on the DISK tier. This difference is mainly because machines in the ARCHIVE tier have very limited computing power and hence lower costs.


Storage without computing is cheaper than storage with computing. We can use the temperature of the data to make sure that storage with computing is wisely used. Because each block of data is replicated a few times (the default is three), some replicas can be moved to the low-cost storage based on the temperature of the data. HDFS supports tiered storage and provides the necessary tools to move data between tiers. Tiered storage is enabled on one of the very large clusters at eBay to archive data.

Benoy Antony is an Apache Hadoop committer who focuses on HDFS and Hadoop security. Benoy works as a software engineer in the Global Data Infrastructure team at eBay.

The Power of Perceived Performance

Recent years have seen a huge influx of SPAs — Single Page Applications. Though they enhance user experience, implementing SPAs for large-scale web applications is indeed a complex task. At eBay, we faced a similar challenge when we wanted to migrate a couple of our key desktop flows (search and item pages) to an app-like experience, from the current state of full page refreshes. Some of the key challenges were

  • Server and client sync: Solving this challenge is super critical for e-commerce applications. Both the browser and the server should maintain the state of the app. At any point in time the URL should be portable, meaning it should render the page markup the same way every time. From an e-commerce perspective, three main circumstances make this point critical:   SEO, browser refreshes (especially for items ending soon), and URL sharing.
  • Code redundancy: This point follows from the previous one. In order for the app to be rendered on the server, all logic built in JavaScript for the browser should be replicated on the server. The result, however, is a maintenance nightmare, especially for large distributed teams. Although there are solutions like rendr and PhantomJS to simulate the browser environment on the server, they don’t work at scale.
  • Performance penalty for the first hit: Most SPA frameworks out there require the initial rendering on page load to happen on the browser. This is an anti-performance pattern and has proven to be a bad way to go for many (see, for example, Twitter's experience). In addition, with initial rendering on the browser we lose the huge benefits of the preload scanners in modern browsers. This point, then, is a reiteration that server-side rendering is not just an add-on, but a must.
  • Browser back/forward: This may come as a surprise to many, but from what we have observed, even the slightest deviation from the default back/forward actions of the browser has impacts on consumer behavior. Users are so accustomed to these actions (mainly in the desktop environment) that we need to make sure they work as expected. This is not much of an SPA challenge, but it is something to keep in mind.

Considering the above facts and still wanting to build a seamless experience, we decided to go the PJAX route. PJAX (pushState + AJAX) is a technique that delivers a fast browsing experience, without the SPA overhead. It has worked well for biggies like Twitter and GitHub. When looking to implement a PJAX library, we learned about YouTube’s SPF — Structured Page Fragments— from the 2014 Velocity conference (yes, SPF not SPA; we know it's confusing). A quick dive into SPF indicated it was pretty much what we wanted. Moreover, the contributors to SPF responded promptly to all our queries and enhancement requests, thus enabling us to get started quickly. So what does SPF offer?

  • Application code remains intact: For SPF, we don’t have to change the way we build applications. Also, no specialized client treatment is required. The only change needed was to add a conditional hook (based on a particular request URL param) in our server response pipeline to respond with HTML in JSON format, instead of with standard HTML. This benefit is huge to us, as development teams can build and maintain applications while being agnostic about how client-side navigations might happen.
  • Markup is on server: With SPF, markup is always rendered on the server. Along with the code maintenance benefit, this feature also removes the dependency on client hardware specifically for rendering. Although client machines are getting increasingly powerful, we still have a sizable set of our global users on the lower hardware spectrum whom we need to support. Our previous attempts at client-side rendering for these users were not fruitful. On a side note, we are very interested in trying out React for client-side rendering after initial load.

JavaScript and CSS optimizations 

Moving to SPF provided an opportunity for us to clean up JavaScript and CSS. We did two types of optimization.

  1. JavaScript events: Our pages did not have a standard way of handling events — some were handled at an individual element level and some used delegation. This situation was not ideal, as complex pages were sometimes sluggish because they had tons of events. Now with PJAX, we've brought in a standard:  to widgetize our UI modules and delegate events at the widget container level. This standard has made event handling more efficient. Furthermore, we needed to re-initialize only those widgets that were changed on page navigation.
  2. Resource bundling: Most pages were bundled in a way that there was only one JavaScript and one CSS URL per page. All library JS (jQuery, Raptor, tracking, utils, etc.) and CSS (skin) were combined with application JS and CSS, making them one URL each. While this was good for reducing HTTP requests, it also meant anti-caching. When a user navigates from one page to another, the entire CSS and JS have to be downloaded and executed; library files were the big chunk of this overhead, which was unnecessary. With SPF, this approach would fail right away, since it involves a single page context as well as executing the same library code (like jQuery), which would result in unintended behaviors. To fix this situation, we took on the task of creating two resource bundles for key pages — bundling all common JS and CSS shared across pages as one resource, and bundling the application JS and CSS per page as the second resource. This solution saves a lot of time in terms of resource parsing and execution, as only the small amount of application JS and CSS has to be processed on each navigation. Also, in SPF mode the resource processing happens only for the first navigation; for repeated views, the previously executed CSS and JS can be leveraged.

Progress indicators

Now back to why the title “The Power of Perceived Performance.” Moving to SPF measurably increased performance on each navigation. But we had a problem:  the performance gain was not visually perceptible. In an internal demo, the feedback was “yeah, it seems to be a little quicker, but nothing else is different.” We were scratching our heads about what was missing. Finally, it all came down to one thing — a progress indicator. Yes, we did not have progress indicators when users navigated pages in SPF mode.

Transitions or progress indicators mask the slowness in applications. There has been much research around this phenomenon, and we actually experienced the humongous impact it has. Close observation of all major websites that use the PJAX technique reveals they use some sort of progress indicator. For instance, Twitter navigation uses a small throbber, replacing the bird icon in the static header. GitHub replaces the icon right next to a file or folder with a throbber. YouTube shows a red progress bar at the top to indicate the progress of a user’s action.

When we were considering how to implement a transition for SPF-based navigation, a lot of fancy ideas came up. From internal testing, the feedback we received was clear :  more than the fancy stuff, customers just need an elegant transition. We ended up with a real-time progress bar similar to YouTube's. With the progress indicator in place, we did another internal launch. This time, the feedback was unanimous: “Wow! The page looks fast.”

It was surprising how a tiny progress indicator could change the perception of an entire application. The performance numbers with and without the progress indicators were the same. But just with that indicator, the application feels much faster. This is the real power of perceived performance. As a bonus, avoiding the re-parse and re-execution of large CSS and JavaScript on each navigation made our page interactions ready instantly.

Currently the PJAX-based navigation is enabled within the item page and is in production A/B testing. Search is next, and soon other key flows will follow. The ultimate goal is ONE eBay desktop experience.

Huge thanks to YouTube engineers Alex Nicksay, David Phillips, and Marcel Duran for their continuous support throughout the migration. Last but not least, thanks to my colleagues Karthik, Yaniv, Vidya, and Raghu for teaming up and making this is a reality.