Announcing Pulsar Reporting: Near-Real-Time Metrics Reporting Framework

We are excited to announce the first open-source release of Pulsar Reporting.

Earlier this year, we announced, an open-source project that included Pulsar Pipeline, a real-time analytics platform and stream processing framework. One of the frequently requested features for Pulsar has been integration with a metrics store for visualizing the near-real-time metrics. We’ve provided this feature with this release, which adds the Pulsar Reporting API and the Pulsar Reporting UI Framework under the same license terms. The public GitHub repository is

What is Pulsar Reporting?

Pulsar Reporting is an extensible data visualization and reporting framework designed to provide real-time insights from Pulsar Pipeline. The framework includes a rich set of charting widgets and a visual reporting editor for users to easily create reports. It has a robust data query engine that can be extended to support many different types of data sources. With the Pulsar Reporting Framework, users can quickly create multi-dimensional and interactive reports that include drill-down and slice-and-dice capabilities.


  • Near-real-time reports – Building reports based on near-real-time data that auto-refreshes at specified intervals
  • Visual reporting editor – Generating reports without writing any code
  • Rich charting widgets – Creating multiple chart types:   line, bar, histogram, pie, stack,  datatable, etc.
  • Reporting API – Querying data with human-friendly SQL or program-friendly structured JSON
  • Dynamic data source management – Adding or removing data sources with no down time
  • Security and permissions – Managing authentication and access control
  • Druid Kafka extension – Ingesting real-time data from Kafka into Druid
  • AngularJS-based hierarchical UI framework – Easily adding and extending reports
  • Bootstrap-based responsive design – Being able to use Pulsar Reporting on different sizes of screens

Why Pulsar Reporting?

The Pulsar Reporting Framework complements Pulsar, an open-source, real-time analytics platform and stream processing framework. Pulsar generates huge amounts of data, and visualization is the best way to provide intuitive and meaningful insights into that data. However, building dashboards and reports for big data from scratch is cumbersome and error-prone. The Pulsar Reporting Framework allows user to create reports easily and quickly without requiring complex data processing and UI logic.


The raw events and session events from Pulsar Pipeline flow to Kafka using the Pulsar Kafka channel. The Druid cluster then ingests the raw events as well as the sessions from Kafka topics into two tables, one for sessions and one for events. Both tables are indexed in one-second granularity to enable real-time reporting. The Pulsar Reporting API provides an abstract layer to access the tables. The Reporting UI gets the data from the API to build different charts.


Sample API requests

    • Get session metrics using the SQL API:
      Endpoint: http://<API_Server>/prapi/v2/sql
      Method: POST
      Body: {"sql" : "SELECT (count(session) - sum(retvisitor)) * 1.0 / count(session) newSessionRate, sum(sessionDuration) * 1000 totalSessionDurations, count(session) sessions, sum(sessionDuration) totalSessions, sum(totalpagect) totalPages, country, trafficSource FROM pulsar_session WHERE site=0 and country='usa' GROUP BY country, trafficSource ORDER BY sum(totalpagect) ASC limit 20",
      "intervals": "2015-10-11 03:00:32/2015-10-18 01:00:32",
      "granularity": "day"}
    • Get page views by traffic source using the structured JSON API:
      Endpoint: http://<API_Server>/prapi/v2/realtime
      Method: POST
      Body: {"metrics" : [ "pageviews" ], "dimensions" : [ "trafficsource" ], "filter" : "site=0" }

What’s next?

We have open-sourced the Pulsar Reporting Framework, and we plan to continue developing the code in the open. We welcome your suggestions and contributions. Here are some of the features we are thinking about.

  • Pathing and funnels
  • Exporting reports
  • Expanding support to additional data sources based on community interest
  • Integrating with Pulsar.js, a client-side Javascript library to generate Pulsar events for the web

Please visit for source code, documentation, and more information.

The Team


How Our CSS Framework Helps Enforce Accessibility


Screenshot of two visually identical 'Buy it Now' buttons

Spot the difference….You can’t! To a sighted user it appears we have two identical button elements.

A user interface control not only needs to look like a certain control, it must be described as that control too. Take for example a button, one of the simplest of controls. There are many ways you can create something that looks like a button, but unless you use the actual button tag (or button role – more on roles later), it will not be described as a button.

Why does it need to be described as a button? Users of AT (assistive technology), such as a screen reader, may not be able to see what the control looks like visually; therefore it is the job of the screen reader to describe it aurally. A screen reader, such as VoiceOver for Mac OSX and iOS, can do this job only if we, the developers, ensure the correct semantics are present in our HTML code.


In the table below, compare and contrast the accessibility tree attributes for each element  (hint: click each image to view at full size). VoiceOver uses the accessibility tree to convey to the user what it knows about the web page. You will see that for the fake button, there is nothing in the tree to identify the span element as a button. Quite simply, VoiceOver does not know this element is intended to be a button.

Now spot the difference: this is how VoiceOver sees these two elements
‘REAL’ button ‘FAKE’ button
HTML <button class="btn">Buy it Now</button> <span class="btn">Buy it Now</span>
ACCESSIBILITY TREE ATTRIBUTES Annotated accessibility tree of real button Annotated accessibility tree of fake button
VOICEOVER “Buy it now, Button.”

“To click this button press CTRL-OPTION-SPACE.”

“Buy it now.”

Accessibility tree screenshots taken from Mac OSX Accessibility Inspector

What’s also interesting is that if you look at the ‘Actions’ section of the tree, the real button has an ‘accessibilityPerformPress’ action, while the fake button does not. Armed with this information, VoiceOver can also describe how to interact with the element (e.g., press CTRL-OPTION-SPACE). No such information will be communicated for the fake button.

We can safely say that this fake button is not accessible, because the AT doesn’t know what it is or how to interact with it. It appears our fake button is accessible only to people who can see the screen and use a mouse. Oh dear – this fake button has excluded a large number of our users from being able to buy items!

Swiss cheese

You might be wondering, “Who on earth would use a span or div tag for a button?”

You might now also be thinking, “What on earth does Swiss cheese have to do with any of this?”

In the Swiss cheese model of accident causation, risk of a threat becoming a reality is mitigated by differing layers and types of defenses that are “layered” behind each other. For example, we might use code linting, code reviews, accessibility checkers, and manual testing to help ensure that this button is properly described. We liken these separate layers to multiple slices of Swiss cheese, stacked side by side – hence the name.

Illustration of swiss cheese model

Is there anything cheese can’t do? Although many layers of defense lie between hazards and accidents, there are flaws in each layer that, if aligned, can allow the accident to occur.

What if we could also write our CSS framework in a way that acts as another layer in our line of defense? Read on to find out how!

Enforcing roles

Continuing on from our previous ‘fake button’ example, let’s suppose the developer had created the following rules to make the span element appear visually like a button:

.btn {
  background-color: #0654ba;
  border-radius: 0.25em;
  color: white;
  padding: 0.25em 1em;

Screenshot of fake 'Buy it Now' button

The dreaded fake button (although you still can’t tell, just by looking at it)

What we have here is the proverbial cart before the horse. The developer has styled the element before describing its purpose. One way in which we can create the necessary description (the horse) is to require a role attribute. We’ll go into more detail on the role attribute later, but here’s the interesting bit – we can leverage attribute selectors and re-write our CSS like so:

[role=button].btn {
  background-color: #0654ba;
  border-radius: 0.25em;
  color: white;
  padding: 0.25em 1em;

Screenshot of unstyled 'Buy it Now' span element

Under the skin: our attribute selector has now exposed the fake button for the fraud that it is!

Our selector now ensures that a button will visually appear like a button only if it has first been described as a button. You can almost think of this as TDD (test-driven development). If the HTML does not pass our ‘test’, the visual style will not be applied.

Implicit roles

It’s important to know that nearly all elements have a default implicit role, and these default roles do not need to be specified in the HTML – to do so would be redundant. No prizes for guessing what the default role of a button element is. Yes, it’s button!

You might think that it was easy enough for us to convert a span into an accessible button using the button role, but in actual fact our work would not be finished there. Adding a role does not add behavior. A fully accessible button must be keyboard focusable and it must be invokable with SPACE and ENTER keys too. A button element gives this behavior for free; a span element – even with a role of button – does not, and we must implement its behavior by hand.

So please, and I really can’t emphasize this strongly enough, do everybody a favor and always use an actual button element for buttons.

The only real reason you might have for using the button role is when progressively enhancing a link into a button using JavaScript; for example, to make the link open an overlay instead of a new page – which is exactly what we do on eBay. As with spans and divs, allowing anchor tags for buttons does re-open the door to misuse and abuse (think ‘faux’ buttons); and though it is possible to enforce the correct usage with clever use of attribute selectors, it’s a little more convoluted and therefore beyond the scope of this post.

Again, we can enforce this markup requirement by rewriting our CSS selector like so:

button.btn {
  background-color: #0654ba;
  border-radius: 0.25em;
  color: white;
  padding: 0.25em 1em;

Screenshot of our final 'Buy it Now' button

Horse, cart, & driver: the element now has the appearance, description, and interaction of a button

Finally, no more span and div tags for buttons. Our CSS framework simply does not allow it.

Enforcing states

So far we’ve looked at a simple example of how CSS selectors can force developers to put the proper semantics in place – whether implicitly or explicitly. But what about state? If an element has state (a checked checkbox for example), it is not sufficient to describe only what the element is; we must also describe what state it is in.

Developers often fall into exactly the same trap as before: they convey the state visually but not aurally.

In the following code example, the developer has used a modifier class of btn--disabled in order to alter the opacity and background-color of the button:

button.btn--disabled {
  background-color: #999;
  opacity: 0.5;

Screenshot of a button that appears visually disabled

Our ‘ghosted out’ button appears visually disabled

Modifier class is a BEM (Block, Element, Modifier) concept. Throughout this article we will be using a variation of BEM in order to structure and distinguish our class names.

You might be thinking that this isn’t really disabled. If so, you are quite right. This button will not be described as disabled and it will not behave as disabled.

Again, you might be thinking, “Who actually does this kind of stuff?”, but fear not, our CSS selectors can again protect us from this manner of profanity:

button[disabled] {
  background-color: #999;
  opacity: 0.5;

As you can see, the previous modifier class will no longer cut the mustard. It is removed from the selector entirely and the HTML disabled property takes its place. Only when this property is applied in the markup will the button be well and truly disabled for all users.

Comparing accessibility trees we see that the button with class name is still described as ‘Enabled’
Disabled property Disabled class
Annotated accessibility tree of button with disabled property Annotated accessibility tree of button with disabled classname

So far, none of this is particularly earth-shattering, I’m sure you agree, but it sets the stage nicely for moving onto more complex controls and widgets, where we must start delving deeper into the world of WAI-ARIA (commonly referred to as just ARIA for short).


HTML gives us only a limited set of controls such as buttons, links, and the various form value inputs. What about menus, tabs, carousels, overlays, etc. – how do we describe those? Yes, you guessed it – ARIA comes to our rescue.

ARIA gives us many more roles beyond a simple button, and these roles, in conjunction with a multitude of states and properties, open up a whole new set of desktop-like user interface controls and widgets for us to play with. Just make sure you read the instructions before diving in. You do read the instructions don’t you?

Look out for more controls in HTML5, such as menu and dialog. In fact, you might be interested to know that both the menu and dialog tags started out life as ARIA roles before they were introduced as bona fide HTML elements. Don’t get too excited, though – neither have cross browser support at the time of this writing.

In the next section we will look at an example of such a widget and demonstrate how we can use ARIA to influence the way we write CSS selectors in order to enforce accessible markup.


A tabs widget allows the layered stacking of two or more content panels, whereby only one panel of content can be visible at any time. A list of clickable tabs allows the user to swap out the visible panel. This all happens on the client, without a full page reload (i.e., the client is stateful). By decluttering the user interface in this way we can say that a tabs widget follows the principle of progressive disclosure.

Screenshot of eBay's tabbed interface for sign-in or register

Using tabs, the user can switch between “Sign In” or “Register” without a full page reload.

It is critical that our interface is not only visually identifiable as a tabs control (I’ve seen designs that struggle even to meet this criterion!), but also aurally. Without any tab-related HTML tags, how do we achieve this?

Faux tabs

A seasoned developer might set out initially to create the tabs as a list of clickable page anchors for the tabs, with a group of divs acting as anchor targets for the tab panels:

<div class="tabs">
    <li class="tabs__tab tabs__tab--selected">
      <a href="#sign-in">Sign in</a>
    <li class="tabs__tab">
      <a href="#register">Register</a>
    <div class="tabs__panel tabs__panel--active" id="sign-in">
      <!-- Sign in Content -->
    <div class="tabs__panel" id="register">
      <!-- Register Content -->

This is a perfectly reasonable approach to begin with. Page anchors are often well suited as the starting point for tabs, because in the case of JavaScript being unavailable they ensure at least some basic functionality when clicked (i.e., the browser will scroll to the content of the relevant panel). However, when JavaScript does become available, care must be taken to prevent the default link behavior so as to not interfere with tab semantics and behavior. Let me be very clear about this: links are not the same as tabs!

This technique of making core content and functionality available pre-CSS and pre-JavaScript is called progressive enhancement. Progressive enhancement is the safest and surest way to guard against the unknown (e.g., script timeout, script failure, scripting disabled) and to ensure your core experience remains backwards and forwards compatible in all HTML-capable browsers.

We will assume that all layout-related styling is in place for the links (i.e., they are neatly spaced out horizontally), and that by default the visible state of all panels is hidden, with only the ‘active’ panel displayed. Let’s then suppose our developer chooses to visually convey the selected ‘tab’ state using only an underline (a veritable tour de force of minimalism, I know):

.tabs__tab {
  text-decoration: none;
.tabs__tab--selected {
  text-decoration: underline;
.tabs__panel {
  display: none;
.tabs__panel--active {
  display: block;

It would now take only a small amount of JavaScript for our developer to turn this into a “functioning” tabs widget by preventing the default link action (i.e., prevent it navigating to the URL fragment) and toggling the ‘selected’ and ‘active’ modifier classes accordingly; and indeed our developer might be tempted to stop there.

But although this control looks like a tabs widget, it will currently be described only as a list of links (scroll down to see the accessibility tree). No clues are given as to the dynamic, stateful nature of the widget. Screen reader users attempting to follow one of these links are going to be surprised when nothing happens after invoking the link, and equally surprised when no navigation occurs. They are left guessing as to what type of control they might be interacting with. Not a good experience.

Let’s fix it so that if developers try to use our amazingly awesome CSS to style their tabs like ours (go on, admit it, you want that underline too), the styles will appear only if they have the correct accessible markup in place.

Real tabs

To achieve the correct markup for tabs, just as with our simple button example, we can replace class names with ARIA roles and states.

Luckily, ARIA gives us a set of tab-related roles:

  • tablist
  • tab
  • tabpanel

We can also leverage the following global ARIA states:

  • aria-selected
  • aria-hidden
  • aria-controls
  • aria-labelledby

Whilst it would be entirely possible to continue on with our demonstration of progressive enhancement by applying the above roles and states to override our previous link-based markup, it does add some additional complexities which might distract us from the primary topic at hand. So, rather than getting bogged down in those details, let’s drop the progressive enhancement for now and pretend we live in a magical world where JavaScript is always on, is always available, and always works.

Actually, to be honest, it’s not just a JavaScript issue. Some people would argue that by using list-based markup, we also provide for a reasonable semantic fallback in the cases where the tab & tablist roles are not supported by the user’s browser & AT combo.

It will make most sense if we show you the new HTML first this time, rather than the CSS, and hopefully, without the cognitive clutter of the list and link tags, our end goal is now a little clearer. You will quickly see that the core DOM structure remains almost identical:

<div class="tabs">
  <div role="tablist">
    <div role="tab" aria-selected="true" tabindex="0">
      <span>Sign in</span>
    <div role="tab" aria-selected="false" tabindex="-1">
    <div role="tabpanel" id="sign-in" aria-hidden="false">
      <!-- Sign in Content -->
    <div role="tabpanel" id="register" aria-hidden="true">
      <!-- Register Content -->

With these new ARIA roles in place, our tabs will now actually be described as tabs by assistive technology. Likewise, when our JavaScript toggles the ARIA selected state, this state will also be conveyed to our users.

Note that AT actually requires two additional ARIA properties that are not present in our markup: aria-controls (on the tabs) and aria-labelledby (on the tabpanels). These ARIA properties are not typically used as styling hooks on tabs, so we will leave them out for the sake of code brevity; but be sure to include them when building your own tabs widget!

Okay, so we are nearing the end now, but first we must finish up our CSS. Our selectors must become a contract for the accessible HTML above. Where before we had classes for BEM blocks and elements, now we have ARIA roles. Where before we had classes for BEM modifiers, now we have ARIA states:

.tabs [role=tab][aria-selected=false][tabindex="-1"] {
  text-decoration: none;
.tabs [role=tab][aria-selected=true][tabindex="0"] {
  text-decoration: underline;
.tabs [role=tabpanel][aria-hidden=true] {
  display: none;
.tabs [role=tabpanel][aria-hidden=false] {
  display: block;

Personally, I’m a big fan of BEM, but it’s nice where possible like this to be able to replace it with something a little more real, if you know what I mean.

Finally, let us compare the accessibility tree of the first real tab with the first faux tab
Real tab Faux tab
Annotated accessibility tree of real tabs Annotated accessibility tree of faux tabs

One other rule we have enforced in our selectors is the tabindex attribute. Keyboard accessibility for tabs must be implemented in JavaScript using a roving tabindex technique; this is because the tabs in a tablist are selected using the arrow keys, not the tab key (the tab key is actually used to exit the list of tabs). While not strictly necessary to ensure the correct description is given, this selector helps ensure that the correct attribute values are in place for roving tabindex behavior. It’s up to you whether you want to go this far, into the realm of behavior-testing, in your own selectors.

Good behavior

We must always remember that correctly describing a UI control is only part of making it accessible. The user expectation is that it behaves like that control too. Therefore we must also ensure that the correct accessible behavior is in place.

For example, a button must always be ‘clickable’ with SPACE and ENTER keys. Sadly, this kind of behavior is often the first thing to go missing when developers try rolling their own buttons using span or div tags.

More complex controls such as tabs, menus, or autocomplete will typically require a more significant amount of JavaScript in order to make sure the control fully behaves according to its description.


We have seen that each layer of the web frontend has its own responsibilities in terms of creating accessible UI controls:

  • HTML provides the aural description and some built-in behavior
  • CSS provides the visual style and interaction clues
  • JS provides any missing behavior not provided by ARIA or HTML

HTML provides behavior, without the need for JavaScript, for built-in tags such as links, buttons, and form controls.

For the purpose of this blog post, our focus has been primarily HTML and CSS. HTML is fundamental in laying solid foundations for accessible UI controls and widgets, and we have shown how those foundations can be enforced by use of CSS attribute selectors.

So, the next time you find yourself creating a class name like ‘active’, ‘hidden’, ‘on’, or ‘off’ – stop and think instead how you might be able to leverage HTML properties or ARIA states in your selectors. Likewise, if you find yourself creating a class name like ‘btn’, ‘tab’, or ‘dialog’ – also stop and think about how you might be able to leverage an existing HTML tag or ARIA role.

Thank you for reading. I hope you enjoyed it. If you are interested in more accessibility-related articles in future, be sure to leave a comment below!

Finally, if you are interested in learning more about our CSS framework, watch this space for an upcoming announcement and further details. We are currently applying the finishing touches to the framework before releasing it as open source.

Appendix / bibliography

Peer Groups in Empirical Bayes

In a post from February, I sang the praises of Empirical Bayes, and showed how eBay uses it to judge the popularity of an item. This post discusses an important practical issue in using Empirical Bayes which I call “Peer Groups”.

First, a quick summary of the February post. The popularity of an item with multiple copies available for sale can be measured by the number sold divided by the number of times the item has been viewed, sales/impressions for short. The problem was how to interpret the ratio sales/impressions when the number of impressions is small and there might not be any sales yet. The solution was to think of the ratio as a proxy for the probability of sale (call it \pi), and use Bayes theorem to estimate \pi. Bayes theorem requires a prior probability, which I estimated using some of eBay’s voluminous sales data. This method is called Empirical Bayes because the prior is determined empirically, using the data itself.

That brings me to peer groups. When computing the prior probability that an item gets a sale, I want to base it on sales/impressions data from similar items, which I call a peer group. For example if the item is a piece of jewelry, the peer group might be all items of jewelry listed on eBay in the past month. I can get more specific. If the item is new, then the peer group might be restricted to new items. If the list price is $138, the peer group might be further restricted to items whose price was between $130 and $140, and so on.

Once you have identified a peer group and used it to estimate the prior probability, you use Bayes theorem to combine that with the observed count of sales and impressions to compute the probability of sale. This is the number you want—the probability that the next impression will result in a sale. It is called the posterior probability, to distinguish it from the prior probability.

There’s a tension in selecting the peer group. You might think that a peer group more strongly constrained to be similar to the item under consideration will result in a better prior and therefore a better estimate of the probability of a sale. But as the peer group gets smaller and smaller, the estimate of the prior based on the group becomes noisier and less reliable.

Which finally brings me to the subject of this post. In the case where the peer group is specified by a continuous variable like price, you can get the best of both worlds—a narrowly defined peer group and a lot of data (hence low noise) to estimate the prior parameters.

The idea

The idea is modeling. If the prior depends on the price p, and if there is a model for the dependence, the same data used to compute the prior can be used to find the model. Then an item of price p is assigned the prior given by the model at p, which is essentially the peer group of all items with exactly price p. Since this prior is a prediction of the model, it indirectly uses all the data, since the model depends on the entire data set.

Dependence on price

What is needed to apply Bayes theorem is not a single probability \pi, but rather a probability distribution on \pi. I assume the distribution of \pi is a Beta distribution B(\alpha, \beta), which has two parameters. Specifying the prior means providing values of \alpha and \beta.

So our idea is to see if there is a simple parametrized function that explains the dependence of \alpha(p) and \beta(p) on the price p. The beta distribution B(\alpha, \beta) has a mean of \mu = \alpha/(\alpha + \beta). As a first step, I examine the dependence of \mu (rather than \alpha and \beta) on price.


The fit to the power law \mu \propto \beta^{-0.67} is very good. The values of \alpha and \beta are noisier than \mu. But I do know one thing: sales/impressions is small, so that \mu is small, and therefore \alpha \ll \beta so \mu \approx \alpha/\beta. It follows that if \alpha and \beta fit a power law, so would \mu. Thus a power law for \alpha and \beta is consistent with the plot above.

Here are plots of \alpha(p) and \beta(p). Although somewhat noisy, their fits to power laws are reasonable. And the exponents add as expected: the exponent for \alpha is -0.32, for \beta is 0.35, and for \mu = \alpha/(\alpha + \beta) \approx \alpha/\beta is -0.32 - 0.35 = -0.67.




Once the form of the dependence of \alpha and \beta on price is known, the Empirical Bayes computations proceed as usual. Instead of having to determine two constants \alpha and \beta, I use Empirical Bayes to determine four constants c_1, c_2, c_3, and c_4, where

    \[ \alpha(p) = c_1 p^{c_2} \qquad \beta(p) = c_3 p^{c_4} \]

The details are in the February posting, so I just summarize them here. The c_i are computed using maximum likelihood as follows. The probability of seeing a sales/impressions ratio of k_i/n_i is

    \[ q_i(\alpha, \beta) = \binom{n_i}{k_i} \frac{B(\alpha + k_i, n_i + \beta - k_i)}{B(\alpha, \beta)} \]

and max likelihood maximizes the product \prod_i q_i(\alpha, \beta) or equivalently the log

    \[ l(\alpha, \beta) = \sum_i \log q_i(\alpha, \beta) \]

Instead of maximizing a function of two variables \alpha, \beta maximize

    \[ l(c_1, c_2, c_3, c_4) = \sum_i \log q_i(c_1 p_i^{c_2}, c_3 p_i^{c_4}) \]

Once you have computed c_1, c_2, c_3, c_4, then an item with k sales out of n impressions at price p has a posterior probability of (\alpha + k)/(\alpha + \beta + n) =(c_1 p^{c_2} + k)/(c_1 p^{c_2} + c_3 p^{c_4} + n).

Beta regression

When people hear about the peer group problem with a beta distribution prior, they sometimes suggest using beta regression. This suggestion turns out not to be as promising as it first seems. In this section I will dig into beta regression, but it is somewhat of a detour so feel free to skip over it.

When we first learn about linear regression, we think of points on the (x,y) plane and drawing the line that best fits them. For example the x-coordinate might be a person’s height, the y coordinate is the person’s weight, and the line shows how (on the average) weight varies with height.

A more sophisticated way to think about linear regression is that each point represents a random variable Y_i. In the example above, x_i is a height, and Y_i represents the distribution of weights for people of height x_i. The height of the line at x_i represents the mean of Y_i. If the line is y = ax + b, then Y_i has a normal distribution with mean ax_i + b.

Beta regression is a variation when Y_i has a beta distribution instead of a normal distribution. If the y_i satisfy 0 \leq y_i \leq 1 they are clearly not from a normal distribution, but might be from a beta distribution. In beta regression you assume that Y_i is distributed like B(\alpha_i, \beta_i) where the mean \mu_i = \alpha_i/(\alpha_i + \beta_i) is ax_i + b, or perhaps a function of ax_i + b. The theory of beta regression tells you how to take a set of (x_i, y_i) and compute the coefficients a and b.

But in our situation we are not given (x_i, y_i) from a beta distribution. The beta distribution is the unknown (latent) prior distribution. So it’s not obvious how to apply beta regression to get \alpha(p) and \beta(p).


Empirical Bayes is a technique that can work well for problems with large amounts of data. It uses a Bayesian approach that combines a prior and a particular item’s data to get a posterior probability for that item. For us the data is sales/impressions for the item, and the posterior is the probability that an impression for that item results in a sale.

The prior is based on a set of parameters, which for a beta distribution is \{\alpha, \beta\}. The parameters are chosen to make the best possible fit of the posterior to a subset of the data. But what subset?

There’s a tradeoff. If the subset is too large, it won’t be representative of the item under study. If it is too small, the estimates of the parameters will be very noisy.

If the subset is parametrized by a continuous variable (price in our example), you don’t need to decide how to make the tradeoff. You use the entire data set to build a model of how the parameters vary with the variable. And then when computing the posterior of an item, you use the parameters given by the model. In our example, I use the data to compute constants c_1, \ldots c_4. If the item has price p, k sales and n impressions, then the parameters of the prior are \alpha = c_1p^{c_2} and \beta = c_3p^{c_4} and the estimated probablity of a sale (the posterior) is (c_1 p^{c_2} + k)/(c_1 p^{c_2} + c_3 p^{c_4} + n).