Measuring Search Relevance

How do we know when we’ve built software that has improved the search results returned by the eBay search engine? There are many ways we can understand this, and this blog post focuses on using human judges to assess the performance of the engine. In a future post, I’ll talk more about A/B testing, behavioral and business metrics we capture from the eBay site, and other ways of measuring search accuracy.

One of the most important ways we understand if we’ve improved our search engine is by asking people what they think. We do this at a large-scale: we’ve asked human judges questions about our search results over two million times in the past year. We use this data to compute scores or metrics, and we use these metrics for a few purposes: first, they help us understand whether one search algorithm is better than another; second, they help us understand where a search algorithm doesn’t work and needs to be improved (for example, maybe it doesn’t work well in the collectibles category, or maybe it only works well for short one- or two-word queries); and, last, it helps us understand what users want – it’s great data to look through to understand how our users think about searching on eBay. It’s super useful data, and it’s available to all eBay employees on our intranet site. (http://qu, where QU stands for Query Understanding.)

We refer to the process of gathering this data as relevance judgment, that is, collecting human judgment about the relevance of search results. The basic task goes like this: we present a judge with an eBay item, and a search engine query, and we ask the judge to assess how relevant the item is to the query on a four-point scale. For example, suppose the query is ipod nano 16Gb and the item is a brand new Apple 6th generation iPod nano 16Gb. The judge might decide that this is a “great result” (which might be, say, our top rating on the four-point scale), click on a radio button to record their vote, and move on to the next task. If the item was a car that had an iPod adapter in it, the judge might decide this result is “irrelevant” (say the lowest rating on the four point scale). If it were a Samsung mp3 player, it might be “partially relevant” (say the second-to-lowest), and if it were a used Apple iPod nano 2nd generation that’s missing its headphones, the judge might say “relevant” (it’s not perfect, but it sure is an Apple iPod).

The judgment process is subjective, and different people will make different choices. If the query is shoes and the item we show the judge is just one shoe from a pair: is it irrelevant, partially relevant, or maybe even relevant? There are many arguments you could make: a shoe isn’t shoes, so it’s irrelevant; a shoe is half of a pair of shoes, and so it’s partially relevant; or, maybe there’s a complicated argument you can make about it being relevant. I guess we’d all agree it isn’t a great result. There are two ways we address subjectivity: first, we ask multiple judges to assess the same results to get an average score; and, second, we judge thousands of queries, so that we can compute metrics and be confident statistically that the numbers we see represent true differences in performance between algorithms.

We do our judgment with crowdsourcing. We put our tasks in the public domain, and pay independent people to perform the judgments. In many cases, the folks work from home, live in areas where work is scarce, or enjoy the flexibility of being able to do the work anytime. I love this part of human judgment – it’s real people, who need the money or the flexibility, and they’re helping us make eBay better. It does create some problems – some people try to game the system by writing software that randomly answers questions, or they answer fast and erroneously. We’re constantly working on detecting problems, and removing both the poor results and judges from the system. To give you a flavor, one thing we do is inject questions where we know what the relevance score should be, and then check that the judges answer most of those correctly. Another thing we do is look for judges who consistently answer differently from other judges for the same tasks.

When we’ve got tens of answers for each query, and we’ve completed judging thousands of queries, we’re ready to compute a metric that allows us to compare algorithms. Our favorite is NDCG , Normalized Discounted Cumulative Gain. It’s a mouthful, but it’s a common-sense measure. Suppose that on our four-point scale, we give a 0 score for an irrelevant result, 1 for a partially relevant, 2 for relevant, and 3 for perfect. Suppose also that a query is judged by one of our judges, and the first four results that the search engine returns are assessed as relevant, irrelevant, perfect, and relevant by the judge. The cumulative gain after four results is the sum of the scores for each result: 2 + 0 + 3 + 2 = 7.

Rank Judgment (Gain)
Cumulative Gain
1 2 2
2 0 2
3 3 5
4 2 7

Now for the Discounted part in NDCG. We know that the first result in our search results is more important than the second, the second more important than the third, and so on. We know this because users click on result one much more than result two, and so on. Moreover, there’s plenty of research that shows users expect search engines to return great results at the top of the page, that they are unlikely to view results low on the page, and that they dislike pagination. The Discounted part of NDCG adds in this weighting: one simple way to make position one more important than two (and so on) is to sum the score divided by the rank. So, for example, if the third result is ”great”, its contribution is 3 / 3 = 1 (since the score for “great” is 3 , and the rank of the result is 3). If “great” were the first result, its contribution would be 3 / 1 = 3. In practice, the score is often divided by the log of the rank, which seems to better match the user perception of relevance. Anyway, for our example and to keep it simple, the Discounted Cumulative Gain (DCG) after four results is 2 / 1 + 0 / 2 + 3 / 3 + 2 / 4 = 3.5.

Rank Judgment (Gain)
Discounted Gain Discounted Cumulative Gain (DCG)
1 2 2/1 2
2 0 0/2 2
3 3 3/3 3
4 2 2/4 3.5

The Normalized part in NDCG allows us to compare DCG values between different queries. It’s not fair to compare DCG values across queries because some queries are easier than others: for example, maybe it’s easy to get four perfect results for the query ipod nano, and much harder to get four perfect results for 1968 Porsche 912 targa soft window. If the search engine gets a high score for the easy query, and a poor score for the hard query, it doesn’t mean it’s worse at hard queries – it might just mean the queries have different degrees of difficulties. Normalization works like this: we figure out what the best possible score is given the results we’ve seen so far. In our previous example, we had 2, then 0, 3, and a 2. The best arrangement of these same results would have been: 3, 2, 2, 0, that is, if the “great” result had been ranked first, followed by the two “relevant” ones, and then the “irrelevant”. This best ranking would have a DCG score of 3 / 1 + 2 / 2 + 2 / 3 + 0 / 4 = 4.67. This is known as the “ideal DCG,” or iDCG.  Our NDCG is the score we got (3.5) divided by the ideal DCG (4.67), or 3.5 / 4.67 = 0.75. Now we can compare scores across queries, since we’re comparing percentages of the best possible arrangements and not the raw scores. It has some drawbacks, but we’ll duck those for now.

Rank Judgment (Gain)
Discounted Gain Discounted Cumulative Gain (DCG)
Ideal Discounted Gain Ideal Discounted Cumulative Gain (iDCG) Normalized Discounted Cumulative Gain (NDCG)
1 2 2/1 2 3/1 3.0 0.67
2 0 0/2 2 2/2 4.0 0.5
3 3 3/3 3 2/3 4.67 0.64
4 2 2/4 3.5 0/4 4.67 0.75

Once we’ve computed NDCG values for each query, we can average them across thousands of queries. We can now compare two algorithms: we take the mean average NDCG values for each, and check using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence. As I mentioned at the start, this is one important factor we consider when comparing two algorithms, along with a host of other metrics. As I’ve said in previous posts, I’m a huge fan of measuring in as many ways as possible, and making decisions with all the data at hand. It takes professional judgment to decide one algorithm is better than another, and that’s part of managing our search engineering effort.

Hope you found this useful – I’ll look forward to hearing your feedback and answering questions.

Hugh Williams
Vice President, Buyer Experience Engineering

20 thoughts on “Measuring Search Relevance

  1. AuctionInsights

    Interesting post. If you find that different relevance sort algorithms are effective for different categories, can you and do you deploy those separate algorithms in the appropriate category? For example, algorithm X is evaluated as returning excellent results in the collectables category but returns disappointing results in the fashion category. Algorithm Y does well in the fashion category while bombs in the collectibles category. Do you deploy both algorithms in their respective categories or settle for algorithm Z that does an adequate job everywhere? Thanks!

    1. Hugh Williams

      Thanks for the question. The answer is yes and no. We do sometimes adjust features (parts of the ranking function) on a per category basis, but we don’t have completely different ranking functions/algorithms for different parts of the site. It’s important to deliver a consistent experience for our customers, since most of our customers search in many different categories — and so we would typically not create very different search experiences. Cheers, Hugh.

  2. Pingback: AuctionInsights » How eBay Evaluates the Relevence of Best Match Search Results

  3. Tekgems

    Sellers are complaining that gtc is less advantageous than fp30. Is this true? If so, can this be fixed? It seems like gtc is to ebay’s benefit. I don’t like fp30 but if it will help with sales, I will use it again even though it wastes eBay resources.

    1. Mike Mathieson

      Hi Tekgems–thanks for your question. The purpose of this blog is to discuss the technology at eBay and selling strategies are generally outside the scope of that discussion. I’d suggest bringing this discussion up in the eBay community forums, which are actively monitored by eBay employees who can help you, such as the Seller Central forum (http://forums.ebay.com/db2/forum/Seller-Central/143).

  4. Avi Rappoport

    That’s fascinating, I had no idea that you use human judgments on these. It’s a lot like TREC testing, which I now know is harder than it looks. I like the nDCG as well, because the algorithm recognizes that there is often more than one right answer.

    Do you also do click-tracking analysis? Sometimes it’s more accurate than judges, because it’s not as influenced by self-awareness. Because you own all the data, you could take that tracking all the way to checkout, for popular items at least.

    1. Hugh Williams

      Hi Avi,

      Thanks for the comment. Yes, we collect many different metrics related to user behavior. As i mentioned in the post, I believe that having many different perspectives on our systems is key to making the right decisions.

      Cheers, Hugh.

  5. Pingback: Wie arbeiten die eBay-Suchingenieure? » onlinemarktplatz.de - eBay, gewusst wie!

  6. Antonio

    Interesting post. I wonder whether the external judges have a set of standard recommendations for expressing vote. This should include freshness of the auction, authoritativeness of the seller, presence of an image, and so on and so forth. The trick is that you probably need to maintain those recommendations separated by the ranking features. I am not sure if you have this problem or not.

  7. Hugh Williams

    Hi Antonio,

    Thanks for the comment. Training judges is important, but so is smoothing out the variation by having many judges judge the same query/item pair. I have a personal worry about overtraining judges — most people have used eBay, and have a subjective opinion about what is and isn’t a relevant item to a query. I believe it’s important to capture those nuances and understand them — rather than capture the nuances in our judging guidelines or our subjective opinion on what the judges should be doing. Hope that helps.

    Cheers, Hugh.

  8. Pingback: Query Rewriting in Search Engines | Hugh E. Williams

  9. Pingback: Ranking at eBay (Part #3) « Hugh E. Williams

  10. Xuemei Liu

    great introduction of NDCG, I learned a lot from it. I have two questions.
    1.
    “So, for example, if the third result is ”great”, its contribution is 3 / 3 = 1 (since the score for “great” is 3 , and the rank of the result is 3). If “great” were the first result, its contribution would be 3 / 1 = 3. In practice, the score is often divided by the log of the rank, which seems to better match the user perception of relevance.”
    Why it’s the score divided by the log of the rank seems to better match the user perception of relevance?
    2.
    “Once we’ve computed NDCG values for each query, we can average them across thousands of queries.”
    Different queries have different level of hotness in ebay, so whether it would be better if increasing the weight of hot queries in the “average NDCG values” ?

  11. Maggie

    When it says, “Our NDCG is the score we got (3.5) divided by the best score we could have obtained (4.67) or 3/4.67.” is that a typo or is there a reason we used 3 instead 3.5?

    Also, for the NDCG calculation in the table, can someone please explain the reasoning for using 2/3, 2/(3+3/2) so on? Are those numbers added together? That doesn’t make sense because it would exceed .64.

    Thank you!

    1. Jon Degenhardt

      Hi Maggie,

      Thanks for the comment. You are correct, it should be 3.5/4.67. Thanks for catching this, it has been fixed in the post. Regarding the NDCG calculation, both the numerator and denominator are sums. This may have been confusing due to a second typo and the way the calculation was presented in the table. The post has been updated, hopefully it is clearer now.

      Jon Degenhardt
      eBay Search Science

  12. Pingback: Measuring Search Relevance | Hugh E. Williams

  13. Hanieh

    Great post.
    My question is nDCG at rank 1 (when only one item is retrieved) is always equal to 1, right?

  14. Harsha

    Thank you for the Post. …..My understanding is the following….
    When you do Normalization, you make an assumption that the best possible DCG for the query is obtained by a different arrangement of the search results. Is this true always? This is possible only when there is no restriction on the length of search results. In the post, the length of search results is 4. Could you please elaborate?

    1. Harsha

      To illustrate my question, I would like to take an example. Let us say I am evaluating two Algorithms A and B for one particular query. Let us say that Algorithm A returns 4 partially relevant results (1,1,1,1) for the query. So, the NDCG is 1. Let us that Algorithm B returns 3 perfect results and 1 relevant result in the following order (3,2,3,3) for the same query. The NDCG is less than 1. Clearly, Algorithm B is superior to Algorithm A, but, NDCG does not reflect this.

Comments are closed.