<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>eBay Tech Blog</title>
	<atom:link href="http://www.ebaytechblog.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ebaytechblog.com</link>
	<description>Where e-commerce meets real-time search</description>
	<lastBuildDate>Mon, 16 Apr 2012 18:17:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Five JavaScript Tips for a Sleek User Experience</title>
		<link>http://www.ebaytechblog.com/2012/04/16/five-javascript-tips-for-a-sleek-user-experience/</link>
		<comments>http://www.ebaytechblog.com/2012/04/16/five-javascript-tips-for-a-sleek-user-experience/#comments</comments>
		<pubDate>Mon, 16 Apr 2012 18:07:42 +0000</pubDate>
		<dc:creator>Senthil Padmanabhan</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[CSS3]]></category>
		<category><![CDATA[eBay]]></category>
		<category><![CDATA[Frontend]]></category>
		<category><![CDATA[HTML5]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[User Experience]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=1183</guid>
		<description><![CDATA[The eBay Motors engineering team took up an initiative to revisit some of the legacy JavaScript code shared across various pages, optimizing them to leverage the latest advancements in HTML5. Our main focus areas were user interactions and animations in which the old JavaScript code was lacking in performance and sturdiness. When we completed the [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>The <a title="eBay Motors" href="http://www.ebay.com/motors" target="_blank">eBay Motors</a> engineering team took up an initiative to revisit some of the legacy JavaScript code shared across various pages, optimizing them to leverage the latest advancements in <a title="HTML5" href="http://en.wikipedia.org/wiki/HTML5" target="_blank">HTML5</a>. Our main focus areas were user interactions and animations in which the old JavaScript code was lacking in performance and sturdiness. When we completed the exercise and demoed the upgraded experience to our product folks, the feedback we got was &#8220;<strong>SLEEK</strong>&#8220;. This post highlights five of those JavaScript techniques that we think made a difference.</p>
<p style="color: #000000;"><span style="color: #d52a33;"><strong>1. </strong></span><span style="color: #d52a33;"><strong>requestAnimationFrame </strong><em>over</em><strong> setInterval</strong></span><strong>:</strong><strong> </strong>Using the new <a title="requestAnimationFrame" href="https://developer.mozilla.org/en/DOM/window.requestAnimationFrame" target="_blank">requestAnimationFrame</a> API for building JavaScript-based animations is a more optimized and efficient approach when compared to the traditional timers and intervals. To quickly summarize, this API offloads the timer calculations for when to do the next UI/DOM style changes to the browser, rather than the developer deciding when to repaint the screen. We grepped our code base and replaced all applicable occurrences of <a title="window.setTimeout" href="https://developer.mozilla.org/en/DOM/window.setTimeout" target="_blank">setTimeout</a> and <a title="window.setInterval" href="https://developer.mozilla.org/en/DOM/window.setInterval" target="_blank">setInterval</a> with requestAnimationFrame; as a result, the animations were a lot smoother. This approach also saves battery life when content is viewed on mobile devices. As expected, the requestAnimationFrame API is not supported in all browsers, so we used <a title="requestAnimationFrame Polyfill" href="http://paulirish.com/2011/requestanimationframe-for-smart-animating/" target="_blank">polyfill</a>, provided by <a title="Paul Irish" href="https://twitter.com/#!/paul_irish" target="_blank">Paul Irish</a>, as shown below.</p>
<pre>
    // shim layer with setTimeout fallback
    window.requestAnimFrame = (function(w){
      return  w.requestAnimationFrame       ||
              w.webkitRequestAnimationFrame ||
              w.mozRequestAnimationFrame    ||
              w.oRequestAnimationFrame      ||
              w.msRequestAnimationFrame     ||
              function( callback ){
                w.setTimeout(callback, 1000 / 60);
              };
    })(window);
</pre>
<p><span style="color: #d52a33;"><strong>2. insertAdjacentHTML </strong><em>over</em><strong> innerHTML</strong></span><strong>:</strong><strong> </strong> The <a title="insertAdjacentHTML" href="https://developer.mozilla.org/en/DOM/element.insertAdjacentHTML" target="_blank">insertAdjacentHTML</a> API is a fine-grained and optimized version of the super-popular <a title="innerHTML" href="https://developer.mozilla.org/en/DOM/element.innerHTML" target="_blank">innerHTML</a>. Since we specify the insert position, insertAdjacentHTML does not re-parse the element it is being used on and avoids the extra step of serialization, making it much faster than direct innerHTML manipulation. This approach is very effective in scenarios where we keep appending markup to a page or module, such as for the daily deals feed and for endless scroll. A simple JSPerf <a title="JSPerf Performance Test Result" href="http://jsperf.com/insertadjacenthtml-perf" target="_blank">test result</a> shows that insertAdjacentHTML is 100% faster than innerHTML. Surprisingly, the browser support for this API has been there for a very long time (Firefox started supporting it in version 8), and the helper function is pretty straightforward:</p>
<pre>
    // helper function to append content for a given element
    var appendContent = function(){
        // Closure to hold the insertAdjacentHTML API support
        var insertAdjacentSupported = document.createElement('div').insertAdjacentHTML;

        return function(elem, content) {
            if(insertAdjacentSupported) {
                elem.insertAdjacentHTML('beforeend', content);
            } else {
                elem.innerHTML = elem.innerHTML + content;
            }
        };
    }();
</pre>
<p><span style="color: #d52a33;"><strong>3. if-else </strong><em>over</em><strong> try-catch</strong></span><strong>:</strong><strong> </strong> Try-catch blocks provide an efficient mechanism to handle exceptions and unforeseen runtime errors. However, they impose a performance penalty in JavaScript, especially when used in iterations or recursive functions. More details can be found at <a title="Try-catch performance issue 1" href="http://dev.opera.com/articles/view/efficient-javascript/?page=2#trycatch" target="_blank">dev.opera.com</a> and <a title="Try-catch performance issue 2" href="http://oreilly.com/server-administration/excerpts/even-faster-websites/writing-efficient-javascript.html" target="_blank">O&#8217;Reilly&#8217;s site</a>. To address this issue, we scanned our code base, particularly looking for performance-critical functions. We found a few occurrences where a try-catch was inside a long-running iterator loop, and also where the probability of the thread entering the catch block was high. The try-catch was replaced with simple if-else conditions, which took care of all error handling, and the resulting code was much more efficient.</p>
<p><strong>Using try-catch:</strong></p>
<pre>
    // jsonResponse comes from some web service
    var i, l = jsonResponse.length, item, offer, binPrice, bidPrice;
    for(i = 0; i < l; i++) {
        item = jsonResponse[i];
        // Some code
        // ...
        try {
            offer = item.offer;
            binPrice = offer.bin;
            bidPrice = offer.bid;
        } catch(e) {
            offer = {err: "Offer not found"};
        }
        // Some more code
        // ...
    }
</pre>
<p><strong>Using if-else:</strong></p>
<pre>
    // jsonResponse comes from some web service
    var i, l = jsonResponse.length, item, offer, binPrice, bidPrice;
    for(i = 0; i < l; i++) {
        item = jsonResponse[i];
        // Some code
        // ...
        offer = item.offer;
        if(offer) {
            binPrice = offer.bin;
            bidPrice = offer.bid;
        } else {
            offer = {err: "Offer not found"};
        }
        // Some more code
        // ...
    }
</pre>
<p><span style="color: #d52a33;"><strong></strong><strong>4. XMLHttpRequestUpload </strong><em>over</em><strong> server polling</strong></span><strong>:</strong> Using browsers to upload files (photos, PDFs, other documents, etc.) has become a very common use case, and one of our <a title="eBay my Vehicles" href="www.ebay.com/motors/garage" target="_blank">applications</a> had this requirement. To keep users informed during the upload process, we wanted to show a real-time progress meter with accurate percentages. One (non-Flash) way of simulating this AJAX-based upload is to create a hidden iFrame and submit the main form, which holds the <a title="HTML input file type" href="http://www.w3.org/TR/html401/interact/forms.html#file-select" target="_blank">input file</a> element targeted to the iFrame. Following the form submit, the server has to be polled periodically to retrieve the upload percentage for the progress meter. Not only is this approach a hack, but it also consumes huge amounts of server and client resources. <br />As a savior, the <a title="XMLHttpRequest" href="https://developer.mozilla.org/en/XMLHttpRequest" target="_blank">XMLHttpRequest</a> object in modern browsers has the capability of uploading files (as byte streams) using the <a title="XMLHttpRequest send method" href="https://developer.mozilla.org/en/DOM/XMLHttpRequest#send%28%29" target="_blank">send</a> method, and also has this amazing <a title="XMLHttpRequest upload attribute" href="http://www.w3.org/TR/XMLHttpRequest/#the-upload-attribute" target="_blank">XMLHttpRequestUpload</a> attribute. The upload attribute has a couple of associated events, the most important being the <a title="XMLHttpRequest Upload Progress Event" href="https://developer.mozilla.org/En/XMLHttpRequest/Using_XMLHttpRequest#Monitoring_progress" target="_blank">progress</a> event. The progress event handler receives the total number of bytes to transfer and the number of bytes transferred so far, from the event's total and loaded fields. With this information, end users receive updates of real-time progress in the most efficient and sturdy way. Here is a quick preview of the API usage:</p>
<pre>
    var uploadAJAX = function(serverURL) {
        var xhr = new XMLHttpRequest(),
            fileElement = document.getElementById("file"),
            fileObj = fileElement[0], // For demo, taking only the first file in the file list
            progressMeter = getProgressMeterComponent(); // retrieve the progress meter UI component 

        xhr.upload.onprogress = function(e){
            if (e.lengthComputable){
                var percentComplete = Math.round(e.loaded / e.total * 100);
                progressMeter.update(percentComplete); // Updates the progress meter UI component with the given percentage
                }
        };    

        xhr.open("POST", serverURL, true);
        xhr.setRequestHeader("X-Requested-With", "XMLHttpRequest");
        xhr.setRequestHeader("X-File-Name", encodeURIComponent(fileObj.name));
        xhr.setRequestHeader("Content-Type", "application/octet-stream");
        xhr.send(fileObj.file);
    };
</pre>
<p>We have open-sourced the HTML5 image uploader utility in <a title="HTML5 based image uploader" href="https://github.com/senthilp/picman" target="_blank">github</a>. For browsers that do not support this feature, the application falls back to the hidden iFrame approach. <strong><br /></strong></p>
<p><span style="color: #d52a33;"><strong>5. CSS3 </strong><em>over</em><strong> JavaScript</strong></span><strong>:</strong><strong> </strong> The final optimization is to AVOID JavaScript as much as possible for animations (JavaScript is great but...) and to leverage the modern <a title="CSS3" href="http://www.css3.info/" target="_blank">CSS3</a>-based <a title="CSS3 Transitions " href="https://developer.mozilla.org/en/CSS/CSS_transitions" target="_blank">transitions</a>. The new-age CSS comes with a ton of great <a title="Animatable CSS Properties" href="http://oli.jp/2010/css-animatable-properties/" target="_blank">animatable</a> properties that can replace most of the basic animations currently implemented in JavaScript. The main advantages are ease of use, the browser doing most of the work, the leveraging of machine <a title="Hardware accelerated CSS" href="http://www.html5rocks.com/en/tutorials/speed/html5/" target="_blank">hardware</a> if necessary, and above all an elegant and smooth visual touch to the transition (which is nearly impossible to achieve with JavaScript). We started changing our animations from simple tab switches (<a title="eBay Motors Homepage" href="http://www.ebay.com/motors" target="_blank">check it out</a>) to complex 3D carousels (check out the <a title="3D Carousel" href="http://www.ebay.com/autos/c3d?title=2010%20Bugatti%20Veyron&amp;h=300&amp;w=400&amp;urls=http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqEOKkME5UYVTN!yBOd-BlQ5F!~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqUOKi8E5!KRByQEBOd-BlQwq!~~60_8.JPG%20http://i.ebayimg.com/00/s/NTM0WDgwMA==/$%28KGrHqYOKj!E5W9!lpC!BOd-BlL%29Yg~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqMOKpQE5U-snNn7BOd-BlRcbg~~60_8.JPG%20%20http://i.ebayimg.com/00/s/NTM0WDgwMA==/$%28KGrHqIOKjYE5qmUm627BOd-Bptrlg~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqYOKioE5ezELRDtBOd-Bmg-6!~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqYOKkQE5VFdmcQjBOd-BnYzU!~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqEOKnIE5t!Gmc1hBOd-BnJeug~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqEOKnIE5t!Gmc1hBOd-BnJeug~~60_8.JPG%20http://i.ebayimg.com/00/s/NjAwWDgwMA==/$%28KGrHqIOKo8E5VLzPRQYBOd-BoN80Q~~60_8.JPG" target="_blank">beta</a> version) - all with CSS, which netted a great user experience. For older browsers, we just stopped doing animations.</p>
<p>This entire re-engineering process also helped us develop a workflow for similar upgrades, thus enabling our code base to iterate at the same speed as do browser innovations. We can always hope that the need to use <a title="Polyfills" href="https://github.com/Modernizr/Modernizr/wiki/HTML5-Cross-browser-Polyfills" target="_blank">polyfills</a> will be reduced in the near future.</p>
<p><a title="Senthil Padmanabhan Twitter Profile" href="https://twitter.com/#!/senthil_hi" rel="author" target="_blank">Senthil Padmanabhan</a> <br /> Engineer @ eBay</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2012/04/16/five-javascript-tips-for-a-sleek-user-experience/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>German Compound Words</title>
		<link>http://www.ebaytechblog.com/2012/03/12/german-compound-words/</link>
		<comments>http://www.ebaytechblog.com/2012/03/12/german-compound-words/#comments</comments>
		<pubDate>Tue, 13 Mar 2012 00:36:17 +0000</pubDate>
		<dc:creator>Brian Johnson</dc:creator>
				<category><![CDATA[Search Science]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=1138</guid>
		<description><![CDATA[Mark Twain is quoted as having said, “Some German words are so long that they have a perspective.” Although eBay users are unlikely to search using fearsome beasts like “rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”, which stands for the “beef labeling supervision duties delegation law”, we do frequently see compound words in our users’ queries. While some might look for [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Mark Twain is quoted as having said, “Some German words are so long that they have a perspective.” </p>
<p>Although eBay users are unlikely to search using fearsome beasts like <i>“rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”</i>, which stands for the <i>“beef labeling supervision duties delegation law”</i>, we do frequently see compound words in our users’ queries. While some might look for “<a href="http://www.ebay.de/sch/i.html?_nkw=damenlederhose">damenlederhose</a>”, others might be searching for the same thing (women’s leather pants) using the decompounded forms “<a href="http://www.ebay.de/sch/i.html?_nkw=damen%20lederhose">damen lederhose</a>” or “<a href="http://www.ebay.de/sch/i.html?_nkw=damen%20leder%20hose">damen leder hose</a>”. And even though a German teacher would tell you only “damenlederhose” or “damen lederhose” are correct, the users’ expectation is to see the same results regardless of which form is used.</p>
<p>This scenario exists on the seller side as well. That is, people selling on eBay might describe their item using one or more of these forms. In such cases, what should a search engine do? While the problem might seem simple at first, German word-breaking – or decompounding, as it is also known – is not so simple.</p>
<p><strong>How to find compound synonyms?</strong></p>
<p>Writing a program to figure out how to break a German compound word for one of the world’s largest online marketplaces poses challenges:</p>
<ul>
<li>The word could have been created using a product name, and new product names can appear all the time.</li>
<li>Syntactically, the word can be split into many valid components, but not all splits are useful or make sense. For instance, “aktivkohlefiltermagnetventil”, meaning “active carbon filter magnetic valve” (a valve used in water treatment plants), can be split at least three different ways, all of which are comprised of valid dictionary words:
<ul>
<li>aktiv (active) kohlefilter (carbon filter) magnetventil (magnetic valve) or</li>
<li>aktiv (active) kohle (carbon) filter (filter) magnetventil (magnetic valve) or</li>
<li>aktiv (active) kohle (carbon) filter (filter) magnet (magnetic) ventil (valve)</li>
</ul>
<p><br/>However, the only acceptable split that is synonymous to the compound word “<a href="http://www.ebay.de/sch/i.html?_nkw=aktivkohlefiltermagnetventil">aktivkohlefiltermagnetventil</a>” is “<a href="http://www.ebay.de/sch/i.html?_nkw=aktivkohlefilter%20%C2%A0magnetventil">aktivkohlefilter magnetventil</a>”.</li>
<li>Some strings look like they might be compound words, but in fact are not. For instance, a tokenizer might look at the word “beiden” (“both”) and decide it’s made up of the two separate words “bei” (“at”) and “den” (“the”), but that would be incorrect.</li>
<li>There is also the problem of how small or large each of the splits can be. Consider the three forms below; which are most useful?</li>
<ul>
<li><a href="http://www.ebay.de/sch/i.html?_nkw=granitpflastersteine">granitpflastersteine</a> (granite paving stones)</li>
<li><a href="http://www.ebay.de/sch/i.html?_nkw=granit%20pflastersteine">granit pflastersteine</a> (granite cobblestones)</li>
<li><a href="http://www.ebay.de/sch/i.html?_nkw=granit%20pflaster%20steine">granit pflaster steine</a> (granite paving stones)</li>
</ul>
<li>A smaller concern is the cost of processing possible splits online, i.e., after the user types in her search query.</li>
</ul>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2012/03/granitpflastersteine.png"><img class="size-full wp-image-1140 alignnone" title="granitpflastersteine" src="http://www.ebaytechblog.com/wp-content/uploads/2012/03/granitpflastersteine.png" alt="" width="620" height="403" /></a></p>
<p>Dictionary-based approaches have a tendency to split German compound words into all possible morphologically valid forms. The best<em> </em>splits are hard to algorithmically determine.</p>
<p><strong>Wisdom of the crowds</strong></p>
<p>Faced with these challenges, we turned to our users for help. We can learn a lot simply by watching millions of German users listing millions of items, and performing hundreds of millions of searches. Essentially, whenever users changed from a compound to a decompounded form of a query (or vice versa), we asked, did their subsequent actions on our site indicate that they were pleased with the change? First, we mined our logs to find instances where users changed their search terms by breaking up or joining consecutive adjacent terms. For instance, people typed in “damen schuhe” and then changed it to “damenschuhe”. We analyzed how many and what kind of results we were showing for either search, and how those users liked those results (based on whether they clicked on any of those results). Based on how frequently users were making such changes in their search terms, we were able to collect millions of candidate compound words and their decompounded forms. We did the same for item titles, as sellers often include multiple forms of compound words in order to make sure buyers can find their items. To ensure a good user experience, we then filtered out words without relevant inventory, model numbers, and single character splits. This whittled the set of synonymous compound word pairs to a few hundred thousand.</p>
<p>Since we now had a large number of compound words and their decompounded forms, we were able to determine if a user’s query matched items listed with any of its compounded or decompounded forms. Since the compound words are precomputed, the lookups are blazingly fast. Further, the list of words is based on real-live user behavior data, which allows us to use the most “sensible” compounds/decompounds.</p>
<p><strong>The bottom line</strong></p>
<p>Being able to find items listed with all common compounded/decompounded word forms is valuable and easy to measure. We simply split our user population in two. We provide the existing search experience to some users, and the new experience to the others. The new compound word synonym experience, although largely unnoticed by most users, was indeed improved, enabling buyers to find – and subsequently buy – more of the items they were looking for. This was measurable in terms of direct revenue. In this case, buyer, seller, and eBay interests are well aligned. Buyers bought more, sellers sold more, and eBay revenue increased.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2012/03/diamantring.png"><img class="alignleft size-full wp-image-1141" title="diamantring" src="http://www.ebaytechblog.com/wp-content/uploads/2012/03/diamantring.png" alt="" width="620" height="609" /></a></p>
<p><i>Brian Johnson, Prathyusha Senthil Kumar, Ashok Mallya</i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2012/03/12/german-compound-words/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>BitTorrent for Package Distribution in the Enterprise</title>
		<link>http://www.ebaytechblog.com/2012/01/31/bittorrent-for-package-distribution-in-the-enterprise/</link>
		<comments>http://www.ebaytechblog.com/2012/01/31/bittorrent-for-package-distribution-in-the-enterprise/#comments</comments>
		<pubDate>Tue, 31 Jan 2012 17:00:30 +0000</pubDate>
		<dc:creator>Mahesh Somani</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[BitTornado]]></category>
		<category><![CDATA[BitTorrent]]></category>
		<category><![CDATA[data transfer]]></category>
		<category><![CDATA[deployment]]></category>
		<category><![CDATA[GetRight]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS seeding]]></category>
		<category><![CDATA[P2P]]></category>
		<category><![CDATA[rsync]]></category>
		<category><![CDATA[web seeding]]></category>
		<category><![CDATA[zsync]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=1099</guid>
		<description><![CDATA[eBay has been exploring the potential benefits of incorporating peer-to-peer transport technologies into our network topology. We&#8217;ve begun to use BitTorrent for index distribution, and we&#8217;ll soon use it for build distribution as well. In this post, we discuss our experience with BitTorrent and the problem of package distribution within the enterprise environment. Typical enterprise [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>eBay has been exploring the potential benefits of incorporating peer-to-peer transport technologies into our network topology. We&#8217;ve begun to use BitTorrent for index distribution, and we&#8217;ll soon use it for build distribution as well. In this post, we discuss our experience with BitTorrent and the problem of package distribution within the enterprise environment. Typical enterprise requirements are for replicating packages of medium to large size (a few megabtyes to several gigabytes) across a large number of target nodes, with high transfer reliability and often within stringent time lines. Use cases having such requirements include the following:</p>
<ul>
<li>Rollout of versioned software packages</li>
<li>Fan-out of data packages across distributed grids for downstream processing</li>
</ul>
<h2>About BitTorrent</h2>
<p>BitTorrent (BT) is a popular peer-to-peer (P2P) transport protocol that is commonly used across the Internet to transfer packages such as audio, video files, and data sets. According to a comprehensive <a href="http://www.ipoque.com/sites/default/files/mediafiles/documents/internet-study-2008-2009.pdf">2008-2009 study of Internet traffic</a> in eight regions of the world, BitTorrent traffic represents not only a lion’s share of all P2P traffic, but also a significant portion (20-30% or more) of all Internet traffic. BT works well due to the distributed nature of the transfers among the nodes; since a single server, or a few servers, do not need to transfer to all nodes across the grid, BT is efficient for distributing large packages to a large number of nodes.</p>
<p>BT splits the packages into pieces that are shared between the peers. A tracker maintains and provides a list of peers that are interested in a particular package. A peer first acts as a client. Once it receives a full piece, it acts as a server for that piece. The peer discovers other peers via one or more trackers. Information about trackers, pieces, and their checksums is stored within a “.torrent” file, which is typically a fraction of the overall file size. Before BT transfers can be initiated, the “.torrent” file is distributed to the peers using HTTP, SCP, or other client/server means. The BT specification and protocol details are described at <a href="http://wiki.theory.org/index.php/BitTorrentSpecification">theory.org</a>.</p>
<p>The P2P technique, particularly the BitTorrent protocol, is well studied. Two key strategies for efficient BT content replication have been found to be piece selection and peer selection (see the 2006 paper, <a href="http://conferences.sigcomm.org/imc/2006/papers/p20-legout.pdf">Rarest First and Choke Algorithms Are Enough</a>). BT uses rarest first as the piece selection strategy, and the choke algorithm as the peer selection strategy. The rarest first strategy, as the name suggests, involves peers targeting the rarest piece for transfer. The choke strategy is based on fairness, speed, and reciprocation of uploads.</p>
<p>The content distribution problem has also been studied from a theoretical perspective (see <a href="http://www.statslab.cam.ac.uk/%7Errw1/research/MunWebWei06JoS_final.pdf">Optimal Scheduling of Peer-to-Peer File Dissemination</a>). Optimal distribution choices, comparing performance using a client/server mechanism versus P2P transfers, are described below.</p>
<table border="1">
<tbody>
<tr>
<td><strong>Protocol</strong></td>
<td><strong>Transfer time</strong></td>
</tr>
<tr>
<td>Client/server</td>
<td>t * N</td>
</tr>
<tr>
<td>P2P with a single piece</td>
<td>t + t * log<sub>2</sub>(N)</td>
</tr>
<tr>
<td>P2P with multiple pieces</td>
<td>t + t * log<sub>2</sub>(N) / R</td>
</tr>
</tbody>
</table>
<p><em>where:</em></p>
<p>N = Number of nodes <br />R = Number of pieces that the package is split into <br />t = Time to transfer a complete package from one node to another</p>
<h2>BitTorrent within the enterprise environment</h2>
<p>BitTorrent has been used to a limited extent in large-scale web services to distribute files. Facebook uses it for <a href="http://torrentfreak.com/facebook-uses-bittorrent-and-they-love-it-100625/">file sharing</a>, and Twitter for <a href="http://engineering.twitter.com/2010/07/murder-fast-datacenter-code-deploys.html">code deployments</a>.</p>
<p>The BitTorrent protocol is designed for Internet scale with slow, unreliable, far-away and greedy clients in the presence of ISP throttling. BT-style distributions are desirable in enterprise environments, where large data sets of increasing size need to be transferred to a large set of nodes. In some ways, the problem of data transfers (fan-out) within a data center is simpler with reliable, managed, fast, and low-latency nodes. However, the demands of predictability, scalability, manageability, and efficiency are higher in such environments.</p>
<p>We now turn to optimization options, on top of regular BT transfers, for improving efficiency and predictability</p>
<p><strong>Configuration</strong></p>
<p>The general BitTorrent configuration is designed for Internet scale with slow and unreliable clients. This configuration can be adapted to work better in the enterprise environment with homogenous clients. The upload and download parameters and piece sizes can be adapted to improve transfer rates and reliability in an enterprise environment. The tit-for-tat strategy as a part of the choke algorithm, throttling, and encryption of payload may not be needed in enterprise environments.</p>
<p><strong>Peer selection</strong></p>
<p>When distributing pieces between peers, it is efficient to pick peer nodes the closest to each other. In enterprise environments, the server topology may be available beforehand and easily leveraged in the algorithm to pick nearer neighbors. We find that closeness of IP address can often be used as a reasonable approximation.</p>
<p><strong>Cross-data-center transfers</strong></p>
<p>When distribution topology extends across multiple data centers, it is useful to limit cross-data-center transfers due to bandwidth constraints–limited and/or expensive bandwidth.</p>
<p>A naïve approach to reducing cross-data-center transfers is to perform them in two steps. The first step is to transfer a package to one or a few nodes in each data center. The second step is to initiate BT transfers within each data center such that package transfer takes place between the initial nodes and the rest of the nodes in that data center—in effect, one BT transfer session per data center. This two-step scheme increases the overall transfer time. Peer selection based on closeness of IP address can be an effective way of conducting transfers in a single step, with all nodes participating in the torrent session.</p>
<p><strong>Package transfer steps</strong></p>
<p>When the source package is in HTTP, HDFS, or custom storage, BT distribution requires sequential execution of steps as follows:</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2012/01/steps1.png"><img class="alignnone size-full wp-image-1106" src="http://www.ebaytechblog.com/wp-content/uploads/2012/01/steps1.png" alt="" width="620" height="106" /></a></p>
<p>Before initiating the BT session to download a package, the additional steps that need to be executed are downloading from the package store, generating the “.torrent” file (assuming it was not previously generated), and seeding the package. Each step requires the sequential reading or writing of the data. It is common to find that the time taken by these steps is comparable to actual download time.</p>
<p>As multiples of these steps need to be performed in sequence, the best way to simplify the distribution is to avoid some of the steps altogether.</p>
<h2>Web seeding</h2>
<p>HTTP-based web servers are popularly used to host packages. Clients connect to one or a handful of these servers to download a package. Even when using BT, it is common for the initial “.torrent” file to be downloaded from an HTTP server. The “.torrent” file contains package meta-information, including the location of trackers and piece-by-piece checksums.</p>
<p>Web seeding is an attempt to host packages on web servers so that they can be used for seeding. This technique avoids the need for a separate BT seeding process, as web servers can double up as seeders for hosted content.</p>
<p>BitTorrent web seeding has two distinct and incompatible specifications. The <a href="http://bittornado.com/docs/webseed-spec.txt">BitTornado specification</a> is based on a custom URL scheme to download specific pieces. The <a href="http://www.getright.com/seedtorrent.html">GetRight specification</a> relies on the basic HTTP 1.1 download mechanism using byte serving (byte range queries).</p>
<p>We find that the HTTP byte-serving method is more convenient, as it simply requires hosting packages behind an HTTP 1.1-compatible server. We recommend hosting both the “.torrent” file and packages on the same server. The “.torrent” file can be created at the same time that the source package is generated, thereby avoiding the need for a separate step to create it later during BT transfer. In this scheme, BT seeders are not used; instead, BT clients use HTTP range queries to download pieces from a web server. Once one or a few peers have a piece, the distribution of that piece takes place between peers without a request to the HTTP server. Note that this scheme requires BT clients to support both the BT and HTTP protocols.</p>
<p>Here is a summary of the benefits of using web seeding:</p>
<ul>
<li>HTTP server-hosted packages can be used for both HTTP and BT transport. Web seeding allows the HTTP packages to be transferred using the BT protocol; when distributing to a few nodes, HTTP transport works well.</li>
<li>BT seeding requires the additional step(s) of downloading the package (if the seeder is different from the HTTP package store), and then seeding it before the BT transfer can be initiated. As explained above, web seeding avoids these steps and saves time.</li>
<li>When we have a large number of packages, any of which can be downloaded using BT, serving via BT seeding requires an always-on seeder—which means having a separate seeding process (a process or thread per package). Even when transfers are inactive, these processes or threads need to be running. By contrast, HTTP transport does not require any activity when package download is not occurring.</li>
<li>With traditional BT seeding, the seeder periodically communicates with the tracker on its status and progress, so that tracker logs can be mined for network utilization, the time taken by nodes, and other package transfer details. If the BT seeding session is left active to support any future BT downloads, it continues communicating with the tracker whether or not there is any download activity. The result is communication overhead and cluttered logs. Web seeding avoids these disadvantages because it does not involve communicating with the tracker.</li>
</ul>
<h2>HDFS-based seeding</h2>
<p>Hadoop and HDFS are becoming increasingly popular in enterprise environments to store large data files. In HDFS, large data files are split into several blocks, copies of which are stored on multiple nodes to provide reliability and fail-over capability (default block size is 64 MB).</p>
<p>Distribution for HDFS-hosted packages can borrow ideas from the web-seeding approach, as HDFS supports retrieval of arbitrary portions of packages. The “.torrent” file can be generated in a parallel manner, with piece checksums being computed as a map/reduce task. In order for the map/reduce task to work efficiently, the BT piece size can be chosen such that one or multiple BT pieces equate to a single HDFS block. This strategy helps with torrent generation and BT distribution, as BT pieces wouldn’t spawn across HDFS blocks.</p>
<p>The use of HDFS-based seeding has the following advantages (some of which are similar to those of web seeding):</p>
<ul>
<li>The source package may be stored across HDFS nodes. Downloading it on a single seeder node requires download and aggregation of HDFS blocks before BT seeding. This step can be avoided with HDFS-based seeding.</li>
<li>BT seeding traverses through the entire package, validating checksums, before seeding. The HDFS-based seeding approach both avoids a sequential read of a large package and provides an always-on seeder.</li>
<li>Using traditional BT seeding with large HDFS packages requires keeping multiple versions of the packages on seeder nodes—and consequently extra bookkeeping, space management, and cleanup. HDFS seeding avoids this overhead.</li>
</ul>
<h2>Dealing with deltas</h2>
<p>When clients have an earlier version of a package with some overlapping content, efficiency improvements can be achieved by downloading less of that package. The rsync algorithm is one of the commonly used techniques for dealing with package deltas. This algorithm uses a combination of strong and weak checksums to determine the deltas. A modified scheme, called <a href="http://zsync.moria.org.uk/">zsync</a>, is used for delta computation in a distributed manner. The zsync scheme works nicely with HTTP 1.1 and byte range queries. We suggest a combination of BitTorrent and zsync as a way to transfer packages with some overlapping content.</p>
<p>As clients may have different versions of a package, each client identifies and reuses common parts of the package by employing small “.zsync” files that contain weak and strong checksums. When using zsync in combination with the BitTorrent protocol, a client can download the modified portion from other peers if peers contain the piece. If other peers do not contain the piece, the client can default to web seeding and use HTTP 1.1 byte range to download that piece from source.</p>
<p>The amount of overlap between package versions would depend on the package format as well as the changes compared to the earlier version of the package. The package producers are likely to know whether such overlap is expected. The presence of a “.zsync” file can indicate whether a previous version of the package should be scanned to determine overlap. In the absence of a “.zsync” file, the BT client can fall back to the case where no common parts are shared with the previous version.</p>
<h2>BitTorrent test results</h2>
<p>To optimize transfer time, we ran a series of experiments using a small cluster of 16 nodes. The results discussed below are captured in a data center environment with fast connectivity (1 Gbps) between nodes with low latency (less than a few milliseconds). We used BitTornado for our experiments. The peer nodes are homogenous, running the same BT client version with identical configurations.</p>
<p>This first table shows the actual and theoretical times taken to transfer a 30 GB package as the number of nodes is varied:</p>
<table border="1">
<tbody>
<tr>
<td><strong>Number of nodes</strong></td>
<td><strong>Theoretical client server (minutes</strong>)</td>
<td><strong>Theoretical P2P best (minutes) – B</strong></td>
<td><strong>Actual transfer time (minutes) &#8211; A</strong></td>
<td><strong>Ratio (B / A</strong>)</td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>4</td>
<td>11</td>
<td>0.36</td>
</tr>
<tr>
<td>2</td>
<td>8</td>
<td>4</td>
<td>12</td>
<td>0.33</td>
</tr>
<tr>
<td>3</td>
<td>12</td>
<td>4</td>
<td>13</td>
<td>0.31</td>
</tr>
<tr>
<td>4</td>
<td>16</td>
<td>4</td>
<td>13.5</td>
<td>0.30</td>
</tr>
<tr>
<td>6</td>
<td>24</td>
<td>4</td>
<td>16</td>
<td>0.25</td>
</tr>
<tr>
<td>8</td>
<td>32</td>
<td>4</td>
<td>16</td>
<td>0.25</td>
</tr>
<tr>
<td>16</td>
<td>64</td>
<td>4</td>
<td>17</td>
<td>0.24</td>
</tr>
</tbody>
</table>
<p>The theoretical times are computed using the formulas described earlier with t = 4 minutes and R = 3840. The actual time taken is the time for all nodes to receive the full package.</p>
<p>The theoretical client/server time shows a linear increase as the number of nodes is increased. The theoretical P2P best time is constant at 4 minutes. The data shows that the actual transfer times are better than the theoretical client/server times when the number of nodes increases beyond 3-4. While the actual time taken increases as the number of nodes increases, the rate of growth decreases.</p>
<p>The ratio between theoretical best and actual times indicates the efficiency in transfer compared to the theoretical best. The ratio, and thus efficiency, decreases as the number of nodes increases—going from 36% to 24% as the number of nodes increases from 1 to 16. The low efficiency suggests potential improvements with an optimized BitTorrent implementation or use of other P2P schemes.</p>
<p>The following table shows the time taken by the different steps in downloading and transferring the 30 GB package to a 16-node cluster:</p>
<table border="1">
<tbody>
<tr>
<td><strong>Step #</strong></td>
<td><strong>Step</strong></td>
<td><strong>Time taken (minutes)</strong></td>
</tr>
<tr>
<td>1</td>
<td>HDFS download</td>
<td>6</td>
</tr>
<tr>
<td>2</td>
<td>Torrent generation</td>
<td>6</td>
</tr>
<tr>
<td>3</td>
<td>Seeding</td>
<td>6</td>
</tr>
<tr>
<td>4</td>
<td>Package transfer</td>
<td>17</td>
</tr>
</tbody>
</table>
<p>The time for actual package transfer (step 4) is comparable to the combined time of the three prior steps. Web seeding or HDFS-based seeding would avoid steps 1 and 3. Step 2 can be performed during package creation or as a map/reduce task for HDFS packages. Because steps 1 through 3 are executed sequentially, HTTP- or HDFS-based seeding can achieve substantial gains for package transfer.</p>
<p>The following graph shows the amount of piece sharing that happens in the swarm of 16 nodes. The data depicted is captured from tracker logs at the end of a BT transfer.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2012/01/piece_sharing.png"><img class="alignnone size-full wp-image-1103" src="http://www.ebaytechblog.com/wp-content/uploads/2012/01/piece_sharing.png" alt="graph of piece sharing with 16 nodes" width="620" height="482" /></a></p>
<p>The download figures are not very interesting, as each peer needs the entire package. Thus, download is 100% for peers while 0% for the seeder.</p>
<p>The amount of upload varies between the seeder and the peers. The seeder needs to upload all pieces at least once, and thus its upload would be 100% or more. The results show seeder upload of slightly more than 100%, while peer upload ranged from 75% to 125%. Thus, all peers participated in uploads such that the original seeder did not upload a disproportionately high amount of the package bytes.</p>
<h2>Summary</h2>
<p>For the package fan-out problem in enterprise environments, BitTorrent-based distribution is a good solution. The problem of package transfers is simpler in enterprise environments, with their fast, reliable networks and homogenous clients. Web seeding and HDFS-based seeding can be used to avoid some of the steps and increase the speed of package transfers. Deltas across package versions can be dealt with effectively by using an approach that combines zsync and BitTorrent mechanisms.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2012/01/31/bittorrent-for-package-distribution-in-the-enterprise/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Rapid Development Setup in Large Environments, part II</title>
		<link>http://www.ebaytechblog.com/2012/01/03/rapid-development-setup-in-large-environments-part-ii/</link>
		<comments>http://www.ebaytechblog.com/2012/01/03/rapid-development-setup-in-large-environments-part-ii/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 16:52:05 +0000</pubDate>
		<dc:creator>Suresh Mathew</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[agility]]></category>
		<category><![CDATA[code rollout]]></category>
		<category><![CDATA[Eclipse]]></category>
		<category><![CDATA[IDE]]></category>
		<category><![CDATA[SCM]]></category>
		<category><![CDATA[source control]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=1037</guid>
		<description><![CDATA[In the first part of this blog post, we outlined the problem of rapid development setup as well as the approaches that we have used at eBay to address the problem. We will now describe how we set up projects in the Eclipse IDE, which is well adopted at eBay for Java applications. Let&#8217;s start [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>In the first part of this blog post, we outlined the problem of rapid development setup as well as the approaches that we have used at eBay to address the problem. We will now describe how we set up projects in the Eclipse IDE, which is well adopted at eBay for Java applications.</p>
<p>Let&#8217;s start by looking at two Eclipse project types – source and binary – before we describe dynamic projects, which convert between source and binary.</p>
<h2>Source project</h2>
<p>A source project is a standard Eclipse project created by clicking File -&gt; New Project. A source project includes these characteristics:</p>
<ul>
<li>Source elements</li>
<li>Dependencies to other projects or libraries (binary bundles)</li>
<li>Builders that convert the source elements to binaries</li>
</ul>
<p>With source projects, an SCM system typically provisions the source elements. Project metadata is created and optional pre-build steps are executed before compiling the source elements. The most time-consuming steps tend to be source downloads and compilation.</p>
<h2>Binary project</h2>
<p>The binary project is not a common Eclipse project type. Eclipse does not have a wizard to create a binary project, but it does provide this option in cases where an existing plugin or feature is imported into the workspace. A binary project has these characteristics:</p>
<ul>
<li>No source elements; instead, libraries with attached source</li>
<li>Dependencies to other projects or libraries</li>
<li>No-op builders and no source authorship</li>
</ul>
<p>In most other ways, binary projects share the capabilities of source projects, including the ability to browse, debug, and execute code. As there are no source elements to provision and compile, the setup of binary projects is generally much quicker compared to setup of source projects.</p>
<h2>Dynamic project (project level)</h2>
<p>The advantage of a binary project is that there is no source to download or compile, as we consume compiled source; the obvious disadvantage is that it doesn&#8217;t provide the source authorship capability that is available with source projects. We therefore came up with a third type of project — what we call a dynamic project — which can convert a binary project into a source project and vice versa.</p>
<p>Let’s look at the workflow for a project-level conversion. The developer first sets up the environment with binary projects to browse, run, and debug the application. When ready to make changes to a project by editing the source, the developer simply selects the appropriate binary bundle and converts it to source.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image71.png"><img class="alignnone size-full wp-image-1050" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image71-e1323126171526.png" alt="" width="620" height="471" /></a></p>
<p>The system fetches the source elements, compiles them, and brings them as a source project into the workspace. The system also adjusts dependencies, so that all other projects that were depending on the binary project now depend on the source project. Conversion from source to binary is accomplished in a reverse manner.</p>
<h2>Dynamic project (source level)</h2>
<p>At the project level, the dynamic project acts as either a binary or a source project. When we have a limited number of source elements, project-level conversion from binary to source works fine. In contrast, when a project contains a large number of elements but only a small set of those elements require authorship, conversion of individual source elements is useful. In this case, the project becomes a hybrid: it is neither a binary nor a source project. Only the source elements that need to be modified are converted from binary to source.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image8.png"><img class="alignnone size-full wp-image-1051" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image8-e1323126359604.png" alt="" width="620" height="578" /></a></p>
<p>When the first source elements are converted, a source project is created. It contains only the converted source elements, with a dependency to the binary bundle. Due to specific tooling for dynamic projects, Eclipse exports the compiled <code>.class</code> elements before the libraries and binary bundles. The class loader picks the first matching class, which is the one from the source element in the workspace. In this manner, any source element modifications get reflected in the application.</p>
<h2>Summary</h2>
<p>Quick developer setup is very desirable yet challenging in environments with a large code base and a high velocity of changes. In web applications affording the flexibility of frequent rollouts, the number of feature releases is usually large (hundreds per month in eBay&#8217;s case), while the number of changes per feature is often limited (for eBay, hundreds of source elements).</p>
<p>Processes and tools help provide an environment that supports quick developer setup. We have described our journey in providing such an environment through the use of binary bundles, hierarchical repositories, and a combination of incremental source changes with binary bundles. We have described the challenges that these approaches pose as well as the approaches and ideas we have developed to overcome them.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2012/01/03/rapid-development-setup-in-large-environments-part-ii/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Rapid Development Setup in Large Environments</title>
		<link>http://www.ebaytechblog.com/2011/12/15/rapid-development-setup-in-large-environments/</link>
		<comments>http://www.ebaytechblog.com/2011/12/15/rapid-development-setup-in-large-environments/#comments</comments>
		<pubDate>Thu, 15 Dec 2011 22:11:00 +0000</pubDate>
		<dc:creator>Mahesh Somani</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[agility]]></category>
		<category><![CDATA[code rollout]]></category>
		<category><![CDATA[Eclipse]]></category>
		<category><![CDATA[IDE]]></category>
		<category><![CDATA[SCM]]></category>
		<category><![CDATA[source control]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=1033</guid>
		<description><![CDATA[Large development teams with a sizable code base have two conflicting needs: code sharing and rapid developer setup (agility). Teams who are working on different features need to modify common projects and source elements. While code ownership helps, members of the same team invariably need to touch the same components across different features – or [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Large development teams with a sizable code base have two conflicting needs: code sharing and rapid developer setup (agility). Teams who are working on different features need to modify common projects and source elements. While code ownership helps, members of the same team invariably need to touch the same components across different features – or even the same features! Additionally, as ownership itself changes over time, developers other than the original authors need to enhance features or resolve issues.</p>
<p>The problem facing large development teams stems from more than source-level conflicts and SCM code merges or promotions – which, in our experience, are well bounded. The more fundamental problem has to do with enabling teams to set up their development environments rapidly while working with a code base that is not only large but has a high velocity of changes.</p>
<p>Between the need for code sharing and the need for developer setup agility, code sharing often receives priority as it avoids duplicating code and reinventing the wheel. In this two-part post, we discuss our experience with providing rapid developer setup.</p>
<h2>The size of the problem</h2>
<p>We will first consider quantitative aspects of the problem before looking at our journey to solutions. The problem involves both the number of changes and the velocity of the changes. The eBay code base consists of hundreds of thousands of source elements (with total lines of code in the tens of millions). The applications – which comprise web, services, messaging, and batch style – number in the thousands. It is common for an application to use tens of thousands of source elements. The elements themselves are contributed by different teams and are shared with other applications.</p>
<table>
<tbody>
<tr>
<td><strong>Code Type</strong></td>
<td><strong>Count</strong></td>
</tr>
<tr>
<td>Source elements</td>
<td>100,000’s</td>
</tr>
<tr>
<td>Source elements per application</td>
<td>10,000’s</td>
</tr>
<tr>
<td>Applications</td>
<td>1000’s</td>
</tr>
<tr>
<td>Features per month</td>
<td>100’s</td>
</tr>
<tr>
<td>Source elements changed per feature</td>
<td>100’s (sometimes 1000’s)</td>
</tr>
<tr>
<td>Projects</td>
<td>1000’s</td>
</tr>
<tr>
<td>Source elements per project</td>
<td>10’s-100’s (sometimes 1000’s)</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>Code base changes take the form of new features, which require modifying, adding, and (sometimes) removing source elements. eBay is a fast-paced environment with hundreds of teams; every month, we roll out hundreds of features. The following chart shows the number of source elements that features touch.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image1.png"><img class="alignnone size-full wp-image-1040" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image1-e1323121132219.png" alt="" width="620" height="409" /></a></p>
<p>The X-axis is a log-like scale indicating the number of elements modified. The Y-axis shows the percentage of features matching each bucket. The data distribution is fairly normal, with a median of 100 to 200 source elements. Compared to the overall code base size, the per-feature change is less than 1% of applications. However, with hundreds of features rolled out per month, these changes compound quickly.</p>
<p>In addition, the code base dynamically changes across features. As modifications for individual features are validated and rolled out, later features pick up modifications from earlier features as they themselves are validated and rolled out.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image2.png"><img class="alignnone size-full wp-image-1041" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image2-e1323121708944.png" alt="" width="620" height="297" /></a></p>
<h2>Setup time</h2>
<p>The time involved in setting up the development environment consists of the following activities:</p>
<p>1. Determining the target set of source elements (can be for multiple applications)</p>
<p>2. Provisioning (downloading) of the target set of elements</p>
<p>3. Setting up the development environment in an IDE</p>
<p>4. Compiling source elements such as Java source code</p>
<p>Although we find that practices like service orientation, binary dependencies, and code ownership help minimize code-level coupling, quick setup challenges still exist in large and fast-paced environments. The following sections describe eBay’s own evolution toward providing a faster and more reliable development environment.</p>
<h2>Binary bundles</h2>
<p>An application typically has several thousand source elements. Instead of provisioning and compiling all of those elements, the setup problem is simplified by consuming many of the elements in binary form. Already-compiled binary bundles diminish setup time by reducing the number of source elements that need downloading and compiling. The release stream tags the binary bundles that are released; different features can consume the bundles based on the release stream.</p>
<p>Feature teams can consume some projects as source while consuming many others in binary form. Source projects are required only where code authorship is required. As source projects take longer to set up, teams are encouraged to use them only for the projects that they expect to modify. The rest is consumed as binary bundles, to which the source is attached.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image3.png"><img class="alignnone size-full wp-image-1042" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image3-e1323121950343.png" alt="" width="620" height="246" /></a></p>
<p>The binary bundle approach has clear advantages in reducing the number of source elements that need to be dealt with in the development environment. With this approach, the release streams are a vehicle for contributing changes that can be consumed later by other features. This methodology has also been adopted in the open-source world – for example, with the Apache Maven build tool and with increasingly popular binary repositories.</p>
<p>One of the challenges with the binary bundle approach is managing the bundles per release stream. This approach also requires the development team to identify the set of bundles that need authorship and thus need to be consumed as source. Furthermore, storing binary bundles at the release-stream level does not provide feature-level isolation. These shortcomings are discussed next.</p>
<h2>Binary bundles with feature isolation and round-tripping</h2>
<p>Feature-level isolation is required because features are developed in parallel and have overlapping code areas. In addition, features might be in early stages of development or might be developed in an experimental manner; changes from such features, in the form of binary bundles, should not be shared with others. With feature isolation, modified bundles are not shared until they have gone through appropriate validations and rollout. In other words, we desire to limit sharing when features are actively under development, and to encourage sharing after those changes have been rolled out.</p>
<p>To provide feature isolation, a common repository with release and snapshot versions does not suffice, as that approach is meant to encourage sharing. We use a per-feature binary repository to achieve feature isolation.</p>
<p>As individual features change only a small set of bundles, a complete binary repository per feature leads to duplication of binary bundles. We use hierarchical binary repositories to avoid duplication of bundles and to improve efficiency. Only bundles that are modified by features are placed in the feature repository. The rest of the bundles are obtained from higher-level repositories such as those from the release stream. This approach is similar to hierarchical management of source code in an SCM system (trunk -&gt; branch 1 -&gt; branch 2). In addition to the hierarchical binary repository, the local cache of these bundles is stored in a hierarchical manner to encourage sharing of common bundles across different features.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image4-e1323122669222.png"><img class="alignnone size-full wp-image-1043" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image4-e1323123041885.png" alt="" width="620" height="570" /></a></p>
<p>Another aid to rapid developer setup is dynamic conversion of binary bundles to source and vice versa. This conversion ability enables development to start before all of the projects that need source authorship have been identified. Furthermore, it encourages teams to start with a small set of source projects and to consume most (or even all!) in binary form. When the team realizes the need for source authorship, the appropriate projects are converted from binary to source. During conversion, the system determines source elements, provisions (downloads) them, compiles them, and appropriately adjusts project dependencies.</p>
<p>Once the authoring is completed, the projects can be converted back to binary form. This reverse conversion picks up binary bundles with the recent changes that have been made to those bundles.</p>
<p>The main challenge with this approach is managing the hierarchical repository, including timely updates as feature changes are committed. The hierarchical repository itself needs to point to different release streams as a feature is rebased. This approach still involves numerous source elements if the number or size of projects is large.</p>
<h2>Binary bundles with source element changes</h2>
<p>As discussed earlier in this post, most of our feature changes consist of hundreds of source elements. We find that these changes occur across tens of different projects. The chart below summarizes project size and usage in our environment.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image5.png"><img class="alignnone size-full wp-image-1044" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image5-e1323122843104.png" alt="" width="620" height="491" /></a></p>
<p>The X-axis identifies the number of source elements in a project. The green bars show the percentage of projects that match a given bucket, and the red bars show the distribution of projects that are used as source. The median project size is about 50 to 100 elements, and we have a small number of projects that have more than 1000 elements. However, while small in number, the large projects get used disproportionately, to the extent where the largest of them are used significantly more compared to their counts.</p>
<p>We find that projects don’t start out being large; rather, some grow huge because many features use them and contribute to them. As these projects grow by contribution, many features consume them in source form. Due to their size, these source projects drastically increase development setup time.</p>
<p>To avoid such issues, we use various approaches: splitting large projects into several small projects, decoupling applications from common code areas, creating meta-information (DSL) instead of compiled source code, etc. These are all valid ways of addressing the problem. In our experience, it is easier to start with an approach that incorporates these concepts rather than changing to such an approach later, because of the pervasive nature of changes that touch large numbers of projects.</p>
<p>Another way of achieving rapid setup in the presence of large projects is by using source element changes in combination with binary bundles. When a project needs source authorship, instead of having all elements come in source form, we include a small set of the elements that are modified or that need authorship. The rest of the elements continue to come from binary bundles. We achieve this approach by setting the classpath such that compiled source elements take precedence over the elements coming in binary form.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image6.png"><img class="alignnone size-full wp-image-1059" src="http://www.ebaytechblog.com/wp-content/uploads/2011/12/Image6.png" alt="" width="413" height="253" /></a></p>
<p>The challenges with this approach include setting up the development environment while incorporating others&#8217; feature changes, and dealing with non-source elements that are used for prebuilds. Given the limited number of source elements that change by feature, we are starting to adopt this approach in a limited fashion and are finding that it has potential to work well.</p>
<p>In the second part of this post, we will describe how we use the Eclipse IDE to optimize working with both source elements and binary bundles.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2011/12/15/rapid-development-setup-in-large-environments/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Announcing ql.io</title>
		<link>http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/</link>
		<comments>http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/#comments</comments>
		<pubDate>Wed, 30 Nov 2011 17:28:59 +0000</pubDate>
		<dc:creator>Subbu Allamaraju</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[node.js]]></category>
		<category><![CDATA[nodejs]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=994</guid>
		<description><![CDATA[We are happy to announce ql.io – a declarative, evented, data-retrieval and aggregation gateway for HTTP APIs. Through ql.io, we want to help application developers increase engineering clock speed and improve end user experience. ql.io can reduce the number of lines of code required to call multiple HTTP APIs while simultaneously bringing down network latency [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>We are happy to announce <a href="http://ql.io">ql.io</a> – a declarative, evented, data-retrieval and aggregation gateway for HTTP APIs. Through ql.io, we want to help application developers increase engineering clock speed and improve end user experience. ql.io can reduce the number of lines of code required to call multiple HTTP APIs while simultaneously bringing down network latency and bandwidth usage in certain use cases.</p>
<p>ql.io consists of a <a href="http://ql.io/docs/reference/ql">domain-specific language</a> inspired by SQL and JSON, and a node.js-based runtime to process scripts written in that language. Check out <a href="https://github.com/ql-io/ql.io">ql.io on Github</a> for the source and <a href="http://ql.io">http://ql.io</a> for demos, examples, and docs.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-large.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-large-300x149.png" alt="ql.io logo" width="300" height="149" class="alignnone size-medium wp-image-1013" /></a></p>
<h2>Why ql.io</h2>
<p>HTTP based APIs – some call them services – are an integral part of eBay’s architecture. This is true not just for eBay, but for most companies that use the Web for content and information delivery. Within eBay’s platform engineering group, we noticed several pain points for application developers attempting to get the data they need from APIs:</p>
<ul>
<li>Most use cases require accessing multiple APIs – which involves making several network round trips.</li>
<li>Often those API requests have interdependencies – which requires programmatic orchestration of HTTP requests &#8211; making some requests in parallel and some in sequence to satisfy the dependencies and yet keep the overall latency low.</li>
<li>APIs are not always consistent as they evolve based on the API producers&#8217; needs – which makes code noisier in order to normalize inconsistencies.</li>
</ul>
<p>We found that these issues have two critical impacts: engineering clock speed and end user experience.</p>
<ul>
<li>Engineering clocks slow down because developers need to account for dependencies between API calls, and to arrange those calls to optimize overall latency. Implementing orchestration logic involves multi-threaded fork-join code, leads to code bloat, and distracts from the main business use case that the developer is striving to support.</li>
<li>End user experience suffers due to high bandwidth usage as well as the latency caused by the number of requests and the processing overhead of non-optimized responses from APIs.</li>
</ul>
<p>The goal of ql.io is to ease both pain points:</p>
<ul>
<li>By using a SQL- and JSON-inspired DSL to declare API calls, their interdependencies, forks and joins, and projections, you can cut down the number of lines of code from hundreds of lines to a few, and the development time from entire sprints to mere hours. Using this language, you can create new consumer-centric interfaces that are optimized for your application&#8217;s requirements.</li>
<li>You can deploy ql.io as an HTTP gateway between client applications and API servers so that ql.io can process and condense the data to just the fields that the client needs. This helps reduce the number of requests that the client needs to make as well as the amount of data transported to clients.</li>
</ul>
<h2>A quick taste</h2>
<p>Here is one of the typical examples of ql.io usage. It shows how ql.io can transform the experience of a developer getting the data needed to paint the UI in a native application.</p>
<pre>
prodid = select ProductID[0].Value from eBay.FindProducts where
    QueryKeywords = 'macbook pro';
details = select * from eBay.ProductDetails where
    ProductID in ('{prodid}') and ProductType = 'Reference';
reviews = select * from eBay.ProductReviews where
    ProductID in ('{prodid}') and ProductType = 'Reference';

return select d.ProductID[0].Value as id, d.Title as title,
    d.ReviewCount as reviewCount, r.ReviewDetails.AverageRating as rating
    from details as d, reviews as r
    where d.ProductID[0].Value = r.ProductID.Value
    via route '/myapi' using method get;
</pre>
<p>This script uses three API calls (in this case, all offered by eBay) to get four fields of products that match a keyword. The result is provided via a new HTTP resource with URI <code>http://{host}:{port}/myapi</code>. See the <a href="http://ql.io/docs/build-an-app">guide</a> to build this example yourself, or copy and paste the above script into <a href="http://ql.io/console">ql.io&#8217;s Web Console</a> to see it in action.</p>
<p>While we are still working on various benchmarks, we want to share some early results on developer productivity and end user benefits. One of the teams at eBay recently migrated an application that relies solely on eBay’s APIs to get the data needed to paint its UI. The first diagram below shows the request-response traces before migrating to ql.io.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/before-ql.io_.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/before-ql.io_.png" alt="Before migrating to ql.io" width="620" height="237" class="alignnone size-full wp-image-1014" /></a> </p>
<p>The code related to these API calls was about 2800 lines long. The diagram below shows the request-response traces after migrating API access to ql.io.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/after-ql.io_.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/after-ql.io_.png" alt="After migrating to ql.io" width="620" height="77" class="alignnone size-full wp-image-1015" /></a></p>
<p>This effort brought the code down to about 1200 lines, in addition to reducing the number of requests from 18 to 5 and the data size from 274k to 91k. In this experiment, latency drop is not significant as the client application was using broadband and some of the APIs used were slow APIs.</p>
<h2>How to use ql.io</h2>
<p>ql.io is not intended to replace frameworks that are currently used to build HTTP APIs. API producers can continue to use existing frameworks to offer interfaces that are generic and broadly reusable. ql.io comes into play when a consumer of APIs wants to implement consumer-specific aggregation, orchestration, and optimizations. In other words, while existing frameworks continue to support “producer-controlled” interfaces, you can use ql.io to create “consumer-controlled” interfaces.</p>
<p>We are building ql.io with flexible deployment in mind. Depending on where the network costs are felt, you can deploy ql.io closer to API servers, closer to users on the edge, or even on front-end machines.</p>
<p><strong>Deploying closer to API servers</strong></p>
<p>The primary usage of ql.io is to run it as a gateway at the reverse proxy tier, potentially between your load balancers and API servers.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-as-gateway.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-as-gateway.png" alt="ql.io as a gateway on the reverse-proxy tier" width="449" height="267" class="alignnone size-full wp-image-1016" /></a></p>
<p><strong>Deploying closer to client applications</strong></p>
<p>A secondary usage is to deploy ql.io closer to client applications on the edge.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-on-the-edge.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-on-the-edge.png" alt="ql.io on the edge" width="512" height="267" class="alignnone size-full wp-image-1017" /></a></p>
<p>Edge-side deployment can further reduce network costs for client applications by pushing API orchestration closer to those applications. Where API servers are globally distributed and the best place for aggregation may be closer to client applications, edge-side deployment may yield significant gains. If you are a developer using third-party APIs, you can follow the same pattern and deploy ql.io on your own closer to your applications.</p>
<p><strong>Deploying on the front end</strong></p>
<p>Our choice of Javascript and node.js for building ql.io provides an additional deployment option: front-end applications built on node.js can use ql.io programmatically.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-on-node.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/ql.io-on-node.png" alt="ql.io on the node front end apps" width="512" height="267" class="alignnone size-full wp-image-1018" /></a></p>
<h2>Why node.js</h2>
<p>Early on, one of the critical choices that we had to make was the software stack. We had two choices: Should we go with the proven Java stack that has full operational support within eBay? Or should we choose a stack like node.js with its excellent support for async I/O, but which was not yet proven when we started the project? Moreover, very few companies had operational experience with node.js. This was not an easy choice to make. In our deliberations, we considered the following systemic qualities, in their order of importance:</p>
<ul>
<li><em>Performance and scalability for I/O workloads.</em>  Of workloads performed during script execution, a significant percentage  is I/O bound. CPU loads are limited to in-memory tasks like joining and projections. Blocking I/O was out of the equation for supporting such workloads.</li>
<li><em>Operability.</em> We need to be able to monitor the runtime, know what is going on, and react quickly when things go wrong. Furthermore, integrating with eBay’s logging and monitoring tools is a prerequisite for bringing in a new technology stack.</li>
<li><em>Low per-connection memory overhead.</em> Since script execution involves some slow and some fast APIs, we need the stack to remain stable as the number of open connections increases.</li>
<li><em>Dynamic language support.</em> This consideration had two parts. We wanted to build ql.io very quickly in a very small team with low code-to-execution turn-around times. This approach helps us iterate rapidly in the face of bugs as well as new use cases. In addition, we wanted application developers to be able to extend ql.io’s processing pipeline with small snippets of code.</li>
</ul>
<p>After some analysis and prototyping, we chose Javascript as the language and node.js as the runtime stack. Here are some highlights of our experience so far:</p>
<ul>
<li>Javascript and node.js allowed us to iterate very rapidly. Though we were initially concerned about finding the right tools and libraries, the node.js ecosystem proved sufficient for us to build as complex a system as ql.io.</li>
<li>We were able to tune a regular developer-quality Ubuntu workstation to handle more than 120,000 active connections per node.js process, with each connection consuming about 2k memory. We knew we could go further with the number of connections; although we did not spend the time to go beyond, this gave us the confidence to proceed with node.js.</li>
<li>ql.io’s core engine does automatic fork-join of HTTP requests by using compile-time analysis of scripts. Node’s evented I/O model freed us from worrying about locking and concurrency issues that are common with multithreaded async I/O.</li>
<li>We did pay some operationalization tax while we prepared the ql.io and node.js stack for integration with eBay’s monitoring and logging systems. This was a one-time penalty.</li>
</ul>
<h2>What&#8217;s next</h2>
<p>We’re not done with ql.io yet, and we want to continue to develop ql.io in the open. Go to <a href="https://github.com/ql-io/ql.io">ql.io on Github</a> and <a href="http://ql.io">http://ql.io</a> to find more about ql.io, try it out, <a href="http://groups.google.com/group/qlio">discuss</a> it, and participate.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Click Modeling for eCommerce</title>
		<link>http://www.ebaytechblog.com/2011/11/22/click-modeling-for-ecommerce/</link>
		<comments>http://www.ebaytechblog.com/2011/11/22/click-modeling-for-ecommerce/#comments</comments>
		<pubDate>Tue, 22 Nov 2011 19:06:54 +0000</pubDate>
		<dc:creator>Halim Abbas</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Search Science]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=953</guid>
		<description><![CDATA[Historical user click patterns on search result pages are considered a great resource for algorithms that attempt to learn to rank search results. This ranking method is a well-studied problem in the field of Information Retrieval (IR). See Mike Mathiesonʼs excellent blog post, Using Behavioral Data to Improve Search, for more background on this subject. [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Historical user click patterns on search result pages are considered a great resource for algorithms that attempt to learn to rank search results. This ranking method is a well-studied problem in the field of Information Retrieval (IR). See Mike Mathiesonʼs excellent blog post, <a title="Using Behavioral Data to Improve Search" href="http://www.ebaytechblog.com/2011/04/13/using-behavioral-data-to-improve-search/">Using Behavioral Data to Improve Search</a>, for more background on this subject.</p>
<p>If learn-to-rank algorithms could explicitly observe which results were deemed relevant by most users and which were not, we would have open access to usersʼ minds. Alas, we live in the real world where only user clicks can be observed; their thought patterns are forever hidden, for us to infer and speculate upon.</p>
<p><strong>What are click models?</strong></p>
<p>Click models are mathematical models that attempt to do just that: Describe a typical userʼs decision process as he or she interacts with the search results page, so that we may infer said userʼs judgments on the relevance and irrelevance of specific search results.</p>
<p>Take, for example, the following scenario:</p>
<p>The user searched for “Meaning of life”. A search result page (SRP) with 50 results was served back. The user then clicked on result #2, and we never heard from him again.</p>
<p>Consider the following two explanations:</p>
<ol>
<li>The user looked at the SRP, read the snippet returned for result #1, then ignored it as irrelevant. The user then moved to result #2, read the snippet, found it attractive, clicked through to the page, found the meaning of life in there, then stopped the search, satisfied with what he found.</li>
<li>The user glanced at the SRP, chose result #2 randomly, read the snippet, found it somewhat relevant, clicked-through to the page, and found it completely irrelevant. Then his phone rang and he abandoned the search.</li>
</ol>
<p>According to the first explanation, result #2 was relevant to the userʼs search. According to the second explanation, it wasnʼt. Both explanations (and many others) are possible. But are they equally likely?</p>
<p>A click model helps us assign mathematical probabilities to every such explanation, enabling us to use millions of scenarios to infer the likelihood of relevance for every search result against every search.</p>
<p><strong>Commonly Used Click Models</strong></p>
<p>The simplest class of click models is called <em>position models</em>. Position models assume that search results have a probability of being examined by the user that decays with the position of the result within the page. A click depends on a result being examined and deemed relevant, so that P(click) = P(examined) * P(relevant), and P(examined) is a decaying function of rank.</p>
<p><em>Cascade models </em>are another class of click models, where the user is assumed to examine the results sequentially: starting from the top, clicking on the first relevant result examined, and stopping the search immediately. Here the probability of a click depends on the relevance of a result, as well as the irrelevance of all previous results. This model doesnʼt account for abandoned searches, or searches with multiple clicks.</p>
<p>Recent work has shown that using a dynamic bayesian network (DBN) to model general web search clicks outperforms both position models and cascade models [Olivier Chapelle, <a href="http://olivier.chapelle.cc/pub/DBN_www2009.pdf">A dynamic bayesian network click model for web search ranking</a>, <em>Proceedings of the 18th International World Wide Web Conference (WWW)</em>, 2009]. The DBN model assumes that users examine items in an SRP starting at position 1 and working downwards:  skipping items that appear irrelevant and clicking on items that appear relevant, until either abandoning the search or landing on a satisfactory page following a click from the SRP. The user is assumed to leave the search as soon as the first satisfactory page is presented. SRP clicks are observed in the logs, but all other variables in the bayesian network are hidden.</p>
<p>Figure 1 below illustrates the concept. The boolean variables inside the box relate to user behavior, whether observable or not, and the boolean variables below the box relate to intrinsic properties of URLs: <span>a<sub>u</sub></span> denotes how click-attractive a URL snippet is, and <span>s<sub>u</sub></span> denotes how satisfactory the page really is once you land on it following a search.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/fig1.png"><img class="alignnone size-full wp-image-987" title="DBN click model proposed by Chapelle et al" src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/fig1.png" alt="" width="599" height="403" /></a></p>
<p>Here, clicks are the only observable events. All other events are to be inferred using expectation maximization (EM) guided by the causality links that the model assumes between boolean variables. The causality links are denoted in the figure by arrows.</p>
<p>A parameter of the model is the user persistence rate γ, which is the probability that a user would continue to examine results after landing on an irrelevant page.</p>
<p><strong>Why We Need Something New</strong></p>
<p>DBN and other web search click models are not well suited for explaining user behavior at eCommerce sites such as eBay. Compared to users of general web search, shoppers at eBay are unlikely to be satisfied by landing the first relevant result. Shopping and web browsing are radically different IR exercises. For shopping, the items in the SRP are <em>competing for the shopperʼs business</em>, and the shopper is in a process of <em>arbitration</em>, where he/she must choose to award business to zero or one winning item.</p>
<p>Unlike web surfers, web shoppers (think referees) are expected to sample seemingly relevant items until they feel satisfied that they have come across a representative sample of good deals <em>as well as bad deals</em>. Only then can a shopper make an informed decision that an item is a winner. In this setting, a good deal is only good relative to other deals that are not as good. For example, if the top ten results are equally good deals then weʼd expect the shopper to keep scrolling down until a satisfactory number of bad deals are encountered. That is in sharp contrast to the general web browsing behavior assumed in DBN.</p>
<p><strong>Proposed Click Model </strong></p>
<p>The following model builds upon DBN, with a focus on eCommerce user behavior.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/11/fig2.png"><img class="alignnone size-full wp-image-988" title="DBN click model adapted for eCommerce" src="http://www.ebaytechblog.com/wp-content/uploads/2011/11/fig2.png" alt="" width="593" height="399" /></a></p>
<p>In this model, SRPs list items on sale rather than webpage URLs. Every listing L is assumed to have two independent intrinsic properties: <span>a<sub>L</sub></span> and <span>g<sub>L</sub></span>. The <span>a<sub>L</sub></span> property denotes how attractive the listing looks in the SRP, which influences clicks; and <span>g<sub>L</sub></span> denotes how much of a good shopping deal the listing really is, which depends on the price, item details, seller details, shipping details, etc. Both variables are latent – that is, cannot be observed directly; however, they are assumed to influence the behavior of users on the site, and therefore can be inferred (hopefully) by analyzing mountains of click data.</p>
<p>Notice how the model still contains <span>E<sub>i</sub></span> (examine) events, and how, just like the DBN model, the user is assumed to examine results sequentially top to bottom. The proposed model also contains <span>A<sub>i</sub></span> (attracted enough to click) events; but <span>S<sub>i</sub></span> (satisfied) events are replaced with <span>G<sub>i</sub></span> (good deal) events, which denote that the user found the listing to be a &#8220;good deal&#8221; over all. We introduce a new event S (success event) to denote a successful shopping experience. At eBay, an S event translates into a bid or buy-it-now action. Because S events are directly observable, this model enables learning from clicks, bids, and purchases simultaneously.</p>
<p>Upon finding a good deal, the user performs an S event with probability f(G), which depends not only on the goodness of the current listing, but on the goodness of all previously examined listings. This is how the model accounts for users browsing multiple good and bad deals before arriving at an S decision. In the simplest case, f(G) can be the sum of good deals encountered, but can be generalized to the form:</p>
<p>f(G) = αΣ(good deals) + βΣ(bad deals)</p>
<p>Where α and β are model parameters.</p>
<p>This modeling framework, if properly applied to eCommerce click logs, would greatly enhance the learnability of ranking algorithms from click data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2011/11/22/click-modeling-for-ecommerce/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Personalized Search at eBay, part II</title>
		<link>http://www.ebaytechblog.com/2011/11/04/personalized-search-at-ebay-part-ii/</link>
		<comments>http://www.ebaytechblog.com/2011/11/04/personalized-search-at-ebay-part-ii/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 22:14:28 +0000</pubDate>
		<dc:creator>David Goldberg</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Search Science]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=780</guid>
		<description><![CDATA[In the first part of this blog posting, I talked about how to estimate a buyer’s propensity to purchase an auction over a fixed price item. I gave the Empirical Bayes formula f = (a+k)/(a + b+ n) for the auction propensity f which is a compromise between the buyer’s shopping history k/n and the [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>In the first part of this blog posting, I talked about how to estimate a buyer’s propensity to purchase an auction over a fixed price item. I gave the Empirical Bayes formula <em>f = (a+k)/(a + b+ n)</em> for the auction propensity <em>f</em> which is a compromise between the buyer’s shopping history<em> k/n</em> and the propensity of his peer group <em>a/(a+b)</em>. In this posting I will explain where the formula comes from, and how to compute the numbers <em>a</em> and <em>b</em>.</p>
<p>Since the method is called Empirical Bayes, it won’t surprise you to learn that it uses Bayes Theorem</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/bayes1.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/bayes1.png" alt="" width="164" height="30" /></a></p>
<p>The specific form I’ll use is</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/bayes22.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/bayes22.png" alt="" width="189" height="30" /></a></p>
<p>In this formula, <em>p</em> is the probability of buying an auction,<em> n</em> is the number of purchases that a shopper has made, and <em>k</em> is the number of those purchases that were auctions. The left-hand side is what I’d like to know: the probability (propensity) of an auction given that the buyer has previously purchased <em>k/n</em> auctions. As usual, the reason to use Bayes formula is that it relates the unknown left-hand side to the computable right-hand side.</p>
<p>The pleasure of Bayes formula is the first term on the right-hand side. This is the textbook formula we’ve seen before:</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/bayes_right.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/bayes_right.png" alt="" width="200" height="65" /></a></p>
<p>The pain of Bayes formula is the second term, the prior Pr(<em>p</em>). This is the estimate of <em>p</em> before (prior to) learning there were <em>k</em>/<em>n</em> auction purchases. Much time has been spent debating what value to use for the prior. This is where empirical Bayes comes in. I will use the data itself to estimate Pr(<em>p</em>). I do it by assuming that Pr(<em>p</em>) follows a Beta probability distribution</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/beta_dist.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/beta_dist.png" alt="" width="300" height="62" /></a></p>
<p>and all I need to do is specify the two parameters <em>a</em> and <em>b</em>. Why this distribution? It’s general enough to fit most data, but simple enough to make explicit calculations possible. I’ll do those calculations in a moment. The following figure from Wikipedia shows the first point. By varying the parameters <em>a</em> and <em>b</em> (&alpha; and &beta; in the figure) you can get a large range of distributions.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/distributions.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/distributions.png" alt="" width="620" height="388" /></a></p>
<p>Now for the details of how to compute <em>a</em> and <em>b</em>. The probability of purchasing <em>k</em>/<em>n</em> auctions is</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/probability1.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/probability1.png" alt="" width="194" height="45" /></a></p>
<p>If I don’t know a precise value for <em>p</em>, but do know it varies via some probability distribution Prob(<em>p</em>) then <em>p<sub>k</sub></em> is an average over those <em>p</em>, expressed as the integral</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/integral.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/integral.png" alt="" width="336" height="65" /></a></p>
<p>As I mentioned above, I’ll assume that Prob(<em>p</em>) is a beta distribution</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/beta_dist2.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/beta_dist2.png" alt="" width="371" height="60" /></a></p>
<p>Now plug this value of Prob(<em>p</em>) into the formula for <em>p<sub>k</sub></em> which I&#8217;ve written as <em>p<sub>k</sub></em>(<em>a</em>, <em>b</em>) to emphasize the dependence on <em>a</em> and <em>b</em>.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/plugin.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/plugin.png" alt="" width="419" height="60" /></a></p>
<p>If you’re following along in detail, you’ll see that I did some simplifications after the plug-in. The important thing is this: the value of <em>p<sub>k</sub></em> depends on the known <em>k</em>, <em>n</em> and the unknown <em>a</em>, <em>b</em>, and my job is to pick those <em>a</em>, <em>b</em> so that</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/minimize.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/minimize.png" alt="" width="220" height="92" /></a></p>
<p>is minimized. In other words, I want to make the computed <em>p<sub>k</sub></em> as close as possible to the observed fraction, which I&#8217;ve written with a hat. For example <em>p<sub>2</sub></em> (<em>a</em>,<em>b</em>)  is the computed fraction of people in the peer group that bought 2 auctions assuming the prior is Beta(<em>a</em>, <em>b</em>), while <em>p<sub>2</sub></em> with a hat is the observed fraction of users who bought two auctions.</p>
<p>Here’s an illustration. Suppose the peer group is users who’ve purchased 21 items. Then the best values of <em>a</em> and <em>b</em> are 1.16 and 2.22 which gives this beta distribution:</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/21items1.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/21items1.png" alt="" width="550" height="412" /></a></p>
<p>Once I have <em>a</em> and <em>b</em> I can compute the <em>p<sub>k</sub></em> (<em>a</em>,<em>b</em>) and see how closely they approximate <a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/pk1.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/pk1.png" alt="" height="35" /></a></p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/pkab.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/pkab.png" alt="" width="600" height="291" /></a></p>
<p>The match is excellent. The red predicted curve has only two parameters and yet it has a very good fit to all 22 points.</p>
<p>Now that I’ve explained how to compute <em>a</em> and <em>b</em>, I can explain where the formula <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>) comes from. Use Bayes formula</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/formula2.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/formula2.png" alt="" width="192" height="30" /></a></p>
<p>and substitute the beta distribution B<em><sub>a,b</sub></em> for Pr(<em>p</em>) as follows:</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/formula.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/formula.png" alt="" width="390" height="171" /></a></p>
<p>I’ve been a little pedantic and used â to show that it is an estimate of <em>a</em> derived from the peer group (and similarly for <em>b</em>). The calculations above show that when you plug the formula for B<em><sub>a,b</sub></em>(<em>p</em>) into Bayes formula you get out another beta distribution! This is the explanation for my earlier statement that beta was chosen to make explicit calculations possible.</p>
<p>My goal is to estimate the auction propensity for a buyer who has purchased <em>k</em>/<em>n</em> auctions. What I have so far is a probability distribution for this propensity, specifically a beta distribution. If I want a single probability, the natural choice is the average of that beta distribution. The average of Beta(<em>a</em>,<em>b</em>) is <em>a</em>/(<em>a</em>+<em>b</em>). So if I want to get a single number for the auction propensity, I take the average of</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/10/propensity.png"><img class="alignnone" src="http://www.ebaytechblog.com/wp-content/uploads/2011/10/propensity.png" alt="" width="162" height="40" /></a></p>
<p>which is (<em>a</em>+<em>k</em>)/(<em>a</em>+<em>b</em>+<em>n</em>). This is the formula that I have used many times in this posting.</p>
<p>So much for the calculations. I want to end by discussing whether personalization is desirable. Some have argued that it isn’t a good idea to show users only what they’ve seen before. I think these arguments are often attacking a straw man. A good personalization system doesn’t eliminate all variety, it merely tailors it. Suppose our Mr. X has purchased 20 items, all of them at fixed price. But his friend Ms. Y has done the opposite: all her purchases were made by auction. I find it hard to believe that anyone would seriously argue that X and Y should see the same search results. Just as bad as showing X a pile of auctions would be to never show him any auctions. A good personalization system should be nuanced. Methods like those shown in this posting provide principled estimates of user’s preferences. But how these preferences are used requires some finesse.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2011/11/04/personalized-search-at-ebay-part-ii/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Personalized Search at eBay</title>
		<link>http://www.ebaytechblog.com/2011/10/24/personalized-search-at-ebay/</link>
		<comments>http://www.ebaytechblog.com/2011/10/24/personalized-search-at-ebay/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 20:53:12 +0000</pubDate>
		<dc:creator>David Goldberg</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Search Science]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=745</guid>
		<description><![CDATA[In this blog post and a succeeding one, I will discuss personalization at eBay. Specifically, I’ll talk about how search can be personalized. First I’ll investigate whether buyers actually differ in their preferences for the items they’d like to see in their search results. Next I’ll discuss what preferences have the widest variation from buyer [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>In this blog post and a succeeding one, I will discuss personalization at eBay. Specifically, I’ll talk about how search can be personalized. First I’ll investigate whether buyers actually differ in their preferences for the items they’d like to see in their search results. Next I’ll discuss what preferences have the widest variation from buyer to buyer. Then I’ll talk about how to implement personalization, and finally I’ll explain the theory that underpins the implementation. Some people argue that personalization is actually not desirable. I’ll address that at the end of my last posting.</p>
<p>To make this concrete, I’ll focus on what eBay calls <em>format</em>: is the item being offered for sale via an auction, or is it a fixed-price item with a buy-it-now button? Of course some users buy more auctions than others. But does this indicate a preference? Perhaps not. Perhaps users simply search for the best deal, and if a user buys more auctions it’s not because they prefer auctions, but simply because they happened to purchase a string of items where the best deal was an auction.</p>
<p>The following graph suggests that the “just looking for the best deal” theory is not tenable.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/familiarity.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/familiarity.png" alt="" width="620" height="272" /></a></p>
<p>Users who bought 1 item in the past year did not have much taste for auctions. They only purchased an auction 25% of the time. On the other hand, heavy buyers who purchased at least 600 items in the past year bought mostly auctions – over 70% of their purchases were auctions. So it seems that users do have different preferences. Infrequent buyers prefer fixed-price items, while frequent buyers prefer auctions.</p>
<p>But before discarding the “looking for best deal” hypothesis, I want to consider some alternate possibilities. First is the question you should always ask – is the data statistically significant? Given how consistently the curve moves upwards, it seems unlikely that we’re seeing an artifact due to noise. But I can check this rigorously. I’m measuring a binary variable – one that takes on two values, auction or fixed-price. Statistics textbooks tell us that the noise (as estimated by the standard error) is <span>√<span style="text-decoration: overline"> pq/n </span></span>. The leftmost point represents 14 million buyers. So <em>p</em>=.25, <em>q</em>=.75 and n≈14000000. The noise is about 0.0001. I can be confident that the 25% is very accurate.</p>
<p>Here’s a more subtle alternate explanation. Suppose everyone has similar auction preferences. But suppose infrequent buyers tend to purchase electronics, which has fewer auctions, and frequent buyers are more likely to buy antiques, which have a high fraction of auctions. So the curve might be consistent with buyers having similar auction preferences, and simply be a reflection of the type of items that different buyers purchase.</p>
<p>I can test this alternate explanation by focusing on a single category, for example Jewelry and Watches. This is a category with a high proportion of auction listings.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/experience.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/experience.png" alt="" width="620" height="276" /></a></p>
<p>The plot shows the same pattern as before: infrequent buyers prefer fixed price, heavy buyers prefer auctions. The curve is shifted upwards, because Jewelry is a high-auction category. But there’s a clear difference in preferences. The same pattern holds in a low-auction category like Computers &amp; Networking.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/experience2.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/experience2.png" alt="" width="620" height="275" /></a></p>
<p>The curve is shifted downward, but again more experienced buyers have a stronger preference for auctions. So to summarize, the most likely explanation of the graphs I’ve shown is that some buyers have a much stronger preference for auctions than other buyers.</p>
<p>But I want to dig deeper. Suppose I look at buyers with similar buying experience, for example all those who bought 8 items in the past year. Do they have different format preferences?</p>
<p>To answer that question, I need to be more precise about what it means for users to have similar format preferences. I’ll assume that buyers chose an auction with probability <em>p</em>. One way that might happen is that just before buyers make a purchase, they mentally throw dice in their head, and depending on the outcome say “I feel like an auction today”, or “I feel like buying fixed price today”. OK, that’s not too realistic, but there are other scenarios where it will appear that buyers are choosing an auction with probability <em>p</em>. For example their decision might depend on the inventory available, and that could vary randomly. Or more likely, their decision depends on factors that I don’t have available, such as how soon they want to receive the item (for an auction, you have to wait for bidding to end). In any case, imagining that buyers have some propensity for auctions that is captured as a probability is a simple but plausible model.</p>
<p>A consequence of this model is that I can predict how many auctions a user will buy. Recall that I’m focusing on those who bought 8 auctions. Assuming that all buyers have the same auction propensity <em>p</em>, the chance that <em>k</em> of the 8 purchases are an auction is the familiar formula <a href="http://www.ebaytechblog.com/wp-content/uploads/2011/08/formula1.png"><img class="aligncenter" src="http://www.ebaytechblog.com/wp-content/uploads/2011/08/formula1.png" alt="" width="134" height="35" /></a> In this formula <em>n</em> = 8. In Jewelry &amp; Watches, about 75% of the purchases are auctions, so <em>p</em> = 0.75. The top histogram below labeled <strong>Predicted</strong> was generated using this formula.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/auctions.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/auctions.png" alt="" width="620" height="360" /></a></p>
<p>Most users are predicted to buy 6/8 = 0.75 auctions. But many are predicted to buy 8 auctions. In our model they buy an auction with probability <em>p</em>, and so there’s random variation. They buy 6 on the average, but some buy 8 and others only 2. The bottom histogram gives the actual numbers. They are nothing like the prediction. This suggests that the hypothesis of all users have the same auction preference is false. Here’s another example in the Computers &amp; Networking category.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/auctions2.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/auctions2.png" alt="" width="620" height="357" /></a></p>
<p>Again, the count of auctions vs. fixed price is nothing like what would be predicted if all users had a similar auction preference. Here’s a summary of my reasoning. I focused on users with similar buying experience by looking at those who bought 8 items in the past year. I observed how many bought 0/8 auctions, how many bought 1/8, etc. If all buyers had similar auction preferences, that preference would be 75% (in Jewelry), and I could predict how many would buy 0/8, 1/8, etc. But the predicted values and observed values are totally different, so I conclude that buyers do not have similar auction preferences.</p>
<p>Let me move on to a new topic. Users have widely varying preferences for auctions. Are there other attributes of items that vary widely? I can repeat the analysis above to find out. I will examine 3 different attributes: item condition (new vs. used), whether the seller had an eTRS (eBay Top-Rated Seller) rating, and whether the item had free shipping. In each case I will compare the predicted and observed plots, and observe by how much they differ. First item condition:</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/not_new.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/not_new.png" alt="" width="620" height="397" /></a></p>
<p>The two histograms are very different, so condition appears to be a good attribute for personalization. Next is a graph for items sold by trusted sellers.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/etrs.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/etrs.png" alt="" width="620" height="389" /></a></p>
<p>This time the histograms are similar. True, the tails are a bit larger in the observed histogram, but there is nowhere near the variation there was for item condition. Finally, here’s free shipping.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/free_shipping.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/free_shipping.png" alt="" width="620" height="388" /></a></p>
<p>The observed histogram is not too different from what would be expected if everyone had a 44% preference for eTRS. So I conclude that personalizing search results on format and condition will probably be much more useful than personalizing on eTRS or free shipping.</p>
<p>Now that I’ve established that buyers have different preferences, I need to figure out how to estimate a buyer’s personal preferences. I will use a technique called <em>Empirical Bayes</em>. Like all things Bayes, it has some controversy. Brad Efron, the former chair of the Stanford Statistics department has said:</p>
<p><em>The suggestion here, made explicit in the final section, is that after 50 years of underuse, we are poised for an avalanche of empirical Bayes applications</em></p>
<p>Meanwhile across the Bay, the late statistician David Blackwell, who was chairman of the Berkeley statistics department, has been quoted as such:</p>
<p><em>He [David Blackwell] noted that he didn’t believe in empirical Bayes and showed that it didn’t make sense when applied to a single inference.</em></p>
<p>I will ignore the controversy and plunge ahead. First, here’s an extremely simple way to implement personalization: If a user has bought more than 10 auctions, I will assume I have enough information to estimate their auction preference, and compute it as <em>k</em>/<em>n</em> where <em>k</em> is the number of auction purchases and <em>n</em> is the total number of purchases. Otherwise I make no assumptions. This is clearly unsatisfying. Why 10? Do I really know nothing if they’ve purchased 9 items, but suddenly at 10 items I believe their purchase history is the whole story? As a rule of thumb, hard cutoffs like this do not perform as well as a smooth transition. And empirical Bayes is a perfect tool for eliminating the hard cutoff. It gives a principled way of estimating a buyer’s propensity (probability) of purchasing an auction. In the next posting, I’ll explain how to use Empirical Bayes to derive an estimation formula. For now, I’ll just explain how the formula applies to a typical buyer, Mr. X.</p>
<p>The first step is to aggregate buyers who are similar to Mr. X. This peer group is then summarized by two numbers, <em>a</em> and <em>b</em>. I’ll explain how to compute them later. These numbers encode the <strong>aggregate</strong> estimate of the auction propensity, <em>f<sub>A</sub></em> = <em>a</em>/(<em>a</em>+<em>b</em>). I write this as <em>f</em>, since this probability estimates the fraction of auctions. The next step is to record X’s purchases. If he has made <em>n</em> purchases, and <em>k</em> were in auction format, then his <strong>personal</strong> estimate is <em>f<sub>P</sub></em> <em>= k</em>/<em>n</em>. Empirical Bayes gives a compromise between these two numbers, via the formula <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>). I’ll next check whether this formula is reasonable.</p>
<p>Suppose there were no purchases, so that <em>k</em> = <em>n</em> = 0. Then the formula becomes <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>) = <em>a</em>/(<em>a</em> + <em>b</em>) which is the aggregate estimate. This makes sense. On the other extreme, suppose X has made many purchases so that <em>k</em> and <em>n</em> are large. Then the Bayes estimate becomes <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>) ≈ <em>k</em>/<em>n</em> which is the user’s personal preference. So again the formula makes sense: it gives the aggregate value initially, and then as X’s purchases increase, it becomes more like his personal value.</p>
<p>I’ll now work through an example. Suppose I’ve bought 8 items in the Computers &amp; Networking category in the past year. I’ll use as my peer group the buyers who have also purchased 8 items in Computers &amp; Networking. The <em>a</em>, <em>b</em> numbers are <em>a</em>=0.95, <em>b</em>=4.64. These numbers encode the behavior of my peer group, and I can use them to compute the average fraction of auctions purchased by my peers via , <em>f<sub>A</sub></em> = <em>a</em>/(<em>a</em>+<em>b</em>) = .95/(.95 + 4.64) = .17. The empirical Bayes formula says that my auction propensity is <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>) = (.95+<em>k</em>)/13.59. Here’s a plot of what the formula predicts for different values of <em>k</em>, the number of purchases that were auctions.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/8purchases.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/8purchases.png" alt="" width="620" height="329" /></a></p>
<p>The empirical Bayes estimate is a compromise between the aggregate value and my personal value. Here’s the plot with <em>n</em><em> </em>=<em> </em>4 and <em>n</em><em> </em>=<em> </em>16 added to <em>n</em><em> </em>=<em> </em>8.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/n_purchases.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/n_purchases.png" alt="" width="620" height="328" /></a></p>
<p>As <em>n</em> gets larger, the Empirical Bayes estimate gets closer and closer to the personal estimate. Here’s a second example in the Jewelry &amp; Watches category. Again with 8 purchases, the numbers <em>a</em> and <em>b</em> are <em>a</em><em> </em>=<em> </em>0.75, <em>b</em><em> </em>=<em> </em>0.80 and so <em>f<sub>A</sub></em> = <em>a</em>/(<em>a</em>+<em>b</em>) = .75/(.75 + 40.8) = .48. The plot is below:</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/8purchases2.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/8purchases2.png" alt="" width="620" height="332" /></a></p>
<p>Again, the Empirical Bayes estimate is between the aggregate and personal. You might be wondering why the aggregate isn’t more simply represented by a single number <em>f<sub>A</sub></em>, rather than decomposing it into <em>f<sub>A</sub></em> = <em>a</em>/(<em>a</em>+<em>b</em>). The reason is that <em>a</em> and <em>b</em> encode not just the average number of auctions in your peer group, but also the variation in that group. When <em>a</em> and <em>b</em> are large, buyers in the peer group cluster closely near <em>f<sub>A</sub></em> (left plot below). When <em>a</em> and <em>b</em> are small, buyers have a large spread (right plot).</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/fA.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/fA.png" alt="" width="620" height="188" /></a></p>
<p>This is consistent with <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>). When <em>a</em> and <em>b</em> are large the left graph shows that the peer group is very consistent in their auction preference, and so we should be skeptical of a user whose personal preference is very different from <em>f<sub>A</sub></em>. And indeed, the formula <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>) shows that <em>k</em> and <em>n</em> have small influence. When <em>a</em> and <em>b</em> are small the right graph applies. It shows that peer group has no strong preference, and so <em>k</em> and <em>n</em> are more important.</p>
<p>I’ve actually already shown you an example of this. In Computers &amp; Networking, <em>a</em> and <em>b</em> are large, and the Bayes compromise is roughly midway between the personal and aggregate. But in Jewelry &amp; Watches, <em>a</em> and <em>b</em> are small and the compromise is much closer to the personal line &#8211; see below.</p>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/a_and_b.png"><img src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/a_and_b.png" alt="" width="620" height="320" /></a></p>
<p>In my next posting I will explain where the formula <em>f</em> = (<em>a</em>+<em>k</em>)/(<em>a</em> + <em>b</em>+ <em>n</em>) comes from, and how to compute the numbers a and b.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2011/10/24/personalized-search-at-ebay/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The New eBay Motors Homepage is 2X Faster</title>
		<link>http://www.ebaytechblog.com/2011/10/14/the-new-ebay-motors-homepage-is-2x-faster/</link>
		<comments>http://www.ebaytechblog.com/2011/10/14/the-new-ebay-motors-homepage-is-2x-faster/#comments</comments>
		<pubDate>Sat, 15 Oct 2011 01:16:00 +0000</pubDate>
		<dc:creator>Senthil Padmanabhan</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[Big Pipe]]></category>
		<category><![CDATA[chunked transfer encoding]]></category>
		<category><![CDATA[CSS Lint]]></category>
		<category><![CDATA[eBay Motors Home Page]]></category>
		<category><![CDATA[Image Data URI scheme]]></category>
		<category><![CDATA[Progressive Rendering]]></category>
		<category><![CDATA[site speed]]></category>
		<category><![CDATA[Web Performance]]></category>

		<guid isPermaLink="false">http://www.ebaytechblog.com/?p=640</guid>
		<description><![CDATA[Our eBay Motors homepage has been around for more than 5 years, and we needed to give it a facelift. One of our goals was to make it faster: we wanted the new page to render twice as fast compared to the old one. We started by running the YSlow and PageSpeed plug-ins and implementing [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Our <a title="eBay Motors homepage" href="http://www.motors.ebay.com/">eBay Motors homepage</a> has been around for more than 5 years, and we needed to give it a facelift. One of our goals was to make it faster: we wanted the new page to render twice as fast compared to the old one.</p>
<p>We started by running the <a title="Yahoo! YSlow Performance Analyzer" href="http://developer.yahoo.com/yslow/">YSlow</a> and <a title="Google PageSpeed Performance Analyzer" href="http://code.google.com/speed/page-speed/">PageSpeed</a> plug-ins and implementing the recommendations. Much better already, as you can see below!</p>
<p style="text-align: center"><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/Tools.png" target="_blank"><img class="aligncenter" src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/Tools.png" alt="Perf Tools Report" width="616" height="147" /></a></p>
<p>You can run the PageSpeed tests yourself against the old and new homepages:</p>
<ul>
<li>Old: <a href="http://pagespeed.googlelabs.com/pagespeed/#url=www.motors.ebay.com&amp;mobile=false">http://pagespeed.googlelabs.com/pagespeed/#url=www.motors.ebay.com&amp;mobile=false</a></li>
<li>New: <a href="http://pagespeed.googlelabs.com/pagespeed/#url=www.ebay.com_2Fmtr&amp;mobile=false">http://pagespeed.googlelabs.com/pagespeed/#url=www.ebay.com_2Fmtr&amp;mobile=false</a></li>
</ul>
<p>We wanted to do more. Here are the three things we learned along the way:</p>
<ol start="1">
<li><strong>Progressive rendering <strong>via multiple flushes</strong></strong> is a great way to increase <a title="Perceived performance" href="http://en.wikipedia.org/wiki/Perceived_performance">perceived performance</a> of a page because the user starts seeing content at the earliest possible time. Using just two chunks&#8211; one comprising all content <a title="Above the fold" href="http://en.wikipedia.org/wiki/Above_the_fold">above the fold</a>, and the second everything below the fold &#8212; can give you great improvements. Our data indicates that 70% of our users are viewing our pages with a screen resolution of 1024&#215;768, so we can divide the page as shown below (click on images to see enlarged versions).</li>
</ol>
<p><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/AF.png" target="_blank"><img class="alignleft" style="margin-top: 20px" src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/AF.png" alt="Above Fold Content" width="290" height="260" /></a><a href="http://www.ebaytechblog.com/wp-content/uploads/2011/09/BF.png" target="_blank"><img class="alignleft" style="margin-top: 20px" src="http://www.ebaytechblog.com/wp-content/uploads/2011/09/BF.png" alt="Below Fold Content" width="290" height="260" /></a></p>
<p style="padding-left: 23px">This approach makes it possible to do very specific site speed optimizations. The first chunk above the fold needs to be optimized for fast server-side processing. By adding smart server caching mechanism, the number of backend HTTP calls needed to retrieve all the data for the above-fold content can be reduced significantly. In contrast, all JavaScript loading, parsing, and binding can be moved below the fold. As a result, a user can already see page content while below-the-fold content and JavaScript behavioral characteristics are still being delivered.</p>
<ol start="2">
<li><strong>Data URI for Static Images</strong>: Reduce the number of HTTP requests for static images (both <a title="CSS Sprites" href="http://www.alistapart.com/articles/sprites/">sprited</a> and individual) by leveraging the <a title="Image Data URI Scheme" href="http://en.wikipedia.org/wiki/Data_URI_scheme">Data URI scheme</a>. The new Motors homepage includes all static images as base64-encoded data URIs in CSS instead of as external URLs. (For IE7 and other browsers that don’t support Data URI, you need to have a CSS fallback to include external URLs.) Because CSS files are cached in the browser, you also get the benefit of image caching &#8211; similar to using external image URLs &#8211; in addition to reduced HTTP calls.</li>
</ol>
<ol start="3">
<li><strong><a title="CSS Lint" href="http://csslint.net/">CSS Lint</a></strong> is an open-source tool similar to <a title="JavaScript Lint" href="http://www.jslint.com/">JS Lint</a> used to identify CSS coding problems. From the wide variety of configurable <a title="CSS Lint Rules" href="http://csslint.net/about.html#docs">rules</a> that CSS Lint offers, you can pick a subset based on your application&#8217;s needs. Due to the various options available in the <a title="CSS Lint Source" href="https://github.com/stubbornella/csslint/tree/master/release">CSS Lint source</a>, integration with your <a title="Continuous Integration" href="http://en.wikipedia.org/wiki/Continuous_integration">CI</a>  build system should be pretty straightforward. Look for improvement opportunities in the following areas:
<ul>
<li>removing unused and redundant code</li>
<li>using efficient CSS selectors</li>
<li>writing reusable and maintainable code</li>
</ul>
<p>After running CSS Lint on our code base and fixing the reported problems, we improved page rendering by about 200 milliseconds &#8212; a big win for us! And now that CSS Lint is a part of our build system, we are sure to catch issues before they become a problem in production. We highly recommend this tool for CSS-heavy websites.</p>
</li>
</ol>
<p><strong>What&#8217;s next?</strong> With all the performance optimizations, we were able to achieve our goal in our first release. The new page is indeed twice as fast as the old one! We are now testing ideas for the next iteration. One promising approach is displaying images using <em>div</em> tags with the background-image style, rather than using <em>img</em> tags with the src attribute. After <a title="background-img vs img src" href="http://code.flickr.com/blog/2011/07/20/lessons-learned-from-the-flickr-touch-lightbox/">reading about this technique</a> we experimented with it in one of our image-heavy projects, and the results were positive. Some tests showed page speed improved by as much as 23%; however, we didn&#8217;t see a difference for pages with few images. We are going to test the technique on our image-heavy Motors homepage in one of our upcoming releases.</p>
<p>Go ahead, try out the <a title="New eBay Motors home page" href="http://www.ebay.com/mtr">new eBay Motors homepage</a>! We would love to get your feedback.</p>
<p><a title="Twitter Handle" href="http://twitter.com/#%21/senthil_hi">Senthil Padmanabhan</a><br /> Engineering Lead &amp; Site Speed Evangelist</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ebaytechblog.com/2011/10/14/the-new-ebay-motors-homepage-is-2x-faster/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

