eBay Tech Blog

In the era of cloud and XaaS (everything as a service), REST/SOAP-based web services have become ubiquitous within eBay’s platform. We dynamically monitor and manage a large and rapidly growing number of web servers deployed on our infrastructure and systems. However, existing tools present major challenges when making REST/SOAP calls with server-specific requests to a large number of web servers, and then performing aggregated analysis on the responses.

We therefore developed REST Commander, a parallel asynchronous HTTP client as a service to monitor and manage web servers. REST Commander on a single server can send requests to thousands of servers with response aggregation in a matter of seconds. And yes, it is open-sourced at http://www.restcommander.com.

Feature highlights

REST Commander is Postman at scale: a fast, parallel asynchronous HTTP client as a service with response aggregation and string extraction based on generic regular expressions. Built in Java with Akka, Async HTTP Client, and the Play Framework, REST Commander is packed with features beyond speed and scalability:

  • Click-to-run with zero installation
  • Generic HTTP request template supporting variable-based replacement for sending server-specific requests
  • Ability to send the same request to different servers, different requests to different servers, and different requests to the same server
  • Maximum concurrency control (throttling) to accommodate server capacity

Commander itself is also “as a service”: with its powerful REST API, you can define ad-hoc target servers, an HTTP request template, variable replacement, and a regular expression all in a single call. In addition, intuitive step-by-step wizards help you achieve the same functionality through a GUI.

Usage at eBay

With REST Commander, we have enabled cost-effective monitoring and management automation for tens of thousands of web servers in production, boosting operational efficiency by at least 500%. We use REST Commander for large-scale web server updates, software deployment, config pushes, and discovery of outliers. All can be executed by both on-demand self-service wizards/APIs and scheduled auto-remediation. With a single instance of REST Commander, we can push server-specific topology configurations to 10,000 web servers within a minute (see the note about performance below). Thanks to its request template with support for target-aware variable replacement, REST Commander can also perform pool-level software deployment (e.g., deploy version 2.0 to QA pools and 1.0 to production pools).

Basic workflow

Figure 1 presents the basic REST Commander workflow. Given target servers as a “node group” and an HTTP command as the REST/SOAP API to hit, REST Commander sends the requests to the node group in parallel. The response and request for each server become a pair that is saved into an in-memory hash map. This hash map is also dumped to disk, with the timestamp, as a JSON file. From the request/response pair for each server, a regular expression is used to extract any substring from the response content.

workflow

 Figure 1. REST Commander Workflow.

Concurrency and throttling model with Akka

REST Commander leverages Akka and the actor model to simplify the concurrent workflows for high performance and scalability. First of all, Akka provides built-in thread pools and encapsulated low-level implementation details, so that we can fully focus on task-level development rather than on thread-level programming. Secondly, Akka provides a simple analogy of actors and messages to explain functional programming, eliminating global state, shared variables, and locks. When you need multiple threads/jobs to update the same field, simply send these results as messages to a single actor and let the actor handle the task.

Figure 2 is a simplified illustration of the concurrent HTTP request and response workflow with throttling in Akka. Throttling (concurrency control) indicates the maximum concurrent requests that REST Commander will perform. For example, if the throttling value is 100, REST Commander will not send the “n_th” request until it gets the “{n-100}_th” response back; so the 500th request will not be sent until the response from the 400th request has been received.

concurrency 

Figure 2. Concurrency Design with Throttling in Akka (see code)

Suppose one uniform GET /index.html HTTP request is to be sent to 10,000 target servers. The process starts with the Director having the job of sending the requests. Director is not an Akka actor, but rather a Java object that initializes the Actor system and the whole job. It creates an actor called Manager, and passes to it the 10,000 server names and the HTTP call. When the Manager receives the data, it creates one Assistant Manager and 10,000 Operation Workers. The Manager also embeds a task of “server name” and the “GET index.html HTTP request” in each Operation Worker. The Manager does not give the “go ahead” message for triggering task execution on the workers. Instead, the Assistant Manager is responsible for this part: exercising throttling control by asking only some workers to execute tasks.

To better decouple the code based on functionality, the Manager is only in charge of receiving responses from the workers, and the Assistant Manager is responsible for sending the “go ahead” message to trigger workers to work. The Manager initially sends the Assistant Manager a message to send the throttling number of messages; we’ll use 1500, the default throttling number, for this example. The Assistant Manager starts sending a “go ahead” message to each of 1500 workers. To control throttling, the Assistant Manager maintains a sliding window of [response_received_count, request_sent_count]. The request_sent_count is the number of “go ahead” messages the Assistant Manager has sent to the workers. The response_received_count comes from the Manager; when the Manager receives a response, it communicates the updated count to the Assistant Manager. Every half-second, the Assistant Manager sends itself a message to trigger a check of response_received_count and request_sent_count to determine whether the sliding window has room for sending additional messages. If so, the Assistant Manager sends messages until the sliding window is greater than or equal to the throttling number (1500).

Each Operation Worker creates an HTTP Worker, which also has Ning’s async HTTP client functions. When the Manager receives a response from an Operation Worker, it updates the response part of the in-memory hash map of for the associated server. In the event of failing to obtain the response or of timing out, the worker would return exception details (e.g., connection exception) back to the Manager. When the Manager has received all of the responses, it returns the whole hash map of back to the Director. As the job successfully completes, the Director dumps the hash map to disk as a JSON file, then returns.

Beyond web server management – generic HTTP workflows

When modeling and abstracting today’s cloud operations and workflows – e. g., provisioning, file distributions, and software deployment – we find that most of them are similar: each step is a certain form of HTTP call with certain responses, which trigger various operations in the next step. Using the example of monitoring cluster server health, the workflow goes like this:

  1. A single HTTP call to query data storage (such as database as a service) and retrieve the host names and health records of the target servers (1 call to 1 server)
  2. Massive uniform HTTP calls to check the current health of target servers (1 call to N servers); aggregating these N responses; and conducting simple analysis and extractions
  3. Data storage updates for those M servers with changed status (M calls to 1 server)

REST Commander flawlessly supports such use cases with its generic and powerful request models. It therefore is used to automate many tasks involving interactions and workflows (orchestrations) with DBaaS, LBaaS (load balancer as a service), IaaS, and PaaS.

Related work review

Of course, HTTP is a fundamental protocol to the World Wide Web, SOAP/REST-based web services, cloud computing, and many distributed systems. Efficient HTTP/REST/SOAP clients are thus critical in today’s platform and infrastructure services. Although many tools have been developed in this area, we are not aware of any existing tools or libraries on HTTP clients that combine the following three features:

  • High efficiency and scalability with built-in throttling control for parallel requests
  • Generic response aggregation and analysis
  • Generic (i.e., template-based) heterogeneous request generation to the same or different target servers

Postman is a popular and user-friendly REST client tool; however, it does not support efficient parallel requests or response aggregation. Apache JMeter, ApacheBench (ab), and Gatling can send parallel HTTP requests with concurrency control. However, they are designed for load/stress testing on a single target server rather than on multiple servers. They do not support generating different requests to different servers. ApacheBench and JMeter cannot conduct response aggregation or analysis, while Gatling focuses on response verification of each simulation step.

ql.io is a great Node.js-based aggregation gateway for quickly consuming HTTP APIs. However, having a different design goal, it does not offer throttling or generic response extraction (e.g., regular expressions). Also, its own language, table construction, and join query result in a higher learning curve. Furthermore, single-threaded Node.js might not effectively leverage multiple CPU cores unless running multiple instances and splitting traffic between them. 

Typhoeus is a wrapper on libcurl for parallel HTTP requests with throttling. However, it does not offer response aggregation. More critically, its synchronous HTTP library supports limited scalability. Writing a simple shell script with “for” loops of “curl” or “wget” enables sending multiple HTTP requests, but the process is sequential and not scalable.

Ning’s Async-http-client library in Java provides high-performance, asynchronous request and response capabilities compared to the synchronous Apache HTTPClient library. A similar library in Scala is Stackmob’s (PayPal’s) Newman HTTP client with additional response caching and (de)serialization capabilities. However, these HTTP clients are designed as raw libraries without features such as parallel requests with templates, throttling, response aggregation, or analysis.

Performance note

Actual REST Commander performance varies based on network speed, the slowest servers, and Commander throttling and time-out settings. In our testing with single-instance REST Commander, for 10,000 servers across regions, 99.8% of responses were received within 33 seconds, and 100% within 48 seconds. For 20,000 servers, 100% of responses were received within 70 seconds. For a smaller scale of 1,000 servers, 100% of responses were received within 7 seconds.

Conclusion and future work

“Speaking HTTP at scale” is instrumental in today’s platform with XaaS (everything as a service).  Each step in the solution for many of our problems can be abstracted and modeled by parallel HTTP requests (to a single or multiple servers), response aggregation with simple (if/else) logic, and extracted data that feeds into the next step. Taking scalability and agility to heart, we (Yuanteng (Jeff) Pei, Bin Yu, and Yang (Bruce) Li) designed and built REST Commander, a generic parallel async HTTP client as a service. We will continue to add more orchestration, clustering, security, and response analysis features to it. For more details and the video demo of REST Commander, please visit http://www.restcommander.com.  

Yuanteng (Jeff) Pei

Cloud Engineering, eBay Inc.

References

Postman

http://www.getpostman.com

Akka

http://akka.io

Async HTTP Client

https://github.com/AsyncHttpClient/async-http-client 

Play Framework

http://www.playframework.com

Apache JMeter

https://jmeter.apache.org

ApacheBench (ab)

http://httpd.apache.org/docs/2.2/programs/ab.html

Gatling

http://gatling-tool.org

ql.io

http://ql.io 

Typhoeus

https://github.com/typhoeus/typhoeus

Apache HttpClient

http://hc.apache.org/httpclient-3.x

Stackmob’s Newman

https://github.com/stackmob/newman

{ 0 comments }

Yet Another Responsive vs. Adaptive Story

by Senthil Padmanabhan on 03/05/2014

in Software Engineering

Yes, like everyone else in web development, eBay has become immersed in the mystical world of Responsive Web Design. In fact, our top priority for last year was to make key eBay pages ready for multi-screen. Engineers across the organization started brainstorming ideas and coming up with variations on implementing a multi-screen experience. We even organized a “Responsive vs. Adaptive Design” debate meetup to discuss the pros and cons of various techniques. This post summarizes some of the learnings in our multi-screen journey.

There is no one-size-fits-all solution

This is probably one of the most talked-about points in the responsive world, and we want to reiterate it. Every web page is different, and every use case is different. A solution that works for one page might not work for another – in fact, sometimes it even backfires. Considering this, we put together some general guidelines for building a page. For read-only pages or web pages where users only consume information, a purely responsive design (layout controlled by CSS) would suffice. For highly interactive pages or single-page applications, an adaptive design (different views dependent on the device type) might be the right choice. But for most cases, the RESS (Responsive Design + Server Side Components) approach would be the ideal solution. With RESS we get the best of both worlds, along with easier code maintenance and enhancements. Here the server plays a smart role by not switching to a completely new template or wireframe per device; instead, the server helps deliver the best experience by choosing the right modules and providing hints to the client.

User interaction is as important as screen size

Knowing how a user interacts with the device (keyboard, mouse, touch, pointer, TV remote, etc.) is crucial to delivering the optimal experience. Screen size is required for deciding the layout, but is not in itself sufficient. This point resonates with the previous point about RESS:  the server plays a role. The first hint that our servers provide to the browser is an interaction type class (touch, no-touch, pointer, etc.) added to the root HTML or module element. This class helps CSS and JavaScript to enhance features accordingly. For instance, the CSS :hover pseudo selector is applied only to elements having a no-touch class as a predecessor; and in JavaScript, certain events are attached only when the touch class is present. In addition to providing hints, the server can include/exclude module and JavaScript plugins (e.g., Fastclick) based on interaction type.

Keeping the importance of user interaction in mind, we created a lightweight jQuery Plugin called tactile just to handle gesture-based events:  tap, drag (including dragStart, dragEnd), and swipe. Instead of downloading an entire touch library, we felt that tactile was sufficient for our use cases. By including this plugin for all touch-based devices, we enhance the user interaction to a whole new level, bringing in a native feel. These results would not be possible in a purely responsive design.

Understanding the viewport is essential

At a glance the term ‘viewport’ sounds simple, referring to the section of the page that is in view. But when you dig a little deeper, you will realize that the old idiom ‘The devil is in the detail’ is indeed true. For starters, the viewport itself can have three different perspectives:  visual viewport, layout viewport, and ideal viewport. And just adding the default viewport meta tag <meta name="viewport" content="width=device-width, initial-scale=1"/> alone may not always be sufficient (for example, in a mobile-optimized web app like m.ebay.com, the user-scalable=no property should also be used). In order to deliver the right experience, a deeper understanding of the viewport is needed. Hence before implementing a page, our engineers revisit the concept of viewport and make sure they’re taking the right approach.

To get a good understanding of viewports, see the documents introduction, viewport 1, viewport 2, and viewport 3, in that order.

Responsive components vs. responsive pages is a work in progress

Another idea that has been floating around is to build components that are responsive, instead of page layouts that are responsive. However, until element query becomes a reality, there is no clear technical solution for this. So for now, we have settled on two options:

  • The first option is to use media queries at a component level, meaning each component will have its own media queries. When included in a page, a component responds to the browser’s width and optimizes itself (based on touch/no-touch) to the current viewport and device. This approach, though, has a caveat:  It will fail if the component container has a restricted width, since media queries work only at a page level and not at a container level.
  • The second approach was suggested by some engineers in the eBay London office, where they came up with the idea of components always being 100% in width and all their children being sized in percentages. The components are agnostic of the container size; when dropped into a page, they just fit into whatever the container size is. A detailed blog about this technique can be found here.

We try to implement our components using one of the above approaches.  But the ultimate goal is to abstract the multi-screen factor from the page to the component itself. 

We can at least remove the annoyance

Finally, even if we are not able to provide the best-in-class experience on a device, at minimum we do not want to annoy our users. This means following a set of dos and don’ts.

Dos

  • Always include the viewport meta tag <meta name="viewport" content="width=device-width, initial-scale=1"/>
  • Add the interaction type class (touch, no-touch, etc.) to the root HTML or module element
  • Work closely with design to get an answer on how the page looks across various devices before even starting the project

Don’ts

  • Tiny click area, less than 40px
  • Hover state functionality on touch devices
  • Tightly cluttered design
  • Media queries based on orientation due to this issue

This post provides a quick summary of the direction in which eBay is heading to tackle ever-increasing device diversity. There is no silver bullet yet, but we are getting there.

Senthil
Engineer @ eBay

{ 6 comments }

When I started writing this blog post, my original goal was to provide (as alluded to in the title) some insights into my first year as a presentation engineer at eBay – such as my day-to-day role, some of the things we build here, and how we build them. However, before I can do that, I feel I first need to step back and talk about the renaissance. “The renaissance!?”, I hear you say.

The web dev renaissance

Unless you’ve been living under a rock for the past few years, you can’t help but have noticed a renaissance of sorts in the world of web development – propelled in large part, of course, by Node.js, NPM, GitHub, and PaaS, all of which are enabling and empowering developers like never before. Combined with the rapid innovations in the browser space – and within HTML, CSS, and JavaScript – what you have is an incredibly exciting and fun time to be a web developer! And I’m glad to say that the renaissance is truly alive and well here at eBay!

Node.js

Of course the darling of this renaissance is Node.js. Discovering JavaScript on the server for me was just as exciting and liberating as the day I discovered it in the browser – and I’m sure many, many others will share that sentiment. Spinning up an HTTP server in the blink of an eye with just a few lines of JavaScript is simply audacious, and to this day it still makes me grin with delight – especially when I think of all the hours I’ve wasted in my life waiting for the likes of Apache or IIS! But it’s not just the speed and simplicity that enthralls; it’s also the feeling of utmost transparency and control.

CubeJS

But I digress. I hear you say, “What does this so-called renaissance have to do with eBay?” and “Isn’t eBay just a tired, old Java shop?” That might have been true in the past. But these days, in addition to an excellent new Java stack (we call it Raptor and, as the name correctly implies, it is anything but tired!), we now also have our very own Node.js stack (we call it CubeJS), which already powers several of our existing sites and applications. Yes, the wait is over; Node.js in the enterprise is finally a reality for developers. Since joining eBay in the spring of 2013, I have barely touched a line of Java or JSP code.

JavaScript everywhere

Why is this a big deal? Well, a common pattern for us web developers is that every time we change jobs more often than not we also have to change server-side languages. Over the years I’ve used Perl/CGI, ASP Classic, JSP, ColdFusion/CFML, PHP, and ASP.NET. Now as much as I do enjoy learning new skills (except the circus trapeze – that was ill-advised), I’d be stretching the truth if I said I knew all of those languages and their various intricacies inside out. Most of the time I will learn what I need to learn, but rarely do I feel the need or desire to specialize. It would be fair to say I wasn’t always getting the best out of the language and the language wasn’t always getting the best out of me. Really, deep down, I wanted to be using JavaScript everywhere. And now of course that pipe dream is true.

Polyglotism

Adoption of Node.js is a win-win situation for eBay as we seek to embrace the flourishing world-wide community of JavaScript developers like myself as well as to leverage our excellent open-source eco-system. Node.js might only be the beginning; as eBay further adopts and advocates for such polyglotism, we increasingly welcome developers from different tribes – Python, PHP, Ruby on Rails, and beyond – and eagerly anticipate the day they become integrated with our PaaS (Platform as a Service). You see, it’s all about choice and removing barriers – which empowers our developers to delight our users.

Vive la Renaissance

In this post I’ve mainly focused my attention on Node.js but, as mentioned, the renaissance at eBay doesn’t stop there. We also embrace NPM. We embrace GitHub. We embrace PaaS. We embrace modern principles, tools, and workflows (Modular JavaScript, Grunt, JSHint, Mocha, LESS, and Jenkins – to name but a few!). Yes, we embrace open source – and it’s not all take, take, take either; be sure to check out KrakenJS (a web application framework built on Node.js by our good friends over at PayPal), RaptorJS (eBay’s end-to-end JavaScript toolkit for building adaptive modules and UI components), and Skin (CSS modules designed by eBay to build an online store). And be sure to keep your eyes open for more contributions from us in the near future!

—-

Do you share our passion for JavaScript, Node.js, and the crazy, fast-paced world of front-end web development? Interested in finding out more about joining our presentation team? Please visit http://www.ebaycareers.com for current openings.

{ 0 comments }

Deployment to the cloud is an evolving area. While many tools are available that deploy applications to nodes (machines) in the cloud, zero deployment downtime is rare or nonexistent. In this post, we’ll take a look at this problem and propose a solution. The focus of this post is on web applications—specifically, the server-side applications that run on a port (or a shared resource).

In traditional deployment environments, when switching a node in the cloud from the current version to a new version, there is a window of time when the node is unusable in terms of serving traffic. During that window, the node is taken out of traffic, and after the switch it is brought back into traffic.

In a production environment, this downtime is not trivial. Capacity planning in advance usually accommodates the loss of nodes by adding a few more machines. However, the problem becomes magnified where principles like continuous delivery and deployment are adopted.

To provide effective and non-disruptive deployment and rollback, a Platform as a Service (PaaS) should possess these two characteristics:

  • Best utilization of resources to minimize deployment downtime as much as possible
  • Instantaneous deployment and rollback

Problem analysis

Suppose we have a node running Version1 and we are deploying Version2 to that node. This is how the lifecycle would look:

  typical deployment workflow

Every machine in the pool undergoes this lifecycle. The machine stops serving traffic right after the first step and cannot resume serving traffic until the very last step. During this time, the node is effectively offline.

At eBay, the deployment lifecycle takes a reasonably sized application about 9 minutes. For an organization of any size, many days of availability can be lost if every node must go into offline phase during deployment.

So, the more we minimize the off-traffic time, the closer we get to instant/zero-downtime deployment/rollback.

Proposed solutions

Now let’s look into a few options for achieving this goal.

A/B switch

In this approach, we have a set of nodes standing by. We deploy the new version to those nodes and switch the traffic to them instantly. If we keep the old nodes in their original state, we could do instant rollback as well. A load balancer fronts the application and is responsible for this switch upon request.

The disadvantage to this approach is that some nodes will be idle, and unless you have true elasticity, it will amplify the node wastage. When a lot of deployments are occurring at the same time, you may end up needing to double the capacity to handle the load.

Software load balancers

In this approach, we configure the software load balancer fronting the application with more than one end point so that it can effectively route the traffic to one or another. This solution is elegant and offers much more control at the software level. However, applications will have to be designed with this approach in mind. In particular, the load balancer’s contract with the application will be very critical to successful implementation.

From a resource standpoint, both this and the previous approach are similar; both use additional resources, like memory and CPU. The first approach needs the whole node, whereas the other one is accommodated inside the same node.

Zero downtime

With this approach, we don’t keep a set of machines; rather, we delay the port binding. Shared resource acquisition is delayed until the application starts up. The ports are switched after the application starts, and the old version is also kept running (without an access point) to roll back instantly if needed.

Similar solutions exist already for common servers.

Parallel deployment – Apache Tomcat

Apache Tomcat has added the parallel deployment feature to their version 7 release. They let two versions of the application run at the same time and take the latest version as default. They achieve this capability through their context container. The versioning is pretty simple and straightforward, appending ‘##’ to  the war name. For example, webapp##1.war and webapp##2.war can coexist within the same context; and for rolling back to webapp##1, all that is required is to delete webapp##2.

Although this feature might appear to be a trivial solution, apps need to take special care with shared files, caches (as much write-through as possible), and lower-layer socket usage.

Delayed port binding

This solution is not available in web servers currently. A typical server first binds to the port, then starts the services. Apache lets you delay binding to some extent by overriding bindOnInit, but still the binding occurs after the connector is started.

What we propose here is the ability to start the server without binding the port and essentially without starting the connector. Later, a separate command will start and bind the connector. Version 2 of the software can be deployed while version 1 is running and already bound. When version 2 is started later, we can unbind version 1 and bind version 2. With this approach, the node is effectively offline only for a few seconds.

The lifecycle for delayed port binding would look like this:

delayed_port_binding_workflow

However, there is still a few-second glitch, so we will look at the next solution.

Advanced port binding

Now that we have minimized the window of unavailability to a few seconds, we will see if we can reduce it to zero. The only way to do that would be to bring version 2 up before version 1 goes down. But first:

Breaking the myth:  ‘Address already in use’

If you’ve used a server to run an application, I am sure you’ve seen this exception at least once. Let’s consider this scenario: We start the server and bind to the port. If we try to start another instance (or another server with the same port), the process fails with the error ‘Address already in use’. We kill the old server and start it again, and it works.

But have you ever given a thought as to why we cannot have two processes listening to the same port? What could be preventing it? The answer is “nothing”! It is indeed possible to have two processes listening to the same port.

SO_REUSEPORT

The reason we see this error in typical environments is because most servers bind with the SO_REUSEPORT option off. This option lets two (or more) processes bind to the same port, provided the application that bound the first process had this option set while binding. If this option is off, the OS interprets the setting to mean that the port is not to be shared, and it blocks subsequent processes from binding to that port.

The SO_REUSEPORT option also provides fair distribution of requests (important since threading suffers from bottlenecks in multi-cores). Both of the threading approaches—one thread listening and then dispatching, as well as multiple threads listening—suffer from the under/over utilization of cycles. An additional advantage of SO_REUSEPORT is that it takes care of sending the datagram from the same client to the same server process. However, it has a shortcoming:  packets might be dropped if new processes are added or removed on the fly. This shortcoming is being addressed.

You can find a good article about SO_REUSEPORT at this link on LWN.net. If you want to try this out yourself, see this post on the Free-Programmer’s Blog.

The SO_REUSEPORT option address two issues:

  • The small glitch between the application version switching:  The node can serve traffic all the time, effectively giving us zero downtime.
  • Improved scheduling:  Data indicates (see this article on LWN.net) that thread scheduling is not fair; the ratio between the busiest thread versus the one with the least connections is 3:1.

 zero_downtime_workflow

Please note that SO_REUSEPORT is not the same as SO_REUSEADDRESS, and that it is not available in Java as not all operating systems support it.

Conclusion

Applications can successfully serve traffic during deployment, if we carefully design and manage those applications to do so. Combining both late binding and port reuse, we can effectively achieve zero downtime. And if we keep the standby process around, we will be able to do an instant rollback as well.

{ 14 comments }

eBay is experiencing phenomenal growth in the transactional demands on our databases, in no small part due to our being at the forefront of mobile application development. To keep up with such trends, we continually assess the design of our schemas.

Schema design is a logical representation of the structures used to store the data that applications produce and consume.  Given that database resources are finite, execution times for transactions can vary wildly as those transactions compete for the resources they require. As a result, schema design is the most essential part of any application development life cycle. This blog post covers schema design for online transaction processing applications and recommends a specific approach.

Unfortunately, there is no predefined set of rules for designing databases in an efficient manner, but there can certainly be a defined design process that achieves that outcome. Such a design process includes, but is not limited to, the following activities:

  1. determining the purpose of the database
  2. gathering the information to be recorded
  3. dividing the  information items into major entities
  4. deciding what information needs to be stored
  5. setting up relationships between entities
  6. refining the design further

Historically, OLTP systems have relied on a systematic way of ensuring that a database structure is suitable for general-purpose querying and is free of certain undesirable characteristics – insertion, update, and deletion anomalies– that could lead to loss of data integrity. A highly normalized database offers benefits such as minimizing redundancy, freeing up the relations from undesired insertion, update and deletion dependency, data consistency within the database, and a much more flexible database design.

But as they say, there are no free lunches. A normalized database exacts the price of inserting into multiple tables and reading by way of joining multiple tables. Normalization involves design decisions that are likely to cause reduced database performance. Schema design requires keeping in mind that when a query or transaction request is sent to the database, multiple factors are involved, such as CPU usage, memory usage, and input/output (I/O). Depending on the use case, a normalized database may require more CPU, memory, and I/O to process transactions and database queries than does a denormalized database.

Recent developments further compound schema design challenges. With increasing competition and technological advances such as mobile web applications, transactional workload on the database has increased exponentially. As the competitor is only a click away, an online application’s valuable users must be ensured consistently good performance via QoS and transaction prioritization. Schema design for such applications cannot be merely focused on normalization; performance and scalability are no less important.

For example, at eBay we tried a denormalized approach to improve our Core Cart Service’s DB access performance specifically for writes. We switched to using a BLOB-based cart database representation, combining 14 table updates into a single BLOB column. Here are the results:

  • The response time for an “add to cart” operation improved on average 30%. And in use cases where this same call is made against a cart that already contains several items (>20), the performance at 95 percentile has improved by 40% .
  • For the “create cart” operation, total DB call time for the worst case was improved by approximately 50% due to significant reduction in SQL counts.
  • Parallel transaction DB call times improved measurably for an average use case.

These realities do not imply that denormalization is in any way a blessing. There are costs to denormalization. Data redundancy is increased in a denormalized database. This redundancy can improve performance, but it also requires more extraneous efforts to keep track of related data. Application coding can create further complications, because the data is spread across various tables and may be more difficult to locate. In addition, referential integrity is more of a chore, because related data is divided among a number of tables.

There is a happy medium between normalization and denormalization, but both require a thorough knowledge of the actual data and the specific business requirements. This happy medium is what I call “the medium approach.” Denormalizing a database is the process of taking the level of normalization within the database down a notch or two. Remember, normalization can provide data integrity, which is the assurance of consistent and accurate data within a database, but at the same time it can slow performance because of its frequently occurring table join operations.

There is an old proverb:  “normalize until it hurts, denormalize until it works.” One has to land in the middle ground to get all of the goodies of these two different worlds.

{ 4 comments }

The series of interconnected tents are buzzing with activity. Small groups in animated discussion huddle around laptops and monitors, while some people are lost in private discovery as they interact with new apps or prototypes on their smart phones. Similar scenes repeat themselves from row to row throughout the space.

Sound like the typical software industry expo?  In this case, the venue is one of multiple showcases held on eBay campuses over the summer, and the presenters are not product vendors or seasoned exhibitors, but rather college interns demonstrating their work to peers, managers, and executives. With rare exception, the interns’ already-enthusiastic faces light up when asked if their summer at eBay had been a positive experience.

eBay’s global internship program brought more than 500 undergraduate, master’s, and PhD students to eBay campuses across the U. S. as well as to eBay Centers of Excellence in India and China. About 100 universities were represented in this year’s program. The vast majority of the interns are computer science, software engineering, applied science, or related majors. Their work covers the gamut of engineering challenges at eBay:  from unsupervised machine learning techniques and predictive forecasting models, to big data analysis and visualization; from personalization and localization, to new front-end features and development for multi-screen users; and from site security and fraud detection, to end-to-end testing and internal developer tools.

Lincoln J. Race, Computer Science graduate student at University of California at San Diego, says his internship “has been a tremendous learning experience for me, learning about how to work with a larger team to meet a project deadline.” His summer work has focused on big data. “I’ve loved working with the people around me to ‘GSD’, as David Marcus would say” (referring to the PayPal president’s shorthand for “get stuff done”), “and I can’t wait to continue doing that in the future.”

Oregon State University Computer Science major Marty Ulrich has been interning with eBay’s mobile core apps group in Portland. “The internship has so far exceeded my expectations,” he says. “I like that I’m given the freedom to work on the parts of the app that interest me. And if I have any questions, there are always knowledgeable people ready to help.” He adds, “My experience here this summer has made me want to work at eBay full time when I graduate.”

Ranjith Tellakula, a graduate student in Computer Science at San Jose State University, says he was excited when eBay offered him an internship that would combine his interests in data mining and back-end application development. Throughout the summer, he worked on developing internal metrics applications that identify gaps in the information available to eBay’s own engineers. Visual dashboards and export tools are now enabling support groups to prioritize and close those information gaps.

Like Marty, Ranjith says his internship has exceeded his expectations. “My work is actually going to make engineers’ lives better,” he says, “and I’ve gotten to learn new technologies all along the way. For example, I had no experience with shell scripting, but now shell scripts I created are running every day. And I’ve been amazed that I’m treated as a peer, even by people with years of experience.” His colleagues say he offered expertise in MongoDB that has benefited the entire team. Ranjith continues to work with eBay on a half-time internship while he completes his master’s degree.

Here is a sampling of other intern projects:

  • 3D structure estimation by augmenting a single 2D image with its depth metadata — Mohammed Haris Baig, Computer Science PhD candidate, Dartmouth
  • Write-once run-anywhere by-anyone integration testing – Greg Blaszczuk, Computer Science and Engineering undergraduate student, University of Illinois at Urbana-Champaign
  • Extraction algorithm to cluster eBay search engine output into groups that are meaningful to users – Chao Chang, Statistics PhD candidate, Washington University in St. Louis
  • Buyer recommendations for similar products that are more economic and environmentally friendly – University of California at Santa Cruz undergraduate students Trieste Devlin (Robotic Engineering), Navneet Kaur (Bioengineering), Anh Dung Phan (Bioengineering), and Alisa Prusa (Computer Science)
  • Examination of how we treat money obtained through programs like eBay Bucks differently from money we earn through normal means – Darrell Hoy, Computer Science PhD candidate, Northwestern University
  • Prototype for providing price guidance to eBay bidders—Isabella Li, Information Technology master’s student, Carnegie Mellon University
  • Use of intelligent caching in eBay API calls – Bharad Raghavan, Computer Science undergraduate student, Stanford University
  • Emulation and testing of various types of DDOS attack tools – Sree Varadharajan, Computer Science master’s student, University of Southern California
  • Dashboard framework enabling data-driven decisions without requiring coding – Jie Zha, Software Engineering master’s student, UC Irvine

Interns received onboarding orientations, goal-setting sessions with their managers, deliverables, and performance reviews, much like regular new-hires. In addition, they attended a three-day conference featuring presentations specifically tailored to the internship experience, including talks by eBay President and CEO John Donahoe and eBay Global Marketplaces President Devin Wenig.

Of course, interns had all of the fun experiences typical at a high-tech company (casino and bowling nights, sports leagues, various other competitions, barbeques, etc. etc.). But according to the interns’ feedback, what made them want to come back to eBay are the opportunities they saw for innovative research and product development, in a self-driven manner, using cutting-edge technology.

“Infusing young talent into the eBay culture is really the future of our company,” says eBay’s university engagement partner Jill Ripper. Adds her colleague Joy Osborne, “Providing access for our interns to connect with both fellow interns and the broader eBay Inc. family in a meaningful way was a key component to the success of our summer internship program.”

To learn more about eBay’s internship program, visit http://students.ebaycareers.com/jobs/interns.

collage

 

{ 0 comments }

Want to hear what some of the world’s most talented testers think about the changing face of software and software testing? Then come to the Conference of the Association for Software Testing (CAST), where you’ll also get a chance to talk with these testers and explore your own thoughts and ideas.

CAST is put together by testers, for testers (and those interested in testing). This year, it takes place in Madison, Wisconsin, August 26-28. The presenters are among the best practitioners from around the globe, and many attendees travel thousands of miles specifically for this conference. eBay will have a strong presence, including a keynote by Quality Engineering Director Jon Bach and a track presentation by Ilari Henrik Aegerter, manager of quality engineering in Europe.

Unlike many testing conferences, at CAST a third of each presentation is reserved for a “threaded” question-and-answer session, in which the audience uses colored cards to indicate new questions or questions related to the current thread. With this format, you can satisfy your curiosity, raise doubts, and make presenters defend their positions. That includes the keynote speakers. The conference also includes a test lab where you can get hands-on and try out ideas you might have heard about, see how other testers test, and share your own experience. You’ll find testers hanging out in the hallways having in-depth discussions and challenging each other to demonstrate their testing skills. Everything about the environment is designed to support testers who want to excel at their craft.

The theme for CAST 2013 is “Old Lessons applied and new lessons learned: advancing the practice and building a foundation for the future.”  The technology we work with changes at a rapid pace. Some testing practices stand the test of time; others become obsolete and irrelevant as the technology changes around them.

If this conference sounds like something you’d like to be a part of, then I urge you to register.

http://www.associationforsoftwaretesting.org/conference/cast-2013

I’d love to see you there.

- Ben Kelly, eBay EUQE team and a content chair for CAST 2013

{ 0 comments }

As discussed in a previous post, earlier this year we unveiled the Digital Service Efficiency (DSE) methodology, our miles-per-gallon (MPG) equivalent for viewing the productivity and efficiency of our technical infrastructure across four key areas: performance, cost, environmental impact, and revenue. The goal of releasing DSE was to provide a transparent view of how the eBay engine is performing, as well as spark an industry dialogue between companies, big and small, about how they use their technical infrastructure and the impact it’s having on their business and the planet. In the past month, we’ve been excited to see – and participate in – the resulting dialogue and conversation.

When we shared DSE externally for the first time, we also set a number of performance goals for 2013 and committed to disclosing updates on our progress on a quarterly basis. Today, we’re pleased to share our first such quarterly update with year-over-year comparisons and analysis. This post provides highlights of our findings as well as links to the DSE dashboard and other details.

Dash_Q1-600x339

Q1 metrics summary

Here’s where we stand on the progress toward 2013 performance and efficiency goals:

  • Our transactions per kWh increased 18% year over year. The growth of eBay Marketplaces and continued focus on driving efficiencies have contributed to this increase.
  • Our cost per transaction decreased by 23% in Q1 alone, already exceeding our initial goal.
  • Our carbon per transaction showed a net decrease of 7% for the quarter. As we’re still on track for our Utah Bloom fuel cell installation to go live this summer, we expect this number to continue to decrease and contribute significantly to our 10% carbon reduction goal for the year, even with our projected business growth.

Recognizing that our dynamic infrastructure changes each quarter, we’re confident that we’re on track for our 10% net gain across performance, cost, and environmental impact for the year.

Trends

We’ve seen a few other interesting trends as well:

The New eBay: Over the past year we’ve added numerous new features to our site. Last fall, we rolled out our feed technology, which makes the shopping experience on eBay.com more personal to each of our individual users. On the backend, with this curated site personalization, we’ve seen a jump in our transaction URLs as the eBay feed technology increases infrastructure utilization. This is one example of our customers’ site use driving a more productive work engine.

Continued Growth and Efficiency: As you can see on the dashboard, we had a significant spike in the number of physical servers powering eBay.com – a 37% increase year over year. The rise is in direct response to increased customer demand and personalization for our users. And while we added a significant number of new servers, we were able to limit our increase in power consumption to just 16% (2.69 MW). Compared to the first quarter of last year, we’ve reduced the power consumed by an average of 63 watts per server; this is a direct result of our efforts to run more energy-efficient servers and facilities.

Running a Cleaner Engine: As eBay aspires to be the leading global engine for greener commerce, we’re continually focused on integrating cleaner and renewable energy into our portfolio. In March, our Salt Lake City data center’s solar array came online and though it’s relatively small, it increased our owned clean energy powering eBay.com by 0.17%. Our active projects with fuel cells and renewable energy sourcing will continue to increase this value through the year.

Continuing to Refine Our Approach

When we first announced DSE, one of our top priorities was continued transparency into our metrics, calculations, and lessons learned. Along those lines, the greater transparency has also sparked internal conversations at eBay about how our server pools are categorized.

We organize each server into one of three categories: “buy,” “sell,” or “shared.” “Buy” and “sell” serve the pages and APIs directly for customer traffic, which count as our transactions (URLs); “shared” is the backend support equipment, which does not receive external traffic.

When we released the 2012 DSE numbers we reported 7.3 trillion URLs or transactions in the “buy” and “sell” groups. As we rolled up the Q1 2013 numbers, we found that some internally facing servers were grouped into “buy” and “sell.” We moved them to the appropriate “shared” group. While this did not change overall server counts or power consumption, it did decrease the 2012 transactions count coming from the external servers to 4.3 trillion. We also moved some server groups from “sell” to “buy” to be more consistent with our business model (unlike before, when “buy” and “sell” were relatively more equal). We felt it was important to stay strict with our methodology, and so we have retroactively updated the 2012 baseline numbers to ensure that our year-over-year results were standard and consistent.

Based on lessons learned in the Q1 2013 work, we’ve fine-tuned our methodology as follows:

  1. We’ll look only at those server pools that receive external web traffic in order to ensure that we’re accurately speaking to the year-over-year comparisons for “buy” and “sell” – all other server pools not receiving external web traffic will be considered “shared.”
  2. We’re now measuring Revenue per MW hour (as opposed to Revenue per MW), as this metric represents total consumption per quarter and year instead of quarterly averages. Likewise, we’ve decided to measure CO2e per MW hour instead of CO2e per MW.

Conclusion

With the Q1 2013 results under our belt, we’re happy with the progress we’ve made to fine-tune our technical infrastructure and the DSE metric. We still have more to do, though, and we’ll be sure to keep you updated along the way.

You can find the full Q1 2013 DSE results at http://dse.ebay.com, and click through the dashboard yourself to see the full cost, performance, and environmental impact of our customer “buy” and “sell” transactions.

Have a question or comment? Leave a note in the comments below or email dse@ebay.com and we’ll be sure to get back to you.

- Sri Shivananda, Vice President of Platform, Infrastructure and Engineering Systems, eBay Inc.

 

 

{ 2 comments }

For the most part, eBay runs on a Java-based tech stack. Our entire workflow centers around Java and the JVM. Considering the scale of traffic and the stability required by a site like ebay.com, using a proven technology was an obvious choice. But we have always been open to new technologies, and Node.js has been topping the list of candidates for quite some time. This post highlights a few aspects of how we developed eBay’s first Node.js application.

Scalability

It all started when a bunch of eBay engineers (Steven, Venkat, and Senthil) wanted to bring an eBay Hackathon-winning project called “Talk” to production. When we found that Java did not seem to fit the project requirements (no offense), we began exploring the world of Node.js. Today, we have a full Node.js production stack ready to rock. 

We had two primary requirements for the project. First was to make the application as real time as possible–i.e., maintain live connections with the server. Second was to orchestrate a huge number of eBay-specific services that display information on the page–i.e., handle I/O-bound operations. We started with the basic Java infrastructure, but it consumed many more resources than expected, raising questions about scalability for production. These concerns led us to build a new mid-tier orchestrator from scratch, and Node.js seemed to be a perfect fit.

Mindset

Since eBay revolves around Java and since Java is a strongly typed static language, initially it was very difficult to convince folks to use JavaScript on the backend. The numerous questions involved ensuring type safety, handling errors, scaling, etc. In addition, JavaScript itself (being the world’s most misunderstood language) further fueled the debate. To address concerns, we created an internal wiki and invited engineers to express their questions, concerns, doubts, or anything else about Node.js.

Within a couple of days, we had an exhaustive list to work on. As expected, the most common questions centered around the reliability of the stack and the efficiency of Node.js in handling eBay-specific functionality previously implemented in Java. We answered each one of the questions, providing details with real-world examples. At times this exercise was eye-opening even for us, as we had never considered the angle that some of the questions presented. By the end of the exercise, people understood the core value of Node.js; indeed, some of the con arguments proved to be part of the beauty of the language.

Once we had passed the test of our peers’ scrutiny, we were all clear to roll.

Startup

We started from a clean slate. Our idea was to build a bare minimum boilerplate Node.js server that scales; we did not want to bloat the application by introducing a proprietary framework. The first four node modules we added as dependencies were express, clusterrequest, and async. For data persistence, we decided on MongoDB, to leverage its ease of use as well as its existing infrastructure at eBay. With this basic setup, we were able to get the server up and running on our developer boxes. The server accepted requests, orchestrated a few eBay APIs, and persisted some data.

For end-to-end testing, we configured our frontend servers to point to the Node.js server, and things seemed to work fine. Now it was time to get more serious. We started white-boarding all of our use cases, nailed down the REST end points, designed the data model and schema, identified the best node modules for the job, and started implementing each end point. The next few weeks we were heads down–coding, coding, and coding.   

Deployment

Once the application reached a stable point, it was time to move from a developer instance to a staging environment. This is when we started looking into deployment of the Node.js stack. Our objectives for deployment were simple: Automate the process, build once, and deploy everywhere. This is how Java deployment works, and we wanted Node.js deployment to be as seamless and easy as possible.

We were able to leverage our existing cloud-based deployment system. All we needed to do was write a shell script and run it through our Hudson CI job. Whenever code is checked in to the master branch, the Hudson CI job kicks off. Using the shell script, this job builds and packages the Node.js bundle, then pushes it to the deployment cloud. The cloud portal provides an easy user interface to choose the environment (QA, staging, or pre-production) and activate the application on the associated machines.

Now we had our Node.js web service running in various stable environments. This whole deployment setup was quicker and simpler than we had expected.  

Monitoring

At eBay, we have logging APIs that are well integrated with the Java thread model as well as at the JVM level. An excellent monitoring dashboard built on top of the log data can generate reports, along with real-time alerts if anything goes wrong. We achieved similar monitoring for the Node.js stack by hooking into the centralized logging system. Fortunately for us, we had logging APIs to consume. We developed a logger module and implemented three different logging APIs:

  1. Code-level logging. This level includes logging of errors/exceptions, DB queries, HTTP service calls, transaction metadata, etc.
  2. Machine-level logging. This level includes heartbeat data about CPU/memory and other OS statistics. Machine-level logging occurs at the cluster module level; we extended the npm cluster module and created an eBay-specific version.
  3. Logging at the load balancer level. All Node.js production machines are behind a load balancer, which sends periodic signals to the machines and ensures they are in good health. In the case of a machine going down, the load balancer fails-over to a backup machine and sends alerts to the operations and engineering teams.

We made sure the log data formats exactly matched the Java-based logs, thus generating the same dashboards and reports that everyone is familiar with.

One particular logging challenge we faced was due to the asynchronous nature of the Node.js event loop. The result was that the logging of transactions was completely crossed. To understand the problem, let’s consider the following use case:  The Node process starts a URL transaction and issues a DB query with an async callback. The process will now proceed with the next request, before the DB transaction finishes. This being a normal scenario in any event loop-based model like Node.js, the logs are crossed between multiple URL transactions, and the reporting tool shows scrambled output. We have worked out both short-term and long-term resolutions for this issue.

Conclusion

With all of the above work completed, we are ready to go live with our Hackathon project. This is indeed the first eBay application to have a backend service running on Node.js. We’ve already had an internal employee-only launch, and the feedback was very positive–particularly on the performance side. Exciting times are ahead!

A big shout-out to our in-house Node.js expert Cylus Penkar, for his guidance and contributions throughout the project. With the success of the Node.js backend stack, eBay’s platform team is now developing a full-fledged frontend stack running on Node.js. The stack will leverage most of our implementation, in addition to frontend-specific features like L10N, management of resources (JS/CSS/images), and tracking. For frontend engineers, this is a dream come true; and we can proudly say, “JavaScript is EVERYWHERE.”

Senthil Padmanabhan & Steven Luan
Engineers @ eBay

{ 59 comments }

Many web performance testing service vendors, such as Keynote, offer last-mile testing data. In addition, many popular tools, such as WebPagetest, provide a convenient way to submit test cases so that you can collect large volumes of performance metrics. However, most of these services and tools are synthetic, meaning that they are not from real users. The testing agents in a computer performs the tests in a controlled environment.

The real world can be very different. A great number of parameters can affect the data dramatically. We can easily name a few such parameters: network connection speed, last mile speed in particular, browser type, computer CPU power. More importantly, the distribution of parameter values affects the results substantially. For example, we can get DSL and high-speed backbone test results from Keynote for a particular type of browser, but we don’t know exactly how many users have a network speed that’s comparable to the testing network’s speed. Other types of connection speeds that exist in the real world can also be missed in the tests. It’s difficult to put a weight on each of the tests and get an average that reflects the real world.

Several years back, at eBay we began building a framework called Site Speed Gauge to capture real-world, end-user site speed data. Site Speed Gauge has proven to be an effective way to understand real user experience with web performance, while also correlating site speed and the conversion of traffic to purchases. This framework is now the gold standard at eBay for monitoring site speed. Compared to synthetic tests with at most several thousand data points per day for a site page, we are getting millions of data samples for the same page. Large, statistically significant data sampling points help us to get stable trending and to identify real issues. This post provides an overview of how we use Site Speed Gauge to monitor and improve site performance as well as to collect meaningful business metrics.

The art of site speed measurement

We often hear questions such as, Do you know how fast or slow my web page is? Does it matter to end users? Depending on the source of the question, you’ll likely have a different answer. If a page takes 10 seconds to load and the person asking the question can wait that long, then the speed is fine. However, if I am an engineer, I probably would consider 10 seconds as too slow, and I’d want to know why the page is slow. Is the server response slow, or is the network connection slow? How long do I need to wait to see the first impression on the page? How long does it take to render the full page after getting the first byte of data? What is the speed on different types of browsers so I can tune the slow ones? If I am a business person, I would want to know how many users are affected by slow web performance; a distribution of performance data, such as the median or the 90th percentile, would be useful. Most importantly, everyone wants to know the correlation between site speed and the business metrics. Does site speed affect our company’s top line?

To answer these different questions, we will need various measurements and data analysis views. Throughout the design and enhancement of Site Speed Gauge, we have continually ensured that it is extensible to meet different needs. Site speed measurement is an art; you will experience its beauty when it exactly matches what any particular circumstance calls for.

How Site Speed Gauge works

The following diagram describes the 10 steps of Site Speed Gauge, from receiving a user request to reporting data:

10steps

For older browser versions, where the Navigation Timing object is not available, we mainly use JavaScript to capture the timers. Client-side timer measurements using JavaScript are relatively easy and accurate, as we can start the timer at the beginning of the page, and use the client machine’s clock to measure any points during the page rendering. The difficult measurement is the total end-to-end time, from when the user’s click initiates the page request through when the page completes with the browser onload event. We can use a cookie to store the timestamp when a user leaves the previous page, and measure end-to-end time when loading of the current page is completed. However, if the previous page is not from eBay—for example, if it is a third-party search or referral page—we will miss the end-to-end time metric.

Therefore, we instead use server-side timers for end-to-end time measurement. The beginning timestamp st1 is the time when an app server gets the request, and the end timestamp st2 is the time when another metrics collection server gets the site speed beacon request. We miss the URL request time for the real user’s page, but we compensate for this fact with the beacon request time. To handle the case of the app server’s clock and the metrics collection server’s clock not being in synch, we can use a time-synch service on the two machines. To provide sufficient accuracy, the synch service should return timestamps in milliseconds. Alternatively, we can use a single database timestamp to eliminate the time synch issue.

For the latest versions of browsers, we also send back measurements from the Navigation Timing object those browsers create. These measurements give us very useful information about client-side performance, such as DNS lookup time and network connection time. Through our analysis of the data, we have identified DNS and connection times as major sources of overhead if our datacenters and our clients are not located on the same continent.

Site Speed Gauge features

Currently, Site Speed Gauge supports these major features:

  • Key performance metrics are provided, such as total page end-to-end time, client-side rendering time, first byte time, DOM ready time, certain JavaScript execution times, server processing time, above-fold time, and graphical ads time. About 30 timers are available from the various stages of page processing.
  • Metrics are broken down by page, browser, device, international site, and client IP location. These breakdowns are in different selection dimensions when you query and view the data.
  • Data sampling is adjustable, from 100% sampling for low-traffic pages, to smaller percentages for high-traffic pages. For a heavily trafficked site like eBay, with billions of page views per day, big data processing and scaling requires controlling the sampling data size.
  • Through the gauge’s support for A/B testing, we can tie site speed to site feature changes. This ability is very useful for collecting business metrics correlation data; more on this in the next section.
  • In addition to collecting web performance data, we can plug in collection of other user behavior data:  user clicking, scrolling, and browser size data. We have built heap maps on top of the data to analyze user interactions with the pages.
  • We can plug in other browser performance objects, such as the Navigation Timing object available in new browser versions. As described previously, this capability enables us to get more visibility into the network layer, such as DNS lookup and network connection times.
  • Site Speed Gauge also supports other client-side capturing, such as JavaScript error capturing. Without this capability, we would be flying blind, unaware of problems until getting complaints from end users.

Integration with A/B testing

Tracking the trending of site speed helps us to identify site feature rollout issues, and to monitor site speed improvements we can achieve from various optimizations. In addition, we can run different versions of a web page at the same time and compare the site speed of each version. The resulting data enables us to correlate business metrics with site speed in a precise way. One of the characteristics of eBay’s business is seasonality; if we can simultaneously monitor site speed and business metrics for a page with seasonal or other variations, we can build meaningful correlations.

To enable such analysis, we have integrated Site Speed Gauge with eBay’s A/B testing platform. In this platform, business metrics are collected and analyzed based on testing and control groups; a user session id identifies which group a user belongs to. We use the same session id to collect site speed as well as business metrics. Once a user is sampled for site speed, all pages viewed by the same user are sampled so that we have all site speed and business metrics data for this user.

Several years ago, we ran two versions of the eBay search page. The versions had the same features but different implementations of the page, one our classic search and the other a new implementation. We collected data on site speed as well as buyer purchases per week (PPW, a business metric related to conversion of traffic to purchases), and over time we found a strong correlation between site speed and PPW. As the chart below shows, the correlation is not linear, starting from 10% site speed and 1 % PPW, increasing to 35% site speed and 5% PPW. Our interpretation of this result is that a small change in page speed might not have much impact, but a large page speed change can have a noticeable effect on end users; for example, a large reduction in site speed can cause user activity to drop, or even abandonment of the site, thus affecting conversions.

fig2_revised

As we established the correlation between site speed and PPW, business people and engineers alike began vigorously emphasizing site speed when they designed and implemented features. Now, the engineering culture at eBay is that site speed and optimization are part of the design, implementation, and rollout of features. Many important changes to the site first go through A/B testing to ensure that we’re maximizing site speed as well as business impacts. In addition, the Site Operations team uses dashboards and alerts to monitor site speed 24×7.

Data processing and visualization dashboards and tools

Although we typically collect 10% of site speed sampling data for high-traffic pages, these pages still generate large amounts of beacon data. A batch job running at certain time intervals performs site speed data processing and aggregation in ways that meet the different needs of various consumers of site speed data.

Site Speed Gauge provides dashboards and analytical tools that support data visualization in highly configurable ways. For example, we can select daily results, hourly trending on the 50th percentile, or hourly trending on the 90th percentile. We can view data on a specific eBay page, for a particular international or US site. We can see the site speed for various A/B testing groups.

Here is an example of the site speed breakdown by browser:

browser_site_speed

And here is an example of the browser usage breakdown:

browser_usage

For those who are interested in network latency and the impacts of a CDN, they can view results by geographic location. We process the data based on client IP address to support this feature. The image below shows a heat map of site speed for the United States. A picture is worth a thousands words; the heat map helps people identify page speed visually.

geographic

Conclusion

At eBay, Site Speed Gauge helps us to identify user web performance issues as well as to monitor site speed improvement efforts. Its abilities to collect real-world data and correlate business metrics provide powerful tools in eBay’s highly trafficked consumer-oriented web site. We built extensibility into Site Speed Gauge. In the past several years, we have enhanced Site Speed Gauge to support different needs, and we expect to continue enhancing it in the future.

{ 4 comments }

Copyright © 2011 eBay Inc. All Rights Reserved - User Agreement - Privacy Policy - Comment Policy