Monthly Archives: October 2012

Validating Hadoop Platform Releases

Intern Pralay Biswas submitted the following blog post at the conclusion of his summer with eBay. Pralay was one of 120 college interns welcomed into eBay Marketplaces this summer.

In pioneer days, they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. — Admiral Grace Hopper (1906-1992)

Like any company where data is of primary importance, eBay needed a scalable way to store and analyze data to help shape the company’s future, formulate strategies, and comprehend customer pain points in a much better way. eBay began its Hadoop journey in 2007 — and yes, eBay has come a long way since then.

I spent the summer of 2012 with eBay’s core Hadoop Platform team, working on quality engineering. Part of eBay’s central Marketplaces business, this team is where eBay manages the shared clusters, augments public-domain Hadoop software, and validates each release of Hadoop at eBay.

The problem

  • How do we validate each new Hadoop release?
  • How do we test the software to decrease the probability that we are shipping bugs?
  • What about cluster-wide, end-to-end tests and integration tests?
  • How do we make sure there are no quality regressions?

These are difficult questions to answer, especially when it comes to Hadoop.

No silver bullet for Hadoop testing

Existing software solutions from the Apache community that orchestrate and automate Hadoop testing and validation are insufficient. JUnit and PyUnit are great for small, self-contained developer (unit) tests. AspectJ works fine for local, single-module fault injection tests. Herriot is also based on JUnit and has some badly designed, buggy extensions. EclEmma is IDE-specific, and inappropriate.

Long story short: there is no existing framework or test harness that can validate a Hadoop platform release. Quick and automatic validation with minimal human intervention (ideally, zero-touch) is of primary importance to any company that needs to test with large-scale, mission-critical data. The quicker we store and analyze data, the quicker we understand our progress and ensure that we can roll out new software or configurations.

Engineer Arun Ramani developed the eBay Hadoop test framework. I was hired as a graduate summer intern to add tests to the framework that he conceived. In the following sections, I’ll elaborate on the test harness (as opposed to the tests that I wrote).

Architecture of the framework

The architecture of the test harness follows the model-view-controller (MVC) design pattern. The framework separates test cases from the test logic, so that the logic can remain almost unchanged; most changes can be made in the test XML only. The driver in the figure below is a Perl script that uses support modules (some more Perl modules) to prepare the test environment, and then pulls in the tests themselves, which reside in an XML file.

The main driver module acts as an orchestrator, and the support modules help determine the environment configuration; they restart daemons (name node, secondary name node, data node, job tracker, task tracker), send email reports of results, look for anomalous and correct exit conditions of processes, etc. The actual tests are in the XML file only. The driver parses these tests, executes them, and checks whether a test passed. Determination of a test’s pass/fail status can be achieved using Perl extended regular-expression comparisons of standard out, standard error, process termination, exit signals, log file data, or other system-level information. The use of Perl affords extraordinary visibility into end-to-end system-level test results (memory, cpu, signals, exit codes, stdout, stderr, and log files).


All tests are grouped according to uses cases.  Tests within group tags represent one use case. Every use case (group) has a setup action and a teardown action containing commands to be executed for the purposes of those individual tests. All commands to be executed go between cmd tags, and desc tags enclose test and command descriptions.

Test case example

Take, for example, the following test case, which runs a Pi calculation job on the Hadoop cluster. The setup action prepares the environment required by the subsequent test — in this case, executing  a command that deletes any existing directories so that the job can create new ones. The cmd tag would typically enclose a Hadoop command. However, this is not a restriction; the command could be a UNIX command line, a Java command to execute any jar, a shell script, another Perl script, or any arbitrary command. This flexibility enables the test framework to support tests not only for MapReduce and HDFS but for the entire Hadoop ecosystem.


When the test driver executes the command, it stores results in the variable $actual_out. The driver then compares this value to that of the exp_stdout variable specified in the test case; this is a regular expression comparison. The driver also checks that $actual_out has nothing in common with not_exp_stdout_regexp. Furthermore, the return code of the executed command (the Hadoop job) is compared against the expected return code (exp_rc). A strict test driver declares success of a test only if all the checks and assertions are positive.

Once all tests in a group are run, the commands in the teardown phase clean up any data or state that the tests created (in HDFS, HBase, or memory) during their run, leaving the cluster clean for the next test case. Once the end of the XML test file is reached, the driver calculates the total number of cases (groups) executed, passed, and failed, then sends this log to the configured email IDs.

Automation and results

This test framework is kicked off as a Jenkins job that the Hadoop deployment job runs. So, as soon as a new release is deployed and compiled on the QA cluster, this framework runs automatically to validate that the new bits are good. Modular and flexible, the framework has been used successfully to validate Cloudera’s CDH4 and Apache Hadoop 0.22. Going forward, this framework will validate all future releases of Hadoop at eBay.


Bandwidth-based Experience

When designing new web pages or working on new features for an existing page, developers commonly consider the following factors:

  • Page appearance for different layout sizes
  • Browser-specific implementations
  • Page behavior with and without JavaScript enabled

All of the above points are valid, and developers should indeed think about them. However, there is another factor that developers might not think much about:  how their pages are loading on low-bandwidth connections or slower computers.

When users browse web sites, numerous factors contribute to the speed at which web pages load. For example, some users might have high-bandwidth connections but slower computers, while others might have high-speed computers with low-bandwidth connections. Due to such factors, users of your web site can have vastly different page loading experiences.

Most web sites collect real-time site speed data to see how their web pages are loading on their users’ computers. Typically, these numbers provide median, average, and 95th-percentile loading time. If you look closely into the 95th-percentile data, you might be amazed how slowly your web pages are loading. Can we do anything about these 95th-percentile users?

To analyze your users’ bandwidth experience, collect and segregate data based on the load time of your web pages. Next, determine what percentage of your users are experiencing slow load times. For this determination, it’s up to you to draw the line between fast users and slow users.

Suppose that your segregated data looks like this:











<10+ sec












According to the data, pages are loading in less than 4 seconds for 85% of your users. Let’s say that you determine that slow loading is anything more than 4 seconds.

Now we have a data point to distinguish between slow and fast load times. But how do we detect load times in real time, per user? One approach is to download an image of X bytes, measure the time taken to load the image, and use that time to calculate the user’s bandwidth. However, this approach might not touch all factors—such as JavaScript execution time—that can affect page loading time.

Here is a different approach, which would touch all possible factors for slowness. When the user visits a web page for the first time, at the start of the session calculate the load time of the page and store it in a cookie. Do this calculation only one time, and use it for the rest of the session.

Calculating page load time is very simple. The following script measures purely client-side load times. (Note that it assumes that you have JavaScript functions to read and write cookies.)

<SCRIPT> var start = new Date().getTime(); //Take the start measurement </SCRIPT>

        window.onload = function(){ //Use appropriate event binding mechanism
           var end = new Date().getTime(); //Take the end measurement 
           var loadtime = end-start;  //Calculate and store the load time
           var loadtimefromcookie = readCookie(); //Call JS function to read cookie
           if(!loadtimefromcookie){ //If load time not yet stored
              writeCookie(loadtime); //Call JS function to set cookie

From the second request onwards, you can read the cookie; based on its value, you can strip down the features for lower-bandwidth users and/or add more features for higher-bandwidth users.

How can you improve the loading speed for lower-bandwidth users? Here are some techniques:

  • Show fewer results.
  • Use smaller images.
  • Cut down on heavy ads, or remove all ads.
  • Remove heavy or crazy features that are not required for basic functionality.

Do A/B testing with these changes, assess the gains for the 95th-percentile user, and see whether this approach makes a difference. Once you see positive changes, you can target and fine-tune page features based on user bandwidth.

The Case Against Logic-less Templates

At a conference I recently attended, two separate topics related to templates sparked a debate off-stage about Logic vs No-logic View Templates. Folks were very passionate about the side they represented. Those discussions were the high point of the conference, since I learned much more from them than from the sessions that were presented.

While everyone involved in the debates had different opinions about the ideal templating solution, we all agreed that templates should…

  • not include business logic
  • not include a lot of logic
  • be easy to read
  • be easy to maintain

In this post I’ll describe what I prefer–rules, best practices, performance considerations, etc. aside. My preferences come from my experience with templates for medium to complex applications.

Full-of-logic, logic-less, and less-logic solutions

Before I go into the details, I would like to make one thing very clear. Although I’m trying to make a case against logic-less templates, that does not mean that I am advocating the other extreme–i.e., a templating language that allows a lot of logic. I find such templating languages, especially those that allow the host programming languages to be used inside the template, to be hard to read, hard to maintain, and simply a bad choice. A JSP template with Java code in it and an Underscore template with JavaScript in it both fall into the category of being a full-of-logic template. JSP and Underscore are not necessarily at fault here; rather, developers often abuse the additional freedom such solutions offer.

What I am advocating is “less-logic” templates in place of “logic-less” templates (thanks, Veena Basavaraj (@vybs), for the term “less-logic templates”!).

Good and great templating languages

A good templating language should offer, at a minimum, the following features:

  1. Clean and easy-to-read syntax (including the freedom to use whitespace that will not show up in output)
  2. Structural logic like conditionals, iterations/loops, recursions, etc.
  3. Text/Token replacement
  4. Includes

A great templating language should also offer the following features:

  1. Ability to be rendered on the server and the client
  2. Easy learning curve with as few new concepts as possible
  3. Simple template inheritance
  4. Easy debugging
  5. Great error reporting (line numbers, friendly messages, etc.)
  6. Extensibility and customizability
  7. Localization support
  8. Resource URL management for images, scripts, and CSS

The case against logic-less templates

I consider a logic-less template to be too rigid and hard to work with due to the restrictions it imposes. More specifically, here are the reasons I am not a big fan of logic-less templates:

  • Writing a logic-less template requires a bloated view model with comprehensive getters for the raw data. As a result, a messy and difficult-to-maintain view model usually accompanies logic-less templates.
  • Every new variation to the view requires updating both the view model and the template. This holds true even for simple variations.
  • Getting the data ready for the template requires too much data “massaging.”
  • Offsetting the rules of a logic-less template requires a lot of helper methods.

Regarding the argument of better performance with logic-less templates: While they might have a simpler implementation, they require additional preprocessing/massaging of the data. You therefore have to pay a price before the data gets to the template renderer. Granted, a templating language that allows more logic might have a more complex implementation; however, if the compiler is implemented correctly, then it can still produce very efficient compiled code. For these reasons, I would argue that a solution involving templates with more logic will often outperform a similar solution based on logic-less templates.


No doubt, there are advantages like simplicity and readability associated with logic-less templates such as Mustache. And arguably, there could be some performance gains. However, in practice I do not think these tradeoffs are enough to justify logic-less templates.

Logic in a template isn’t really a bad thing, as long as it doesn’t compromise the template’s readability and maintainability.

And as an aside, I’m definitely in favor of more debates than sessions in future conferences, since we actually learn more by hearing multiple viewpoints.

I welcome your thoughts!

Credits: Many thanks to my colleague Patrick Steele-Idem (@psteeleidem) for helping me write this post. He is working on some cool solutions like a new templating language; be sure to check them out when they are open-sourced.