eBay Tech Blog

We are very excited to announce that eBay has released to the open-source community our distributed analytics engine: Kylin (http://kylin.io). Designed to accelerate analytics on Hadoop and allow the use of SQL-compatible tools, Kylin provides a SQL interface and multi-dimensional analysis (OLAP) on Hadoop to support extremely large datasets.

Kylin is currently used in production by various business units at eBay. In addition to open-sourcing Kylin, we are proposing Kylin as an Apache Incubator project.

Background

The challenge faced at eBay is that our data volume has become bigger while our user base has become more diverse. Our users – for example, in analytics and business units – consistently ask for minimal latency but want to continue using their favorite tools, such as Tableau and Excel.

So, we worked closely with our internal analytics community and outlined requirements for a successful product at eBay:

  1. Sub-second query latency on billions of rows
  2. ANSI-standard SQL availability for those using SQL-compatible tools
  3. Full OLAP capability to offer advanced functionality
  4. Support for high cardinality and very large dimensions
  5. High concurrency for thousands of users
  6. Distributed and scale-out architecture for analysis in the TB to PB size range

We quickly realized nothing met our exact requirements externally – especially in the open-source Hadoop community. To meet our emergent business needs, we decided to build a platform from scratch. With an excellent team and several pilot customers, we have been able to bring the Kylin platform into production as well as open-source it.

Feature highlights

Kylin is a platform offering the following features for big data analytics:

  • Extremely fast OLAP engine at scale: Kylin is designed to reduce query latency on Hadoop for 10+ billion rows of data.
  • ANSI SQL on Hadoop: Kylin supports most ANSI SQL query functions in its ANSI SQL on Hadoop interface.
  • Interactive query capability: Users can interact with Hadoop data via Kylin at sub-second latency – better than Hive queries for the same dataset.
  • MOLAP cube query serving on billions of rows: Users can define a data model and pre-build in Kylin with more than 10+ billions of raw data records.
  • Seamless integration with BI Tools: Kylin currently offers integration with business intelligence tools such as Tableau and third-party applications.
  • Open-source ODBC driver: Kylin’s ODBC driver is built from scratch and works very well with Tableau. We have open-sourced the driver to the community as well.
  • Other highlights: 
  • Job management and monitoring
  • Compression and encoding to reduce storage
  • Incremental refresh of cubes
  • Leveraging of the HBase coprocessor for query latency
  • Approximate query capability for distinct counts (HyperLogLog)
  • Easy-to-use Web interface to manage, build, monitor, and query cubes
  • Security capability to set ACL at the cube/project level
  • Support for LDAP integration

The fundamental idea

The idea of Kylin is not brand new. Many technologies over the past 30 years have used the same theory to accelerate analytics. These technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels.

For reference, here is the cuboid topology:

cuboid_topo

When data becomes bigger, the pre-calculation processing becomes impossible – even with powerful hardware. However, with the benefit of Hadoop’s distributed computing power, calculation jobs can leverage hundreds of thousands of nodes. This allows Kylin to perform these calculations in parallel and merge the final result, thereby significantly reducing the processing time.

From relational to key-value

As an example, suppose there are several records stored in Hive tables that represent a relational structure. When the data volume grows very large – 10+ or even 100+ billions of rows – a question like “how many units were sold in the technology category in 2010 on the US site?” will produce a query with a large table scan and a long delay to get the answer. Since the values are fixed every time the query is run, it makes sense to calculate and store those values for further usage. This technique is called Relational to Key-Value (K-V) processing. The process will generate all of the dimension combinations and measured values shown in the example below, at the right side of the diagram. The middle columns of the diagram, from left to right, show how data is calculated by leveraging MapReduce for the large-volume data processing.

rational_to_kv

Kylin is based on this theory and is leveraging the Hadoop ecosystem to do the job for huge volumes of data:

  1. Read data from Hive (which is stored on HDFS)
  2. Run MapReduce jobs to pre-calculate
  3. Store cube data in HBase
  4. Leverage Zookeeper for job coordination

Architecture

The following diagram shows the high-level architecture of Kylin.

kylin_arch

This diagram illustrates how relational data becomes key-value data through the Cube Build Engine offline process. The yellow lines also illustrate the online analysis data flow. The data requests can originate from SQL submitted using a SQL-based tool, or even using third-party applications via Kylin’s RESTful services. The RESTful services call the Query Engine, which determines if the target dataset exists. If so, the engine directly accesses the target data and returns the result with sub-second latency. Otherwise, the engine is designed to route non-matching dataset queries to SQL on Hadoop, enabled on a Hadoop cluster such as Hive.

Following are descriptions of all of the components the Kylin platform includes.

  • Metadata Manager: Kylin is a metadata-driven application. The Metadata Manager is the key component that manages all metadata stored in Kylin, including the most important cube metadata. All other components rely on the Metadata Manager.
  • Job Engine: This engine is designed to handle all of the offline jobs including shell script, Java API, and MapReduce jobs. The Job Engine manages and coordinates all of the jobs in Kylin to make sure each job executes and handles failures.
  • Storage Engine: This engine manages the underlying storage – specifically the cuboids, which are stored as key-value pairs. The Storage Engine uses HBase – the best solution from the Hadoop ecosystem for leveraging an existing K-V system. Kylin can also be extended to support other K-V systems, such as Redis.
  • REST Server: The REST Server is an entry point for applications to develop against Kylin. Applications can submit queries, get results, trigger cube build jobs, get metadata, get user privileges, and so on.
  • ODBC Driver: To support third-party tools and applications – such as Tableau – we have built and open-sourced an ODBC Driver. The goal is to make it easy for users to onboard.
  • Query Engine: Once the cube is ready, the Query Engine can receive and parse user queries. It then interacts with other components to return the results to the user.

In Kylin, we are leveraging an open-source dynamic data management framework called Apache Calcite to parse SQL and plug in our code. The Calcite architecture is illustrated below. (Calcite was previously called Optiq, which was written by Julian Hyde and is now an Apache Incubator project.)

calcite

Kylin usage at eBay

At the time of open-sourcing Kylin, we already had several eBay business units using it in production. Our largest use case is the analysis of 12+ billion source records generating 14+ TB cubes. Its 90% query latency is less than 5 seconds. Now, our use cases target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily – no more Hive query, shell command, and so on.

What’s next

  • Support TopN on high-cardinality dimension: The current MOLAP technology is not perfect when it comes to querying on a high-cardinality dimension – such as TopN on millions of distinct values in one column. Similar to search engines (as many researchers have pointed out), the inverted index is the reasonable mechanism to use to pre-build such results.
  • Support Hybrid OLAP (HOLAP): MOLAP is great to serve queries on historical data, but as more and more data needs to be processed in real time, there is a growing requirement to combine real-time/near-real-time and historical results for business decisions. Many in-memory technologies already work on Relational OLAP (ROLAP) to offer such capability. Kylin’s next generation will be a Hybrid OLAP (HOLAP) to combine MOLAP and ROLAP together and offer a single entry point for front-end queries.

Open source

Kylin has already been open-sourced to the community. To develop and grow an even stronger ecosystem around Kylin, we are currently working on proposing Kylin as an Apache Incubator project. With distinguished sponsors from the Hadoop developer community supporting Kylin, such as Owen O’Malley (Hortonworks co-founder and Apache member) and Julian Hyde (original author of Apache Calcite, also with Hortonworks), we believe that the greater open-source community can take Kylin to the next level.

We welcome everyone to contribute to Kylin. Please visit the Kylin web site for more details: http://kylin.io.

To begin with, we are looking for open-source contributions not only in the core code base, but also in the following areas:

  1. Shell Client
  2. RPC Server
  3. Job Scheduler
  4. Tools

For more details and to discuss these topics further, please follow us on twitter @KylinOLAP and join our Google group: https://groups.google.com/forum/#!forum/kylin-olap

Summary

Kylin has been deployed in production at eBay and is processing extremely large datasets. The platform has demonstrated great performance benefits and has proved to be a better way for analysts to leverage data on Hadoop with a more convenient approach using their favorite tool. We are pleased to open-source Kylin. We welcome feedback and suggestions, and we look forward to the involvement of the open-source community.

 

 

{ 4 comments }

NoSQL Data Modeling

by Donovan Hsieh on 10/10/2014

in Data Infrastructure and Services

Data modeling for RDBMS has been a well-defined discipline for many years. Techniques like logical to physical mapping and normalization / de-normalization have been widely practiced by professionals, including novice users. However, with the recent emergence of NoSQL databases, data modeling is facing new challenges to its relevance. Generally speaking, NoSQL practitioners focus on physical data model design rather than the traditional conceptual / logical data model process for the following reasons:

  • Developer-centric mindset – With flexible schema (or schema-free) support in NoSQL databases, application developers typically assume data model design responsibility. They have been ingrained with the notion that the database schema is an integral part of application logic.
  • High-performance queries running in massive scale-out distributed environments – Contrary to traditional, centralized scale-up systems (including the RDBMS tier), modern applications run in distributed, scale-out environments. To accomplish scale-out, application developers are driven to tackle scalability and performance first through focused physical data model design, thus abandoning the traditional conceptual, logical, and physical data model design process.
  • Big and unstructured data – With its rigidly fixed schema and limited scale-out capability, the traditional RDBMS has long been criticized for its lack of support for big and unstructured data. By comparison, NoSQL databases were conceived from the beginning with the capability to store big and unstructured data using flexible schemas running in distributed scale-out environments.

In this blog post, we explore other important mindset changes in NoSQL data modeling: development agility through flexible schemas vs. database manageability; the business and data model design process; the role of RDBMS in NoSQL data modeling; NoSQL variations that affect data modeling; and visualization approaches for NoSQL logical and physical data modeling. We end the post with a peak into the NoSQL data modeling future.

Development agility vs. database manageability

One highly touted feature in today’s NoSQL is application development agility. Part of this agility is accomplished through flexible schemas, where developers have full control over how data is stored and organized in their NoSQL databases. Developers can create or modify database objects in application code on the fly without relying on DBA execution. The result is, indeed, increased application development and deployment agility.

However, the flexible schema is not without its challenges. For example, dynamically created database objects can cause unforeseen database management issues due to the lack of DBA oversight. Furthermore, unsupervised schema changes increase DBA challenges in diagnosing associated issues. Frequently, such troubleshooting requires the DBA to review application code written in programming languages (e.g., Java) rather than in RDBMS DDL (Data Definition Language) – a skill that most DBAs do not possess.

NoSQL business and data model design process

In old-school software engineering practice, sound business and (relational) data model designs are key to successful medium- to large-scale software projects. As NoSQL developers assume business / data model design ownership, another dilemma arises: data modeling tools. For example, traditional RDBMS logical and physical data models are governed and published by dedicated professionals using commercial tools, such as PowerDesigner or ER/Studio.

Given the nascent state of NoSQL technology, there isn’t a professional-quality data modeling tool for such tasks. It is not uncommon for stakeholders to review application source code in order to uncover data model information. This is a tall order for non-technical users such as business owners or product managers. Other approaches, like sampling actual data from production databases, can be equally laborious and tedious.

It is obvious that extensive investment in automation and tooling is required. To help alleviate this challenge, we recommend that NoSQL projects use the business and data model design process shown in the following diagram (illustrated with MongoDB’s document-centric model):

design_process

Figure 1

  • Business Requirements & Domain Model: At the high level, one can continue using database-agnostic methodologies, such as domain-driven design, to capture and define business requirements
  • Query Patterns & Application Object Model: After preliminary business requirements and the domain model are produced, one can work iteratively and in parallel to analyze top user access patterns and the application model, using UML class or object diagrams. With RDMS, applications can implement database access using either a declarative query (i.e., using a single SQL table join) or a navigational approach (i.e., walking individual tables embedded in application logic). The latter approach typically requires an object-relational mapping (ORM) layer to facilitate tedious plumbing work. By nature, almost all NoSQL databases belong to the latter category. MongoDB can support both approaches through the JSON Document model, SQL-subset query, and comprehensive secondary indexing capabilities.
  • JSON Document Model & MongoDB Collection / Document: This part is where native physical data modeling takes place. One has to understand the specific NoSQL product’s strengths and weaknesses in order to produce efficient schema designs and serve effective, high-performance queries. For example, modeling social network entities like followed and followers is very different from modeling online blogging applications. As such, social networking applications are best implemented using Graph NoSQL databases like Neo4j, while online blogging applications can be implemented using other flavors of NoSQL like MongoDB.

RDBMS data modeling influence on NoSQL

Interestingly enough, old-school RDBMS data modeling techniques still play a meaningful role for those who are new to NoSQL technology. Using document-centric MongoDB as an example, the following diagram illustrates how one can map a relational data model to a comparable MongoDB document-centric data model:

mongodb_mapping

Figure 2

NoSQL data model variation

In the relational world, logical data models are reasonably portable among different RDBMS products. In a physical data model, design specifications such as storage clauses or non-standard SQL extensions might vary from vendor to vendor. Various SQL standards, such as SQL-92 and the latest SQL:2008 as defined by industry bodies like ANSI/ISO, can help application portability across different database platforms.

However, in the NoSQL world, physical data models vary dramatically among different NoSQL databases; there is no industry standard comparable to SQL-92 for RDBMS. Therefore, it helps to understand key differences in the various NoSQL database models:

  • Key-value stores – Collections comprised of unique keys having 1-n valid values
  • Column families – Distributed data stores in which a column consists of a unique key, values for the key, and a timestamp differentiating current from stale values
  • Document databases – Systems that store and manage documents and their metadata (type, title, author, creation/modification/deletion date, etc.)
  • Graph databases – Systems that use graph theory to represent and store data as nodes (people, business, accounts, or other entities), node properties, and edges (lines connecting nodes/properties to each other)

The following diagram illustrates the comparison landscape based on model complexity and scalability:

nosql_comparisons

Figure 3

It is worth mentioning that for NoSQL data models, a natural evolutionary path exists from simple key-value stores to the highly complicated graph databases, as shown in the following diagram:

nosql_evolution

Figure 4

NoSQL data model visualization

For conceptual data models, diagramming techniques such as the Entity Relationship Diagram can continue to be used to model NoSQL applications. However, logical and physical NoSQL data modeling requires new thinking, due to each NoSQL product assuming a different native structure. One can intuitively use any of the following three visualization approaches, using a document-centric data model like MongoDB as an example:

  • Native visual representation of MongoDB collections with support for nested sub-documents (see Figure 2 above)

Pros – It naturally conveys a complex document model through an intuitive visual representation.
Cons – Without specialized tools support, visualization results in ad-hoc drawing using non-uniform conventions or notations.

  • Reverse engineering selected sample documents using JSON Designer (see Figure 5 below)

Pros – It can easily reverse engineer a hierarchical model into a visual representation from existing JSON documents stored in NoSQL databases like MongoDB.
Cons – As of this writing, JSON Designer is available only on iPhone / iPad. Furthermore, it does not include native DB objects, such as MongoDB indexes.

json_designer

Figure 5

  • Traditional RDBMS data modeling tools like PowerDesigner (see Figure 6 below)

Pros – Commercial tools support is available.
Cons – it requires tedious manual preparation and diagram arrangement to represent complex and deeply nested document structure.

power_designer

Figure 6

In a future post, we’ll cover specific data model visualization techniques for other NoSQL products such as Cassandra, which is based on the Column Family structure.

New NoSQL data modeling opportunities

Like any emerging technology, NoSQL will mature as it becomes mainstream. We envision the following new data modeling opportunities for NoSQL:

  • Reusable data model design patterns (some product-specific and some agnostic) to help reduce application development effort and cost
  • Unified NoSQL model repository to support different NoSQL products
  • Bi-directional, round-trip engineering support for (data) model-driven design processes and tools
  • Automated data model extraction from application source code
  • Automated code-model-data consistency validation and consistency conformance metrics
  • Strong control for application / data model change management, with proactive tracking and reconciliation between application code, embedded data models, and the actual data in the NoSQL databases

{ 3 comments }

Don’t Build Pages, Build Modules

by Senthil Padmanabhan on 10/02/2014

in Software Engineering

We are in an interesting phase of rethinking frontend engineering across eBay Marketplaces, and this blog summarizes where we are heading.

Modular programming is a fundamental design technique that has been practiced since the dawn of software engineering. It is still the most recommended pattern for building maintainable software, and the Node.js community fully embraces this design philosophy. Most Node.js modules are created using smaller modules as building blocks to achieve the final goal. Now with Web Components gaining momentum, we decided to change our approach toward building frontend web applications.

Modularity was already in our frontend codebase, but only in the scope of a particular language. Most of our shared JavaScript (overlays, tabs, carousel, etc.) and CSS (buttons, grid, forms, icons, etc.) were written in a modular fashion. This is great, but when it comes to a page or a view, the thinking was still about building pages, as opposed to building UI modules. With this mindset, we found that as the complexity of pages grew, it became exponentially more difficult to maintain them. What we wanted was a simple way to divide a page into small and manageable pieces, and to develop each piece independently. This is when we came up with the notion, “Don’t build pages, build modules.”

Modular thinking

In general, everyone understands and agrees with the concept of frontend modules. But to make the concept a reality, we needed to deviate from our current style of web development.

Decomposition: First, we wanted to move away from the idea of directly building a page. Instead, when a requirement comes in the form of a page, we decompose it into logical UI modules. We do so recursively until a module becomes FIRST. This means a page is made up of a few top-level modules, which in turn are built from sub-modules, very similar to how JavaScript modules are built in the Node.js world. There are common styles and JavaScript (e.g., jQuery) that all modules depend on. These files together become a module of their own (e.g., a base module) and are added as dependencies of other modules. Engineers start working independently on these modules, and a page is nothing more than a grid that finally assembles them.

modules

DOM encapsulation: We wanted all of our frontend modules to be associated with a DOM node and for that node to be the module’s root element. So we place all client-side JavaScript behavior like event binding, querying the DOM, jQuery plugin triggers, etc. within the scope of the module’s root element. This gives perfect encapsulation to our modules by making them restrictive and truly independent. Obviously we needed some sort of JavaScript abstraction to achieve this encapsulation, and we decided to go with the lightweight widgets functionality (named Marko Widgets) offered by RaptorJS. Marko widgets are a small module (~4 KB) providing a simple mechanism for instantiating widgets and binding them to DOM elements. Instantiated widgets are made into observables, and can also be destroyed when they need to be removed from the DOM. To bind a JavaScript module to a DOM node, we simply used the w-bind directive as shown below:

<div class="my-module" w-bind="./my-module-widget.js"> ... </div>

When rendered, the Marko widgets module tracks which modules are associated with which DOM nodes, and automatically binds the behavior after adding the HTML to the DOM. Some of our modules, like our tracking module, had JavaScript functionality, but no DOM node association. In those scenarios, we use the <noscript> tag to achieve encapsulation. With respect to CSS, we name-space all class names within a module with the root element’s class name, separated with ‘-’ such as gallery-title, gallery-thumbnail, etc.

Packaging: The next big question was how do we package the modules? Frontend package management has always been challenging, and is a hotly debated topic. Packaging for the same asset type is quite straightforward. For instance, in JavaScript once we nail down a module pattern (CommonJS, AMD, etc.), then packaging becomes easy with tools like browserify. The problem is when we need to bundle other asset types like CSS and markup templates. Here, our in-house Raptor Optimizer came to the rescue. The optimizer is a JavaScript module bundler very similar to browserify or webpack, but with a few differences that make it ideal for our module ecosystem. All it needs is an optimizer.json file in the module directory, to list out the CSS and markup template (dust or marko) dependencies. For JavaScript dependencies, the optimizer scans the source code in the current directory, and resolves them recursively. Finally, an ordered, de-duped bundle is inserted in the CSS and JavaScript slot of the page – for example:

[
    "./base",
    "gallery.less",
    "gallery.html" 
]

Note that the markup templates will be included only when rendering on the client side. Including them otherwise will unnecessarily increase JavaScript file size.

File organization

Going modular also meant changing the way files were structured. Before applying modularity to the frontend codebase, teams would typically create separate top-level directories for JavaScript, CSS, images, fonts, etc. But with the new approach it made sense to group all files associated with a module under the same directory, and to use the module name as the directory name. This practice raised some concerns initially, mainly around violating a proven file structuring scheme, and around tooling changes related to bundling and CDN pushing. But engineers quickly came to an agreement, as the advantages clearly outweighed the disadvantages. The biggest benefit is that the new structure truly promoted module-level encapsulation:  all module-associated files live together and can be packaged.  In addition, any action on a module (deleting, renaming, refactoring, etc., which happen frequently in large codebases) becomes super easy.

before_after

Module communication

We wanted all of our modules to follow the Law of Demeter – meaning two independent modules cannot directly talk to each other. The obvious solution was to use an event bus for communication between client-side modules. We evaluated various eventing mechanisms, with the goal of having it centralized and also not introducing a large library dependency. Surprisingly, we settled on the eventing model that comes with jQuery itself. jQuery’s trigger, on, and off APIs do a fantastic job of abstracting out all eventing complexities, for both DOM and custom events. We wrote a small dispatcher wrapper, which handles interactions between modules by triggering and listening to events on the document element:

(function($) {
    'use strict';
    var $document = $(document.documentElement);

    // Create the dispatcher
    $.dispatcher = $.dispatcher || {};

    var dispatcherMethods = {
        trigger: function(event, data, elem) {
            // If element is provided trigger from element
            if(elem) {
                // Wrap in jQuery and call trigger                
                return $(elem).trigger(event, data);
            } else {
                return $document.trigger(event, data);
            }
        },

        on: function(event, callback, scope) {
            return $document.on(event, $.proxy(callback, scope || $document));
        },

        off: function(event) {
            return $document.off(event);
        }
    }; // dispatcherMethods end

    // Attach the dispatcher methods to $.dispatcher
    $.extend(true, $.dispatcher, dispatcherMethods);
})(jQuery);

Modules can now use the $.dispatcher to trigger and listen to custom events without having any knowledge about other modules. Another advantage of using the jQuery DOM-based eventing model is that we get all event dynamics (propagation and name-spacing) for free.

// Module 1 firing a custom event 'sliderSwiped'
$.dispatcher.trigger('sliderSwiped', {
    activeItemId: 1234
});

// Module 2 listening on 'sliderSwiped' and performing an action
$.dispatcher.on('sliderSwiped', function(evt, data) {
    fetchItem(data.activeItemId);
});

Some teams prefer to create a centralized mediator module to handle the communication. We leave that to engineers’ discretion.

Multiscreen and view model standardization

One of the biggest advantages of frontend modules is they perfectly fit in the multiscreen world. Flows change based on device dimensions, and making a page work either responsively or adaptively on all devices is not practical. But with modules, things fall in place. When engineers finalize the modules in a view, they also evaluate how they look and behave across various screen sizes. Based on this evaluation, the module name and associated view model JSON schema are agreed upon. But the implementation of the module is based upon the device. For some modules, just a responsive implementation is sufficient to work across all screens. For others, the design and interactions (touch or no-touch) would be completely different, thus requiring different implementations. However different the implementations may be, the module name and the view model powering it would be the same.

We indeed extended this concept to the native world, where iOS and Android apps also needed the same modules and view models. But the implementation is still native (Objective-C or Java) to the platform. All clients talk to the frontend servers, which are cognizant of the modules that a particular user agent needs and respond with the appropriate view model chunks. This approach gave us a perfect balance in terms of consistency and good user experience (by not compromising on the implementation). Companies like LinkedIn have already implemented a view-based JSON model that has proved successful. The granularity of the view model is decided by engineers and product managers together, depending on how much control they need over the module. The general guideline is to make the view model’s JSON as smart as possible and the modules dumb (or thin), thus providing a central place to control all clients.

Associated benefits

All of the other benefits of modular programming come for free:

  • Developer productivity – engineers can work in parallel on small contained pieces, resulting in a faster pace.
  • Unit testing – it has never been easier.
  • Debugging – it’s easy to nail down the problem, and even if one module is faulty others are still intact.

Finally, this whole approach takes us closer to the way the web is evolving. Our idea of bundling HTML, CSS, and JS to create an encapsulated UI module puts us on a fast track to adoption. We envision an ideal future where all of our views, across devices, are a bunch of web components.

Conclusion

As mentioned earlier, we are indeed in the process of rethinking frontend engineering at eBay, and modularization is one of the first steps resulting from that rethinking. Thanks to my colleagues Mahdi Pedramrazi and Patrick Steele-Idem for teaming up and pioneering this effort across the organization.

Senthil
Frontend Engineer

{ 12 comments }

Last year, our Gumtree Australia team started to use behavior-driven development in our agile development process. Initially, we searched for an open-source BDD framework to implement for BDD testing. Since we work on a web application, we wanted our automated BDD tests to exercise the application as a customer view, via a web interface. Of course, many web testing libraries do just that, including open-source tools such as Selenium and WebDriver (Selenium 2). However, we discovered that no existing BDD framework supports web automation testing directly.

We decided to simply integrate the open-source BDD tool JBehave with Selenium, but soon learned that this solution doesn’t completely fulfill the potential of automated BDD tests. It can’t drive the project development and documentation processes as much as we expected.

Different stakeholders need to view BDD tests at different levels. For instance, QA needs results to show how the application behaved under test. Upper managers are not so interested in the finer details, but rather want to see the number and complexity of the features defined and implemented so far, and if the project is still on track.

We needed a testing framework that could let us express and report on BDD tests at different levels, manage the stories and their scenarios effectively, and then drill down into the details as required. And so we started to develop a product that matches our needs. We named this product the Behavior Automation Framework (Beaf) and designed it to make the practice of behavior-driven development easier. Based on JBehave as well as more traditional tools like TestNg, Beaf includes a host of features to simplify writing automated BDD tests and interpreting the results.

beaf-01

Beaf’s extensions and utilities improve web testing on WebDriver/Selenium2 in four ways:

  • Supplying a Story web console, where PMs/QAs can manage all the existing stories, divide stories into different categories or groups, and create/edit stories using the correct BDD format.

beaf-02

  • Organizing web tests into reusable steps, and mapping them back to their original requirements and user stories.

beaf-03

  • Generating reports and documentation about BDD tests. Whenever a test case is executed, Beaf generates a report. Each report contains a narrative description of the test, including a short comment and screenshot for each step. The report provides not only information about the test results, but also documentation about how the scenarios under test have been implemented. Reports serve as living documentation, illustrating how an application implements the specified requirements.

beaf-06

  • Incorporating high-level summaries and aggregations of test results into reports. These overviews include how many stories and their scenarios have been tested by automated BDD tests. Taken as a set of objective metrics, test results show the relative size and progress of each feature being implemented.

beaf-04

When executing a test, whether it be with JUnit, jMock, or another framework, Beaf handles many of the Selenium 2/WebDriver infrastructure details. For example, Beaf can run cases cross-platform, including desktop (Firefox, Chrome), mobile (iPhone), and tablet (iPad). Testers can opt to open a new browser for each test, or use the same browser session for all of the tests in a class. The browser/device to be used can also be set in the Beaf configuration.

beaf-05

Using common steps provides a layer of abstraction between the behavior being tested and the way the web application implements that behavior. This level of abstraction makes it easier to manage changes in implementation details, because a desired behavior will generally change less frequently than the details of how it is to be implemented. Abstraction also allows implementation details to be centralized in one place.

@When("posting an Ad in the \"$category\" category")
public void postingAd(@Named("category") String category) throws Throwable
{
// 1. PageFactory will generate ad post pages under specified category.  
// 2. Page status will be sent to next step by ThreadLocal.
postAdPage.set(PostAdPageFactory.createPostAdPage(category, AdType.OFFER.name()));
}

In addition to test reports, Beaf supplies a very useful web module called the Beaf Dashboard, which provides a higher-level view of current status. The dashboard shows the state of all of the stories, both in terms of their relative priorities and in terms of how many P1/P2/P3 stories and scenarios are fully, partially, or not automated. This information gives a good idea of the amount of work involved in implementing different parts of the project. The dashboard also keeps track of test results over time, so that users can visualize in concrete terms the amount of work done so far versus the estimated amount of work remaining to be done.

beaf-07

Beaf facilitates QA people joining projects in stages. It provides PD/QA with the detailed information required to test and update code, while giving business managers and PMs the more high-level views and reports that suit their needs. However, the greater potential of Beaf is the ability to turn automated web tests into automated web acceptance testing, in the true spirit of BDD.

{ 3 comments }

At eBay we run Hadoop clusters comprising thousands of nodes that are shared by thousands of users. We analyze data on these clusters to gain insights for improved customer experience. In this post, we look at distributing RPC resources fairly between heavy and light users, as well as mitigating denial of service attacks within Hadoop. By providing appropriate response times and increasing system availability, we offer a better Hadoop experience.

Problem: namenode slowdown

In our clusters, we frequently deal with slowness caused by heavy users, to the point of namenode latency increasing from less than a millisecond to more than half a second. In the past, we fixed this latency by finding and terminating the offending job. However, this reactive approach meant that damage had already been done—in extreme cases, we lost cluster operation for hours.

This slowness is a consequence of the original design of Hadoop. In Hadoop, the namenode is a single machine that coordinates HDFS operations in its namespace. These operations include getting block locations, listing directories, and creating files. The namenode receives HDFS operations as RPC calls and puts them in a FIFO call queue for execution by reader threads. The dataflow looks like this:

FIFO call queue

Though FIFO is fair in the sense of first-come-first-serve, it is unfair in the sense that users who perform more I/O operations on the namenode will be served more than users who perform less I/O. The result is the aforementioned latency increase.

We can see the effect of heavy users in the namenode auditlogs on days where we get support emails complaining about HDFS slowness:

heavy_users

Each color is a different user, and the area indicates call volume. Single users monopolizing cluster resources are a frequent cause of slowdown. With only one namenode and thousands of datanodes, any poorly written MapReduce job is a potential distributed denial-of-service attack.

Solution: quality of service

Taking inspiration from routers—some of which include QoS (quality of service) capabilities—we replaced the FIFO queue with a new type of queue, which we call the FairCallQueue.

qos_fair_call_queue

The scheduler places incoming RPC calls into a number of queues based on the call volume of the user who made the call. The scheduler keeps track of recent calls, and prioritizes calls from lighter users over calls from heavy users.

The multiplexer controls the penalty of being in a low-priority queue versus a high-priority queue. It reads calls in a weighted round-robin fashion, preferring to read from high-priority queues and infrequently reading from the lowest-priority queues. This ensures that high-priority requests are served first, and prevents starvation of low-priority RPCs.

The multiplexer and scheduler are connected by a multi-level queue; together, these three form the FairCallQueue. In our tests at scale, we’ve found the queue is effective at preserving low latencies even in the face of overwhelming denial-of-service attacks on the namenode.

This plot shows the latency of a minority user during three runs of a FIFO queue (QoS disabled) and the FairCallQueue (QoS enabled). As expected, the latency is much lower when the FairCallQueue is active. (Note: spikes are caused by garbage collection pauses, which are a separate issue).

latency_comparison

Open source and beyond

The 2.4 release of Apache Hadoop includes the prerequisites to namenode QoS. With this release, cluster owners can modify the implementation of the RPC call queue at runtime and choose to leverage the new FairCallQueue. You can try the patches on Apache’s JIRA: HADOOP-9640.

The FairCallQueue can be customized with other schedulers and multiplexers to enable new features. We are already investigating future improvements, such as weighting different RPC types for more intelligent scheduling and allowing users to manually control which queues certain users are scheduled into. In addition, there are features submitted from the open source community that build upon QoS, such as RPC client backoff and Fair Share queuing.

With namenode QoS in place, we have improved our users’ experience of our Hadoop clusters by providing faster and more uniform response times to well-behaved users while minimizing the impact of poorly written or badly behaved jobs. This in turn allows our analysts to be more productive and focus on the things that matter, like making your eBay experience a delightful one.

- Chris Li

eBay Global Data Infrastructure Analytics Team

{ 4 comments }

The Platform and Infrastructure team at eBay Inc. is happy to announce the open-sourcing of Oink – a self-service solution to Apache Pig.

Pig and Hadoop overview

Apache Pig is a platform for analyzing large data sets. It uses a high-level language for expressing data analysis programs, coupled with the infrastructure for evaluating these programs. Pig abstracts the Map/Reduce paradigm, making it very easy for users to write complex tasks using Pig’s language, called Pig Latin. Because execution of tasks can be optimized automatically, Pig Latin allows users to focus on semantics rather than efficiency. Another key benefit of Pig Latin is extensibility:  users can do special-purpose processing by creating their own functions.

Apache Hadoop and Pig provide an excellent platform for extracting and analyzing data from very large application logs. At eBay, we on the Platform and Infrastructure team are responsible for storing TBs of logs that are generated every day from thousands of eBay application servers. Hadoop and Pig offer us an array of tools to search and view logs and to generate reports on application behavior. As the logs are available in Hadoop, engineers (users of applications) also have the ability to use Hadoop and Pig to do custom processing, such as Pig scripting to extract useful information.

The problem

Today, Pig is primarily used through the command line to spawn jobs. This model wasn’t well suited to the Platform team at eBay, as the cluster that housed the application logs was shared with other teams. This situation created a number of issues:

  • Governance – In a shared-cluster scenario, governance is critically important to attain. Pig scripts and requests of one customer should not impact those of other customers and stakeholders of the cluster. In addition, providing CLI access would make governance difficult in terms of controlling the number of job submissions.
  • Scalability – CLI access to all Pig customers created another challenge:  scalability. Onboarding customers takes time and is a cumbersome process.
  • Change management – No easy means existed to upgrade or modify common libraries.

Hence, we needed a solution that acted as a gateway to Pig job submission, provided QoS, and abstracted the user from cluster configuration.

The solution:  Oink

Oink solves the above challenges not only by allowing execution of Pig requests through a REST interface, but also by enabling users to register jars, view the status of Pig requests, view Pig request output, and even cancel a running Pig request. With the REST interface, the user has a cleaner way to submit Pig requests compared to CLI access. Oink serves as a single point of entry for Pig requests, thereby facilitating rate limiting and QoS enforcement for different customers.

oinkOink runs as a servlet inside a web container and allows users to run multiple requests in parallel within a single JVM instance. This capability was not supported initially, but rather required the help of the patch found in PIG-3866. This patch provides multi-tenant environment support so that different users can share the same instance.

With Oink, eBay’s Platform and Infrastructure team has been able to onboard 100-plus different use cases onto its cluster. Currently, more than 6000 Pig jobs run every day without any manual intervention from the team.

Special thanks to Vijay Samuel, Ruchir Shah, Mahesh Somani, and Raju Kolluru for open-sourcing Oink. If you have any queries related to Oink, please submit an issue through GitHub.

{ 1 comment }

Using Spark to Ignite Data Analytics

by eBay Global Data Infrastructure Analytics Team on 05/28/2014

in Data Infrastructure and Services,Machine Learning

At eBay we want our customers to have the best experience possible. We use data analytics to improve user experiences, provide relevant offers, optimize performance, and create many, many other kinds of value. One way eBay supports this value creation is by utilizing data processing frameworks that enable, accelerate, or simplify data analytics. One such framework is Apache Spark. This post describes how Apache Spark fits into eBay’s Analytic Data Infrastructure.

spark_logo

What is Apache Spark?

The Apache Spark web site describes Spark as “a fast and general engine for large-scale data processing.” Spark is a framework that enables parallel, distributed data processing. It offers a simple programming abstraction that provides powerful cache and persistence capabilities. The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager. Developers can use the Spark framework via several programming languages including Java, Scala, and Python. Spark also serves as a foundation for additional data processing frameworks such as Shark, which provides SQL functionality for Hadoop.

Spark is an excellent tool for iterative processing of large datasets. One way Spark is suited for this type of processing is through its Resilient Distributed Dataset (RDD). In the paper titled Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, RDDs are described as “…fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.” By using RDDs,  programmers can pin their large data sets to memory, thereby supporting high-performance, iterative processing. Compared to reading a large data set from disk for every processing iteration, the in-memory solution is obviously much faster.

The diagram below shows a simple example of using Spark to read input data from HDFS, perform a series of iterative operations against that data using RDDs, and write the subsequent output back to HDFS.

spark_example_diagram

In the case of the first map operation into RDD(1), not all of the data could fit within the memory space allowed for RDDs. In such a case, the programmer is able to specify what should happen to the data that doesn’t fit. The options include spilling the computed data to disk and recreating it upon read. We can see in this example how each processing iteration is able to leverage memory for the reading and writing of its data. This method of leveraging memory is likely to be 100X faster than other methods that rely purely on disk storage for intermittent results.

Apache Spark at eBay

Today Spark is most commonly leveraged at eBay through Hadoop via Yarn. Yarn manages the Hadoop cluster’s resources and allows Hadoop to extend beyond traditional map and reduce jobs by employing Yarn containers to run generic tasks. Through the Hadoop Yarn framework, eBay’s Spark users are able to leverage clusters approaching the range of 2000 nodes, 100TB of RAM, and 20,000 cores.

The following example illustrates Spark on Hadoop via Yarn.

spark_hadoop_diagram

The user submits the Spark job to Hadoop. The Spark application master starts within a single Yarn container, then begins working with the Yarn resource manager to spawn Spark executors – as many as the user requested. These Spark executors will run the Spark application using the specified amount of memory and number of CPU cores. In this case, the Spark application is able to read and write to the cluster’s data residing in HDFS. This model of running Spark on Hadoop illustrates Hadoop’s growing ability to provide a singular, foundational platform for data processing over shared data.

The eBay analyst community includes a strong contingent of Scala users. Accordingly, many of eBay’s Spark users are writing their jobs in Scala. These jobs are supporting discovery through interrogation of complex data, data modelling, and data scoring, among other use cases. Below is a code snippet from a Spark Scala application. This application uses Spark’s machine learning library, MLlib, to cluster eBay’s sellers via KMeans. The seller attribute data is stored in HDFS.

/**
 * read input files and turn into usable records
 */
 var table = new SellerMetric()
 val model_data = sc.sequenceFile[Text,Text](
   input_path
  ,classOf[Text]
  ,classOf[Text]
  ,num_tasks.toInt
 ).map(
   v => parseRecord(v._2,table)
 ).filter(
   v => v != null
 ).cache

....

/**
 * build training data set from sample and summary data
 */
 val train_data = sample_data.map( v =>
   Array.tabulate[Double](field_cnt)(
     i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
   )
 ).cache

/**
 * train the model
 */ 
 val model = KMeans.train(train_data,CLUSTERS,ITERATIONS)
  
/**
 * score the data
 */
 val results = grouped_model_data.map( 
   v => (
     v._1
    ,model.predict(
       Array.tabulate[Double](field_cnt)(
         i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
       )
     )
   )
 ) 
 results.saveAsTextFile(output_path)

In addition to  Spark Scala users, several folks at eBay have begun using Spark with Shark to accelerate their Hadoop SQL performance. Many of these Shark queries are easily running 5X faster than their Hive counterparts. While Spark at eBay is still in its early stages, usage is in the midst of expanding from experimental to everyday as the number of Spark users at eBay continues to accelerate.

The Future of Spark at eBay

Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. Our Hadoop platform team has started gearing up to formally support Spark on Hadoop. Additionally, we’re keeping our eyes on how Hadoop continues to evolve in its support for frameworks like Spark, how the community is able to use Spark to create value from data, and how companies like Hortonworks and Cloudera are incorporating Spark into their portfolios. Some groups within eBay are looking at spinning up their own Spark clusters outside of Hadoop. These clusters would either leverage more specialized hardware or be application-specific. Other folks are working on incorporating eBay’s already strong data platform language extensions into the Spark model to make it even easier to leverage eBay’s data within Spark. In the meantime, we will continue to see adoption of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product announcements, industry chatter, and Spark’s own strengths and capabilities.

{ 2 comments }

In part I of this post we laid out in detail how to run a large Jenkins CI farm in Mesos. In this post we explore running the builds inside Docker containers and more:

  • Explain the motivation for using Docker containers for builds.
  • Show how to handle the case where the build itself is a Docker build.
  • Peek into how the Mesos 0.19 release is going to change Docker integration.
  • Walk through a Vagrant all-in-one-box setup so you can try things out.

Overview

Jenkins follows the master-slave model and is capable of launching tasks as remote Java processes on Mesos slave machines. Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. We can leverage the capabilities of Jenkins and Mesos to run a Jenkins slave process within a Docker container using Mesos as the resource manager.

Why use Docker containers?

This page gives a good picture of what Docker is all about.

At eBay Inc., we have several different build clusters. They are primarily partitioned due to a number of factors:  requirements to run different OS flavors (mostly RHEL and Ubuntu), software version conflicts, associated application dependencies, and special hardware. When using Mesos, we try to operate on a single cluster with heteregeneous workloads instead of having specialized clusters. Docker provides a good solution to isolate the different dependencies inside the container irrespective of the host setup where the Mesos slave is running, thereby helping us operate on a single cluster. Special hardware requirements can always be handled though slave attributes that the Jenkins plugin already supports. Overall, then, this setup scheme helps maintain consistent host images in the cluster, avoids having to introduce a wide combination of different flavors of Mesos slave hosts running, yet handles all the varied build dependencies within a container.

Now why support Docker-in-Docker setup?

When we started experimenting with running the builds in Docker containers, some of our teammates were working on enabling Docker images for applications. They posed the question, How do we support Docker build and push/pull operations within the Docker container used for the build? Valid point! So, we will explore two ways of handling this challenge. Many thanks to Jérôme Petazzoni from the Docker team for his guidance.

Environment setup

A Vagrant development VM setup demonstrates CI using Docker containers. This VM can be used for testing other frameworks like Chronos and Aurora; however, we will focus on the CI use of it with Marathon. The screenshots shown below have been taken from the Vagrant development environment setup, which runs a cluster of three Mesos masters, three Mesos slave instances, and one Marathon instance. (Marathon is a Mesos framework for long-running services. It provides a REST API for starting, stopping, and scaling services.)

192.168.56.101 mesos1 marathon1
192.168.56.102 mesos2
192.168.56.103 mesos3

Running Jenkins slaves inside Mesos Docker containers requires the following ecosystem:

  1. Jenkins master server with the Mesos scheduler plugin installed (used for building Docker containers via CI jobs).
  2. Apache Mesos master server with at least one slave server .
  3. Mesos Docker Executor installed on all Mesos slave servers. Mesos slaves delegate execution of tasks within Docker containers to the Docker executor. (Note that integration with Docker changes with the Mesos 0.19 release, as explained in the miscellaneous section at the end of this post.)
  4. Docker installed on all slave servers (to automate the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere).
  5. Docker build container image in the Docker registry.
  6. Marathon framework.

1. Creating the Jenkins master instance

We needed to first launch a standalone Jenkins master instance in Mesos via the Marathon framework.  We placed Jenkins plugins in the plugins directory, and included a default config.xml file with pre-configured settings. Jenkins was then launched by executing the jenkins.war file. Here is the directory structure that we used for launching the Jenkins master:

.
├── README.md
├── config.xml
├── hudson.model.UpdateCenter.xml
├── jenkins.war
├── jobs
├── nodeMonitors.xml
├── plugins
│   ├── mesos.hpi
│   └── saferestart.jpi
└── userContent
└── readme.txt
3 directories, 8 files

2. Launching the Jenkins master instance

Marathon launched the Jenkins master instance using the following command, also shown in the Marathon UI screenshots below. We zipped our Jenkins files and downloaded them for the job by using the URIs field in the UI; however, for demonstration purposes, below we show using a Git repository to achieve the same goal.

git clone https://github.com/ahunnargikar/jenkins-standalone && cd jenkins-standalone;
export JENKINS_HOME=$(pwd);
java -jar jenkins.war
--webroot=war
--httpPort=$PORT0
--ajp13Port=-1
--httpListenAddress=0.0.0.0
--ajp13ListenAddress=127.0.0.1
--preferredClassLoader=java.net.URLClassLoader
--logfile=../jenkins.log

jenkins_marathon1

jenkins_marathon2

jenkins_marathon3

jenkins_marathon4

jenkins_marathon5

3. Launching Jenkins slaves using the Mesos Docker executor

ahunnargikar_cloud4

Here’s a sample supervisord startup configuration for a Docker image capable of executing Jenkins slave jobs:

[supervisord]
nodaemon=true

[program:java-jenkins-slave]
command=/bin/bash -c "eval $JENKINS_COMMAND"

As you can see, Jenkins passed its slave launch command as an environment variable to the Docker container. The container then initialized the Jenkins slave process, which fulfilled the basic requirement for kicking off the Jenkins slave job.

This configuration was sufficient to launch regular builds within the Docker container of choice. Now let’s walk through the two options that we explored to run Docker operations for a CI build inside a Docker container. Strategy #1 required use of supervisord to control the Docker daemon process. For the default case (regular non-Docker builds) and strategy #2, supervisord was not required; one could simply pass the command directly to the Docker container.

3.1 Strategy #1 – Using an individual Docker-in-Docker (dind) setup on each Mesos slave

This strategy, inspired by this blog,  involved a dedicated Docker daemon inside the Docker container. The advantage of this approach was that we didn’t have a single Docker daemon handling a large number of container builds. On the flip side, each container was now absorbing the I/O overhead of downloading and duplicating all the AUFS file system layers.

docker_caching_multiple

The Docker-in-Docker container had to be launched in privileged mode (by including the “-privileged” option in the Mesos Docker executor code); otherwise, nested Docker containers wouldn’t work. Using this strategy, we ended up having two Docker executors:  one for launching Docker containers in non-privileged mode (/var/lib/mesos/executors/docker) and the other for launching Docker-in-Docker containers in privileged mode (/var/lib/mesos/executors/docker2). The supervisord process manager configuration was updated to run the Docker daemon process in addition to the Jenkins slave job process.

[program:docker] 
command=/usr/local/bin/wrapdocker 

The following Docker-in-Docker image has been provided for demonstration purposes for testing out the multi-Docker setup:

ahunnargikar/jenkins-dind/multiple-docker

In real life, the actual build container image would capture the build dependencies and base image flavor, in addition to the contents of the above dind image. The actual command that the Docker executor ran looked similar to this one:

docker run 
-cidfile /tmp/docker_cid.6c6bba3db72b7483 
-privileged
-c 51 -m 302365697638 
-e JENKINS_COMMAND=wget -O slave.jar http://192.168.56.101:9000/jnlpJars/slave.jar && java -DHUDSON_HOME=jenkins -server -Xmx256m -Xms16m -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar slave.jar  -jnlpUrl http://192.168.56.101:9000/computer/mesos-jenkins-beb3a8ae-3de7-4117-8c4e-efe50b37fbb4/slave-agent.jnlp hashish/jenkins-dind

3.2 Strategy #2 – Using a shared Docker Setup on each Mesos slave

All of the Jenkins slaves running on a Mesos slave host could simply use a single Docker daemon for running their Docker containers, which was the default standard setup. This approach eliminated redundant network and disk I/O involved with downloading the AUFS file system layers. For example, all Java application projects could now reuse the same AUFS file system layers that contained the JDK, Tomcat, and other static Linux package dependencies. We lost isolation as far as the Docker daemon was concerned, but we gained a massive reduction in I/O and were able to leverage caching of build layers. This was the optimal strategy for our use case.

docker_caching_single

The Docker container mounted the host’s /var/run/docker.sock file descriptor as a shared volume so that its native Docker binary, located at /usr/local/bin/docker, could now communicate with the host server’s Docker daemon. So all Docker commands were now directly being executed by the host server’s Docker daemon. This eliminated the need for running individual Docker daemon processes on the Docker containers that were running on a Mesos slave server.

The following Docker image has been provided for demonstration purposes for a shared Docker setup. The actual build Docker container image of choice essentially just needed to execute the Docker binary via its CLI. We could even have mounted the Docker binary from the host server itself to the same end.

ahunnargikar/jenkins-dind/single-docker

The actual command that the Docker executor ran looked similar to this:

docker run 
-cidfile /tmp/docker_cid.6c6bba3db72b7483 
-v /var/run/docker.sock:/var/run/docker.sock 
-c 51 -m 302365697638 
-e JENKINS_COMMAND=wget -O slave.jar http://192.168.56.101:9000/jnlpJars/slave.jar && java -DHUDSON_HOME=jenkins -server -Xmx256m -Xms16m -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar slave.jar  -jnlpUrl http://192.168.56.101:9000/computer/mesos-jenkins-beb3a8ae-3de7-4117-8c4e-efe50b37fbb4/slave-agent.jnlp hashish/jenkins-dind-single

4. Specifying the cloud configuration for the Jenkins master

We then needed to configure the Jenkins master so that it would connect to the Mesos master server and start receiving resource offers, after which it could begin launching tasks on Mesos. The following screenshots illustrate how we configured the Jenkins master via its web administration UI.

jenkins_cloud1

jenkins_cloud2

jenkins_cloud3

jenkins_cloud4

jenkins_cloud5

Note: The Docker-specific configuration options above are not available in the stable release of the Mesos plugin. Major changes are underway in the upcoming Mesos 0.19.0 release, which will introduce the pluggable containerizer functionality. We decided to wait for 0.19.0 to be released before making a pull request for this feature. Instead, a modified .hpi plugin file was created from this Jenkins Mesos plugin branch and has been included in the Vagrant dev setup.

jenkins_cloud6

jenkins_cloud7

5. Creating the Jenkins Mesos Docker job

Now that the Jenkins scheduler had registered as a framework in Mesos, it started receiving resource offers from the Mesos master. The next step was to create a Jenkins job that would be launched on a Mesos slave whose resource offer satisfied the cloud configuration requirements.

5.1 Creating a Docker Tomcat 7 application container image

Jenkins first needed a Docker container base image that packaged the application code and dependencies as well as a web server. For demonstration purposes, here’s a sample Docker Tomcat 7 image created from this Github repository:

hashish/tomcat7

Every application’s Git repository would be expected to have its unique Dockerfile with whatever combination of Java/PHP/Node.js pre-installed in a base container. In the case of our Java apps, we simply built the .war file using Maven and then inserted it into the Docker image during build time. The Docker image was then tagged with the application name, version, and timestamp, and then uploaded into our private Docker registry.

5.2 Running a Jenkins Docker job

For demonstration purposes, the following example assumes that we are building a basic Java web application.

jenkins_job1

jenkins_job2

jenkins_job3

jenkins_job4

jenkins_job5

jenkins_job6

Once Jenkins built and uploaded the new application’s Docker image containing the war, dependencies, and other packages, this Docker image was launched in Mesos and scaled up or down to as many instances as required via the Marathon APIs.

Miscellaneous points

Our Docker integration with Mesos is going to be outdated soon with the 0.19 release. Our setup was against Mesos 0.17 and Docker 0.9.  You can read about the Mesos pluggable containerizer feature in this blog and in this ticket. The Mesosphere team is also working on the deimos project to integrate Docker with the external containerization approach. There is an old pull request against the Mesos Jenkins plugin to integrate containerization once it’s released. We will update our setup accordingly when this feature is rolled out. We’d like to add a disclaimer that the Docker integration in the above post hasn’t been tested at scale yet; we will do our due diligence once Mesos 0.19 and deimos are out.

For different build dependencies, you can define a build label for each. A merged PR already specifies the attributes per label. Hence, a Docker container image of choice can be added per build label.

Conclusion

This concludes the description of our journey, giving a good overview of how we ran a distributed CI solution on top of Mesos, utilizing resources in the most efficient manner and isolating build dependencies through Docker.

{ 21 comments }

Problem statement

In eBay’s existing CI model, each developer gets a personal CI/Jenkins Master instance. This Jenkins instance runs within a dedicated VM, and over time the result has been VM sprawl and poor resource utilization. We started looking at solutions to maximize our resource utilization and reduce the VM footprint while still preserving the individual CI instance model. After much deliberation, we chose Apache Mesos for a POC. This post shares the journey of how we approached this challenge and accomplished our goal.

Jenkins framework’s Mesos plugin

The Mesos plugin is Jenkins’ gateway into the world of Mesos, so it made perfect sense to bring the plugin code in sync with our requirements. This video explains the plugin. The eBay PaaS team made several pull requests to the Mesos code, adding both new features and bug fixes. We are grateful to the Twitter engineering team (especially Vinod) for their input and cooperation in quickly getting these features validated and rolled out.  Here are all the contributions that we made to the recently released 0.2 Jenkins Mesos plugin version. We are adding more features as we proceed.

Mesos cluster setup

docker_caching_single

docker_caching_single

Our new Mesos cluster is set up on top of our existing OpenStack deployment. In the model we are pursuing, we would necessarily have lots of Jenkins Mesos frameworks (each Jenkins Master is essentially a Mesos framework), and we did not want to run those outside of the Mesos cluster so that we would not have to separately provision and manage them. We therefore decided to use the Marathon framework as the Mesos meta framework; we launched the Jenkins master (and the Mesos framework) in Mesos itself. We additionally wanted to collocate the Jenkins masters in a special set of VMs in the cluster, using the placement constraint feature of Marathon that leverages slave attributes. Thus we separated Mesos slave nodes into a group of Jenkins masters and another group of Jenkins slave nodes. For backup purposes, we associated special block storage with the VMs running the CI master. Special thanks to the Mesosphere.io team for quickly resolving all of our queries related to Marathon and Mesos in general.

Basic testing succeeded as expected. The Jenkins master would launch through Marathon with a preconfigured Jenkins config.xml, and it would automatically register as a Mesos framework without needing any manual configuration. Builds were correctly launched in a Jenkins slave within one of the distributed Mesos slave nodes. Configuring slave attributes allowed the Mesos plugin to pick the nodes on which  slave jobs could be scheduled. Checkpointing was enabled so that build jobs were not lost if the slave process temporarily disconnected and reconnected back to the Mesos master (the slave recovery feature). In case there was a new Mesos master leader elected, the plugin used Zookeeper endpoints to locate the new master (more on this a little later).

docker_caching_single

docker_caching_single

We decided to simulate a large deployment and wrote a Jenkins load test driver (mammoth). As time progressed we started uncovering use cases that were unsuccessful. Here is a discussion of each problem and how we addressed it.

Frameworks stopped receiving offers after a while

One of the first things we noticed occurred after we used Marathon to create the initial set of CI masters. As those CI masters started registering themselves as frameworks, Marathon stopped receiving any offers from Mesos; essentially, no new CI masters could be launched. The other thing we noticed was that, of the Jenkins Frameworks that were registered, only a few would receive offers. At that point, it was evident that we needed a very thorough understanding of the resource allocation algorithm of Mesos – we had to read the code. Here is an overview on the code’s setup and the dominant resource fairness algorithm.

Let’s start with Marathon. In the DRF model, it was unfair to treat Marathon in the same bucket/role alongside hundreds of connected Jenkins frameworks. After launching all these Jenkins frameworks, Marathon had a large resource share and Mesos would aggressively offer resources to frameworks that were using little or no resources. Marathon was placed last in priority and got starved out.

We decided to define a dedicated Mesos role for Marathon and to have all of the Mesos slaves that were reserved for Jenkins master instances support that Mesos role. Jenkins frameworks were left with the default role “*”. This solved the problem – Mesos offered resources per role and hence Marathon never got starved out. A framework with a special role will get resource offers from both slaves supporting that special role and also from the default role “*”. However, since we were using placement constraints, Marathon accepted resource offers only from slaves that supported both the role and the placement constraints.

Certain Jenkins frameworks were still getting starved

Our next task was to find out why certain Jenkins frameworks were getting zero offers even when they were in the same role and not running any jobs. Also, certain Jenkins frameworks always received offers. Mesos makes offers to frameworks and frameworks have to decline them if they don’t use them.

The important point was to remember that the offer was made to the framework. Frameworks that did not receive offers, but that had equal resource share to frameworks that received and declined offers, should receive offers from Mesos. Basically, past history has to be accounted for. This situation can also arise where there are fewer resources available in the cluster. Ben Hindman quickly proposed and implemented the fix for this issue, so that fair sharing happens among all of the Jenkins frameworks.

Mesos delayed offers to frameworks

We uncovered two more situations where frameworks would get starved out for some time but not indefinitely. No developer wants to wait that long for a build to get scheduled. In the first situation, the allocation counter to remember past resource offers (refer to the fix in the previous paragraph) for older frameworks (frameworks that joined earlier) would be much greater than for the new frameworks that just joined. New frameworks would continue to receive more offers, even if they were not running jobs; when their allocation counter reached the level of older frameworks, they would be treated equally. We addressed this issue by specifying that whenever a new framework joins, the allocation counter is reset for all frameworks, thereby bringing them to a level playing field to compete for resources. We are exploring an alternative that would normalize the counter values instead of setting the counter to zero. (See this commit.)

Secondly, we found that once a framework finished running jobs/Mesos tasks, the user share – representing the current active resources used – never came down to zero. Double arithmetic led to a ridiculously small value (e.g., 4.44089e-16), which unfairly put frameworks that had just finished builds behind frameworks that had their user share at 0. As a quick fix, we used precision 0.000001 to treat those small values as 0 in the comparator. Ben Hindman suggested an alternative: once a framework has no tasks and executors, it’s safe to set the resource share explicitly to zero. We are exploring that alternative as well in the Mesos bug fix review process.

Final hurdle

After making all of the above changes, we were in reasonably good shape. However, we discussed scenarios where certain frameworks running active jobs and finishing them always get ahead of inactive frameworks (due to the allocation counter); a build started in one of those inactive frameworks would waste some time in scheduling. It didn’t make sense for a bunch of connected frameworks to be sitting idle and competing for resource offers when they had nothing to build. So we came up with a new enhancement to the Jenkins Mesos plugin:  to register as a Mesos framework only when there was something in the build queue. The Jenkins framework would unregister as a Mesos framework as soon as there were no active or pending builds (see this pull request). This is an optional feature, not required in a shared CI master that’s running jobs for all developers. We also didn’t need to use the slave attribute feature any more in the plugin, as it was getting resource offers from slaves with the default role.

Our load tests were finally running with predictable results! No starvation, and quick launch of builds.

Cluster management

When we say “cluster” we mean a group of servers running the PaaS core framework services. Our cluster is built on virtual servers in our OpenStack environment. Specifically, the cluster consists of virtual servers running Apache Zookeeper, Apache Mesos (masters and slaves), and Marathon services. This combination of services was chosen to provide a fault-tolerant and high-availability (HA) redundant solution. Our cluster consists of at least three servers running each of the above services.

By design, the Mesos and Marathon services do not operate in the traditional redundant HA mode with active-active or active-passive servers. Instead, they are separate, independent servers that are all active, although for HA implementation they use the Zookeeper service as a communication bus among the servers. This bus provides a leader election mechanism to determine a single leader.

Zookeeper performs this function by keeping track of the leader within each group of servers. For example, we currently provision three Mesos masters running independently. On startup, each master connects to the Zookeeper service to first register itself and then to elect a leader among themselves. Once a leader is elected, the other two Mesos masters will redirect all client requests to the leader. In this way, we can have any number of Mesos masters for an HA setup and still have only a single leader at any one time.

The Mesos slaves do not require HA, because they are treated as workers that provide resources (CPU, memory, disk space) enabling the Mesos masters to execute tasks. Slaves can be added and removed dynamically. The Marathon servers use a similar leader election mechanism to allow any number of Marathon servers in an HA cluster. In our case we also chose to deploy three Marathon servers for our cluster. The Zookeeper service does have a built-in mechanism for HA, so we chose to deploy three Zookeeper servers in our cluster to take advantage of its HA capabilities.

Building the Zookeeper-Mesos-Marathon cluster

Of course we always have the option of building each server in the cluster manually, but that quickly becomes time-intensive and prone to inconsistencies. Instead of provisioning manually, we created a single base image with all the necessary software. We utilize the OpenStack cloud-init post-install to convert the base image into either a Zookeeper, Mesos Master, Mesos Slave, or Marathon server.

We maintain the cloud-init scripts and customization in github. Instead of using the nova client directly or the web gui to provision, we added another automation feature and wrote Python scripts to call the python-novaclient and pass in the necessary cloud-init and post-install instructions to build a new server. This combines all the necessary steps into a single command. The command provisions the VM, instructs the VM to download the cloud-init post-install script from github, activates the selected service, and joins the new VM to a cluster. As a result, we can easily add servers to an existing cluster as well as create new clusters.

Cluster management with Ansible

Ansible is a distributed systems management tool that helps to ease the management of many servers. It is not too difficult or time-consuming to make changes on one, two, or even a dozen servers, but making changes to hundreds or thousands of servers becomes a non-trivial task. Not only do such changes take a lot of time, but they have a high chance of introducing an inconsistency or error that would cause unforeseen problems.

Ansible is similar to cfengine, puppet, chef, salt, and many other systems management tools. Each tool has its own strengths and weaknesses. One of the reasons we decided to use Ansible is its ability to execute remote commands using ssh without having the need for any Ansible client to run on the servers.

Ansible can be used as a configuration management, software deployment, and do-anything-you-want kind of a tool. It employs a plug-and-play concept, where existing modules have already been written for many functions. For example, there are modules for connecting to hosts with a shell, for AWS EC2 automation, for networking, for user management, etc.

Since we have a large Mesos cluster and several of the servers are for experimentation, we use Ansible extensively to manage the cluster and make consistent changes when necessary across all of the servers.

Conclusion

Depending on the situation and use case, many different models for running CI in Mesos can be tried out. The model that we outlined above is only one. Another variation is a shared master using the plugin and temporary masters running the build directly.  In Part II of this blog post, we introduce an advanced use case of running builds in Docker containers on Mesos.

{ 8 comments }

In the era of cloud and XaaS (everything as a service), REST/SOAP-based web services have become ubiquitous within eBay’s platform. We dynamically monitor and manage a large and rapidly growing number of web servers deployed on our infrastructure and systems. However, existing tools present major challenges when making REST/SOAP calls with server-specific requests to a large number of web servers, and then performing aggregated analysis on the responses.

We therefore developed REST Commander, a parallel asynchronous HTTP client as a service to monitor and manage web servers. REST Commander on a single server can send requests to thousands of servers with response aggregation in a matter of seconds. And yes, it is open-sourced at http://www.restcommander.com.

Feature highlights

REST Commander is Postman at scale: a fast, parallel asynchronous HTTP client as a service with response aggregation and string extraction based on generic regular expressions. Built in Java with Akka, Async HTTP Client, and the Play Framework, REST Commander is packed with features beyond speed and scalability:

  • Click-to-run with zero installation
  • Generic HTTP request template supporting variable-based replacement for sending server-specific requests
  • Ability to send the same request to different servers, different requests to different servers, and different requests to the same server
  • Maximum concurrency control (throttling) to accommodate server capacity

Commander itself is also “as a service”: with its powerful REST API, you can define ad-hoc target servers, an HTTP request template, variable replacement, and a regular expression all in a single call. In addition, intuitive step-by-step wizards help you achieve the same functionality through a GUI.

Usage at eBay

With REST Commander, we have enabled cost-effective monitoring and management automation for tens of thousands of web servers in production, boosting operational efficiency by at least 500%. We use REST Commander for large-scale web server updates, software deployment, config pushes, and discovery of outliers. All can be executed by both on-demand self-service wizards/APIs and scheduled auto-remediation. With a single instance of REST Commander, we can push server-specific topology configurations to 10,000 web servers within a minute (see the note about performance below). Thanks to its request template with support for target-aware variable replacement, REST Commander can also perform pool-level software deployment (e.g., deploy version 2.0 to QA pools and 1.0 to production pools).

Basic workflow

Figure 1 presents the basic REST Commander workflow. Given target servers as a “node group” and an HTTP command as the REST/SOAP API to hit, REST Commander sends the requests to the node group in parallel. The response and request for each server become a pair that is saved into an in-memory hash map. This hash map is also dumped to disk, with the timestamp, as a JSON file. From the request/response pair for each server, a regular expression is used to extract any substring from the response content.

workflow

 Figure 1. REST Commander Workflow.

Concurrency and throttling model with Akka

REST Commander leverages Akka and the actor model to simplify the concurrent workflows for high performance and scalability. First of all, Akka provides built-in thread pools and encapsulated low-level implementation details, so that we can fully focus on task-level development rather than on thread-level programming. Secondly, Akka provides a simple analogy of actors and messages to explain functional programming, eliminating global state, shared variables, and locks. When you need multiple threads/jobs to update the same field, simply send these results as messages to a single actor and let the actor handle the task.

Figure 2 is a simplified illustration of the concurrent HTTP request and response workflow with throttling in Akka. Throttling (concurrency control) indicates the maximum concurrent requests that REST Commander will perform. For example, if the throttling value is 100, REST Commander will not send the “n_th” request until it gets the “{n-100}_th” response back; so the 500th request will not be sent until the response from the 400th request has been received.

concurrency 

Figure 2. Concurrency Design with Throttling in Akka (see code)

Suppose one uniform GET /index.html HTTP request is to be sent to 10,000 target servers. The process starts with the Director having the job of sending the requests. Director is not an Akka actor, but rather a Java object that initializes the Actor system and the whole job. It creates an actor called Manager, and passes to it the 10,000 server names and the HTTP call. When the Manager receives the data, it creates one Assistant Manager and 10,000 Operation Workers. The Manager also embeds a task of “server name” and the “GET index.html HTTP request” in each Operation Worker. The Manager does not give the “go ahead” message for triggering task execution on the workers. Instead, the Assistant Manager is responsible for this part: exercising throttling control by asking only some workers to execute tasks.

To better decouple the code based on functionality, the Manager is only in charge of receiving responses from the workers, and the Assistant Manager is responsible for sending the “go ahead” message to trigger workers to work. The Manager initially sends the Assistant Manager a message to send the throttling number of messages; we’ll use 1500, the default throttling number, for this example. The Assistant Manager starts sending a “go ahead” message to each of 1500 workers. To control throttling, the Assistant Manager maintains a sliding window of [response_received_count, request_sent_count]. The request_sent_count is the number of “go ahead” messages the Assistant Manager has sent to the workers. The response_received_count comes from the Manager; when the Manager receives a response, it communicates the updated count to the Assistant Manager. Every half-second, the Assistant Manager sends itself a message to trigger a check of response_received_count and request_sent_count to determine whether the sliding window has room for sending additional messages. If so, the Assistant Manager sends messages until the sliding window is greater than or equal to the throttling number (1500).

Each Operation Worker creates an HTTP Worker, which also has Ning’s async HTTP client functions. When the Manager receives a response from an Operation Worker, it updates the response part of the in-memory hash map of for the associated server. In the event of failing to obtain the response or of timing out, the worker would return exception details (e.g., connection exception) back to the Manager. When the Manager has received all of the responses, it returns the whole hash map of back to the Director. As the job successfully completes, the Director dumps the hash map to disk as a JSON file, then returns.

Beyond web server management – generic HTTP workflows

When modeling and abstracting today’s cloud operations and workflows – e. g., provisioning, file distributions, and software deployment – we find that most of them are similar: each step is a certain form of HTTP call with certain responses, which trigger various operations in the next step. Using the example of monitoring cluster server health, the workflow goes like this:

  1. A single HTTP call to query data storage (such as database as a service) and retrieve the host names and health records of the target servers (1 call to 1 server)
  2. Massive uniform HTTP calls to check the current health of target servers (1 call to N servers); aggregating these N responses; and conducting simple analysis and extractions
  3. Data storage updates for those M servers with changed status (M calls to 1 server)

REST Commander flawlessly supports such use cases with its generic and powerful request models. It therefore is used to automate many tasks involving interactions and workflows (orchestrations) with DBaaS, LBaaS (load balancer as a service), IaaS, and PaaS.

Related work review

Of course, HTTP is a fundamental protocol to the World Wide Web, SOAP/REST-based web services, cloud computing, and many distributed systems. Efficient HTTP/REST/SOAP clients are thus critical in today’s platform and infrastructure services. Although many tools have been developed in this area, we are not aware of any existing tools or libraries on HTTP clients that combine the following three features:

  • High efficiency and scalability with built-in throttling control for parallel requests
  • Generic response aggregation and analysis
  • Generic (i.e., template-based) heterogeneous request generation to the same or different target servers

Postman is a popular and user-friendly REST client tool; however, it does not support efficient parallel requests or response aggregation. Apache JMeter, ApacheBench (ab), and Gatling can send parallel HTTP requests with concurrency control. However, they are designed for load/stress testing on a single target server rather than on multiple servers. They do not support generating different requests to different servers. ApacheBench and JMeter cannot conduct response aggregation or analysis, while Gatling focuses on response verification of each simulation step.

ql.io is a great Node.js-based aggregation gateway for quickly consuming HTTP APIs. However, having a different design goal, it does not offer throttling or generic response extraction (e.g., regular expressions). Also, its own language, table construction, and join query result in a higher learning curve. Furthermore, single-threaded Node.js might not effectively leverage multiple CPU cores unless running multiple instances and splitting traffic between them. 

Typhoeus is a wrapper on libcurl for parallel HTTP requests with throttling. However, it does not offer response aggregation. More critically, its synchronous HTTP library supports limited scalability. Writing a simple shell script with “for” loops of “curl” or “wget” enables sending multiple HTTP requests, but the process is sequential and not scalable.

Ning’s Async-http-client library in Java provides high-performance, asynchronous request and response capabilities compared to the synchronous Apache HTTPClient library. A similar library in Scala is Stackmob’s (PayPal’s) Newman HTTP client with additional response caching and (de)serialization capabilities. However, these HTTP clients are designed as raw libraries without features such as parallel requests with templates, throttling, response aggregation, or analysis.

Performance note

Actual REST Commander performance varies based on network speed, the slowest servers, and Commander throttling and time-out settings. In our testing with single-instance REST Commander, for 10,000 servers across regions, 99.8% of responses were received within 33 seconds, and 100% within 48 seconds. For 20,000 servers, 100% of responses were received within 70 seconds. For a smaller scale of 1,000 servers, 100% of responses were received within 7 seconds.

Conclusion and future work

“Speaking HTTP at scale” is instrumental in today’s platform with XaaS (everything as a service).  Each step in the solution for many of our problems can be abstracted and modeled by parallel HTTP requests (to a single or multiple servers), response aggregation with simple (if/else) logic, and extracted data that feeds into the next step. Taking scalability and agility to heart, we (Yuanteng (Jeff) Pei, Bin Yu, and Yang (Bruce) Li) designed and built REST Commander, a generic parallel async HTTP client as a service. We will continue to add more orchestration, clustering, security, and response analysis features to it. For more details and the video demo of REST Commander, please visit http://www.restcommander.com.  

Yuanteng (Jeff) Pei

Cloud Engineering, eBay Inc.

References

Postman

http://www.getpostman.com

Akka

http://akka.io

Async HTTP Client

https://github.com/AsyncHttpClient/async-http-client 

Play Framework

http://www.playframework.com

Apache JMeter

https://jmeter.apache.org

ApacheBench (ab)

http://httpd.apache.org/docs/2.2/programs/ab.html

Gatling

http://gatling-tool.org

ql.io

http://ql.io 

Typhoeus

https://github.com/typhoeus/typhoeus

Apache HttpClient

http://hc.apache.org/httpclient-3.x

Stackmob’s Newman

https://github.com/stackmob/newman

{ 0 comments }

Copyright © 2011 eBay Inc. All Rights Reserved - User Agreement - Privacy Policy - Comment Policy