Cube Planner – Build an Apache Kylin OLAP Cube Efficiently and Intelligently

Life is about carefully calculating daily necessities and the same is true for technology. Frugal people spend money on things that are needed most, while programmers are always seeking to reduce the resources cost of their code. Cube Planner is a new feature created by eBay’s programmers that helps you spend resources on building cost-effective dimension combinations.

Background

Apache Kylin is an open source Distributed Analytics Engine designed to provide multi-dimensional analysis (OLAP) on Hadoop, originally contributed from eBay to the Apache Software Foundation in 2014. Kylin became a Apache top-level project in 2015. With the rapid development of the Apache Kylin community since 2015, Apche Kylin has been deployed by many companies worldwide.

There has also been a crescendo of data analysis applications deployed on the Apache Kylin platform within eBay. There are currently 144 cubes in the production environment of eBay, and this number continues to grow on an average of 10 per month. The largest cube is about 100T. The average query amount is between 40,000 and 50,000 per week day. During last month, the average query latency is 0.49 seconds.

Why is Cube Planner needed

In 2016, eBay Data Services and Solutions (DSS) department began to introduce self-service reports. Data analysts are able to drag and drop dimensions and measures to create their own reports. But the self-service report generally has many dimensions, measures, and it is difficult to define the query pattern, which requires Kylin to pre-calculate a large amount of dimension combinations.

In addition, in companies that reach a certain scale, data providers and data users don’t belong to the same department, and data providers may not be particularly familiar to the query pattern, which leads to a gap between the capability of cube design and actual use. Some combinations of dimensions are pre-calculated, but are not queried, while some are often queried, but are not pre-calculated. These gaps have effect on the resource utilization and query performance of a cube.

We decided to make a cube build planner, Cube Planner, with the goal of efficiently using computing and storage resources to build cost-effective dimension combinations intelligently.

What is Cube Planner?

Cube Planner checks the costs and benefits of each dimension combination, and selects cost-effective dimension combination sets to improve cube build efficiency and query performance. The Cube Planner recommendation algorithm can be run multiple times throughout the life cycle of the cube. For example, after the cube is created for the first time, or after a certain amount of query statistics is collected, the cube is periodically optimized.

Definitions and Formulas

Here are some definitions and formulas used in the Cube Planner algorithm.

  • Dimensions and Measures are the basic concepts of data warehousing.
  • A dimension combination (Cuboid) refers to any combination of multiple dimensions of the data. Suppose the data has three dimensions (time, location, category), then it has eight kinds of dimension combinations: [ ], [time], [location], [category], [time, location], [time, category], [location, category], [time, location, category].
  • The basic dimension combination (Basic Cuboid) is a combination of all dimensions, with the most detailed aggregation information, such as the combination of the dimensions above [time, place, category].
  • The symbol≼ represents a parent-child relationship between dimension combinations,
    w ≼ i, indicates that a query involving dimension combination w can be aggregated from the data of the dimension combination i, the combination w is a child combination, and the combination i is the parent combination. For example, dimension combination w [time, category] and dimension combination i [time, location, category]. Each parent combination may have multiple child combinations, and the basic dimension combination is the parent of all other dimension combinations.
  • Costs of a dimension combination
    The costs of a dimension combination is divided into two aspects. One is the build cost, which is defined by the number of rows of data combined by itself, and the other is the query cost, which is defined by the number of rows scanned by the query. Since the build is often one-time while the query is repetitive, we ignore the build cost and only use the query cost to calculate the cost of a dimension combination. C(i) = Cost of dimension combination(i).
  • Benefits of dimension combination
    The benefits of a dimension combination can be understood as reduced costs, that is, the cost of all queries on this cube that can be reduced by pre-calculating this dimension combination.B(i, S) refers to the benefit of creating a dimension combination (i) in the dimension combination set S, calculated by summing up the benefits of all the subsets of i:B(i,S) = ∑w≼i Bw
    Among the set, the benefit of each subset Bw=max{ C(w) – C(i) ,0}
  • Benefit ratio of dimension combination
    Dimension combination benefit ratio = (cost of the dimension combination/benefit of dimension combination) * 100%:
    Br(i, S) = (B(i, S) / C(i)) * 100%

How to choose a cost-effective dimension combination?

We use greedy algorithms to choose a cost-effective dimension combination, and the process of selection is iterative. For each iteration, the system will go through the candidate set, select a dimension combination with the highest benefit ratios at the current state, and add it to the combination result set that needs to be built.

Let’s now look at a simple example to understand the process of the iterative selection.

Cube Planner Select a dimension combination process example

Let’s assume there is a cube that has the following dimension combination.

Dimension Combination ID Dimension
a time, place, category
b time, place
c time, category

The parent-child relationship between each dimension combination is shown in the following figure. Each dimension combination has a set of numbers (i, j) in addition to ID.

i: the number of rows in this dimension combination

j: the cost of querying the dimension combination

In the following figure, dimension combination a is the basic dimension combination (Basic Cuboid), and the data has 100 rows. In the Kylin cube, the basic dimension combination will be pre-calculated by default. Initially, the result set starts with only the basic dimension combination. Therefore, queries on other dimension combinations need to answered by scanning the basic dimension combination data then doing the aggregation, so the cost of these queries is the number of rows, 100.

First iteration of Cube Planner greedy algorithm

The green dimension combinations are the combinations that have been selected into the result set during the previous rounds, initially with only the basic dimension combination. The red ones are the combinations that may be selected into the result set during the this round. The blue ones indicate that these combinations are selected and will be added to the result set during this round.

What is the benefit of selecting dimension combination b, that is, the reduced cost of the entire cube? If b is selected, the cost of b and its children (d, e, g, h) will be reduced from 100 to 50, so that the benefit ratio of the dimension combination b is (100 – 50) * 5 / 50 = 5.

If you choose the dimension combination C, the benefit ratio is 1.67.

If other dimension combinations are selected, the benefit ratio is as follows. It is obvious that g is a combination with the highest benefit ratio, so dimension combination g is added to the result set during this round.

Second round selection of the Cube Planner

At this point, the result set has become (a, g). At this state, if b is selected, although the children of b are d, e, g and h, only query over b, d and e will be effected because g has been pre-calculated and g has fewer row counts than b. When the query against dimension combination g arrives, the system will still choose g to answer the query. This is exactly why the benefit of each child combinations mentioned above is Bw=max{ QC(w) – QC(i) ,0}. So, the benefit ratio of combination b is 4.

If you pick the combination c, the benefit ratio is 1.34.

The benefit of other combinations is as follows. Combination h is selected eventually and is added to the result set during this round.

When do we stop the iteration selection?

When any one of the following three conditions is fulfilled, the system will stop the selection:

  • The expansion rate of combination dimensions in the result set reaches the established expansion rate of the cube.
  • The efficiency ratio of the selected dimension combination this round is lower than the established minimum value. This indicates that the newly added dimension is not cost-effective, so there is no need to select more dimension combinations.
  • The run time of the selection algorithm reaches the established ceiling.

A Cube Planner algorithm based on weighted query statistics

In the previous algorithm, the benefit weights of other dimension combinations are the same when calculating the benefit that one dimension combination will bring for the entire cube. But in the actual application, the probability of querying dimension combination varies. Therefore, when calculating the benefit, we use the probability of querying as the weight of each dimension combination.

Weighted dimension combination benefit WB(i,S) = ∑w≼i (Bw * Pw)

Pw is the probability of querying this dimension combination. As for those dimension combinations that are not queried, we set the probability to a small non-zero value to avoid overfitting.

As for the newly created cube, the probability is set to the same because there are no query statistics in the system. There is room for improvement. We want to use the platform-level statistics data and some data-mining algorithms to estimate the probability of the querying dimension combinations of the new cube in the future. For the cubes that are already online, we collect the query statistics, and it is easy to calculate the probability of querying each dimension combination.

In fact, the real algorithm based on weighted query statistics is more complex than this. Many practical problems need to be solved, for example, how to estimate the cost of a dimension combination that is not pre-calculated, because Kylin only stores statistics of the pre-calculated dimension combinations.

Greedy algorithm VS Genetic algorithm

Greedy algorithms have some limitations when there are too many dimensions. Because the candidate dimension combination set is very large, the algorithm needs to go through each dimension combination, so it takes some time to run. Thus, we implemented a genetic algorithm as an alternative. Using the iterative and evolutionary model of the gene to simulate the selection of the dimension combination set, the optimal combination set of dimensions is finally evolved (selected) through a number of iterations. There is no refinement. If anyone is interested, we can write a detailed description of the genetic algorithm used in Cube Planner.

Cube Planner one-click optimization

For the cube that is already on the production site, Cube Planner can use the collected query statistics, and intelligently recommend a dimension combination set for pre-calculation based on cost and benefit algorithm. Users can make a one-click optimization based on this recommendation. The following figure is the one-click optimization process. We use two jobs, Optimizejob and Checkpointjob, not only to ensure the optimization process concurrency, but also to ensure the atomicity of the data switching after optimizing. In order to save the resource of optimizing the cube, we will not recalculate all the recommended dimension combinations in Optimizejob, but only compute those new combinations in the recommended dimension combinations set. If some recommended dimension combinations are pre-calculated in the cube, we reuse them.

Application of Cube Planner in eBay

Cube Planner’s development was basically completed at the end of 2016. After a variety of testing, it was formally launched in the production environment in February 2017. After the launch, it was highly praised and unanimously recognized by the users. Here is an example of the use of Cube Planner on eBay’s production line. The customer optimized the cube on April 19th. The optimized cube not only had a significant boost in building performance, but also improved significantly in query performance.

Follow-up development of Cube Planner

In addition to enhancing resource utilization, Cube Planner has another goal of reducing the cube design complexity. At present, the design of a Kylin cube depends on the understanding of data. Cube Planner has made Kylin able to recommend and optimize based on data distribution and data usages, but this is not enough. We want to make the Kylin cube design a smart recommendation system based on query statistics across cubes within same domain.

eBay’s New Homepage

 

Building and releasing a modern homepage for eBay’s hundreds of millions of annual and tens of millions daily users was the team’s greatest challenge to date. We were given the mission to develop a personalized buying experience by mashing merchandised best offers with tailored ads, topping it off with the hottest trends and relevant curated content. Any such virtual front door had to be robust to changes, be easily configurable for any site, and had to load really, really fast!

There’s no second chance to make a first impression, so we had to make it count. But how quickly can you go from 0 to 100 (100% deployed for 100% of traffic to the homepage)?

The team commenced this project in late 2016, and we rushed to get an early flavor available for an internal beta in mid-February. Then, traffic was slowly ramped until reaching a 100% mark for the Big 4 sites (US, UK, DE, AU), delivering it less than three months after the beta, while simultaneously adding more scope and fresh content to the sites.

Our new homepage consists of three main components, spread across three large pools:

  • Node Application – The homepage front-end web server handles incoming user requests and acts as the presentation layer for the application, boasting responsive design.
  • Experience Service – The experience service tier is the primary application server that interacts with the page template and customization service and orchestrates the request handling to various data providers. It performs response post-processing that includes tracking. This layer returns a render-ready format that is consumable by all screen types. The experience layer is invoked by Node for desktop and mobile web traffic and directly by the native applications.
  • Data Provider Service – The data provider service is facade that massages responses from datastores into a semantic format. Providers, such as Merchandise, Ads, Marketing, Collections, and Trends, are interfacing with this service. JSON transformation, done using Jolt, produces the response datagrams.

Additional auxiliary components help keep the system configurable and simple to reorganize or to publish new content to site within seconds.

The Experience service was built on a shared platform, called Answers, developed jointly with the Browse and Search domains.

Due to the hierarchical nature of the system, its complex compositeness (16 unique modules), and its multiple dependencies (9 different services, to be exact), special consideration has to be given to the rollout and deployment process.

We adopted a bottom-up strategy by deploying in the reverse order of dependencies. That is, we start with datastores, data providers, experience services and lastly deploy the node application server. This approach helps us bubble up issues early on in the process and tackle misalignment in an effective manner.

All service-to-service calls are made via clearly defined, internally reviewed APIs. Interface changes are only allowed to be additive to maintain backwards compatibility. This is essential for the incremental bottom-up rollout scheme.

When developing and deploying a new build, as you would delicately change a layer in a house of cards, one must act with great care and much thought about the inherited risks. Test strategies like “mock-first” were used as a working proof-of-concept and allowed testing to begin early on, where caching techniques quickly took the place of the very same mocks in the final product. Unit tests, hand-in-hand with functional and nonfunctional tests, had to be done with rigor, to make regression easy, fast, and resilient to change.

The team always deployed to Staging and Pre-production before conducting any smoke-testing on three boxes in Production, one VM instance per Data Center. Specific networking issues were exposed and quickly remediated, as a result.

Some engineering-related challenges that the team was facing had to do with Experimentation Platform complexities, where a selective pool channel redirect appeared to have unexpected behavior with traffic rerouting, while gradually diverting from the current Feed home page to the new carousel-based one.

Internal Beta testing, done by Buyer Experience teams, and external in-the-wild testing conducted by a crowd-testing company, took place with a zero-traffic experiment. This process was followed by a slow 1% traffic ramp, to help control the increments, as gradual 5% and 25% steps were introduced to the pools, spanning days to weeks between the phases. Each ramp involved multiple daily Load & Performance cycles, while fine-tuning timeout and retry variables.

To summarize our key learnings from the homepage experience:

  • Relying on clearly-defined APIs was vital in achieving an iterative, incremental development and rollout strategy.
  • Co-evolving our logging and monitoring infrastructure and the team’s debugging expertise was essential to achieving an aggressive rollout approach on a distributed stack.
  • Tracking and Experimentation setups, as well as baking in Accessibility, proved to be major variables that should get well-scoped into a product’s development lifecycle and launch timelines.
  • Introducing gradual functional increments along with a design for testability in mind proved to be a valuable quality control measure and allowed absorbing change while moving forward.
  • A meticulous development and release process maintained the associated risks and mitigation actions, allowing the team to meet their milestones, as defined and on time.

Enhancing the User Experience of the Hadoop Ecosystem

 

At eBay, we have multiple large, multi-tenant clusters. Each of these clusters stores hundreds of petabytes of data. These clusters offer tens of thousands of cores to run computations on the data. We have thousands of internal users who use Hadoop in their roles, including data analysts, data scientists, engineers, and product managers. These users use multiple technologies like MapReduce, Hive, and Spark to process data. There are thousands of applications that push and pull data from Hadoop and run computations.

Figure 1: Hadoop clusters, auxiliary components, users, applications, and services

Pains

The users normally interact with the cluster via the command line by SSHing to specialized gateway machines that reside in the same network zone as the cluster. To transfer job files and scripts, the users need to SCP over multiple hops.

Figure 2: Old way of accessing a Hadoop cluster

The need to traverse multiple hops as well as the command-line-only usage was a major hindrance to the productivity of our data users.

On the other side, our website applications and services need to access data and perform compute. These applications and services reside in a different network zone and hence need to set up network rules to access various services like HDFS, YARN, and Oozie. Since our clusters are secured with Kerberos, the applications need to be able to use Kerberos to authenticate to the Hadoop services. This was causing an extra burden for our application developers.

In this post, I will share the work in progress to facilitate access to our Hadoop clusters for data and compute resources by users and applications.

Requirements

We need better ways to achieve the following goals:

  • Our engineers and other users need to use multiple clusters and related components.
  • Data Analysts and other users need to run interactive queries and create shareable reports.
  • Developers need to be able to develop applications and services without spending time on connectivity problems or Kerberos authentication.
  • We can afford no compromise on security.

Solutions

To improve user experience and productivity, we added three open-source components:

Hue — to perform operations on Hadoop and related components.

Apache Zeppelin — to develop interactive notebooks with queries, programs, and reports.

Apache Knox — to serve as a single point for applications to access HDFS, Oozie, and other Hadoop services.

Figure 3: Enhanced user experience with Hue, Zeppelin, and Knox

We will describe each product, the main use cases, a list of our customizations, and the architecture.

Hue

Hue is a user interface to the Hadoop ecosystem. It provides user interfaces to several components including HDFS, Oozie, Resource Manager, Hive, and HBase. It is a 100% open-source product, actively supported by Cloudera, and stored at the Hue GitHub site.

Similar products

Apache Airflow allows users to specify workflows in Python. Since we did not want a Python learning curve for our users, we chose Hue instead of Airflow. But we may find Airflow compelling enough to deploy it in future so that it can be used by people who prefer Airflow.

Use cases of Hue

Hue allows a user to work with multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • To browse, manage, upload, and download HDFS files and directories
  • To specify workflows comprising MapReduce, Hive, Pig, Spark, Java, and shell actions
  • Schedule workflows and track SLAs
  • To manage Hive metadata, run Hive queries, and share the queries with other users
  • To manage HBase metadata and interact with HBase
  • To view YARN applications and terminate applications if needed

Enhancements

Two-factor authentication — To ensure that the same security level is maintained as that of command-line access, we needed to integrate our custom SAML-based two-factor authentication in Hue. Hue supports plugging in new authentication mechanisms, using which we were able to plug in our two-factor authentication.

Ability to impersonate other users — At eBay, users sometimes operate on behalf of a team account. We added capability in Hue so that users can impersonate as another account as long as they are authorized to do so. The authorization is controlled by LDAP group memberships. The users can switch back between multiple accounts.

Working with multiple clusters — Since we have multiple clusters, we wanted to provide single Hue instance serving multiple Hadoop clusters and components. This enhancement required changes in HDFS File Browser, Job Browser, Hive Metastore Managers, Hive query editors, and work flow submissions.

Architecture

Figure 4: Hue architecture at eBay

Zeppelin

A lot of our users, especially data scientists, want to run interactive queries on the data stored on Hadoop clusters. They run one query, check its results, and, based on the results, form the next query. Big data frameworks like Spark, Presto, Kylin, and to some extent, HiveServer2 provide this kind of interactive query support.

Apache Zeppelin (GitHub repo) is a user interface that integrates well with products like Spark, Presto, Kylin, among others. In addition, Zeppelin provides an interface where users can develop data notebooks. The notebooks can express data processing logic in SQL or Scala or Python or R. Zeppelin also supports data visualization in notebooks in the form of tables and charts.

Zeppelin is an Apache project and is 100% open source.

Use cases

Zeppelin allows a user to develop visually appealing interactive notebooks using multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • Run a quick Select statement on a Hive table using Presto.
  • Develop a report based on a dataset by reading files from HDFS and persisting them in memory as Spark data frames.
  • Create an interactive dashboard that allows users to search through a specific set of log files with custom format and schema.
  • Inspect the schema of a Hive table.

Enhancements

Two-factor authentication — To maintain security parity with that command-line access, we plugged in our custom two-factor authentication mechanism in Zeppelin. Zeppelin uses Shiro for security, and Shiro allows one to plug in a custom authentication with some difficulty.

Support for multiple clusters — We have multiple clusters and multiple instances of components like Hive. To support multiple instances in one Zeppelin server, we created different interpreters for different clusters or server instances.

Capability to override interpreter settings at the user level — Some of the interpreter settings, such as job queues and memory values, among others, need to be customized by users for their specific use cases. To support this, we added a feature in Zeppelin so that users can override certain Interpreter settings by setting properties. This is described in detail in this Apache JIRA ticket ZEPPELIN-1625

Architecture

Figure 5: Zeppelin Architecture at eBay

Knox

Apache Knox (GitHub repo) is an HTTP reverse proxy, and it provides a single endpoint for applications to invoke Hadoop operations. It supports multiple clusters and multiple components like webHDFS, Oozie, WebHCat, etc. It can also support multiple authentication mechanisms so that we can hook up custom authentication along with Kerberos authentication

It is an Apache top-level project and is 100% open source.

Use cases

Knox allows an application to talk to multiple Hadoop clusters and related components through a single entry point using any application-friendly non-Kerberos authentication mechanism. A few common use cases are listed below:

  • To authenticate using an application token and put/get files to/from HDFS on a specific cluster
  • To authenticate using an application token and trigger an Oozie job
  • To run a Hive script using WebHCat

Enhancements

Authentication using application tokens — The applications and services in the eBay backend use a custom token-based authentication mechanism. To take advantage of the existing application credentials, we enhanced Knox to support our application token-based authentication mechanism in addition to Kerberos. Knox utilizes the Hadoop Authentication framework, which is flexible enough to plug in new authentication mechanisms. The steps to plug in an authentication mechanism on Hadoop’s HTTP interface is described in Multiple Authentication Mechanisms for Hadoop Web Interfaces

Architecture

Figure 6: Knox Architecture at eBay

Summary

In this blog post, we describe the approach taken to improve user experience and developer productivity in using our multiple Hadoop clusters and related components. We illustrate the use of three open-source products to make Hadoop users’ life a lot simpler. The products are Hue, Zeppelin, and Knox. We evaluated these products, customized them for eBay’s purpose, and made them available for our users to carry out their projects efficiently.