Author Archives: Benoy Antony

Enhancing the User Experience of the Hadoop Ecosystem

 

At eBay, we have multiple large, multi-tenant clusters. Each of these clusters stores hundreds of petabytes of data. These clusters offer tens of thousands of cores to run computations on the data. We have thousands of internal users who use Hadoop in their roles, including data analysts, data scientists, engineers, and product managers. These users use multiple technologies like MapReduce, Hive, and Spark to process data. There are thousands of applications that push and pull data from Hadoop and run computations.

Figure 1: Hadoop clusters, auxiliary components, users, applications, and services

Pains

The users normally interact with the cluster via the command line by SSHing to specialized gateway machines that reside in the same network zone as the cluster. To transfer job files and scripts, the users need to SCP over multiple hops.

Figure 2: Old way of accessing a Hadoop cluster

The need to traverse multiple hops as well as the command-line-only usage was a major hindrance to the productivity of our data users.

On the other side, our website applications and services need to access data and perform compute. These applications and services reside in a different network zone and hence need to set up network rules to access various services like HDFS, YARN, and Oozie. Since our clusters are secured with Kerberos, the applications need to be able to use Kerberos to authenticate to the Hadoop services. This was causing an extra burden for our application developers.

In this post, I will share the work in progress to facilitate access to our Hadoop clusters for data and compute resources by users and applications.

Requirements

We need better ways to achieve the following goals:

  • Our engineers and other users need to use multiple clusters and related components.
  • Data Analysts and other users need to run interactive queries and create shareable reports.
  • Developers need to be able to develop applications and services without spending time on connectivity problems or Kerberos authentication.
  • We can afford no compromise on security.

Solutions

To improve user experience and productivity, we added three open-source components:

Hue — to perform operations on Hadoop and related components.

Apache Zeppelin — to develop interactive notebooks with queries, programs, and reports.

Apache Knox — to serve as a single point for applications to access HDFS, Oozie, and other Hadoop services.

Figure 3: Enhanced user experience with Hue, Zeppelin, and Knox

We will describe each product, the main use cases, a list of our customizations, and the architecture.

Hue

Hue is a user interface to the Hadoop ecosystem. It provides user interfaces to several components including HDFS, Oozie, Resource Manager, Hive, and HBase. It is a 100% open-source product, actively supported by Cloudera, and stored at the Hue GitHub site.

Similar products

Apache Airflow allows users to specify workflows in Python. Since we did not want a Python learning curve for our users, we chose Hue instead of Airflow. But we may find Airflow compelling enough to deploy it in future so that it can be used by people who prefer Airflow.

Use cases of Hue

Hue allows a user to work with multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • To browse, manage, upload, and download HDFS files and directories
  • To specify workflows comprising MapReduce, Hive, Pig, Spark, Java, and shell actions
  • Schedule workflows and track SLAs
  • To manage Hive metadata, run Hive queries, and share the queries with other users
  • To manage HBase metadata and interact with HBase
  • To view YARN applications and terminate applications if needed

Enhancements

Two-factor authentication — To ensure that the same security level is maintained as that of command-line access, we needed to integrate our custom SAML-based two-factor authentication in Hue. Hue supports plugging in new authentication mechanisms, using which we were able to plug in our two-factor authentication.

Ability to impersonate other users — At eBay, users sometimes operate on behalf of a team account. We added capability in Hue so that users can impersonate as another account as long as they are authorized to do so. The authorization is controlled by LDAP group memberships. The users can switch back between multiple accounts.

Working with multiple clusters — Since we have multiple clusters, we wanted to provide single Hue instance serving multiple Hadoop clusters and components. This enhancement required changes in HDFS File Browser, Job Browser, Hive Metastore Managers, Hive query editors, and work flow submissions.

Architecture

Figure 4: Hue architecture at eBay

Zeppelin

A lot of our users, especially data scientists, want to run interactive queries on the data stored on Hadoop clusters. They run one query, check its results, and, based on the results, form the next query. Big data frameworks like Spark, Presto, Kylin, and to some extent, HiveServer2 provide this kind of interactive query support.

Apache Zeppelin (GitHub repo) is a user interface that integrates well with products like Spark, Presto, Kylin, among others. In addition, Zeppelin provides an interface where users can develop data notebooks. The notebooks can express data processing logic in SQL or Scala or Python or R. Zeppelin also supports data visualization in notebooks in the form of tables and charts.

Zeppelin is an Apache project and is 100% open source.

Use cases

Zeppelin allows a user to develop visually appealing interactive notebooks using multiple components of the Hadoop ecosystem. A few common use cases are listed below:

  • Run a quick Select statement on a Hive table using Presto.
  • Develop a report based on a dataset by reading files from HDFS and persisting them in memory as Spark data frames.
  • Create an interactive dashboard that allows users to search through a specific set of log files with custom format and schema.
  • Inspect the schema of a Hive table.

Enhancements

Two-factor authentication — To maintain security parity with that command-line access, we plugged in our custom two-factor authentication mechanism in Zeppelin. Zeppelin uses Shiro for security, and Shiro allows one to plug in a custom authentication with some difficulty.

Support for multiple clusters — We have multiple clusters and multiple instances of components like Hive. To support multiple instances in one Zeppelin server, we created different interpreters for different clusters or server instances.

Capability to override interpreter settings at the user level — Some of the interpreter settings, such as job queues and memory values, among others, need to be customized by users for their specific use cases. To support this, we added a feature in Zeppelin so that users can override certain Interpreter settings by setting properties. This is described in detail in this Apache JIRA ticket ZEPPELIN-1625

Architecture

Figure 5: Zeppelin Architecture at eBay

Knox

Apache Knox (GitHub repo) is an HTTP reverse proxy, and it provides a single endpoint for applications to invoke Hadoop operations. It supports multiple clusters and multiple components like webHDFS, Oozie, WebHCat, etc. It can also support multiple authentication mechanisms so that we can hook up custom authentication along with Kerberos authentication

It is an Apache top-level project and is 100% open source.

Use cases

Knox allows an application to talk to multiple Hadoop clusters and related components through a single entry point using any application-friendly non-Kerberos authentication mechanism. A few common use cases are listed below:

  • To authenticate using an application token and put/get files to/from HDFS on a specific cluster
  • To authenticate using an application token and trigger an Oozie job
  • To run a Hive script using WebHCat

Enhancements

Authentication using application tokens — The applications and services in the eBay backend use a custom token-based authentication mechanism. To take advantage of the existing application credentials, we enhanced Knox to support our application token-based authentication mechanism in addition to Kerberos. Knox utilizes the Hadoop Authentication framework, which is flexible enough to plug in new authentication mechanisms. The steps to plug in an authentication mechanism on Hadoop’s HTTP interface is described in Multiple Authentication Mechanisms for Hadoop Web Interfaces

Architecture

Figure 6: Knox Architecture at eBay

Summary

In this blog post, we describe the approach taken to improve user experience and developer productivity in using our multiple Hadoop clusters and related components. We illustrate the use of three open-source products to make Hadoop users’ life a lot simpler. The products are Hue, Zeppelin, and Knox. We evaluated these products, customized them for eBay’s purpose, and made them available for our users to carry out their projects efficiently.

Secure Communication in Hadoop without Hurting Performance

 

Apache Hadoop is used for processing big data at many enterprises. A Hadoop cluster is formed by assembling a large number of commodity machines, and it enables the distributed processing of data. Enterprises store lots of important data on the cluster. Different users and teams process this data to obtain summary information, generate insights, and gain other very useful information.

Figure 1 diagram of a Typical Hadoop cluster with clients communicating from all sides
Figure 1. Typical Hadoop cluster with clients communicating from all sides

Depending on the size of the enterprise and the domain, there can be lots of data stored on the cluster and a lot of clients working on the data. Depending on the type of client, its network location, and the data in transit, the security requirements of the communication channel between the client and the servers can vary. Some of the interactions need to be authenticated. Some of the interactions need to be authenticated and protected from eavesdropping. How to enable different qualities of protection for different cluster interactions forms the core of this blog post.

The requirement for different qualities of protection

At eBay, we have multiple large clusters that hold hundreds of petabytes of data and are growing by many terabytes daily. We have different types of data. We have thousands of clients interacting with the cluster. Our clusters are kerberized, which means clients have to authenticate via Kerberos.

Some clients are within the same network zone and interact directly with the cluster. In most cases, there is no requirement to encrypt the communication for these clients. In some cases, it is required to encrypt the communication because of the sensitive nature of the data. Some clients are outside the firewall, and it is required to encrypt all communication between these external clients and Hadoop servers. Thus we have the requirement for different qualities of protection.

figure 2 diagram of a hadoop cluster with clients with different quality of protection requirements
Figure 2. Cluster with clients with different quality of protection requirements (Lines in red indicate communication channel requiring confidentiality)

Also note that there is communication between the Hadoop servers within cluster. They also can fall into category of clients within the same network zone or internal clients.

Table 1. Quality protection based on client location and data sensitivity
# Client Data QOP
1 Internal Normal data Authentication
2 Internal Sensitive Authentication + Encryption + Integrity
3 External Normal or sensitive data Authentication + Encryption + Integrity

Hadoop protocols

All Hadoop servers support both RPC and HTTP protocols. In addition, Datanodes support the Data Transfer Protocol.

Figure 3 diagram of the Hadoop Resource Manager with multiple protocols on different ports
Figure 3. Hadoop Resource Manager with multiple protocols on different ports

RPC protocol

Hadoop servers mainly interact via the RPC protocol. As an example, file system operations like listing and renaming happen over the RPC protocol.

HTTP

All Hadoop server expose their status, JMX metrics, etc. via HTTP. Some Hadoop servers support additional functionality to offer a better user experience over the HTTP. For example, the Resource Manager provides a web interface to browse applications over the HTTP.

Data Transfer Protocol

Datanodes store and serve the blocks via the Data Transfer Protocol.

Figure 4 diagram of a Hadoop Datanode with multiple protocols on different ports
Figure 4. A Hadoop Datanode with multiple protocols on different ports

Quality of Protection for the RPC Protocol

The quality of protection for the RPC protocol is specified via the configuration property hadoop.rpc.protection. The default value is authentication on a kerberized cluster. Note that hadoop.rpc.protection is effective only in a cluster where hadoop.rpc.authentication is set to Kerberos.

Hadoop supports Kerberos authentication and quality protection via the SASL (Simple Authentication and Security Layer) Java library. The Java SASL library supports the three levels of Quality of protection shown in Table 2.

Table 2: SASL and Hadoop QOP values
# SASL QOP Description hadoop.rpc.protection
1 auth Authentication only authentication
2 auth-int Authentication + Integrity protection integrity
3 auth-conf Authentication + Integrity protection + Privacy protection privacy

If privacy is chosen for hadoop.rpc.protection, data transmitted between client and server will be encrypted. The algorithm used for encryption will be 3DES.

Figure 5 diagram of Resource Manager supporting privacy on its RPC port
Figure 5: Resource Manager supporting privacy on its RPC port

Quality of protection for HTTP

The quality of protection for HTTP can be controlled by the policies dfs.http.policy and yarn.http.policy. The policy values can be one of the following:

  • HTTP_ONLY. The interface is served only on HTTP
  • HTTPS_ONLY. The interface is served only on HTTPS
  • HTTP_AND_HTTPS. The interface is served both on HTTP and HTTPS

Figure 6 diagram of Resource Manager supporting privacy on its RPC and HTTP ports
Figure 6. Resource Manager supporting privacy on its RPC and HTTP ports

Quality of protection for the Data Transfer Protocol

The quality of protection for the Data Transfer Protocol can be specified in a similar way to that for the RPC protocol. The configuration property is dfs.data.transfer.protection. Like RPC, the value can be one of authentication, integrity, or privacy. Specifying this property makes SASL effective on the Data Transfer Protocol.

Figure 7 diagram of a Datanode supporting privacy on all its ports
Figure 7. Datanode supporting privacy on all its ports

Specifying authentication as the value of dfs.data.transfer.protection forces the client to require a block access token while reading and storing blocks. Setting dfs.data.transfer.protection to privacy results in encrypting the data transfer. The algorithm used for encryption will be 3DES. It is possible to agree upon a different algorithm by setting dfs.encrypt.data.transfer.cipher.suites on both client and server sides. The only value supported is AES/CTR/NoPadding. Using AES results in better performance and security. It is possible to further speed up encryption and decryption by using Advanced Encryption Standard New Instructions (AES-NI) via the libcrypto.so native library on Datanodes and clients.

Performance impact

Enabling privacy comes at the cost of performance. Encryption and decryption require extra performance.

Setting hadoop.rpc.protection to privacy encrypts all communication from clients to Namenode, from clients to Resource Manager, from datanodes to Namenodes, from Node Managers to Resource managers, and so on.

Setting dfs.data.transfer.protection to privacy encrypts all data transfer between clients and Datanodes. The clients could be any HDFS client like a map-task reading data, reduce-task writing data or a client JVM reading/writing data.

Setting dfs.http.policy and yarn.http.policy to HTTPS_ONLY causes all HTTP traffic to be encrypted. This includes the web UI for Namenodes and Resource Managers, Web HDFS interactions, and others.

While this guarantees the privacy of interaction, it slows down the cluster considerably. The cumulative effective will be a fivefold decrease in the cluster throughput. We had set quality of protection for both RPC and data transfer protocols to privacy in our main cluster. We experienced severe delays in processing that resulted in days of backlogs in data processing. The Namenode throughput was low, the data read/writes were very slow, and this increased the completion times of individual applications. Since applications were occupying containers for five times longer than before, there was severe resource contention that ultimately caused the application backlogs.

The immediate solution was to change the quality of protection back to authentication for both protocols, but we still required privacy for some of our interactions, so we explored the possibility of choosing a quality of protection dynamically during connection establishment.

Selective encryption

As noted, the RPC protocol and Data Transfer Protocol internally use SASL to incorporate security in the protocol. On each side (client/server), SASL can support an ordered list of QOPs. During client-server handshake, the first common QOP is selected as the QOP of that interaction. Table 2 lists the valid QOP values.

Figure 8 diagram showing sequence of client and server agreeing on a QOP based on configured QOPs
Figure 8. Sequence of client and server agreeing on a QOP based on configured QOPs

In our case, we wanted most client-server interactions to require only authentication and very few client-server interactions to use privacy. To achieve this, we implemented an interface SASLPropertiesResolver. The method getSaslProperties on SASLPropertiesResolver will be invoked during handshake to determine the potential list of QOP to be used for the handshake. The default implementation of the SASLPropertiesResolver simply returned the value specified in hadoop.rpc.protection and dfs.data.transfer.protection for RPC and data transfer protocols.

More details on SASLPropertiesResolver can be found by reviewing the work on the related JIRA HADOOP-10221.

eBay’s implementation of SASLPropertiesResolver

In our implementation of SASLPropertiesResolver, we use a whitelist of IP addresses on the server side. If the client’s IP address is in the whitelist, then the list of QOP specified in hadoop.rpc.protection/dfs.data.transfer.protection is used. If the client’s IP address in not in whitelist, then we use the list of QOPs specified in a custom configuration, namely hadoop.rpc.protection.non-whitelist. To avoid very long whitelists, we use CIDRs to represent IP address ranges.

In our clusters, we set hadoop.rpc.protection and dfs.data.transfer.protection on both client and servers to be authentication,privacy. The configuration hadoop.rpc.protection.non-whitelist was set to privacy.

When whitelisted clients connect to servers, they will agree upon authentication as the QOP for the connection, since both client and servers support a list of QOP values, authentication,privacy. The order of QOP values in the list is important. If both client and servers support a list of QOP values as privacy,authentication, then client and servers agree upon privacy as the QOP for their connections.

Figure 9 digram showing internal clients, external clients, cluster
Figure 9. internal clients, external clients, cluster

Only clients inside the firewall are in the whitelist. When these clients connect to servers, the QOP used will be authentication as both clients and servers use authentication,privacy for RPC and data transfer protocols.

The external clients are not in the whitelist. When these clients connect to servers, QOP used will be privacy as servers offer only privacy based on the value of hadoop.rpc.protection.non-whitelist. The clients in this case could be using authentication,privacy or privacy in their configuration. If the client only specifies authentication, the connection will fail as there will be no common QOP supported between client and servers.

Figure 10 diagram showing sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver
Figure 10. Sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver

The diagram above depicts the sequence of an external client and Hadoop server agreeing on a QOP. Both client and server support two potential QOP, authentication,privacy. When the client initiates the connection, the server consults the SASLPropertiesResolver with the client IP address to determine the QOP. The whitelist-based SASLPropertiesResolver checks the whitelist, finds that the client’s IP address is not whitelisted and hence offers only privacy as the QOP. The server then offers only privacy as the only QOP choice to the client during SASL negotiation. Thus the QOP of the subsequent connection will be privacy. This necessitates the need for setting up a secret key based on the cipher suites available at the client and server.

In some cases, internal clients need to transmit sensitive data and prefer to encrypt all communication with the cluster. In this case, the internal clients can force encryption by setting both hadoop.rpc.protection and dfs.data.transfer.protection to privacy. Even though the servers support authentication,privacy, connections will use the common QOP, privacy.

Table 3: QOP list at clients, servers and the negotiated QOP.
# Client QOP Server QOP Negotiated QOP Comments
1 Authentication, Privacy Authentication, Privacy Authentication For normal clients
2 Authentication, Privacy Privacy Privacy For external clients
3 Privacy Authentication, Privacy Privacy For clients to transmit sensitive data

Selective secure communication using Apache Knox

Supporting multiple QOPs on Hadoop protocols enables selective encryption. This will protect any data transfer. The data could be real data or signal data like delegation tokens. Another approach will be to use a reverse proxy like Apache Knox. Here the external clients will have connectivity only to Apache Knox servers and Apache Knox will allow only HTTPS traffic. The cluster will support only one QOP, which is Authentication.

Figure 11 showing the process of Protecting data transfer using Reverse Proxy (Apache Knox)
Figure 11. Protecting data transfer using Reverse Proxy (Apache Knox)

As shown in the diagram, the external clients interact with the cluster via knox servers. The transmission between the client and knox will be encrypted. knox, being an internal client, forwards the transmission to the cluster in plain text.

Selective Encryption using reverse proxy vs SaslPropertiesResolver

As we discussed, it is possible to achieve a different quality of protection for the client-cluster communication in two ways.

  • With a reverse proxy to receive encrypted traffic
  • By enabling the cluster to support multiple QOPs simultaneously

There are pros and cons with both approaches, as shown in Table 4.

Table 4: Pros and cons of two approaches to support multiple qualities of protection
Approaches PROS CONS
SaslPropertiesResolver Natural and efficient. The clients can communicate directly with the Hadoop cluster using the Hadoop commands and using native Hadoop protocols. Maintenance of whitelist. The whitelist needs to be maintained on the cluster. Care should be taken to keep the whitelist at the minimal size using CIDRs to avoid lookup in a large list of IP addresses for each connection.
Reverse Proxy Closed Cluster. The external clients need to communicate only with Knox servers and hence all other ports can be closed. Maintenance of additional components. A pool of knox servers needs to be maintained to handle traffic from external clients. Depending on the amount of data, there could be a quite a lot of Knox servers.

Depending on the organization’s requirements, the proper approach should be chosen.

Further work

Enable separate SaslPropertiesResolver properties for client and server

A process can either be a client or server or both. In some environments, it is desirable to use different logic for choosing QOP depending on whether the process is a client or server for the specific interaction. Currently, there is provision to specify only one SaslPropertiesResolver via configuration. If a client needs a different SaslPropertiesResolver, it needs to use a different configuration. If the same process needs a different SaslPropertiesResolver while acting as client and server, there is no way to do that. It would be a good enhancement to be able to specify different SaslPropertiesResolver for client and server.

Use Zookeeper to maintain whitelist of IP Addresses

Currently the whitelist of IP addresses is maintained on local files. This introduces the problem of updating thousands of machines and keeping them all in sync. Storing the whitelist information on a Zookeeper node may be a better alternative.

Currently, the whitelist is cached by each server during initialization and then reloaded periodically. A better approach may be to reload the whitelist based on an update event.

Multiple Authentication Mechanisms for Hadoop Web Interfaces

 

Apache Hadoop is a base component for Big Data processing and analysis. Hadoop servers, in general, allow interaction via two protocols: a TCP-based RPC (Remote Procedure Call) protocol and the HTTP protocol.

The RPC protocol currently allows only one primary authentication mechanism: Kerberos. The HTTP interface allows enterprises to plug in different authentication mechanisms. In this post, we are focusing on enhancing Hadoop with a simple framework that allows us to plug in multiple authentication mechanisms for Hadoop web interfaces.

diagram showing a user interfacing with Hadoop via a custom authentication mechanism

Note that the Hadoop HTTP Authentication module (deployed as the hadoop-auth-version.jar file) is reused by different Hadoop servers like NameNode, ResourceManager, NodeManager, and DataNode as well as other Hadoop-based components like Hbase and Oozie.

We can follow the steps below to plug in custom authentication mechanism.

  1. Implement interface AuthenticationHandler, which is under the  org.apache.hadoop.security.authentication.server package.
  2. Specify the implementation class in the configuration. Make sure that the implementation class is available in the classpath of the Hadoop server.

AuthenticationHandler interface

The implementation of the AuthenticationHandler will be loaded by the AuthenticationFilter, which is a servlet Filter loaded during startup of the Hadoop server’s web server.

The definition of AuthenticationHandler interface is as follows:

package org.apache.hadoop.security.authentication.server;
public interface AuthenticationHandler {
    public String getType();
    public void init(Properties config) throws ServletException;
    public void destroy();
    public boolean managementOperation(AuthenticationToken token, HttpServletRequest request, HttpServletResponse response) throws IOException, AuthenticationException;
    public AuthenticationToken authenticate(HttpServletRequest request, HttpServletResponse response)throws IOException, AuthenticationException;
}

The init method accepts a Properties object. This contains the properties read from the Hadoop configuration. Any config property that is prefixed by hadoop.http.authentication.Type will be added to the Properties object.

The authenticate method does the job of performing the actual authentication. For successful authentication, an AuthenticationToken is returned. The AuthenticationToken implements java.user.Principal and contains the following set of properties:

  • Username
  • Principal
  • Authentication type
  • Expiry time

Existing AuthenticationHandlers

There are a few implementations of AuthenticationHandler interface that are part of the Hadoop distribution.

  • KerberosAuthenticationHandler — Performs Spnego Authentication.
  • PseudoAuthenticationHandler — Performs simple authentication. It authenticates the user based on the identity passed via the user.name URL query parameter.
  • AltKerberosAuthenticationHandler — Extends KerberosAuthenticationHandler. Allows you to provide an alternate authentication mechanism by extending
  • AltKerberosAuthenticationHandler. The developer has to implement the alternateAuthenticate method in which to add the custom authentication logic.

Composite AuthenticationHandler

At eBay, we like to provide multiple authentication mechanisms in addition to the Kerberos and anonymous authentication. The operators prefer to turn off any authentication mechanism by modifying the configuration rather than rolling out new code. For this reason, we implemented a CompositeAuthenticationHandler.

The CompositeAuthenticationHandler accepts a list of authentication mechanisms via the property hadoop.http.authentication.composite.handlers. This property contains a list of classes that are implementations for AuthenticationHandler corresponding to different authentication mechanisms.

diagram showing a user and a service interfacing with Hadoop via two different authentication mechanisms, Kerberos and 2FA

The properties for each individual authentication mechanism can be passed via configuration properties prefixed with hadoop.http.authentication.Type. The following table lists the different properties supported by CompositeAuthenticationHandler.

 

# Property Description Default Value
1 hadoop.http.authentication.composite.handlers List of classes that implement AuthenticationHandler for various authentication mechanisms
2 hadoop.http.authentication.composite.default-non-browser-handler-type The default authentication mechanism for a non-browser access
3 hadoop.http.authentication.composite.non-browser.user-agents List of user agents whose presence in the User-Agent header is considered to be a non-browser. java,curl,wget,perl

The source code for CompositeAuthenticationHandler is attached to the JIRA page HADOOP-10307.