Secure Communication in Hadoop without Hurting Performance


Apache Hadoop is used for processing big data at many enterprises. A Hadoop cluster is formed by assembling a large number of commodity machines, and it enables the distributed processing of data. Enterprises store lots of important data on the cluster. Different users and teams process this data to obtain summary information, generate insights, and gain other very useful information.

Figure 1 diagram of a Typical Hadoop cluster with clients communicating from all sides
Figure 1. Typical Hadoop cluster with clients communicating from all sides

Depending on the size of the enterprise and the domain, there can be lots of data stored on the cluster and a lot of clients working on the data. Depending on the type of client, its network location, and the data in transit, the security requirements of the communication channel between the client and the servers can vary. Some of the interactions need to be authenticated. Some of the interactions need to be authenticated and protected from eavesdropping. How to enable different qualities of protection for different cluster interactions forms the core of this blog post.

The requirement for different qualities of protection

At eBay, we have multiple large clusters that hold hundreds of petabytes of data and are growing by many terabytes daily. We have different types of data. We have thousands of clients interacting with the cluster. Our clusters are kerberized, which means clients have to authenticate via Kerberos.

Some clients are within the same network zone and interact directly with the cluster. In most cases, there is no requirement to encrypt the communication for these clients. In some cases, it is required to encrypt the communication because of the sensitive nature of the data. Some clients are outside the firewall, and it is required to encrypt all communication between these external clients and Hadoop servers. Thus we have the requirement for different qualities of protection.

figure 2 diagram of a hadoop cluster with clients with different quality of protection requirements
Figure 2. Cluster with clients with different quality of protection requirements (Lines in red indicate communication channel requiring confidentiality)

Also note that there is communication between the Hadoop servers within cluster. They also can fall into category of clients within the same network zone or internal clients.

Table 1. Quality protection based on client location and data sensitivity
# Client Data QOP
1 Internal Normal data Authentication
2 Internal Sensitive Authentication + Encryption + Integrity
3 External Normal or sensitive data Authentication + Encryption + Integrity

Hadoop protocols

All Hadoop servers support both RPC and HTTP protocols. In addition, Datanodes support the Data Transfer Protocol.

Figure 3 diagram of the Hadoop Resource Manager with multiple protocols on different ports
Figure 3. Hadoop Resource Manager with multiple protocols on different ports

RPC protocol

Hadoop servers mainly interact via the RPC protocol. As an example, file system operations like listing and renaming happen over the RPC protocol.


All Hadoop server expose their status, JMX metrics, etc. via HTTP. Some Hadoop servers support additional functionality to offer a better user experience over the HTTP. For example, the Resource Manager provides a web interface to browse applications over the HTTP.

Data Transfer Protocol

Datanodes store and serve the blocks via the Data Transfer Protocol.

Figure 4 diagram of a Hadoop Datanode with multiple protocols on different ports
Figure 4. A Hadoop Datanode with multiple protocols on different ports

Quality of Protection for the RPC Protocol

The quality of protection for the RPC protocol is specified via the configuration property The default value is authentication on a kerberized cluster. Note that is effective only in a cluster where hadoop.rpc.authentication is set to Kerberos.

Hadoop supports Kerberos authentication and quality protection via the SASL (Simple Authentication and Security Layer) Java library. The Java SASL library supports the three levels of Quality of protection shown in Table 2.

Table 2: SASL and Hadoop QOP values
# SASL QOP Description
1 auth Authentication only authentication
2 auth-int Authentication + Integrity protection integrity
3 auth-conf Authentication + Integrity protection + Privacy protection privacy

If privacy is chosen for, data transmitted between client and server will be encrypted. The algorithm used for encryption will be 3DES.

Figure 5 diagram of Resource Manager supporting privacy on its RPC port
Figure 5: Resource Manager supporting privacy on its RPC port

Quality of protection for HTTP

The quality of protection for HTTP can be controlled by the policies dfs.http.policy and yarn.http.policy. The policy values can be one of the following:

  • HTTP_ONLY. The interface is served only on HTTP
  • HTTPS_ONLY. The interface is served only on HTTPS
  • HTTP_AND_HTTPS. The interface is served both on HTTP and HTTPS

Figure 6 diagram of Resource Manager supporting privacy on its RPC and HTTP ports
Figure 6. Resource Manager supporting privacy on its RPC and HTTP ports

Quality of protection for the Data Transfer Protocol

The quality of protection for the Data Transfer Protocol can be specified in a similar way to that for the RPC protocol. The configuration property is Like RPC, the value can be one of authentication, integrity, or privacy. Specifying this property makes SASL effective on the Data Transfer Protocol.

Figure 7 diagram of a Datanode supporting privacy on all its ports
Figure 7. Datanode supporting privacy on all its ports

Specifying authentication as the value of forces the client to require a block access token while reading and storing blocks. Setting to privacy results in encrypting the data transfer. The algorithm used for encryption will be 3DES. It is possible to agree upon a different algorithm by setting on both client and server sides. The only value supported is AES/CTR/NoPadding. Using AES results in better performance and security. It is possible to further speed up encryption and decryption by using Advanced Encryption Standard New Instructions (AES-NI) via the native library on Datanodes and clients.

Performance impact

Enabling privacy comes at the cost of performance. Encryption and decryption require extra performance.

Setting to privacy encrypts all communication from clients to Namenode, from clients to Resource Manager, from datanodes to Namenodes, from Node Managers to Resource managers, and so on.

Setting to privacy encrypts all data transfer between clients and Datanodes. The clients could be any HDFS client like a map-task reading data, reduce-task writing data or a client JVM reading/writing data.

Setting dfs.http.policy and yarn.http.policy to HTTPS_ONLY causes all HTTP traffic to be encrypted. This includes the web UI for Namenodes and Resource Managers, Web HDFS interactions, and others.

While this guarantees the privacy of interaction, it slows down the cluster considerably. The cumulative effective will be a fivefold decrease in the cluster throughput. We had set quality of protection for both RPC and data transfer protocols to privacy in our main cluster. We experienced severe delays in processing that resulted in days of backlogs in data processing. The Namenode throughput was low, the data read/writes were very slow, and this increased the completion times of individual applications. Since applications were occupying containers for five times longer than before, there was severe resource contention that ultimately caused the application backlogs.

The immediate solution was to change the quality of protection back to authentication for both protocols, but we still required privacy for some of our interactions, so we explored the possibility of choosing a quality of protection dynamically during connection establishment.

Selective encryption

As noted, the RPC protocol and Data Transfer Protocol internally use SASL to incorporate security in the protocol. On each side (client/server), SASL can support an ordered list of QOPs. During client-server handshake, the first common QOP is selected as the QOP of that interaction. Table 2 lists the valid QOP values.

Figure 8 diagram showing sequence of client and server agreeing on a QOP based on configured QOPs
Figure 8. Sequence of client and server agreeing on a QOP based on configured QOPs

In our case, we wanted most client-server interactions to require only authentication and very few client-server interactions to use privacy. To achieve this, we implemented an interface SASLPropertiesResolver. The method getSaslProperties on SASLPropertiesResolver will be invoked during handshake to determine the potential list of QOP to be used for the handshake. The default implementation of the SASLPropertiesResolver simply returned the value specified in and for RPC and data transfer protocols.

More details on SASLPropertiesResolver can be found by reviewing the work on the related JIRA HADOOP-10221.

eBay’s implementation of SASLPropertiesResolver

In our implementation of SASLPropertiesResolver, we use a whitelist of IP addresses on the server side. If the client’s IP address is in the whitelist, then the list of QOP specified in is used. If the client’s IP address in not in whitelist, then we use the list of QOPs specified in a custom configuration, namely To avoid very long whitelists, we use CIDRs to represent IP address ranges.

In our clusters, we set and on both client and servers to be authentication,privacy. The configuration was set to privacy.

When whitelisted clients connect to servers, they will agree upon authentication as the QOP for the connection, since both client and servers support a list of QOP values, authentication,privacy. The order of QOP values in the list is important. If both client and servers support a list of QOP values as privacy,authentication, then client and servers agree upon privacy as the QOP for their connections.

Figure 9 digram showing internal clients, external clients, cluster
Figure 9. internal clients, external clients, cluster

Only clients inside the firewall are in the whitelist. When these clients connect to servers, the QOP used will be authentication as both clients and servers use authentication,privacy for RPC and data transfer protocols.

The external clients are not in the whitelist. When these clients connect to servers, QOP used will be privacy as servers offer only privacy based on the value of The clients in this case could be using authentication,privacy or privacy in their configuration. If the client only specifies authentication, the connection will fail as there will be no common QOP supported between client and servers.

Figure 10 diagram showing sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver
Figure 10. Sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver

The diagram above depicts the sequence of an external client and Hadoop server agreeing on a QOP. Both client and server support two potential QOP, authentication,privacy. When the client initiates the connection, the server consults the SASLPropertiesResolver with the client IP address to determine the QOP. The whitelist-based SASLPropertiesResolver checks the whitelist, finds that the client’s IP address is not whitelisted and hence offers only privacy as the QOP. The server then offers only privacy as the only QOP choice to the client during SASL negotiation. Thus the QOP of the subsequent connection will be privacy. This necessitates the need for setting up a secret key based on the cipher suites available at the client and server.

In some cases, internal clients need to transmit sensitive data and prefer to encrypt all communication with the cluster. In this case, the internal clients can force encryption by setting both and to privacy. Even though the servers support authentication,privacy, connections will use the common QOP, privacy.

Table 3: QOP list at clients, servers and the negotiated QOP.
# Client QOP Server QOP Negotiated QOP Comments
1 Authentication, Privacy Authentication, Privacy Authentication For normal clients
2 Authentication, Privacy Privacy Privacy For external clients
3 Privacy Authentication, Privacy Privacy For clients to transmit sensitive data

Selective secure communication using Apache Knox

Supporting multiple QOPs on Hadoop protocols enables selective encryption. This will protect any data transfer. The data could be real data or signal data like delegation tokens. Another approach will be to use a reverse proxy like Apache Knox. Here the external clients will have connectivity only to Apache Knox servers and Apache Knox will allow only HTTPS traffic. The cluster will support only one QOP, which is Authentication.

Figure 11 showing the process of Protecting data transfer using Reverse Proxy (Apache Knox)
Figure 11. Protecting data transfer using Reverse Proxy (Apache Knox)

As shown in the diagram, the external clients interact with the cluster via knox servers. The transmission between the client and knox will be encrypted. knox, being an internal client, forwards the transmission to the cluster in plain text.

Selective Encryption using reverse proxy vs SaslPropertiesResolver

As we discussed, it is possible to achieve a different quality of protection for the client-cluster communication in two ways.

  • With a reverse proxy to receive encrypted traffic
  • By enabling the cluster to support multiple QOPs simultaneously

There are pros and cons with both approaches, as shown in Table 4.

Table 4: Pros and cons of two approaches to support multiple qualities of protection
Approaches PROS CONS
SaslPropertiesResolver Natural and efficient. The clients can communicate directly with the Hadoop cluster using the Hadoop commands and using native Hadoop protocols. Maintenance of whitelist. The whitelist needs to be maintained on the cluster. Care should be taken to keep the whitelist at the minimal size using CIDRs to avoid lookup in a large list of IP addresses for each connection.
Reverse Proxy Closed Cluster. The external clients need to communicate only with Knox servers and hence all other ports can be closed. Maintenance of additional components. A pool of knox servers needs to be maintained to handle traffic from external clients. Depending on the amount of data, there could be a quite a lot of Knox servers.

Depending on the organization’s requirements, the proper approach should be chosen.

Further work

Enable separate SaslPropertiesResolver properties for client and server

A process can either be a client or server or both. In some environments, it is desirable to use different logic for choosing QOP depending on whether the process is a client or server for the specific interaction. Currently, there is provision to specify only one SaslPropertiesResolver via configuration. If a client needs a different SaslPropertiesResolver, it needs to use a different configuration. If the same process needs a different SaslPropertiesResolver while acting as client and server, there is no way to do that. It would be a good enhancement to be able to specify different SaslPropertiesResolver for client and server.

Use Zookeeper to maintain whitelist of IP Addresses

Currently the whitelist of IP addresses is maintained on local files. This introduces the problem of updating thousands of machines and keeping them all in sync. Storing the whitelist information on a Zookeeper node may be a better alternative.

Currently, the whitelist is cached by each server during initialization and then reloaded periodically. A better approach may be to reload the whitelist based on an update event.

Multiple Authentication Mechanisms for Hadoop Web Interfaces


Apache Hadoop is a base component for Big Data processing and analysis. Hadoop servers, in general, allow interaction via two protocols: a TCP-based RPC (Remote Procedure Call) protocol and the HTTP protocol.

The RPC protocol currently allows only one primary authentication mechanism: Kerberos. The HTTP interface allows enterprises to plug in different authentication mechanisms. In this post, we are focusing on enhancing Hadoop with a simple framework that allows us to plug in multiple authentication mechanisms for Hadoop web interfaces.

diagram showing a user interfacing with Hadoop via a custom authentication mechanism

Note that the Hadoop HTTP Authentication module (deployed as the hadoop-auth-version.jar file) is reused by different Hadoop servers like NameNode, ResourceManager, NodeManager, and DataNode as well as other Hadoop-based components like Hbase and Oozie.

We can follow the steps below to plug in custom authentication mechanism.

  1. Implement interface AuthenticationHandler, which is under the package.
  2. Specify the implementation class in the configuration. Make sure that the implementation class is available in the classpath of the Hadoop server.

AuthenticationHandler interface

The implementation of the AuthenticationHandler will be loaded by the AuthenticationFilter, which is a servlet Filter loaded during startup of the Hadoop server’s web server.

The definition of AuthenticationHandler interface is as follows:

public interface AuthenticationHandler {
    public String getType();
    public void init(Properties config) throws ServletException;
    public void destroy();
    public boolean managementOperation(AuthenticationToken token, HttpServletRequest request, HttpServletResponse response) throws IOException, AuthenticationException;
    public AuthenticationToken authenticate(HttpServletRequest request, HttpServletResponse response)throws IOException, AuthenticationException;

The init method accepts a Properties object. This contains the properties read from the Hadoop configuration. Any config property that is prefixed by hadoop.http.authentication.Type will be added to the Properties object.

The authenticate method does the job of performing the actual authentication. For successful authentication, an AuthenticationToken is returned. The AuthenticationToken implements java.user.Principal and contains the following set of properties:

  • Username
  • Principal
  • Authentication type
  • Expiry time

Existing AuthenticationHandlers

There are a few implementations of AuthenticationHandler interface that are part of the Hadoop distribution.

  • KerberosAuthenticationHandler — Performs Spnego Authentication.
  • PseudoAuthenticationHandler — Performs simple authentication. It authenticates the user based on the identity passed via the URL query parameter.
  • AltKerberosAuthenticationHandler — Extends KerberosAuthenticationHandler. Allows you to provide an alternate authentication mechanism by extending
  • AltKerberosAuthenticationHandler. The developer has to implement the alternateAuthenticate method in which to add the custom authentication logic.

Composite AuthenticationHandler

At eBay, we like to provide multiple authentication mechanisms in addition to the Kerberos and anonymous authentication. The operators prefer to turn off any authentication mechanism by modifying the configuration rather than rolling out new code. For this reason, we implemented a CompositeAuthenticationHandler.

The CompositeAuthenticationHandler accepts a list of authentication mechanisms via the property hadoop.http.authentication.composite.handlers. This property contains a list of classes that are implementations for AuthenticationHandler corresponding to different authentication mechanisms.

diagram showing a user and a service interfacing with Hadoop via two different authentication mechanisms, Kerberos and 2FA

The properties for each individual authentication mechanism can be passed via configuration properties prefixed with hadoop.http.authentication.Type. The following table lists the different properties supported by CompositeAuthenticationHandler.


# Property Description Default Value
1 hadoop.http.authentication.composite.handlers List of classes that implement AuthenticationHandler for various authentication mechanisms
2 hadoop.http.authentication.composite.default-non-browser-handler-type The default authentication mechanism for a non-browser access
3 hadoop.http.authentication.composite.non-browser.user-agents List of user agents whose presence in the User-Agent header is considered to be a non-browser. java,curl,wget,perl

The source code for CompositeAuthenticationHandler is attached to the JIRA page HADOOP-10307.

ColorizerJS — My First Hack Project at eBay


First blog on the eBay Tech Blog, whooooh!

Hello! This blog will mostly be about my project, ColorizerJS, but since this is my first post on the eBay Tech Blog, I thought I’d introduce myself. For those unaware of some of the terminology and nomenclature, I will try to provide some brief short-bracket explanations[“[]”]. If you are submerged in web software or already cool enough to know the terminology, just pretend like there are no brackets.

I started working at eBay in May after graduating from the University of California, Santa Cruz, with a bachelor’s degree in Computer Science Game Design (now called CGPM). CSGD at UCSC was a great program for front-facing developers, as games are as much about the user (or player) as they are the algorithm. Plus, at least at UCSC, I got a whole lot more experience with managing assets and interfacing between technologies in my Game Design classes than in my vanilla CS classes.

The interview process at eBay was great, not too niche-specific and not too broad, with great questions on both niche and broad ends of the interview spectrum. I was hired as a full-time IC [individual contributor] on the Search Front-End [abbreviated to FE, meaning closest to the client] engineering team. This means that whenever you enter a search query, the stuff you get back — JavaScript [functionality, the bulk of it], CSS [look and feel], HTML [structure] — is what my team handles.

I was lucky enough to be on a great team at a great time. When Hack Week 2016 rolled in mid-June, I was too new to have many responsibilities keeping me from joining the Hack Week, yet experienced enough to know our codebase a bit and have some ideas for my potential project.

After some thought, I determined something that would be fun to work on, enhance the user experience, and bring in revenue to eBay through increased buyer activity. The initial problem arose from eBay sellers often having non-standard sized images, either shorter or thinner than a square. With the current redesign of Search Results Page (SRP), the method to combat this problem was to overlay all images onto a grey bounding square. This technically fixed the problem of non-standard image dimensions, but it did not lend itself to good aesthetics. This is where ColorizerJS comes in.

Current background implementation in production — not so hot

I was inspired to do ColorizerJS by my coworker (and one of my many mentors) Yoni Medoff, who made a script that drew loaded images to an HTML Canvas, averaged the RGB [red, green, and blue channels. You definitely know this. These brackets are so condescending.] values of each image in the search results for a query, then created an RGB color out of those average channel numbers, and set the containing square (normally just white or grey) behind the image to be the average color. This was intended to provide a more visually appealing way of standardizing the size and shape of each image in search results than just slapping them on top of a grey square. Yoni’s implementation looked significantly better than just a grey square, but it could definitely be improved upon.

There were four problems with his approach:

  • It did not look good in many cases. Sometimes the generated average does match any color in an image. A strong foreground color can throw that average out of whack, and most of all, the average may not resemble the “background” at all.
  • It depended on HTML <img/> elements [images on the (meta)physical webpage], which limited use to the DOM, so that isomorphic [using the same code in a browser and on the server] server-side use was impossible and parallelized web-workers [multiple CPU threads running JavaScript] wouldn’t work.
  • It converted the <img/> to an HTML Canvas element to extract the pixel data, a slow process that required drawing to large Canvas objects before being of any use. This was not time- or memory-efficient.
  • It was very tightly coupled to our codebase, meaning that it could not be used in other parts of the site or on other websites without huge changes.

My code running average algorithm — it’s alright

I thought, “Oh snap, here’s my chance to win $10k and become VP of engineering or whatever the Hack Week prize is.” Easy! All I needed to do was come up with a way to load the image into the browser cache to reduce double-requests [when a browser makes two requests to the server for the same image, BAD] and then read that cached image as binary [1000100101], analyze the binary file to extract its RGB pixel data, organize that in a logical way to analyze, and then analyze it and determine the most likely background color of the image to be appended to the background bounding square of the image, all while keeping the analysis code modular so as to be used on the client, web-worker, or server, and independent of our codebase and modular so that I could open-source it as an NPM module to be used by anyone on any page. Mayyybe not so easy.

So now I have a considerable job — meet all these criteria in a week — but where do I start? I decided starting with a client-side [user’s browser] image decoding [decoders analyze an image and give back a string of numbers indicating the channel values of each pixel] NPM module. eBay supports JPEG and WebP (Google’s new super-great image compression format) formats, and I figured since JPEG was older, I’d have more luck with it, so off I went looking for a client-side decoder. There weren’t any client-side JPEG decoders. Only Node [server-side]. Nice. After a few PRs [Pull Request] to jpeg-js (to support native UInt8Arrays instead of just Node-specific Buffer arrays) and just like that I had a client-side JPEG decoder. Nice.

Next I had to figure out how to request the image file as binary, and found a great article on JQuery Ajax custom transport overriding. This allowed me to send the data to jpeg-js as a bitstream in a JavaScript typed-array (UInt8Array in this case). jpeg-js sadly only supported 4-channel (including alpha-transparency) output, so in my color analysis code I handled both 3-channel and 4-channel output flags as 4-channel. This increased data overhead by about 1/3 more bytes per image [since each channel of a pixel is one byte] — inconvenient but not a dealbreaker.

With my now merged pull request to support client side analysis and array type, jpeg-js analyzed my binary input (after some array conversion) and gave me back an object with a height, width, and array of bytes [number from 0 – 255], each four array indices corresponding to a pixel in the image binary. I found a great WebP analysis library called libwebp (I got the code by viewing the page source) and got it working.

Now it’s time to do some background analysis!

I started with the simple average-of-the-edge-pixels algorithm and appended the resulting color to the bounding box behind each image result, which expectedly yielded sub-par results, but at least the analysis was working. I then I decided to up the size of the pixels analyzed to around 10 pixels per side. If the image was tall I would check the outer 8 pixels on the left and right side, if it was short I would check the top and bottom. I made a function that determined two modes for each color channel, determined the weight of each mode against each other and against the average, shaving off outliers and conversely, fully using the mode if it was significant enough in the image. This yielded great results, especially for images with fairly flat background colors or heavily dominating colors.

Lookin’ pretty good

But some images, especially those with sporadic colors, colors that are very different from side to side, edges with very different colors than the rest of the image, borders, and other complex images, sometimes did not cooperate with this implementation. Some generated colors would clearly not occur at all in the picture or just not fit as a background.

WHERE IS THAT MINT COMING FROM? — Mr Nice Guy could have fixed this if he were still here.

I did not want this algorithm to be just KINDA similar to the background for an image, I wanted it to be almost EXACT. I came up with a few modifications, more modes, more pixels analyzed, some analysis of sectors and their potential significance to determine their weight, and doing analysis of 2 pixels as well as large sections of 20–30 pixels from each edge and determining their weights. Also I fine-tuned the cutoffs for modes and averages to be more likely to exclude an average and include a mode. Some modifications to weights for each sector was required before I came up with a fairly finished product.

Niiiiiiice. It actually looks like a background.

Particularly impressed with the fighter jet analysis, good job Colorizer

I presented the Colorizer at the end of the week of Hack Week 2016 and got some positive responses from the judges. Fun project, and some hard work for a little more than a week.

I hope to make all magic numbers into parameters that can be defined by the user to widen the range of use cases that ColorizerJS can apply to. I also want to make some more functions in the Colorizer that can make it work with path image files to be used on the server, satisfying my original requirement of the module being isomorphic. If it ends up looking good, I might also come up with separate background colors for each side (top and bottom or left and right) if they are different enough. This would especially work well with unprofessionally shot images in sellers’ homes that might have vastly different upper and lower backgrounds in their images.

Soon I will get the code up and tested on Search Results Page, working with Web workers, and published as an NPM module with some examples of usage so that you and all of your friends can do some sweet image analysis. All that is needed is an image URL passed to ColorizerJS, the image is cached and analyzed, and out comes the background color!

Thanks for reading. I will update soon with more info as the project progresses. When I open-source ColorizerJS, I will post a link to the Git repo.

Big thanks to Yoni Medoff for the idea and initial implementation along with encouragement along the way, Eugene Ware for jpeg-js and the help getting it implemented, and Senthil Padmanabhan for the inspiration to write this blog.