GZinga: Seekable and Splittable Gzip

Coauthors Ruchir Shah and Mahesh Somani

Generally, data compression techniques are used to conserve space and network bandwidth. Widely used compression techniques include Gzip, bzip2, lzop, and 7-Zip. According to performance benchmarks, lzop is one of the fastest compression algorithms, while bzip2 has a high compression ratio but is very slow. Gzip offers the lowest level of compression. Gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. The zlib software library writes compressed data with a Gzip wrapper. (Note that in this post we will use Gzip and zlib interchangeably.) With Gzip, compression is as fast as or faster than serial writes on disk, and the compression ratio is far better than with lzop. For decompression as well, Gzip performs better than other algorithms.

These are the reasons why Gzip is one of the most popular and widely used data compression techniques in the industry. However, Gzip has limitations: you cannot randomly access a Gzip file, and you cannot split a Gzip file in the case of Hadoop map-reduce jobs. As a result, Gzip is slower in those scenarios, and it does not leverage the power of parallel processing.

The eBay Cloud Platform team is happy to announce the open-source project GZinga, which aims to provide two additional capabilities for Gzip-compressed files:

  • Seekable: Random search within Gzip files
  • Splittable: Division of a Gzip file into multiple chunks


It is common to collect logs from production applications and use them to debug and triage issues. Log files are generally rolled over periodically (hourly, daily) or based on size. To save disk space, log files are stored in a compressed format such as Gzip.

In most outage scenarios, developers are interested in looking at logs for a particular activity or around a certain timestamp. Scanning an entire file to find the log for a particular time period is costly. The data for these logs can be stored on commodity storage like Hadoop. However, to take advantage of Hadoop’s capabilities, small chunks of large Gzip files need to be processed in parallel (for files beyond a few hundred MBs).

The idea of GZinga originated as a means to provide optimal performance for reading/processing Gzip files. Though we have looked at options for log files, the technique we’ve developed can apply to other use cases—textual documents, web crawls, weather data, etc.

Seekable Gzip: write

In this section, we will discuss generating Gzip files with meta information that will be used to perform random access in those files. We use two techniques from the Gzip specification—multiple compressed blocks and the Gzip header flags format—to make files seekable.

Multiple compressed blocks

A Gzip file consists of a series of “members” (compressed data sets). The format of each member is specified as follows:

   <Header>Compressed Data<Footer>

As a Gzip file can have a series of such members, an actual Gzip file can look like this:

   <Header>Compressed Data<Footer><Header>Compressed Data</Footer>….

Such multiple compressed blocks can be achieved using the Z_FULL_FLUSH option, according to the zlib manual. These blocks help us in reading a file from any arbitrary location after jumping to its inner header location. The main advantages of this feature are that it is not necessary to uncompress the entire file and it is possible to read from different locations. Only the requested compressed block will be uncompressed, which improves performance considerably.

As the deflater gets reset for every block, compression efficiency (both size and processing) diminishes. However, we’ve observed that when data is compressed using zlib, then the impact on the amount of compression is minimal. The following table shows the difference in compression and timing for logs, with and without the multiple-blocks approach. In all of these scenarios, 60 blocks are written—1 per minute for an hourly log file.

Original file size (in MB) Time taken with compressed blocks Time taken without compressed blocks Δ increase in time File size with compressed blocks (in MB) File size without compressed blocks (in MB) % increase in file size
8.12 0.41 0.387 0.023 1.078 1.076 0.19
105.5 3.685 3.49 0.195 13.982 13.981 0.007
527.86 14.54 14.12 0.420 69.904 69.903 0.001
2111.47 53.44 52.6 0.840 279.614 279.6 0.0005

It is evident that when the size per block is greater than 1 MB, the processing and space overhead is minimal and can be ignored. (Note that a smaller block size may not be suitable to use, as it is reasonably fast to decompress such blocks).

Header flags format

The header section has provision for extra comments, according to the Gzip specification. Each header has the following format :

|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
  1   2   3   4   5   6   7   8   9   10

The 4th byte of the header represents the flag information:

bit 0   FTEXT
bit 1   FHCRC
bit 2   FEXTRA
bit 3   FNAME
bit 4   FCOMMENT
bit 5   reserved
bit 6   reserved
bit 7   reserved

Bit 4 can store a zero-terminated file comment at the end of the block. We can use this flag to store an index for the file. The interface for providing the index is shown below in the GZipOutputStreamRandomAccess class :

class GZipOutputStreamRandomAccess extends DeflaterOutputStream {
private Map<Long, Long> offsetMap = new LinkedHashMap<Long, Long>(); //will maintain index where value provides byte offset for given key

/** This method adds current byte offset (in gzip file) for given key.*/
public void addOffset(Long key) {

/**Writes header with extra comment which contains entries from offsetMap.*/
private void writeHeaderWithComment() {

The comment section in the header field will contain information about the index collected up until that point. It should be noted that comment blocks are ignored by standard libraries that don’t look for comments or extra fields. The comments are added as a name / value pair as shown below:

|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | key1:value1;key2:value2;........;0|
  1   2   3   4   5   6   7   8   9   10  Index in specific format ending with 0

At the time of the file close operation, an extra block is added with a header that contains information about all indices, and a footer without any data. This extra block can be used in locating the header deterministically while reading, as described in the next section. The effective size of the file increases due to the comment block as described below (in a test with 60 blocks):

Original file size (in MB) Time taken with index data(in ms) Time taken without index data (in ms) Δ increase in time File size with index data (in MB) File size without index data (in MB) % increase in file size
8.12 0.582 0.387 0.195 1.154 1.08 6.85
105.5 3.926 3.49 0.436 14.09 13.98 0.79
527.86 15 14.12 0.880 70.02 69.9 0.17
2111.47 54.35 52.6 1.750 279.72 279.6 0.04

Seekable Gzip: read

To take advantage of an index written in the header, the file will first be open for random access and the last few chunks of data will be read. Then the header needs to be located and the index loaded into memory. Byte comparison (reading in reverse order) will be employed to locate the header, and then the index information will be extracted. Based on the requested index, the read marker will be jumped to the required location and the stream will be passed to Gzip for uncompression of the data. The interface for jumping to the desired location and reading the metadata information is shown in the below GZipInputStreamRandomAccess class:

class GZipInputStreamRandomAccess extends GZiPInputStream {

/** Return metadata information for given file.*/

public Map<Long, Long> getMetadata();

/** This method jump to location for specified key. If specified key does not exist, then it jumps to beginning of file.*/

public void jumpToIndex(Long index) throws IOException;


Reading the index and then jumping to the desired location improves performance dramatically. The following table provides the numbers for jumping to the last block of data and reading one record from there:

Original file size (in MB) Compressed file size (in MB) Time taken with index data (in sec) Time taken without index data (in sec)
8.12 1.08 0.162 0.266
105.5 13.98 0.166 0.87
527.86 69.9 0.167 3.59
2111.47 279.6 0.17 13.71

We also observed that it takes the same amount of time irrespective of which index needs to be jumped to before reading one record. Overall, this approach provides significant improvement in performance, with the extent of improvement depending on file size (e.g., 10-50x gain).

Random access in Hadoop

SeekableGZipDataInputStream is an extension of Hadoop’s FSDataInputStream class to enable random access for Gzip files that are stored inside Hadoop. Using this extended class, one can jump to any location in a Gzip file stored in Hadoop and read required data from there with much faster performance. Here is an example of using this class to perform random access:

FSDataInputStream fin = fs.open(new Path("testfile"));
long len = fs.getFileStatus(new Path("testfile")).getLen();
SeekableGZipDataInputStream sin = new SeekableGZipDataInputStream(fin, len);
GZipInputStreamRandomAccess gzin = new GZipInputStreamRandomAccess(sin);

Splittable Gzip

In many Hadoop production environments, you get Gzipped files as the raw input. When putting these Gzipped files into Hadoop, an MR job runs with only 1 map task, as a Gzip file is not splittable. This fact is the biggest disadvantage to Gzip, because it defeats the real power of Hadoop. But when we generate Gzip files with the above Seekable approach, then it is possible to split a Gzip file and feed it to an individual map task for processing.

Determining the split

A Gzip file with multiple compressed blocks allows for splitting the file based on the start of a new header. In our approach, we implemented a SplittableCompressionCodec interface for the GzipCodec class. Using this interface, when a split is being invoked with the “start” and “end” position of the split provided, it locates the start of the header from the provided “start” (and “end”) position and sets the new “start” (and “end”) pointer once located. For example, given the following split as the original (or default):


The new “start” and “end” location will be set as shown below:


Compressed data before the “start” position will be processed by the previous split (both the “start” and “end” positions will point to the next header).

Configuration changes

In order to use this feature, one needs to set the io.compression.codec property in the configuration object before submitting the MR job. Instead of org.apache.hadoop.io.compress.GzipCodec for Gzip files, the value should be io.gzinga.hadoop.SplittableGzipCodec:


This value can also be set in the mapred-site.xml file to take effect for all MR jobs.

Another important property to set is for splitting the size. Property mapreduce.input.fileinputformat.split.maxsize indicates the maximum split size. The number of splits will depend upon the value of this property and the actual Gzip file size. The greater the number of splits, the faster the MR job. Below are performance numbers indicating improvement with respect to splits (where the input Gzip file size is 280MB and the MR job is to perform the wc command on the file):

Split Size (MB) # of Splits Avg Map Time (sec) Job Time (sec)
32 9 24 50
64 5 36 55
128 3 52 79
300 1 127 139

As these performance numbers show, the map task took 127 seconds when run with a single split, compared to 52 seconds when run with 3 map tasks. Also note that when the split size decreases (and the number of splits increases), the map task time decreases accordingly.


Gzip provides one of the best compression algorithms and is widely used in the industry. But with its lack of support for random access search and file splitting, it is not able to leverage the power of parallel and distributed processing systems like Hadoop. With the introduction of the Seekable and Splittable features, Hadoop access can be achieved with high performance.

Company-wide Hack Week Encourages Innovation

What happens when you give more than 2500 employees an entire week to develop an innovation concept they are passionate about? Amazing things happen.

Last month, eBay Chief Product Officer, RJ Pittman, and eBay Chief Technology Officer, Steve Fisher, piloted a new program, Hack Week, that gave everyone in their organization an entire week to design, develop, and deliver a new innovation concept.

In a bold move, we adjusted our roadmap commitments to give our innovators a dedicated week to focus on the ideas that inspire them and create innovative concepts that deliver value for our customers. From our interns, to new hires fresh out of college, to design experts, to data researchers, to our most seasoned programmers, everyone came together in the spirit of innovation.

More than 400 concepts were submitted and involved 7 major eBay offices around the world. Concepts included wearables, on-demand manufacturing, digital goods, computer vision, cryptocurrencies, security, natural language processing, machine learning, voice recognition, search science, and more. Innovations spanned product marketing, policy, business process, data insights, and technology infrastructure. Add in exciting activities like dueling drones, battling robots, and virtual reality gaming, and the energy on the campus was buzzing with excitement.

One of our innovators, Robinson Piramuthu, shared his experience:

An event like Hack Week is like a marathon and tests our perseverance and rapid coding. We were allowed to choose the topics that inspired us, to be entrepreneurs driving the ideas that will transform eBay.

Our project involves computer vision, which is an emerging topic at eBay with still-developing resources and infrastructure. The challenge is how to put a complex system together in one week, from the UI to the back-end servers. Prior to hack week, a lot of planning happened mentally – picturing how the product will look like, what features it will support, which features exist (in some other form), and which needed to be developed in short time.

During Hack Week there were some organic unexpected developments that involved code optimization, UI simplification, and believe it or not, algorithm development. The most difficult day was the day before the final day. We encountered some problems with the back-end server that we postponed since we were focusing a lot on the front end. But it was a treat when we fixed the problem.

It was great to have full support from our eBay leaders during Hack Week. The continued focus without interruptions made it very productive.

Each location recognized its innovators with local awards, and several concepts are already underway having picked up executive sponsors during Hack Week demos. The best from each location will now move on to be reviewed by the executive leadership team to get their sponsorship in pursuing some of the larger, cross-domain innovation concepts.

Events like Hack Week, which are part of eBay’s larger Innovation Programs, are critical drivers for our continuing cultural transformation at eBay. It inspires our talent, helps them develop new skills, and gives them a channel for their voices to be heard. Driven by our purpose of delivering connected commerce and creating economic opportunity, our mandate to innovate to help our customers be successful is stronger than it has ever been.

Application Resiliency Using Netflix Hystrix

Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation.
Ever since the term services and recently microservices came into usage, application developers have been converting monolithic APIs into simple and single-function microservices. However, such conversions come with the cost of ensuring consistent response times and resiliency when certain dependencies become unavailable. For example, a monolithic web application that performs a retry for every call is potentially resilient to some extent, as it can recover when certain dependencies (such as databases or other services) are unavailable. This resilience comes without any additional network or code complexity.

For a service that orchestrates numerous dependencies, each invocation is costly, and a failure can lead to diminished user experience as well as to higher stress on the underlying system that is attempting to recover from the failure.

Circuit breaker pattern

Consider a typical use case:  An e-commerce site that is overloaded with requests on Black Friday, and the vendor providing the payment operations goes offline for a few seconds due to heavy traffic. The users begin to see long wait times for their checkouts due to the high concurrency of requests. These conditions also keep all of the application servers clogged with the threads that are waiting to receive a response from the vendor. After a long wait time, the eventual result is a failure.

Such events lead to abandoned carts or users trying to refresh or retry their checkouts, increasing the load on the application servers—which already have long-waiting threads, leading to network congestion.

A circuit breaker is a simple structure that constantly remains vigilant, monitoring for faults. In the above-mentioned scenario, the circuit breaker identifies long waiting times among the calls to the vendor and fails-fast, returning an error response to the user instead of making the threads wait. Thus, the circuit breaker prevents the users from having a very sub-optimal response time.

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

-Martin Fowler

Recovery time is crucial for the underlying resource, and having a circuit breaker that fails-fast without overloading the system ensures that the vendor can recover quickly.

A circuit breaker is an always-live system keeping watch over dependency invocations. In case of a high failure rate, the circuit breaker stops the calls from going through for a small amount of time, rather than responding with a standard error.

Circuit breakers in eBay

In earlier times, we used a simple setup called AUTO_MARK_DOWN, which prevented such long wait problems in dependencies by short-circuiting the invocations until they were brought back up via MARK_UP. A bot periodically checks for AUTO_MARK_DOWN on various machines for each of its dependencies and performs MARK_UP.

However, the bot and MARK_UP infrastructure is not embedded within the system, but rather located externally. Due to the absence of live and constant feedback about the request volume and failure rates, the MARK_UP of a failing system dependency would occur without verifying its availability. Relying on this setup also leads to false positives, as the bot system is outside the client and cannot evaluate the continuity of failures.

Another major flaw with the setup is the absence of a comprehensive and real-time monitoring structure for all of the dependencies of any application. This old system is slow and erratic, does not have ongoing telemetry, and blindly marks all systems back up on the assumption that the application will AUTO_MARK_DOWN a dependency in the event of further failures. The result is unpredictable behavior and incorrect evaluation.

Recovery in a circuit breaker

A circuit breaker takes care of tripping the dependencies at the appropriate time. However, a more sophisticated system needs to continue the vigilance to determine if the dependency is available, and if so to close the circuit again to let dependent calls go through.

This behavior can be achieved in two ways:

  1. Allow all calls to go through during a regular time interval and check for errors.
  2. Allow one single call to go through at a more frequent rate to gauge the availability.

AUTO_MARK_DOWN was a variant of Type 1, where the circuit is closed without any proof of recovery, relying on errors to identify an issue.

Type 2 is a more sophisticated mechanism as it does not allow multiple calls to go through because the calls might take a long time to execute and still fail. Rather, allowing  only a single call ensures more frequent execution, enabling faster closure of the circuit and revival of the system.

Ideal circuit breaker

A harmonious system is one where we have an ideal circuit breaker, real-time monitoring, and a fast recovery variable setup, making the application truly resilient.

Circuit Breaker + Real-time Monitoring + Recovery = Resiliency

– Anonymous

Using the example of the e-commerce site from above, with a resilient system in place, the circuit breaker keeps an ongoing evaluation of the faults from the payments processor and identifies long wait times or errors from the vendor. On such occurrences, it breaks the circuit, failing fast. As a result, users are notified of the problem and the vendor has enough time to recover.

In the meantime, the circuit breaker also keeps sending one request at regular intervals to evaluate if the vendor system is back again. If so, the circuit breaker closes the circuit immediately, allowing the rest of the calls to go through successfully, thereby effectively removing the problem of network congestion and long wait times.

Netflix Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.


From its inception in 2012, Hystrix has become the go-to solution for many systems attempting to improve their capabilities to sustain failures. It has a fairly mature API and a highly tunable configuration system, enabling application developers to provide optimal utilization of their underlying service dependencies.

Circuit breaker (CB) states for Hystrix

The following state diagram and narrative depicts how a resilient system functions during various states of a circuit breaker’s lifecycle.

Circuit Breaker State Diagram

Normal function (Closed)

When a system is functioning smoothly, the resiliency is measured by the state of its success counters, while any failures are tracked using the failure gauges. This design ensures that when the threshold for failures is reached, the circuit breaker opens the circuit to prevent further calls to the dependent resource.

Failure state (Open)

At this juncture, every call to the dependency is short-circuited with a HystrixRuntimeException and FailureType of SHORTCIRCUIT, giving clear indication of its cause. Once the sleepInterval passes, the Hystrix circuit breaker moves into a half-open state.

Half-open state

In this state, Hystrix takes care of sending the first request to check system availability, letting other requests fail-fast until the response is obtained. If the call is successful, the circuit breaker is reset to Closed; in case of failure, the system goes back to the Open state, and the cycle continues.

How to use Hystrix

Hystrix Github has comprehensive documentation of how to use the library. It is as simple as creating a class for invoking the Hystrix library for service consumption.

   public class CommandHelloWorld extends HystrixCommand<String> {

        private final String name;

        public CommandHelloWorld(String name) {
            this.name = name;

        protected String run() {
            // a real example would do work like a network call here
            return "Hello " + name + "!";

Reference: https://github.com/Netflix/Hystrix/wiki/Getting-Started

Internally, this class utilizes the RxJava library to perform asynchronous invocation of the service dependencies. This design helps the application manage its resources intelligently by using the application threads to its maximum potential. For application developers who perform parallel processing and manage their dependencies using lazy invocations, Hystrix also exposes Future<?> and Observable<?>.

How we use Hystrix in eBay

Multiple applications within eBay have started using Hystrix either as a standalone library or  with our platform wrappers. Our platform wrappers exposes the Hystrix configurations in the form of JMX beans for centralized management.  Our wrappers also inject custom Hystrix plugin implementations to capture the real-time metrics being published and to feed them to the site monitoring systems for critical applications.

The Hystrix dashboard is integrated as part of the core server-monitoring systems, enabling teams to view how their application dependencies are performing during various times of the day.

The execution hook provided by Hystrix is a critical component of this integration, as it helps monitor/alert various failures in real time—especially on errors and fallback failures, thereby helping investigate and resolve issues more quickly with little to no user impact.

eBay example: Secure Token service

eBay hosts a slew of service APIs for both internal and external consumption. All of these services are authenticated via tokens, with the Secure Token service acting as the issuer and validator of these tokens. The Guards in all of the services are now upgraded with the Hystrix-based circuit breaker, which enables the Secure Token service to be highly available. In times of heavy traffic from one of the services, the circuit breaker for that service trips and opens the circuit, failing calls only to that specific service while allowing the other services to function normally.

Secure Token Service protected using Hystrix

Secure Token Service protected using Hystrix


The circuit breaker is the default one available through the Hystrix library. The functioning of the circuit breaker can be summarized as follows:

  1. Every incoming call is verified against the current state of the circuit breaker.
  2. A Closed state of the Circuit allows the requests to be sent through.
  3. An Open state fails all requests.
  4. A Half-Open state (which occurs when the sleep time is completed), allows one request to go through, and on success or failure moves the circuit to the Closed or Open state as appropriate.


Hystrix is not just a circuit breaker, but also a complete library with extensive monitoring capabilities, which can be easily plugged into existing systems. We have started exploring the usage of the library’s Request Collapsing and Request Caching abilities for future use cases. There are a few other Java-based implementations available, such as Akka and Spring circuit breakers; but Hystrix has proven to be a sound and mature library for maintaining a resilient environment for our critical applications, providing high availability during any time period.