HDFS Storage Efficiency Using Tiered Storage

At eBay, we run Hadoop clusters comprised of thousands of nodes that are shared by thousands of users. We store hundreds of petabytes of data in our Hadoop clusters. In this post, we look at how to optimize big data storage based on frequency of data usage. This method helps reduce the cost in an effective manner.

Hadoop and its promise

It is now common knowledge that commodity hardware can be grouped together to create a Hadoop cluster with big data storage and computing capability. Parts of the data are stored in each individual machine, and data processing logic is also run on the same machines.

hadoop_storage

For example: A 1,000-node Hadoop cluster with storage capacity of 20 TB per node can store up to 20 petabytes (PB) of data. All these machines have sufficient computing power to fulfill Hadoop’s motto of “take compute to data.”

Temperature of data

Different types of datasets are usually stored in the clusters, which are shared by different teams running different types of workloads to crunch through the data. Each dataset is enhanced and enriched by daily and hourly feeds through the data pipelines.

A common trait of datasets is heavy initial usage. During this period the datasets are considered HOT. Based on our analysis, we found there is a definite decline in usage with time, where the stored data is accessed a few times a week and ages to being WARM data. In the next 90 days, when data usage falls to a few times a month, it is defined as COLD data.

So data can be considered HOT during its initial days, then it remains WARM in the first month. Jobs or applications use the data a few times during this period. The data’s usage goes down further; the data becomes COLD, and may be used only a handful of times in the next 90 days. Finally, when the data is very rarely used, at a frequency of once or twice per year, the “temperature” of the data is referred to as FROZEN.

 Data Age   Usage Frequency   Temperature 
Age < 7 days 20 times a day HOT
7 days > Age < 1 month 5 times a week WARM
1 month < Age < 3 months 5 times a month COLD
3 months < Age < 3 years 2 times a year FROZEN

In general, a temperature can be associated with each dataset. In this case, temperature is inversely proportional to the age of the data. Other factors can affect the temperature of a particular dataset. You can also write algorithms to determine the temperature of datasets.

Tiered storage in HDFS

HDFS supports tiered storage since Hadoop 2.3.

How does it work?

Normally, a machine is added to the cluster, and local file system directories are specified to store the block replicas. The parameter used to specify the local storage directories is dfs.datanode.data.dir. Another tier, such as ARCHIVE, can be added using an enum called StorageType. To denote that a local directory belongs to the ARCHIVE tier, the directory is prefixed in the configuration with [ARCHIVE]. In theory, multiple tiers can exist, as defined by a Hadoop cluster administrator.

For example: Let’s add 100 nodes that contain 200 TB of storage per node to an existing 1,000-node cluster having a total of 20 PB of storage. These new nodes have limited computing capability compared to the existing 1,000 nodes. Let’s prefix all the local data directories with ARCHIVE. These 100 nodes now form the ARCHIVE tier and can store 20 PB of data. The total capacity of the cluster is 40 PB, which is divided into two tiers – the DISK tier and the ARCHIVE tier. Each tier has 20 PB.

Mapping data to a storage tier based on temperature

For this example, we will store the heavily used HOT data in the DISK tier, which has nodes with better computing power.

For WARM data, we will keep most of its replicas in the DISK tier. For data with a replication factor of 3, we will keep two replicas in the DISK tier and one replica in the ARCHIVE tier.

If data is COLD, we will keep at least one replica of each block of the COLD data in the DISK tier. All the remaining replicas go to the ARCHIVE tier.

When a dataset is deemed FROZEN, which means it is almost never used, it is not optimal to store it on a node that has lots of CPU power to run many tasks or containers. We will keep it on a node that has minimal computing power. Thus, all the replicas of all blocks of FROZEN data can move to the ARCHIVE tier.

Data flow across tiers

When data is first added to the cluster, it gets stored in the default tier, DISK. Based on the temperature of the data, one or more replicas are moved to the ARCHIVE tier. Mover is used for data movement from one storage tier to another tier. Mover works similarly to Balancer except that it moves block replicas across tiers. Mover accepts an HDFS path, a replica count, and destination tier information. Then it identifies the replicas to be moved based on the tier information, and schedules the moves between source and destination data nodes.

Changes in Hadoop 2.6 to support tiered storage

Many improvements in Hadoop 2.6 further support tiered storage. You can attach a storage policy to a directory to denote it as HOT, WARM, COLD, or FROZEN. The storage policy defines the number of replicas to be located on each tier. It is possible to change the storage policy on a directory and then invoke Mover on that directory to make the policy effective.

Applications using data

Based on the data temperature, some or all replicas of data could be on either tier. But the location is transparent to applications consuming the data via HDFS.

Even though all the replicas of FROZEN data are on ARCHIVE storage, applications can still access it just like any HDFS data. Because no computing power is available on ARCHIVE nodes, mapped tasks running on DISK nodes will read the data from ARCHIVE nodes, which leads to increased network traffic for the applications. If this occurs too frequently, you can declare the data as WARM/COLD, and Mover can move one or more replicas back to DISK.

The determination of data temperature and the designated replica movement to pre-defined tiered storage can be fully automated.

Tiered storage at eBay

Tiered storage is enabled in one of the very large clusters at eBay. The cluster had 40 PB of data. We added 10 PB of additional storage with limited computing power. Each new machine could store 220 TB. We marked the additional storage as ARCHIVE. We identified a few directories as WARM, COLD, or FROZEN. Based on their temperature, we moved all or a few replicas to the ARCHIVE storage.

The price per GB of the ARCHIVE tier is four times less than the price per GB on the DISK tier. This difference is mainly because machines in the ARCHIVE tier have very limited computing power and hence lower costs.

Summary

Storage without computing is cheaper than storage with computing. We can use the temperature of the data to make sure that storage with computing is wisely used. Because each block of data is replicated a few times (the default is three), some replicas can be moved to the low-cost storage based on the temperature of the data. HDFS supports tiered storage and provides the necessary tools to move data between tiers. Tiered storage is enabled on one of the very large clusters at eBay to archive data.

Benoy Antony is an Apache Hadoop committer who focuses on HDFS and Hadoop security. Benoy works as a software engineer in the Global Data Infrastructure team at eBay.

15 thoughts on “HDFS Storage Efficiency Using Tiered Storage

  1. Pingback: TWiST #40: Neue Termine, neue Association | ShopTechBlog

  2. Wes Floyd

    Great article! Could you describe in more detail what the “Mover” application is and how it operates? Do you use Apache Falcon?

  3. steppenknee

    Useful article, thanks for sharing. We’ve also been evaluating a number of cheap n deep objects stores for data archive, and we’d like to potentially use these stores as an archive tier for HDFS. Did you look at integrating any vendor solutions like Caringo swarm, EMC ECS etc. These all provide HDFS interfaces, but I’m not 100% clear if you can use them as a target in a Hadoop cluster.

    1. Benoy

      ebay wanted our data to be on premise (in the same datacenter) for security and better administration. So we did not try with cloud storage solutions. But could storage could work and will be interesting to try that !! Let me know how it goes if you ever try that .

  4. Manoj

    Very useful. What I understood is, running Mover on certain files/directories will check if the blocks that belong to those files/directories are appropriately located as per the policy, if not Mover will schedule them to be relocated. Is there notification mechanism or some other mechanism when you know Mover is done doing it?

    1. Benoy

      The mover (like balancer) will exit when it has done moving all the blocks under a directory to the respective storage type. So that’s one indication.
      Another way to verify is to run fsck with the “-storagepolicies” option. This will do a detailed output of the state of block replicas on the storagetypes and associated storage policies.
      The details are in this jira : https://issues.apache.org/jira/browse/HDFS-7467

  5. Pingback: The YARN Revolution | hadoopoopadoop

  6. Yuanbo

    Great article !
    I have one question: how can we measure the temperature of data? I mean the age of a file is easy to obtain, but I don’t know how to record the frequency of accessing data. Is there a metric corresponding to the frequency in Hadoop?

    1. Benoy Antony Post author

      We get frequency by scanning HDFS Audit Logs.
      We are working on a simple notebook to measure temperature which we can share

  7. rao yendluri

    archival is fine – appreciate the article. How about reading data from any combination those 3 storages (Hot, Warm and Cold) seamlessly?

    1. Benoy Antony Post author

      In terms of reading, HDFS really doesn’t care whether the block replicas are in Hot or Cold storage.
      So the reads will work seamlessly. Let me know if I am not understanding your question.

Comments are closed.