Monthly Archives: October 2010

Hadoop - The Power of the Elephant

In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I'll discuss here is Apache Hadoop.

Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.

MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.

Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)

Ecosystem

Hadoop Stack

Athena, our first large cluster was put in use earlier this year.
Let's look at the stack from bottom to top:

  • Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
  • MapReduce – provides the APIs and components to develop and execute jobs.
  • Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
    • HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
    • Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
    • Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
  • Tools & Libraries - UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
    Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay's homegrown library for parsing web logs (Mobius).
  • Monitoring & AlertingGanglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.

Infrastructure

Our enterprise servers run 64-bit RedHat Linux.

  • NameNode is the master server responsible for managing the HDFS.
  • JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
  • HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
  • Zookeeper is a distributed lock coordinator providing consistency for HBase.

The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.

On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.

Scheduling

Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop's Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.

Data Sourcing

Data Sourcing

On a daily basis we ingest about 8 to 10 TB of new data.

Road Ahead

Here are some of the challenges we are working on as we build out our infrastructure:

  • Scalability
    In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
  • Availability
    NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
  • Data Discovery
    Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive's metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
  • Data Movement
    We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
  • Policies
    Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
  • Metrics, Metrics, Metrics
    We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.

eBay is changing how it collects, transforms, and uses data to generate business intelligence. We're hiring, and we'd love to have you come help.

Anil Madan
Director of Engineering, Analytics Platform Development

eBay at QCon San Francisco 2010

QCon, the Annual International Software Development Conference, is taking place next week in San Francisco. We’re pretty passionate about software development at eBay, so you can usually find us both on the speakers’ side of the house as well as in the audience.

This year, Randy Shoup will give a talk on Best Practices for Large Scale Web Sites, a continuation of his previous QCon presentations on the topic.
In a talk titled RESTful SOA in the Real World, Sastry Malladi will present the approach we took at eBay to describe and enable RESTful access to SOA services in an optimal and highly performant fashion.

Say hi if you see us!

Data mining and e-commerce

In the last 15 years, eBay grew from a simple website for online auctions to a full-scale e-commerce enterprise that processes petabytes of data to create a better shopping experience.

Data mining is important in creating a great experience at eBay. Data mining is a systematic way of extracting information from data. Techniques include pattern mining, trend discovery, and prediction. For eBay, data mining plays an important role in the following areas:

Product search

When the user searches for a product, how do we find the best results for the user? Typically, a user query of a few keywords can match many products. For example, “Verizon Cell phones” is a popular query at eBay, and it matches more than 34,000 listed items.

One factor we can use in product ranking is user click-through rates or product sell-through rate. Both indicate a facet of the popularity of a product page. In addition, user behavioral data gives us the link from a query, to a product page view, and all the way to the purchase event. Through large-scale data analysis of query logs, we can create graphs between queries and products, and between different products. For example, the user who searches for "Verizon cell phones" might click on the Samsung SCH U940 Glyde product, and the LG VX10000 Voyager. We now know the query is related to those two products, and the two products have a relationship to each other since a user viewed (and perhaps considered buying) both.

We can also mine data to understand user query intent. When a user searches for “Honda Civic”, are they searching for a new car, or just repair parts of the car? Query intent detection comes from understanding the user, other users’ searches, and the semantics of query terms.

Product recommendation

Recommending similar products is an important part of eBay. A good product recommendation can save hours of search time and delight our users.

Typical recommendation systems are built upon the principle of “collaborative filtering”, where the aggregated choices of similar, past users can be used to provide insights for the current user. We do this in our new product based experience. Try viewing our Apple iPod touch 2nd generation page and scroll down -- you'll see that users who viewed this product also viewed other generations of the iPod touch and the iPod classic.

Discovering item similarity requires understanding product attributes, price ranges, user purchase patterns, and product categories. Given the hundreds of millions of items sold on eBay, and the diversity of merchandise on our website, this is a challenging computational task. Data mining provides possible tools to tackle this problem, and we are always actively improving our approach to the problem.

Fraud detection

A problem faced by all e-commerce companies is misuse of our systems and, in some cases, fraud. For example, sellers may deliberately list a product in the wrong category to attract user attention, or the item sold is not as the seller described it. On the buy side, all retailers face problems with users using stolen credit cards to make purchases or register new user accounts.

Fraud detection involves constant monitoring of online activities, and automatic triggering of internal alarms. Data mining uses statistical analysis and machine learning for the technique of “anomaly detection”, that is, detecting abnormal patterns in a data sequence.

Detecting seller fraud requires mining data on seller profile, item category, listing price and auction activities. By combining all of this data, we can have a complete picture and fast detection in real time.

Business intelligence

Every company needs to understand its business operation, inventory and sales pattern. The unique problem facing eBay is its large and diverse inventory. eBay is the world’s largest marketplace for buyers and sellers, with items ranging from collectible coins to new cars. There is no complete product catalog that can cover all items sold on eBay’s website. How do we know the exact number of “sunglasses” sold on eBay? They can be listed under different categories, with different titles and descriptions, or even offered as part of a bundle with other items.

Inventory intelligence requires us to use data mining to process items and map them to the correct product category. This involves text mining, natural language understanding, and machine learning techniques. Successful inventory classification also helps us provide a better search experience and gives a user the most relevant product.

We are seeing a growing need for data mining and its huge potential for e-commerce sites. The success of an e-commerce company is determined by the experience it offers its users, which these days is linked to data understanding. Stay tuned for exciting developments and an improved experience at eBay.

Junling Hu
Principal Data Mining Lead