Author Archives: Junling Hu

Natural Language Processing and eBay Listings

You’ve heard before on this blog about the difference between products and items on eBay: the former uses a well-defined structure to describe product information, the latter allows a seller to enter free-form text for describing what’s for sale. In order to help buyers find what they’re looking for, how can we extract relevant information from these unstructured item titles and make them comparable to products?

Natural language processing (NLP) can be used in this context. In a paper titled “Bootstrapped Named Entity Recognition for Product Attribute Extraction”, we present a named entity recognition (NER) system for extracting product attributes and values from listing titles.

These titles pose some unique challenges for NLP:

  • They’re relatively short
  • Often they’re just a list of nouns without any grammatical structure
  • They contain abbreviations and acronyms, and even typographical errors
  • There is no contextual information that could help in identifying product attributes

We combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay’s fashion categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identify novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and find n-gram substring matching to work well in practice.

We presented our work (*) at the international conference on Empirical Methods in Natural Language Processing (EMNLP) this July.

(*) Duangmanee Putthividhya and Junling Hu, “Bootstrapped Named Entity Recognition for Product Attribute Extraction”, Proceedings of EMNLP-2011, July 2011.

-Junling Hu
Principal Data Mining Lead

Next ACM Data Mining Camp at eBay

The San Francisco Bay Area Chapter of ACM, the Association for Computing Machinery, is holding its next data mining camp on Saturday, Nov 13.  The event will be hosted by eBay on our San Jose North Campus (2161 N First St, San Jose, CA 95131). So far, more than 345 participants have signed up.

This one-day data mining camp features an expert panel discussion and more than 12 individual sessions. The expert panel includes Neel Sundaresan, Senior Director Research Labs at eBay; Omid Madani, Senior Researcher at SRI; Susan Holmes, Professor at Stanford University; and Lionel Jouffe, CEO and co-founder of Bayesia.

Individual session topics include among others Statistical Design of Experiments, Large-scale Supervised Learning: Parallel Implementation, Text Mining for Financial Market Prediction, and Manipulating Very Large Data Sets with R.

For more information, please visit the event website.

See you next Saturday!

Data mining and e-commerce

In the last 15 years, eBay grew from a simple website for online auctions to a full-scale e-commerce enterprise that processes petabytes of data to create a better shopping experience.

Data mining is important in creating a great experience at eBay. Data mining is a systematic way of extracting information from data. Techniques include pattern mining, trend discovery, and prediction. For eBay, data mining plays an important role in the following areas:

Product search

When the user searches for a product, how do we find the best results for the user? Typically, a user query of a few keywords can match many products. For example, “Verizon Cell phones” is a popular query at eBay, and it matches more than 34,000 listed items.

One factor we can use in product ranking is user click-through rates or product sell-through rate. Both indicate a facet of the popularity of a product page. In addition, user behavioral data gives us the link from a query, to a product page view, and all the way to the purchase event. Through large-scale data analysis of query logs, we can create graphs between queries and products, and between different products. For example, the user who searches for “Verizon cell phones” might click on the Samsung SCH U940 Glyde product, and the LG VX10000 Voyager. We now know the query is related to those two products, and the two products have a relationship to each other since a user viewed (and perhaps considered buying) both.

We can also mine data to understand user query intent. When a user searches for “Honda Civic”, are they searching for a new car, or just repair parts of the car? Query intent detection comes from understanding the user, other users’ searches, and the semantics of query terms.

Product recommendation

Recommending similar products is an important part of eBay. A good product recommendation can save hours of search time and delight our users.

Typical recommendation systems are built upon the principle of “collaborative filtering”, where the aggregated choices of similar, past users can be used to provide insights for the current user. We do this in our new product based experience. Try viewing our Apple iPod touch 2nd generation page and scroll down — you’ll see that users who viewed this product also viewed other generations of the iPod touch and the iPod classic.

Discovering item similarity requires understanding product attributes, price ranges, user purchase patterns, and product categories. Given the hundreds of millions of items sold on eBay, and the diversity of merchandise on our website, this is a challenging computational task. Data mining provides possible tools to tackle this problem, and we are always actively improving our approach to the problem.

Fraud detection

A problem faced by all e-commerce companies is misuse of our systems and, in some cases, fraud. For example, sellers may deliberately list a product in the wrong category to attract user attention, or the item sold is not as the seller described it. On the buy side, all retailers face problems with users using stolen credit cards to make purchases or register new user accounts.

Fraud detection involves constant monitoring of online activities, and automatic triggering of internal alarms. Data mining uses statistical analysis and machine learning for the technique of “anomaly detection”, that is, detecting abnormal patterns in a data sequence.

Detecting seller fraud requires mining data on seller profile, item category, listing price and auction activities. By combining all of this data, we can have a complete picture and fast detection in real time.

Business intelligence

Every company needs to understand its business operation, inventory and sales pattern. The unique problem facing eBay is its large and diverse inventory. eBay is the world’s largest marketplace for buyers and sellers, with items ranging from collectible coins to new cars. There is no complete product catalog that can cover all items sold on eBay’s website. How do we know the exact number of “sunglasses” sold on eBay? They can be listed under different categories, with different titles and descriptions, or even offered as part of a bundle with other items.

Inventory intelligence requires us to use data mining to process items and map them to the correct product category. This involves text mining, natural language understanding, and machine learning techniques. Successful inventory classification also helps us provide a better search experience and gives a user the most relevant product.

We are seeing a growing need for data mining and its huge potential for e-commerce sites. The success of an e-commerce company is determined by the experience it offers its users, which these days is linked to data understanding. Stay tuned for exciting developments and an improved experience at eBay.

Junling Hu
Principal Data Mining Lead