Author Archives: Tatyana Badeka

Types of Content at eBay: Titles

 

All eBay-generated content is currently translated by our talented localization team, whereas eBay’s user-generated (UG) content is handled by our Machine Translation (MT) engine. It is common knowledge that UG text can get pretty noisy due to typos, non-dictionary terms, etc. At eBay, however, MT deals with more than that. We work with multiple types of UG content — search queries, item titles, and item descriptions — and each presents its own challenges. In the previous post we talked about search queries. This post discusses item titles.

Item Titles

Translating item titles (IT) provides our buyers from Russia, Brazil, Spanish Latin America, France, Italy, Germany, and Spain with an option to view eBay search results in their own language. This allows customers to look through pages of results and make an informed decision on which listings to open, because an image alone does not contain enough information. Being able to read and understand item titles is essential to a positive customer experience, which is why we invest a lot of effort into improving the MT engine for titles.

This type of content is very specific and presents a number of challenges for MT.

Syntax

A title is a summarized item description composed of keywords. The eBay Help article on writing effective titles encourages sellers to omit punctuation and avoid trying to create a grammatically correct sentence. Following these and other tips is supposed to help sellers create a clear picture of an item and a good first impression, so it is important that the MT translation meets the same expectations. However, the lack of syntax and punctuation presents a problem for an MT engine that is normally trained on sentences. If it tries to translate a sequence of nouns, adjectives, and numbers as a sentence, meaning errors are unavoidable. It may start looking for a subject and a predicate and in general for a sentence structure, thus translating adjectives as verbs, moving words around, and so on.

As an example, let’s take a title for a can of paint: “20g Glow in the Dark Acrylic Luminous Paint Bright Pigment Graffiti Party DIY”.

glow-in-the-dark-acrylic-1

What might go wrong here?

“Glow” may get translated as an imperative form of the verb, and “dark acrylic” — as a noun phrase with “acrylic” being a noun. (as in “Stay in the shaded area!”) – and that is just part of the title. Similar transformation may happen with polysemous words or those that belong to different parts of speech: “can”, “paint”, “party”, etc. The result of such translation may be a completely different item.

Segmentation

This is closely related to the previous issue. Segmenting a title and correctly identifying semantic units is of utmost importance for machine translation. For example, “Gucci fake snake leather purse”: in case of an incorrect segmentation, we may get a translation of a “Gucci fake” instead of the intended “fake snake leather”. Such translations are the most dangerous because they sound correct and believable yet present misleading information, which in the end may leave both a buyer and a seller unhappy with the experience.

To address these major issues, the science team created an engine just for item titles; it is trained on separate data sets. In addition, they have been working on a named entity recognition (NER) algorithm that identifies semantic units in a title before it goes in the MT engine for translation.

Synonyms

Sellers tend to use multiple synonyms in a title assuming this will increase the chances of matching search queries and coming up high in search results (which is a common misconception). For MT this means several things:

A chain of adjacent nouns or adjectives that are in no relation to each other

The machine needs to learn to translate them independently of each other. This is similar to the first issue described above, because the engine may try to create agreement where there should be none.

toddler-backpacks-2

Example, Baby Toddler Kids Child Mini Cartoon Animal Backpack Schoolbag Shoulder Bag

We see four synonyms for the age reference and three synonyms for the item itself. The age reference terms are not all adjectives nor can all of them be translated as adjectives. Even a human translator would have to get creative and produce something like “for a baby/toddler, kids’, child’s” – because we could not simply leave all four of them as nouns; it would sound too abrupt and possibly confusing. The task is much more challenging for a machine. Not only should it avoid creating noun phrases (Kids child may turn into a kid’s child), it also needs to rephrase or insert prepositions where necessary (baby toddler child -> for baby, toddler, child; kids –> kids’). The best ways to approach this would vary depending on the target language.

Agreement with the head noun

In our example, there are three synonyms for a head noun: Backpack – Schoolbag – Shoulder Bag. What if they are of different gender in the target language? Which one should the adjectives agree with? A human translator would probably pick the first one, but MT may not think the same way. Here is a bigger challenge: the head noun does not immediately follow the adjectives describing it. In our example there are two other nouns between the attributes “Kids Child” and the head noun “Backpack”. The machine is supposed to figure out that “kids” describes “backpack”, not “cartoon” or “animal”. As you can imagine, however, the most logical decision for a machine would be to connect “kids” with “cartoon”.

Agreement plays a very important role in translating item titles, because it provides a customer with a description of features and qualities of the item. If you connect an attribute with the wrong noun, it will modify an incorrect object and produce an overall misleading translation. In our example, with the incorrect agreement, a user will read: “backpack with a kids’ cartoon animal”, which is in essence a different item than a “kids’ backpack with a cartoon animal”. One may argue that an image would be a clear indication that the item is a kids’ backpack. Unfortunately, a picture is not always a reliable source of information. In our case, there are similar backpacks for adults, which is why an accurate translation will make a difference.

totoro-backpack-3

Acronyms

Sellers use multiple acronyms to save space and fit as much information in a title as possible. For MT this presents several challenges.

  • Rare, unknown acronyms or acronyms that sellers made up on the spot. Gathering more training data and compiling additional lists of expanded out-of-vocabulary (OOV) acronyms is helping address that.
  • Polysemic acronyms that have different translations in different categories. The most challenging acronyms are the ones that have more than one meaning in the same category. For example, “RN” appears in Clothing, Shoes and Accessories as “registered nurse”, “Rapa Nui”, “Rusty Neal”, and as part of model names for Nike, Hugo Boss, A&F and other brands.

Writings and names of songs/music bands/movies/video games

This is common content for certain categories. Singling out a movie or song title out of the rest of the string may be difficult because there is often no contextual information pointing to the fact that it is a movie or a song. It is not much of a problem in the DVD or Music category, but quite often you will find reference to a movie title or a music band name in other categories such as Collectables or Clothing. It is also common for sellers to quote a writing on the item they are selling. Ideally, we would want to have the writing to be left as is so that the customer would know exactly what the item depicts. As you can imagine, however, literally anything can be written on a t-shirt or a poster, which is why it is very difficult for a machine to differentiate a writing from the actual item description. In such cases a user would have to rely on the quality and size of an image, which may not be the best on the search results page.

lake-champlain-poster-4

In this example, “New York Vermont Quebec” is part of the poster design, but it is barely visible. In the text of the item title, however, it may be interpreted as locations of the poster, places it originally came from, etc. Identifying this as verbatim writing, thus keeping it in English, would be a very difficult task for an MT engine, but it would clearly benefit an eBay customer.

Conclusion

With so many aspects to keep in mind, training the engine to translate eBay item titles is certainly a challenge. Our teams of scientists and linguists are actively and successfully working on ways to improve the quality of the training data and the MT output.

Machine Translation: Search Queries at eBay

All eBay-generated content is currently translated by our talented localization team, whereas eBay’s user-generated (UG) content is handled by our Machine Translation (MT) engine. It is common knowledge that UG text can get pretty noisy due to typos, non-dictionary terms, etc. At eBay, however, MT deals with more than that. We work with multiple  types of UG content – search queries, item titles, and item descriptions – and each presents its own challenges. This post discusses one of those content types: search queries.

Translating search queries (SQ) is a very important step in providing our customers with a localized shopping experience, because before even opening a listing, customers look through the search results to choose the listings of interest. What we offer our users is an opportunity to search for items in their native language by automatically translating their queries into the language of the market (English, German) and matching their queries against our inventory.

beanie_SQ

Search on eBay is a complex process (think of polysemy, broad context, the variety of inventory, etc.); try adding a machine translation step to that! Here are some of the main challenges we face when translating SQ.

  • Training set (see SMT for more). Queries are translated from the user’s language, which means there has to be a separate training corpus for this direction in every language. Being able to use actual post-edited queries for training is very helpful.
  • Lack of context. As you can imagine, search queries are quite short; the average length is 1-3 words (iPhone; iPhone case; blue iPhone case). The training data therefore has very little context for the engine to learn from.
  • Polysemy. Given that search queries provide zero category information, polysemous terms are number one candidates for an error.  Categories are listed for search results, but we have no way to know which category a user had in mind when he/she typed “pipe” in the search field – was it Plumbing, Motors, or Collectibles? Which translation do we choose? The same issue applies to all languages.
  • Domain limitations. This goes hand in hand with polysemy. Guessing the user intent and choosing the right category is not the only problem with polysemic terms; sometimes we also need to keep in mind legal aspects, domain specifics, and even shipping policies.  For instance, some possible meanings of a term might point to items that would be very expensive or illegal to ship, or that for other reasons are very unlikely to be sold on eBay. In this case, we might give preference to a less common meaning that is more likely to be sold and bought on eBay.
  • Language trends. Sometimes the most obvious and common translation, the one that would be offered in a dictionary, is not suitable for our purposes. Dictionaries cannot keep up with the fast-changing language culture and do not always reflect the current trends. This situation is especially true for clothing and gadgets. Fashion is changing, technologies are developing, and new words are emerging – or new meanings are assigned to existing terms. A lot of English words are also getting adopted by other languages (or transliterated, in the case of Russian), often with a local “flavor.” We must keep up with these trends.

Here’s a specific example: A simple word “шапка”, which means “hat” in Russian and would be translated as a hat in any other circumstances, should be translated as “beanie” on eBay, because when Russian users search for “шапка”, what they have in mind looks like a beanie; the search results for “hat” are much more diverse and less relevant.

Translating search queries correctly means helping eBay users find what they want. Providing an accurate, grammatically correct translation of a query is never enough; what we always keep in mind is user intent and relevance of the results.