Human Evaluation of Machine Translation


Machine translation (MT) evaluation is essential in machine translation development. This is key to determining the effectiveness of the existing MT system, estimating the level of required post-editing, negotiating the price, and setting reasonable expectations. Machine translation output can be evaluated automatically, using methods like BLEU and NIST, or by human judges.

The automatic metrics use one or more human reference translations, which are considered the gold standard of translation quality. The difficulty lies in the fact that there may be many alternative correct translations for a single source segment.

Human evaluation, however, also has a number of disadvantages. Primarily, it is a costly and time-consuming process. Human judgment is also subjective in nature, so it is difficult to achieve a high level of intra-rater (consistency of the same human judge) and inter-rater (consistency across multiple judges) agreement. In addition, there are no standardized metrics and approaches to human evaluation.

Let us explore the most commonly used types of human evaluation.


Judges rate translations based on a predetermined scale. For example, a scale from 1 to 5 can be used, where 1 is the lowest and 5 is the highest score. One of the challenges of this approach is establishing a clear description of each value in the scale and the exact differences between the levels of quality. Even if human judges have explicit evaluation guidelines, they still find it difficult to assign numerical values to the quality of the translation (Koehn & Monz, 2006).

The two main dimensions or metrics used in this type of evaluation are adequacy and fluency.


Adequacy, according to the Linguistic Data Consortium, is defined as “how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation.” The annotators must be bilingual in both the source and target language in order to judge whether the information is preserved across translation.

A typical scale used to measure adequacy is based on the question “How much meaning is preserved?”

5: all meaning
4: most meaning
3: some meaning
2: little meaning
1: none


Fluency refers to the target only, without taking the source into account; criteria are grammar, spelling, choice of words, and style. A typical scale used to measure fluency is based on the question “Is the language in the output fluent?”

5: flawless
4: good
3: non-native
2: disfluent
1: incomprehensible


Judges are presented with two or more translations (usually from different MT systems) and are required to choose the best option. This task can be confusing when the ranked segments are nearly identical or contain difficult-to-compare errors. The judges must decide which errors have greater impact on the quality of the translation (Denkowski & Lavie, 2010). On the other hand, it is often easier for human judges to rank systems than to assign absolute scores (Vilar et al., 2007). This is because it is difficult to quantify the quality of the translation.

Error Analysis

Human judges identify and classify errors in MT output. Classification of errors might depend on the specific language and content type. Some examples of error classes are “missing words,” “incorrect word order,” “added words,” “wrong agreement,” “wrong part of speech,” and so on. It is useful to have reference translations in order to classify errors; however, as mentioned above, there may be several correct ways to translate the same source segment. Accordingly, reference translations should be used with care.

When evaluating the quality of eBay MT systems, we use all the aforementioned methods. However, our metrics can vary in the provision of micro-level details about some areas specific to eBay content. As a result, one of the evaluation criteria is to identify whether brand names and product names (the main noun or noun phrase identifying an item) were translated correctly. This information can help in identifying the problem areas of MT and focusing on the enhancement of that particular area.

Some types of human evaluation, such as error analysis, can only be conducted by professional linguists, while other types of judgment can be performed by annotators who are not linguistically trained.

Is there a way to cut the cost of human evaluation? Yes, but unfortunately, low-budget crowdsourcing evaluations tend to produce unreliable results. How then can we save money without compromising the validity of our findings?

  • Start with a pilot test — a process of trying out your evaluation on a small data set. This can reveal critical flaws in your metrics, such as ambiguous questions or instructions.
  • Monitor response patterns to remove judges whose answers are outside the expected range.
  • Use dynamic judgments — a feature that allows fewer judgments on the segments where annotators agree, and more judgments on segments with a high inter-rater disagreement.
  • Use professional judgments that are randomly inserted throughout your evaluation job. Pre-labeled professional judgments will allow for the removal of judges with poor performance.

Human evaluation of machine translation quality is still very important, even though there is no clear consensus on the best method. It is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment.

If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.

Igniting Node.js Flames

“Simple things bring infinite pleasure. Yet, it takes us a while to realize that. But once simple is in, complex it out – forever.” ― Joan F. Marques

Now that I have your attention, let me clear up the word “flames.” The flames that I’m referring to have nothing to do with fire. All I am talking about is performance tools in Node.js. When it comes to performance, everyone thinks of fighting fires, as many think performance optimization is a nightmare. Most of us think only that some individuals are masters in profiling.

Anyone can become master in profiling when given simple tools. At eBay, we strive to make things simple and easy for our developers to use. During the course of Node.js development and production issues, we soon realized that profiling in Node.js is not an easy thing to do.

Before jumping to the CPU profiling tool that simplified our lives, let me walk you through our journey that ended up in seeing flame charts from a completely different angle.

Flame graphs using kernel tools

With Brendan Gregg’s flame graph generation, it was much easier to visualize CPU bottlenecks. However, we need to run a small number of tools and scripts to generate these graphs.

Yunong Xiao has posted an excellent blog on how to generate flame graphs using the perf command based on Gregg’s tools. Kernel tools like DTrace (BSD and Solaris) and perf (Linux) are very useful in generating stack traces from the core level and transform the stack calls to flame graphs. This approach gives us complete picture from Node internals, from the V8 engine all the way to JS code.

However, running tools like this need some good understanding on tool itself and sometimes you need different OS itself. In most cases your production box and profiling box setup differ completely. This way makes it hard to investigate the issue going in production as one has to attempt to reproduce this issue in completely different environment.

After managing to run the tools, you will end up with flame charts like this.

netflix-profileImage source from Yunong Xiao’s blog

Here are some pros and cons for this approach.


  • Easy to find CPU bottleneck
  • Graphical view
  • Complete profile graph for native and JS frames.


  • Complexity in generating graphs.
  • Limited DTrace support by different platforms, harder to profile in DEV boxes

Chrome profiling tool

The Chrome browser is just amazing. It is famous not only for its speed but also for its V8 engine, which is core to Node.js. In addition to these features, one tool that web developers love about Chrome is Developer Tools.


There is one tool inside Developer Tools that is used to profile browser-side JS. The v8-profiler enables us to use server-side profile data in the Chrome Profile tool.


Let us see how we can use this for profiling our Node.js application. Before using Profiles in Chrome, we have to generate some profiling data from our running Node.js application. We will use v8-profiler for creating CPU profile data.

In the following code, I have created a route /cpuprofile for generating CPU profile data for a given number of seconds and then streaming the dump to a browser to open in Chrome.

This sample code creates a CPU dump using v8-profiler.

//file index.js
var express = require('express');
var util = require('util');
var profiler = require('v8-profiler');
var app = express();
app.get('/', function(req, res){
 res.send(“Hello World!!”);

app.get('/cpuprofile', function(req, res){
    var duration = req.query.duration || 2;
    res.header('Content-type', 'application/octet-stream');
    res.header('Content-Disposition', 'attachment; filename=cpu-profile' + + '.cpuprofile');
    //Start Profiling
    profiler.startProfiling('CPU Profile', true);
       //Stop Profiling after duration
       var profile = profiler.stopProfiling();
       //Pipe profile dump to browser
       profile.export().pipe(res).on('finish', function() {
    }, duration * 1000); //Convert to millisec

To generate CPU profile data, use these steps:

  1. Start your app.
    node index.js

    It’s a good idea to run ab to put some load on the page.

  2. Access the CPU profile dump using http://localhost:8080/cpuprofile?duration=2. A cpu-profile.cpuprofile will be downloaded from the server.
  3. Load the downloaded file cpu-profile.cpuprofile in Chrome using Developer Tools > Profiles > Load. Upon loading, you should see in your Profiles tab something like the

Now that you have opened profile data, you can drill down the tree and analyze which piece of code is taking more CPU time. With this tool, anyone can generate profile data anytime with just one click, but just imagine how hard it is to drill down with this tree structure when you have big application.

In comparison to Flame Graphs using Kernel Tools, here are some pros and cons.


  • Easy generation of a profile dump
  • Platform independent
  • Profiling available during live traffic


  • Chrome provides a graphical view for profile data, but the data is not aggregated and navigation is limited.

Flame graphs @ eBay

Now that we have seen two different approaches for generating CPU profile data, let us see how we can bring in a nice graphical view like flame graphs to V8-profiler data.

At eBay, we have taken a different approach to make it very simple and easy to use tool for our Node.js developers. We used V8-profiler data, applied the aggregation algorithm, and rendered the data as flame charts using the d3-flame-graphs module.

If you look at the .cpuprofile file closely (created above), it is basically a JSON file. We came across a generic d3-flame-graphs library that can draw flame graphs in a browser using input JSON data. Thanks to “cimi” for his d3-flame-graphs module.

After we made some modifications to the chrome2calltree aggregation algorithm and aggregated profile data (removed core-level CPU profile data), we could convert .cpuprofile data file to JSON that can be read by d3-flame-graphs, and the final outcome is simply amazing.

Three-step process

  1. Generate .cpuprofile on demand using v8-profiler as shown in Chrome Profiling Tool.
  2. Convert .cpuprofile into aggregated JSON format (source code).
  3. Load the JSON using d3-flame-graphs to render the flame graph on browser.


This time access CPU flame graph on browser using the same URL (http://localhost:8080/cpuprofile?duration=2) from Chrome Profiling Tool.


The above flame chart shows only JS frames, which is what most Node application developers are interested in.

Third-party packages used


  • Easy and simple to generate flame graphs
  • Doesn’t need special setup
  • Platform independent
  • Early performance analysis during development
  • Graphical view integrated into every application


  • Imposes 10% overhead during profiling


To summarize, we have seen three different ways of profiling CPU in Node.js, starting from using OS-level tools to rendering flame graphs on a browser using simple open source JS code. Simple and easy-to-use tools help anyone master profiling and performance tuning. At eBay, we always strive to make some difference in our developers’ lives.


Correcting Severe Errors Made by Machine Translation

This article intends to show a few examples of severe errors made by machine translation engines that most of us want to prevent or correct. First, I will try to categorize what I would consider as a severe error created by MT:

  • Errors with economic consequences for the company
  • Errors due to offensive words
  • Errors with legal or safety consequences

Errors with economic consequences

Economic consequences come from errors that prevent a customer from doing business with the company. For eBay, these would be mostly issues that prevent buyers from buying. Customers of eBay start buying by entering a query to search for the items that they want to buy. That query entered in their native language is translated into English, for example, and then the English query goes on to find items. So it is critical that the translation of the query is appropriate to find the best results possible. When a query is translated in a way that does not bring results, this becomes a severe error because the customer is not buying from that translation.

Our example comes from Brazilian Portuguese: when searching for an iPhone case, Brazilians will enter the term “capinha” for “case”, which is a diminutive form of the word. Most of the corpora used to train most MT engines may come from formally written sources, and these sources do not use diminutives very often. As a result, the translation of “capinha” from Portuguese into English may not translate into ”case”; actually, it may not translate it at all. The query for that search produces no results and this becomes an important error. This is something that we fixed and made our Brazilian customers happy.

Another type of error that could have economic consequences would be the translation of “free shipping” as “paid shipping”, or the translation of “seller pays for shipping” as “buyer pays for shipping”. This could result in less buying. However, we haven’t seen this happen.

Errors due to offensive or inappropriate words

Words could be offensive or inappropriate for being explicit language or have sexual connotations. We have these examples:

Consider the word “female”. In many languages, the word female is translated differently if you are referring to a person or to an animal (or a mechanical part). If you are referring to a person, the translation for “female jacket” should sound like “feminine jacket” or “jacket for women”. If you are referring to an animal, the word is more on the anatomic side, expressing the idea of something being physically female. Preferably, the MT engine should not translate a female jacket as “anatomically” female and should translate with the meaning of a style. This is an issue that we found across several languages. To illustrate this, here is what happens in two languages.


The other example is language-specific. There is a doll called Adora doll, where Adora is a brand. It turns out that adora is a word in Portuguese that means that you “adore”, that you “love” someone. The translation for “boneca Adora” coming from Portuguese turned into “love doll” in English, and the results of a search for “love doll” may not be the most suitable if someone is looking for a doll for children.

Errors with legal or safety consequences

Errors of legal or safety nature could come from converting units of measurement from one system (English units) to another (metric). This kind of issue is critical in medical translations, where the dosage of a medication in the English system turned into a metric unit, keeping the same number, could risk a person’s life! This is less likely to be an issue on an e-commerce environment. Also, our engine is trained to keep the same unit system, so we have not seen issues of that nature.

If you know other examples of severe errors made by machine translation, we would love to hear from you and improve our system.

And if you enjoyed this article, please check other posts from the eBay MT Language Specialists series.