Author Archives: Juan Rowda

Five Tools to Build Your Basic Machine Translation Toolkit

 

If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well.

As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read this article.

Advanced Text Editors

The text editor that comes with your laptop won’t cut it, trust me. You need an advanced text editor that can provide at least these capabilities:

  • Deal with different encodings (UTF, ANSI, etc.)
  • Open big files, sometimes with unusual formats or extensions
  • Do global search and replace operations with regular-expression support
  • Highlight syntax (display different programming, scripting, or markup languages — such as XML and HTML — with color codes)
  • Have multiple files open at the same time (that is, support tabs)

This is a list of my favorite text editors, but there are a lot of good ones out there.

Notepad++

Notepad++ is my editor of choice. You can open virtually any file with it, it’s really fast, and it will keep your files in the editor even if you quit it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It’s really easy to convert to and from different encodings and to save all opened files at once. You can also download different plug-ins, like spellcheckers, comparators, etc. It’s free, and you can download it from the Notepad++ web site.

notepadplusplus-window

Sublime Text

Sublime Text is another amazing editor, and it’s a developers’ favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs as well. It has a distraction-free mode if you really need to focus. It comes with a free trial period, and you can get it at the Sublime Text web site.

sublime-window

EmEditor

Syntax highlighting, document comparison, regular expressions, handling of huge files, encoding conversion: Emeditor is complete. My favorite feature, however, is the scriptable macros. This means that you can create, record, and run macros within EmEditor — you can use these macros to automate repetitive tasks, like making changes in several files or saving them with different extensions. You can download it from the EmEditor web site.

emeditor-window

Language Quality Assurance Tools

Quality Assurance tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way:

  1. You load files with your translated content (source + target).
  2. You optionally load reference content, like glossaries, translation memories, previously translated files, or blacklists.
  3. The tool checks your content and provides a report listing potential errors.

You can find lots of errors using a QA tool:

  • Terminology: where term A in the source is not translated as B in the target
  • Blacklisted terms: terms you don’t want to see in the target
  • Inconsistencies: same source segment with different translations
  • Differences in numbers: when the source and target numbers don’t match
  • Capitalization
  • Punctuation: missing or extra periods, duplicate commas, and so on
  • Patterns: certain user-defined patterns of words, numbers and signs (which may contain regular expressions to make them more flexible) expected to occur in a file.
  • Grammar and spelling errors
  • Duplicate words, tripled letters, and more

Some QA tools you should try are Xbench and Checkmate.

Xbench

Xbench allows you to run the following QA Checks:

  • Untranslated segments
  • Segments with the same source text and different target text
  • Segments with the same target text and different source text
  • Find segments whose target text matches the source text (potentially untranslated text)
  • Tag mismatches
  • Number mismatches
  • Double blanks
  • Repeated words
  • Terminology mismatches against a list of key terms
  • Spell-check translations.

Some linguists like to add all their reference materials into Xbench, like translation memories, glossaries, termbases, and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.

Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited, but there are ways to expand it.

xbench-window

Checkmate

Checkmate is the QA Tool part of the Okapi Framework, which is an open-source suite of applications to support the localization process. That means that the framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are repeated words, corrupted characters, patterns, inline code differences, significant differences in length between source and target, missing translations, spaces, etc.

The patterns section is especially interesting; I will come back to it in the future. Checkmate also produces comprehensive error reports in different formats. It can also be integrated with LanguageTool, an open-source spelling and grammar checker.

checkmate-window

Comparison Tools

Why do you need a comparison tool? Comparing files is a very practical way to see in detail what changes were introduced, for example, which words were replaced, which segments contain changes, or whether there is any content added or missing. Comparing different versions of a file (for example, before and after post-editing) is essential for processes that involve multiple people or steps. Beyond Compare is, by far, the best and most complete comparison tool, in my opinion.

With Beyond Compare, you can compare entire folders, too. If you work with many files, comparing two folders is an effective way to determine if you are missing any files or if a file does not belong in a folder. You can also see if the contents of the files are different or not.

beyond-compare-window

Concordance Tool

As defined by its website, AntConc is a “freeware corpus analysis toolkit for concordancing and text analysis.” This is, in my opinion, one of the most helpful tools you can find out there when you want to analyze your content, regardless the language. AntConc will let you easily find n‑grams and sort them by the number of occurrences. This is extremely important when you want to find patterns in your content. Remember: with MT, you want to fix patterns, not specific occurrences of errors. It may sound obvious, but finding and fixing patterns is a more efficient way to get rid of an issue than trying to fix each particular instance of an error.

AntConc will also create a list of each word in your content, preceded by the number of hits. This is extremely helpful for your terminology work. There are so many things you can use this tool for, that it deserves its own article.

antconc-window

CAT Tools

In most cases, using a translation memory (TM) is a good idea. If you are not familiar with CAT tools (computer-assisted translation), they provide a translation environment that combines an editor, a translation memory, and a terminology database. The key part here is the TM, which is essentially a database that stores translations in the form of translation units (that is, a source segment plus a target segment), and if you come across the same or a similar source segment, the TM will “remember” the translation you previously stored there.

CAT Tools make a great post-editing environment. Most modern tools can be connected to different machine translation systems, so you get suggestions both from a TM and from an MT system. And you can use the TM to save your post-edited segments and reuse them in the future. If you have to use glossaries or term bases, CAT tools are ideal, as they can also display terminology suggestions.

When post-editing with a CAT tool, there are usually two approaches: you can get MT matches from a TM or a connected MT system (assuming, of course, that the matches are added to it previously), or you can work on bilingual, pre-translated files and store only post-edited segments in your TM.

Matecat

If you have never tried it, I totally recommend Matecat. It’s a free, open-source, web-based CAT tool, with a nice and simple editor that is easy to use. You don’t have to install a single file. They claim you will always get up to 20% more matches than with any other CAT tool. Considering that some tools out there cost around 800 dollars, what Matecat has to offer for free can’t be ignored. It can process 50+ file types; you can get statistics on your files (like word counts or even how much time you spent on each segment), split them, save them on the cloud, and download your work. Even if you never used a CAT tool before, you will feel comfortable post-editing in Matecat in just a few minutes.

matecat-window

OmegaT

Another interesting free, open-source option is OmegaT. It’s not as user-friendly as Matecat, so you will need some time to get used to it, even if you are an experienced TM user. It has pretty much all the same main features commercial CAT tools have, like fuzzy matching, propagation, and support for around 40 different file formats, and it boasts an interface with Google Translate. If you never used it, you should give it a try.

omegat-window

Memoq

If you are looking into investing some money and getting a commercial tool, my personal favorite is Memoq. It has tons of cool features and overall is a solid translation environment. It probably deserves a more detailed review, but that is outside of the scope of this post.

memoq-window

If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.

Machine Translation Corpus Analysis

 

Statistical Machine Translation (SMT) needs considerably large amounts of text data to produce good translations. We are talking about millions of words. But it’s not simply any text data — it’s good data that will produce good translations.

The challenge is how to make sense of all of these millions of words. What to do to find out whether the quality of a corpus is good enough to be used in your MT system? How do you know what to improve if you realize a corpus is not good? How to know what your corpus is about? Reading every single word or line is completely out of the question.

Corpus analysis can help you find answers to these questions. It can also help you understand how your MT system is performing and why. It can even help you understand how your post-editors are performing.

I will cover some analysis techniques that I believe are effective to understand your corpus better. To keep things simple, I will use the word “corpus” to refer to any text sample, either one used to produce translations or one being the result of a translation-related process.

The tools

I’m going to cover two tools: AntConc and Python. The first one is a corpus analysis tool exclusively. The latter is a programming language (linguists, please, don’t panic!), but I’m going to show you how you can use a natural language processing module (NLTK) to dig into your corpora and also provide bits of code for you to try.

AntConc and Python can be used in Windows, Mac, and Linux.

AntConc

As defined by its web site, AntConc is a “freeware corpus analysis toolkit for concordancing and text analysis.” It’s really simple to use. It contains seven main tools for analysis and has several interesting features. We will take a closer look at the details and how the tool can be used with the following examples.

Getting a Word List

A great way to know more about your corpus is getting a list of all the words that appear in it. AntConc can easily create a list with all the words that appear in your corpus and show important additional information about them, like how many tokens are there and the frequency of each. Knowing which words appear in your corpus can help you identify what it is about; the frequency can help you determine which are the most important words.

You can also see how many tokens (individual words) and word types (unique words) are there in a corpus. This is important to determine how varied (how many different words) your text is.

To create a word list, after loading your corpus files, click the Word List tab and click Start. You’ll see a list of words sorted by frequency by default. You can change the sorting order in the Sort by drop-down. Besides frequency, you can sort alphabetically and by word ending.

adding-keyword-lists-window

Frequency is often a good indicator of important words — it makes sense to assume that tokens that appear many times have a more relevant role in the text.

But what about prepositions or determiners and other words that don’t really add any meaning to the analysis? You can define a word list range, that is, you can add stop words (words you want to exclude from your analysis) individually or in entire lists.

adding-stopwords-window-demo

Word lists are also very good resources to create glossaries. You can either use the frequency to identify key words or just go through the list to identify words that may be difficult to translate.

Keyword Lists

This feature allows you to compare a reference corpus and a target corpus and then calculate words that are unusually frequent or infrequent. What’s the use for this? Well, this can help you get a better insight on post-editing changes, for example, and try to identify words and phrases that were consistently changed by post-editors. It’s safe to assume that the MT system is not producing a correct translation for such words and phrases. You can add these to any blacklists, QA checks, or automated post-editing rules you may be using.

adding-keyword-lists-window

A typical scenario would be this: you use your MT output as the target corpus, and a post-edited/human translation (for the same source text, of course) as the source corpus; the comparison will tell you which words are frequent in the MT output that are not so frequent in the PE/HT content.

“Vintage” here is at the top of the list. In my file with MT output segments, it occurs 705 times. If I do the same with the post-edited content, there are 0 occurrences. This means post-editors have consistently changed “vintage” to something else. It’s safe to add this word to my blacklist then, as I’m sure I don’t want to see it in my translated content. If I know how it should be translated, it could be part of an automated post-processing rule. Of course, if you retrain your engine with the post-edited content, “vintage” should become less common in the output.

To add a reference corpus, in the Tool Preferences menu, select Add Directory or Add Files to choose your corpus file(s). Click the Load button after adding your files.

adding-a-keyword-list-window

Collocates

Collocates are simply words that occur together. This feature allows you to search for a word in a corpus and get a list of results that show other words that appear next to the search term. You can see how frequent a collocate is and also choose if your results should include collocates appearing to the right of the term, to the left, or both. What’s really interesting about this is that it can help you find occurrences of words that occur near your search term and not necessarily next to it. For example, in eBay’s listing titles, the word “clutch” can be sometimes mistranslated. It’s a polysemous word, and it can be either a small purse or an auto part. I can do some analysis on the collocate results for clutch (auto parts) and see if terms like bag, leather, purse, etc., occur near it.

You can also select how frequent a collocate needs to be in order to be included in the results.

This is very useful to spot unusual combinations of words as well. It obviously depends on the language, but a clear example could be a preposition followed by another preposition.

collocates-window

To use this feature, load your corpus files and click the Collocates tab. Select the From and To ranges — values here contain a number and a letter: L(eft)/R(ight). The number indicates how many words away from the search terms should be included in the results, and L/R indicates the direction in which collocates must appear. You can also select a frequency value. Enter a search term and click Start.

All the results obtained with any of the tools that AntConc provides can be exported into several formats. This allows you to take your data and process it in any other tool.

Clusters and n-grams

This is perhaps one of the most useful features in AntConc. Why? Because it allows you to find patterns. Remember that, when working with MT output, most of the time it’s not realistic to try to find or fix every single issue. There may be tons of errors with varying levels of severity in the MT output (especially considering the volumes of content processed by MT), so it does make sense to focus first on those that occur more frequently or that have a higher severity.

Here’s a simple example: let’s assume that by looking at your MT output you realize that your MT system is translating the word “inches” into “centimeters” without making any changes to the numbers that usually precede that word, that is, “10 inches” is being consistently translated as “10 centimeters”. You could try to find and fix “1 centimeter”, “2 centimeters”, “3 centimeters”, etc. Instead, a much better choice would be to identify a pattern: “any number” followed by the word “centimeter” should be instead “any number” “inches”. This is an oversimplification, but the point is that identifying an error pattern is a much better approach than fixing individual errors.

Once you have identified a pattern, the next step is to figure out how you can create some sort of rule to find/fix such pattern. Simple patterns made of word or phrases are pretty straightforward — find all instances of “red dress” and replace with “blue dress”, for example. Now, you can take this to the next level by using regular expressions. Going back to the inches example, you could easily find all instances of “any number” followed by centimeters with a simple regex like \d+ centimeters, where \d stands for any digit and the + sign stands for one or more (digits).

Clusters

Using the Clusters/N-Grams tool helps you find strings of text based on their length (number of tokens or words), frequency, and even the occurrence of any specific word. Once you open your corpus, AntConc can find a word or a pattern in it and cluster the results in a list. If you search for a word in your corpus, you can opt to see words that precede or follow the word you searched for.

Results can be sorted by several criteria:

  • By frequency (ideal to find recurring patterns — the more frequent a pattern is, the more relevant it might be)
  • By word (ideal to see how your MT system is dealing with the translation of a particular term)
  • By word end (sorted alphabetically based off the last word in the string)
  • By range (if your corpus is composed of more than one file, in how many of those files the search term appears)
  • By transitional probability (how likely it is that word2 will occur after word1; e.g., the probability of “Am” occurring after “I” is much higher than “dishwasher” occurring after “I”.)

Let’s see how the Clusters tool can be used. I’ve loaded my corpus in AntConc. and I want to see how my system is dealing with the word case. Under the Cluster/N-grams tab, let’s select the Word check box, as I want to enter a specific search term. I want to see clusters that are three to four words long. And very important here, the Search Term Position option: if you select Left, your search term will be the first word in the cluster; if you select Right, it’ll be the last one instead. The following screenshots show how the Left/Right option selection affects the results.

clusters-on-left
On left

clusters-on-right
On right

We can also use regular expressions here for cases in which we need more powerful searches. Remember the example about numbers and inches above? Well, numbers, words, spaces, letters, punctuation — all these can be covered with regular expressions.

Let’s take a look at a few examples:

Here I want to see all two-word clusters that start with the word “original”, so I’m going to use a boundary (\b) before “original”. I don’t know the second word, it’s actually what I want to find out, so I’m going to use \w, which stands for “any word”. All my results will then have the following form: original+word.

original-two-word-clusters

Now I want to see all clusters, regardless of their frequency, that contain the words “price” OR “quality”. So, in addition to adding the boundaries, I’m going to separate these words with | (vertical bar) that simply stands for “or”.

This is really useful when you want to check how the system is dealing with certain words — there’s no need to run separate searches since you can combine any number of words with | between them. Check the Global Settings menu for reference.

price-or-quality-clusters

For seasoned regex users, note that regex capabilities in AntConc are pretty modest and that some operators are not standard.

N-grams

If you are not familiar with this term, in a nutshell, an n-gram is any word or sequence of words of any size; a 1-gram is composed of one element, a 2-gram is composed of 2 elements, etc. It’s a term that defines the length of a string rather than its content.

n-gram-window

What’s great about this feature is that you can find recurring phrases without specifying any search terms. That is, you can easily obtain a list of, for example, all the 6-grams to 3-grams that occur more than 10 times in your corpus. Remember that clusters work in the opposite way — you find words that surround a specific search term.

The n-gram search is definitely an advantage when you don’t know your corpus very well and you still don’t know what kind of issues to expect. It’s usually a good choice if it’s the first time you are analyzing a corpus — it finds patterns for you: common expressions, repeated phrases, etc.

When working with n-grams, it’s really important to consider frequency. You want to focus your analysis on n-grams that occur frequently first, so you can cover a higher number of issues.

What can you do with your findings, besides the obvious fact of knowing your corpus better? You can find recurring issues and create automated post-editing rules. Automated post-editing is a technique that consists in applying search and replace operations on the MT output. For instance, going back to our initial “inches” vs. “centimeters” example, you could create a rule that replaces all instances of number+centimeters with number+inches. Using regular expressions, you can create very powerful, flexible rules. Even though this technique was particularly effective when working with RBMT, it’s still pretty useful for SMT between training cycles (the process in which you feed new data to your system so it learns to produce better translations).

You can also create blacklists with issues found in your MT output. A blacklist is simply a list of terms that you don’t want to see in your target, so for example, if your system is consistently mistranslating the word “case” as a legal case instead of a protective case, you can add the incorrect terms to the blacklists and easily detect when they occur in your output. In the same way, you can create QA checks to run in tools like Checkmate or Xbench.

Python

(Please note that this article is intended for a general audience and, if you are a Python expert, you may find some of the ideas below too basic. I’m not a Python expert myself, so I apologize in advance if the terminology I use here is not what experts use.)

For those of you not familiar with Python, it’s a programming language that has been gaining more and more popularity for several reasons: it’s easy to learn and easy to read, it can be run in different environments and operating systems, and there’s a significant number of modules that can be imported and used.

Modules are files that contain classes, functions, and other data. Without getting too technical, a module is code already written that you can reuse, without having to write it yourself from scratch. Quick example: if you want to write a program that will use regular expressions, you can simply import the re module, and Python will learn how to deal with them thanks to the data in the module.

Enter Natural Language Processing Toolkit

Modules are the perfect segue to introduce the Natural Language Processing Toolkit (NLTK). Let me just steal the definition from their site, www.nltk.org: “NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries…”

Using Python and NLTK, there are quite a few interesting things you can do to learn more about your corpora. I have to assume you are somewhat familiar with Python (not an expert!), as a full tutorial would simply exceed the purpose of this post. If you want to learn more about it, there are really good courses on Coursera, Udemy, and YouTube, for example. I personally like Codeacademy’s hands-on approach.

The Mise en place

To follow these examples, you’ll need the following installed:

  • Python 3.5 (version 2.7 works too, but some of these examples may need tweaking)
  • NLTK
  • Numpy (optional)

To get corpora, you have two options: you can choose to use corpora provided by NLTK (ideal if you just want to try these examples, see how Python works, etc.) or you can use your own files. Let me walk you through both cases.

If you want to use corpora from NLTK, open your Python’s IDLE, import the nltk module (you’ll do this every time you want to use nltk), and then download the corpora:

>>> import nltk
>>> nltk.download()

A new window will open, and you’ll be able to download one or more corpora, as well as other packages. You can find the entire list on the NLTK Corpora page.

list-of-corpora

When working in Python, you can import (a) all available corpora at the same time or (b) a single corpus. Notice that (a) will import books (like Moby Dick and The Book of Genesis.)

a	>>> from nltk.book import *
b	>>> from nltk.corpus import brown

If you want to use your own files, you’ll have to tell Python where they are so it can read them. Follow these steps if you want to work with one file (but remember to import nltk first):

>>> f = open(r'c:\reviews.txt','rU')
>>> raw = f.read()
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)

Basically, I’m telling Python to open my file called reviews.txt saved in C. The “r” in front of the path is required for Python to read it correctly. I’m also telling Python that I want to read, not write on, this file.

Then, I’m telling Python to read the contents of my file and store them in a variable called raw, to tokenize the content (“identify” the words in it), and to store those tokens in a variable named text. Don’t get scared by the technical lingo at this point: a variable is just a name that we assign to a bucket where we store information, so we can later make reference to it.

What if you have more than one file? You can use the Plaintext Corpus Reader to deal with several plaintext documents. Note that if you follow the example below, you’ll need to replace sections with the relevant information, such as your path and your file extension.

>>> from nltk.corpus import PlaintextCorpusReader
>>> files = ".*\.txt"
>>> corpus0 = PlaintextCorpusReader(r"C:/corpus", files)
>>> corpus  = nltk.Text(corpus0.words())

Here, I’m asking Python to import PlaintextCorpusReader, that my files have the txt extension, where the files are stored, and to store the data from my files into a variable called corpus.

You can test if your data was correctly read just by typing the name of the variable containing it:

>>> corpus
<Text: This black and silver Toshiba Excite is a...>
>>> text
<Text:`` It 's a Motorola StarTac , there...>

corpus and text are the variables I used to store data in the examples above.

Analyzing (finally!)

Now that we are all set up and have our corpora imported, let’s see some of the things we can do to analyze it.

We can get a word count using the len function. It is important to know the size of our corpus, basically to understand what we are dealing with. What we’ll obtain is a count of all words and symbols, repeated words included.

>>> len(text)
43237
>>> len(corpus)
610

If we wanted to count unique tokens, excluding repeated elements, we can follow this example.

>>> len(set(corpus))
370

With the set function, we can get a list of all the words used in our corpus, that is, a vocabulary.

>>> set(corpus)
{'knowledge', 'Lord', 'stolen', 'one', ':', 'threat', 'PEN', 'gunslingers', 'missions', 'extracting', 'ensuring', 'Players', 'player', 'must', 'constantly', 'except', 'Domino', 'odds', 'Core', 'SuperSponge', etc.

A list of words is definitely useful, but it’s usually better to have them alphabetically sorted. We can also do that easily.

>>> sorted(set(corpus))
["'", '(', ').', ',', '-', '--', '.', '3', '98', ':', 'Ancaria', 'Apocalypse', 'Ashen', 'Barnacle', 'Bikini', 'Black', 'Bond', 'Bottom', 'Boy', 'Core', 'Croft', 'Croy', 'D', 'Dalmatian', 'Domino', 'Egyptian', etc.

Note that Python will put capitalized words at the beginning of your list.

We can check how many times a word is used on average, what we call “lexical richness.” From a corpus analysis perspective, it’s good that a corpus is lexically rich, as theoretically the MT system will “learn” how to deal with a broader range of words. This indicator can be obtained by dividing the total number of words by the number of unique words.

>>> len(text)/len(set(text))
6.1459843638948115
>>> len(corpus)/len(set(corpus))
1.6486486486486487

If you need to find out how many times a word occurs in your corpus, you can try the following. (Notice that this is case-sensitive.)

>>> text.count("leave")
50
>>> text.count("Leave")
10

One key piece of information you probably want to get is the number of occurrences of each token or vocabulary item. As we mentioned previously, frequent words say a lot about your corpus. They can be used to create glossaries, for example. One way to do this is using frequency distributions. You can also use this method to find how many times a certain word occurs.

>>> fdistcorpus = FreqDist(corpus)
>>> fdistcorpus
FreqDist({',': 33, 'the': 27, 'and': 24, '.': 20, 'a': 20, 'of': 17, 'to': 16, '-': 12, 'in': 8, 'is': 8, ...})

>>> fdistcorpus['a']
20

A similar way to do this is using the vocab function:

>>> text.vocab()
FreqDist({',': 2094, '.': 1919, 'the': 1735, 'a': 1009, 'of': 978, 'and': 912, 'to': 896, 'is': 597, 'in': 543, 'that': 518, ...})

Conversely, if you want to see the words that only appear one time, use the hapaxes function:

>>> fdistcorpus.hapaxes()
['knowledge', 'opening', 'mystical', 'return', 'bound']

If you only want to see, for example, the ten most common tokens from your corpus, there’s a function for that:

>>> fdistcorpus.most_common(10)
[(',', 33), ('the', 27), ('and', 24), ('.', 20), ('a', 20), ('of', 17), ('to', 16), ('-', 12), ('in', 8), ('is', 8)]

We can have the frequency distributions results presented in many ways:

  • One column
    >>> for sample in fdistcorpus:
    	print(sample)
    
    	
    knowledge
    opening
    mystical
    return
    bound
    moving
    Bottom
    slam
    Lord
    Ms
    

    Here, I’m using a for loop. Loops are typically used when you want to repeat or iterate and action. In this case, I’m asking Python, for each token or sample in my corpus, to print said sample. The loop will perform the same action for all the tokens, one at the time, and stop when it has covered every single one of them.

  • Tab-separated
    >>> fdistcorpus.tabulate()
                   ,              the              and                .                a               of               to                -               in               is             with              for               as                '                s              The               by               or               --               be             them               on              has
    
  • Chart
    >>> fdistcorpus.plot()
    

    python-analysis-chart

Visualizing Machine Translation Quality Data — Part I

There is no knowledge that is not power

We are, no doubt, living some of the most exciting days of the Information Age. Computers keep getting faster, smartphones are ubiquitous. Huge amounts of data are created daily by amazingly diverse sources. It is definitely easier than ever to gather data for language services buyers and providers, but it looks like the localization industry is really not doing a lot to harness all this information. Overall, and with a few exceptions, of course, the industry seems to be missing out on this and is not fully leveraging, or at least trying to understand, all these wonderful bits and pieces of information that are generated every day.

Perhaps the issue is that too much information can be hard to make sense out of and may even feel overwhelming. That is, precisely, the advantage of data visualization.

In this series of articles, I will cover three different tools you can use to visualize your translation quality data: Tableau, TAUS DQF, and Excel. This article is part 1 and will only focus on general information and Tableau.

The case for data visualization

Perhaps the single most important point of data visualization is that it allows you to assimilate information in a very natural way. An enormous amount of information that is difficult to take in when in a table suddenly makes sense when presented and summarized in a nice chart. Patterns and trends may become easier to spot and, sometimes, even obvious. Correlations may pop up and give you much-needed business or strategic advantages, allowing you to effectively act on your information.

How does this apply to translation and localization practices? Well, there simply is a lot of information you can measure and analyze, for example:

  • Productivity
  • Vendor performance
  • MT system performance
  • Tool performance
  • Financials
  • Process efficiency, etc.

At eBay, we use data visualization to track our vendors’ performance, the quality of our MT output for different language combinations, details on the types of issues found in our MT output, what types of issues we are finding in our vendors’ deliverables, and more.

The Keys

Let’s take a minute to examine what is necessary to make a visualization effective. I’m by no means an expert on this subject, you’ll notice, but based on my experience and research, these are the key points to consider:

First of all, be clear. What are you trying to find out with a chart? What do you want to bring the attention to? What are you trying to say? Transmitting a clear message is a priority.

Be selective: don’t cram columns and lines in a visualization just because. Carefully plan the data points you want to include, assessing if they contribute or not to the main purpose of your message. This can be difficult, especially if you have too much information – you may feel tempted to add information that might not add any value at all.

Keep your audience in mind, and be relevant. Shape your message to answer the questions they may have. Discard any information they may find unnecessary. Project managers may be interested in financials and the percentage of on-time deliveries, and engineers on process efficiencies, while language managers may be focused on quality and language performance.

Put some thinking on what’s the best way to represent the information and how you can make the most important information stand out. It’s usually a good idea to include trends, highlight patterns, and make meaningful correlations obvious.

Tableau

Tableau is perhaps one of the most popular visualization programs available. The concept is simple: Tableau can read your data, from a simple Excel file or a database (among several other options), parse it, and turn the information into dimensions and measures. And here’s the best part: you can simply drag and drop those dimensions and measures onto columns and rows, and Tableau will generate charts (or views, as they like to call them) for you. Automatically. Effortlessly.

And it comes with an amazing range of chart options and customization options that may seem overwhelming when you start using the software but, once you get the hang of it, make total sense.

Let’s look at some examples:

  • This chart shows in a very simple way how vendors are performing for each of the two content types we are working with at the moment, that is, titles and descriptions. It becomes evident that Vendor 2 may be the best option for descriptions while Vendor 5 is underperforming when it comes to titles.

    1-vendor-perf0rmance-by-content-type

  • Now, let’s imagine we want to analyze how post-editors for the different languages are doing, again based on the content type. We can take a look at how many errors reviewers found for each of them.

    Here it becomes evident that German post-editors are doing great with descriptions, but they are struggling with titles, as there’s a big difference in the position of the blue columns. We can also see that Spanish and French seem to be above the error average. Italian, Portuguese and Russian don’t show major changes from one content type to the other.

    2-content-type-by-language

  • Now we want to dig deeper into the errors our reviewers are finding, and for that, we are going to look at the different types of errors by language. Looking at this chart, it seems like the biggest problem are mistranslations. This is a good hint to try to find out why is this happening: Is the source too complex? Are post-editors not doing enough research? Are we providing the right reference material? On the other hand, data seems to indicate that terminology is not really a big problem. We could infer that our glossaries are probably good, our tool is showing the right glossary matches, and our translators are subject matter experts.

    We can also see that French has many more issues than Italian, for example.

    3-accuracy-by-language

    Tableau will easily let you swap your columns and rows to change the way the data is presented. In the example below, the focus is now on error categories and not on the number of errors found. However, what I don’t like in this view is that the names of the error categories are vertical and are hard to read — it is possible to rotate them, but that will make the chart wider.

    There are plenty of options you can try to create a view that shows exactly what you want, in the best possible way.

    4-accuracy-by-language-vertical

  • Here’s a very simple one to quickly see what are the busiest months based on the number of words processed.

    5-errors-by-language-by-month

  • Now we want to look at edit distance — analyzing this information can help us figure out, for example, MT performance by language, considering that a low edit distance indicates less post-editing effort. I’m going to include the wordcount, as well, to see the edit distance in context.

    I can ask Tableau to display the average edit distance for all languages by placing a line in the chart.

    The blue dots indicate that German is the language with the lowest edit distance, with an average of 21.15. This may be an indication that my DE output is good, or at least better than the rest of the languages. The red dots for Italian are all over the place, which may indicate that the quality of my IT MT output is inconsistent — just the opposite of Portuguese, with most purple dots concentrated in the center of the chart.

    6-edit-distance-by-language

  • In this final example, let’s assume we want to see how much content our reviewers are covering; ideally, they should be reviewing 50% of the total wordcount. Here we can see, by language, how many words we’ve processed and how many were reviewed. You can quickly see that the wordcount for French doubles the Russian wordcount. You can also easily notice that the FR reviewer is not covering as much as the rest. This may indicate that you need another reviewer or that the current reviewer is underperforming. Compare this to Portuguese, where the difference between total words and reviewed words is minimal. If we only need to review 50% of the content, PT reviewer is covering too much.

    7-review-coverage-by-language