Machine Translation Corpus Analysis


Statistical Machine Translation (SMT) needs considerably large amounts of text data to produce good translations. We are talking about millions of words. But it’s not simply any text data — it’s good data that will produce good translations.

The challenge is how to make sense of all of these millions of words. What to do to find out whether the quality of a corpus is good enough to be used in your MT system? How do you know what to improve if you realize a corpus is not good? How to know what your corpus is about? Reading every single word or line is completely out of the question.

Corpus analysis can help you find answers to these questions. It can also help you understand how your MT system is performing and why. It can even help you understand how your post-editors are performing.

I will cover some analysis techniques that I believe are effective to understand your corpus better. To keep things simple, I will use the word “corpus” to refer to any text sample, either one used to produce translations or one being the result of a translation-related process.

The tools

I’m going to cover two tools: AntConc and Python. The first one is a corpus analysis tool exclusively. The latter is a programming language (linguists, please, don’t panic!), but I’m going to show you how you can use a natural language processing module (NLTK) to dig into your corpora and also provide bits of code for you to try.

AntConc and Python can be used in Windows, Mac, and Linux.


As defined by its web site, AntConc is a “freeware corpus analysis toolkit for concordancing and text analysis.” It’s really simple to use. It contains seven main tools for analysis and has several interesting features. We will take a closer look at the details and how the tool can be used with the following examples.

Getting a Word List

A great way to know more about your corpus is getting a list of all the words that appear in it. AntConc can easily create a list with all the words that appear in your corpus and show important additional information about them, like how many tokens are there and the frequency of each. Knowing which words appear in your corpus can help you identify what it is about; the frequency can help you determine which are the most important words.

You can also see how many tokens (individual words) and word types (unique words) are there in a corpus. This is important to determine how varied (how many different words) your text is.

To create a word list, after loading your corpus files, click the Word List tab and click Start. You’ll see a list of words sorted by frequency by default. You can change the sorting order in the Sort by drop-down. Besides frequency, you can sort alphabetically and by word ending.


Frequency is often a good indicator of important words — it makes sense to assume that tokens that appear many times have a more relevant role in the text.

But what about prepositions or determiners and other words that don’t really add any meaning to the analysis? You can define a word list range, that is, you can add stop words (words you want to exclude from your analysis) individually or in entire lists.


Word lists are also very good resources to create glossaries. You can either use the frequency to identify key words or just go through the list to identify words that may be difficult to translate.

Keyword Lists

This feature allows you to compare a reference corpus and a target corpus and then calculate words that are unusually frequent or infrequent. What’s the use for this? Well, this can help you get a better insight on post-editing changes, for example, and try to identify words and phrases that were consistently changed by post-editors. It’s safe to assume that the MT system is not producing a correct translation for such words and phrases. You can add these to any blacklists, QA checks, or automated post-editing rules you may be using.


A typical scenario would be this: you use your MT output as the target corpus, and a post-edited/human translation (for the same source text, of course) as the source corpus; the comparison will tell you which words are frequent in the MT output that are not so frequent in the PE/HT content.

“Vintage” here is at the top of the list. In my file with MT output segments, it occurs 705 times. If I do the same with the post-edited content, there are 0 occurrences. This means post-editors have consistently changed “vintage” to something else. It’s safe to add this word to my blacklist then, as I’m sure I don’t want to see it in my translated content. If I know how it should be translated, it could be part of an automated post-processing rule. Of course, if you retrain your engine with the post-edited content, “vintage” should become less common in the output.

To add a reference corpus, in the Tool Preferences menu, select Add Directory or Add Files to choose your corpus file(s). Click the Load button after adding your files.



Collocates are simply words that occur together. This feature allows you to search for a word in a corpus and get a list of results that show other words that appear next to the search term. You can see how frequent a collocate is and also choose if your results should include collocates appearing to the right of the term, to the left, or both. What’s really interesting about this is that it can help you find occurrences of words that occur near your search term and not necessarily next to it. For example, in eBay’s listing titles, the word “clutch” can be sometimes mistranslated. It’s a polysemous word, and it can be either a small purse or an auto part. I can do some analysis on the collocate results for clutch (auto parts) and see if terms like bag, leather, purse, etc., occur near it.

You can also select how frequent a collocate needs to be in order to be included in the results.

This is very useful to spot unusual combinations of words as well. It obviously depends on the language, but a clear example could be a preposition followed by another preposition.


To use this feature, load your corpus files and click the Collocates tab. Select the From and To ranges — values here contain a number and a letter: L(eft)/R(ight). The number indicates how many words away from the search terms should be included in the results, and L/R indicates the direction in which collocates must appear. You can also select a frequency value. Enter a search term and click Start.

All the results obtained with any of the tools that AntConc provides can be exported into several formats. This allows you to take your data and process it in any other tool.

Clusters and n-grams

This is perhaps one of the most useful features in AntConc. Why? Because it allows you to find patterns. Remember that, when working with MT output, most of the time it’s not realistic to try to find or fix every single issue. There may be tons of errors with varying levels of severity in the MT output (especially considering the volumes of content processed by MT), so it does make sense to focus first on those that occur more frequently or that have a higher severity.

Here’s a simple example: let’s assume that by looking at your MT output you realize that your MT system is translating the word “inches” into “centimeters” without making any changes to the numbers that usually precede that word, that is, “10 inches” is being consistently translated as “10 centimeters”. You could try to find and fix “1 centimeter”, “2 centimeters”, “3 centimeters”, etc. Instead, a much better choice would be to identify a pattern: “any number” followed by the word “centimeter” should be instead “any number” “inches”. This is an oversimplification, but the point is that identifying an error pattern is a much better approach than fixing individual errors.

Once you have identified a pattern, the next step is to figure out how you can create some sort of rule to find/fix such pattern. Simple patterns made of word or phrases are pretty straightforward — find all instances of “red dress” and replace with “blue dress”, for example. Now, you can take this to the next level by using regular expressions. Going back to the inches example, you could easily find all instances of “any number” followed by centimeters with a simple regex like \d+ centimeters, where \d stands for any digit and the + sign stands for one or more (digits).


Using the Clusters/N-Grams tool helps you find strings of text based on their length (number of tokens or words), frequency, and even the occurrence of any specific word. Once you open your corpus, AntConc can find a word or a pattern in it and cluster the results in a list. If you search for a word in your corpus, you can opt to see words that precede or follow the word you searched for.

Results can be sorted by several criteria:

  • By frequency (ideal to find recurring patterns — the more frequent a pattern is, the more relevant it might be)
  • By word (ideal to see how your MT system is dealing with the translation of a particular term)
  • By word end (sorted alphabetically based off the last word in the string)
  • By range (if your corpus is composed of more than one file, in how many of those files the search term appears)
  • By transitional probability (how likely it is that word2 will occur after word1; e.g., the probability of “Am” occurring after “I” is much higher than “dishwasher” occurring after “I”.)

Let’s see how the Clusters tool can be used. I’ve loaded my corpus in AntConc. and I want to see how my system is dealing with the word case. Under the Cluster/N-grams tab, let’s select the Word check box, as I want to enter a specific search term. I want to see clusters that are three to four words long. And very important here, the Search Term Position option: if you select Left, your search term will be the first word in the cluster; if you select Right, it’ll be the last one instead. The following screenshots show how the Left/Right option selection affects the results.

On left

On right

We can also use regular expressions here for cases in which we need more powerful searches. Remember the example about numbers and inches above? Well, numbers, words, spaces, letters, punctuation — all these can be covered with regular expressions.

Let’s take a look at a few examples:

Here I want to see all two-word clusters that start with the word “original”, so I’m going to use a boundary (\b) before “original”. I don’t know the second word, it’s actually what I want to find out, so I’m going to use \w, which stands for “any word”. All my results will then have the following form: original+word.


Now I want to see all clusters, regardless of their frequency, that contain the words “price” OR “quality”. So, in addition to adding the boundaries, I’m going to separate these words with | (vertical bar) that simply stands for “or”.

This is really useful when you want to check how the system is dealing with certain words — there’s no need to run separate searches since you can combine any number of words with | between them. Check the Global Settings menu for reference.


For seasoned regex users, note that regex capabilities in AntConc are pretty modest and that some operators are not standard.


If you are not familiar with this term, in a nutshell, an n-gram is any word or sequence of words of any size; a 1-gram is composed of one element, a 2-gram is composed of 2 elements, etc. It’s a term that defines the length of a string rather than its content.


What’s great about this feature is that you can find recurring phrases without specifying any search terms. That is, you can easily obtain a list of, for example, all the 6-grams to 3-grams that occur more than 10 times in your corpus. Remember that clusters work in the opposite way — you find words that surround a specific search term.

The n-gram search is definitely an advantage when you don’t know your corpus very well and you still don’t know what kind of issues to expect. It’s usually a good choice if it’s the first time you are analyzing a corpus — it finds patterns for you: common expressions, repeated phrases, etc.

When working with n-grams, it’s really important to consider frequency. You want to focus your analysis on n-grams that occur frequently first, so you can cover a higher number of issues.

What can you do with your findings, besides the obvious fact of knowing your corpus better? You can find recurring issues and create automated post-editing rules. Automated post-editing is a technique that consists in applying search and replace operations on the MT output. For instance, going back to our initial “inches” vs. “centimeters” example, you could create a rule that replaces all instances of number+centimeters with number+inches. Using regular expressions, you can create very powerful, flexible rules. Even though this technique was particularly effective when working with RBMT, it’s still pretty useful for SMT between training cycles (the process in which you feed new data to your system so it learns to produce better translations).

You can also create blacklists with issues found in your MT output. A blacklist is simply a list of terms that you don’t want to see in your target, so for example, if your system is consistently mistranslating the word “case” as a legal case instead of a protective case, you can add the incorrect terms to the blacklists and easily detect when they occur in your output. In the same way, you can create QA checks to run in tools like Checkmate or Xbench.


(Please note that this article is intended for a general audience and, if you are a Python expert, you may find some of the ideas below too basic. I’m not a Python expert myself, so I apologize in advance if the terminology I use here is not what experts use.)

For those of you not familiar with Python, it’s a programming language that has been gaining more and more popularity for several reasons: it’s easy to learn and easy to read, it can be run in different environments and operating systems, and there’s a significant number of modules that can be imported and used.

Modules are files that contain classes, functions, and other data. Without getting too technical, a module is code already written that you can reuse, without having to write it yourself from scratch. Quick example: if you want to write a program that will use regular expressions, you can simply import the re module, and Python will learn how to deal with them thanks to the data in the module.

Enter Natural Language Processing Toolkit

Modules are the perfect segue to introduce the Natural Language Processing Toolkit (NLTK). Let me just steal the definition from their site, “NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries…”

Using Python and NLTK, there are quite a few interesting things you can do to learn more about your corpora. I have to assume you are somewhat familiar with Python (not an expert!), as a full tutorial would simply exceed the purpose of this post. If you want to learn more about it, there are really good courses on Coursera, Udemy, and YouTube, for example. I personally like Codeacademy’s hands-on approach.

The Mise en place

To follow these examples, you’ll need the following installed:

  • Python 3.5 (version 2.7 works too, but some of these examples may need tweaking)
  • NLTK
  • Numpy (optional)

To get corpora, you have two options: you can choose to use corpora provided by NLTK (ideal if you just want to try these examples, see how Python works, etc.) or you can use your own files. Let me walk you through both cases.

If you want to use corpora from NLTK, open your Python’s IDLE, import the nltk module (you’ll do this every time you want to use nltk), and then download the corpora:

>>> import nltk

A new window will open, and you’ll be able to download one or more corpora, as well as other packages. You can find the entire list on the NLTK Corpora page.


When working in Python, you can import (a) all available corpora at the same time or (b) a single corpus. Notice that (a) will import books (like Moby Dick and The Book of Genesis.)

a	>>> from import *
b	>>> from nltk.corpus import brown

If you want to use your own files, you’ll have to tell Python where they are so it can read them. Follow these steps if you want to work with one file (but remember to import nltk first):

>>> f = open(r'c:\reviews.txt','rU')
>>> raw =
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)

Basically, I’m telling Python to open my file called reviews.txt saved in C. The “r” in front of the path is required for Python to read it correctly. I’m also telling Python that I want to read, not write on, this file.

Then, I’m telling Python to read the contents of my file and store them in a variable called raw, to tokenize the content (“identify” the words in it), and to store those tokens in a variable named text. Don’t get scared by the technical lingo at this point: a variable is just a name that we assign to a bucket where we store information, so we can later make reference to it.

What if you have more than one file? You can use the Plaintext Corpus Reader to deal with several plaintext documents. Note that if you follow the example below, you’ll need to replace sections with the relevant information, such as your path and your file extension.

>>> from nltk.corpus import PlaintextCorpusReader
>>> files = ".*\.txt"
>>> corpus0 = PlaintextCorpusReader(r"C:/corpus", files)
>>> corpus  = nltk.Text(corpus0.words())

Here, I’m asking Python to import PlaintextCorpusReader, that my files have the txt extension, where the files are stored, and to store the data from my files into a variable called corpus.

You can test if your data was correctly read just by typing the name of the variable containing it:

>>> corpus
<Text: This black and silver Toshiba Excite is a...>
>>> text
<Text:`` It 's a Motorola StarTac , there...>

corpus and text are the variables I used to store data in the examples above.

Analyzing (finally!)

Now that we are all set up and have our corpora imported, let’s see some of the things we can do to analyze it.

We can get a word count using the len function. It is important to know the size of our corpus, basically to understand what we are dealing with. What we’ll obtain is a count of all words and symbols, repeated words included.

>>> len(text)
>>> len(corpus)

If we wanted to count unique tokens, excluding repeated elements, we can follow this example.

>>> len(set(corpus))

With the set function, we can get a list of all the words used in our corpus, that is, a vocabulary.

>>> set(corpus)
{'knowledge', 'Lord', 'stolen', 'one', ':', 'threat', 'PEN', 'gunslingers', 'missions', 'extracting', 'ensuring', 'Players', 'player', 'must', 'constantly', 'except', 'Domino', 'odds', 'Core', 'SuperSponge', etc.

A list of words is definitely useful, but it’s usually better to have them alphabetically sorted. We can also do that easily.

>>> sorted(set(corpus))
["'", '(', ').', ',', '-', '--', '.', '3', '98', ':', 'Ancaria', 'Apocalypse', 'Ashen', 'Barnacle', 'Bikini', 'Black', 'Bond', 'Bottom', 'Boy', 'Core', 'Croft', 'Croy', 'D', 'Dalmatian', 'Domino', 'Egyptian', etc.

Note that Python will put capitalized words at the beginning of your list.

We can check how many times a word is used on average, what we call “lexical richness.” From a corpus analysis perspective, it’s good that a corpus is lexically rich, as theoretically the MT system will “learn” how to deal with a broader range of words. This indicator can be obtained by dividing the total number of words by the number of unique words.

>>> len(text)/len(set(text))
>>> len(corpus)/len(set(corpus))

If you need to find out how many times a word occurs in your corpus, you can try the following. (Notice that this is case-sensitive.)

>>> text.count("leave")
>>> text.count("Leave")

One key piece of information you probably want to get is the number of occurrences of each token or vocabulary item. As we mentioned previously, frequent words say a lot about your corpus. They can be used to create glossaries, for example. One way to do this is using frequency distributions. You can also use this method to find how many times a certain word occurs.

>>> fdistcorpus = FreqDist(corpus)
>>> fdistcorpus
FreqDist({',': 33, 'the': 27, 'and': 24, '.': 20, 'a': 20, 'of': 17, 'to': 16, '-': 12, 'in': 8, 'is': 8, ...})

>>> fdistcorpus['a']

A similar way to do this is using the vocab function:

>>> text.vocab()
FreqDist({',': 2094, '.': 1919, 'the': 1735, 'a': 1009, 'of': 978, 'and': 912, 'to': 896, 'is': 597, 'in': 543, 'that': 518, ...})

Conversely, if you want to see the words that only appear one time, use the hapaxes function:

>>> fdistcorpus.hapaxes()
['knowledge', 'opening', 'mystical', 'return', 'bound']

If you only want to see, for example, the ten most common tokens from your corpus, there’s a function for that:

>>> fdistcorpus.most_common(10)
[(',', 33), ('the', 27), ('and', 24), ('.', 20), ('a', 20), ('of', 17), ('to', 16), ('-', 12), ('in', 8), ('is', 8)]

We can have the frequency distributions results presented in many ways:

  • One column
    >>> for sample in fdistcorpus:

    Here, I’m using a for loop. Loops are typically used when you want to repeat or iterate and action. In this case, I’m asking Python, for each token or sample in my corpus, to print said sample. The loop will perform the same action for all the tokens, one at the time, and stop when it has covered every single one of them.

  • Tab-separated
    >>> fdistcorpus.tabulate()
                   ,              the              and                .                a               of               to                -               in               is             with              for               as                '                s              The               by               or               --               be             them               on              has
  • Chart
    >>> fdistcorpus.plot()


Significance Testing for Ratio Metrics in Experiments


Written by the Experimentation Analytics Team (with Experimentation Platform Product Team)

What’s new?

We recently improved the ASP (Average Selling Price) metric calculation on our experimentation platform. As of Oct 31, 2016, we are reporting the ASP shift between test and control for all experiments.

However, one question may come to your mind — how do we report it? It’s actually a question on how to report any ratio metrics. In this article I will explain the story and then describe how we disentangled the problem.

Defining ASP: what do we want to measure?

ASP stands for average selling price, which seems very straightforward — the average of an item’s selling price on the eBay website. It’s not the listing price that you can browse directly on the eBay website. ASP only reflects the item price trend from completed transactions.

When it comes to calculation, we simply compute ASP as the ratio of two other important metrics: Gross Merchandise Volume Bought (GMB) and Bought Item (BI) (or GMV and Sold Item, from the seller side). In any experiment, when we want to measure ASP’s lift between test and control groups, it’s simply a comparison of the test group’s ASP and the control group’s ASP. It’s exactly the same as all other metrics we report.

Challenges for the ratio metric’s significance test

However, when we calculate the ratio metric’s p-value, many questions come up.

  • Non-buyers: How do we define ASP for GUIDs or users without any purchase? They have 0 GMB and 0 Bought Item count. Hence, what’s 0/0? It’s undefined!
  • Selection bias: If we discard non-buyers and only compute ASP among buyers, then are we incorporating a selection bias in our measurement? We all know that a treatment may already drive a lift in number of buyers.
  • Aggregation: At which level should we aggregate data to derive the standard deviation? For GMB, we usually aggregate to the user level and then calculate standard deviation. However, how can we aggregate the user-level ASP?
  • Outliers: How do we do outlier capping? Can we just apply the convenient 99.9% capping that what we do for GMB?

In fact, all four of these questions are common challenges for any ratio metrics. For instance, they apply as well to all conversion rates, exit rates, and defect rates. Therefore, we need to solve these four questions to develop a generic method to conduct significance tests for any ratio metrics.

Significance testing for ratio metrics

Conditional ratio metrics

The answer to the first question is closely tied to the denominator of any ratio metrics. In the ASP case, ASP = GMB/BI, so ASP exists conditional on BI. Clearly, 0/0 does not make any mathematical sense. Therefore, we can only report ASP conditional on transactions.

However, if we condition on transactions, we encounter possible selection bias between test and control transactions. Here, although we are not utilizing the advantage of randomization directly (that is, not a reduced-form estimate as in econometrics), we can still safely do so if we impose one assumption: we assume for a specific treatment, namely, BI and ASP lifts are independent. Therefore, we can decompose ASP’s lift from BI’s lift, and as long as we report BI’s lift; and when BI’s lift is small, ASP’s lift can be approximated by the difference between GMB lift and BI lift. (A more precise calculation requires a structural model and instrumental variables; we are not doing that for now).

In conclusion, our decision is to report ASP conditional on transactions. For other ratio metrics, if the denominator is GUID or user count, then it’s just a natural unconditional ratio metric, and there will be no selection bias anyway.

Data aggregation and standard deviation

The reason for data aggregation is that we think a given user’s past behavior will be correlated with their future behavior. For example, we make recommendations on eBay website based on user’s past purchase patterns. Thus, there is a time-dependency (or auto-correlation), and we aggregate transaction-level data to the user level to get rid of such a correlation. So the answer to question #3 is still user-level aggregation.

At the user level, we aggregate both GMB and BI. For the calculation of standard deviation, we apply the delta method so that ASP’s standard deviation will be a function of user-level GMB and BI. Fortunately, we report GMB and BI by default, so we have collected the raw materials already. For other ratio metrics, we need to aggregate both denominator and nominator to the user level.

Outlier capping

We always want to control any outlier’s impact to reduce standard deviation. Capping always depends on the metric and parameter we want to estimate. Do we want to control for users with extreme purchases, luxurious items, or bunches of cheap item purchases? Different concerns will lead to different capping choices, and we can test them all with the data.

Alternatively, we can estimate a different parameter that is less impacted by outliers, or we can use statistical tests that rely less on the mean (say, quantile tests or rank-based methods). We would like to offer more test results in the future to help people understand how ASP’s distribution is affected in each experiment.

Here are a few options.

  • P0: Everything uncapped.
  • P1: GMB and BI capped at GUID level.
  • P2: Item price capped at item level, keep quantity uncapped.
  • P3: Cap item price at item level, then cap quantity at item-GUID level.
  • P4: Treat ASP as a weighted average, cap item price at item level, then cap BI at GUID level.
  • Rank-based test: Wilcoxon rank-sum test to test difference in distribution.

It’s a metric-specific choice, and we hope our options can inspire people.

Our solution for ASP

In summary, we are calculating ASP in this way:

  1. Define ASP conditional on transactions.
  2. Aggregate to the user level, and use delta method to calculate its standard deviation.
  3. Use user-level capping, the same as we do for GMB. It’s not perfect, but it requires less development time. We will keep monitoring the difference and make future enhancements if necessary.


Typically, a ratio metric brings more challenges to significance testing. Here we illustrate ASP as an example to highlight major concerns and propose some solutions. We will keep monitoring ASP’s performance on the Experimentation Platform and make improvements over time.

Ready-to-use Virtual-machine Pool Store via warm-cache

Problem overview

Conventional on-demand Virtual Machine (VM) provisioning methods on a cloud platform can be time-consuming and error-prone, especially when we need to provision VMs in large numbers quickly.

The following list captures different issues that we often encounter while trying to provision a new VM instance on the fly:

  • Insufficient availability of compute resources due to capacity constraints
  • Desire to place VMs on different fault domains to avoid concentration of VM instances on the same rack
  • Transient failures or delays in the service provider platform result in failure or an increase in time to provision a VM instance.

Elasticsearch-as-a-service, or Pronto, is a cloud-based platform that provides distributed, easy to scale, and fully managed Elasticsearch clusters. This platform uses the OpenStack-based Nova module to get different compute resources (VMs). Nova is designed to power massively scalable, on-demand, self-service access to compute resources. The Pronto platform is available across multiple data centers with a large number of managed VMs.

Typically, the time taken for provisioning a complete Elasticsearch cluster via Nova APIs is directly proportional to the largest time taken by the member node to be in a “ready to use” state (active state). Typically, provisioning a single node could take up to three minutes (95th Percentile) but can be up to 15 minutes in some cases. Therefore, in a fairly large size cluster, our platform would take a long time for complete provisioning. This greatly impacts our turnaround time to remediate production issues. In addition to provisioning time, it is time-consuming to validate new created VMs.

There are many critical applications that leverage our platform for their search use cases. Therefore, as a platform provider, we need high availability to ensure that in a case of catastrophic cluster event (such as a node or an infrastructure failure), We can quickly flex up our clusters in seconds. Node failures are also quite common in a cloud-centric world, and applications need to ensure that there is sufficient resiliency built in. To avoid over-provisioning nodes, remediation actions such as flex-up (adding a new node) should ideally be done in seconds for high availability.

New hardware capacity is acquired as racks from external vendors. Each rack typically has two independent fault domains with minimal resource overlap (For example, different networks), and sometimes they don’t share a common power source. Each fault domain hosts many hypervisors, which are virtual machine managers. Standalone VMs are provisioned on such hypervisors. VMs can be of different sizes (tiny, medium, large, and so on). VMs on the same hypervisor can compete for disk and network I/O resources, and therefore can lead to noisy neighbor issues.


Nova provides ways to be fault domain- and hypervisor- aware. However, it is still difficult to successfully achieve guaranteed rack isolation during run-time provisioning of VM instances. For example, once we start provisioning VMs, there is no guarantee that we will successfully create VM instances on different racks. This depends entirely on the underlying available hardware at that point in time. Rack isolation is important to ensure high availability of Elasticsearch master nodes (cluster brain). Every master node in an Elasticsearch cluster must reside on a different rack for fault tolerance. (If a rack fails, at least some other master node in an another rack can take up active master role). Additionally, all data nodes of a given cluster must reside on different hypervisors for logical isolation. Our APIs must fail immediately when we cannot get VMs on different racks or hypervisors. A subsequent retry will not necessarily solve this problem.


The warm-cache module intends to solve these issues by creating a cache pool of VM instances well ahead of actual provisioning needs. Many pre-baked VMs are created and loaded in a cache pool. These ready-to-use VMs cater to the cluster-provisioning needs of the Pronto platform. The cache is continuously built, and it can be continuously monitored via alerts and user-interface (UI) dashboards. Nodes are periodically polled for health status, and unhealthy nodes are auto-purged from the active cache. At any point, interfaces on warm-cache can help tune or influence future VM instance preparation.

The warm-cache module leverages open source technologies like Consul, Elasticsearch, Kibana, Nova, and MongoDB for realizing its functionality.

Consul is an open-source distributed service discovery tool and key value store. Consul is completely distributed, highly available, and scalable to thousands of nodes and services across multiple data centers. Consul also provides distributed locking mechanisms with support for TTL (Time-to-live).

We use Consul as key-value (KV) store for these functions:

  • Configuring VM build rules
  • Storing VM flavor configuration metadata
  • Leader election (via distributed locks)
  • Persisting VM-provisioned information

The following snapshot shows a representative warm-cache KV store in Consul.


The following screenshot shows a sample Consul’s web UI.


Elasticsearch “is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.” Apart from provisioning and managing Elasticsearch clusters for our customers, we ourselves use Elasticsearch clusters for our platform monitoring needs. This is a good way to validate our own platform offering. Elasticsearch backend is used for warm-cache module monitoring.

Kibana is “built on the power of Elasticsearch analytics capabilities to analyze your data intelligently, perform mathematical transformations, and slice and dice your data as you see fit.” We use Kibana to depict the entire warm-cache build history stored in Elasticsearch. This build history is rendered on Kibana dashboard with various views. The build history contains information such as how many instances were created and when were they created, how many errors had occurred, how much time was taken for provisioning, how many different Racks are available, and VM instance density on racks/hypervisors. warm-cache module can additionally send email notifications whenever the cache is built, updated, or affected by an error.

We use the Kibana dashboard to check active and ready-to-use VM instances of different flavors in a particular datacenter, as shown in the following figure.


MongoDB “is an open-source, document database designed for ease of development and scaling.” warm-cache uses this technology to store information about flavor details. Flavor corresponds to the actual VM-underlying hardware used. (They can be tiny, large, xlarge, etc.). Flavor details consist of sensitive information, such as image-id, flavour-id, which are required for actual Nova compute calls. warm-cache uses a Mongo service abstraction layer (MongoSvc) to interact with the backend MongoDB in a secure and protected manner. The exposed APIs on MongoSvc are authenticated and authorized via Keystone integration.

CMS (Configuration Management System) is a high-performance, metadata-driven persistence and query service for configuration data with support for RESTful API and client libraries (Java and Python). This system is internal to eBay, and it is used by warm-cache to get hardware information of various compute nodes (including rack and hypervisor info).

System Design

The warm-cache module is built as a pluggable library that can be integrated or bundled into any long running service or daemon process. On successful library initialization, a warm-cache instance handle is created. Optionally, a warm-cache instance can enroll for leader election participation. Leader instances are responsible for preparation of VM cache pools for different flavors. warm-cache will consist of all VM pools for every flavor across the different available data centers.


The following figure shows the system dependencies of warm-cache.



The warm-cache module is expected to bring down VM instance preparation time to few seconds. It should also remedy a lot of exceptions and errors that occur while VM instances get ready to a usable state, because these errors are handled well in advance of actual provisioning needs. Typical errors that are encountered today are nodes not available in Foreman due to sync issues and waiting for VM instances to get to the active state.

The figure below depicts the internal state diagram of the warm-cache service. This state flow is triggered on every warm-cache service deployed. Leader election is triggered at every 15-minute boundary interval (which is configurable). This election is done via Consul locks with an associated TTL (Time-to-live). After a leader instance is elected, that particular instance holds the leader lock and reads metadata from Consul for each Availability Zone (AZ, equivalent to a data center). These details include information such as how many minimum instances of each flavor are to be maintained by warm-cache. Leader instance spawns parallel tasks for each AZ and starts preparing the warm cache based on predefined rules. Preparation of a VM instance is marked as complete when the VM instance moves to an active state (for example, as directed by an open-stack Nova API response). All successfully created VM instances are persisted on an updated warm-cache list maintained on Consul. The leader instance releases the leader lock on the complete execution of its VM’s build rules and waits for next leader election cycle.

The configuration of each specific flavor (for example, g2-highmem-16-slc07) is persisted in Consul as build rules for that particular flavor. The following figure shows an example.


In above sample rule, the max_instance_per_cycle attribute indicates how many instances are to be created for this flavor in one leadership cycle. min_fault_domain is used for the Nova API to ensure that at least two nodes in a leader cycle go to different fault domains. reserve_cap specifies the number of instances that will be blocked and unavailable via warm-cache. user_data is the base64-encoded Bash script that a VM instance executes on first start-up. total_instances keeps track on total number of instances that need to be created for a particular flavor. An optional group_hint can be provided that ensures that no two instances with the same group-id are configured on the same hypervisor.

For every VM instance added to warm-cache, following information will be metadata is persisted on Consul:

  • Instance Name
  • Hypervisor ID
  • Rack ID
  • Server ID
  • Group name (OS scheduler hint used)
  • Created time


Since there are multiple instances of the warm-cache service deployed, only of them is elected leader to prepare the warm-cache during a time interval. This is necessary to avoid any conflicts among multiple warm-cache instances. Consul is again used for leader election. Each warm-cache service instance registers itself as a warm-cache service on Consul. This information is used to track available warm cache instances. The registration has a TTL (Time-To-Live) value (one hour) associated with it. Any deployed warm cache service is expected to re-register itself with the warm-cache service within the configured TTL value (one hour). Each of the registered warm-cache services on Consul to elect itself as a leader by making an attempt to acquire the leader lock on Consul. Once a warm-cache service acquires a lock, it acts as a leader for VM cache pool preparation. All other warm-cache service instances move to a stand-by mode during this time. There is a TTL associated with each leader lock to handle leader failures and to enable leader reelection.

In the following figure, leader is a Consul key that is managed by a distributed lock for the leadership role. The last leader node name and leader start timestamp are captured on this key. When a warm-cache service completes it functions in the leader role, this key is released for other prospective warm-cache service instances to become the new leader.



The leadership time-series graph depicts which node assumed the leadership role. The number 1 in the graph below indicates a leadership cycle.



When a leader has to provision a VM instance for a particular flavor, it first looks up for meta information for the flavor on MongoDB (via MongoSvc). This lookup provides details such as image-Id and flavor-Id. This information is used when creating the actual VM instance via NOVA APIs. Once a VM is created, its rack-id information is available via CMS. This information is stored in Consul associated with a Consul key $AZ/$INSTANCE, where $AZ is the Availability Zone and $INSTANCE is the actual instance name. This information is also then persisted on Elasticsearch for monitoring purpose.

The following figure shows a high-level system sequence diagram (SSD) of a leader role instance:




A Kibana dashboard can be used to check how VM instances in the cache pool are distributed across available racks. The following figure shows how many VM instances are provisioned on each rack. Using this information, Dev-ops can change the warm-cache build attributes to influence how the cache should be built in future.




The following options are available for acquiring VM instances from the warm-cache pool:

  • The Rack-aware mode option ensures that all nodes provided by warm-cache reside on different racks
  • The Hypervisor-aware mode option returns nodes that reside on different hypervisors with no two nodes sharing a common hypervisor
  • The Best-effort mode option tries to get nodes from mutually-exclusive hypervisors but does not guarantee it.

The following figure illustrates the process for acquiring a VM.




The following screen-shot includes a table from Kibana showing the time when an instance was removed from warm-cache, the instance’s flavor, data center information, and instance count.




The corresponding metadata information on Consul for acquired VM instances is updated and removed from the active warm-cache list.

Apart from our ability to quickly flex up, another huge advantage of the warm-cache technique compared to conventional run-time VM creation methods is that before an Elasticsearch cluster is provisioned, we know exactly if we have all the required non-error-prone VM nodes to satisfy to our capacity needs. There are many generic applications hosted on a cloud environment that require the ability to quickly flex up or to guarantee non-error-prone capacity for their application deployment needs. They can take a cue from the warm-cache approach for solving similar problems.