NoSQL — Filling in the Gaps


When you look for a NoSQL data store for your service, there are a number of choices. However, if your selection criteria are not simple, you find very few options, and you may even need to develop features to cover the gaps in the database. For a collection of services that we were developing, we needed a key-value store and a document store with high availability, high read performance, and high write performance. Some of the use cases needed this data store to be the system of record, needing point-in-time restore capability. We also needed “read-your-own-write” consistency for some of the operations. This blog is about how we came to use Couchbase 3.1 and how we addressed the gaps that were found.

Couchbase setup

In our setup, there are three Couchbase clusters, one each in a data center, forming a cluster set. Each cluster has the full data set and has bidirectional Cross Datacenter Replication (XDCR) set up with each of the other two clusters in the cluster set. Each cluster has a replication factor of 2, that is, there are six copies of a document in the cluster set. The services read from and write to the primary node local to it. If the call to primary fails, the application uses its secondary node. This setup gives a highly available and eventually consistent distributed data store.






Though the above setup gives high availability, one issue with the above setup is that when an entire cluster is down or a node is slow performing, there is no way for the application to detect and switch to another cluster or a better performing node.

If the application is aware of all nodes in the cluster set and can choose the best performing node, availability improves and the service-level agreement (SLA) is more easily achieved. For this, a performance score, which is a weighted average of the query response time for that node, is assigned to the node and stored in a map. We assign more weight (say 90) to the existing score and less (say 10) to the new scores. With this strategy, scores change slowly, small spikes do not affect cluster selection, while large or sustained spikes cause a different cluster to be selected. The application chooses the best performing cluster set and optimal node (primary vs replica) at any given time. This allows for the application to automatically switch when there are external factors, such as prolonged network issues or Couchbase version upgrades.


The chart below shows the performance during a manually simulated outage scenario. Each window represents a Couchbase cluster that is in a geographically different location. The application that is driving traffic is in the same geographic location as the upper left window.




As the Couchbase cluster with the traffic is taken out with a vicious shutdown -r now across all the nodes, you can see the traffic in the upper left window take a dive, while it picks up a different data center shown in the upper right window.


As the Couchbase cluster with the traffic is taken out with a vicious shutdown -r, now across all the nodes, you can see the traffic in the upper left window take a dive, while it picks up a different datacenter shown in the upper right window.


As the application continues to request data from Couchbase, it occasionally shifts traffic to the third data center, shown as spikes in the lower left window. This is because the application is constantly evaluating the “best” performing cluster and shunting traffic to it.




Once the nodes in the “down” cluster finish rebooting and reform the cluster, the application can detect the cluster with trial queries and cause the traffic to shift back to it.


A single Couchbase cluster is highly consistent. Given a single key, reads and writes are always to the same primary node and are immediately consistent. The trade-off that Couchbase offers is write availability.

In a multiple-cluster setup, Couchbase sacrifices its strong consistency but gains much better availability. In the event that one node goes down, data for that node can’t be written in the affected cluster; however, you can choose to write to other clusters and rely on the XDCR setup to bring all your clusters back into consistency.

Even though the multiple-cluster Couchbase was meant to be eventually consistent, some of our operations needed high (“read your own write”) consistency.

High-consistency API

There are many ways to impose a higher consistency than what the underlying storage technology provides. In the data write-read cycle, there are three options on which phase can be modified to increase consistency. Additional process can be imposed on the read phase, the write phase, or both phases.

Read-phase consistency

With no guarantees during the write phase, we can increase the consistency in the read phase by simply always reading data from all the clusters and either picking the latest or picking the value that has a quorum.

Unfortunately, this does not guarantee “read your own write” consistency. When a value is updated, until the updated value is propagated to a quorum of nodes, the value returned is always the old value, unless “pick latest” is used. If the node that has the latest data is down, “pick latest” won’t work.

Write-phase consistency

We can enforce strong consistency in the write phase by making sure that the data is consistent before we return success on write. There are many ways to do this.

  • Multi-write. The application can write to all the clusters all the time and wait until the writes all return successfully before returning a success to the caller. This does make the reads very fast and efficient. The complexity of the solution greatly increases when a node is down or the network is partitioned or multiple threads are attempting to update the same key.
  • Read-back. The application can take advantage of XDCR by writing to the local cluster and performing reads to all clusters until all the reads fetch the value written. Once XDCR completes the transfer of documents and the correct values are fetched, the write operation can finally succeed. This solution also suffers from the same complexity increases for down nodes, partitioned network, or race conditions.

Both of techniques focusing on this phase improve consistency most of the time and allow us to perform high-performance local reads most of the time, but it is too complex if we have to guarantee “read your own write” capability during node failures or when the network is partitioned.

Both-phase consistency

A popular approach in both-phase consistency requires quorums during both phases. Obtain a majority quorum upon write, and during the read phase obtain another quorum. We can mathematically choose the consensus threshold in each the read and write quorums to guarantee “read your own write” consistency.



This communication diagram shows how expensive get and set calls are when we want to guarantee “read your own write” consistency. Every call requires network communication with every cluster that has the information. And as the number of clusters grows, the overhead with this approach can grow.

It is best to provide this only to the use cases that absolutely need this.

Monitoring XDCR lag

Discovering the data lag due to XDCR can give us the ability to make informed suggestions to our clients on when to actually use the very expensive “read your own write” consistency capability. In order for us to accurately measure how long it takes between writing the data and when data is available in the other clusters to be read, we simply generate heartbeats. With every heartbeat, we write the current timestamp to the local cluster and then immediately read the same document out of all the other clusters in parallel. When we take the difference between what timestamp was written, and what timestamp was received, we measure the data latency very accurately. The delta of the timestamp from the data read and the timestamp of when the data arrived can give the absolute staleness of the data from each cluster.

From our measurements we found that the data lag is very close to the network latencies between the various clusters.

Backup and restore

Some of the use cases needed point-in-time backup and restore capability. Couchbase 3.X backup and restore tools were not yet mature, but it provided a Kafka connector to stream changes to Kafka based on DCP. This was used to implement point-in-time backup and restore.

We set up a Kafka cluster to which the changes from one of the Couchbase clusters were written to. From Kafka, the readers can read the changes and write them to an Oracle database table. This table contains the changes to the Couchbase document along with the timestamp of the change. We developed a restore tool that gives the users the ability to select the data and the point in time from which restoration needs to be made. On execution, the tool extracts rows from the table, saves them to a file, and updates one of the Couchbase clusters, which then gets replicated.

In addition to the standard Couchbase keys, application-level keys are stored. This allows us to perform selective restores. If some specific values are corrupted, the corrupted data alone can be restored.

Race conditions

We avoid race conditions where data being restored might be actively updated from the service by leveraging the following approaches:

  • Highly selective restore. Rarely would we want to restore everything back to a specific point in time. There is always a time window when the corruption occurs for specific attributes or users. This reduces the data volume and the chance of collision.
  • Smaller window of collision. Because the data volume to be restored is small, the duration of the restore process is kept small. The goal is to have the restore complete in minutes rather than hours. In the worst case, where a collision does happen, it is straightforward to identify the documents that collided and run another restore with just the collided documents, so as to roll back the changes performed during the restore that had collisions as “corrupted” data.

Purging logic in the Oracle database

Over time, a lot of data can be collected. To maintain space, we used partition-dropping logic. The data is partitioned in Oracle using the LAST_MODIFIED_TS column modulo 24 months. Every month, the 24th oldest partition is dropped.


NoSQL is great, but it may not be possible to find a product that meets all your needs. Spend the necessary time to evaluate your options. Prioritize your core needs and select the solution that best meets your priorities. Evaluate the gaps, and see if the gaps can be filled in by your application.

A well-chosen NoSQL offering along with the proper application design and implementation can produce a highly available, low-latency, and scalable service to serve your needs.

Five Tools to Build Your Basic Machine Translation Toolkit


If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well.

As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read this article.

Advanced Text Editors

The text editor that comes with your laptop won’t cut it, trust me. You need an advanced text editor that can provide at least these capabilities:

  • Deal with different encodings (UTF, ANSI, etc.)
  • Open big files, sometimes with unusual formats or extensions
  • Do global search and replace operations with regular-expression support
  • Highlight syntax (display different programming, scripting, or markup languages — such as XML and HTML — with color codes)
  • Have multiple files open at the same time (that is, support tabs)

This is a list of my favorite text editors, but there are a lot of good ones out there.


Notepad++ is my editor of choice. You can open virtually any file with it, it’s really fast, and it will keep your files in the editor even if you quit it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It’s really easy to convert to and from different encodings and to save all opened files at once. You can also download different plug-ins, like spellcheckers, comparators, etc. It’s free, and you can download it from the Notepad++ web site.


Sublime Text

Sublime Text is another amazing editor, and it’s a developers’ favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs as well. It has a distraction-free mode if you really need to focus. It comes with a free trial period, and you can get it at the Sublime Text web site.



Syntax highlighting, document comparison, regular expressions, handling of huge files, encoding conversion: Emeditor is complete. My favorite feature, however, is the scriptable macros. This means that you can create, record, and run macros within EmEditor — you can use these macros to automate repetitive tasks, like making changes in several files or saving them with different extensions. You can download it from the EmEditor web site.


Language Quality Assurance Tools

Quality Assurance tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way:

  1. You load files with your translated content (source + target).
  2. You optionally load reference content, like glossaries, translation memories, previously translated files, or blacklists.
  3. The tool checks your content and provides a report listing potential errors.

You can find lots of errors using a QA tool:

  • Terminology: where term A in the source is not translated as B in the target
  • Blacklisted terms: terms you don’t want to see in the target
  • Inconsistencies: same source segment with different translations
  • Differences in numbers: when the source and target numbers don’t match
  • Capitalization
  • Punctuation: missing or extra periods, duplicate commas, and so on
  • Patterns: certain user-defined patterns of words, numbers and signs (which may contain regular expressions to make them more flexible) expected to occur in a file.
  • Grammar and spelling errors
  • Duplicate words, tripled letters, and more

Some QA tools you should try are Xbench and Checkmate.


Xbench allows you to run the following QA Checks:

  • Untranslated segments
  • Segments with the same source text and different target text
  • Segments with the same target text and different source text
  • Find segments whose target text matches the source text (potentially untranslated text)
  • Tag mismatches
  • Number mismatches
  • Double blanks
  • Repeated words
  • Terminology mismatches against a list of key terms
  • Spell-check translations.

Some linguists like to add all their reference materials into Xbench, like translation memories, glossaries, termbases, and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.

Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited, but there are ways to expand it.



Checkmate is the QA Tool part of the Okapi Framework, which is an open-source suite of applications to support the localization process. That means that the framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are repeated words, corrupted characters, patterns, inline code differences, significant differences in length between source and target, missing translations, spaces, etc.

The patterns section is especially interesting; I will come back to it in the future. Checkmate also produces comprehensive error reports in different formats. It can also be integrated with LanguageTool, an open-source spelling and grammar checker.


Comparison Tools

Why do you need a comparison tool? Comparing files is a very practical way to see in detail what changes were introduced, for example, which words were replaced, which segments contain changes, or whether there is any content added or missing. Comparing different versions of a file (for example, before and after post-editing) is essential for processes that involve multiple people or steps. Beyond Compare is, by far, the best and most complete comparison tool, in my opinion.

With Beyond Compare, you can compare entire folders, too. If you work with many files, comparing two folders is an effective way to determine if you are missing any files or if a file does not belong in a folder. You can also see if the contents of the files are different or not.


Concordance Tool

As defined by its website, AntConc is a “freeware corpus analysis toolkit for concordancing and text analysis.” This is, in my opinion, one of the most helpful tools you can find out there when you want to analyze your content, regardless the language. AntConc will let you easily find n‑grams and sort them by the number of occurrences. This is extremely important when you want to find patterns in your content. Remember: with MT, you want to fix patterns, not specific occurrences of errors. It may sound obvious, but finding and fixing patterns is a more efficient way to get rid of an issue than trying to fix each particular instance of an error.

AntConc will also create a list of each word in your content, preceded by the number of hits. This is extremely helpful for your terminology work. There are so many things you can use this tool for, that it deserves its own article.


CAT Tools

In most cases, using a translation memory (TM) is a good idea. If you are not familiar with CAT tools (computer-assisted translation), they provide a translation environment that combines an editor, a translation memory, and a terminology database. The key part here is the TM, which is essentially a database that stores translations in the form of translation units (that is, a source segment plus a target segment), and if you come across the same or a similar source segment, the TM will “remember” the translation you previously stored there.

CAT Tools make a great post-editing environment. Most modern tools can be connected to different machine translation systems, so you get suggestions both from a TM and from an MT system. And you can use the TM to save your post-edited segments and reuse them in the future. If you have to use glossaries or term bases, CAT tools are ideal, as they can also display terminology suggestions.

When post-editing with a CAT tool, there are usually two approaches: you can get MT matches from a TM or a connected MT system (assuming, of course, that the matches are added to it previously), or you can work on bilingual, pre-translated files and store only post-edited segments in your TM.


If you have never tried it, I totally recommend Matecat. It’s a free, open-source, web-based CAT tool, with a nice and simple editor that is easy to use. You don’t have to install a single file. They claim you will always get up to 20% more matches than with any other CAT tool. Considering that some tools out there cost around 800 dollars, what Matecat has to offer for free can’t be ignored. It can process 50+ file types; you can get statistics on your files (like word counts or even how much time you spent on each segment), split them, save them on the cloud, and download your work. Even if you never used a CAT tool before, you will feel comfortable post-editing in Matecat in just a few minutes.



Another interesting free, open-source option is OmegaT. It’s not as user-friendly as Matecat, so you will need some time to get used to it, even if you are an experienced TM user. It has pretty much all the same main features commercial CAT tools have, like fuzzy matching, propagation, and support for around 40 different file formats, and it boasts an interface with Google Translate. If you never used it, you should give it a try.



If you are looking into investing some money and getting a commercial tool, my personal favorite is Memoq. It has tons of cool features and overall is a solid translation environment. It probably deserves a more detailed review, but that is outside of the scope of this post.


If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.

Machine Translation Corpus Analysis


Statistical Machine Translation (SMT) needs considerably large amounts of text data to produce good translations. We are talking about millions of words. But it’s not simply any text data — it’s good data that will produce good translations.

The challenge is how to make sense of all of these millions of words. What to do to find out whether the quality of a corpus is good enough to be used in your MT system? How do you know what to improve if you realize a corpus is not good? How to know what your corpus is about? Reading every single word or line is completely out of the question.

Corpus analysis can help you find answers to these questions. It can also help you understand how your MT system is performing and why. It can even help you understand how your post-editors are performing.

I will cover some analysis techniques that I believe are effective to understand your corpus better. To keep things simple, I will use the word “corpus” to refer to any text sample, either one used to produce translations or one being the result of a translation-related process.

The tools

I’m going to cover two tools: AntConc and Python. The first one is a corpus analysis tool exclusively. The latter is a programming language (linguists, please, don’t panic!), but I’m going to show you how you can use a natural language processing module (NLTK) to dig into your corpora and also provide bits of code for you to try.

AntConc and Python can be used in Windows, Mac, and Linux.


As defined by its web site, AntConc is a “freeware corpus analysis toolkit for concordancing and text analysis.” It’s really simple to use. It contains seven main tools for analysis and has several interesting features. We will take a closer look at the details and how the tool can be used with the following examples.

Getting a Word List

A great way to know more about your corpus is getting a list of all the words that appear in it. AntConc can easily create a list with all the words that appear in your corpus and show important additional information about them, like how many tokens are there and the frequency of each. Knowing which words appear in your corpus can help you identify what it is about; the frequency can help you determine which are the most important words.

You can also see how many tokens (individual words) and word types (unique words) are there in a corpus. This is important to determine how varied (how many different words) your text is.

To create a word list, after loading your corpus files, click the Word List tab and click Start. You’ll see a list of words sorted by frequency by default. You can change the sorting order in the Sort by drop-down. Besides frequency, you can sort alphabetically and by word ending.


Frequency is often a good indicator of important words — it makes sense to assume that tokens that appear many times have a more relevant role in the text.

But what about prepositions or determiners and other words that don’t really add any meaning to the analysis? You can define a word list range, that is, you can add stop words (words you want to exclude from your analysis) individually or in entire lists.


Word lists are also very good resources to create glossaries. You can either use the frequency to identify key words or just go through the list to identify words that may be difficult to translate.

Keyword Lists

This feature allows you to compare a reference corpus and a target corpus and then calculate words that are unusually frequent or infrequent. What’s the use for this? Well, this can help you get a better insight on post-editing changes, for example, and try to identify words and phrases that were consistently changed by post-editors. It’s safe to assume that the MT system is not producing a correct translation for such words and phrases. You can add these to any blacklists, QA checks, or automated post-editing rules you may be using.


A typical scenario would be this: you use your MT output as the target corpus, and a post-edited/human translation (for the same source text, of course) as the source corpus; the comparison will tell you which words are frequent in the MT output that are not so frequent in the PE/HT content.

“Vintage” here is at the top of the list. In my file with MT output segments, it occurs 705 times. If I do the same with the post-edited content, there are 0 occurrences. This means post-editors have consistently changed “vintage” to something else. It’s safe to add this word to my blacklist then, as I’m sure I don’t want to see it in my translated content. If I know how it should be translated, it could be part of an automated post-processing rule. Of course, if you retrain your engine with the post-edited content, “vintage” should become less common in the output.

To add a reference corpus, in the Tool Preferences menu, select Add Directory or Add Files to choose your corpus file(s). Click the Load button after adding your files.



Collocates are simply words that occur together. This feature allows you to search for a word in a corpus and get a list of results that show other words that appear next to the search term. You can see how frequent a collocate is and also choose if your results should include collocates appearing to the right of the term, to the left, or both. What’s really interesting about this is that it can help you find occurrences of words that occur near your search term and not necessarily next to it. For example, in eBay’s listing titles, the word “clutch” can be sometimes mistranslated. It’s a polysemous word, and it can be either a small purse or an auto part. I can do some analysis on the collocate results for clutch (auto parts) and see if terms like bag, leather, purse, etc., occur near it.

You can also select how frequent a collocate needs to be in order to be included in the results.

This is very useful to spot unusual combinations of words as well. It obviously depends on the language, but a clear example could be a preposition followed by another preposition.


To use this feature, load your corpus files and click the Collocates tab. Select the From and To ranges — values here contain a number and a letter: L(eft)/R(ight). The number indicates how many words away from the search terms should be included in the results, and L/R indicates the direction in which collocates must appear. You can also select a frequency value. Enter a search term and click Start.

All the results obtained with any of the tools that AntConc provides can be exported into several formats. This allows you to take your data and process it in any other tool.

Clusters and n-grams

This is perhaps one of the most useful features in AntConc. Why? Because it allows you to find patterns. Remember that, when working with MT output, most of the time it’s not realistic to try to find or fix every single issue. There may be tons of errors with varying levels of severity in the MT output (especially considering the volumes of content processed by MT), so it does make sense to focus first on those that occur more frequently or that have a higher severity.

Here’s a simple example: let’s assume that by looking at your MT output you realize that your MT system is translating the word “inches” into “centimeters” without making any changes to the numbers that usually precede that word, that is, “10 inches” is being consistently translated as “10 centimeters”. You could try to find and fix “1 centimeter”, “2 centimeters”, “3 centimeters”, etc. Instead, a much better choice would be to identify a pattern: “any number” followed by the word “centimeter” should be instead “any number” “inches”. This is an oversimplification, but the point is that identifying an error pattern is a much better approach than fixing individual errors.

Once you have identified a pattern, the next step is to figure out how you can create some sort of rule to find/fix such pattern. Simple patterns made of word or phrases are pretty straightforward — find all instances of “red dress” and replace with “blue dress”, for example. Now, you can take this to the next level by using regular expressions. Going back to the inches example, you could easily find all instances of “any number” followed by centimeters with a simple regex like \d+ centimeters, where \d stands for any digit and the + sign stands for one or more (digits).


Using the Clusters/N-Grams tool helps you find strings of text based on their length (number of tokens or words), frequency, and even the occurrence of any specific word. Once you open your corpus, AntConc can find a word or a pattern in it and cluster the results in a list. If you search for a word in your corpus, you can opt to see words that precede or follow the word you searched for.

Results can be sorted by several criteria:

  • By frequency (ideal to find recurring patterns — the more frequent a pattern is, the more relevant it might be)
  • By word (ideal to see how your MT system is dealing with the translation of a particular term)
  • By word end (sorted alphabetically based off the last word in the string)
  • By range (if your corpus is composed of more than one file, in how many of those files the search term appears)
  • By transitional probability (how likely it is that word2 will occur after word1; e.g., the probability of “Am” occurring after “I” is much higher than “dishwasher” occurring after “I”.)

Let’s see how the Clusters tool can be used. I’ve loaded my corpus in AntConc. and I want to see how my system is dealing with the word case. Under the Cluster/N-grams tab, let’s select the Word check box, as I want to enter a specific search term. I want to see clusters that are three to four words long. And very important here, the Search Term Position option: if you select Left, your search term will be the first word in the cluster; if you select Right, it’ll be the last one instead. The following screenshots show how the Left/Right option selection affects the results.

On left

On right

We can also use regular expressions here for cases in which we need more powerful searches. Remember the example about numbers and inches above? Well, numbers, words, spaces, letters, punctuation — all these can be covered with regular expressions.

Let’s take a look at a few examples:

Here I want to see all two-word clusters that start with the word “original”, so I’m going to use a boundary (\b) before “original”. I don’t know the second word, it’s actually what I want to find out, so I’m going to use \w, which stands for “any word”. All my results will then have the following form: original+word.


Now I want to see all clusters, regardless of their frequency, that contain the words “price” OR “quality”. So, in addition to adding the boundaries, I’m going to separate these words with | (vertical bar) that simply stands for “or”.

This is really useful when you want to check how the system is dealing with certain words — there’s no need to run separate searches since you can combine any number of words with | between them. Check the Global Settings menu for reference.


For seasoned regex users, note that regex capabilities in AntConc are pretty modest and that some operators are not standard.


If you are not familiar with this term, in a nutshell, an n-gram is any word or sequence of words of any size; a 1-gram is composed of one element, a 2-gram is composed of 2 elements, etc. It’s a term that defines the length of a string rather than its content.


What’s great about this feature is that you can find recurring phrases without specifying any search terms. That is, you can easily obtain a list of, for example, all the 6-grams to 3-grams that occur more than 10 times in your corpus. Remember that clusters work in the opposite way — you find words that surround a specific search term.

The n-gram search is definitely an advantage when you don’t know your corpus very well and you still don’t know what kind of issues to expect. It’s usually a good choice if it’s the first time you are analyzing a corpus — it finds patterns for you: common expressions, repeated phrases, etc.

When working with n-grams, it’s really important to consider frequency. You want to focus your analysis on n-grams that occur frequently first, so you can cover a higher number of issues.

What can you do with your findings, besides the obvious fact of knowing your corpus better? You can find recurring issues and create automated post-editing rules. Automated post-editing is a technique that consists in applying search and replace operations on the MT output. For instance, going back to our initial “inches” vs. “centimeters” example, you could create a rule that replaces all instances of number+centimeters with number+inches. Using regular expressions, you can create very powerful, flexible rules. Even though this technique was particularly effective when working with RBMT, it’s still pretty useful for SMT between training cycles (the process in which you feed new data to your system so it learns to produce better translations).

You can also create blacklists with issues found in your MT output. A blacklist is simply a list of terms that you don’t want to see in your target, so for example, if your system is consistently mistranslating the word “case” as a legal case instead of a protective case, you can add the incorrect terms to the blacklists and easily detect when they occur in your output. In the same way, you can create QA checks to run in tools like Checkmate or Xbench.


(Please note that this article is intended for a general audience and, if you are a Python expert, you may find some of the ideas below too basic. I’m not a Python expert myself, so I apologize in advance if the terminology I use here is not what experts use.)

For those of you not familiar with Python, it’s a programming language that has been gaining more and more popularity for several reasons: it’s easy to learn and easy to read, it can be run in different environments and operating systems, and there’s a significant number of modules that can be imported and used.

Modules are files that contain classes, functions, and other data. Without getting too technical, a module is code already written that you can reuse, without having to write it yourself from scratch. Quick example: if you want to write a program that will use regular expressions, you can simply import the re module, and Python will learn how to deal with them thanks to the data in the module.

Enter Natural Language Processing Toolkit

Modules are the perfect segue to introduce the Natural Language Processing Toolkit (NLTK). Let me just steal the definition from their site, “NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries…”

Using Python and NLTK, there are quite a few interesting things you can do to learn more about your corpora. I have to assume you are somewhat familiar with Python (not an expert!), as a full tutorial would simply exceed the purpose of this post. If you want to learn more about it, there are really good courses on Coursera, Udemy, and YouTube, for example. I personally like Codeacademy’s hands-on approach.

The Mise en place

To follow these examples, you’ll need the following installed:

  • Python 3.5 (version 2.7 works too, but some of these examples may need tweaking)
  • NLTK
  • Numpy (optional)

To get corpora, you have two options: you can choose to use corpora provided by NLTK (ideal if you just want to try these examples, see how Python works, etc.) or you can use your own files. Let me walk you through both cases.

If you want to use corpora from NLTK, open your Python’s IDLE, import the nltk module (you’ll do this every time you want to use nltk), and then download the corpora:

>>> import nltk

A new window will open, and you’ll be able to download one or more corpora, as well as other packages. You can find the entire list on the NLTK Corpora page.


When working in Python, you can import (a) all available corpora at the same time or (b) a single corpus. Notice that (a) will import books (like Moby Dick and The Book of Genesis.)

a	>>> from import *
b	>>> from nltk.corpus import brown

If you want to use your own files, you’ll have to tell Python where they are so it can read them. Follow these steps if you want to work with one file (but remember to import nltk first):

>>> f = open(r'c:\reviews.txt','rU')
>>> raw =
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)

Basically, I’m telling Python to open my file called reviews.txt saved in C. The “r” in front of the path is required for Python to read it correctly. I’m also telling Python that I want to read, not write on, this file.

Then, I’m telling Python to read the contents of my file and store them in a variable called raw, to tokenize the content (“identify” the words in it), and to store those tokens in a variable named text. Don’t get scared by the technical lingo at this point: a variable is just a name that we assign to a bucket where we store information, so we can later make reference to it.

What if you have more than one file? You can use the Plaintext Corpus Reader to deal with several plaintext documents. Note that if you follow the example below, you’ll need to replace sections with the relevant information, such as your path and your file extension.

>>> from nltk.corpus import PlaintextCorpusReader
>>> files = ".*\.txt"
>>> corpus0 = PlaintextCorpusReader(r"C:/corpus", files)
>>> corpus  = nltk.Text(corpus0.words())

Here, I’m asking Python to import PlaintextCorpusReader, that my files have the txt extension, where the files are stored, and to store the data from my files into a variable called corpus.

You can test if your data was correctly read just by typing the name of the variable containing it:

>>> corpus
<Text: This black and silver Toshiba Excite is a...>
>>> text
<Text:`` It 's a Motorola StarTac , there...>

corpus and text are the variables I used to store data in the examples above.

Analyzing (finally!)

Now that we are all set up and have our corpora imported, let’s see some of the things we can do to analyze it.

We can get a word count using the len function. It is important to know the size of our corpus, basically to understand what we are dealing with. What we’ll obtain is a count of all words and symbols, repeated words included.

>>> len(text)
>>> len(corpus)

If we wanted to count unique tokens, excluding repeated elements, we can follow this example.

>>> len(set(corpus))

With the set function, we can get a list of all the words used in our corpus, that is, a vocabulary.

>>> set(corpus)
{'knowledge', 'Lord', 'stolen', 'one', ':', 'threat', 'PEN', 'gunslingers', 'missions', 'extracting', 'ensuring', 'Players', 'player', 'must', 'constantly', 'except', 'Domino', 'odds', 'Core', 'SuperSponge', etc.

A list of words is definitely useful, but it’s usually better to have them alphabetically sorted. We can also do that easily.

>>> sorted(set(corpus))
["'", '(', ').', ',', '-', '--', '.', '3', '98', ':', 'Ancaria', 'Apocalypse', 'Ashen', 'Barnacle', 'Bikini', 'Black', 'Bond', 'Bottom', 'Boy', 'Core', 'Croft', 'Croy', 'D', 'Dalmatian', 'Domino', 'Egyptian', etc.

Note that Python will put capitalized words at the beginning of your list.

We can check how many times a word is used on average, what we call “lexical richness.” From a corpus analysis perspective, it’s good that a corpus is lexically rich, as theoretically the MT system will “learn” how to deal with a broader range of words. This indicator can be obtained by dividing the total number of words by the number of unique words.

>>> len(text)/len(set(text))
>>> len(corpus)/len(set(corpus))

If you need to find out how many times a word occurs in your corpus, you can try the following. (Notice that this is case-sensitive.)

>>> text.count("leave")
>>> text.count("Leave")

One key piece of information you probably want to get is the number of occurrences of each token or vocabulary item. As we mentioned previously, frequent words say a lot about your corpus. They can be used to create glossaries, for example. One way to do this is using frequency distributions. You can also use this method to find how many times a certain word occurs.

>>> fdistcorpus = FreqDist(corpus)
>>> fdistcorpus
FreqDist({',': 33, 'the': 27, 'and': 24, '.': 20, 'a': 20, 'of': 17, 'to': 16, '-': 12, 'in': 8, 'is': 8, ...})

>>> fdistcorpus['a']

A similar way to do this is using the vocab function:

>>> text.vocab()
FreqDist({',': 2094, '.': 1919, 'the': 1735, 'a': 1009, 'of': 978, 'and': 912, 'to': 896, 'is': 597, 'in': 543, 'that': 518, ...})

Conversely, if you want to see the words that only appear one time, use the hapaxes function:

>>> fdistcorpus.hapaxes()
['knowledge', 'opening', 'mystical', 'return', 'bound']

If you only want to see, for example, the ten most common tokens from your corpus, there’s a function for that:

>>> fdistcorpus.most_common(10)
[(',', 33), ('the', 27), ('and', 24), ('.', 20), ('a', 20), ('of', 17), ('to', 16), ('-', 12), ('in', 8), ('is', 8)]

We can have the frequency distributions results presented in many ways:

  • One column
    >>> for sample in fdistcorpus:

    Here, I’m using a for loop. Loops are typically used when you want to repeat or iterate and action. In this case, I’m asking Python, for each token or sample in my corpus, to print said sample. The loop will perform the same action for all the tokens, one at the time, and stop when it has covered every single one of them.

  • Tab-separated
    >>> fdistcorpus.tabulate()
                   ,              the              and                .                a               of               to                -               in               is             with              for               as                '                s              The               by               or               --               be             them               on              has
  • Chart
    >>> fdistcorpus.plot()