Collocations: The Secret Web of Language

Imagine this.

You are a beginning English learner. You enjoy the methodical approach, so you tackle the language systematically, memorizing lists of irregular verbs, spelling norms, and syntactic rules. No conversational practice, no watching movies. You want to get the theory right first.

One day, you think you have mastered it. You are a walking grammar book.

You after 6 months of English studies

Then you realize you have been so engrossed in your studies that you skipped lunch, so you ask a passer-by:

Excuse me, sir. I am heavily hungry. Could you point me to the nearest swift-food restaurant?

Which he greets with a baffled stare.

you dont say

What went wrong?

You studied the standard rules of English, but there is a part of the language (of any language) that will never fit in that tidy set of axioms — collocations.

collocations_imageCollocations — a vast n-gram web that connects all words in a language.

According to the Oxford English Dictionary, collocations are pairs or groups of words that are habitually juxtaposed, such as strong coffee or heavy drinker. As such, they are the final touch foreign learners (or say, machine translation systems) need to “speak properly.” You can communicate without knowing them, but you will sound pretty weird to the native ear.

In a wider sense, collocations are a sort of lightweight idioms, but they differ from idioms in a couple of ways:

  • Their meaning is transparent — you can guess it the first time you see them (which you can’t with proper, metaphorical idioms, such as kick the bucket or a piece of cake).
  • There are vastly more collocations than idioms. (The largest explanatory collocation dictionary in existence covers only 1% of all possible vocabulary.)

Most importantly, collocations don’t follow clear rules. There is no apparent reason why we should say a burning thirst and not a blazing thirst, except that most people say the former and not the latter. In a way, these whimsical word patterns are like an unexplored realm at the edges of grammar — a lush rainforest with all sorts of curious and apparently random species. At the edge of the forest, the human language learner and the MT system both face the same problem — how to chart it?

Playing Linnaeus to the human language “biosphere” is no trivial task, but fortunately there is help — massive computational power applied to vast sets of texts (linguistic corpora) is producing resources for us all:

The work with collocations is far from over. For MT, the challenge is finding enough corpora. (Except for a few — such as English, French, and Spanish — most languages don’t have enough online texts to create accurate models.) For human learners, there is the additional problem of analyzing and describing the vast amount of data in terms useful to the language student.

The good news is that here, as in other areas, human linguists and MT systems can leverage each other’s efforts. Every new language model provides helpful data that can be used by the next generation of dictionaries, and every dictionary throws new light on the relationship patterns between words that MT can incorporate.


Meanwhile, language students will do well heading to the pub every once in a while for some conversational practice.

Want to learn more?

  • Go academic — Check out Mike Dillinger’s and Brigitte Orliac’s paper on Collocation extraction for MT.
  • Go deeper — Learn about collocations’ extended family: the phrasemes.

And if you enjoyed this article, please check other posts from the eBay MT Language Specialists series.

Running Arbitrary DAG-based Workflows in the Cloud

The problem

We were working on a problem that involved processing workflows in the cloud. The problem was how we design a system that is scalable and fault tolerant, allowing dynamic allocation of cloud nodes to every step in the workflow.

To explain with an example, imagine a workflow like the following:


Diagram A

We need to execute A based on a trigger, followed by B; then we need to execute C, D, and E in parallel; and when they are complete, we need to execute F, which will trigger the end of the flow. There is a strict dependency hierarchy in the above graph that needs to be followed as we visit each node. The problem was that we needed to process a series of different independent workflow graphs like the above, but depending on the trigger. So we were faced with two potential issues:

  1. How do we execute different workflows in a generic, coherent manner?
  2. How do we scale our solution so that we have a system where we can run thousands or more arbitrary workflows concurrently in the cloud?

Initial idea

We started with a brute-force idea:  Why not just hard-code all of the workflows (initially we had just a handful)? We could have, for instance, six nodes running each of the tasks corresponding to the diagram above, and the tasks could communicate state information via a shared database. This primitive approach worked well initially. All of the tasks could run concurrently on different nodes and would start doing actual work only when the dependency was met as per the database state.

However, we started facing problems when the number of workflow variations increased. We also had an issue hard-coding the dependency above since, for instance, task B depended on A and it was harder to insert a new task between the two.  Moreover, the larger we got the trickier it became to maintain, enhance, and optimize the system.

Our solution

We started hashing out ideas on an effective solution that is quick and easy and that can work with all of the eBay configurations. We looked at each problem and took a deep dive on it.

  1. How do we execute different workflows in a generic, coherent manner?

The best approach we came up with for this first problem was to use metadata services. Each of the workflows was defined as part of the metadata based on the trigger point type. So for instance, for the above diagram we defined metadata that explicitly stated these dependencies.  There are multiple ways this approach can be accomplished. One way is to use the GraphML library. We can then define the above graph using GraphML XML to define all of the dependencies. If the use case is a lot simpler, another way might be to create a table of tasks, names, and sequence numbers showing the ordering to be followed. The actual representation of the graph-based workflow can be done in multiple ways, and then we need services to assemble the metadata and isolate the implementation details from all of the clients.

This first problem was the easier one to deal with.

  1. How do we scale our solution so that we have a system where we can run thousands or more arbitrary workflows concurrently in the cloud?

To deal with this second problem, we wanted to start by getting rid of the static bounding of tasks and nodes. The aim was to make the bounding dynamic based on load on the platform. So we wanted to come up with two things:

  • A way of assigning and running tasks on a node
  • A way of load-balancing the above assignment

The second item immediately prompted us towards a load balancer. It’s a fairly common approach now, and we needed just that:  a load balancer that will know the state of the pool and allocate tasks to all nodes on the cloud. Therefore, we needed a way to describe the state of each node. We came up with a simple format consisting of node name, tasks to execute, each task’s current run status, etc. The load balancer then has to keep track of this data, which represents what’s happening within the pool. Based on this data, the load balancer can decide which nodes have fewer tasks or are overloaded, and can allocate accordingly.

Now onto the first item:  once we have this data we can have a daemon service/task running on all of the nodes in the pool which will then just look for this data and spawn the given task that is assigned to it. The last thing we wanted was to form a queue of all incoming trigger types, which would result in a series of workflows, and then to dynamically come up with the workflow graph by reading this queue.

We were putting all of these dependencies in place and came up with a really simple, elegant, yet powerful solution that achieved what we wanted to do. The platform met all of our constraints and was scalable and, from an implementation perspective, easy to start with. Our project actually involved more business logic that corresponded to the specific workflows, but we also came up with this generic platform for putting it all together.

Our final design looked something like this:

Diagram B

Diagram B

(For simplicity, other design aspects like monitoring, disaster recovery, and node provisioning are omitted from the above diagram.)

Let’s go through some details about each of the above components.

  • Input Queue Manager — The role of an Input Queue Manager is to maintain a queue of what’s to be processed by taking the input events and sorting them by time of arrival.  Note that these events can be input files, crontab-like scheduled tasks, or any set of external services/dependencies that need a workflow to be processed downstream.  The role of the Input Queue Manager is also to avoid duplicates and discard invalid requests going into the pipeline.
  • Workflow Manager — This is one of the core components of the platform. Given an input queue type, the Workflow Manager looks for the workflow metadata services to get an instance of workflow tasks to be executed. The metadata services have pre-configured data backed by all possible input types that result in the workflow. Again referring to Diagram A above, this sample workflow can be configured as an input to a given file, a service call, etc.  The output is a queue of workflow task instances that need to be executed as per the graph ordering.
  • Task Scheduler — The primary purpose of this component is to schedule tasks on any node in the pool. Figuring out a good node to schedule requires the help of a load balancer, which will return a healthy node back to the Task Scheduler. It then uses the Task Assignment Service to allocate a task to the selected node.
  • Load Balancer — This component calls the Task Allocation Service, which constructs the complete graph of the state of the pool at any given instant. The Load Balancer then uses this graph to come up with the best possible node that can be assigned to process a given task. Load balancing can be implemented in multiple ways, from round robin to random to load-based. The Load Balancer can also be extended to have knowledge of all of the tasks in the system, and can use machine learning to extract average times, based on past performances, as training data. Again, depending on the use case, load balancing can be altered as the platform matures.
  • Node Executor — This is a relatively simpler component. The primary function of this is to find out all the assigned tasks and then just spawn them as separate processes on that node. All the nodes in the pool need to have these executors as daemon processes which look out for any new tasks that need to be spawned.  Each task that runs can do a heartbeat ping to these executors so that they can be made smarter in cases where things get stuck or node restarts.

The above model can easily be extended to provide better fault tolerance by putting master/slave dependencies on components. Also, cases like node restarts, nodes going down when tasks are running, etc. can be accommodated by adding a monitor to the pool to keep track of what is going on in the overall system, and then using the Load Balancer to perform the necessary steps for automatic failover/recovery.

Note that using the above system we were not only able to process DAG workflows in the cloud, but also were able to process simple crontab-scheduled tasks dynamically and not bounding it to any particular node in the cloud.

We are in the process of open-sourcing some of our solution. We’ll announce the GitHub location once we’re ready.

Machine Translation: The Basics of Quality Estimation

 A good point to make before we go into more detail on QE is the difference between evaluation and estimation. You can evaluate the quality of MT output in two main ways: human evaluation (a person will check the translation and provide feedback) and automatic evaluation (there are different methods that can provide a score on the translation quality without human intervention).

Traditionally, automatically evaluating the quality of any given MT output has required a reference translation created by a human translator. The differences and similarities between the MT output and the reference translation can then be turned into a score to determine the quality of said output. This is the approach followed by certain methods, such as BLEU and NIST.

 The main differentiator of quality estimation is that it does not require a human reference translation.

QE is a prediction of the quality based on certain features. These features can be, for example, the number of noun or prepositional phrases in the source and target (and their difference), the number of named entities (names of places, people, companies, etc.), and many more. With these features, using techniques like machine learning, a QE model can be created to obtain a score that represents the estimation of the translation quality.

At eBay, we use MT to translate search queries, item titles and item descriptions; an earlier post discussed our work with search queries. To train our MT systems, we work with vendors that help us post-edit content. Due to the challenging nature of our content (user-generated, diversity of categories, millions of listings, etc.), a method to estimate the level of effort required for post-editing definitely adds value. QE can help you obtain important information on this effort in an automated manner. For example, one can estimate how many segments have a very low-quality translation and could be just discarded instead of post-edited.

So, what can you do with the help of QE? First and foremost, you can estimate the quality of translations at the segment and file level. Segment-level scores can help you target post-editing, focusing only on content that makes sense to post-edit. You can also estimate post-editing effort/time – it would be rather safe to assume that segments with a low-quality score take more time to post-edit. It is also possible to compare MT systems based on QE scores and see which one performs better. This last application example is especially helpful if you are trying to decide which engine you should use, or whether a new version of an engine is working better than its predecessor.