Mastering the Fire

 

“If you play with fire, you’re gonna get burned.”  ~ Anonymous

There were several reasons we built the NodeJS stack at eBay and now offer it as part of our polyglot initiative. These reasons include an active open source community, development productivity, reliability, scalability, and speed. Community support and productivity proved to be true from the start, but when it comes to reliability, scalability, and speed, they all depend on developer culture.

We use static code analysis tools, code reviews, unit tests, and regression testing to make sure our modules and applications work according to the spec. One isolated module can perform perfectly fine in its own test environment and as part of application, but once all the modules are packaged together and ready to roll, the app may turn out to be much slower than one expected, for example, due to logging too much data. This can become a tough time for the application stack provider who does not have answers.

Thankfully, flame graphs came on the scene. They were really promising, but their promise turned out to be far from the reality. The flame graphs turned out to be hot like real flames. We touched them a few times, got burned, and backed off. The first time we approached them, flame graphs were available only in SmartOS, and one had to follow specific steps to generate them, and that was the problem, especially when one runs applications on a completely different platform. Addicted to the simplicity of Node, which just works, we found this option was far from simple, and we put it in reserve for tough cases that we could not solve some other way. The second time that we approached flame graphs, they were already available on Linux or OSX, but creating them still required a special setup and too many steps (including merging symbols with profile results) to get SVG charts in OSX.

“It’s a living thing, Brian. It breathes, it eats, and it hates. The only way to beat it is to think like it.” ~ Robert De Niro (as Donald ‘Shadow’ Rimgale), Backdraft, 1991

Meanwhile, we were using v8-profiler to generate profile data that we would load into the Chrome Profile tool, and then we would analyze the aggregation tree for any performance hot spots. It is a laborious task when one has to look at all the call stacks of a big application, and it demanded a lot of focus. We could not offer this solution to our application developers, as it would take too much of their time to troubleshoot. It was going to become a task for a special profile expert who would do a lot of profiling, get a lot experience, and be able to spot things easily and know where to look. This was not scalable. As a big project started knocking at our door, we had to figure out a better way to profile so that the application developers could do the work by themselves.

We got an idea that if Chrome shows profile results in aggregated format, then there should be a way to calculate the same results by ourselves and present them as flame graphs by using one of the tools available. And we found our calculator and a suitable tool that was built to use JSON as profile data. All we needed to do is to put it all together.

“Playing with fire is bad for those who burn themselves. For the rest of us, it is a very great pleasure.”  ~ Jerry Smith, National Football League tight end, Washington Redskins ‘65-77

The result is pretty exciting. We are now able to turn on profiling in production any time without restarting the server and look right into the problem via flame graphs with one click of a button. The results show the JavaScript part of the profiling (no native code), which is what developers want most of the time anyway when it comes to performance issues in their applications.

It also works anywhere that can run Node. For example, developers now can profile right on their Macs or Windows machines without any special effort on their part.

We have already successfully used it to find and optimize performance in platform code as well as in many applications that are soon to be rolled to production. We were able to quickly identify performance problems in production for one critical application when, after a fresh deployment, it started using 80% of CPU instead of the expected 20–30%. Below you can see the problem, it was loading templates over and over again with every request. The fix was simply to cache the templates at the first load.

This first flame graph shows the application’s behavior before the fix. Total time spent on requests was 3500 msec.

flame graph of a sample application before its fix was applied

This next illustration shows a close-up view of the same flame graph, highlighting the trouble spots.

close up view of part of the flame graph of a sample application before its fix was applied

This next flame graph shows the optimization we got after applying the fix.

flame graph of the sample application after its fix was applied

As you can see the rendering part became much smaller. The total time spent on all requests became 1100 msec.

Most of the problems we discovered were not as big as the one that Netflix uncovered with flame graphs, but fixing them helped us save a lot on CPU usage.

“Don’t let your dreams go up in smoke — practice fire safety.”  ~ Unknown Author

cartoon shows a data center in flames with the caption someone rolled to production without CPU profiling

There is still work to do. We need to train developers to read flame graphs. Otherwise this valuable tool can draw an undeserved negative perception and disappear from the developers’ toolset.

After profiling many applications, we have also found common problems that we can highlight by default, and we can implement new rules for static code analysis to identify these problems.

We have found it useful to profile the following areas with flame graphs:

  • Application profiling during development
  • Unexpected activity detection during memory leak analysis
  • Capacity estimation based on CPU usage
  • Issue troubleshooting at runtime in production
  • Proactive smoke testing with live traffic in a special environment using a traffic mirror (cloning read requests and directing them to the target test box)
  • Sampling and storing for future investigation

To summarize our experience with Node and profiling, I would say that the successful employment of any language, no matter how promising, depends on the way it is used, and performance tools like flame graphs play a major role in helping the developer to accomplish what was claimed at the start.

Browse eBay with Style and Speed

One of the top initiatives for eBay this year is to provide a compelling browse experience to our users. In a recent interview, Devin Wenig has given a good overview of why this matters to eBay. The idea is to leverage structured data and machine learning to allow users to shop across a whole spectrum of value, where some users might desire great savings, while others may want to focus on, say, best selling products.

When we started to design the experience, our first area of focus was mobile web. Similar to many other organizations, mobile web has been our highest growing sector. We wanted to launch the new browse experience on mobile web first, followed by desktop and native.

The core design principles of the new mobile web browse experience were to keep it simple, accessible, and fast, really fast. On the front-end side of things, we made a couple of choices to achieve this.

  • Lean and accessible — From the beginning we wanted the page to be as lean as possible. This meant keeping the HTML, CSS, and JS to a minimum. To achieve this goal, we followed a modular architecture and started building atomic components. Basically a page is a bunch of modules, and a module is built from other sub-modules, and so on. This practice enabled maximum code reuse, which in turn reduced the size of resources (CSS and JS) drastically. In addition, our style library enforced accessibility through CSS — by using ARIA attributes to define styles rather than just class names. This forces developers to write a11y-friendly markup from the beginning, instead of it being an afterthought. You can read more about it here.
  • Code with the platform — The web platform has evolved into a more developer friendly stack, and we wanted to leverage this aspect — code with the platform vs. coding against it. What this meant was that we could reduce the dependency on big libraries and frameworks and start using the native APIs to achieve the same. For instance, we tried to avoid jQuery for DOM manipulations and instead use the native DOM APIs. Similarly, we could use the fetch polyfill instead of $.ajax etc. The end result was a faster loading page that was also very responsive to user interactions. BTW, jQuery is still loaded in the page, because some of eBay platform specific code is dependent on it, and we are working towards removing the dependency altogether.

But our efforts did not stop there. The speed aspect was very critical for us, and we wanted to do more for speed. That is when we ran into AMP.

Experimenting with AMP

The AMP project was announced around the same time we started the initial brainstorming for browse. It seemed to resonate a lot with our own thinking on how we wanted to render the new experience. Although AMP was more tuned towards publisher-based content, it was still an open source project built using the open web. Also, a portion of the traffic to the new browse experience is going to be from search engines, which made it more promising to look into AMP. So we quickly pinged the AMP folks at Google and discussed the idea of building an AMP version for the browse experience, in addition to the normal mobile web pages. They were very supportive of it. This positive reaction encouraged us to start looking into AMP technology for the eCommerce world and in parallel develop an AMP version of browse.

Today we are proud to announce that the AMP version of the new browse experience is live, and about 8 million AMP-based browse nodes are available in production. Check out some of the popular queries in a mobile browser — Camera Drones and Sony PlayStation, for example. Basically adding amp/ to the path of any browse URL will render an AMP version (for example, non-AMP, AMP). We have not linked all of them from our regular (non-AMP) pages yet. This step is waiting on few pending tasks to be completed. For now, we have enabled this new browse experience only in mobile web. In the next couple of weeks, the desktop web experience will also be launched.

So how was the experience in implementing AMP for the eCommerce world? We have highlighted some of our learnings below.

What worked well?

  • Best practices — One of the good things about AMP is that at the end of the day it is a bunch of best practices for building mobile web pages. We were already following some of them, but adoption was scattered across various teams, each having its own preference. This initiative helped us consolidate the list and incorporate these best practices as a part of our regular development life cycle itself. This made our approach towards AMP more organic, rather than a forced function. The other good side effect of this is even our non-AMP pages become faster.
  • Less forking in code — This follows the previous point. Since we started following some of the AMP best practices for building regular pages, we were able to reuse most of the UI components between our non-AMP and AMP browse page. This resulted in less forking in code, which otherwise would have become a maintenance nightmare. Having said that, there is still some forking when it comes to JavaScript-based components, and we are still figuring out the best solution.
  • AMP Component list — Although the AMP project’s initial focus was more towards publisher-based content and news feeds, the AMP component list was still sufficient to build a basic product for viewing eCommerce pages. Users will not be able to do actions on items (such as “Add To Cart”), but they still get a solid browsing experience. The good news is that the list is getting better and growing day by day. Components like sidebar, carousel, and lightbox are critical in providing a compelling eCommerce experience.
  • Internal AMP platform — We have been thinking about leveraging the AMP ecosystem for our own search, similar to how Google handles AMP results. This plan is in very early stages of discussion, but the possibility of our search using AMP technology is very interesting.

The complex parts

  • Infrastructure components — To launch an eBay page to production, a lot of infrastructure components automatically come into play. These are things like Global header/footer, site speed beacon kit, experimentation library, and the analytics module. All of them have some amount of JavaScript, which immediately disqualifies them from being used in the AMP version. This adds complexity in development. We had to fork few infrastructure components to support the AMP guidelines. They had to go through a strict regression cycle before being published, which added delays. Also, our default front-end server pipeline had to be conditionally tweaked to exclude or swap certain modules. It was a good learning curve, and over time we have also replaced our early quick hacks with more robust and sustainable solutions.
  • Tracking — AMP provides user activity tracking through its amp-analytics component. amp-analytics can be configured in various ways, but it still was not sufficient for the granular tracking needs that eBay has. We also do stuff like session stitching, which needs cookie access. Creating an amp-analytics configuration to suit our needs was slowly becoming unmanageable. We need some enhancements in the component, which we are hoping to develop and commit to the project soon.

What’s next?

We are excited to partner with Google and everyone else participating on the AMP Project to close the gap in launching a full-fledged eCommerce experience in AMP. We have created a combined working group to tackle the gap, and we will be looking into these items and more.

  • Smart buttons — These enable us to do actions like “Add To Cart” and “Buy It Now” with authentication support.
  • Input elements — User interactive elements are critical to eCommerce experiences, be they simple search text boxes or checkboxes.
  • Advanced tracking — As mentioned earlier, we need more granular tracking for eBay, and so we have to figure out a way to achieve it.
  • A/B Testing — This will enable experimentation on AMP.

With items like these in place, AMP for eCommerce will soon start surfacing.

We will also be looking into creating a seamless transition from the AMP view to a regular page view, similar to what the Washington Post did using Service Workers. This will enable users to have a complete and delightful eBay experience without switching contexts.

We are also asked the question of if there is more focus towards web over native. The answer is NO. At eBay, we strongly believe that web and native do not compete each other. They indeed complement each other, and the combined ecosystem works very well for us. We will soon be launching these browse experiences in our native platforms.

We are on our path to making eBay the world’s first place to shop and this is a step towards it. Thanks to my colleague Suresh Ayyasamy, who partnered in implementing the AMP version of browse nodes and successfully rolling it to production.

Senthil

arrows-1262403_1920

Human Evaluation of Machine Translation

 

Machine translation (MT) evaluation is essential in machine translation development. This is key to determining the effectiveness of the existing MT system, estimating the level of required post-editing, negotiating the price, and setting reasonable expectations. Machine translation output can be evaluated automatically, using methods like BLEU and NIST, or by human judges.

The automatic metrics use one or more human reference translations, which are considered the gold standard of translation quality. The difficulty lies in the fact that there may be many alternative correct translations for a single source segment.

Human evaluation, however, also has a number of disadvantages. Primarily, it is a costly and time-consuming process. Human judgment is also subjective in nature, so it is difficult to achieve a high level of intra-rater (consistency of the same human judge) and inter-rater (consistency across multiple judges) agreement. In addition, there are no standardized metrics and approaches to human evaluation.

Let us explore the most commonly used types of human evaluation.

Rating

Judges rate translations based on a predetermined scale. For example, a scale from 1 to 5 can be used, where 1 is the lowest and 5 is the highest score. One of the challenges of this approach is establishing a clear description of each value in the scale and the exact differences between the levels of quality. Even if human judges have explicit evaluation guidelines, they still find it difficult to assign numerical values to the quality of the translation (Koehn & Monz, 2006).

The two main dimensions or metrics used in this type of evaluation are adequacy and fluency.

Adequacy

Adequacy, according to the Linguistic Data Consortium, is defined as “how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation.” The annotators must be bilingual in both the source and target language in order to judge whether the information is preserved across translation.

A typical scale used to measure adequacy is based on the question “How much meaning is preserved?”

5: all meaning
4: most meaning
3: some meaning
2: little meaning
1: none

Fluency

Fluency refers to the target only, without taking the source into account; criteria are grammar, spelling, choice of words, and style. A typical scale used to measure fluency is based on the question “Is the language in the output fluent?”

5: flawless
4: good
3: non-native
2: disfluent
1: incomprehensible

Ranking

Judges are presented with two or more translations (usually from different MT systems) and are required to choose the best option. This task can be confusing when the ranked segments are nearly identical or contain difficult-to-compare errors. The judges must decide which errors have greater impact on the quality of the translation (Denkowski & Lavie, 2010). On the other hand, it is often easier for human judges to rank systems than to assign absolute scores (Vilar et al., 2007). This is because it is difficult to quantify the quality of the translation.

Error Analysis

Human judges identify and classify errors in MT output. Classification of errors might depend on the specific language and content type. Some examples of error classes are “missing words,” “incorrect word order,” “added words,” “wrong agreement,” “wrong part of speech,” and so on. It is useful to have reference translations in order to classify errors; however, as mentioned above, there may be several correct ways to translate the same source segment. Accordingly, reference translations should be used with care.

When evaluating the quality of eBay MT systems, we use all the aforementioned methods. However, our metrics can vary in the provision of micro-level details about some areas specific to eBay content. As a result, one of the evaluation criteria is to identify whether brand names and product names (the main noun or noun phrase identifying an item) were translated correctly. This information can help in identifying the problem areas of MT and focusing on the enhancement of that particular area.

Some types of human evaluation, such as error analysis, can only be conducted by professional linguists, while other types of judgment can be performed by annotators who are not linguistically trained.

Is there a way to cut the cost of human evaluation? Yes, but unfortunately, low-budget crowdsourcing evaluations tend to produce unreliable results. How then can we save money without compromising the validity of our findings?

  • Start with a pilot test — a process of trying out your evaluation on a small data set. This can reveal critical flaws in your metrics, such as ambiguous questions or instructions.
  • Monitor response patterns to remove judges whose answers are outside the expected range.
  • Use dynamic judgments — a feature that allows fewer judgments on the segments where annotators agree, and more judgments on segments with a high inter-rater disagreement.
  • Use professional judgments that are randomly inserted throughout your evaluation job. Pre-labeled professional judgments will allow for the removal of judges with poor performance.

Human evaluation of machine translation quality is still very important, even though there is no clear consensus on the best method. It is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment.

If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.