Beats @ eBay – Collectbeat – A Journey where Company and Community Come Together

In the beginning…

In early 2016, the Monitoring Special Interest Group (SIG) ventured into solving the problem of logs and metrics shipping from (eBay’s Kubernetes ecosystem). Kubernetes, as one may be aware of, is a container management system. Users have the flexibility to drop in Docker containers and let Kubernetes manage them. These Kubernetes clusters, or Tess clusters inside of eBay, are multi-tenanted. We have many customers running their workloads at any given time. The multi-tenanted aspect of our Tess clusters brings about some interesting problems. Some of them are:

  • All logs and metrics need to be logically grouped (namespaced) based on customer/application type.
  • Workloads are ephemeral and tend to move across Nodes.
  • Additional metadata is required to search/query logs and metrics from Pods.
  • Metrics and logs need to be exposed/collected in a cloud native fashion.
  • Ability to self onboard logs and metrics to the centralized system.

Our first offering…

With these problems in mind, we wanted to offer a solution that allowed users to drop their Pods into Kubernetes and obtain their logs and metrics in the simplest possible way. If a user’s Pod logs to stdout/stderr, we should be able to collect the logs, append all the Pod metadata to each log line. If a user exposes metrics in a well-known HTTP endpoint, we should be able to collect it.

Knowing these challenges and the goal in mind, we embarked on the problem, attempting to solve one issue at a time.

Let us take logs as the first example and see how we attempted to solve it. Docker allows users to write to logs stdout/stderr, which is taken and placed in a well-known file path of:


If we were to listen to /var/lib/docker/containers/*/*-log.json, we would be able to collect all the logs that are being generated by all Pods in a given Node. Configuring Filebeat to listen to that path is simple and is exactly what we did. When we collected all these logs, we needed a way for users to be able to query based on Pod name, Namespace name, etc. Kubelet also started exposing these files in a symlink of:


It would be easy to write a processor on Filebeat to split the source value in the payload and extract pod, namespace, and container name. But, we realized that pod labels also carry significance in querying an entire deployment’s worth of logs and that information was not present. To solve this, we wrote our own custom Beat called Annotatebeat which can:

  • Listen on lumberjack protocol for Beat events
  • Look for the source field and extract the container ID
  • Look up the Kube API server for metadata of all pods in a given node
  • Use the container ID to append all the remaining metadata onto the event
  • Send it to a prescribed destination

As long as a user writes an application that can write to stdout/stderr, Docker would pick up the log and place it in a well-known log file. Filebeat tails the logs, sends it to Annotatebeat, which annotates the log message with pod metadata and ships the logs out. At this time, the Beats community wasn’t fully invested in Kubernetes, so we built some of these features internal to eBay.

Seeing how simple it was to write logs and have them shipped, we wanted a simple experience for metrics as well. At Elastic{ON} 2016, the Elastic folks announced Metricbeat as a new offering they were coming up with. Metricbeat has the concept of “modules” where a module is a procedural mechanism by which metrics can be collected from a given application. If Metricbeat is configured to listen to localhost:3306 for module type “mysql”, the MySQL module knows that it should connect to the host:port and run a `SHOW GLOBAL STATISTICS` query to extract metrics and ship them out to the configured backend.

This concept appealed to us because it allows “drop in” installations like MySQL, Nginx, etc. to be monitored out of the box. However, we needed users, who write their own code and deploy applications into Kubernetes, to also be able to monitor their applications. We hence came up with Prometheus/Dropwizard modules for users to expose their metrics via the above formats as HTTP endpoints, so that we could collect metrics from them and ship them. However at the time of Metricbeat creation, it was designed to be tailored for specific applications like MySQL, Apache, and Nginx and not for generic frameworks like Prometheus or Dropwizard. Hence our PR was not initially accepted by the community, and we managed the module internally.

The discovery is something that is not supported by Beats out of the box. We had to come up with a mechanism that says “given a node on Kubernetes, find out all the pods that are exposing metrics and start polling for metrics.” How do we find the pods that are poll worthy? We look for following metadata found as annotations:

io.collectbeat.metrics/type - the type of metrics exposed (Metricbeat module name)
io.collectbeat.metrics/endpoints - ports to look at
io.collectbeat.metrics/namespace - namespace to write metrics into

As long as these three mandatory annotations are present, we should be able to start polling for metrics and write them into the configured backend. This discovery module uses Kubernetes’ controller mechanism to keep watching for updates within the node and start polling configured endpoints. This discovery module resided in a custom Beat that we lovingly call Collectbeat. To sum up, we used Collectbeat for collecting metrics from pods and Filebeat for collecting logs. Both sent their data to Annotatebeat, which appended pod metadata and shipped it to the configured backend. We ran this setup internally for about a year on version 1.x. Then Beats came out with 5.x.

Challenges in managing an internal fork…

When we were ready to upgrade to Beats 5.x, most of the interfaces had changed, and all of our custom code had to be upgraded to the newer interfaces. By this time, the Beats community had evolved Metricbeat to support generic collectors like Prometheus and several other changes for which we had written changes in our internal fork were available upstream. The effort to upgrade to 5.x would be substantial.

We had two options in front of us. One was to keep going down this path of managing our internal fork and invest a month every major release to pull in all the new features and make necessary changes to our internally owned features. The second option was to open source anything that was generic enough to be accepted by community. On taking stock of all the features that we had written, 90% of them were features applicable to any Kubernetes cluster. The remaining 10% was required to ship data to our custom backend. Hence, we took a decision to upstream that 90% so that we don’t have to manage it any longer.

Be one with community…

In Elastic{ON} 2016 we met with the Beats community and came to an agreement to open source as much as we can with regards to the Kubernetes use-case, since we already have expertise monitoring Kubernetes internal to eBay, in return for faster PR reviews.

The first thing that we decided to get rid of internally was Annotatebeat, which did the metadata enrichment. Today in libbeat there is a processor called add_kubernetes_metadata, which was a result of that decision. We took all the logic present in Annotatebeat and converted it into a processor with the help of Carlos Pérez-Aradros, a member of the Beats community. We also took our internal Prometheus implementation and used it as a reference to update the community-available version to cover a few missing use cases. Dropwizard, Kubernetes Metricbeat modules, were something we used internally that we also open sourced.

Eventually we got to a point where we could run both Filebeat and Metricbeat as available upstream without any necessary changes. With go1.8 out, there was also support for plugins and we offloaded all our custom code internal to eBay. It is managed independent of stock Beats.

We realized the hard way that it is impossible to keep up with the rapid pace of an open source community if we have custom code residing in our internal fork. Not having a custom fork internally has helped us to be on the most recent version of Beats all the time and has reduced the burden of pulling in new changes.

It is always easier to make progress when we work with the community on features that not only benefit us today, but may also benefit someone else tomorrow. More thoughts and ideas on the code can always make it better. A good working relationship with the Beats community has helped us not only with code management, but also with features that were required internally that ended up getting built by the community. Today, eBay contributes the most amount of code outside of Elastic itself to the Beats product. This has not only benefited the product, but also eBay as well. With the combined effort of eBay and the Elastic, Beats will have native Kubernetes support in 6.0.

A new day…

Removing all of our custom code improved our agility to think of newer use cases. We wanted to increase coverage for the number of applications from which metrics can be collected. We realized that writing Metricbeat modules for every application is an impossible task and that going after protocols is a more scalable option.

One protocols that has tremendous coverage is the plain text protocol understood by Graphite. Tools like CollectD and StatsD can write to destinations that understand the Graphite protocol. We then implemented “Graphite server” as a Metricbeat module and contributed it back to Beats. This module inside of Collectbeat’s Kubernetes discovery helped us support use cases where customers can annotate their Pods with a parsing rule, and Collectbeat would receive metrics and parse them to split the metric name and tags before ingesting them to the desired backend. Another similar protocol that we went after was vanilla HTTP, where users can send metrics as JSON payloads to Metricbeat, and it would be shipped to the desired backend.

Being able to discover metrics inside of a Kubernetes environment is a big win in itself. The benefits were quite huge, and we saw the need to do the same for logs as well to support two use-cases:

  • Being able to stitch stack trace-like log patterns
  • Being able to read logs that are not being written into stdout

Because Kubernetes clusters inside of eBay are multi-tenanted, it becomes impossible to configure a single multiline pattern on Filebeat for all Pods inside of the cluster. We applied our learnings from metrics to log collection and decided to expose annotations that users can use to define multi-line patterns based on how Filebeat expects multiline to be configured. A user can, at a container level, configure multiline via annotations, and Collectbeat ensures that the required Filebeat prospectors are spun up to stitch stack traces.

A long standing problem that we have seen in our Kubernetes clusters is that, since we heavily rely on docker’s JSON log driver, performance is always a concern. Letting Filebeat decode each log line as a JSON payload is quite expensive. Also, there are a lot of use cases where a container may expose one of its many log files via stdout, but all others are written in specific file in the container.

One such example is Apache Tomcat, where catalina.out’s logs are written into stdout, whereas access logs are not. We wanted to solve both these problems with an unconventional solution. Collectbeat was rewritten to accept log paths in the Pod’s annotations, and based on what is the underlying Docker file system, Collectbeat would spin up prospectors by appending the container’s filesystem path to the file path. This would let us tail log files present inside of the container, and helps us to not rely on JSON log file processing. We can also collect log files from different files written by a container.

Where we are today…

Collectbeat has become our defacto agent that sits on every node through DaemonSets in our Kubernetes clusters to collect logs and metrics. Collectbeat runs in both Filebeat mode and Metricbeat mode to be able to tail log files and collect metrics respectively. This is what our Node looks like:

What are the features that Collectbeat has today? We are able to:

  • Collect metrics from any Pod that exposes metrics that abide to all supported Metricbeat modules
  • Collect logs written to stdout or files inside the Docker container
  • Append Pod metadata on every log and metric collected
  • Allow Pods to push metrics through Graphite protocol and parse them uniquely
  • Stitch stack traces for application logs

Today we run Collectbeat on over 1000 nodes shipping more than 3TB of logs and several billion data points per day. Our end goal is to put Collectbeat on every host in eBay and be able to collect logs and metrics from any application that is being deployed.

Are we there yet? No, but we are slowly, but surely, getting there. There are still several more features that we have yet to crack, like being able to give QoS for all Pods so that all Pods are treated equally when shipping logs and metrics. We also want to be able to provide quotas and throttle workloads when applicable.

We have greatly benefited from Collectbeat, and with great excitement we are happy to announce the open sourcing of Collectbeat. Putting our code out in the open will help us get feedback from the community and improve our implementation at the same time help others who are trying to solve the same problem as we are. So, go get and let us know your feedback.


A big shout out to all the folks in eBay who made this a reality:

Also, a big shout out to the Elastic folks from the Beats community who have helped us along the way:

Automating the Creation of Standard Change Requests at eBay

eBay’s Network Engineering team operates a large-scale network infrastructure with a presence across the globe. Our mission is to provide a seamless experience connecting buyers and sellers wherever they may be. The network we created to support that goal is comprised of different vendors and designs that have evolved over time. Networks require care and feeding on a regular basis in order to ensure that performance targets are met. How can we make the numerous weekly changes required while minimizing the risk of an impact?

One way in which we accomplish our goal is by making all change management procedures as standard and reproducible as possible. Common tasks such as line card installations, BGP changes, or the turn up of new ports are formalized into Standard Operating Procedures (SOPs). A SOP lays out all of the needed pre-checks, change steps, and post-checks for a successful change to be executed. Our SOPs are put through an engineering review process where we review and hone these steps so that the combined experience of all team members can be brought to bear on the problem.

As we went through this process of creating SOPs for most of our workload, we realized that we were doing many of the same things each time. Examples include things such as backing up the configuration, verifying that the console works, and executing commands that let us verify status before and after a change is executed. All of these steps, taken together, began to sound very much like a broken record to us as we created SOP after SOP.

Project Broken Record (PBR)

We determined that fully automating the creation of SOP-based change requests would be a worthwhile investment of our time. Now that we had the most common tasks well-documented in SOPs, we could actually run through most steps programmatically with some work invested. Because many steps were identical (such as collecting ‘show ip ospf neighbor’) from one type of change to another, bits of code would be reusable. Some challenges, such as how to detect different vendors, code versions, or design standards, would present themselves, but the important part for us was to get started and validate that the concept was workable before expanding it.

Our project outline for automating standard change requests was as follows:

  • Preparation and Planning
  • Design the System
  • Develop Proof of Concept
  • Document the System
  • Execute Pilot
  • Evaluation

Preparation and Planning

We decided to focus on a few common and relatively easier tasks with already defined SOPs. The tasks selected were:

  • Costing links in or out for maintenance
  • Enabling or disabling ports
  • Decommissioning switches
  • VLAN add/change
  • Code upgrades (various vendors)

A smaller set of tasks like this kept the scope contained to a reasonable size while still allowing the opportunity to bump into a few challenges and solve problems that might be encountered when the project is expanded to cover all of our SOPs.

Dividing the work among several people allowed us to build components in parallel. All coding was stored in a Git repository to facilitate group participation.

Design the System

The system is built out of various building blocks. The foundation is a Python script named ‘Auto About.’ This script contains functions that lay out the high-level outline of the pending change request. It defines specific devices, interfaces, or neighbors that are involved in the pending change. It gathers the most basic information, “What is this maintenance about?”, hence the name. A few examples of functions within Auto About are ‘get_routing_instances,’ ‘collect_vlan_info,’ and ‘collect_power_supply_data.’  Feeding Auto About the arguments of a device name and the type of maintenance is all that is required to gather information. The output of Auto About is a small YAML file that contains the information collected at this step.   

This small file is fed into the Collector script. This component gathers information from the network devices. Collector is written in Python and is a stateless system. There is no database or long-term storage of information at this point. Collector’s output is a YAML file, much longer than Auto About’s file, with everything we need to know about the change we’re about to execute.

At this point, we have all of the information we need, but reading a YAML is not very friendly for humans. We still track changes in a ticketing system, and we want to be able to review them.

A separate Python script,, combined with a Jinja2 template that matches the specific type of maintenance desired, takes that long YAML file and generates a few plain text files for us. Each step or check from our original SOP is broken down in the same way within the script, and the output lists each step and sub-step in the proper order. Any device-collected information is added where it is required. Output files created include an Action plan (your “forward” steps), a Verification plan where changes are tested, and a Rollback plan (your “backwards” steps in case you need to undo your changes). These plain text, human-readable files are used to create the Change Request (CR) in our in-house ticketing system. They represent a step-by-step and line-by-line plan to execute the work.

A final Python script called (‘cr’ indicating change request) handles the task of pushing the information files created by into Trace, our internally developed ticketing system that tracks changes. This saves engineer time by automating another piece of the CR puzzle for them. handles aspects of the change ticket process, such as filling in names of the people who submit or check tickets, setting the time and date of the proposed change, and requesting the creation of a new CR ticket.

Develop Proof of Concept

The proof of concept (PoC) involved creating the first versions of the components highlighted above and testing for functionality as well as interoperation of the individual pieces. A number of different people worked on this project, and the correct operation of all of the parts together was tested in the PoC phase. The PoC was a success, and we decided to press forward to a pilot phase.

Document the System

Documentation was created primarily within our Git repository. This was done so that everything a contributor would need was in one place and could be easily updated by anybody working on the project.  A simple ‘readme’ file uploaded into the appropriate directory in Git provided a place to put higher level information about how a piece code was supposed to function. This was done in addition to good commenting within each file, of course!  Some project tracking items were also hosted on a wiki page, where they were more easily accessed by stakeholders who were not directly involved in the coding aspects.

Execute Pilot

During the pilot phase, we saw a rapid expansion of the PBR program as we started onboarding more use cases and actually using this system in our live change management workflow.

Exposing the output from PBR to the wider group of engineers during our pilot phase was a great way to get additional feedback on how we could collect the right information that would be valuable for the change type being executed. During the pilot period of about six months, numerous small issues with the various CR templates were corrected. Many of these issues were uncovered in our regular change management meetings as we discussed pending CR tickets.

Where it was possible, we aimed to make the CR tickets have the same look and feel.  For example, standardizing the sequence numbers for prechecks, change steps, and post checks is one way we found to make the CRs more readable and faster to evaluate at change management meetings. As a result of this feedback loop, our templates and methods quickly evolved to be more comprehensive and polished.


Project Broken Record took us approximately one year to complete from the initial meetings to a working product that had been successfully piloted.  We found that all of the pieces of this product require updating and fine tuning from time to time as we strive to execute the perfect change.  This type of regular time investment is a good tradeoff for eBay, because we are confident that this system has helped to avoid outages while streamlining the change process.

Our change management meetings were able to be run more efficiently, because we became familiar with the standard layout of change tickets. It was easier to review and approve very standard SOP-based things vs. the previous system of a queue of tickets all written differently by different engineers. Increasing the throughput of the review process directly benefited our internal customers waiting on change work to be completed.

We track all impacts to business availability and analyze what we could have done to avoid impact. One way that we sort this data is by root cause. Causes could be things such as hardware failure, vendor software bugs, change tickets gone wrong, etc. In 2017, impact time due to change tickets was very nearly zero. There were several parallel initiatives that contributed to this, but Project Broken Record was a part of that success story to be sure. Doing the same change the same way each time reduces the chance of unexpected consequences and builds our confidence in our procedures.

Where We’re Headed

We are happy with the progress we have made so far, but there are still a number of things we would like to improve upon.

We want to become more disciplined in our coding by creating development and master branches of our code. Currently, most portions of this system are in a development type state, but are also being used daily. We are also testing systems built up from this that will perform standard maintenances completely automatically by following the SOPs using the large YAML file information.

The larger goal we are pursuing here is minimal human interaction with the production network. Now that we have seen a return on our initial investment, we want to take this to a higher level of engineered solution. A ground-up rewrite of many of the pieces described above is already underway to consolidate functions and improve the way in which we gather information from the network. We are committed to this program, and we expect it to continue to evolve and grow.

Our team exists to help eBay’s business to be successful. As we explore this new automation-focused landscape, we are looking for the best ways to achieve that goal through solid uptime, delivery of projects, and a great user experience for everyone on the platform. The thought processes on our team have shifted from one in which we directly care for an ever-increasing number of network devices directly to one in which we create tools that can do that for us. This new way of approaching operations at eBay is much more scalable and is where we are placing a heavy emphasis as we march toward 2018 and beyond.

eBay’s Font Loading Strategy

The usage of custom fonts in web pages have steadily increased in recent years. As of this writing, 68% of sites in the HTTP Archive use at least one custom font. At eBay, we have been discussing custom web fonts for typography for quite some time, but never really pursued it. The main reason was due to uncertainty in end user experiences from a performance standpoint. But this changed recently.

Our design team made a strong case for a custom font to complement our new branding and, after multiple reviews, we all agreed it makes sense. Now it was on the engineering team to come up with an optimized implementation that not only uses the new custom font, but also tackles the performance overhead. This post gives a quick overview of the strategy we use at eBay to load custom web fonts.

Meet “Market Sans”

Our new custom font is rightly named “Market Sans” to denote affiliation with an online marketplace. As observed in the below image, it adds a subtle difference to typography, but in its entirety makes the whole page elegant and creates a unique eBay branded experience. Check out our desktop homepage, where “Market Sans” has been deployed.

Custom Font vs. System Font


It is well known and documented that custom web fonts come with a cost, and that is performance. They often delay rendering of text (critical to any web page) until the font is downloaded. A recent post from Akamai gives an excellent overview of the problems associated with custom fonts. To summarize there are two major issues, and it varies among browsers:

  • FOUT — Flash of Unstyled Text
  • FOIT — Flash of Invisible Text

As expected, the design and product teams were not happy with the compromise. Yes, custom fonts create a unique branded experience, but not at the cost of delaying the same experience. Additionally, from an e-commerce perspective, custom fonts are a good enhancement and not an absolute necessity. System fonts can still provide a compelling typography. So it was up to the engineering team to come up with an efficient font loading strategy with minimal tradeoffs.


Our strategy was pretty simple — avoid FOUT and FOIT: Use the custom font if it is already available (meaning downloaded and cached), else use default system fonts.

Fortunately, there is a CSS Font-Rendering proposal that adds a new @font-face descriptor named font-display. Using font-display, developers can specify how a font is displayed, based on whether and when it’s downloaded and ready to use. There are many values for font-display (checkout this quick video to understand them), and the one that maps to our strategy would be ‘font-display: optional’.

Unfortunately, the adoption of font-display among browsers is not widespread, as it is relatively new. So for now, until the adoption becomes mainstream, we came up with a solution that leverages the localStorage, FontFaceSet APIs and the Font Face Observer utility (as a backup if the FontFaceSet API is not present).

The below illustration gives an overview of how it works:

Flow diagram for Font Loader

To summarize,

  • When users visit an eBay web page, we add a tiny inline CSS and JavaScript snippet in the response HTML <head> tag. We also include a small JavaScript snippet in the footer HTML that incorporates the font loader logic.
  • The JavaScript in the <head> checks the localStorage if a font flag is set. If the flag is set, it immediately adds the CSS class to document root to enable the custom font. The page renders with the “Market Sans” custom font. This is the happy path.
  • The JavaScript in the footer again checks the localStorage for a font flag. If it is NOT set, it calls the font loader function on the document load event.
  • The font loader function loads (downloads) the custom fonts either using the built-in FontFaceSet API (if present) or through the Font Face Observer utility. The Font Face Observer is asynchronously downloaded on demand.
  • Once the font download is complete, a font flag is set on the localStorage. One thing to note — even though the font flag is set, we do not update the current view with the custom font. It is done on the next page visit with Step 2 above kicking in.

We have open sourced this module as ebay-font. It is a small utility that works along with other eBay open source modules Skin, Marko, and Lasso, as well as in standalone mode. We hope others can benefit from it.


There are a couple of tradeoffs with this strategy:

  1. First time users: A new user visiting eBay for the first time will get the system font. On navigation or subsequent visits they will get our custom font. This is acceptable, as we have to start the custom font at some point, and we will start it from the second visit of a new user.
  2. Private or incognito mode: When a user starts a new browsing session in private or incognito mode, they get the system font initially. But subsequent browsing in the same session will render the custom font (Safari is an exception, but it is getting fixed). We do not have metrics on how many users fall under this category, but this is something we have to live with.
  3. Cache eviction: In certain rare scenarios we observed that the custom font entity in the browser cache is evicted, but the localStorage entry is still present. Probably browsers clean up cache more frequently than localStorage. In these scenarios, users will experience a FOIT or FOUT based on the browser. This is more of an edge case and hence less concerning.

As a team we agreed that these tradeoffs are acceptable, considering the unpredictable behavior that comes with default font loading.


Custom web fonts do add value to the overall user experience, but it should not be at the cost of delaying the critical content. Each organization should have a font loading strategy based on their application needs. The new built-in CSS property ‘font-display’ makes it very easy to choose one. We should start using it right away, even if the support is minimal and even if there is already an in-house implementation.

Huge thanks to my colleague Raja Ramu for partnering on this effort and help in open sourcing the module ebay-font.

—  Senthil Padmanabhan