Elasticsearch Cluster Lifecycle at eBay

Defining an Elasticsearch cluster lifecycle

eBay’s Pronto, our implementation of the “Elasticsearch as service” (ES-AAS) platform, provides fully managed Elasticsearch clusters for various search use cases. Our ES-AAS platform is hosted in a private internal cloud environment based on OpenStack. The platform currently manages around 35+ clusters and supports multiple data center deployments. This blog provides guidelines on all the different pieces for creating a cluster lifecycle to allow streamlined management of Elasticsearch clusters. All Elasticsearch clusters deployed within the eBay infrastructure follow our defined Elasticsearch lifecycle depicted in the figure below.

Cluster preparation

This lifecycle stage begins when a new use case is being onboarded onto our ES-AAS platform.

On-boarding information

Customers’ requirements are captured onto an onboarding template that contains information such as document size, retention policy, and read/write throughput requirement. Based on the inputs provided by the customer, infrastructure sizing is performed. The sizing uses historic learnings from our benchmarking exercises. On-boarding information has helped us in cluster planning and defining SLA for customer commitments.

We collect the following information from customers before any use case is onboarded:

  • Use case details: Consists of queries relating to use case description and significance.
  • Sizing Information: Captures the number of documents, their average document size, and year-on-year growth estimation.
  • Data read/write information: Consists of expected indexing/search rate, mode of ingestion (batch mode or individual documents), data freshness, average number of users, and specific search queries containing any aggregation, pagination, or sorting operations.
  • Data source/retention: Original data source information (such as Oracle, MySQL, etc.) is captured on an onboarding template. If the indices are time-based, then an index purge strategy is logged. Typically, we do not use Elasticsearch as the source of data for critical applications.

Benchmarking strategy

Before undertaking any benchmarking exercise, it’s really important to understand the underlying infrastructure that hosts your VMs. This is especially true in a cloud-based environment where such information is usually abstracted from end users. Be aware of different potential noisy-neighbors issues, especially on a multi-tenant-based infrastructure.

Like most folks, we have also performed extensive benchmarking exercise on existing hardware infrastructure and image flavors. Data stored in Elasticsearch clusters are specific to customer use cases. It is near to impossible to perform benchmarking runs on all data schemas used by different customers. Therefore, we made assumptions before embarking on any benchmarking exercise, and the following assumptions were key.

  • Clients will use a REST path for any data access on our provisioned Elasticsearch clusters. (No transport client)
  • To start with, we kept a mapping of 1GB RAM to 32GB disk space ratio. (This was later refined as we learnt from benchmarking)
  • Indexing numbers were carefully profiled for different numbers of replicas (1, 2, and 3 replicas).
  • Search benchmarking was done always on GetById queries (as search queries are custom and profiling different custom search queries was not viable).
  • We used fixed-size 1KB, 2KB, 5KB, and 10 KB documents

Working from these assumptions, we derived at a maximum shard size for performance (around 22GB), right payload size for _bulk requests (~5MB), etc. We used our own custom JMeter scripts to perform benchmarking. Recently Elasticsearch has developed and open-sourced the Rally benchmarking tool, which can be used as well. Additionally, based on our benchmarking learnings, we created a capacity-estimation calculator tool that can take in customer requirement inputs and calculate the infrastructure requirement for a use case. We avoided a lot of conversation with our customers on infrastructure cost by sharing this tool directly with end users.

VM cache pool

Our ES clusters are deployed by leveraging an intelligent warm-cache layer. The warm-cache layer consists of ready-to-use VM nodes that are prepared over a period of time based on some predefined rules. This ensures that VMs are distributed across different underlying hardware uniformly. This layer has allowed us to quickly spawn large clusters within seconds. Additionally, our remediation engine leverages this layer to flex up nodes on existing clusters without errors or any manual intervention. More details on our cache pool are available in another eBay tech blog at Ready-to-use Virtual-machine Pool Store via warm-cache

Cluster deployment

Cluster deployment is fully automated via a Puppet/Foreman infrastructure. We will not talk in detail about how Elasticsearch Puppet module was leveraged for provisioning Elasticsearch clusters. This is well documented at Elasticsearch puppet module. Along with every release of Elasticsearch, a corresponding version of the Puppet module is generally made publically available. We have made minor modifications to these Puppet scripts to suit eBay-specific needs. Different configuration settings for Elasticsearch are customized based on our benchmarking learnings. As a general guideline, we do not set the JVM heap memory size to more than 28 GB (because doing so leads to long garbage collection cycles), and we always disable in-memory swapping for the Elasticsearch JVM process. Independent clusters are deployed across data centers, and load balancing VIPs (Virtual IP addresses) are provisioned for data access.

Typically, with each cluster provisioned we give out two VIPs, one for data read operations and another one for write operations. Read VIPs are always created over client nodes (or coordinating nodes), while write VIPs are configured over data nodes. We have observed improved throughput from our clusters with such a configuration.

Deployment diagram

 

We use a lot of open source on our platform such as OpenStack, MongoDB, Airflow, Grafana, InfluxDB (open version), openTSDB, etc. Our internal services, such as cluster provisioning, cluster management, and customer management services, allow REST API-driven management for deployment and configuration. They also help in tracking clusters as assets against different customers. Our cluster provisioning service relies heavily on OpenStack. For example, we use NOVA for managing compute resources (nodes), Neutron APIs for load balancer provisioning, and Keystone for internal authentication and authorization of our APIs.

We do not use federated or cross-region deployments for an Elasticsearch cluster. Network latency limits us from having such a deployment strategy. Instead, we host independent clusters for use cases across multiple regions. Clients will have to perform dual writes when clusters are deployed in multiple regions. We also do not use Tribe nodes.

Cluster onboarding

We create cluster topology during customer onboarding. This helps to track resources and cost associated with cluster infrastructure. The metadata stored as part of a cluster topology maintains region deployment information, SLA agreements, cluster owner information, etc. We use eBay’s internal configuration management system (CMS) to track cluster information in form of a directed graph. There are external tools that hook onto this topology. Such external integrations allow easy monitoring of our clusters from centralized eBay-specific systems.

Cluster topology example

Cluster management

Cluster security

Security is provided on our clusters via a custom security plug-in that provides a mechanism to both authenticate and authorize the use of Elasticsearch clusters. Our security plug-in intercepts messages and then performs context-based authorization and authentication using an internal authentication framework. Explicit whitelisting based on client IP is supported. This is useful for configuring Kibana or other external UI dashboards. Admin (Dev-ops) are configured to have complete access to Elasticsearch cluster. We encourage using HTTPS (based on TLS 1.2) for securing communication between client and Elasticsearch clusters.

The following is a sample simple security rule that can configure be configured on our platform of provisioned clusters.

sample json code implementing a security rule

In the above sample rule, the enabled field controls if the security feature is enabled or not. whitelisted_ip_list is an array attribute for providing all whitelisted Client IPs. Any Open/Close index operations or delete index operations can be performed only by admin users.

Cluster monitoring

Cluster monitoring is done by custom monitoring plug-in that pushes 70+ metrics from each Elasticsearch node to a back-end TSDB-based data store. The plug-in works on a push-based design. External dashboards using Grafana consume the data on TSDB store. Custom templates are created on a Grafana dashboard, which allows easy centralized monitoring of our own clusters.

 

 

We leverage an internal alert system that can be used to configure threshold-based alerts on data stored on OpenTSDB. Currently, we have 500+ active alerts configured on our clusters with varying severity. Alerts are classified as ‘Errors’ or ‘Warnings’. Error alerts, when raised, are immediately attended to either by DevOps or by our internal auto-remediation engine, based on the alert rule configured.

Alerts are created during cluster provisioning based on various thresholds. For Example, if a cluster status turns RED, an ‘Error’ alert is raised or if CPU utilization of node exceeds 80% a ‘Warning’ alert is raised.

Cluster remediation

Our ES-AAS platform can perform an auto-remediation action on receiving any cluster anomaly event. Such actions are enabled via our custom Lights-Out-Management (LOM) module. Any auto-remediation module can significantly reduce manual intervention for DevOps. Our LOM module uses a rule-based engine which listens to all alerts raised on our cluster. The reactor instance maintains a context of the alerts raised and, based on cluster topology state (AUTO ON/OFF), takes remediation actions. For example, if a cluster loses a node and if this node does not return to its cluster within the next 15 minutes, the remediation engine replaces that node via our internal cluster management services. Optionally, alerts can be sent to the team instead of taking a cluster remediation action. The actions of the LOM module are tracked as stateful jobs that are persisted on a back-end MongoDB store. Due to the stateful nature of these jobs, they can be retried or rolled back as required. Audit logs are also maintained to capture the history or timeline of all remediation actions that were initiated by our custom LOM module.

Cluster logging

Along with the standard Elasticsearch distribution, we also ship our custom logging library. This library pushes all Elasticsearch application logs onto a back-end Hadoop store via an internal system called Sherlock. All centralized application logs can be viewed at both cluster and node levels. Once Elasticsearch log data is available on Hadoop, we run daily PIG jobs on our log store to generate reports for error log or slow log counts. We generally have our logging settings as INFO, and whenever we need to triage issues, we use transient a logging setting of DEBUG, which collects detailed logs onto our back-end Hadoop store.

Cluster decommissioning

We follow a cluster decommissioning process for major version upgrades of Elasticsearch. For major upgrades for Elasticsearch clusters, we spawn a new cluster with our latest offering of the Elasticsearch version. We replay all documents from old or existing version of Elasticsearch clusters to the newly created cluster. Client (user applications) starts using both cluster endpoints for all future ingestion until data catches up on the new cluster. Once data parity is achieved, we decommission the old cluster. In addition to freeing up infrastructure resources, we also clean up the associated cluster topology. Elasticsearch also provides a migration plug-in that can be used to check if direct, in-place upgrades can be done on major Elasticsearch versions. Minor Elasticsearch upgrades are done on an as-needed basis and are usually done in-place.

Healthy Team Backlogs

 

What is a backlog?

Agile product owners use a backlog to organize and communicate the requirements for a team’s work. Product backlogs are deceptively simple, which can sometimes make them challenging to adopt for product owners who may be used to working with lengthy PRDs (“project requirement documents” or similar).

Scrum most commonly uses the term product backlog. However, many product owners who are new to Scrum are confused by this term. Reasonable questions arise: Does this suggest that a team working on multiple products would have multiple backlogs? If so, how do we prioritize between them? Where do bugs get recorded? What happens if work needs to be done, but it isn’t associated with a product; do we create a placeholder?

Therefore, we prefer the term team backlog. Our working definition of team backlog is “the maintained, ordered list of work that the team plans to do now or in the future.” This is a dense description, so let’s unpack it a little.

“The” and “Team”

  • We say the and team because each team needs a single source of truth to track their current and future work.
  • If a team is working on multiple projects or products, all of the work for those stories should appear on a single, unified, team backlog.
  • Teams do not generally share backlogs.

“Work”

  • Work includes almost everything that the development team needs to do.
  • Features, bugs, technical debt, research, improvements, and even user experience work all appear on the same backlog.
  • Generally speaking, recurring team meetings and similar events do not appear on the backlog.

“Maintained”

  • We say maintained because the backlog is a “living” artifact.
  • The product owner and team must continually update and refine their backlog. Otherwise, the team will waste time doing useless work and chasing requirements.
  • This requires several hours per week for the product owner and 1–2 hours per week for the team. It involves adding, ordering, discussing, describing, justifying, deleting, and splitting work.

“Ordered”

  • We say ordered list rather than prioritized list because the backlog is ordered, not just prioritized.
  • If the backlog is only prioritized, there can be multiple items that are all “very high priority.”
  • If the backlog is ordered, we communicate exactly in what order those “very high priority” tasks should be worked on.

“Plans to Do”

  • We say plans to do because we regularly delete everything from the backlog that we no longer plan to work on.
  • Deleting unnecessary work is essential. Unnecessary work clutters up our backlog and distracts from the actual work.

What makes a backlog healthy?

Now that we know what a backlog is, what makes a backlog healthy or not? While what makes for a good backlog is somewhat subjective — in the same way that what makes a good PRD could be subjective — there are 10 characteristics that we’ve found to be particularly important.

Would you like to know if your backlog is healthy? Download this handy PDF checklist, print it out, then open up your backlog and follow along. For each criterion, take note of whether your backlog currently does, doesn’t, or only somewhat meets the criterion. In exchange for less than half an hour of your time, you’ll have good sense as to the health of your backlog and a few ideas for improvement.

  1. Focused, ordered by priority, and the team follows the order diligently

    • At all times, anyone can look at the backlog and know what needs to be worked on next without ambiguity.
    • Even if you have several “P1” issues, the team needs to know which P1 issue needs to be addressed next. Simply saying “they’re all important” will paralyze the team.
    • Although the PO is responsible for the product backlog order and makes the final call, the PO should be willing to negotiate the order with their team. The team often has good insights that can mitigate dependencies or help the PO deliver more value.
    • Stay focused on one thing at a time when possible to deliver value earlier and reduce context switching waste.
  2.  

  3. Higher-value items towards the top, lower-value items towards the bottom

    • In general, do high-value, low-cost work first (“lowest hanging fruit”).
    • Next, do high-value, high-cost work because it is usually more strategic.
    • Then, do low-value, low-cost work.
    • Finally, eliminate low-value, high-cost work. You will almost always find something better to do with your time and resources, so don’t waste your time tracking it. It will be obvious if and when that work becomes valuable.
    • Hint: You can use Weighted Shortest Job First or a similar technique if you’re having difficulty prioritizing.
  4.  

  5. Granular, ready-to-work items towards the top, loosely-defined epics towards the bottom

    • Items that are at the top of the backlog will be worked on next, so we want to ensure that they are the right size to work on.
    • The typical team’s Definition of Ready recommends that items take ≤ ½ of a sprint to complete.
    • Delay decision-making and commitments — manifested as small, detailed, team-ready items — until the last responsible moment.
    • There is little value specifying work in detail if you will not work on it soon. Due to learning and changing customer/company/competitive conditions, your requirements may change or you may cancel the work altogether.

     
    What is an Epic?

    • An “epic” is simply a user story that is too large to complete in one sprint. It gets prioritized in the backlog like every other item.
    • JIRA Tip: “Epics” in JIRA do not appear in the backlog for Scrum boards. As a result, they behave more like organizing themes than epics. Therefore, we suggest using JIRA’s epic functionality to indicate themes and user stories with the prefix “Epic: ”  to indicate actual epics.
  6.  

  7. Solutions towards the top, statements of need towards the bottom

    • Teams can decide to start working on an item as soon as they know what customer needs they hope to solve. However, collaborating between product, design, development, and stakeholders to translate customer needs into solutions takes time.
    • As with other commitments, defer solutioning decisions until the last responsible moment:
      • Your ideal solution may change through learning or changing conditions such as customer, competitors, company, or even technology options.
      • You may decide not to work on the problem after all.
  8.  

  9. 1½ to 2 sprints worth of work that’s obviously ready to work on at the top

    • Teams sometimes surprise the product owner by having more capacity by expected.
    • Having enough ready stories ensures that the team is:
      • Unlikely to run out of work to pull into their sprint backlog during sprint planning.
      • Able to pull in additional work during the sprint if they complete the rest of the work on their sprint backlog.
    • It should be obvious what work is and isn’t ready to work on so that the team doesn’t have to waste time figuring it out each time they look at the backlog.
      • Some teams prefix a story title with a “* ” to indicate a ready story (or a story that isn’t ready).
  10.  

  11. The value of each piece of work is clearly articulated

    • Your team should be able to understand why the work is important to work on.
    • There are three primary sources of value (and you can define your own):
      • User/Business Value: Increase revenue, reduce costs, make users happy
      • Time Criticality: Must it happen soon due to competition, risk, etc.?
      • Opportunity Enablement/Risk Reduction/Learning: Is it strategic? Is it necessary to enable another valuable item (for example, a dependency)?
    • You won’t usually need a complex financial projection, just a reasonable justification as to why the item should be worked on next relative to all other known possibilities. Time previously spent with complex projections can instead be used to talk to customers and identify other opportunities.
  12.  

  13. The customer persona for the work is clearly articulated

    • The “As a” part of the “As a ____, I can ___, so that ____” user story isn’t a mere formality; it’s an essential part of user-centered product development.
    • Who is the customer? Who are you completing this work for? Even if you’re on a “back-end” team, keep the end-user in mind.
    • Partner with your designer to identify your personas and reference them whenever possible. Is this feature for “Serious Seller Sally?” Can you imagine her personality and needs just as well as any of your friends?
      • Example: “As Serious Seller Sally, I can list items using a ‘advanced’ flow so that I can get the options I need without the guidance for casual-sellers that only slows me down.”
    • Tool Tip: Most teams and POs find it best to put just the “I can” part the user story (for example, “List items using a ‘advanced’ flow”) in the planning tool’s title field. Otherwise it can be harder to read the backlog. Put the entire user story at the top of your tool’s description field.
  14.  

  15. ≤ 100 items (a rule of thumb), and contains no work that — realistically — will never be done

    • This is a general rule. If your team works on many very small items or has considerable work that you must track, your backlog could be longer.
    • Assuming that each backlog item takes a minute to read and understand, 100 items alone would take over an hour and a half to process. Keeping our backlog limited like this makes it easier and faster to fully understand.
    • A longer backlog is more likely to contain features that will never be built or bugs that will never be fixed. Keeping a short backlog helps us ensure that we triage effectively and delete items that we are unlikely to work on.
  16.  

  17. The team backlog is not a commitment

    • A Scrum team cannot make a realistic, firm commitment on an entire team backlog because:
      • It has not been through high-level design (for example, tasking at end of Sprint planning).
      • The risk of missed dependencies and unexpected requests/impediments is too great.
      • “Locking in” a plan that far into the future considerably restricts flexibility
    • A Scrum team can make a valid commitment on a sprint backlog if there are no mid-sprint scope changes and few unexpected requests and impediments.
  18.  

  19. Backlog reflects the release plan if available

    • If the team has conducted release planning, create pro forma sprints with items in your planning tool to reflect the release plan.
    • If there are production release, moratorium, or similar dates, communicate those too.
    • Update the release plan at end of each sprint as you learn.

What does a healthy team backlog look like in JIRA?

Glad you asked. Here are four sample “sprints” that take good advantage of JIRA’s built-in functionality.

Sprint 1 (active sprint)

Sprint 2 (next sprint)

Sprint 3 (future sprint)

Sprint 4 (future sprint)

Conclusion

Now you know what a healthy team backlog looks like. If you’ve filled out our printable checklist, mark off up to three items that you’ll work to improve over the next week or two with your teams. We hope this is of use to you!

Email Tech Is Now Ad Tech


 

eBay has come a long way in our CRM and email marketing in the past two years. Personalization is a relatively easy task when you’re dealing with just one region and one vertical and a hundred thousand customers. With 167M active buyers across the globe, eBay’s journey to help each of our buyers find their version of perfect was quite complex.

Like many in our industry, we’ve had to deal with legacy systems, scalability, and engineering resource constraints. And yet, we’ve made email marketing a point of pride — instead of the “check mark” that we started from. Here’s our story.

Our starting point was a batch-and-blast approach. Our outbound communications very much reflected our organizational structure: as a customer, I’d get a fashion email on Monday, a tech email on Tuesday, and a motors email on Wednesday. This of course wasn’t the kind of an experience we wanted to create.

Additionally, for each of our marketing campaigns, we hand-authored targeting criteria — just as many of our industry colleagues do today in tools like Marketo and ExactTarget. This approach worked OK, but the resulting segment size was too large — in hundreds of thousands. This meant that we were missing out on the opportunity to treat customers individually. It also didn’t scale well — as our business grew internationally, we needed to add more and more business analysts; and the complexity of our contact strategy was becoming unmanageable.

We wanted to create a structurally better experience for our customers — and our big bet was to go after 1:1 personalization using real-time data. We wanted to use machine learning to do the targeting, with real-time feedback loops powering our models.

Since email is such a powerful driver for eCommerce, we committed to a differentiated experience in this channel. After evaluating multiple off-the-shelf solutions, we settled on building an in-house targeting and personalization system — as the size of the eBay marketplace is astounding, and many opportunities and issues are quite unique. We set a high bar: every time we show an offer to a customer, it has to be driven by our up-to-the minute understanding of what the customer did and how other customers are responding to this offer.

Here are some examples of the scenarios we targeted:

  • eBay has many amazing deals, and our community is very active. Deals quickly run out of inventory. We can’t send an offer to a customer and direct them to an expired deal. Thus, our approach involved open-time rendering of offers in email.
  • Some of our retail events turn out to be much more popular than we anticipate. We want to respond to this real-time engagement feedback by adjusting our recommendations quickly. We thus built a feedback loop that shows an offer to a subset of customers; then, if an event is getting a much higher click-through rate than we expected, we show it to more customers. If it for some reason isn’t doing well — for example, if the creative is underperforming — the real-time “bandit” approach reduces its visibility.

Both of these scenarios required us to have real-time CRM and engagement streams. That is, we needed to know when a customer opens an email or clicks on it, and based on this knowledge, instantaneously adjust our recommendations for other customers. This of course is miles away from a typical multi-step batch system based on ETL pipelines that most retailers have today. We were no different — we had to reengineer our delivery and data collection pipes to be real-time. The payoff, however, is quite powerful — this real-time capability is foundational to our ability to personalize better: both today and in the years to come.

The resulting solution transformed email marketing at eBay: instead of hundreds of small, uncoordinated campaigns each month, we now have a small set of “flagship” campaigns, each of which is comprised of one or more offers. Each of the offers is selected at the open time of the email, and the selection is made based upon machine-learned model which uses real-time data. As a result, we saw significant growth in both engagement and sales driven by our emails.

You’ll notice that this component-level personalization approach is all about treating email content as an ad canvas. The problem is fundamentally similar: once you’ve captured the customers attention — be it via a winning bid on an ad auction, or by having that customer open your email — you need to find the most relevant offer to show. Each email slot can be thought of as a first-party ad slot. This realization allowed us to unify our approaches between display advertising and email: the same stack now powers both.

We extended this approach to scenarios like paid social campaigns, where Facebook would want to retrieve the offer from us a priori to manage their customer experience. We built a real-time push framework, where, whenever we find a deal that is better than what we previously apprised Facebook of, we immediately push that offer to Facebook.

This creates a powerful cross-channel multiplier: if we happen to see the customer on the display channel, the same ad-serving pipeline is engaged — and our flagship deal-finding campaign can be served to that customer, too. This means that evolving our flagship campaigns — adding more sophisticated machine learning, improving our creatives — contributes to all channels that are powered by this pipeline, not just email.

Orchestration across channels too becomes possible: we can choose to send an email to a customer with a relevant offer; if they don’t open it, we can then target them with a display ad, and an onsite banner; then, after showing the offer a set number of times across all channels, we can choose to stop the ad — implementing an effective cross-channel impression cap. And each condition for state transitions in this flow can itself be powered by a machine-learning model.

eBay’s scale creates an admirable engineering challenge for a true CRM. By putting our customers, and their behavioral signals, at the top of our priority list, we were able to create an asset in our CRM platform that positions us well towards In this journey towards 1:1 personalization. A single real-time, event-driven pipeline we’ve built allows for coordinated, up-to-the-minute offers to be served — wherever we happen to see the customer.

Alex Weinstein (@alexweinstein) is the Director of Marketing Technologies and CRM at eBay and the author of the Technology + Entrepreneurship blog, where he explores data-driven decision making in the face of uncertainty. Prior to eBay, Alex was the head of product development at Wetpaint, a personalization tech startup.

Graphic: Rahul Rodriguez