Author Archives: Sudeep Kumar

Elasticsearch Cluster Lifecycle at eBay

Defining an Elasticsearch cluster lifecycle

eBay’s Pronto, our implementation of the “Elasticsearch as service” (ES-AAS) platform, provides fully managed Elasticsearch clusters for various search use cases. Our ES-AAS platform is hosted in a private internal cloud environment based on OpenStack. The platform currently manages around 35+ clusters and supports multiple data center deployments. This blog provides guidelines on all the different pieces for creating a cluster lifecycle to allow streamlined management of Elasticsearch clusters. All Elasticsearch clusters deployed within the eBay infrastructure follow our defined Elasticsearch lifecycle depicted in the figure below.

Cluster preparation

This lifecycle stage begins when a new use case is being onboarded onto our ES-AAS platform.

On-boarding information

Customers’ requirements are captured onto an onboarding template that contains information such as document size, retention policy, and read/write throughput requirement. Based on the inputs provided by the customer, infrastructure sizing is performed. The sizing uses historic learnings from our benchmarking exercises. On-boarding information has helped us in cluster planning and defining SLA for customer commitments.

We collect the following information from customers before any use case is onboarded:

  • Use case details: Consists of queries relating to use case description and significance.
  • Sizing Information: Captures the number of documents, their average document size, and year-on-year growth estimation.
  • Data read/write information: Consists of expected indexing/search rate, mode of ingestion (batch mode or individual documents), data freshness, average number of users, and specific search queries containing any aggregation, pagination, or sorting operations.
  • Data source/retention: Original data source information (such as Oracle, MySQL, etc.) is captured on an onboarding template. If the indices are time-based, then an index purge strategy is logged. Typically, we do not use Elasticsearch as the source of data for critical applications.

Benchmarking strategy

Before undertaking any benchmarking exercise, it’s really important to understand the underlying infrastructure that hosts your VMs. This is especially true in a cloud-based environment where such information is usually abstracted from end users. Be aware of different potential noisy-neighbors issues, especially on a multi-tenant-based infrastructure.

Like most folks, we have also performed extensive benchmarking exercise on existing hardware infrastructure and image flavors. Data stored in Elasticsearch clusters are specific to customer use cases. It is near to impossible to perform benchmarking runs on all data schemas used by different customers. Therefore, we made assumptions before embarking on any benchmarking exercise, and the following assumptions were key.

  • Clients will use a REST path for any data access on our provisioned Elasticsearch clusters. (No transport client)
  • To start with, we kept a mapping of 1GB RAM to 32GB disk space ratio. (This was later refined as we learnt from benchmarking)
  • Indexing numbers were carefully profiled for different numbers of replicas (1, 2, and 3 replicas).
  • Search benchmarking was done always on GetById queries (as search queries are custom and profiling different custom search queries was not viable).
  • We used fixed-size 1KB, 2KB, 5KB, and 10 KB documents

Working from these assumptions, we derived at a maximum shard size for performance (around 22GB), right payload size for _bulk requests (~5MB), etc. We used our own custom JMeter scripts to perform benchmarking. Recently Elasticsearch has developed and open-sourced the Rally benchmarking tool, which can be used as well. Additionally, based on our benchmarking learnings, we created a capacity-estimation calculator tool that can take in customer requirement inputs and calculate the infrastructure requirement for a use case. We avoided a lot of conversation with our customers on infrastructure cost by sharing this tool directly with end users.

VM cache pool

Our ES clusters are deployed by leveraging an intelligent warm-cache layer. The warm-cache layer consists of ready-to-use VM nodes that are prepared over a period of time based on some predefined rules. This ensures that VMs are distributed across different underlying hardware uniformly. This layer has allowed us to quickly spawn large clusters within seconds. Additionally, our remediation engine leverages this layer to flex up nodes on existing clusters without errors or any manual intervention. More details on our cache pool are available in another eBay tech blog at Ready-to-use Virtual-machine Pool Store via warm-cache

Cluster deployment

Cluster deployment is fully automated via a Puppet/Foreman infrastructure. We will not talk in detail about how Elasticsearch Puppet module was leveraged for provisioning Elasticsearch clusters. This is well documented at Elasticsearch puppet module. Along with every release of Elasticsearch, a corresponding version of the Puppet module is generally made publically available. We have made minor modifications to these Puppet scripts to suit eBay-specific needs. Different configuration settings for Elasticsearch are customized based on our benchmarking learnings. As a general guideline, we do not set the JVM heap memory size to more than 28 GB (because doing so leads to long garbage collection cycles), and we always disable in-memory swapping for the Elasticsearch JVM process. Independent clusters are deployed across data centers, and load balancing VIPs (Virtual IP addresses) are provisioned for data access.

Typically, with each cluster provisioned we give out two VIPs, one for data read operations and another one for write operations. Read VIPs are always created over client nodes (or coordinating nodes), while write VIPs are configured over data nodes. We have observed improved throughput from our clusters with such a configuration.

Deployment diagram

 

We use a lot of open source on our platform such as OpenStack, MongoDB, Airflow, Grafana, InfluxDB (open version), openTSDB, etc. Our internal services, such as cluster provisioning, cluster management, and customer management services, allow REST API-driven management for deployment and configuration. They also help in tracking clusters as assets against different customers. Our cluster provisioning service relies heavily on OpenStack. For example, we use NOVA for managing compute resources (nodes), Neutron APIs for load balancer provisioning, and Keystone for internal authentication and authorization of our APIs.

We do not use federated or cross-region deployments for an Elasticsearch cluster. Network latency limits us from having such a deployment strategy. Instead, we host independent clusters for use cases across multiple regions. Clients will have to perform dual writes when clusters are deployed in multiple regions. We also do not use Tribe nodes.

Cluster onboarding

We create cluster topology during customer onboarding. This helps to track resources and cost associated with cluster infrastructure. The metadata stored as part of a cluster topology maintains region deployment information, SLA agreements, cluster owner information, etc. We use eBay’s internal configuration management system (CMS) to track cluster information in form of a directed graph. There are external tools that hook onto this topology. Such external integrations allow easy monitoring of our clusters from centralized eBay-specific systems.

Cluster topology example

Cluster management

Cluster security

Security is provided on our clusters via a custom security plug-in that provides a mechanism to both authenticate and authorize the use of Elasticsearch clusters. Our security plug-in intercepts messages and then performs context-based authorization and authentication using an internal authentication framework. Explicit whitelisting based on client IP is supported. This is useful for configuring Kibana or other external UI dashboards. Admin (Dev-ops) are configured to have complete access to Elasticsearch cluster. We encourage using HTTPS (based on TLS 1.2) for securing communication between client and Elasticsearch clusters.

The following is a sample simple security rule that can configure be configured on our platform of provisioned clusters.

sample json code implementing a security rule

In the above sample rule, the enabled field controls if the security feature is enabled or not. whitelisted_ip_list is an array attribute for providing all whitelisted Client IPs. Any Open/Close index operations or delete index operations can be performed only by admin users.

Cluster monitoring

Cluster monitoring is done by custom monitoring plug-in that pushes 70+ metrics from each Elasticsearch node to a back-end TSDB-based data store. The plug-in works on a push-based design. External dashboards using Grafana consume the data on TSDB store. Custom templates are created on a Grafana dashboard, which allows easy centralized monitoring of our own clusters.

 

 

We leverage an internal alert system that can be used to configure threshold-based alerts on data stored on OpenTSDB. Currently, we have 500+ active alerts configured on our clusters with varying severity. Alerts are classified as ‘Errors’ or ‘Warnings’. Error alerts, when raised, are immediately attended to either by DevOps or by our internal auto-remediation engine, based on the alert rule configured.

Alerts are created during cluster provisioning based on various thresholds. For Example, if a cluster status turns RED, an ‘Error’ alert is raised or if CPU utilization of node exceeds 80% a ‘Warning’ alert is raised.

Cluster remediation

Our ES-AAS platform can perform an auto-remediation action on receiving any cluster anomaly event. Such actions are enabled via our custom Lights-Out-Management (LOM) module. Any auto-remediation module can significantly reduce manual intervention for DevOps. Our LOM module uses a rule-based engine which listens to all alerts raised on our cluster. The reactor instance maintains a context of the alerts raised and, based on cluster topology state (AUTO ON/OFF), takes remediation actions. For example, if a cluster loses a node and if this node does not return to its cluster within the next 15 minutes, the remediation engine replaces that node via our internal cluster management services. Optionally, alerts can be sent to the team instead of taking a cluster remediation action. The actions of the LOM module are tracked as stateful jobs that are persisted on a back-end MongoDB store. Due to the stateful nature of these jobs, they can be retried or rolled back as required. Audit logs are also maintained to capture the history or timeline of all remediation actions that were initiated by our custom LOM module.

Cluster logging

Along with the standard Elasticsearch distribution, we also ship our custom logging library. This library pushes all Elasticsearch application logs onto a back-end Hadoop store via an internal system called Sherlock. All centralized application logs can be viewed at both cluster and node levels. Once Elasticsearch log data is available on Hadoop, we run daily PIG jobs on our log store to generate reports for error log or slow log counts. We generally have our logging settings as INFO, and whenever we need to triage issues, we use transient a logging setting of DEBUG, which collects detailed logs onto our back-end Hadoop store.

Cluster decommissioning

We follow a cluster decommissioning process for major version upgrades of Elasticsearch. For major upgrades for Elasticsearch clusters, we spawn a new cluster with our latest offering of the Elasticsearch version. We replay all documents from old or existing version of Elasticsearch clusters to the newly created cluster. Client (user applications) starts using both cluster endpoints for all future ingestion until data catches up on the new cluster. Once data parity is achieved, we decommission the old cluster. In addition to freeing up infrastructure resources, we also clean up the associated cluster topology. Elasticsearch also provides a migration plug-in that can be used to check if direct, in-place upgrades can be done on major Elasticsearch versions. Minor Elasticsearch upgrades are done on an as-needed basis and are usually done in-place.

Ready-to-use Virtual-machine Pool Store via warm-cache

Problem overview

Conventional on-demand Virtual Machine (VM) provisioning methods on a cloud platform can be time-consuming and error-prone, especially when we need to provision VMs in large numbers quickly.

The following list captures different issues that we often encounter while trying to provision a new VM instance on the fly:

  • Insufficient availability of compute resources due to capacity constraints
  • Desire to place VMs on different fault domains to avoid concentration of VM instances on the same rack
  • Transient failures or delays in the service provider platform result in failure or an increase in time to provision a VM instance.

Elasticsearch-as-a-service, or Pronto, is a cloud-based platform that provides distributed, easy to scale, and fully managed Elasticsearch clusters. This platform uses the OpenStack-based Nova module to get different compute resources (VMs). Nova is designed to power massively scalable, on-demand, self-service access to compute resources. The Pronto platform is available across multiple data centers with a large number of managed VMs.

Typically, the time taken for provisioning a complete Elasticsearch cluster via Nova APIs is directly proportional to the largest time taken by the member node to be in a “ready to use” state (active state). Typically, provisioning a single node could take up to three minutes (95th Percentile) but can be up to 15 minutes in some cases. Therefore, in a fairly large size cluster, our platform would take a long time for complete provisioning. This greatly impacts our turnaround time to remediate production issues. In addition to provisioning time, it is time-consuming to validate new created VMs.

There are many critical applications that leverage our platform for their search use cases. Therefore, as a platform provider, we need high availability to ensure that in a case of catastrophic cluster event (such as a node or an infrastructure failure), We can quickly flex up our clusters in seconds. Node failures are also quite common in a cloud-centric world, and applications need to ensure that there is sufficient resiliency built in. To avoid over-provisioning nodes, remediation actions such as flex-up (adding a new node) should ideally be done in seconds for high availability.

New hardware capacity is acquired as racks from external vendors. Each rack typically has two independent fault domains with minimal resource overlap (For example, different networks), and sometimes they don’t share a common power source. Each fault domain hosts many hypervisors, which are virtual machine managers. Standalone VMs are provisioned on such hypervisors. VMs can be of different sizes (tiny, medium, large, and so on). VMs on the same hypervisor can compete for disk and network I/O resources, and therefore can lead to noisy neighbor issues.

rack

Nova provides ways to be fault domain- and hypervisor- aware. However, it is still difficult to successfully achieve guaranteed rack isolation during run-time provisioning of VM instances. For example, once we start provisioning VMs, there is no guarantee that we will successfully create VM instances on different racks. This depends entirely on the underlying available hardware at that point in time. Rack isolation is important to ensure high availability of Elasticsearch master nodes (cluster brain). Every master node in an Elasticsearch cluster must reside on a different rack for fault tolerance. (If a rack fails, at least some other master node in an another rack can take up active master role). Additionally, all data nodes of a given cluster must reside on different hypervisors for logical isolation. Our APIs must fail immediately when we cannot get VMs on different racks or hypervisors. A subsequent retry will not necessarily solve this problem.

Solution

The warm-cache module intends to solve these issues by creating a cache pool of VM instances well ahead of actual provisioning needs. Many pre-baked VMs are created and loaded in a cache pool. These ready-to-use VMs cater to the cluster-provisioning needs of the Pronto platform. The cache is continuously built, and it can be continuously monitored via alerts and user-interface (UI) dashboards. Nodes are periodically polled for health status, and unhealthy nodes are auto-purged from the active cache. At any point, interfaces on warm-cache can help tune or influence future VM instance preparation.

The warm-cache module leverages open source technologies like Consul, Elasticsearch, Kibana, Nova, and MongoDB for realizing its functionality.

Consul is an open-source distributed service discovery tool and key value store. Consul is completely distributed, highly available, and scalable to thousands of nodes and services across multiple data centers. Consul also provides distributed locking mechanisms with support for TTL (Time-to-live).

We use Consul as key-value (KV) store for these functions:

  • Configuring VM build rules
  • Storing VM flavor configuration metadata
  • Leader election (via distributed locks)
  • Persisting VM-provisioned information

The following snapshot shows a representative warm-cache KV store in Consul.

screen-shot-2016-10-19-at-2-55-02-pm

The following screenshot shows a sample Consul’s web UI.

screen-shot-2016-10-19-at-11-07-09-am-2-1024x407

Elasticsearch “is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.” Apart from provisioning and managing Elasticsearch clusters for our customers, we ourselves use Elasticsearch clusters for our platform monitoring needs. This is a good way to validate our own platform offering. Elasticsearch backend is used for warm-cache module monitoring.

Kibana is “built on the power of Elasticsearch analytics capabilities to analyze your data intelligently, perform mathematical transformations, and slice and dice your data as you see fit.” We use Kibana to depict the entire warm-cache build history stored in Elasticsearch. This build history is rendered on Kibana dashboard with various views. The build history contains information such as how many instances were created and when were they created, how many errors had occurred, how much time was taken for provisioning, how many different Racks are available, and VM instance density on racks/hypervisors. warm-cache module can additionally send email notifications whenever the cache is built, updated, or affected by an error.

We use the Kibana dashboard to check active and ready-to-use VM instances of different flavors in a particular datacenter, as shown in the following figure.

screen-shot-2016-10-17-at-9-19-38-pm-768x421

MongoDB “is an open-source, document database designed for ease of development and scaling.” warm-cache uses this technology to store information about flavor details. Flavor corresponds to the actual VM-underlying hardware used. (They can be tiny, large, xlarge, etc.). Flavor details consist of sensitive information, such as image-id, flavour-id, which are required for actual Nova compute calls. warm-cache uses a Mongo service abstraction layer (MongoSvc) to interact with the backend MongoDB in a secure and protected manner. The exposed APIs on MongoSvc are authenticated and authorized via Keystone integration.

CMS (Configuration Management System) is a high-performance, metadata-driven persistence and query service for configuration data with support for RESTful API and client libraries (Java and Python). This system is internal to eBay, and it is used by warm-cache to get hardware information of various compute nodes (including rack and hypervisor info).

System Design

The warm-cache module is built as a pluggable library that can be integrated or bundled into any long running service or daemon process. On successful library initialization, a warm-cache instance handle is created. Optionally, a warm-cache instance can enroll for leader election participation. Leader instances are responsible for preparation of VM cache pools for different flavors. warm-cache will consist of all VM pools for every flavor across the different available data centers.

warmcache

The following figure shows the system dependencies of warm-cache.

screen-shot-2016-10-17-at-11-51-51-pm

 

The warm-cache module is expected to bring down VM instance preparation time to few seconds. It should also remedy a lot of exceptions and errors that occur while VM instances get ready to a usable state, because these errors are handled well in advance of actual provisioning needs. Typical errors that are encountered today are nodes not available in Foreman due to sync issues and waiting for VM instances to get to the active state.

The figure below depicts the internal state diagram of the warm-cache service. This state flow is triggered on every warm-cache service deployed. Leader election is triggered at every 15-minute boundary interval (which is configurable). This election is done via Consul locks with an associated TTL (Time-to-live). After a leader instance is elected, that particular instance holds the leader lock and reads metadata from Consul for each Availability Zone (AZ, equivalent to a data center). These details include information such as how many minimum instances of each flavor are to be maintained by warm-cache. Leader instance spawns parallel tasks for each AZ and starts preparing the warm cache based on predefined rules. Preparation of a VM instance is marked as complete when the VM instance moves to an active state (for example, as directed by an open-stack Nova API response). All successfully created VM instances are persisted on an updated warm-cache list maintained on Consul. The leader instance releases the leader lock on the complete execution of its VM’s build rules and waits for next leader election cycle.

The configuration of each specific flavor (for example, g2-highmem-16-slc07) is persisted in Consul as build rules for that particular flavor. The following figure shows an example.

screen-shot-2016-10-18-at-3-09-14-pm-1024x261

In above sample rule, the max_instance_per_cycle attribute indicates how many instances are to be created for this flavor in one leadership cycle. min_fault_domain is used for the Nova API to ensure that at least two nodes in a leader cycle go to different fault domains. reserve_cap specifies the number of instances that will be blocked and unavailable via warm-cache. user_data is the base64-encoded Bash script that a VM instance executes on first start-up. total_instances keeps track on total number of instances that need to be created for a particular flavor. An optional group_hint can be provided that ensures that no two instances with the same group-id are configured on the same hypervisor.

For every VM instance added to warm-cache, following information will be metadata is persisted on Consul:

  • Instance Name
  • Hypervisor ID
  • Rack ID
  • Server ID
  • Group name (OS scheduler hint used)
  • Created time

screen-shot-2016-10-17-at-11-55-11-pm

Since there are multiple instances of the warm-cache service deployed, only of them is elected leader to prepare the warm-cache during a time interval. This is necessary to avoid any conflicts among multiple warm-cache instances. Consul is again used for leader election. Each warm-cache service instance registers itself as a warm-cache service on Consul. This information is used to track available warm cache instances. The registration has a TTL (Time-To-Live) value (one hour) associated with it. Any deployed warm cache service is expected to re-register itself with the warm-cache service within the configured TTL value (one hour). Each of the registered warm-cache services on Consul to elect itself as a leader by making an attempt to acquire the leader lock on Consul. Once a warm-cache service acquires a lock, it acts as a leader for VM cache pool preparation. All other warm-cache service instances move to a stand-by mode during this time. There is a TTL associated with each leader lock to handle leader failures and to enable leader reelection.

In the following figure, leader is a Consul key that is managed by a distributed lock for the leadership role. The last leader node name and leader start timestamp are captured on this key. When a warm-cache service completes it functions in the leader role, this key is released for other prospective warm-cache service instances to become the new leader.

screen-shot-2016-10-18-at-3-22-02-pm-1024x331

 

The leadership time-series graph depicts which node assumed the leadership role. The number 1 in the graph below indicates a leadership cycle.

screen-shot-2016-10-17-at-10-13-33-pm-768x310

 

When a leader has to provision a VM instance for a particular flavor, it first looks up for meta information for the flavor on MongoDB (via MongoSvc). This lookup provides details such as image-Id and flavor-Id. This information is used when creating the actual VM instance via NOVA APIs. Once a VM is created, its rack-id information is available via CMS. This information is stored in Consul associated with a Consul key $AZ/$INSTANCE, where $AZ is the Availability Zone and $INSTANCE is the actual instance name. This information is also then persisted on Elasticsearch for monitoring purpose.

The following figure shows a high-level system sequence diagram (SSD) of a leader role instance:

 

screen-shot-2016-10-18-at-12-08-39-am

 

A Kibana dashboard can be used to check how VM instances in the cache pool are distributed across available racks. The following figure shows how many VM instances are provisioned on each rack. Using this information, Dev-ops can change the warm-cache build attributes to influence how the cache should be built in future.

 

screen-shot-2016-10-17-at-9-23-51-pm-1024x238

 

The following options are available for acquiring VM instances from the warm-cache pool:

  • The Rack-aware mode option ensures that all nodes provided by warm-cache reside on different racks
  • The Hypervisor-aware mode option returns nodes that reside on different hypervisors with no two nodes sharing a common hypervisor
  • The Best-effort mode option tries to get nodes from mutually-exclusive hypervisors but does not guarantee it.

The following figure illustrates the process for acquiring a VM.

 

screen-shot-2016-10-18-at-12-10-58-am

 

The following screen-shot includes a table from Kibana showing the time when an instance was removed from warm-cache, the instance’s flavor, data center information, and instance count.

 

screen-shot-2016-10-19-at-11-43-58-am-624x112

 

The corresponding metadata information on Consul for acquired VM instances is updated and removed from the active warm-cache list.

Apart from our ability to quickly flex up, another huge advantage of the warm-cache technique compared to conventional run-time VM creation methods is that before an Elasticsearch cluster is provisioned, we know exactly if we have all the required non-error-prone VM nodes to satisfy to our capacity needs. There are many generic applications hosted on a cloud environment that require the ability to quickly flex up or to guarantee non-error-prone capacity for their application deployment needs. They can take a cue from the warm-cache approach for solving similar problems.