Effective Front-End Code Review

 

Whether it is open-source or in-house code, front-end or back-end code, the pull request (PR) model is widely used in the industry. Before a PR is merged, it is reviewed by peers, and the code is modified by the developer as needed, in an iterative and interactive process. Ideally, we want to spend quality time reviewing each modified line (and any unmodified but associated code that may be impacted) so that we don’t defer and build tech debt when code is added or modified.

However, with pressing timelines, sometimes we simply don’t have enough time for code reviews. And when we do have time, we may not be doing a good job because we don’t have a clear set of team coding guidelines or we may not know what we should look for. Below are some suggestions about how code review can be done, with specific front-end things to focus on.

Code review prerequisites

  • Thorough and meaningful code reviews take time. We need to scope for code review just like any other task. For that to happen, everyone must believe in its purpose and importance — that means management and product managers must actively encourage code reviews and give developers the necessary time.
  • The team has agreed on naming convention, styling, and coding standards, whether it is HTML, CSS, or JS. Styling standards give the code a consistent and professional look. A naming convention makes it easier to understand, search, build, and maintain. Coding standards make the code cleaner and easier to read, build, maintain, and debug. There are many popular standards out there. It is less important which one you choose, but it is crucial that the entire team follows one standard. If a rule is controversial within the team, discuss and decide, then change the standard and apply the new rule across the entire code base. Consistency is key.
  • The team has agreed on what to review — from spaces and camelCase, to where business logic should be, to whether the new code is violating established architecture.

Some things to look for in HTML

  • Does the code follow agreed-upon standards and conventions?
  • Does the markup use the same element ID name more than once? If so, change it to a class because each ID should be unique. If a reusable module contains an element ID, the ID is no longer unique once the module is reused in the same page. Remember that getting an element by ID returns only the first element that matches, which may cause issues when you have multiple elements with the same ID name.
  • Are there unnecessary wrapper elements? If a <div> does not have a class, or the <div> has a class but the class has no associated styles, chances are that it is not needed. Remove it and check the layout. If the layout does not change, you do not need that wrapper.
  • If the app uses a template language that allows complicated logic, are you over-using it, thus making the markup hard to read? Consider applying the logic to the view model BEFORE the view model is passed to the template.
  • Does the markup contain the proper ARIA attributes for accessibility?

Some things to look for in CSS

  • Does the code follow agreed-upon standards and conventions?
  • If you are using a CSS precompiler, such as LESS or SASS, are you nesting classes (or even worse, nesting IDs) because it looks more organized? Unnecessary nesting makes overriding CSS difficult; it also hurts performance and maintainability.
  • Is CSS in modules namespaced to avoid name clashing? Consider a top-level wrapper class with the same name as the module. If you don’t want class nesting, consider using BEM convention.
  • Are some class names so generic that name clashes are likely? This can happen with either other modules in the same app or with other libraries that you may use.
  • Are regularly used CSS rules (or blocks of rules) repeated so often that they should become mixins?
  • If the app uses a CSS library, such as Bootstrap or Skin, are you maximizing its use? Does your app redefine mixins available in the library, or worse, does your app have blocks of CSS where using a single class from the library would do?
  • Are padding and margin used correctly? Is padding used when margin should have been, or vice versa?
  • Does using percentage instead of pixel make sense? How about rem vs. px?
  • Are there duplicate rules for the same selector? Remove the duplicates.
  • Are the same rules found for different selectors in the same module? Merge them.
  • Are the same rules found across different modules? Put them in a shared CSS file.
  • Are there too many nested classes? Are they necessary for override? Try removing some nesting. If the layout doesn’t change, they are not needed.
  • Are IDs used for CSS? IDs are HIGHLY specific and need other IDs to override them. And if you must insist on using IDs for CSS, DO NOT nest IDs. First, nested IDs require the same number of nested IDs or more to override, which makes your CSS longer and more complicated. Second, and this relates to fundamental CSS principle — IDs are by definition unique. If an ID needs to be nested to work, it’s not unique. It should not be an ID. Make it a class instead.
  • If the code uses media queries, see the “Media queries” section in this post.

Some things to look for in JavaScript

  • Does the code follow agreed-upon standards and conventions?
  • Is there JSDoc or similar documentation, especially for exposed methods?
  • If something is to be fixed/enhanced later, is there a TODO with a developer’s name next to it?
  • Is a solution either too clever or too complicated? Is there a slightly longer but simpler and more maintainable solution? If you have to read the code (or its comments) several times to understand what it is trying to do, it is probably too complicated and should be simplified.
  • Does each method do one specific task only? If not, break it up into smaller, focused tasks. It will help in reusability and unit testing.
  • Is a method particular to one use case, with hard-coded values and hard-coded CSS selectors, and so on? Or is it generic and reusable, and takes parameters? Generic methods are scalable and easier to maintain, AND they are much easier to unit test.
  • Does a block of code look familiar from elsewhere? If an existing util method does the same, use it. If there is a util method that is related but slightly different, update it. If the util does not exist, create it. Then, use that util and remove identical and similar blocks globally.
  • If an app already uses a utility library such as Underscore or Lodash, is the PR introducing another similar library? Discuss with your team which library best suits your app.
  • Is the app loading a heavy library when you use only a few methods in it? Is there a lighter alternative? Can you pull out just what you need?
  • Does a function take too many parameters? If there are more than 4 (or some number set by your team), use an object to avoid the hassle of parameter ordering and null values.
  • Are values and data cached? Any string, object, etc., should be cached if it is used more than once.
  • Are variable and function names so short that they are not meaningful at a glance? Longer names (be reasonable, of course) are more likely to be meaningful and self-documenting. Furthermore, searching globally for a longer name will return fewer results, which makes code-digging easier.
  • Is there code that looks like the following? Use Lodash’s get (or similar method from another library) to get an object property down a path. You can avoid long chains (it’s easy to miss a check) and undefined exceptions when the response is not what you expect.
    if (a && a.b && a.b.c && a.b.c[0] && a.b.c[0].d ......) {
        ...
    }
  • If using third-party libraries, is the app using deprecated methods? Use the latest APIs as recommended by the libraries.
  • Are all console logs and debuggers removed?
  • Are listeners removed before being re-added? Does the handler run multiple times on the same event on the same element? One way to check is by printing logs in the handler.
  • Pay attention to elements (usually inputs) that are listening to multiple events, such as keydown, keypress, and keyup. When one user action causes all events to fire, such that the handler is run once for each fired event, is the app still behaving as expected? Even if it is behaving correctly, does it make the app less responsive?
  • If the app architecture is based on the modular pattern, is one module referencing or modifying DOM elements of an unrelated module, thus breaking the pattern?

If the app uses jQuery

  • Are all jQuery object variables prefixed with $ for quick identification?
  • Are jQuery objects cached, if used more than once?
  • Are jQuery calls chained?
  • Are selectors cached in a string variable if used more than once? When no element is found for a selector and your code operates on this “missing” selector, jQuery fails silently, which is both GOOD and BAD at the same time. It is good because no error is thrown at run time; it is bad because the app may fail in other ways. A common reason is that the selector was changed and it was not updated globally. Cache the selector to minimize that risk.
  • Is the code using jQuery event delegation to minimize number of listeners?
  • How are you listening to events on elements that can be re-rendered? Via jQuery event delegation? That way, you don’t have to set up listeners again every time the element is re-rendered. It is easy to miss the re-add, especially when there are multiple render paths.
  • Are you using deprecated methods? For example, use .on() instead of .bind()/.live().

Browser support

One difficulty in front-end development is getting a web app to work well across devices, operating systems, and browsers. With the large number of device, operating system, and browser combinations in the wild, including the old and the native, working well everywhere is an impossible task for developers. The reasonable approach is to get your app’s traffic breakdown and test the most popular combinations during development.

To review layout and browser quirks, an in-person code review is the most effective as the reviewer can play with the feature/fix while reviewing the code. (A very simple exercise, which is also very effective in finding layout bugs, is resizing the browser window.) If an in-person review is not feasible, attach screenshots of the app to the PR, showing the layouts in different forms:

  • Desktop browser
  • Tablet browser (both portrait and landscape mode)
  • Phone browser (both portrait and landscape mode)

And finally

Code review is for more than the pull request at hand. Use it as a chance to refine the team’s standards and best practices and to discuss implementation strategies. While code review is an essential part of any development, it is especially important for front end as the front-end world changes very quickly. Team discussion via review comments is an effective way for all team members to get on the same page, learn from each other, and improve as a team.

How much time do you spend on code reviews? Do you scope for it in your sprint planning? Do you already look at some of the things mentioned above? What other specific things do you focus on?

Practical NoSQL resilience design pattern for the enterprise

 

Building applications resilient to infrastructure failure is essential for systems that run in a distributed environment, such as NoSQL databases. For example, failure can come from compute resources such as nodes, local/remote clusters, network switches, or the entire data center. On occasion, NoSQL nodes or clusters may be marked down by Operations to perform administrative tasks, such as a software upgrade and adding extra capacity. In this blog, you can find information on how to build resilient NoSQL applications using appropriate design patterns that are suitable to your needs.

Note that the resilience design pattern described in this blog is complementary to other patterns, such as the Hystrix library. The difference is that it focuses on the following abstractions:

  • NoSQL database architecture and capability abstraction, such as replication strategy, durability, and high availability
  • Infrastructure and environment abstraction, such as standard minimal deployment, disaster recovery, and continuous operation
  • Application abstraction, such as workload, performance, and access patterns

Why resiliency design patterns for NoSQL databases?

Enabled by more efficient distributed consensus protocols, such as Paxos and Raft, and new thinking, such as flexible database schema, NoSQL databases have grown out of early adoption and are gaining popularity among mainstream companies. Although many NoSQL vendors tout their products as having unparalleled web-level scalability and capabilities such as auto-failover, not all are created equal in terms of availability, consistency, durability and recoverability. This is especially true if you are running mission-critical applications using heterogeneous NoSQL systems that include key-value (KV) pairs, column families (CF), and document and graph databases across a geographically dispersed environment.

It becomes imperative that a well-run enterprise optimizes its management to achieve the highest operational efficiency and availability. To accomplish this, an enterprise should not only empower application developers with well-defined NoSQL database resilience design patterns so that they can be relieved from preventable infrastructure failure, but it should also provide a guaranteed-availability Service Level Agreement (SLA) on these resilience design patterns.

NoSQL resiliency design pattern consideration

Given the enormous complexity associated with integrating an enterprise-class infrastructure, such as eBay, developing meaningful and usable NoSQL resilience pattern is more than just one dimension — it comprises a multitude of coherent and intricately interconnected cogs such as these:

  • Use case qualification
  • Application persistence error handling
  • Technology stack and engineering framework
  • Data center infrastructure (for example, public/private/hybrid cloud) configuration and setup
  • NoSQL database-agnostic and -specific resilience abstraction
  • Operation and management best practice and standard operation procedure (SOP)

Since NoSQL databases are not panaceas, ill-suited applications can disrupt operations and cause friction between stakeholders within the organization. The first step toward defining a meaningful NoSQL database resilience pattern is to ensure that only qualified use cases will use it. With that, we recommend the following guidelines when qualifying new NoSQL use cases.

NoSQL qualifying guidelines

NoSQL resiliency design pattern approach

One main objective of a NoSQL resilience design pattern is that it should support a wide range of use cases across different NoSQL databases. To achieve this objective, we devised the following steps in our approach:

  1. Identify a meaningful NoSQL database architectural abstraction based on the CAP theorem, ACID/BASE properties, and performance characteristics.
  2. Categorize and define types of different resilience patterns based on workload, performance, and consistency properties that are meaningful to applications.
  3. Define a standardized minimal NoSQL deployment pattern for common small-to-medium, non-mission critical use cases.
  4. Define an enhanced design pattern to support mission-critical use cases that require high availability, consistency, durability, scalability, and performance.
  5. Define other design patterns to support non-conforming use cases, for example, standalone without disaster recovery (DR), and application sharding.

Given that JSON has become the de facto standard data format for web-centric e-commerce businesses like eBay, this blog will use two of the top document-centric NoSQL databases (MongoDB and Couchbase) to illustrate our proposed resilience design pattern.

DB comparison

Although a tutorial on MongoDB and Couchbase is beyond the scope of this blog, the high-level comparison between them in Table 1 helps illustrate their differences. It also helps explain how these differences influence respective resilience design patterns for each product.

Table 1. Comparison of MongoDB and Couchbase
MongoDB Couchbase
* Throughout this blog post, for convenience we designate one Couchbase cluster per data center even though multiple clusters can be set up per data center. However, nodes in the same Couchbase cluster do not span across multiple data centers.
Name Replica set/sharded cluster “Cluster”*
Architecture Master-slave Peer-to-peer
Replication Master to slave (or primary to secondary) nodes Local cluster Primary to replica nodes
Multiple cluster Bi- or uni-directional Cross Data Center Replication (XDCR)
Sharding Optional Built-in
Quorum Read Yes No
Write Yes

From the comparison above, we observe the following characteristics when designing a resilience pattern design:

  • NoSQL databases tend to simplify their resilience pattern if they are based on peer-to-peer architecture.
  • NoSQL databases tend to complicate their resilience pattern if they lack quorum read/write, an important capability to support global read and write consistency.

Lastly, we found it helpful to organize resilience design patterns into different categories, as shown in the Table 2. For brevity, we will focus the examples on the following three categories: workload, durability, and application sharding.

Table 2. Categories of Resilience Design Patterns
Category Pattern
Workload General purpose mixed read and write
Performance High performance read and/or write
Durability 100% durability
High local and/or cross data center durability
High availability (HA) High availability local read and write
High availability multi-data center read and write
High Read and write consistency High local data center read and write consistency
High multi-data center read and write consistency
Others Administration, backup and restore, application sharding …

Standard minimal deployment

As mentioned earlier, one key objective of NoSQL resilience design patterns is to relieve applications and Operation of preventable infrastructure failure. It is our opinion that, disregarding the size or importance of applications, enterprise-class NoSQL clusters should be able to handle node, cluster, site or network partition failure gracefully with built-in preventive measures. Also, they should be able to perform disaster recovery (DR) within reasonable bounds. This assumption warrants the need of the “standard minimal deployments” described below.

MongoDB standard minimal deployment

According to our experience managing some of the largest 24×7 heterogeneous NoSQL clusters in the world, we recommend the following “standard minimal deployment” best practice for enterprise-class MongoDB deployments:

  • Whenever possible, nodes in the same MongoDB replica set will be provisioned in different fault domain to minimize any availability risk associated with node and network failure.
  • Standard MongoDB deployment (for example, type 1 in Table 3) always includes third data center with two or more secondary nodes. For applications that do not have traffic in a third data center, light-weight arbiter nodes should be used for cross-data center quorum voting during site or total data center failure.
  • With few exceptions (for example, type 2 below), a MongoDB replica set can be deployed in only one data center if applications do not require remote DR capability.
Table 3. MongoDB Standard Minimal Deployments
Deployment Type MongoDB Replica Set Nodes Configuration Description
Data Center 1 Data Center 2 Data Center 3
1 Standard deployment with built-in DR capability Minimum three nodes, which include one primary and two secondary nodes Minimum two secondary nodes Minimum two secondary or arbiter nodes.This is to ensure that should any one data center experience total failure, it is not excluded from quorum voting for new primary node. In selected high traffic or important use cases, additional secondary nodes may be added in each data center. The purpose is to increase application read resilience so that surviving nodes won’t be overwhelmed in case of one or more secondary nodes fail in any data center.
2 Special qualified use case without DR Three (one primary, two secondary nodes) This deployment pattern is for use cases that do not require remote DR since applications can rebuild the entire dataset from an external data source.Furthermore, the lifespan of data can be relatively short-lived and may expire in a matter of minutes, if not seconds. In selected high traffic or important use cases, additional secondary nodes may be added to the replica set to help increase resilience.

The following diagram illustrates the MongoDB “standard minimal deployment” pattern.

Mongo deploy

During primary node failover, one of the available secondary nodes in either data center will be elected as the new primary node, as shown in the following diagram.

Mongo before failover

During site or data center failure, one of the available secondary nodes in other data centers will be elected as the new primary node, as shown in the following diagram.

Mongo after failover

Couchbase standard minimal deployment

With peer-to-peer architecture, Couchbase’s “standard minimal deployment” does not require a third data center since it does not involve quorum voting when failing over to a new primary node. Arguably, increasing minimal nodes from four to five or six or more per data center can help strengthen operational viability should two or more nodes in the same data center fail at the same time (itself an unlikely scenario if they were provisioned in different fault domains in the first place). With that, we feel that a minimal four nodes per data center are sufficient for most Couchbase use cases to start with.

Table 4. Couchbase Standard Minimal Deployments
Deployment Type Couchbase Replica Set Nodes Configuration Description
Data Center 1 Data Center 2 Data Center 3 (Optional)
1 Standard deployment with built-in DR capability 4+ nodes 4+ nodes 4+ nodes See description above
2 Special qualified use case without DR 4+ nodes Same as MongoDB

The following diagram illustrates the Couchbase “standard minimal deployment” pattern where each data center/cluster has two copies of the same document (for example, P1 being the primary copy and R1 the eventually consistent replica copy).

Couchbase deploy

During automatic node failover, the Couchbase client SDK in applications will detect node failure and receive an updated failover cluster map with a topology containing the new location of replica documents that have been promoted to primary. This is shown in following diagram.

Couchbase failover

Although Couchbase supports bidirectional Cross Data Center Replication (XDCR) between geographically dispersed clusters, its current implementation does not offer automatic cluster failover. To address this limitation, Couchbase will support a new feature called Multi-Cluster Awareness (MCA) in a future release (tentatively v4.x) to provide this capability, as shown in following diagram.

Couchbase MCA

High-level MongoDB and Couchbase resilience capability comparison

Before we talk about the details of an individual resilience design pattern for each product, it helps to understand the capability of MongoDB and Couchbase in terms of high availability, consistency, durability, and DR across local and multiple data centers. The following table highlights their differences.

Table 5. Comparison of MongoDB and Couchbase Resilience Capabilities
NoSQL Database Data center High Availability High Consistency High Durability DR
* MCA (Multi-Cluster Awareness) and TbCR (Timestamp-based Conflict Resolution) will be available in a future Couchbase v4.x release.
MongoDB Local DC No for Write
Yes for Read
Yes Yes No
Multi DC Yes
Couchbase Local DC No for Write
Yes for Read
Yes Yes No
Multi DC Yes with MCA* No Yes with MCA and TbCR* Yes

From the above comparison, we observe one similarity between MongoDB and Couchbase in their support for high-availability writes or the lack of. This is because both products limit their writes to a single primary node only. Nonetheless, Couchbase does plan to support high-availability write for multiple data centers through its future Multi-Cluster Awareness feature, which will help alleviate this limitation. This is the reason why Couchbase’s standard minimal deployment pattern requires at minimum two data centers/clusters.

MongoDB resilience design pattern examples

Before we show MongoDB’s resilience design pattern examples, it is important to understand how data loss can happen in MongoDB if applications do not use a proper resilience design pattern.

Non-synchronous writes

Let’s assume that you have a MongoDB replica set comprised of three nodes. The following operations illustrate how data loss can happen:

  • First, the application writes five documents and receives write confirmation from the primary node only.
  • The first three documents are replicated successfully at the second secondary node, and two documents are replicated at the third secondary node.

    Mongo Non-Synchronous Writes A

  • The primary node fails before all five documents reach both of the two secondary nodes

    Mongo Non-Synchronous Writes B

  • Quorum voting re-elects the second secondary node as the new primary node because it receives the third document, that is, more recently than the third secondary node

    Mongo Non-Synchronous Writes C

  • The original primary node steps down to be a secondary and rolls back its fourth and fifth documents, since they didn’t reach other two secondary nodes

    Mongo Non-Synchronous Writes D

  • In effect, the application loses the fourth and fifth documents even though it receives confirmation from the original primary node. The issue associated with this type of data loss (non-synchronous write) occurs because that application didn’t apply the “quorum write” option to majority nodes in the same replica set.
  • In addition to “quorum write”, one should also enable MongoDB’s write-ahead logging journal file to further increase primary node write durability.

Since non-synchronous write is just one type of issue that can happen to applications, it is not sufficient to develop a pattern just to solve it. Having various add-on patterns, like synchronous write and others, on top of the MongoDB standard minimal deployment patterns helps enrich the overall NoSQL database resilience capabilities.

MongoDB write durability pattern

To achieve MongoDB replica-set-wide write durability, an application should use MongoDB’s WriteConcern = Majority option, where a write waits for confirmation from majority nodes across multiple data centers. During primary node failover, one of the majority secondary nodes having the latest committed write will be elected as the new primary node.

One caveat of this pattern is that one should weigh the pros and cons associated with waiting for cross-data center majority nodes confirmation, since it may not suitable for applications that require short latency.

MongoDB Write Durability Pattern

MongoDB read intensive pattern

This pattern is for applications that require high availability reads. With MongoDB’s master-slave replication architecture and ability to scale up to 49 secondary nodes across multiple data centers, it inherently supports highly available reads, provided that the number of healthy secondary nodes in any data center are capable of handling read traffic if one or more fail. Note that even though the current MongoDB replica set can scale up to 50 nodes, only 7 nodes can vote during primary node failover. The following diagram illustrates this capability.

MongoDB Read Intensive Pattern

MongoDB extreme high write pattern

Use cases requiring extreme high writes can use MongoDB optional sharding since it allows horizontal write scaling across multiple replica sets (shards). Each shard (or replica set) stores part of the entire dataset as defined by a predefined shard key of the collection. With a well-designed shard key, MongoDB automatically distributes and balances read/write queries to designated shards. Although MongoDB sharding provides horizontal scale-out write capability, this pattern incurs consequent complexity and overhead and should be used with caution:

  • An application should start with sharding using an appropriate shard key from the start instead of migrating from a single replica set as afterthought. This is because migrating a single replica set to a sharded cluster is a major undertaking from both development and operations perspectives.
  • We recommend starting with a predefined number of shards capable of handling capacity and traffic in the foreseeable future. This helps eliminate the overhead associated with rebalancing when adding new shards.
  • Note that MongoDB automatic shard balancing may introduce spikes during chunk migration and can potentially impact performance.
  • Developers needs to understand behavior and limitation on how mongos, a software router process, on queries that do not include shard key, such as scatter gather operations.
  • Developers should weigh the pros and cons of the overhead associated with running mongos as part of application servers vs. in separate nodes.

The following diagram illustrates this capability.

MongoDB Extreme High Write Pattern

Couchbase resilience design pattern examples

Although Couchbase does not support write-ahead logging or quorum write, it achieves high durability through following mechanisms:

  • Local cluster durability — ReplicateTo and PersistTo functions
  • Multi-cluster durability (to be released in a future v4.x release):
    • Multi-Cluster Awareness (MCA) — Without application logic, this feature allows application developers and DBA to define rules on how applications should behave in the event of complete data center/cluster failover.
    • Timestamp-based Conflict Resolution (TbCR) — Based on a server-synced clock, this feature provides Last-Write-Win (LWW) capability on bidirectional replication update conflict resolution to ensure correct cross-data center/cluster write durability.

Couchbase local cluster write durability pattern

For local cluster durability, Couchbase provides the following two functions through its client SDK:

  • ReplicateTo — This function allows writes on the same document to be successfully replicated in memory for all copies in the local cluster. However, it does not guarantee writes to be persisted on disk, which may result in data loss if both primary and replica nodes fail before that happens.
  • PersistTo — For increased durability, applications can use this function so that writes will not only be successfully replicated in memory but also persisted on disk in the local cluster.

Note that even with the second PersistTo function, the current Couchbase version still does not guarantee writes to be successfully replicated to remote data centers/clusters. (This capability is explained in the next section, “Couchbase Multi-cluster write durability pattern.”) The following operations illustrate how both functions work.

  1. Assume that the Couchbase topology contains four nodes per data center/cluster with each storing two copies of the same document replicated through XDCR between clusters.

    Couchbase Local Cluster Write Durability Pattern A

  2. The application in Data Center 1 writes document P1 to node N1.

    Couchbase Local Cluster Write Durability Pattern B

  3. Before P1 is replicated to the replica node in the local cluster or to the remote data center/cluster, node N1 fails, and as a result the application suffers data loss.

    Couchbase Local Cluster Write Durability Pattern C

  4. Before P1 reaches the remote data center/cluster, even though P1 has been replicated successfully in memory to the local cluster replica node N4, if both N1 and N4 nodes fail, the application still suffers data loss.

    Couchbase Local Cluster Write Durability Pattern D

  5. Using the ReplicateTo function can circumvent the failure described in step 3, and using the PersistTo function can circumvent the failure described in step 4, as shown in the following figure.

    Couchbase Local Cluster Write Durability Pattern E

  6. Lastly, for multi-data center/cluster durability, use the design pattern described in “Couchbase multi-cluster write durability pattern.

Couchbase multi-cluster write durability pattern

With its Multi-Cluster Awareness and Timestamp-based Conflict Resolution features, Couchbase supports multi-cluster durability as shown below.

Couchbase Multi Cluster Write Durability Pattern A

In the absence of write-ahead logging or quorum write, and even though Couchbase provides sufficient support for local and multi-cluster durability, one should still ask this question: what is the likelihood that all primary and replica nodes fail in multiple data centers or even worse that all data centers fail completely at the same time? These two unlikely failure scenarios are shown in the following two diagrams. We feel that the odds are next to zero if one follows this proposed Couchbase durability design pattern.

Couchbase Multi Cluster Write Durability Pattern B                         Couchbase Multi Cluster Write Durability Pattern C

Couchbase read/write intensive scalability pattern

With its peer-to-peer architecture and XDCR’s bidirectional multi-cluster/data center replication capability, Couchbase affords users the flexibility to provision clusters with different sizes and shapes tailored for specific traffic/capacity and usage patterns. This flexibility is shown in the following diagram. On the other hand, the cluster-sizing calculation exercise can become complicated. This is especially true if it involves Multi-Dimensional Scaling.

Couchbase Read:Write Intensive Scalability Pattern

Other NoSQL design pattern examples

NoSQL DB-agnostic application sharding pattern

The motivation behind this design pattern is that almost all large-scale mission-critical use cases require high availability. According to our experience, it doesn’t matter which NoSQL database you use or how comprehensive the best practice and SOP you follow, sometimes simple mundane maintenance tasks can jeopardize the overall database availability. The larger the size of database, the more severe the damage it may suffer. This design pattern offers an alternative solution using a divide-and-conquer approach, that is, by reducing the NoSQL database cluster to a small and manageable size through the following mechanisms:

  • Applications shard their data using modular, hash, round robin, or any other suitable algorithm.
  • The DBA and Operations provide a standard-size NoSQL cluster to host each application-level shard.
  • If needed, each application-level shard can further use built-in NoSQL vendor product sharding if it is available.

Using MongoDB as an example, the following diagram illustrates this design pattern. One caveat associated with this pattern is that it requires a middle-tier data access layer to help direct traffic, an effort that must not be underestimated.

NoSQL DB Agnostic Application Sharding Pattern

Future work and direction

In this blog, we described the motivation, consideration, and approach behind our proposed NoSQL resilience design pattern. We overviewed key differences between MongoDB and Couchbase in the context of resilience patterns. We also walked through three NoSQL resilience design pattern examples and a DB-agnostic application sharding pattern. In conclusion, we would like to suggest the following future work and direction on this topic:

  • Provide end-to-end integration of proven NoSQL design patterns with application frameworks and also cloud provisioning and management infrastructure.
  • Formalize the above NoSQL design patterns as officially supported products rather than just engineering patterns.
  • Add other types of resilience patterns, such as high consistency.
  • Add support for other NoSQL databases, for example, Cassandra.
  • Collaborate with NoSQL vendors and develop new resilience patterns for new features and capabilities.

The graphics in Non-synchronous writes are reproduced with the kind permission of Todd Dampier from his presentation “Rock-Solid Mongo Ops“.

How eBay’s Shopping Cart used compression techniques to solve network I/O bottlenecks

 

Introduction

eBay’s data storage related to Shopping Cart information relies on two different data stores. There’s a MongoDB store that stores the entire cart contents for a user’s cart, and there’s an Oracle store that stores only the cart summary and basic cart details but has enough hints to recompute the full cart if needed. The Oracle store is used to overcome the persistence and consistency challenges (if any) associated with storing data in MongoDB. It’s easier to think of the MongoDB layer as a “cache” and the Oracle store as the persistent copy. If there’s a cache miss (that is, missing data in MongoDB), the services fall back to recover the data from Oracle and make further downstream calls to recompute the cart.

All MongoDB carts are in JSON format, while the Oracle carts are stored in JSON format in a BLOB column. (These Oracle carts are user-sharded and are used only for OLTP reasons. All other filtering and querying to peek into cart data happens via a data warehouse and denormalized access patterns.)

shopcase-compression-intro.1
Figure 1. Storage architecture

While this is not a discussion on the choice of technology (Oracle vs. MongoDB vs. any other database), we are hoping that our experience gives folks an insight into what it takes to identify, debate, solve, and (most importantly) roll out fixes rapidly to a site that handles hundreds of millions of calls across tens of millions of users each day.

Problem statement

Late in 2016, the Shopping Cart service started experiencing a high number of misses at its caching layer, and at around the same time, we started getting alerts from eBay’s Ops team about our MongoDB replicas failing to catch up in a decent amount of time. (MongoDB runs in master-slave configuration and this description is worth a read if you are unfamiliar with the setup.) The setup, which was seemingly working “fine” for the past few years, had suddenly started to hit bottlenecks that were unrelated to any infrastructure or code change. Upon further investigation, this was narrowed down to MongoDB’s oplog hitting network I/O limits. You can think of the oplog as Oracle’s redo logs, except that this has a very critical purpose in making sure that the MongoDB read replicas are refreshed in as real time as possible. Due to this, the read replicas for a GET Cart operation, for example, were returning stale carts and were being rejected as a “cache” miss because of the incorrect (stale) version that was stamped on the cart. This meant that we were doing a lot more recomputes of users’ carts by falling back on the Oracle cart and using that to call downstream systems to formulate the full cart from the basic summary.

Obviously, this was a bad thing since we were making our users wait more than normal. (Cache misses are bad! Recomputes are bad!)

Solutions

Before we go into specific solutions, we want to call out the numerous other conversations and options that were worth a mention but not selected into the final bucket. For example, once the replicas were not catching up, we ran trials on “hiding” some of the replicas to try and allow them to catch up faster without taking user traffic. We also played around with timeouts and periodic replica reboots, but none of them seemed to cause the issues to die down.

Option 1: sharding (MongoDB)

The first obvious option was to split our JSON data into shardable sections. (Note that the entire cart was stored as a single JSON object.) This allows specific sections (only) to be written to disk and into a separate MongoDB cluster, which in turn reduces the number of writes and I/Os sent across the wire into the single master instance. The challenge with this approach was the re-write of the existing business logic to now understand a seemingly new schema.

Option 2: selective writes

This option would use MongoDB’s set command to update only the specific values changing on each update. While this works in theory, the kind of updates happening on the cart JSON object would involve domino-effect updates, which would pretty much trigger a change across numerous sections of the JSON. Keeping this aside, there was no assurance from the MongoDB folks that this would reduce the amount of oplogs being written. Their concern was that given the large number of selecting updates within a document, it might trigger the entire document update — thus not helping.

Option 3: client-side compressed writes and reads

Given the desire to get our situation fixed quickly, without rewriting the business logic, compression seemed like another logical option to consider. Reducing the amount of data coming into the master should correspondingly reduce the amount of data flowing into the oplog. However, this would convert the data into a binary format and would need clients to understand compressed reads and writes.

Our choice

Considering various options and the timings for them, we chose Option 3 as the fastest, cleanest, and the most likely to succeed at first try. The biggest assumption was that the CPU hit taken by the clients on compression and decompression would be well worth the savings in terms of the latency on the MongoDB side. One advantage we had is that we had already embraced SOA, and our service encapsulated all the calls in and out of the MongoDB cluster, so it was only our service (“client” to MongoDB) that needed to change.

Goals and considerations

Before we got started to do something as fundamental as this, we had to get our goals right first. Here were our simple premises.

  • Allow carts to be compressed and persisted into MongoDB (no change of the data store at this time).
  • Allow a choice of compression codecs and seamless shifting between reading with one codec and persisting/writing with another, in case there’s ever a need to change because of performance reasons.
  • Allow reading of old, new, and intermediate carts, that is, forward and backward compatibility.
    • Old carts can be read by new clients and should work.
    • New carts can be read by old clients and should work.
    • Old carts can be read and persisted as new carts.
    • New carts can be read and persisted as old carts.
  • Implementation should allow for uncompressed and compressed carts to be stored at the same time (not necessarily for the same exact cart).
  • Ultimately, getting rid of raw uncompressed storage in favor of compressed carts.
  • Make sure that there are no use cases needing real-time query of JSON data in the MongoDB store. (The new data will be compressed binary and won’t be query-able).

After a quick round of deliberation, we moved to the next phase of evaluation.

Codecs, codecs everywhere

For our choice of codecs, we had four well-known Java implementations we could choose from:

  • SNAPPY: Google-contributed compressor/de-compressor and heavily used in many Google products, externally and internally. The source code for the SNAPPY codec is available on GitHub.
  • LZ4_HIGH: Open-source software, BSD 2-clause license, running in high-compression mode and requiring the storage capacity for decompression to work. The source code for the LZ4 codec is available on GitHub.
  • LZ_FAST: Same as above, just running in fast compression mode.
  • GZIP: the Java-provided implementation

We wrote a benchmark with sample cart data (all sizes in bytes) from production across these libraries to choose what our initial choice of algorithm would be. To make sure that they are not impacted by CPU or I/O spikes on the local machine, these tests were run multiple times to make sure that the results (sample shown here below) are consistent.

Running test for: SNAPPY
Orig_size    Compressed   Ratio    Comp_time   Decom_time    Total_time
687          598          1.149    188.062ms   1.683ms       189.745ms
786          672          1.170    0.079ms     0.193ms       0.272ms
848          695          1.220    0.081ms     0.227ms       0.308ms
15022        3903         3.849    0.296ms     3.181ms       3.477ms
17845        4112         4.340    0.358ms     2.268ms       2.627ms
45419        9441         4.811    0.775ms     5.999ms       6.774ms
55415        10340        5.359    0.860ms     3.851ms       4.711ms
125835       21338        5.897    1.012ms     8.834ms       9.845ms
1429259      179838       7.947    6.248ms     14.742ms      20.990ms
4498990      625338       7.194    19.339ms    64.469ms      83.808ms
Running test for: LZ4_FAST
Orig_size    Compressed   Ratio    Comp_time   Decom_time    Total_time
687          591          1.162    0.091ms     16.615ms      16.705ms
786          658          1.195    0.020ms     26.348ms      26.368ms
848          685          1.238    0.026ms     15.140ms      15.166ms
15022        3539         4.245    0.065ms     17.934ms      17.999ms
17845        3712         4.807    0.096ms     16.895ms      16.991ms
45419        6224         7.297    0.197ms     18.445ms      18.642ms
55415        6172         8.978    0.164ms     17.282ms      17.445ms
125835       11830        10.637   0.538ms     16.239ms      16.777ms
1429259      49364        28.953   4.201ms     21.064ms      25.265ms
4498990      167400       26.876   14.094ms    26.003ms      40.097ms
Running test for: LZ4_HIGH
Orig_size    Compressed   Ratio    Comp_time   Decom_time    Total_time
687          589          1.166    0.149ms     0.032ms       0.181ms
786          648          1.213    0.064ms     0.012ms       0.076ms
848          676          1.254    0.066ms     0.013ms       0.079ms
15022        3270         4.594    0.232ms     0.041ms       0.273ms
17845        3349         5.328    0.244ms     0.045ms       0.290ms
45419        5192         8.748    0.587ms     0.154ms       0.741ms
55415        5130         10.802   0.666ms     0.102ms       0.769ms
125835       8413         14.957   1.776ms     0.403ms       2.179ms
1429259      17955        79.602   12.251ms    3.162ms       15.413ms
4498990      60096        74.863   35.819ms    8.585ms       44.404ms
Running test for: GZIP
Orig_size    Compressed   Ratio    Comp_time   Decom_time    Total_time
687          447          1.537    0.939ms     0.636ms       1.575ms
786          489          1.607    0.138ms     0.103ms       0.240ms
848          514          1.650    0.099ms     0.109ms       0.207ms
15022        2579         5.825    0.502ms     0.400ms       0.902ms
17845        2659         6.711    0.596ms     0.508ms       1.104ms
45419        4265         10.649   1.209ms     0.755ms       1.964ms
55415        4324         12.816   1.301ms     0.775ms       2.076ms
125835       7529         16.713   3.108ms     1.651ms       4.760ms
1429259      23469        60.900   26.322ms    37.250ms      63.572ms
4498990      69053        65.153   153.870ms   103.974ms     257.844ms

Here are some observations:

  • SNAPPY seems to suffer from a slow start. The time taken for the first run is always notoriously high. It also has not-so-great compression ratios. We were surprised that SNAPPY performed so badly given how much talk it gets.
  • LZ4_FAST decompression times are almost constant, independent of size. It also shows not-so-great compression ratios though.
  • LZ4_HIGH provides great times for both compression and decompression and gets a bit slower at high data sizes.
  • GZIP seems to get worse at high data sizes.

Given these results, LZ4_HIGH seems to be the most optimal code for the following reasons.

  • No slow start issues or performance quirks observed
  • Linear time growth with data sizes
  • Excellent overall performance for small to medium cart sizes (well into the 99th percentile cart size) and reasonable performance at very large cart sizes

However, there’s one caveat. Decompression for LZ4_HIGH expects the output buffer size to be specified in the API call. The memory pre-allocation is likely what enables the faster decompression. It’s a price to be paid for the benefits but something we felt was useful enough to account for it in the final design. So while the decision was clear, the implementation was designed to have all four codecs available as choices to be possible to shift seamlessly between one codec or another (one of our goals as mentioned previously) depending on a future need.

Payload design

A sample payload of today’s shopping cart looks like this. The cart variable is the only interesting piece in the payload.

{
"_id" : ObjectId("560ae017a054fc715524e27a"),
"user" : "9999999999",
"site" : 0,
"computeMethod" : "CCS_V4.0.0",
"cart" : "...JSON cart object...",
"lastUpdatedDate" : ISODate("2016-09-03T00:47:44.406Z")
}

Our new payload with support for compressed data looks like this. The cart variable remains unchanged, but the new elements, initially derived from the cart variable, provide support for the Compression feature.

{
"_id" : ObjectId("560ae017a054fc715524e27a"),
"user" : "9999999999",
"site" : 0,
"computeMethod" : "CCS_V4.0.0",
"cart" : "...JSON cart object...",
"compressedData" : {
"compressedCart" : "...Compressed cart object..."
"compressionMetadata" : {
"codec" : "LZ4_HIGH",
"compressedSize" : 3095,
"uncompressedSize" : 6485
},
},
"lastUpdatedDate" : ISODate("2016-09-03T00:47:44.406Z")
}

Each of the fields in the compressionMetadata field is described below:

  • "compressedData" — The container that stores the compressed cart and metadata about the cart itself that will be used for compression/decompression.
  • "compressedCart" : "...Compressed cart object..." — The compressed data for the cart field in the upper-level container.
  • "compressionMetadata" — The subcontainer that holds the metadata required for decompression and hints for certain codecs (for example, LZ4_HIGH that needs the destination size (uncompressedSize) to work.
  • "codec" : "LZ4_HIGH" — Stored when compression runs and used when decompression runs.
  • "compressedSize" : 3095 — Stored when compression runs and used when decompression runs (used by LZ4_HIGH only, but we do not differentiate)
  • "uncompressedSize" : 6485 — Stored when compression runs and used when compression runs (used by LZ4_HIGH only, but we do not differentiate)

Note that the sizes of the compressed and decompressed data are also stored every time although they are only really used when the codec is LZ4_HIGH. While all of this helped with the need to seamlessly switch between codecs, it also acted as a great source of statistical information for compression metrics for us.

Config-driven approach (versus experimentation)

Once the code rolled out to production, we had two approaches towards experimentation to make sure our implementation works seamlessly. The first was to go with an A/B test, the typical experimentation paradigm, and verify performance via standard reporting that’s already in place. The other option was via back-end config-driven testing. We chose the latter since we were confident that we could figure out issues with pure back-end metrics and have enough logging in place to identify issues. For example, we could make sure that the compressed and decompressed data matched the size stored in the metadata (else log a critical error). We also had alerting built into place that would give us an immediate read if any codec was mis-configured or failed to compress or decompress at run time.

To add to it all, eBay’s services follow the 12-factor approach for the most part. The one factor we focus on here is the config-driven approach that enables us to play around with different codec settings.

In summary, the whole rollout and validation process looked like this:

  1. Configure builds with compression feature = OFF as default.
  2. Roll out to all boxes, with a relaxed rollout template, making sure that there are no inadvertent impacts caused by the feature even when it is turned off. Bake for a few hours.
  3. Pick one random box and turn on compression config for that box ONLY in backwards-compatible mode. This mostly verifies compression, and very seldom is decompression expected to be triggered unless the same single box reads the previously compressed cart.
  4. Turn on the compression config for a few more boxes until all boxes compress but no one reads the compressed data yet. This causes even more increased traffic and worsen the current situation. Necessary evil.
  5. Turn on reading decompression config for one box. Make sure that this box is able to read the compressed field and use ONLY the compressed field.
  6. Like before, turn on reading decompression config for multiple boxes and then for all boxes. Verify across all boxes.

Finally, here are the different configs we used:

  • compressionEnabled = true/false — The master switch that controls whether this feature is enabled or not.
  • compressionCodec = {SNAPPY, LZ4_HIGH, LZ4_FAST, GZIP} — One of the four choices that will be enabled, with LZ4_HIGH being the default choice.
  • WRITE_MODE = One of three modes.
    • NO_COMPRESS — Writes will not write the compressed data fields.
    • DUAL — Writes will write the compressed data fields and the uncompressed/regular cart.
    • COMPRESS_ONLY — Writes will write only the compressed data fields and null out the uncompressed/regular cart field.
  • READ_MODE = one of three modes
    • NO_COMPRESS — Reads will read from the regular cart fields.
    • VERIFY_COMPRESS — Reads will read from both the fields, use the regular cart field, but verify that the compression data is being decompressed correctly. Think of this as “audit” mode.
    • COMPRESS_ONLY — Reads will directly read the cart from the compressed data fields, decompress it, and use it.

Note that not all mode combinations are valid. For example, WRITE_MODE = COMPRESS_ONLY and READ_MODE = NO_COMPRESS is invalid, and care should be taken to avoid that combination.

It might look like overkill, but the key goal to keep the site up and running at all costs with zero impact was playing on our minds all the time. We felt that the critical factor was to have all the different controls at your disposal for any eventuality.

Everything is backwards compatible. Backwards compatibility is everything.

We can’t emphasize this enough. For any site that cares about its Users, this is a critical factor to be factored into any new feature that is rolled out. Our goal was no different, and the diagram in figure 2 captures some of the complexity we faced.

A few things to keep in mind:

  • Do not delete fields that are being deprecated. We added the compressedData field as a new one to our existing JSON structure.
  • Make sure to separate out the code paths in a way and as close to the source of changing data. This practice almost always allows a better code-deprecation option in the future.
  • Backward compatibility is critical even when new code is in the process of rollouts. Do not assume a 30 minute “incompatibility window” and everything will be fine after that. For example, you never know when things simply stall and you will be stuck in limbo for much longer.

For example, Figure 2 shows how our code paths looked while we were in the midst of rollout and both the old and new services were taking traffic. By separating out the logic right at the DAO/DO layer, the rest of the business logic continued as if nothing changed.

shopcase-compression-codePaths.2
Figure 2. Code paths

Results

After a successful rollout, we saw some magical numbers. Our oplog write rates, which had been almost 150GB/hour, were down to about 11GB/hour, a 1300% drop! The average object size of the documents, which had been hovering around 32KB, was down to 5KB, a 600% drop! In fact, we even saw some improvements in our eventual service response times as well. So overall we achieved what we wanted with ZERO production issues, and all our bets and analysis paid fantastic dividends at the end, benefiting our customers.

shopcase-compression-resultsGraph.3
Figure 3. Before-and-after average object sizes.

Figure 3 shows a screenshot from MongoDB’s Ops Manager UI tool that highlights the slight increase (when we were doing dual writes of compressed and uncompressed data) and the dramatic drop in the average object size from 32KB down to 5KB after we turned the feature on fully.

shopcase-compression-SecondResultsGraph.4
Figure 4. Before-and-after oplog throughput

Figure 4 shows another screenshot from MongoDB’s Ops Manager UI tool showing the slight increase and the dramatic drop in the oplog throughput after the feature was turned on fully.

Post-release, we also went ahead and removed the “dead code,” which was basically the old logic that worked only on compressed data. Over a period of time, as older carts keep getting refreshed, there will never even be a hint that we even had uncompressed data at some point in time! So how did the machines running the (de-)compression do? Not too bad at all. There was hardly any noticeable change comparing CPU and JVM heap usages before and after the code rollout.

Finally, here are some other scatter graphs for a random hour’s worth of data showing how the compression did in the Production environment. These graphs provide many other interesting observations, but those are left as an exercise for the reader.

shopcase-compression-scatterGraph1.5
Figure 5. A scatter plot showing compression achieved for the uncompressed data plotted against the original (uncompressed) data

shopcase-compression-scatterGraph2.6
Figure 6. A scatter plot showing the behavior of read times (that is, decompression) versus the size of the compressed data

shopcase-compression-scatterGraph3.7
Figure 7. A scatter plot showing the behavior of write times (that is, compression) versus the size of uncompressed data