Author Archives: Donovan Hsieh

Practical NoSQL resilience design pattern for the enterprise

 

Building applications resilient to infrastructure failure is essential for systems that run in a distributed environment, such as NoSQL databases. For example, failure can come from compute resources such as nodes, local/remote clusters, network switches, or the entire data center. On occasion, NoSQL nodes or clusters may be marked down by Operations to perform administrative tasks, such as a software upgrade and adding extra capacity. In this blog, you can find information on how to build resilient NoSQL applications using appropriate design patterns that are suitable to your needs.

Note that the resilience design pattern described in this blog is complementary to other patterns, such as the Hystrix library. The difference is that it focuses on the following abstractions:

  • NoSQL database architecture and capability abstraction, such as replication strategy, durability, and high availability
  • Infrastructure and environment abstraction, such as standard minimal deployment, disaster recovery, and continuous operation
  • Application abstraction, such as workload, performance, and access patterns

Why resiliency design patterns for NoSQL databases?

Enabled by more efficient distributed consensus protocols, such as Paxos and Raft, and new thinking, such as flexible database schema, NoSQL databases have grown out of early adoption and are gaining popularity among mainstream companies. Although many NoSQL vendors tout their products as having unparalleled web-level scalability and capabilities such as auto-failover, not all are created equal in terms of availability, consistency, durability and recoverability. This is especially true if you are running mission-critical applications using heterogeneous NoSQL systems that include key-value (KV) pairs, column families (CF), and document and graph databases across a geographically dispersed environment.

It becomes imperative that a well-run enterprise optimizes its management to achieve the highest operational efficiency and availability. To accomplish this, an enterprise should not only empower application developers with well-defined NoSQL database resilience design patterns so that they can be relieved from preventable infrastructure failure, but it should also provide a guaranteed-availability Service Level Agreement (SLA) on these resilience design patterns.

NoSQL resiliency design pattern consideration

Given the enormous complexity associated with integrating an enterprise-class infrastructure, such as eBay, developing meaningful and usable NoSQL resilience pattern is more than just one dimension — it comprises a multitude of coherent and intricately interconnected cogs such as these:

  • Use case qualification
  • Application persistence error handling
  • Technology stack and engineering framework
  • Data center infrastructure (for example, public/private/hybrid cloud) configuration and setup
  • NoSQL database-agnostic and -specific resilience abstraction
  • Operation and management best practice and standard operation procedure (SOP)

Since NoSQL databases are not panaceas, ill-suited applications can disrupt operations and cause friction between stakeholders within the organization. The first step toward defining a meaningful NoSQL database resilience pattern is to ensure that only qualified use cases will use it. With that, we recommend the following guidelines when qualifying new NoSQL use cases.

NoSQL qualifying guidelines

NoSQL resiliency design pattern approach

One main objective of a NoSQL resilience design pattern is that it should support a wide range of use cases across different NoSQL databases. To achieve this objective, we devised the following steps in our approach:

  1. Identify a meaningful NoSQL database architectural abstraction based on the CAP theorem, ACID/BASE properties, and performance characteristics.
  2. Categorize and define types of different resilience patterns based on workload, performance, and consistency properties that are meaningful to applications.
  3. Define a standardized minimal NoSQL deployment pattern for common small-to-medium, non-mission critical use cases.
  4. Define an enhanced design pattern to support mission-critical use cases that require high availability, consistency, durability, scalability, and performance.
  5. Define other design patterns to support non-conforming use cases, for example, standalone without disaster recovery (DR), and application sharding.

Given that JSON has become the de facto standard data format for web-centric e-commerce businesses like eBay, this blog will use two of the top document-centric NoSQL databases (MongoDB and Couchbase) to illustrate our proposed resilience design pattern.

DB comparison

Although a tutorial on MongoDB and Couchbase is beyond the scope of this blog, the high-level comparison between them in Table 1 helps illustrate their differences. It also helps explain how these differences influence respective resilience design patterns for each product.

Table 1. Comparison of MongoDB and Couchbase
MongoDB Couchbase
* Throughout this blog post, for convenience we designate one Couchbase cluster per data center even though multiple clusters can be set up per data center. However, nodes in the same Couchbase cluster do not span across multiple data centers.
Name Replica set/sharded cluster “Cluster”*
Architecture Master-slave Peer-to-peer
Replication Master to slave (or primary to secondary) nodes Local cluster Primary to replica nodes
Multiple cluster Bi- or uni-directional Cross Data Center Replication (XDCR)
Sharding Optional Built-in
Quorum Read Yes No
Write Yes

From the comparison above, we observe the following characteristics when designing a resilience pattern design:

  • NoSQL databases tend to simplify their resilience pattern if they are based on peer-to-peer architecture.
  • NoSQL databases tend to complicate their resilience pattern if they lack quorum read/write, an important capability to support global read and write consistency.

Lastly, we found it helpful to organize resilience design patterns into different categories, as shown in the Table 2. For brevity, we will focus the examples on the following three categories: workload, durability, and application sharding.

Table 2. Categories of Resilience Design Patterns
Category Pattern
Workload General purpose mixed read and write
Performance High performance read and/or write
Durability 100% durability
High local and/or cross data center durability
High availability (HA) High availability local read and write
High availability multi-data center read and write
High Read and write consistency High local data center read and write consistency
High multi-data center read and write consistency
Others Administration, backup and restore, application sharding …

Standard minimal deployment

As mentioned earlier, one key objective of NoSQL resilience design patterns is to relieve applications and Operation of preventable infrastructure failure. It is our opinion that, disregarding the size or importance of applications, enterprise-class NoSQL clusters should be able to handle node, cluster, site or network partition failure gracefully with built-in preventive measures. Also, they should be able to perform disaster recovery (DR) within reasonable bounds. This assumption warrants the need of the “standard minimal deployments” described below.

MongoDB standard minimal deployment

According to our experience managing some of the largest 24×7 heterogeneous NoSQL clusters in the world, we recommend the following “standard minimal deployment” best practice for enterprise-class MongoDB deployments:

  • Whenever possible, nodes in the same MongoDB replica set will be provisioned in different fault domain to minimize any availability risk associated with node and network failure.
  • Standard MongoDB deployment (for example, type 1 in Table 3) always includes third data center with two or more secondary nodes. For applications that do not have traffic in a third data center, light-weight arbiter nodes should be used for cross-data center quorum voting during site or total data center failure.
  • With few exceptions (for example, type 2 below), a MongoDB replica set can be deployed in only one data center if applications do not require remote DR capability.
Table 3. MongoDB Standard Minimal Deployments
Deployment Type MongoDB Replica Set Nodes Configuration Description
Data Center 1 Data Center 2 Data Center 3
1 Standard deployment with built-in DR capability Minimum three nodes, which include one primary and two secondary nodes Minimum two secondary nodes Minimum two secondary or arbiter nodes.This is to ensure that should any one data center experience total failure, it is not excluded from quorum voting for new primary node. In selected high traffic or important use cases, additional secondary nodes may be added in each data center. The purpose is to increase application read resilience so that surviving nodes won’t be overwhelmed in case of one or more secondary nodes fail in any data center.
2 Special qualified use case without DR Three (one primary, two secondary nodes) This deployment pattern is for use cases that do not require remote DR since applications can rebuild the entire dataset from an external data source.Furthermore, the lifespan of data can be relatively short-lived and may expire in a matter of minutes, if not seconds. In selected high traffic or important use cases, additional secondary nodes may be added to the replica set to help increase resilience.

The following diagram illustrates the MongoDB “standard minimal deployment” pattern.

Mongo deploy

During primary node failover, one of the available secondary nodes in either data center will be elected as the new primary node, as shown in the following diagram.

Mongo before failover

During site or data center failure, one of the available secondary nodes in other data centers will be elected as the new primary node, as shown in the following diagram.

Mongo after failover

Couchbase standard minimal deployment

With peer-to-peer architecture, Couchbase’s “standard minimal deployment” does not require a third data center since it does not involve quorum voting when failing over to a new primary node. Arguably, increasing minimal nodes from four to five or six or more per data center can help strengthen operational viability should two or more nodes in the same data center fail at the same time (itself an unlikely scenario if they were provisioned in different fault domains in the first place). With that, we feel that a minimal four nodes per data center are sufficient for most Couchbase use cases to start with.

Table 4. Couchbase Standard Minimal Deployments
Deployment Type Couchbase Replica Set Nodes Configuration Description
Data Center 1 Data Center 2 Data Center 3 (Optional)
1 Standard deployment with built-in DR capability 4+ nodes 4+ nodes 4+ nodes See description above
2 Special qualified use case without DR 4+ nodes Same as MongoDB

The following diagram illustrates the Couchbase “standard minimal deployment” pattern where each data center/cluster has two copies of the same document (for example, P1 being the primary copy and R1 the eventually consistent replica copy).

Couchbase deploy

During automatic node failover, the Couchbase client SDK in applications will detect node failure and receive an updated failover cluster map with a topology containing the new location of replica documents that have been promoted to primary. This is shown in following diagram.

Couchbase failover

Although Couchbase supports bidirectional Cross Data Center Replication (XDCR) between geographically dispersed clusters, its current implementation does not offer automatic cluster failover. To address this limitation, Couchbase will support a new feature called Multi-Cluster Awareness (MCA) in a future release (tentatively v4.x) to provide this capability, as shown in following diagram.

Couchbase MCA

High-level MongoDB and Couchbase resilience capability comparison

Before we talk about the details of an individual resilience design pattern for each product, it helps to understand the capability of MongoDB and Couchbase in terms of high availability, consistency, durability, and DR across local and multiple data centers. The following table highlights their differences.

Table 5. Comparison of MongoDB and Couchbase Resilience Capabilities
NoSQL Database Data center High Availability High Consistency High Durability DR
* MCA (Multi-Cluster Awareness) and TbCR (Timestamp-based Conflict Resolution) will be available in a future Couchbase v4.x release.
MongoDB Local DC No for Write
Yes for Read
Yes Yes No
Multi DC Yes
Couchbase Local DC No for Write
Yes for Read
Yes Yes No
Multi DC Yes with MCA* No Yes with MCA and TbCR* Yes

From the above comparison, we observe one similarity between MongoDB and Couchbase in their support for high-availability writes or the lack of. This is because both products limit their writes to a single primary node only. Nonetheless, Couchbase does plan to support high-availability write for multiple data centers through its future Multi-Cluster Awareness feature, which will help alleviate this limitation. This is the reason why Couchbase’s standard minimal deployment pattern requires at minimum two data centers/clusters.

MongoDB resilience design pattern examples

Before we show MongoDB’s resilience design pattern examples, it is important to understand how data loss can happen in MongoDB if applications do not use a proper resilience design pattern.

Non-synchronous writes

Let’s assume that you have a MongoDB replica set comprised of three nodes. The following operations illustrate how data loss can happen:

  • First, the application writes five documents and receives write confirmation from the primary node only.
  • The first three documents are replicated successfully at the second secondary node, and two documents are replicated at the third secondary node.

    Mongo Non-Synchronous Writes A

  • The primary node fails before all five documents reach both of the two secondary nodes

    Mongo Non-Synchronous Writes B

  • Quorum voting re-elects the second secondary node as the new primary node because it receives the third document, that is, more recently than the third secondary node

    Mongo Non-Synchronous Writes C

  • The original primary node steps down to be a secondary and rolls back its fourth and fifth documents, since they didn’t reach other two secondary nodes

    Mongo Non-Synchronous Writes D

  • In effect, the application loses the fourth and fifth documents even though it receives confirmation from the original primary node. The issue associated with this type of data loss (non-synchronous write) occurs because that application didn’t apply the “quorum write” option to majority nodes in the same replica set.
  • In addition to “quorum write”, one should also enable MongoDB’s write-ahead logging journal file to further increase primary node write durability.

Since non-synchronous write is just one type of issue that can happen to applications, it is not sufficient to develop a pattern just to solve it. Having various add-on patterns, like synchronous write and others, on top of the MongoDB standard minimal deployment patterns helps enrich the overall NoSQL database resilience capabilities.

MongoDB write durability pattern

To achieve MongoDB replica-set-wide write durability, an application should use MongoDB’s WriteConcern = Majority option, where a write waits for confirmation from majority nodes across multiple data centers. During primary node failover, one of the majority secondary nodes having the latest committed write will be elected as the new primary node.

One caveat of this pattern is that one should weigh the pros and cons associated with waiting for cross-data center majority nodes confirmation, since it may not suitable for applications that require short latency.

MongoDB Write Durability Pattern

MongoDB read intensive pattern

This pattern is for applications that require high availability reads. With MongoDB’s master-slave replication architecture and ability to scale up to 49 secondary nodes across multiple data centers, it inherently supports highly available reads, provided that the number of healthy secondary nodes in any data center are capable of handling read traffic if one or more fail. Note that even though the current MongoDB replica set can scale up to 50 nodes, only 7 nodes can vote during primary node failover. The following diagram illustrates this capability.

MongoDB Read Intensive Pattern

MongoDB extreme high write pattern

Use cases requiring extreme high writes can use MongoDB optional sharding since it allows horizontal write scaling across multiple replica sets (shards). Each shard (or replica set) stores part of the entire dataset as defined by a predefined shard key of the collection. With a well-designed shard key, MongoDB automatically distributes and balances read/write queries to designated shards. Although MongoDB sharding provides horizontal scale-out write capability, this pattern incurs consequent complexity and overhead and should be used with caution:

  • An application should start with sharding using an appropriate shard key from the start instead of migrating from a single replica set as afterthought. This is because migrating a single replica set to a sharded cluster is a major undertaking from both development and operations perspectives.
  • We recommend starting with a predefined number of shards capable of handling capacity and traffic in the foreseeable future. This helps eliminate the overhead associated with rebalancing when adding new shards.
  • Note that MongoDB automatic shard balancing may introduce spikes during chunk migration and can potentially impact performance.
  • Developers needs to understand behavior and limitation on how mongos, a software router process, on queries that do not include shard key, such as scatter gather operations.
  • Developers should weigh the pros and cons of the overhead associated with running mongos as part of application servers vs. in separate nodes.

The following diagram illustrates this capability.

MongoDB Extreme High Write Pattern

Couchbase resilience design pattern examples

Although Couchbase does not support write-ahead logging or quorum write, it achieves high durability through following mechanisms:

  • Local cluster durability — ReplicateTo and PersistTo functions
  • Multi-cluster durability (to be released in a future v4.x release):
    • Multi-Cluster Awareness (MCA) — Without application logic, this feature allows application developers and DBA to define rules on how applications should behave in the event of complete data center/cluster failover.
    • Timestamp-based Conflict Resolution (TbCR) — Based on a server-synced clock, this feature provides Last-Write-Win (LWW) capability on bidirectional replication update conflict resolution to ensure correct cross-data center/cluster write durability.

Couchbase local cluster write durability pattern

For local cluster durability, Couchbase provides the following two functions through its client SDK:

  • ReplicateTo — This function allows writes on the same document to be successfully replicated in memory for all copies in the local cluster. However, it does not guarantee writes to be persisted on disk, which may result in data loss if both primary and replica nodes fail before that happens.
  • PersistTo — For increased durability, applications can use this function so that writes will not only be successfully replicated in memory but also persisted on disk in the local cluster.

Note that even with the second PersistTo function, the current Couchbase version still does not guarantee writes to be successfully replicated to remote data centers/clusters. (This capability is explained in the next section, “Couchbase Multi-cluster write durability pattern.”) The following operations illustrate how both functions work.

  1. Assume that the Couchbase topology contains four nodes per data center/cluster with each storing two copies of the same document replicated through XDCR between clusters.

    Couchbase Local Cluster Write Durability Pattern A

  2. The application in Data Center 1 writes document P1 to node N1.

    Couchbase Local Cluster Write Durability Pattern B

  3. Before P1 is replicated to the replica node in the local cluster or to the remote data center/cluster, node N1 fails, and as a result the application suffers data loss.

    Couchbase Local Cluster Write Durability Pattern C

  4. Before P1 reaches the remote data center/cluster, even though P1 has been replicated successfully in memory to the local cluster replica node N4, if both N1 and N4 nodes fail, the application still suffers data loss.

    Couchbase Local Cluster Write Durability Pattern D

  5. Using the ReplicateTo function can circumvent the failure described in step 3, and using the PersistTo function can circumvent the failure described in step 4, as shown in the following figure.

    Couchbase Local Cluster Write Durability Pattern E

  6. Lastly, for multi-data center/cluster durability, use the design pattern described in “Couchbase multi-cluster write durability pattern.

Couchbase multi-cluster write durability pattern

With its Multi-Cluster Awareness and Timestamp-based Conflict Resolution features, Couchbase supports multi-cluster durability as shown below.

Couchbase Multi Cluster Write Durability Pattern A

In the absence of write-ahead logging or quorum write, and even though Couchbase provides sufficient support for local and multi-cluster durability, one should still ask this question: what is the likelihood that all primary and replica nodes fail in multiple data centers or even worse that all data centers fail completely at the same time? These two unlikely failure scenarios are shown in the following two diagrams. We feel that the odds are next to zero if one follows this proposed Couchbase durability design pattern.

Couchbase Multi Cluster Write Durability Pattern B                         Couchbase Multi Cluster Write Durability Pattern C

Couchbase read/write intensive scalability pattern

With its peer-to-peer architecture and XDCR’s bidirectional multi-cluster/data center replication capability, Couchbase affords users the flexibility to provision clusters with different sizes and shapes tailored for specific traffic/capacity and usage patterns. This flexibility is shown in the following diagram. On the other hand, the cluster-sizing calculation exercise can become complicated. This is especially true if it involves Multi-Dimensional Scaling.

Couchbase Read:Write Intensive Scalability Pattern

Other NoSQL design pattern examples

NoSQL DB-agnostic application sharding pattern

The motivation behind this design pattern is that almost all large-scale mission-critical use cases require high availability. According to our experience, it doesn’t matter which NoSQL database you use or how comprehensive the best practice and SOP you follow, sometimes simple mundane maintenance tasks can jeopardize the overall database availability. The larger the size of database, the more severe the damage it may suffer. This design pattern offers an alternative solution using a divide-and-conquer approach, that is, by reducing the NoSQL database cluster to a small and manageable size through the following mechanisms:

  • Applications shard their data using modular, hash, round robin, or any other suitable algorithm.
  • The DBA and Operations provide a standard-size NoSQL cluster to host each application-level shard.
  • If needed, each application-level shard can further use built-in NoSQL vendor product sharding if it is available.

Using MongoDB as an example, the following diagram illustrates this design pattern. One caveat associated with this pattern is that it requires a middle-tier data access layer to help direct traffic, an effort that must not be underestimated.

NoSQL DB Agnostic Application Sharding Pattern

Future work and direction

In this blog, we described the motivation, consideration, and approach behind our proposed NoSQL resilience design pattern. We overviewed key differences between MongoDB and Couchbase in the context of resilience patterns. We also walked through three NoSQL resilience design pattern examples and a DB-agnostic application sharding pattern. In conclusion, we would like to suggest the following future work and direction on this topic:

  • Provide end-to-end integration of proven NoSQL design patterns with application frameworks and also cloud provisioning and management infrastructure.
  • Formalize the above NoSQL design patterns as officially supported products rather than just engineering patterns.
  • Add other types of resilience patterns, such as high consistency.
  • Add support for other NoSQL databases, for example, Cassandra.
  • Collaborate with NoSQL vendors and develop new resilience patterns for new features and capabilities.

The graphics in Non-synchronous writes are reproduced with the kind permission of Todd Dampier from his presentation “Rock-Solid Mongo Ops“.

NoSQL Data Modeling

Data modeling for RDBMS has been a well-defined discipline for many years. Techniques like logical to physical mapping and normalization / de-normalization have been widely practiced by professionals, including novice users. However, with the recent emergence of NoSQL databases, data modeling is facing new challenges to its relevance. Generally speaking, NoSQL practitioners focus on physical data model design rather than the traditional conceptual / logical data model process for the following reasons:

  • Developer-centric mindset – With flexible schema (or schema-free) support in NoSQL databases, application developers typically assume data model design responsibility. They have been ingrained with the notion that the database schema is an integral part of application logic.
  • High-performance queries running in massive scale-out distributed environments – Contrary to traditional, centralized scale-up systems (including the RDBMS tier), modern applications run in distributed, scale-out environments. To accomplish scale-out, application developers are driven to tackle scalability and performance first through focused physical data model design, thus abandoning the traditional conceptual, logical, and physical data model design process.
  • Big and unstructured data – With its rigidly fixed schema and limited scale-out capability, the traditional RDBMS has long been criticized for its lack of support for big and unstructured data. By comparison, NoSQL databases were conceived from the beginning with the capability to store big and unstructured data using flexible schemas running in distributed scale-out environments.

In this blog post, we explore other important mindset changes in NoSQL data modeling: development agility through flexible schemas vs. database manageability; the business and data model design process; the role of RDBMS in NoSQL data modeling; NoSQL variations that affect data modeling; and visualization approaches for NoSQL logical and physical data modeling. We end the post with a peak into the NoSQL data modeling future.

Development agility vs. database manageability

One highly touted feature in today’s NoSQL is application development agility. Part of this agility is accomplished through flexible schemas, where developers have full control over how data is stored and organized in their NoSQL databases. Developers can create or modify database objects in application code on the fly without relying on DBA execution. The result is, indeed, increased application development and deployment agility.

However, the flexible schema is not without its challenges. For example, dynamically created database objects can cause unforeseen database management issues due to the lack of DBA oversight. Furthermore, unsupervised schema changes increase DBA challenges in diagnosing associated issues. Frequently, such troubleshooting requires the DBA to review application code written in programming languages (e.g., Java) rather than in RDBMS DDL (Data Definition Language) – a skill that most DBAs do not possess.

NoSQL business and data model design process

In old-school software engineering practice, sound business and (relational) data model designs are key to successful medium- to large-scale software projects. As NoSQL developers assume business / data model design ownership, another dilemma arises: data modeling tools. For example, traditional RDBMS logical and physical data models are governed and published by dedicated professionals using commercial tools, such as PowerDesigner or ER/Studio.

Given the nascent state of NoSQL technology, there isn’t a professional-quality data modeling tool for such tasks. It is not uncommon for stakeholders to review application source code in order to uncover data model information. This is a tall order for non-technical users such as business owners or product managers. Other approaches, like sampling actual data from production databases, can be equally laborious and tedious.

It is obvious that extensive investment in automation and tooling is required. To help alleviate this challenge, we recommend that NoSQL projects use the business and data model design process shown in the following diagram (illustrated with MongoDB’s document-centric model):

design_process

Figure 1

  • Business Requirements & Domain Model: At the high level, one can continue using database-agnostic methodologies, such as domain-driven design, to capture and define business requirements
  • Query Patterns & Application Object Model: After preliminary business requirements and the domain model are produced, one can work iteratively and in parallel to analyze top user access patterns and the application model, using UML class or object diagrams. With RDMS, applications can implement database access using either a declarative query (i.e., using a single SQL table join) or a navigational approach (i.e., walking individual tables embedded in application logic). The latter approach typically requires an object-relational mapping (ORM) layer to facilitate tedious plumbing work. By nature, almost all NoSQL databases belong to the latter category. MongoDB can support both approaches through the JSON Document model, SQL-subset query, and comprehensive secondary indexing capabilities.
  • JSON Document Model & MongoDB Collection / Document: This part is where native physical data modeling takes place. One has to understand the specific NoSQL product’s strengths and weaknesses in order to produce efficient schema designs and serve effective, high-performance queries. For example, modeling social network entities like followed and followers is very different from modeling online blogging applications. As such, social networking applications are best implemented using Graph NoSQL databases like Neo4j, while online blogging applications can be implemented using other flavors of NoSQL like MongoDB.

RDBMS data modeling influence on NoSQL

Interestingly enough, old-school RDBMS data modeling techniques still play a meaningful role for those who are new to NoSQL technology. Using document-centric MongoDB as an example, the following diagram illustrates how one can map a relational data model to a comparable MongoDB document-centric data model:

mongodb_mapping

Figure 2

NoSQL data model variation

In the relational world, logical data models are reasonably portable among different RDBMS products. In a physical data model, design specifications such as storage clauses or non-standard SQL extensions might vary from vendor to vendor. Various SQL standards, such as SQL-92 and the latest SQL:2008 as defined by industry bodies like ANSI/ISO, can help application portability across different database platforms.

However, in the NoSQL world, physical data models vary dramatically among different NoSQL databases; there is no industry standard comparable to SQL-92 for RDBMS. Therefore, it helps to understand key differences in the various NoSQL database models:

  • Key-value stores – Collections comprised of unique keys having 1-n valid values
  • Column families – Distributed data stores in which a column consists of a unique key, values for the key, and a timestamp differentiating current from stale values
  • Document databases – Systems that store and manage documents and their metadata (type, title, author, creation/modification/deletion date, etc.)
  • Graph databases – Systems that use graph theory to represent and store data as nodes (people, business, accounts, or other entities), node properties, and edges (lines connecting nodes/properties to each other)

The following diagram illustrates the comparison landscape based on model complexity and scalability:

nosql_comparisons

Figure 3

It is worth mentioning that for NoSQL data models, a natural evolutionary path exists from simple key-value stores to the highly complicated graph databases, as shown in the following diagram:

nosql_evolution

Figure 4

NoSQL data model visualization

For conceptual data models, diagramming techniques such as the Entity Relationship Diagram can continue to be used to model NoSQL applications. However, logical and physical NoSQL data modeling requires new thinking, due to each NoSQL product assuming a different native structure. One can intuitively use any of the following three visualization approaches, using a document-centric data model like MongoDB as an example:

  • Native visual representation of MongoDB collections with support for nested sub-documents (see Figure 2 above)

Pros – It naturally conveys a complex document model through an intuitive visual representation.
Cons – Without specialized tools support, visualization results in ad-hoc drawing using non-uniform conventions or notations.

  • Reverse engineering selected sample documents using JSON Designer (see Figure 5 below)

Pros – It can easily reverse engineer a hierarchical model into a visual representation from existing JSON documents stored in NoSQL databases like MongoDB.
Cons – As of this writing, JSON Designer is available only on iPhone / iPad. Furthermore, it does not include native DB objects, such as MongoDB indexes.

json_designer

Figure 5

  • Traditional RDBMS data modeling tools like PowerDesigner (see Figure 6 below)

Pros – Commercial tools support is available.
Cons – it requires tedious manual preparation and diagram arrangement to represent complex and deeply nested document structure.

power_designer

Figure 6

In a future post, we’ll cover specific data model visualization techniques for other NoSQL products such as Cassandra, which is based on the Column Family structure.

New NoSQL data modeling opportunities

Like any emerging technology, NoSQL will mature as it becomes mainstream. We envision the following new data modeling opportunities for NoSQL:

  • Reusable data model design patterns (some product-specific and some agnostic) to help reduce application development effort and cost
  • Unified NoSQL model repository to support different NoSQL products
  • Bi-directional, round-trip engineering support for (data) model-driven design processes and tools
  • Automated data model extraction from application source code
  • Automated code-model-data consistency validation and consistency conformance metrics
  • Strong control for application / data model change management, with proactive tracking and reconciliation between application code, embedded data models, and the actual data in the NoSQL databases