eBay Tech Blog

Diversity in Search

by David Goldberg on 11/26/2014

in Search Science

This post is about how to rank search results, specifically the list of items you get after doing a search on eBay.

Diversity

One simple approach to ranking is to give a score to each item. For concreteness, imagine the score to be an estimate of the probability that a user will buy the item. Each item is scored individually, and then the items are presented in score order, with the item most likely to be purchased at the top.

This simple approach makes a lot of sense. But there are some situations where it can work poorly. For example, suppose that there are 40 results, 20 for Sony and 20 for Panasonic. And suppose that all the Sony items have a score that is just a tiny bit better than any of the Panasonic items. Then the first 20 results will be Sony, giving the incorrect impression that we only have Sony. It would be better to show at least one Panasonic near the top, preferably high enough to be “above the fold”. In other words, in this case I want to take diversity into account. An ultra-simple method to handle this case would be to add a small amount of random noise to each score. But I think you can see that this is somewhat of a kludge, and doesn’t address the problem of balancing score and diversity head-on.

Solutions in the literature

One of the first principled methods for diversity goes back 15 years to the paper of Jaime Carbonell and Jade Goldstein, The use of MMR, diversity-based reranking for reordering documents and producing summarization, in Proceedings of SIGIR ’98. Their idea is a simple one. Find a way to measure the similarity of items, where presumably two Sony items would be more similar than a Sony item and a Panasonic one. Then trade off the score and the similarity. In more detail, they propose placing items on the results page one step at a time. The first step places the item with the highest score, and records its index in the set of placed items I. The second step examines every unplaced item and compares each one’s score with its similarity to the item i (where I={i}). If P(j) is the score of j (P for probability) then each item j is evaluated using what they call marginal relevance: λP(j)−(1−λ)S(i,j). The marginal relevance is higher when the score P is better, and it is higher when the similarity to i is less. The parameter λ determines the relative importance of these two numbers. The item chosen for the second slot is the one with maximum marginal relevance (MMR):

    \[ MMR = \mbox{argmax}_{j \notin I}\left(\lambda P(j) - (1 - \lambda)S(i, j)\right)\]

The MMR item is placed on the results page, and its index is added to I. In general the algorithm computes

    \[MMR = \mbox{argmax}_{j \notin I}\left(\lambda P(j) - (1 - \lambda)\max_{i \in I}S(i, j)\right)\]

and places the resulting item on the results page as well as adding its index in I.

This method seems an improvement over adding random noise, but is unsatisfying for several reasons. Once it places an item, it is reluctant to place a second similar item. This reluctance may be fine for web search where you only want to see one instance of each kind of page, but is not so good for e-commerce where you’d like to compare multiple items. Also, there are almost certainly several dimensions of similarity (brand, cost, new vs. used, etc.), and it’s not clear how to combine them all into a single similarity function.

Rakesh Agrawal et. al. have a paper Diversifying search results from WSDM ’09 that doesn’t address any of those objections, but does address a different problem: how do you pick λ? They propose an algorithm that doesn’t have an arbitrary tuning parameter. They suppose that each item is in a category and that the demand (for a fixed query) of each category is known. This approach maps well with eBay search, since we know the category demand for our top queries. How to use this info? They imagine that users who issue query Q are diverse—that some are looking for items in category C1 while others want an item in a different category C2, and that this diversity is captured in the category demand table. The paper gives an algorithm that maximizes the chance that a user will find an item in his or her desired category. Let’s call this the Agrawal algorithm.

The Agrawal algorithm is a step above MMR in elegance because it doesn’t require fiddling with an arbitrary constant like λ. It works well for categories, but what about other forms of diversity like brand or item condition (new vs. used)? Just as we record category demand, we could record demand for brands, item condition, etc. But then we would need a way to soup-up the Agrawal algorithm to combine these different demands, and presumably would need to prioritize them. And like MMR, the Agrawal algorithm heavily penalizes an item if a similar one already exists, which is not appropriate for e-commerce.

A paper by Michael Welch et. al. Search result diversity for informational queries, WWW ’11, addresses one of the objections to Agrawal. They assume that users want to see more than one document per category. Specifically, they introduce a random variable J representing the number of items a user wants to see. You have to provide an estimate for the probability distribution of J (i.e., find numbers pj = Pr(J = j)) and then the algorithm uses that in its ranking. But this approach still doesn’t address our requirement to gracefully handle multiple dimensions of diversity.

There’s another factor to take into account. With multiple dimensions, we are unlikely to have a good estimate of the demand. For example, if the dimensions are category, brand and condition, we would ideally want the demands for each tuple: for example, (category=TV, Brand=Sony, condition=Used). For all these reasons, we abandoned the Carbonell → Agrawal → Welch train of thought, and took a different approach.

Our solution: agents

For each query, we specify which dimensions are important, together with constraints in the form of a max and/or min. For example, for the query “flat screen TV”, we might want at least 10% new items, and at most 50% with brand Samsung. This fits in nicely with our search architecture, which lets us customize the search behavior for a specific query by putting info into DSBE, a database indexed by query. Also we expect that in the common case the min/max constraints aren’t violated, and so the exact value of the constraints aren’t critical.

You can think of this as a very simple agent system. Each dimension has an agent with constraints (max/min). Each agent monitors the construction of the results page, and tries to keep its constraint satisfied. The agents won’t always succeed, so think of the requirements as being soft constraints.

Whenever you have multiple constraints, your first question should be how to prioritize them. The answer falls out almost automatically from the agent algorithm, which I now describe.

Each agent monitors the construction of the result set. At each stage, the agent checks on its constraints. If any are violated, it scans through the unplaced items looking for an item that will get the result set closer to satisfying its constraints. This item is the candidate.

If the agent’s constraint is violated, it is unhappy and it quantifies its unhappiness by summing two terms. The first term measures how much the placed items deviate from the agent’s constraints. The more deviation, the more unhappiness. The second term involves the candidate. There will be a penalty for placing the candidate on the search results page (SRP) instead of placing the default item, because the candidate has a lower score (P) than the default has. Summarizing the terminology:

  • Candidate: An item proposed by the agent for the next spot on the results page. If no agent proposes a candidate, the default item is the unplaced item of highest score.
  • Deviance: How far the placed items deviate from the agent’s constraints. Larger means more deviation.
  • Penalty: The penalty (in reduced score) for placing the candidate instead of the default. Larger means more of a penalty, i.e. more gap between the scores.
  • Unhappiness: The agent’s unhappiness. If unhappiness>0 then the agent’s candidate will be a contender for the next placed item.

Now back to the question of prioritizing. The priority falls out automatically. If multiple agents are unhappy, we simply pick the one with the largest unhappiness score, and use its candidate as the next item to be placed.

This approach isn’t hassle-free. You need to pick constraints (max/min) for each query. And you need to quantify deviance and penalty. Or in other words, find how to weight them, which is reminiscent of the parameter λ in MMR. But we prefer this approach because it is efficient, handles multiple constraints, and is not too sensitive to the exact values of max and min. For most queries, we expect the constraints to hold with the default ranking, and so placing an agent’s candidate is the exception rather than the rule.

A sample formula for unhappiness

The description in the previous section was abstract in that it talked about deviance and penalty without offering a way to compute them. Here is one way.

A typical constraint requires that the fraction items with property P be at least (this is a min constraint). might be that an item is used, or has brand Sony, or is newly listed. The straightforward way to compute deviation from a constraint is to look at the number of items with P placed so far and compare it to the number needed to meet the constraint. Suppose that there are n items placed so far, and that k of them are in P.  Since the constraint is min = f we’d like at least n items. If k < nf the constraint isn’t met, so set the deviance to nf − k . Or more precisely, deviance is (nf − k)+ where xx when > 0 and 0 otherwise.

A very simple way to compute the penalty is to compare the score of the candidate item IC with the item that would otherwise be placed, Idefault.  The penalty is S(IC ) − S(Idefault). Since items are sorted in score order, the penalty is always nonnegative.

Unhappiness is computed using the formula unhappiness = deviance − λ × penalty.  Imagine  deviance > 0 but the only way to fix it is by placing an item with a huge penalty. The large − term makes unhappiness negative, so the agent isn’t unhappy after all. In other words, agents are altruistic and don’t insist on their constraint if it results in placing a poor item. The parameter \lambda controls the tradeoff between score and diversity.

As a preliminary formula, unhappiness is

    \[\text{unhappiness} = \text{deviance} - \lambda \times \text{penalty} = (nf - k)^+ - \lambda ( S(I_C) - S(I_{\scriptsize{\text{default}}}))\]

But there’s one issue remaining. Suppose the constraint asks for at least 10%, and suppose the first item placed does not have the desired property. Then n = 1, f = 0.1, and = 0 so (nf − k )+ =(0.1)+ =0.1 and the constraint is already unhappy and will be in the running to have its proposed items placed next. This seems like jumping the gun, since the goal is 10% but only one item has been placed. There are at least two ways to account for this situation.

The first would be rounding: replace (nf − k)+ with round(nf − k)+.   But I think of constraints as something only triggered in exceptional circumstances, and so prefer to have unhappiness go negative only at the last possible minute. In other words, if it is still possible to satisfy nf ≤ k on the next round, then don’t trigger the constraint.

Writing this as an equation, suppose I pass on this constraint and the next item turns out to not have P. Then k stays the same and n becomes + 1. But on the next round I pick an item with P. Then k becomes + 1 and n becomes + 2. So I can satisfy the constraint in the future if (+ 2)f ≤ + 1. So replace (nf−k)+ with ((n+2)f − (k+1))+. The formula becomes

    \[ \boxed{ \footnotesize{\mbox{unhappiness}} = \footnotesize{\mbox{deviance}} - \lambda \times \mbox{penalty} = ((n+2)f - k - 1)^+ - \lambda (S(I_C) - S(I_{\scriptsize{\text{default}}}))} \]

Constraints

There are two types of constraints based on a property P. The first is a min constraint, which asks that the fraction of items with P be greater than fmin. The second is max constraint, asking for the fraction to be less than fmax. Most properties are something very specific, like item=used. But there is also an any property. This would be used (for example) to keep any one seller from dominating the results. A max constraint with property seller, f and using any asks that there be no seller id with more than f of the results. The any property doesn’t make sense with the min constraint.

The any property is also a nice way to do duplicate (or fuzzy) dup removal. We assign a hash to each item, so that duplicate (or similar) items have the same hash. Then using a max with f = 1/50 means (roughly) that each item will appear only once on each 50-item page. If λ = 0 the requirement is absolute (unless there are competing agents). But by setting λ to a small number, we can allow dups if the alternative is to show very low quality items.

The algorithm

I close with a more formal description of the agent-based diversity algorithm. It can be implemented quite efficiently. In MMR (as well as Welch and Agrawal), at each step you must scan all unplaced items (think O(n2)). In our algorithm, you scan each item once for the entire algorithm (although this is done separately for each constraint). Scanning is implemented by having a pointer per constraint. The pointer moves down the unplaced list as the algorithm proceeds.

Input: a list of unplaced items, sorted by score
Output: items in their final arrangement for display on the search results page (SRP)
Remove the top item from unplaced and append it to SRP
For every constraint C, set C.ptr := first unplaced item
While unplaced is not empty
 initialize the set candidates to be empty
 ForEach C in set of soft constraints
 If C.unhappiness() > 0
 add C to the candidates
 // this has the side effect of updating C.ptr
 end
 EndForEach
 If candidates = empty then
 set item to be the top (highest score) item on unplaced
 else
 take the element C in candidates with the largest unhappiness score
 set item := C.proposed_item()
 end
 if there are any C.ptr pointing at item, move them down one
 remove item from unplaced and append it to SRP
EndWhile
Method C.unhappiness() =
 n := number of items placed in SRP so far
 f := C.f // the bound for this constraint
 k := the number of items placed so far that satisfy the constraint C
 // For the "any" property, k is the max number of items with the
 // same property value that satisfy the constraint
 if C.op = min then
 deviance := ((n+2)f - k - 1)+
 else
 deviance := (k + 1 - (n+2)f)+
 end
 Starting at C.ptr, scan down unplaced until reaching an item I that will reduce deviance
 C.ptr := I
 Idefault := highest scoring unplaced item
 penalty := S(I) - S(Idefault)
 unhappiness := deviance - lambda * penalty
 // set proposed_item field and return
 C.proposed_item := I
 return unhappiness
end

{ 2 comments }

We are very excited to announce that eBay has released to the open-source community our distributed analytics engine: Kylin (http://kylin.io). Designed to accelerate analytics on Hadoop and allow the use of SQL-compatible tools, Kylin provides a SQL interface and multi-dimensional analysis (OLAP) on Hadoop to support extremely large datasets.

Kylin is currently used in production by various business units at eBay. In addition to open-sourcing Kylin, we are proposing Kylin as an Apache Incubator project.

Background

The challenge faced at eBay is that our data volume has become bigger while our user base has become more diverse. Our users – for example, in analytics and business units – consistently ask for minimal latency but want to continue using their favorite tools, such as Tableau and Excel.

So, we worked closely with our internal analytics community and outlined requirements for a successful product at eBay:

  1. Sub-second query latency on billions of rows
  2. ANSI-standard SQL availability for those using SQL-compatible tools
  3. Full OLAP capability to offer advanced functionality
  4. Support for high cardinality and very large dimensions
  5. High concurrency for thousands of users
  6. Distributed and scale-out architecture for analysis in the TB to PB size range

We quickly realized nothing met our exact requirements externally – especially in the open-source Hadoop community. To meet our emergent business needs, we decided to build a platform from scratch. With an excellent team and several pilot customers, we have been able to bring the Kylin platform into production as well as open-source it.

Feature highlights

Kylin is a platform offering the following features for big data analytics:

  • Extremely fast OLAP engine at scale: Kylin is designed to reduce query latency on Hadoop for 10+ billion rows of data.
  • ANSI SQL on Hadoop: Kylin supports most ANSI SQL query functions in its ANSI SQL on Hadoop interface.
  • Interactive query capability: Users can interact with Hadoop data via Kylin at sub-second latency – better than Hive queries for the same dataset.
  • MOLAP cube query serving on billions of rows: Users can define a data model and pre-build in Kylin with more than 10+ billions of raw data records.
  • Seamless integration with BI Tools: Kylin currently offers integration with business intelligence tools such as Tableau and third-party applications.
  • Open-source ODBC driver: Kylin’s ODBC driver is built from scratch and works very well with Tableau. We have open-sourced the driver to the community as well.
  • Other highlights: 
  • Job management and monitoring
  • Compression and encoding to reduce storage
  • Incremental refresh of cubes
  • Leveraging of the HBase coprocessor for query latency
  • Approximate query capability for distinct counts (HyperLogLog)
  • Easy-to-use Web interface to manage, build, monitor, and query cubes
  • Security capability to set ACL at the cube/project level
  • Support for LDAP integration

The fundamental idea

The idea of Kylin is not brand new. Many technologies over the past 30 years have used the same theory to accelerate analytics. These technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels.

For reference, here is the cuboid topology:

cuboid_topo

When data becomes bigger, the pre-calculation processing becomes impossible – even with powerful hardware. However, with the benefit of Hadoop’s distributed computing power, calculation jobs can leverage hundreds of thousands of nodes. This allows Kylin to perform these calculations in parallel and merge the final result, thereby significantly reducing the processing time.

From relational to key-value

As an example, suppose there are several records stored in Hive tables that represent a relational structure. When the data volume grows very large – 10+ or even 100+ billions of rows – a question like “how many units were sold in the technology category in 2010 on the US site?” will produce a query with a large table scan and a long delay to get the answer. Since the values are fixed every time the query is run, it makes sense to calculate and store those values for further usage. This technique is called Relational to Key-Value (K-V) processing. The process will generate all of the dimension combinations and measured values shown in the example below, at the right side of the diagram. The middle columns of the diagram, from left to right, show how data is calculated by leveraging MapReduce for the large-volume data processing.

rational_to_kv

Kylin is based on this theory and is leveraging the Hadoop ecosystem to do the job for huge volumes of data:

  1. Read data from Hive (which is stored on HDFS)
  2. Run MapReduce jobs to pre-calculate
  3. Store cube data in HBase
  4. Leverage Zookeeper for job coordination

Architecture

The following diagram shows the high-level architecture of Kylin.

kylin_arch

This diagram illustrates how relational data becomes key-value data through the Cube Build Engine offline process. The yellow lines also illustrate the online analysis data flow. The data requests can originate from SQL submitted using a SQL-based tool, or even using third-party applications via Kylin’s RESTful services. The RESTful services call the Query Engine, which determines if the target dataset exists. If so, the engine directly accesses the target data and returns the result with sub-second latency. Otherwise, the engine is designed to route non-matching dataset queries to SQL on Hadoop, enabled on a Hadoop cluster such as Hive.

Following are descriptions of all of the components the Kylin platform includes.

  • Metadata Manager: Kylin is a metadata-driven application. The Metadata Manager is the key component that manages all metadata stored in Kylin, including the most important cube metadata. All other components rely on the Metadata Manager.
  • Job Engine: This engine is designed to handle all of the offline jobs including shell script, Java API, and MapReduce jobs. The Job Engine manages and coordinates all of the jobs in Kylin to make sure each job executes and handles failures.
  • Storage Engine: This engine manages the underlying storage – specifically the cuboids, which are stored as key-value pairs. The Storage Engine uses HBase – the best solution from the Hadoop ecosystem for leveraging an existing K-V system. Kylin can also be extended to support other K-V systems, such as Redis.
  • REST Server: The REST Server is an entry point for applications to develop against Kylin. Applications can submit queries, get results, trigger cube build jobs, get metadata, get user privileges, and so on.
  • ODBC Driver: To support third-party tools and applications – such as Tableau – we have built and open-sourced an ODBC Driver. The goal is to make it easy for users to onboard.
  • Query Engine: Once the cube is ready, the Query Engine can receive and parse user queries. It then interacts with other components to return the results to the user.

In Kylin, we are leveraging an open-source dynamic data management framework called Apache Calcite to parse SQL and plug in our code. The Calcite architecture is illustrated below. (Calcite was previously called Optiq, which was written by Julian Hyde and is now an Apache Incubator project.)

calcite

Kylin usage at eBay

At the time of open-sourcing Kylin, we already had several eBay business units using it in production. Our largest use case is the analysis of 12+ billion source records generating 14+ TB cubes. Its 90% query latency is less than 5 seconds. Now, our use cases target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily – no more Hive query, shell command, and so on.

What’s next

  • Support TopN on high-cardinality dimension: The current MOLAP technology is not perfect when it comes to querying on a high-cardinality dimension – such as TopN on millions of distinct values in one column. Similar to search engines (as many researchers have pointed out), the inverted index is the reasonable mechanism to use to pre-build such results.
  • Support Hybrid OLAP (HOLAP): MOLAP is great to serve queries on historical data, but as more and more data needs to be processed in real time, there is a growing requirement to combine real-time/near-real-time and historical results for business decisions. Many in-memory technologies already work on Relational OLAP (ROLAP) to offer such capability. Kylin’s next generation will be a Hybrid OLAP (HOLAP) to combine MOLAP and ROLAP together and offer a single entry point for front-end queries.

Open source

Kylin has already been open-sourced to the community. To develop and grow an even stronger ecosystem around Kylin, we are currently working on proposing Kylin as an Apache Incubator project. With distinguished sponsors from the Hadoop developer community supporting Kylin, such as Owen O’Malley (Hortonworks co-founder and Apache member) and Julian Hyde (original author of Apache Calcite, also with Hortonworks), we believe that the greater open-source community can take Kylin to the next level.

We welcome everyone to contribute to Kylin. Please visit the Kylin web site for more details: http://kylin.io.

To begin with, we are looking for open-source contributions not only in the core code base, but also in the following areas:

  1. Shell Client
  2. RPC Server
  3. Job Scheduler
  4. Tools

For more details and to discuss these topics further, please follow us on twitter @KylinOLAP and join our Google group: https://groups.google.com/forum/#!forum/kylin-olap

Summary

Kylin has been deployed in production at eBay and is processing extremely large datasets. The platform has demonstrated great performance benefits and has proved to be a better way for analysts to leverage data on Hadoop with a more convenient approach using their favorite tool. We are pleased to open-source Kylin. We welcome feedback and suggestions, and we look forward to the involvement of the open-source community.

 

 

{ 8 comments }

NoSQL Data Modeling

by Donovan Hsieh on 10/10/2014

in Data Infrastructure and Services

Data modeling for RDBMS has been a well-defined discipline for many years. Techniques like logical to physical mapping and normalization / de-normalization have been widely practiced by professionals, including novice users. However, with the recent emergence of NoSQL databases, data modeling is facing new challenges to its relevance. Generally speaking, NoSQL practitioners focus on physical data model design rather than the traditional conceptual / logical data model process for the following reasons:

  • Developer-centric mindset – With flexible schema (or schema-free) support in NoSQL databases, application developers typically assume data model design responsibility. They have been ingrained with the notion that the database schema is an integral part of application logic.
  • High-performance queries running in massive scale-out distributed environments – Contrary to traditional, centralized scale-up systems (including the RDBMS tier), modern applications run in distributed, scale-out environments. To accomplish scale-out, application developers are driven to tackle scalability and performance first through focused physical data model design, thus abandoning the traditional conceptual, logical, and physical data model design process.
  • Big and unstructured data – With its rigidly fixed schema and limited scale-out capability, the traditional RDBMS has long been criticized for its lack of support for big and unstructured data. By comparison, NoSQL databases were conceived from the beginning with the capability to store big and unstructured data using flexible schemas running in distributed scale-out environments.

In this blog post, we explore other important mindset changes in NoSQL data modeling: development agility through flexible schemas vs. database manageability; the business and data model design process; the role of RDBMS in NoSQL data modeling; NoSQL variations that affect data modeling; and visualization approaches for NoSQL logical and physical data modeling. We end the post with a peak into the NoSQL data modeling future.

Development agility vs. database manageability

One highly touted feature in today’s NoSQL is application development agility. Part of this agility is accomplished through flexible schemas, where developers have full control over how data is stored and organized in their NoSQL databases. Developers can create or modify database objects in application code on the fly without relying on DBA execution. The result is, indeed, increased application development and deployment agility.

However, the flexible schema is not without its challenges. For example, dynamically created database objects can cause unforeseen database management issues due to the lack of DBA oversight. Furthermore, unsupervised schema changes increase DBA challenges in diagnosing associated issues. Frequently, such troubleshooting requires the DBA to review application code written in programming languages (e.g., Java) rather than in RDBMS DDL (Data Definition Language) – a skill that most DBAs do not possess.

NoSQL business and data model design process

In old-school software engineering practice, sound business and (relational) data model designs are key to successful medium- to large-scale software projects. As NoSQL developers assume business / data model design ownership, another dilemma arises: data modeling tools. For example, traditional RDBMS logical and physical data models are governed and published by dedicated professionals using commercial tools, such as PowerDesigner or ER/Studio.

Given the nascent state of NoSQL technology, there isn’t a professional-quality data modeling tool for such tasks. It is not uncommon for stakeholders to review application source code in order to uncover data model information. This is a tall order for non-technical users such as business owners or product managers. Other approaches, like sampling actual data from production databases, can be equally laborious and tedious.

It is obvious that extensive investment in automation and tooling is required. To help alleviate this challenge, we recommend that NoSQL projects use the business and data model design process shown in the following diagram (illustrated with MongoDB’s document-centric model):

design_process

Figure 1

  • Business Requirements & Domain Model: At the high level, one can continue using database-agnostic methodologies, such as domain-driven design, to capture and define business requirements
  • Query Patterns & Application Object Model: After preliminary business requirements and the domain model are produced, one can work iteratively and in parallel to analyze top user access patterns and the application model, using UML class or object diagrams. With RDMS, applications can implement database access using either a declarative query (i.e., using a single SQL table join) or a navigational approach (i.e., walking individual tables embedded in application logic). The latter approach typically requires an object-relational mapping (ORM) layer to facilitate tedious plumbing work. By nature, almost all NoSQL databases belong to the latter category. MongoDB can support both approaches through the JSON Document model, SQL-subset query, and comprehensive secondary indexing capabilities.
  • JSON Document Model & MongoDB Collection / Document: This part is where native physical data modeling takes place. One has to understand the specific NoSQL product’s strengths and weaknesses in order to produce efficient schema designs and serve effective, high-performance queries. For example, modeling social network entities like followed and followers is very different from modeling online blogging applications. As such, social networking applications are best implemented using Graph NoSQL databases like Neo4j, while online blogging applications can be implemented using other flavors of NoSQL like MongoDB.

RDBMS data modeling influence on NoSQL

Interestingly enough, old-school RDBMS data modeling techniques still play a meaningful role for those who are new to NoSQL technology. Using document-centric MongoDB as an example, the following diagram illustrates how one can map a relational data model to a comparable MongoDB document-centric data model:

mongodb_mapping

Figure 2

NoSQL data model variation

In the relational world, logical data models are reasonably portable among different RDBMS products. In a physical data model, design specifications such as storage clauses or non-standard SQL extensions might vary from vendor to vendor. Various SQL standards, such as SQL-92 and the latest SQL:2008 as defined by industry bodies like ANSI/ISO, can help application portability across different database platforms.

However, in the NoSQL world, physical data models vary dramatically among different NoSQL databases; there is no industry standard comparable to SQL-92 for RDBMS. Therefore, it helps to understand key differences in the various NoSQL database models:

  • Key-value stores – Collections comprised of unique keys having 1-n valid values
  • Column families – Distributed data stores in which a column consists of a unique key, values for the key, and a timestamp differentiating current from stale values
  • Document databases – Systems that store and manage documents and their metadata (type, title, author, creation/modification/deletion date, etc.)
  • Graph databases – Systems that use graph theory to represent and store data as nodes (people, business, accounts, or other entities), node properties, and edges (lines connecting nodes/properties to each other)

The following diagram illustrates the comparison landscape based on model complexity and scalability:

nosql_comparisons

Figure 3

It is worth mentioning that for NoSQL data models, a natural evolutionary path exists from simple key-value stores to the highly complicated graph databases, as shown in the following diagram:

nosql_evolution

Figure 4

NoSQL data model visualization

For conceptual data models, diagramming techniques such as the Entity Relationship Diagram can continue to be used to model NoSQL applications. However, logical and physical NoSQL data modeling requires new thinking, due to each NoSQL product assuming a different native structure. One can intuitively use any of the following three visualization approaches, using a document-centric data model like MongoDB as an example:

  • Native visual representation of MongoDB collections with support for nested sub-documents (see Figure 2 above)

Pros – It naturally conveys a complex document model through an intuitive visual representation.
Cons – Without specialized tools support, visualization results in ad-hoc drawing using non-uniform conventions or notations.

  • Reverse engineering selected sample documents using JSON Designer (see Figure 5 below)

Pros – It can easily reverse engineer a hierarchical model into a visual representation from existing JSON documents stored in NoSQL databases like MongoDB.
Cons – As of this writing, JSON Designer is available only on iPhone / iPad. Furthermore, it does not include native DB objects, such as MongoDB indexes.

json_designer

Figure 5

  • Traditional RDBMS data modeling tools like PowerDesigner (see Figure 6 below)

Pros – Commercial tools support is available.
Cons – it requires tedious manual preparation and diagram arrangement to represent complex and deeply nested document structure.

power_designer

Figure 6

In a future post, we’ll cover specific data model visualization techniques for other NoSQL products such as Cassandra, which is based on the Column Family structure.

New NoSQL data modeling opportunities

Like any emerging technology, NoSQL will mature as it becomes mainstream. We envision the following new data modeling opportunities for NoSQL:

  • Reusable data model design patterns (some product-specific and some agnostic) to help reduce application development effort and cost
  • Unified NoSQL model repository to support different NoSQL products
  • Bi-directional, round-trip engineering support for (data) model-driven design processes and tools
  • Automated data model extraction from application source code
  • Automated code-model-data consistency validation and consistency conformance metrics
  • Strong control for application / data model change management, with proactive tracking and reconciliation between application code, embedded data models, and the actual data in the NoSQL databases

{ 5 comments }

Copyright © 2011-2015 eBay Inc. All Rights Reserved - User Agreement - Privacy Policy - Comment Policy