A Middle Approach to Schema Design in OLTP Applications

eBay is experiencing phenomenal growth in the transactional demands on our databases, in no small part due to our being at the forefront of mobile application development. To keep up with such trends, we continually assess the design of our schemas.

Schema design is a logical representation of the structures used to store the data that applications produce and consume.  Given that database resources are finite, execution times for transactions can vary wildly as those transactions compete for the resources they require. As a result, schema design is the most essential part of any application development life cycle. This blog post covers schema design for online transaction processing applications and recommends a specific approach.

Unfortunately, there is no predefined set of rules for designing databases in an efficient manner, but there can certainly be a defined design process that achieves that outcome. Such a design process includes, but is not limited to, the following activities:

  1. determining the purpose of the database
  2. gathering the information to be recorded
  3. dividing the  information items into major entities
  4. deciding what information needs to be stored
  5. setting up relationships between entities
  6. refining the design further

Historically, OLTP systems have relied on a systematic way of ensuring that a database structure is suitable for general-purpose querying and is free of certain undesirable characteristics – insertion, update, and deletion anomalies– that could lead to loss of data integrity. A highly normalized database offers benefits such as minimizing redundancy, freeing up the relations from undesired insertion, update and deletion dependency, data consistency within the database, and a much more flexible database design.

But as they say, there are no free lunches. A normalized database exacts the price of inserting into multiple tables and reading by way of joining multiple tables. Normalization involves design decisions that are likely to cause reduced database performance. Schema design requires keeping in mind that when a query or transaction request is sent to the database, multiple factors are involved, such as CPU usage, memory usage, and input/output (I/O). Depending on the use case, a normalized database may require more CPU, memory, and I/O to process transactions and database queries than does a denormalized database.

Recent developments further compound schema design challenges. With increasing competition and technological advances such as mobile web applications, transactional workload on the database has increased exponentially. As the competitor is only a click away, an online application's valuable users must be ensured consistently good performance via QoS and transaction prioritization. Schema design for such applications cannot be merely focused on normalization; performance and scalability are no less important.

For example, at eBay we tried a denormalized approach to improve our Core Cart Service’s DB access performance specifically for writes. We switched to using a BLOB-based cart database representation, combining 14 table updates into a single BLOB column. Here are the results:

  • The response time for an “add to cart” operation improved on average 30%. And in use cases where this same call is made against a cart that already contains several items (>20), the performance at 95 percentile has improved by 40% .
  • For the “create cart” operation, total DB call time for the worst case was improved by approximately 50% due to significant reduction in SQL counts.
  • Parallel transaction DB call times improved measurably for an average use case.

These realities do not imply that denormalization is in any way a blessing. There are costs to denormalization. Data redundancy is increased in a denormalized database. This redundancy can improve performance, but it also requires more extraneous efforts to keep track of related data. Application coding can create further complications, because the data is spread across various tables and may be more difficult to locate. In addition, referential integrity is more of a chore, because related data is divided among a number of tables.

There is a happy medium between normalization and denormalization, but both require a thorough knowledge of the actual data and the specific business requirements. This happy medium is what I call “the medium approach.” Denormalizing a database is the process of taking the level of normalization within the database down a notch or two. Remember, normalization can provide data integrity, which is the assurance of consistent and accurate data within a database, but at the same time it can slow performance because of its frequently occurring table join operations.

There is an old proverb:  “normalize until it hurts, denormalize until it works.” One has to land in the middle ground to get all of the goodies of these two different worlds.

4 thoughts on “A Middle Approach to Schema Design in OLTP Applications

  1. suchi banerjee

    Good article!

    Quite often students asks this question, how far we should normalize or when to de-normalize?
    Essentially Schema Design and SQL scripting are tightly coupled, so a change in Schema impacts all existing scripts. View helps to a certain extent but again cost benefit depends on a specific use case.

  2. Pingback: My links of the week – October 27, 2013 | R4

  3. Harold Ames

    Thank you for sharing this information. If you adopt an object-oriented approach to schema design, then make sure that you do not lose the flexibility of the relational storage model. In many cases, the object-oriented approach to schema design ends up in a heavily denormalized data structure that requires considerable maintenance and REF pointers associated with objects. Often, these designs represent a step backward to the hierarchical and network database designs that were replaced with the relational storage method. In summary, if you are storing your data in your database for the long-term and you anticipate a degree of ad hoc queries or application development on the same schema, then you will probably find that the relational storage method gives the best performance and flexibility.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>