Sunday, July 6, 2008

Database Performance: Pay Me Now or Pay Me Later

Many database performance decisions come down to "pay me now or pay me later." Some decisions will produce faster inserts and updates at the cost of slower and more complex reads, while other decisions will slow down inserts and updates but provide faster and easier reads.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Paying With Indexes

An index speeds up SELECT operations. We are not going to go into any detail about how indexes work, this week we will stick to how indexes affect performance.

Imagine a table of 10,000 sales orders. You wish to pull out a handful of fields for orders placed on 5/1/08, so you issue this SELECT:

SELECT customer,order_total 
  FROM orders
 WHERE date = '2008-05-01'

If you do not have an index on the table, then the database server will have to scan every single row in the table to find the rows that match the WHERE clause. On the other hand, if you had an index on the date column, the server would first read the index to find pointers to the correct rows, and then read only the rows you needed. The index itself is optimized by various methods so that only a very few reads are necessary to find the correct values. Most databases support the following syntax:

CREATE INDEX some_unique_name ON orders (date)

As far as performance goes, an index will slow down write operations (INSERT, UPDATE, and DELETE) because the index must be updated when the write operation occurs. This cost is on top of the write operation itself. (Addendum added July 7: Jochen points out correctly in his comment that this statement oversimplifies things. While it is true that the index must be updated for writes, the index can also dramatically speed up UPDATE and DELETE operations if those operations use a WHERE clause that can benefit from the index.)

In terms of "pay me now or pay me later", when you regularly add a lot of indexes you are opting to "pay me now." You pay the price of slower writes to get faster reads. If you regularly avoid adding any indexes you are opting to "pay me later." You defer the costs of access to read time to get faster writes.

I should note that it is not possible to completely avoid indexes, nor is there any value in trying to. For instance, a primary key requires an index because otherwise you have to scan the entire table every time you do an INSERT, which is just plain crazy. Foreign keys benefit from indexes as well for similar reasons.

Paying With Views

A "view" is a stored SQL statement that you can SELECT from as if were a table. Imagine we have a table of TEACHERS and a table of COURSES that they are teaching in a particular year. We often need to display a list of courses with the names of the teachers. We can do this with a JOIN, but a view gives us an easier pre-defined way to do this:

CREATE VIEW courses_teachers AS
SELECT courses.room,courses.period,courses.teacher
      ,courses.year
      ,teacher.first_name,teacher.last_name
  FROM courses 
  JOIN teachers ON courses.teacher = teachers.teacher

...which now lets you do the easier SELECT:

SELECT * FROM courses_teachers WHERE year='2008';

In terms of "pay me now or pay me later" a view is always a "pay me later" decision. It makes for easier coding but the server must go out on every SELECT and gather together the data required.

The "pay me later" nature of a VIEW meets its greatest extreme when the view contains aggregations. Consider the following view which gives you easy access to customers and their lifetime history of orders and payments:

CREATE VIEW customers_extended AS
SELECT customers.*
      ,SUM(orders.order_total) as orders_total
      ,SUM(invoices.balance)   as balance
  FROM customers
  JOIN orders    ON customers.customer = orders.customer
  JOIN invoices  ON customers.customer = invoices.customer
  
-- Pulling from the VIEW requires a complete read
-- of relevant ORDERS and INVOICES tables
SELECT * FROM customers_extended
 WHERE customer = X;

This view is a "pay me later" proposition because every time you issue a SELECT from the view, it will have to scan many rows from the ORDERS and INVOICES tables. The contrasting method is to denormalize which is a "pay me now" approach.

Paying With Denormalization

Denormalizing means taking a normalized database and deliberately inserting redundant values. I have an essay on the three Denormalization Patterns that I use myself, which follow these three forms:

  • FETCH operations, where a value such as an item's price is copied from the ITEMS table into the ORDER_LINES table.
  • EXTEND operations, where you take the QTY and the PRICE columns in the ORDER_LINES table and write the EXTENDED_PRICE.
  • AGGREGATE operations, such as writing the total of ORDER_LINES onto the ORDERS table.

All of these operations fall into the "pay me now" category. When these denormalized columns are put into tables, they add to the the size of the table and increase the cost of write operations. However, when it comes time to SELECT out of the tables the values are all there ready to go, usually with fewer JOINs and lower overall disk activity.

Extreme Pay Me Now

In my line of work I deal with line-of-business programs that are commissioned or purchased by businesses to do their daily work. User counts are low and resources are high, because often I will have 10 users on a single server, with access via internet limited to only a few thousand potential customers of which very few are ever on at the same time.

In this context, I prefer to take the "pay me now" approach to its fullest realization. This means I tend to design my systems so that:

  • Any column a user is likely to filter on has an index.
  • Tables are fully denormalized, containing a wealth of derived values.

This means that all write operations on my systems are slower than they might otherwise be. However, this is more than acceptable within this context because the server is largely untaxed, and users do not notice the difference between 100ms and 200ms to save a row. So I can pay when the user does not notice and as reward I have very rich reporting and lookup abilities.

The extreme pay-me-now approach has one more advantage. The wealth of derived values in the database lets end-users find what they are looking for without calling a programmer and asking for a special page or report. Generally the more derived values there are the truer this becomes.

Extreme Pay Me Later

The extreme form of pay-me-later is a fully normalized database with no derived values and a minimum of indexes. Calculated values are available either in views, client-side code or both. This type of database is tuned for lots of fast writes because the cost of an INSERT or UPDATE has been kept to an absolute minimum. The database will be slower to perform ad-hoc or one-off queries because the server will have to do table scans whenever a user filters on anything except primary keys and foreign keys.

The lack of derived values in fully normalized databases also leads to more phone calls and emails asking the programmer to create a report or page that will work out derived values that are not present in the database.

Conclusion: Know Your Context

This week we have taken common database technologies such as indexes and views and seen how they affect performance. All of these technologies can be judged in terms of the "pay me now or pay me later" decision.

Database programmers normally choose to "pay me later" when they must support a large number of simultaneous write operations with a minimum of contention. These situations call for fewer indexes and strict normalization. The trade-off is that ad-hoc or one-off queries will involve more JOINs, more table scans and an increased likelihood the programmer will be called in for special cases.

When read operations are more common than writes, or when inquiries and reports are likely to be unpredictable, database programmers will choose to "pay me now" by doing more work on the write operation. There will be more indexes and more denormalized values, so that the user is more likely to quickly locate whatever they want without programmer intervention.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

7 comments:

Beth Breidenbach said...

Great post. I often find that app developers want a "pay me later" approach -- leaving data querying and manipulation as an afterthought. Sometimes this is a perfectly valid approach, sometimes not.

The most extreme case was a recent scenario in which a developer wanted to pare a fairly complex app down to 4 tables -- each holding a hierarchical xml blob. You can imagine how functional that was for downstream reporting....

Anonymous said...

Your comment -

>An index speeds up SELECT operations.

Isn't strictly correct, I would state it as -

>An index *may* speed up a SELECT operation.

Since clearly -

1) The index might not be used
2) Using the INDEX might actually be slower than doing a FULL TABLE SCAN operation.

KenDowns said...

Anonymous: not a very enlightening comment. The original statement stands: indexes speed up SELECT statements, that's what they do. The next paragraph illustrates exactly how this works from the application viewpoint.

Jochen said...

Your over-simplifying things a bit with your "indexes help selects but hurt insert/update/delete".

If your indexes are well chosen, they also speed up your updates and deletes - under the assumption that you don't delete or update the full table all the time, but rather a few rows. And there an index on the proper SARGs will help finding the rows to be updated.

KenDowns said...

Jochen: good point, I have added an addendum to the OP.

Crystal said...

I think you oversimplified views by placing them only in the "pay me later" category. You are overlooking materialized views, which can provide "pay me now" denormalized storage.

KenDowns said...

Crystal: In the context of the post, a non-materialized
traditional view is a pay-me-later approach.

Materialized Views are not natively supported (to my knowledge as of July 2008) in MySQL or PostgreSQL, and as I am not currently using SQL Server or Oracle I prefer not to comment overmuch on their support.

That being said, materialized views appear to give you some fine control over the pay-me-now or pay-me-later decision by letting you control when the view is refreshed. This is nice, and it would be nice if support were more widespread in open source db's.