Many database performance decisions come down to "pay me now or pay me later." Some decisions will produce faster inserts and updates at the cost of slower and more complex reads, while other decisions will slow down inserts and updates but provide faster and easier reads.
There are links to related essays on normalization and denormalization at the bottom of this post.
Paying With Indexes
An index speeds up SELECT operations. We are not going to go into any detail about how indexes work, this week we will stick to how indexes affect performance.
Imagine a table of 10,000 sales orders. You wish to pull out a handful of fields for orders placed on 5/1/08, so you issue this SELECT:
SELECT customer,order_total FROM orders WHERE date = '2008-05-01'
If you do not have an index on the table, then the database server will have to scan every single row in the table to find the rows that match the WHERE clause. On the other hand, if you had an index on the date column, the server would first read the index to find pointers to the correct rows, and then read only the rows you needed. The index itself is optimized by various methods so that only a very few reads are necessary to find the correct values. Most databases support the following syntax:
CREATE INDEX some_unique_name ON orders (date)
As far as performance goes, an index will slow down write operations (INSERT, UPDATE, and DELETE) because the index must be updated when the write operation occurs. This cost is on top of the write operation itself. (Addendum added July 7: Jochen points out correctly in his comment that this statement oversimplifies things. While it is true that the index must be updated for writes, the index can also dramatically speed up UPDATE and DELETE operations if those operations use a WHERE clause that can benefit from the index.)
In terms of "pay me now or pay me later", when you regularly add a lot of indexes you are opting to "pay me now." You pay the price of slower writes to get faster reads. If you regularly avoid adding any indexes you are opting to "pay me later." You defer the costs of access to read time to get faster writes.
I should note that it is not possible to completely avoid indexes, nor is there any value in trying to. For instance, a primary key requires an index because otherwise you have to scan the entire table every time you do an INSERT, which is just plain crazy. Foreign keys benefit from indexes as well for similar reasons.
Paying With Views
A "view" is a stored SQL statement that you can SELECT from as if were a table. Imagine we have a table of TEACHERS and a table of COURSES that they are teaching in a particular year. We often need to display a list of courses with the names of the teachers. We can do this with a JOIN, but a view gives us an easier pre-defined way to do this:
CREATE VIEW courses_teachers AS SELECT courses.room,courses.period,courses.teacher ,courses.year ,teacher.first_name,teacher.last_name FROM courses JOIN teachers ON courses.teacher = teachers.teacher
...which now lets you do the easier SELECT:
SELECT * FROM courses_teachers WHERE year='2008';
In terms of "pay me now or pay me later" a view is always a "pay me later" decision. It makes for easier coding but the server must go out on every SELECT and gather together the data required.
The "pay me later" nature of a VIEW meets its greatest extreme when the view contains aggregations. Consider the following view which gives you easy access to customers and their lifetime history of orders and payments:
CREATE VIEW customers_extended AS SELECT customers.* ,SUM(orders.order_total) as orders_total ,SUM(invoices.balance) as balance FROM customers JOIN orders ON customers.customer = orders.customer JOIN invoices ON customers.customer = invoices.customer -- Pulling from the VIEW requires a complete read -- of relevant ORDERS and INVOICES tables SELECT * FROM customers_extended WHERE customer = X;
This view is a "pay me later" proposition because every time you issue a SELECT from the view, it will have to scan many rows from the ORDERS and INVOICES tables. The contrasting method is to denormalize which is a "pay me now" approach.
Paying With Denormalization
Denormalizing means taking a normalized database and deliberately inserting redundant values. I have an essay on the three Denormalization Patterns that I use myself, which follow these three forms:
- FETCH operations, where a value such as an item's price is copied from the ITEMS table into the ORDER_LINES table.
- EXTEND operations, where you take the QTY and the PRICE columns in the ORDER_LINES table and write the EXTENDED_PRICE.
- AGGREGATE operations, such as writing the total of ORDER_LINES onto the ORDERS table.
All of these operations fall into the "pay me now" category. When these denormalized columns are put into tables, they add to the the size of the table and increase the cost of write operations. However, when it comes time to SELECT out of the tables the values are all there ready to go, usually with fewer JOINs and lower overall disk activity.
Extreme Pay Me Now
In my line of work I deal with line-of-business programs that are commissioned or purchased by businesses to do their daily work. User counts are low and resources are high, because often I will have 10 users on a single server, with access via internet limited to only a few thousand potential customers of which very few are ever on at the same time.
In this context, I prefer to take the "pay me now" approach to its fullest realization. This means I tend to design my systems so that:
- Any column a user is likely to filter on has an index.
- Tables are fully denormalized, containing a wealth of derived values.
This means that all write operations on my systems are slower than they might otherwise be. However, this is more than acceptable within this context because the server is largely untaxed, and users do not notice the difference between 100ms and 200ms to save a row. So I can pay when the user does not notice and as reward I have very rich reporting and lookup abilities.
The extreme pay-me-now approach has one more advantage. The wealth of derived values in the database lets end-users find what they are looking for without calling a programmer and asking for a special page or report. Generally the more derived values there are the truer this becomes.
Extreme Pay Me Later
The extreme form of pay-me-later is a fully normalized database with no derived values and a minimum of indexes. Calculated values are available either in views, client-side code or both. This type of database is tuned for lots of fast writes because the cost of an INSERT or UPDATE has been kept to an absolute minimum. The database will be slower to perform ad-hoc or one-off queries because the server will have to do table scans whenever a user filters on anything except primary keys and foreign keys.
The lack of derived values in fully normalized databases also leads to more phone calls and emails asking the programmer to create a report or page that will work out derived values that are not present in the database.
Conclusion: Know Your Context
This week we have taken common database technologies such as indexes and views and seen how they affect performance. All of these technologies can be judged in terms of the "pay me now or pay me later" decision.
Database programmers normally choose to "pay me later" when they must support a large number of simultaneous write operations with a minimum of contention. These situations call for fewer indexes and strict normalization. The trade-off is that ad-hoc or one-off queries will involve more JOINs, more table scans and an increased likelihood the programmer will be called in for special cases.
When read operations are more common than writes, or when inquiries and reports are likely to be unpredictable, database programmers will choose to "pay me now" by doing more work on the write operation. There will be more indexes and more denormalized values, so that the user is more likely to quickly locate whatever they want without programmer intervention.
The normalization essays on this blog are:
- Revisiting Normalization and Denormalization (this essay).
- Pay Me Now Or Pay Me Later
- The Argument for Normalization
- First Normal Form
- Second Normal Form
- Third Normal Form and Calculated Values
- The Argument for Denormalization
- Denormalization Patterns
- Keeping Denormalized Values Correct
- Triggers, Encapsulation and Composition
- The Data Dictionary and Calculations, Part 1
- The Data Dictionary and Calculations, Part 2