Sunday, March 2, 2008

The Requirements Are Always Wrong, Or, Iterative Database Development

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are talking about tools again, specifically upgrades.

Requirements and Changes

We all write programs for other people to use, a programmer without his users is like a musician without an audience. So it goes without saying that we need to know what our customers want.

The requirements we are given will never be right. We are destined to get requirements that are incomplete, incorrect, contradictory, and impossible. We cannot control this reality, we can only control our response to it.

Most programming theory is at its base a response to this reality, so that you have the "waterfall" crowd trying to get everything right before you start, the CMM crowd trying to figure out exactly how long it will take, and the "agile" crowd saying you'll never get it right so get started and modify as you go.

On the database side we do not have such a wealth of theory on how to deal with mistakes (and changes). This may be because the trend these days is for theories and frameworks that preach a gleeful ignorance of database principles, or it may be because SQL/Relational databases were already firmly established before the explosion of the web. But whatever the reason, we have a general need for a basic review of how to build your database when you know that the requirements you are given will be full of holes.

The Perfect Requirement

Before we look at best practices, we will spend a few paragraphs looking at why the requirements will never be right. For one particular requirement to go completely right, each of these seven things numbered events below must happen:

  1. The customer herself must understand the requirement.
  2. The customer must consider the requirement worth mentioning.
  3. The customer must state the requirement correctly.
  4. The requirement must get to the programmer without corruption.
  5. The programmer must understand it.
  6. The programmer must design tables correctly.
  7. The programmer must correctly write supporting code.

These are not impossible, but since human beings are not perfect, each step will be wrong some percent of the time, and sometimes more than one will be wrong, and so on.

The last 3 all involve programmer error, and they all point to the need for your development process to have some iterative element, because it is inevitable that somebody will be going back and repeating some work.

Steps 1 and 3 involve customer error. If you are lucky the error will be bad enough to be noticeable, so that you can go back to the customer and ask about it before any tables are built or code is written. In any event, dealing with these problems is much more about diplomacy than technical knowledge. On the plus side, since these things are inevitable, if you build your diplomatic skills you will win a lot of good will from your customers, which is worth far more than technical skills.

Item 4 deserves some special mention. Item 4 is more likely to get screwed up the larger your organization is. Some schools of thought recognize this and preach a super-close connection between customers and programmers, though this is often very difficult to get across to managers who believe that layers of communication provide protection against mistakes.

Item 2 is probably the worst. This gets messed up when some requirement is so much a part of a customer's experience that it would never occur to them to tell you, anymore than they would include "remember to keep breathing" in instructions on how to setup a home theatre system. (See the earlier essay on Limited Transactions for an example).

It Only Gets Worse

The seven items listed above are the simplest possible list of requirements. In a larger company there will be intermediate steps, each of which is usually meant to introduce oversight but which usually introduces more errors. This can only make things worse for the requirements.

Moreover, the guiding philosophy of the technical manager(s) can introduce serious systematic error. For instance, certain fashionable ideas like "Table structure is not that important, just get the code right first" can take hold of a project and cause huge amounts of wasted effort because it just ain't so.

There are lots more problems like prima donna programmers, staff changes, politics, and so on. There is no end to the number of factors that can interfere with the correct articulation, communication, and implementation of requirements.

Iterative Database Development

The experience of decades has led again and again to the conclusion that software development is iterative, you will always go back and change the code. There is no such thing as the finished program (except maybe for PING), and there is rarely any such thing as the bug-free program (again except maybe for PING).

But how exactly do we do iterative develpment with a database? That is what I would like to address now.

We have to begin by looking at the kinds of changes you can make to a database.

  1. Trivial Changes Like widening a column
  2. Additons such as new columns and new tables.
  3. Tangled and Difficult changes like correcting a structure mistake and copying and modifying data into new locations.

Any sensible person is going to try to avoid #3, and by contrast, changes of type #1 are no problem if we have tools that support structure changes. This leaves us with the surprising conclusion that a strategy for iterative database development is to spend your time making additions. When requirements are found to be wrong, incomplete, contradictory, or in any other way messed up, the ideal situation is to add a column or two, add a table or two, and add some code, without ever having to touch the existing code.

We will now see how to do this.

Identify Master Tables and Master Transactions

Every program has a core purpose, and at the center of this purpose will be master entities that have to be tracked. For an online store these will be at the very least customers, items, and a cart (header and lines). For the school management program that I have used in these essays, these are teachers, students, courses, and course enrollments.

When you are strongly confident that you have a few master entities identified, you can start to frame your efforts around them. You can ask for clarification of customer requirements in terms of these master entities by asking, "Is this a property of the customer?" or "Can a student have more than one major?"

The deliverable at this stage is skeleton tables, primary keys and foreign keys. That is, to identify the central entities and their relationship to each other. This effort is not so much about identifying the attributes or properties of these entities, but simply the entities themselves. Once you have identified the entities, you will basically spend the the rest of the project adding columns (and code) as the requirements are worked through. This is the essence of doing iterative development as a series of additions.

The great thing about this approach is that if you miss some major table you rarely have to perform major surgery on the work you have already done. If further discussions reveal a central entity that was missed before, just make a new table for it. The amazing and cool thing about this is that you rarely have to change the existing tables you have already worked up, except perhaps to add a new foreign key.

Translate Features Into Columns and Tables

Many features can be resolved into columns once you have the main tables worked out. For instance, on a school system the specifications may say "every student must choose a major." A seasoned database veteran immediately recognizes this as a foreign key. There is a table of majors (which is definitely related to the table of departments somehow) and a "MAJOR" column in the students table.

In terms of iterative development, we can see that it should be fairly painless to add in a system for majors without disturbing the rest of the system. We add a table of MAJORS (probably with a foreign key to the table of departments), and add a MAJOR column to the STUDENTS table as a foreign key to the MAJORS table.

To repeat the main theme, the last thing we want to do is get stuck massaging data or moving it from one table to another on a live system. By far the much easier route is to simply add tables or columns. If you can learn to think of features in terms of the underlying tables and columns, then you can move towards always making additions, which are far easier.

Identify Inert Columns

I like to use the term "inert" to refer to columns that are not involved in any business rules but must be printed out or displayed from time to time. The easiest example would be a student's mailing address. This has nothing to do with her major, her grades, or her courses, but must be printed occassionally for mailings or on paperwork.

The main point with inert columns is that you can add them in whenever you want without affecting the rest of the system. When the requirements omit these features they are the easiest to put in after the fact.

Go For Maximum Detail

Imagine you are asked to make a simple Business-to-Business system in which customers are shipped their goods and billed later. The requirements say only that "customers often pay more than one invoice with a single check and this should be easy for the user to process."

So you know already that you have a table of customers, and a table of invoices. Now we have to make a table of payments. Here is the killer question: does the table of payments link to invoices or to customers?

If we link payments to customers then we have the easiest system for the user, they just enter the customer and the check and we are done. By contrast, linking the payments to the invoices is much harder because we have to actually create two tables, a table of PAYMENTS and a table of SPLITS, which details how a payment is split out to apply to various payments. Worse, we probably have to add some special user interface screens to make it easy for the user to drag-n-drop or otherwise enter a payment and apply it to their open invoices.

However, the right thing to do is to go for maximum detail, and apply the payments to the invoices. There are two major reasons. First, it is more faithful record keeping. If you do not do it then the customer will sooner or later ask for the details you have not provided. Second, horror of horrors, if you have a live system and have been applying payments to customers, then when you go over to the other system you will have to move data around and, worse, you will have to guess at how to apply already entered data. This is why tables should always be created with maximum detail in terms of primary and foreign keys.

Everything Else

We have now seen a basic outline of how to do database design in such a way that you can do it iteratively by making mostly additions and avoiding painful data re-arrangments. There are more details than will fit into a single blog entry, but hopefully this will give the basic idea.

Next week we will begin to see query building. I had originally expected to do calculated values first, but upon reflection realized that queries should come first, because query needs often determine strategies for putting in calculated values.

Next Essay: Introduction To Queries


Phu Son said...

Great tips for designing new databases! Thanks.

Bruno Martinez said...

I find the customers' appetite for inert columns is endless. I'm thinking of storing these as a single blob or xml. I would love to see a blog about this subject.

KenDowns said...

Bruno: don't do it!!! That violates 1st normal form and means your code will get very complicated.

Inert fields are the easiest possible to deal with, so just stick 'em in and forget about them.

Also keep in mind that sometimes a column you think is inert suddenly becomes involved in logic. If you've trapped it in some blob now you've got to pull it out during an upgrade (that's that change type #3 we try to avoid) and make a column anyway.

Anonymous said...

I just stumbled upon this great series and am now working my way through all the articles. It's very gracious of you to share all your knowledge and experience with the rest of us. Thank you so much!

A quick question regarding your invoice payments example. Is it really necessary to have two tables? I could imagine a simple solution being:

invoice_payment table:
customer (PK) (FK to customer table)
invoice (PK) (FK to invoice table)
payment (PK)

I suppose we could have a separate table that just holds individual payments and the total payment amount (in which case the 'customer' and 'payment' columns in the table above would be a FK to this payment table). But how would you enforce that the split amounts must add up to the total payment amount? I guess you would also need to enforce that each split amount equals the outstanding balance for the invoice (unless you allow invoices to be paid off in multiple payments .... I guess this isn't so simple!).