Monday, December 24, 2007

Database Skills: Second Normal Form

The most basic skill that you need for successful database projects is table design. Strong table design leads directly to efficient and clean code, while weak table design leads in the opposite direction, code that is hard to write and hard to maintain.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

Our example is a school that has different kinds of schedules on different days. Periods may be 35, 40, or 45 minutes, and there may or may not be an assembly. Our assignment is to create a page that displays the periods for a day.

Nobody Argues About Second Normal Form

This week's essay is going to be rather short for two reasons. The first reasons is that I have never really heard anybody argue against second normal form, and so there is no real debate for me to review. In fact, from my own personal experience I would say that Second Normal Form is possibly the least controversial concept in computer science.

The second reason this essay will be short is that I have no real-world examples of violations that appear to make sense, so I have to make up the example. Many people will put in repeating groups (violating first normal form) and give you a determined argument claiming they are doing the Right Thing, but nobody has done that to me with second normal form. So I had to make something up to demonstrate what 2nd Normal Form is.

Prelude: Multi-Column Primary Key

Before we get to second normal form we have to talk very briefly about multi-column primary keys. A multi-column primary key is exactly what it sounds like, a primary key that has more than one column.

Our example from prior essays is a table of periods with a primary key on DAY_TYPE and PERIOD. No combination of these columns can appear twice. So in the table below the combinations in the first two columns must be unique:

DAY_TYPE  |  PERIOD | SEQUENCE  |  START    |   END
----------+---------+-----------+-----------+----------
NORMAL    |  HOME   |   0       | 8:00 am   |  8:10 am
NORMAL    |  1      |   1       | 8:15 am   |  9:00 am
NORMAL    |  2      |   2       | 9:05 am   |  9:50 am   

Second Normal Form

Second normal form is violated if you have a column that depends on only a part of the primary key. In this case it would mean a column that depended only on DAY_TYPE or PERIOD. This example has two violations of second normal form:

DAY_TYPE  |  DAY_DESC     | PERIOD   | DESCRIPTION | SEQUENCE  |  START    |   END
----------+---------------+----------+-------------+-----------+-----------+----------
NORMAL    |  Normal       | HOME     | Homeroom    |  0        | 8:00 am   |  8:10 am
NORMAL    |  Normal       | 1        | Period 1    |  1        | 8:15 am   |  9:00 am
NORMAL    |  Normal       | 2        | Period 2    |  2        | 9:05 am   |  9:50 am   
...
...more rows for type NORMAL
...
ALONG     | Long Assembly | HOME     | Homeroom    |  0        | 8:00 am   |  8:10 am
ALONG     | Long Assembly | 1        | Period 1    |  1        | 8:15 am   |  9:00 am
ALONG     | Long Assembly | 2        | Period 2    |  2        | 9:05 am   |  9:50 am   

Getting to Second Normal Form

In the example above the column DAY_DESC always has the same value for a given value of DAY_TYPE. DAY_DESC depends on only part of the key (which is DAY_TYPE + PERIOD), and so it is a violation of second normal form.

The exact same error occurs with DESCRIPTION, which is always the same for a given value of PERIOD. This value depends on only part of the key, and so it is a violation of second normal form.

The fix in both cases is to create parent tables to hold these values. The primary key of the parent table is one of the columns from the original table. Here are the two tables we create when we normalize the table from above:

DAY_TYPES:

DAY_TYPE          |   DESCRIPTION
------------------+--------------------------------
NORMAL            |   Normal Schedule 
ALONG             |   Long Assembly
ASHORT            |   Short Assembly
ALUNCH            |   Short Assembly After Lunch

PERIODS:
    
PERIOD       |   DESCRIPTION
-------------+-----------------------------
HR           |   Homeroom
1            |   Period 1
2            |   Period 2
3            |   Period 3
4            |   Period 4
LUNCH        |   Lunch
6            |   Period 6
7            |   Period 7
8            |   Period 8
ASSM         |   Assembly

Why Normalize?

Because this is a short essay this week, now would be a good time to review the basic reasons why we normalize.

In a normalized database, every distinct fact is stored in exactly one location, and each location stores exactly one fact (where a "location" means the value of one column in one particular row). This goal itself serves the goal of avoiding something that the database theorists call "anomalies".

In database terms, an "UPDATE anomaly" occurs when you issue a reasonable update statement to change something, but in fact the same information is stored somewhere else as well, and your two locations are now out of sync, your database has become corrupted. An "INSERT anomaly" and a "DELETE anomaly" are similar.

What this means to us programmers is that when a database is not normalized we end up with a significant burden trying to keep the data correct. We waste a lot of time fixing data on live systems (admit it, you've done it!) or adding code trying to prevent the problem. Many of those problems would be avoided completely if the data were normalized.

Now, as always, programmer who do not know databases well will argue that a system of fewer tables is "simpler", because he only needs one table instead of three, and he does not need to use JOINs when querying, or his ORM system is easier to write. This may be true, but he now has the burden of making sure that when an end-user changes a description in one row that it gets changed in all of the others. I personally would rather spend my time adding a new feature then fixing one I coded last year.

Normalization cannot solve all of your problems, but it can very definitely solve the ones it was meant to solve. If you can get each fact stored in exactly one place then you never have to waste time coding special fix-it routines or crawling through inconsistent data on a live system.

Conclusion: Easier Coding

Today we have seen another example of how a normalized set of tables is actually easier to work with than a so-called "simpler" single table, because it is easier to make sure that the data is correct.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Monday, December 17, 2007

Database Skills: First Normal Form

Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

The most basic skill that you need for successful database projects is table design. A strong table design can be the foundation of a wonderful application, while a weak design brings nothing but delays, expense, headaches and heartaches.

One of the cornerstones of good table design is the process of normalization. Normalization proceeds through different stages. The result of each stage is called a "Normal form." So if we have completed the first stage we say the database is in "First Normal Form", and after we have completed the second stage we say the database is in "Second Normal Form" and so on.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

Our example is a school that has different kinds of schedules on different days. A normal day has 4 periods in the morning, lunch, and then 3 afternoon periods. Sometimes the periods are 35 minutes and there is an assembly at the end of the day that goes to the normal end of day. On short assembly days the periods are 40 minutes each. There is a also a variation where the short assembly might be after lunch instead of at the end of the day.

A First Draft Table

Taking a stab at some table design, we may say something like, "Well I can see that there are never more than nine periods, and there are four kinds of schedules. I can make a table that lists the four kinds of schedules and tells me what happens in each period, something like this:"

DAY_TYPE   | MINUTES | P0   P1   P2   P3   P4   P5   P6   P7  P8   
-----------+---------+----+----+----+----+----+----+----+----+----
NORMAL     | 45      | C  | C  |  C | C  | L  | C  | C  | C  | X 
ALONG      | 35      | C  | C  |  C | C  | L  | C  | C  | C  | A 
ASHORT     | 40      | C  | C  |  C | C  | L  | C  | C  | C  | A 
ALUNCH     | 40      | C  | C  |  C | C  | L  | A  | C  | C  | C 

....where "C" means a class period, "L" means lunch, "A" means assembly and "X" means nothing happens.

It so happens that this table is not in first normal form. We will look at what this means first in theory, and then we will see the practical problems that come from it.

First Normal Form as Theory

A table is in first normal form if it has no repeating groups. The example above repeats the period column nine times with columns P0, P1, P2. This is a repeating group, and so it violates first normal form. Theory tells us that we must redesign the table, splitting the repeating group out into a child table. This will leave us with two tables, one that is for day types and the other that tells us about periods. The table of day types is now quite simpler:

DAY_TYPE   | MINUTES | DESCRIPTION
-----------+---------+-----------------------------                  
NORMAL     | 45      | Normal Day                            
ALONG      | 35      | Long Assembly
ASHORT     | 40      | Short Assembly                           
ALUNCH     | 40      | Short Assembly After Lunch                        

The original table had ordering information buried in the names of the columns. We need to have that same information in our child table so we will have the PERIOD column be numeric and it will determine the order of the periods. Here is the table with some of its entries:

DAY_TYPE   |  PERIOD   |   ACTIVITY 
-----------+-----------+---------------
NORMAL     |  1        | Class
NORMAL     |  2        | Class
NORMAL     |  3        | Class
NORMAL     |  4        | Class
NORMAL     |  5        | Lunch 
NORMAL     |  6        | Class
NORMAL     |  7        | Class
NORMAL     |  8        | Class
ALONG      |  1        | Class
ALONG      |  2        | Class
ALONG      |  3        | Class
ALONG      |  4        | Class
ALONG      |  5        | Lunch 
ALONG      |  6        | Class
ALONG      |  7        | Class
ALONG      |  8        | Class
ALONG      |  9        | Assembly

We have now satisfied the theory, and put our fledgling database into First Normal Form. But the question arises, wasn't the single-table version a lot simpler? This new "fixed" version seems more complicated. A programmer might say something like, "I have to look all over the place to find what I need." To answer these objections we will now examine the practical side of the example.

Problem 1: Structure Changes

The non-normalized example is probably fine if you have just one school that never changes their schedule. But if you have multiple schools you will need to change your structure for them, adding or removing periods.

To this we might reply, "but sure, structures change all of the time." Yes, it is true that a structure change may be required to improve a program or add functionality, but in the example above you must change the structure in cases where you are not improving the program or adding functionality. This means the more people use your program, the more work for you! By contrast, in the normalized situation a different schedule just means an end-user task to make entries in the tables.

You can argue that this particular example is not so bad, because, after all, how many periods can any school really have? Let's make a table with 15 periods and be done with it, no? The problem is that if you win that argument and make a habit out of repeating groups then you saddle your development efforts with the ongoing need to do structure changes where simple data editing would have done the job.

Problem 2: Secondary Columns

I worked on a system one time that had a repeating group of 24 columns in several tables. The columns were for the size of an item of clothing. In one table there were 24 additional columns for price. Then 24 more for cost. Then 24 more for gross profit. There were more, should I continue?

The problem would appear in our school schedule if we wanted to put the start and end times of the periods into the tables so we did not have to figure them out on every page load. We would now have START0, START1, END0, END1, and so on, eighteen more columns. Such fat tables like this are very clumsy to work with both in code and when doing ad-hoc queries.

Problem 3: Constants Required

Imagine you have a simple routine called SQL_ALLROWS() that gives you query results in an array (if you don' have one, code it today!). If your database is well normalized your code will display this pattern a lot:

$rows = SQL_AllRows(
    "SELECT activity FROM period_info 
      WHERE day_type = 'NORMAL'
      ORDER BY sequence"
);
foreach($rows as $row) {
   // ...do something interesting...
}

This simple code is not possible when you violate first normal form. You must put a constant into your code somewhere so that your code knows how many columns there are. This is an extra bit of maitenance to worry about here:

$row = SQL_OneRow(
    "SELECT * FROM period_info 
      WHERE day_type = 'NORMAL'
);
for($x = 0; $x < NUM_PERIODS; $x++ ) {
   $period = $row['P'.$x];
   // ... do some display...
}

Problem 4: Query Construction

For problem 3 above, I put a "Select *" into the query. I really should not have done that. I should have listed all 9 "P" columnns (or all 27 columns if we put start and end times in also). Something like this:

SELECT P0,P1,P2,....

But this violates good programming practice. If the structure changes then I have to go to all of those queries and modify them. A better programming practice would be to use that NUM_PERIODs constant to generate column lists on the fly:

$cols = array();
for ($x = 0; $x < NUM_PERIODS; $x++ ) {
  $cols[] = 'P'.$x;
}
$query = "SELECT ".implode(',',$cols)." from...";

Most programmers will avoid a bit of code like this because it seems at first glance to be really stretching a point. Of course once the program is successful and the additions and changes come in that leaves you going through and changing every query every time you modify the structure. None of this is necessary with the normalized version.

It is starting to seem like a stretch to claim that we have take the simpler path here. Perhaps the "complicated" normalized tables really are simpler.

Problem 5: Hard-Coded Tools

As I write this in December of 2007 the web is rediscovering a UI widget that has been in use for decades in terminals and desktops: the grid. The grid is the natural method of editing child tables in admin interfaces.

Before 1995, when there were no web programmers, a programmer who violated first normal form paid the price by needing to code up things like grids by hand because he could not use grid-like tools available in his framework. The modern web programmer will pay the price in not being able to use the grids that are rapidly coming online.

Problem 6: Cannot Aggregate

Finally we will consider a simple question, tell me how many periods there are in any particular DAY_TYPE. In a normalized database the SQL is easy:

SELECT COUNT(*) FROM periods WHERE day_type='NORMAL'
-- or, to get them all:
SELECT COUNT(*),Day_type
  FROM periods
 GROUP BY day_type

Getting the same answer in the non-normalized database requires code. It does not matter how simple that code is, the fact is that if the tables were normalized the code would not be required at all. Multiply that by 10 tables, 100 tables, and you just have complication after complication.

Conclusion

Violating first normal form puts us in a situation where we need programmers to make changes, while normalized databases can be modified by end users. Violating first normal form also forces us to embed and maintaint constants we would not otherwise need and to change queries that could otherwise be left alone. These problems can be mitigated with libraries of code that build queries for us, but still those libraries have to be written.

In this light, we see can see first normal form as something that creates efficient code and easier-to-maintain databases.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Database Skills: Complete Contents

This posting is updated whenever a new post goes up.

There is a also a Skills-oriented Table Of Contents. It is not as complete as this list, which lists all posts, but it is more centered on links as they relate to skills.

If you want some free analysis, why not submit your schema to the Database Programmer? If you are willing to discuss your issues with a bit of public exposure, I will provide free analysis, and everybody can benefit!

User-Submitted Analysis Topic: Email


The Application Stack

Table Design Basics: Keys, Normalization, Denormalization

The first group of posts introduces the must-know terms and techniques for table design.

It might be a good idea to start with The Relational Model.

Following up on the normal forms are some basic discussions of normalization and denormalization.


Table Design Patterns

The second subseries details commonly occurring patterns in table design, how to recognize them and when to use them.

There is a complete List of Table Design Patterns. The rest of the entries are:


SQL SELECT and Queries

Algorithms and Processes

Server-Side Code

Analysis

Development Cycle

Philosophy

Data Dictionary

Many of these posts are listed elsewhere in this table of contents, but I wanted to have them altogether in one place as well.

Security

Performance

The Browser

Monday, December 10, 2007

Database Skills: Foreign Keys

This is the third in a series of articles that will help you develop your database skills, Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

This week the example continues from the last two weeks. A school has 8 periods in a day that are varying lengths, sometimes an assembly, and of course homeroom. Our assignment is to code a screen that shows the begin and end times of each period depending on what schedule was being followed for the day. We have already decided to "spell it out" by populating a table with the start and end times for each period in each kind of day (the schedule was described in the first essay in the series. Here is a list of about half of the rows in the table:

DAY_TYPE  |  PERIOD | SEQUENCE  |  START    |   END
----------+---------+-----------+-----------+----------
NORMAL    |  HOME   |   0       | 8:00 am   |  8:10 am
NORMAL    |  1      |   1       | 8:15 am   |  9:00 am
NORMAL    |  2      |   2       | 9:05 am   |  9:50 am   
NORMAL    |  3      |   3       | 9:55 am   |  10:40 am
NORMAL    |  4      |   4       | 10:45 am  |  11:30 pm
NORMAL    |  LUNCH  |   5       | 11:35 pm  |  12:20 pm
NORMAL    |  6      |   6       | 12:25 pm  |  1:10 pm
NORMAL    |  7      |   7       | 1:15 pm   |  2:00 pm
NORMAL    |  8      |   8       | 2:05 pm   |  2:50 pm
ALONG     |  HOME   |   0       | 8:00 am   |  8:10 am
ALONG     |  1      |   1       | 8:15 am   |  8:50 am
ALONG     |  2      |   2       | 8:55 am   |  9:30 am
ALONG     |  3      |   3       | 9:35 am   |  10:10 am
ALONG     |  4      |   4       | 10:15am   |  10:50 am
ALONG     |  LUNCH  |   5       | 10:55 am  |  11:30 am
ALONG     |  6      |   6       | 11:35 am  |  12:10 pm
ALONG     |  7      |   7       | 12:15 pm  |  12:50 pm
ALONG     |  8      |   8       | 12:55 pm  |  1:30 pm
ALONG     |  ASMB   |   9       | 1:35 pm   |  2:50 pm
....
.... same for other two schedules
....

Now we will see how a foreign key will help to make sure the above table is correct.

The Foreign Key

A foreign key is used to prevent certain kinds of errors. In our example above we see that the column DAY_TYPE has only two values, NORMAL and ALONG (and ASHORT and ALUNCH that were not dipslayed). If a user were typing in this schedule and accidentally typed 'NRMAL' or 'NORMALL' we want to catch that and prevent it. A foreign key is how we do this. Making the column SCHEDULE_TYPE a foreign key means here is some other table that holds the allowed values. We call that other table the parent table. Such tables are also called lookup tables, reference tables and sometimes leaf tables. Here is the parent table of SCHEDULE_TYPES, which lists the allowed values of SCHEDULE_TYPE.

SCHEDULE_TYPE     |   DESCRIPTION
------------------+--------------------------------
NORMAL            |   Normal Schedule 
ALONG             |   Long Assembly
ASHORT            |   Short Assembly
ALUNCH            |   Short Assembly After Lunch

The column PERIOD will also be a foreign key, and the parent table will look like this:

PERIOD       |   DESCRIPTION
-------------+-----------------------------
HR           |   Homeroom
1            |   Period 1
2            |   Period 2
3            |   Period 3
4            |   Period 4
LUNCH        |   Lunch
6            |   Period 6
7            |   Period 7
8            |   Period 8
ASSM         |   Assembly

Database people say that a foreign key enforces "referential integrity". This term comes from the fact that one column reference another, and this reference ensures the integrity of the data.

The SQL Code

This SQL is illustrative only. Your particular database may require additional commands or syntactical variations.

Creating the two lookup tables is accomplished like so:

CREATE TABLE periods (
     PERIOD char(8)
    ,DESCRIPTION char(25)
    ,PRIMARY KEY (PERIOD)
)
CREATE TABLE day_types (
     DAY_TYPE char(8)
    ,DESCRIPTION char(25)
    ,PRIMARY KEY (DAY_TYPE) 
)

The table of actual periods for each kind of day is created with the code below. I have used type 'character' for the start and end times to keep things simple for now. Note the two foreign keys, one to each table:

CREATE TABLE periods_x_days (
     DAY_TYPE char(8)
    ,PERIOD char(8)
    ,SEQUENCE int
    ,START char(8)
    ,END char(8)
    ,PRIMARY KEY (DAY_TYPE,PERIOD)
    ,FOREIGN KEY (PERIOD) REFERENCES PERIODS(period)
    ,FOREIGN KEY (DAY_TYPE) REFERENCES DAY_TYPES(DAY_TYPE)
)

There is a multiple-column primary key that I snuck in there. By enforcing a primary key on DAY_TYPE and PERIOD, we make sure that there is no accidental double listing for a particular type-period combination.

The Basic Building Blocks

Now that we know how a foreign key works and what it does, we can start to talk about why it is important. In short, the foreign key connects your tables together so that you have a true database. Without the foreign key all you have is a jumbled bag of unrelated tables.

An analogy to code may be in order here. Most of us can list many ways in which bits of code can connect to each other. You can call a function, you can make complex objects via composition, you can make classes via inheritance, along with many others. All of these mechanisms tie bits of code together into a coherent application. The corresponding fundamental way to tie tables together in a database is the foreign key.

So to sum it up, we have seen so far that:

  1. A database programmer by inclination wants to spell everything out in tables.
  2. Each table stores information about exactly one kind of thing, and a primary key makes sure there are no duplications.
  3. Foreign keys define how different kinds of things relate to each other.

The Server Does The Work

Using a foreign key is a good way towards moving your business logic into the database server itself. Once you have defined the foreign keys above, the server will never allow you (or your users) to accidentally or maliciously insert bad data. This command:

INSERT INTO periods_x_days (
    day_type,period,sequence,start,end
)
values (
   'FICTION','99',5,'11:00 pm','11:05 pm'
)

...will throw an error when you try to execute it. The exact wording varies from product to product, but the bottom line is you are fully protected from these errors without doing any coding of your own.

Foreign Keys and The Classic Blunder

Last week we noted that many programmers have the idea that they can combine the PERIODS table and the DAY_TYPES table into a single table, thereby "saving" a table. I did my best to debunk this idea by pointing out that

  1. First, it is actually less efficient to do so because it complicates code and requires at least one extra colum, and
  2. Second, it destroys the value of the primary key column as a unique identifier of a single kind of thing.

We now have a third argument, based on our understanding of the foreign key. If we combine the tables then we can no longer use the combined table as a lookup table for foreign keys. The classic blunder of "saving a table" means the resulting table cannot be tied properly to anything else. This kills the idea off completely, since the entire purpose of creating the tables is to use them as parent tables for foreign keys!

Conclusion: The Building Blocks

When we grind out code, we do not usually expect the customer to give us rigorous specs. They tell us what they need in their own language and we have to convert those needs into functioning code. We use building blocks like classes, functions, conditionals and iterators to make those programs.

Designing a database is the same as designing code insofar as the customer will not be able to design it for you. They will be able to tell you want they want in their own language, but you will have to design the tables. That process of table design is all about identifying primary and foreign keys. Next week we will begin to see how to do this, through the process of normalization.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Monday, December 3, 2007

Database Skills: Primary Keys

This is the second in a series of articles that will help you develop your database skills, Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

This week the example continues from last week. A school has 8 periods in a day that are varying lengths, sometimes an assembly, and of course homeroom. Our assignment is to code a screen that shows the begin and end times of each period depending on what schedule was being followed for the day.

Last week we introduced the idea that the database programmer will tend towards trying to spell it out. That is to say, the database programmer will cast the solution in terms of saving information in a database that can be retrieved later on. He will generally avoid coding an algorithm that "figures it out" if he has the chance to simply store data that can be retrieved as needed. We will now see how a primary key works by looking at two tables that make up part of the solution.

The Primary Key Defined

The primary key is pretty easy to understand. It is simply the column or columns in a table that must be unique. In our example of the school schedule the database programmer is going to create two reference tables that have single-column primary keys and just a few rows each. Here they are, the table of schedule types and the table of periods:

SCHEDULE_TYPE     |   DESCRIPTION
------------------+--------------------------------
NORMAL            |   Normal Schedule 
ALONG             |   Long Assembly
ASHORT            |   Short Assembly
ALUNCH            |   Short Assembly After Lunch
PERIOD       |   DESCRIPTION
-------------+-----------------------------
HR           |   Homeroom
1            |   Period 1
2            |   Period 2
3            |   Period 3
4            |   Period 4
LUNCH        |   Lunch
6            |   Period 6
7            |   Period 7
8            |   Period 8
ASSM         |   Assembly

The first table has a primary key of SCHEDULE_TYPE. No two rows in that table can have the same value of SCHEDULE_TYPE. The second table has a primry key of PERIOD, so that no two rows in that table can have the same value of PERIOD.

A primary key is in fact very simple and does not really require a lot of discussion except for one further point: a primary key identifies something. For example, the value 'NORMAL' identifies a certain kind of schedule, and if I am given this value I can look up things about the schedule in the SCHEDULE_TYPES table.

Detour: Code and Data Decisions

These examples are so far very simple, so it may not be obvious that the tables SCHEDULE_TYPES and PERIODS are really necessary. After all, we can just put them into an associative array in some type of include file and not bother making tables, right?

Technically, you can of course put anything you want into an include file and save yourself some time at the front of a project. The drawbacks emerge over time and are the consequence of "trapping" data inside of code.

The first and most obvious drawback is that if the customer wants to change a description then a programmer must change a program file, instead of just having a user go to an admin screen. But worse, what if our program becomes a smash hit and schools all over the country are using it? What if they all want different descriptions, or what if they have different numbers of periods and kinds of schedules? If we start off by putting such things into tables then we will be well prepared for future demands, but if we "trap" the data in code then future success will produce choking demands on the programmers to change the files.

If the tables were more complicated we would have another issue. Databases were designed for efficient querying, but arrays were not. Querying the data that is trapped in code is much harder than querying tables in the database.

So we will take it as a given that data belongs in a database, even for simple two-column tables.

The Primary Key And a Classic Blunder

Programmers are by nature visual thinkers and we tend to be good at pattern recognition. This is why it is so common for programmers to look at the example above and say, "Hey! I just noticed that both of those tables have the same structure, a code and a description! I've got a great idea, I'm going to combine them into one table, which will be more efficient!"

This is a classic blunder. It is a result of applying wisdom from one situation into a different situation where that wisdom does not apply. Here is what happens. The programmer takes his idea of efficiency, which he learned from writing code, and tries to apply it to table design, which operates on completely different principles. He thinks that by replacing two tables with one he is being more efficient, but the real result is less efficiency.

First off, the new combined table now requires a new column, ROW_TYPE, which was not necessary before. More complicated, less efficient.

Secondly, a simple query to pull all types of schedules now requires a filter on ROW_TYPE that was not required before. This raises major alarms for seasoned programmers because any time you make a simple action more complicated with no apparent gain then you are on a bad road!

But most importantly for today's essay, the "efficient" single-table approach has muddled the primary key. The value 'ASHORT' by itself no longer uniquely identifies a schedule type. To solve this we end up in strange discussions. Should the column ROW_TYPE be part of the primary key? Or should we keep the single column key and have the odd situation that a schedule type and a period cannot have the same code? These conversations should raise alarms with any seasoned programmer because we have barely gotten started with two simple tables and already we are struggling with the very basics of picking primary keys! Even a programmer with no database experience should be suspicious of an approach that requires wrangling with the very basics. What good can come of that? The big irony is that these complications arose from the "simplification" of combining two tables into one.

Finally, there is a big technical problem with foreign keys and the one-table approach, but that can wait until next week.

All of these problems can be avoided by remembering that each table in a database contains information about one kind of thing and only one kind of thing. The primary key is that column or columns that uniquely identifies each of these individual things.

Conclusion: Look for the Keys

The primary key is simple to understand. It is a column or columns that must be unique in a table. But beyond that, the primary key is not just unique values, but unique values for like items. Database design skills begin with identifying the kinds of things that must be tracked, putting each into a table, and assigning the primary keys to those tables.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Monday, November 26, 2007

Database Skills: Introduction

This is the first in a series of essays that will help you develop your database skills from the beginning to the advanced.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Since all non-trivial websites require a database, there is plenty good reason to master the principles of database programming.

Database skills are very different from coding skills. In many cases, the reflexes and intuition that lead an experienced programmer into great code will just as easily lead his database efforts into one classic blunder after another. To avoid these blunders, and indeed to produce masterful database sites, the programmer must learn something of the database in its own terms.

The recurring theme of these essays will be that good database design makes for good and efficient code, the two ideas do not oppose each other. Another basic thrust of the entire series will be that databases should be learned and understood in their own terms, instead of trying to gloss over the differences between database design and coding.

The Example

During the series we are going to build a school database, and in this first essay we are going to look only at the start and end times for the class periods. Our school administrator, Dr. Ann Clayborne, explains the schedule to us in plain English like so:

Each day is divided into 8 periods, where period 5 is always lunch. The day begins promptly with homeroom at 8:00 am which lasts for 10 minutes. Period one begins at 8:15 am. All periods are 45 minutes and there is a five minute break between each period. On assembly days the periods are 35 minutes each. If we are going to have a short assembly the periods are 40 minutes. Assemblies are at the end of the day and fill out the remainder of day. Sometimes we have the short assembly after lunch.

Figure it Out or Spell it Out?

Our assignment is to code up a screen that displays the start and end times of each period, including homeroom, lunch, and an assembly if necessary. The start and end times of periods will vary based on what kind of day it is.

For this introduction, I want to begin with the different mindsets that can be used to get the job done. In a nutshell, a code grinder is going to figure it out, while a database programmer is going to spell it out. Let us look at how the coder thinks of the problem, and then we will look at how the database programmer thinks of the problems.

The coder says to himself something like, "Well let's see, I'll hardcode homeroom at the start of each day. Then I'll need a loop that goes from one to eight. I'll hardode 8:15 as the start of period one, and depending on the day, I'll keep adding 35, 40, or 45 minutes to figure out the end of the period. Then I'll add 5 minutes to get the start of the next period. If there's an assembly I'll just have to add some if/else stuff in there. Oh, wait, I'll also have to calculate ahead of time the end of the day so I can work out the length of an assembly, since they did not tell me that."

Now let us consider what the database programmer might do. He says to himself, "Hmmm, I think I'll make a table that lists the start and end times of each period for each kind of day. I will put the assemblies and homeroom in there. On the scheduling page I'll just pull the schedule out of the database and put it on the screen."

The coder has cast the problem in terms of actions, which is what programs are all about. This may seem so obvious that there is no reason to bring it up, but but we do need to make this point strongly. No matter what language or orientation (procedural, object, aspect, functional), every program always comes down to a series of actions take in sequence. It is not so much knowing this that is important as remembering this when making design decisions, as we shall see in this series.

The database programmer, on the other hand, has cast the problem in terms of information he can store for later retrieval. Our example, while very simple, illustrates very clearly how different mindsets approach a problem. The coder thinks in terms of a series of actions while the database programmer thinks in terms of storing and retrieving facts. If we need a bumber sticker for this or a T-shirt slogan, we restate our starting point, which is that the coder wants to figure it out while a database programmer wants to spell it out.

Conclusion: the Tables and the Code

So we know that the database programmer is going to create tables and populate them with the data that he can then read out of the screen. A portion of the table would look like so:

DAY_TYPE  |  PERIOD | SEQUENCE  |  START    |   END
-----------+---------+-----------+-----------+----------
NORMAL    |  HOME   |   0       | 8:00 am   |  8:10 am
NORMAL    |  1      |   1       | 8:15 am   |  9:00 am
NORMAL    |  2      |   2       | 9:05 am   |  9:50 am 
NORMAL    |  3      |   3       | 9:55 am   |  10:40 am
NORMAL    |  4      |   4       | 10:45 am  |  11:30 pm
NORMAL    |  LUNCH  |   5       | 11:35 pm  |  12:20 pm
NORMAL    |  6      |   6       | 12:25 pm  |  1:10 pm
NORMAL    |  7      |   7       | 1:15 pm   |  2:00 pm
NORMAL    |  8      |   8       | 2:05 pm   |  2:50 pm
SHORT     |  HOME   |   0       | 8:00 am   |  8:10 am
SHORT     |  1      |   1       | 8:15 am   |  8:50 am
SHORT     |  2      |   2       | 8:55 am   |  9:30 am
SHORT     |  3      |   3       | 9:35 am   |  10:10 am
SHORT     |  4      |   4       | 10:15am   |  10:50 am
SHORT     |  LUNCH  |   5       | 10:55 am  |  11:30 am
SHORT     |  6      |   6       | 11:35 am  |  12:20 pm
SHORT     |  7      |   7       | 12:15 pm  |  12:50 pm
SHORT     |  8      |   8       | 12:55 pm  |  1:30 pm
SHORT     |  ASMB   |   9       | 1:35 pm   |  2:50 pm
....
.... same for other two schedules
....

Returning now to the main theme, that good database design makes for good and efficient code, we look at the code that would display the periods for a single day (assume that $DAY is sanitized):

$rows=SQL_AllRows(
  "SELECT * FROM scheduleperiods
    where day_type=$DAY
    ORDER BY sequence"
);
foreach($rows as $row) {
  // Display start and end times
}

I will contend here with little argument that you cannot reduce the code to anything simpler than that. Any program that does not already know the period start and end times will have to figure them out, and will be more complicated, therefore harder to code, and more likely to contain errors.

In the next essay we will look at this table again, along with two other tables, and we will demystify primary and foreign keys.

Appendix: Fixing A Mistake

There is a deliberate mistake in the data listed above. How easy is it to fix, and how does this compare with fixing a mistake in code?

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.