Sunday, October 26, 2008

Data and Code at the Application Level

This week I would like to address the assertion that "code is data" and how the application developer might benefit or be harmed by this idea in the practical pursuit of deadlines and functioning code. For some reason my essay written back in May, Minimize Code, Maximize Data got picked up on the blogosphere on Thursday, and comments on ycombinator, on reddit.com, and on the post itself have suggested the thesis is flawed or unworkable because "code is data." Let's take a look at that.

Credit Where Credit Due

I first heard the thesis "Minimize Code, Maximize Data" from A. Neil Pappalardo. I consider it the "best kept secret in programming" because I personally have found it to be almost completely absent from my own day-to-day experience with other programmers.

However, glomek over at reddit.com also credits Eric Raymond with the following quote, "Smart data structures and dumb code works a lot better than the other way around."

Also, sciolizer over on the news.ycombinator.com comments area gives us these quotes from some of the greats:

  • Fred Brooks: "Show me your flow charts and conceal your tables and I shall continue to be mystified, show me your tables and I won't usually need your flow charts; they'll be obvious."
  • Rob Pike: "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."
  • Eric S. Raymond (again): "Fold knowledge into data, so program logic can be stupid and robust."
  • Peter Norvig: "Use data-driven programming, where pattern/action pairs are stored in a table."

And finally, Kragen Javier Sitaker left a comment on my original essay mentioning Tim Berners-Lee and his theory of "least power." You can read a description of that here.

So Why Do They Say Code is Data?

The suprising answer is that code is data, in particular contexts and when trying accomplish certain tasks. The contexts do not include application development, and the tasks do not involve storing of customer information, but the fact remains true for those who work in the right contexts.

As an example, at the bottom layer of the modern computer are the physical devices of CPU and RAM. Both the computer program being executed and the data it operates on are stored in RAM in the same way. This is called the Von Neumann architecture. Its a fascinating study and a programmer can only be improved by understanding it. At this level code is data in the most fundamental ways. There are many many other contexts and tasks for which it is true that code is data.

But we who create applications for customers are separated from Von Neumann by decades. These decades have seen a larger and larger stack of tools that allow us to concentrate on specialized tasks without worrying about how the tools below are doing their jobs. One of the most significant sets of tools that we use allow us to cleanly separate code from data and handle them differently.

The One and Only Difference

Trying to explain the differences between code and data is like trying to explain the differences between a fish and a bicycle. You can get bogged down endlessly explaining the rubber tire of the wheel, which the fish does not even have, or explaining the complexity of the gills, which the bicycle does not even have.

To avoid all of that nonsense I want to go straight to what data is and what code is. The differences after that are apparent.

Data is an inert record of fact. It does nothing but sit there.

A program is the actor, the agent, the power. The application program picks up the data, shakes it, polishes, and puts it down somewhere else (as in picking it up from the db server, transforming it into HTML, and delivering it to a browser).

To repeat: data is facts. Code is actions that operate on facts. The one and only difference is simply that they are not the same thing at all, they are a fish and a bicycle.

Exploiting The Difference

All of the quotes listed above, and my original essay in May on the subject, try to bring home a certain point. This point is simply that the better class of programs are those that begin with a distinction between fact and action, and seek first to organize the facts and only then to plan the actions.

Put another way, it is of enormous practical advantage to the programmer to fully understand that first and always he is manipulating facts (data). If he ignores the principles of how facts are organized and operated on, he can never reach his full abilities as a programmer. Only when he understands how the facts are organized can he see the clearest program designs.

And Again: Understand the facts first. From there design your data structures. After that the algorithms write themselves.

Minimizing and Maximizing

The specific advice to minimize code and maximize data is nothing more than taking the idea to its logical conclusion. If I write program X so that the data structures are paramount, and I find the algorithms to be simple (or "dumb" as ESR would say), easy to write and easy to maintain, don't I want to do that all of the time?

Conclusion

The wise programmer is one who can take the wisdom and theory of the industry and correctly judge what is appropriate and applicable and what is not. This is the programmer who has a shot at keeping focused, making budget and making deadlines. He knows when a generalized routine will support the overall project and when to just code the case at hand and move on.

The unwise programmer is one who cannot properly apply a theoretical concept to the correct context, or cannot judge the context in which a concept is appropriate. He is the one who produces mammoth abstractions, loses sight of the end-goals of the check-signers and end-users, and never seems to be able to make the deadline.

Of all of the advice I have received over the years, one of the most useful and productive has been to "minimize code, maximize data." As an application developer and framework developer it has served me better than most.

Sunday, October 19, 2008

The Argument For Denormalization

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Non-normalized, Normalized and Denormalized

A nonnormalized database is a disorganized one, where nobody has bothered to work out where the facts should be stored. It is like a stack of paper files that has been tossed down the stairs. We are not interested in non-normalized databases.

A normalized database has been organized so that each fact is stored in exactly one place (2nf and greater) and no more than one fact is stored in each place (1nf). In a normalized database there is a place for everything and everything is in its place.

A denormalized database is a normalized database that has had redundancies deliberately re-introduced for some practical gain.

Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed. Values are copied from table to table, calculations are made within a row, and totals, averages and other aggregrations are made between child and parent tables.

Related Essays

If you are a first-time reader of this blog, I recommend taking a look at Third Normal Form and Calculated Values and Denormalization Patterns, which cover issues related to today's post.

The Practical Problems Of Normalization

There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:

  1. No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.
  2. Non-reproducible Calculations. If you do not store calculated values in your database, your application must generate them on the fly as needed. If your application changes over time, you risk not being able to reproduce prior results when the business rules drift far enough from the original.
  3. Join Jungles. When each fact is stored in exactly one place, you may find it daunting to pull together everything needed for a certain query. A query joining 4,5, 7 or even 12 tables may be required for something the end-user considers trivial and easy. Such queries are hard to code, hard to debug, and dangerous to alter.
  4. Performance. When you face a JOIN jungle you almost always face performance problems. A JOIN is a very expensive operation compared to a single-table read, and the more JOINs you have the worse it gets.

The Convenience Argument

The convenience argument addresses the first problem listed above, no calculated values. When calculated values are generated and added to tables, it is far easier for downstream programmers (including members of the customer's IT department) to generate their own reports and ad-hoc queries. It is also much easier for members of the original team to generate display pages and reports.

This convenience is not a result of the simple presence of the calculated values. The convenience stems from the fact that the downstream programmers do not have to get involved in code that generates or calculates the values. They do not have to know anything about the API, the language the app was written in, or anything else, they just have to pull the data they need.

This convenience goes beyond the programmers to semi-technical users who may want to use their favorite 3rd party reporting tool (like Crystal Reports) to query the database. If your application API will not work with their favorite tool (or if you don't have an API), then you have a dissappointed customer. But if the data is right there in tables they can pretty much use anything.

At this point you may be saying, sure, that's fine, but views get all of this done without denormalizing. That is true, but when we go on to the next 3 arguments we will see something of why denormalizing often wins out over views.

The Stability Argument

Every healthy computer program changes and grows as new users and customers make use of it. During this process it is inevitable that later customers will request significant changes to very basic functions that were coded early on and are considered stable. When this happens the programmers have the daunting task of providing the original functionality unchanged for established customers, while providing the new functionality for the newer customers.

Denormalizing can help here. When derived values are calculated during write operations and put directly into the database, they can basically stay there forever unchanged. When a significant new version brings newer code to older users, there is no need to fear that that an invoice printed last week will suddenly come out with different numbers.

There still remains of course the fact that a bug in this whole effort means future calculations are wrong, and the worst case is when a bug gets out to production and generates bad calculated values. When this happens you face the prospect of fixing bad data on a live system. This is definitely my least favorite thing to do.

The Simple Queries Argument

The third problem listed above is JOIN jungles: queries that involve so many JOINs that they become impractical to write, difficult to debug, and dangerous to change.

When you denormalize a database by copying values around between parent and child tables, you reduce the number of JOINs that are required. Very obvious examples include things like copying an items price onto an order_lines table when a customer puts an item in their cart. Each time you copy a fact from one table to another, you eliminate the need for a JOIN between those two tables. Each eliminated JOIN is a simpler query that is easier to get right the first time, easier to debug, and easier to keep correct when changed.

This argument also goes directly back to the convenience argument. If that huge customer you just landed is happy to hear that they can use Crystal Reports to generate reports, you may still face disappointment when they find the reports involve "too many tables" from their perspective for reports that "ought to be simple".

The Performance Argument

The final argument proceeds from our fourth problem listed above. Normalized databases require a lot more JOINs than denormalized databases, and JOINs are very expensive. This means that, overall, any operation that reads and presents data will be more expensive in a normalized database than a denormalized one.

Once we reduce the JOINs by copying data between tables, we end up improving performance because we need fewer JOINs to retrieve the same number of facts.

Denormalization is not the only way to get the convenience of copied values and calculated values. Views and materialized views are the most often mentioned alternatives. The choice between denormalizing and using views has a lot to do with the Pay Me Now or Pay Me Later decision.

Conclusion

Denormalization is not an absolute: it is not one of those things that all wise experienced programmers always do, and it is not something that only fools ignore. The four arguments listed here have guided me well in deciding when to denormalize (and when not to), and I hope that they are of some benefit to you when you face the same decisions.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

Sunday, October 12, 2008

The Argument For Normalization

This week we will review the practical arguments in favor of normalization. The major concern as always on this blog is to examine database decisions in light of how they affect the overall application. The major argument for normalization is very simple: you end up coding less, coding easier, and coding stronger, and you end up with fewer data errors.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Informal Description of Normalization

When I find a programmer who is stuck trying to grasp the concepts of normalization, the problem usually comes down to not being able to see the big picture. The programmer may read any number of rigorous papers on the subject (the Wikipedia article is a good place to start) but still be unable to get the basic point. This leaves the programmer stumbling through table design, second-guessing himself, and then running through a frustrating sequence of redesigns. In the worst case it leads him to conclude normalization may not be worth the effort, at which point he starts writing really crappy applications.

The goal of normalization in simple terms is just this: to store each fact in exactly one place. When you put each fact in only one place, you always know where to go to read it or write it. When facts are repeated in the database, the application programmer has an increased burden to make sure they are all consistent. If he fails to shoulder this burden completely, the database will have inconsistent values for the same facts, leading to emergency phone calls and emails requesting help. These request for help always come at 4:30pm as you are getting ready for a date or an extended vacation.

Once the programmer understands this very simple goal, he usually has an "aha!" moment reading the various articles on normalization. Each rule for normalizing suddenly makes more sense, as it appears as just another way to make sure that there is a place for everything and everything is in its place.

The Programmer's Burden

When a programmer is dealing with a non-normalized database, he is going to run into 4 problems over and over again. Three of these are called "anomalies" and the fourth is "inconsistency" (a fancy way of saying the database holds bad data). Most authors who write on normalization take it for granted that the reader can readily see why the anomalies are bad, but I would like to spell it out here to make it crystal clear. First we will look at the three kinds of anomalies, then we will go to the inconsistency problem, and see exactly how they affect the programmer.

Consider a fledgling programmer who has ready too many Web 2.0 blogs saying that relational databases are bad, and so has not bothered to learn anything about them. In the name of "simplicity", he creates a single table that lists employees, their email addresses, the customers they are assigned to, and the primary email address of each customer. This will lead to three kinds of anomaly, each of which leads to inconsistency.

An Update Anomaly occurs when a fact is stored in multiple locations and a user is able to change one without changing them all. If a user goes to this employee-customer table and changes an employee's email on only one row, and no provision is made to change the others, then the database now has inconsistent values for the employee's email.

  • An Insert Anomaly occurs when it is not actually possible to record a fact. If an employee is hired but not yet assigned to any customers, it is not possible to store his email address!

  • A Delete Anomaly occurs when the user deletes one fact and clobbers some other fact along the way. If an employee goes on leave, so that we must remove (delete) their assignments, then we have lost their email address!

    This non-normalized database requires the programmer to write additional application code to try to intercept and correct these issues. This is the Programmer's Burden in a non-normalized situation, and it gets worse and worse as the program expands.

    The Programmer's Burden also emerges as a continuing stream of complaints from users that "the program is wrong." For every case where the programmer fails to provide exception-handling code, a user will stumble across inconsistent data. The customer says, "it says 'X' on this screen but it says 'Y' on that screen, what's going on?" As far as they are concerned it is a bug (which of course it is) and must be fixed. You can't make money coding new features when you are fixing garbage like that.

    The Basic Argument

    So the basic argument for normalization is: we wish to avoid the Programmer's Burden as completely as possible. We want to spend our time on cool features, not going back over and over to fix features we thought were finished already.

    Special Comment on First Normal Form

    First normal form is different from the others. When a database designer violates the higher normal forms, the result is that a fact is recorded in more than one place. However, when you violate first normal form it results in more than one fact in the same place.

    A basic example would be the same table of employees and customers, where we "solve" the problems listed above by storing only one row for each employee, with a comma-separated list of accounts, like so:

    EMPLOYEE   EMAIL                     CUSTOMERS
    ------------------------------------------------------------
    ARANDOLPH  art@praxis.com            100, 523, 638, 724
    SRUSSELL   sax@overlook.edu          516, 123, 158
    PBOYLE     phyllis@sp-elevataor.com  713, 928, 212
    

    The above scheme increases the Programmer's Burden because now he must decompose the data that comes from the server. In technical terms we say that the value CUSTOMERS is non-atomic, it is not a single fact. Every piece of code that touches that table must break down the list of customers and sometimes reassemble it.

    To see this, consider the basic task of adding a customer for employee Art Randolph. If the tables were set up properly, you would insert into a cross-reference of employees and customers, and duplicates would be trapped by a primary key. But here you must retrieve the list of existing customers, split it up in application code, and check that the value is not repeated. Then you have collapse the list back down and send it up to the server.

    All I can say is, no thanks.

    By The Way, What Is The Right Way?

    Now that we have beat up our fledgeling programmer's lousy employee-customer table, it would be worthwhile to spell out how to do it correctly.

    First off, we always need one table for each kind of thing we are keeping track of. That means we will have a table of employees and a table customers. This solves all of the anomalies and inconsistencies listed above because we put facts about employees in the employees table (like email address) and facts about customers in the customers table.

    This leaves the issue of linking employees to customers. There are three ways to do it:

    1. If each customer gets a team of employees assigned to them, but an employee only ever works for one customer, then put a Foreign Key on the employees table that links to the customers table.
    2. If each employee works on more than one customer, but each customer gets only one employee, then put a foreign key on the customers table that links back to employees.
    3. If an employee can work for more than one customer and vice-versa, make a Cross-reference between customers and employees.

    Conclusion

    This week we have seen a fairly simple argument for normalization, and one that regular readers of this blog have seen before: normalization eliminates unnecessary coding burdens. It is hard enough to get software projects done on time and on budget without imposing additional labor that could be avoided entirely by normalizing.

    I do not mean to imply that normalizing takes no time or is instantly easier than a fear-based retreat into coding your way out of things. It does take time to learn to normalize and it does take time to learn to code an application around normalized tables. In my own experience I passed through the various erroneous mindsets that I make fun of in this blog, and each time I put effort into learning the "right way" then every effort I made after that was forever easier, had fewer bugs, and made my customers more happy. So I am not saying it is free, but I am saying it is one of the best bargains in town.

    Related Essays

    This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

    The normalization essays on this blog are:

  •