Sunday, May 4, 2008

Minimize Code, Maximize Data

Early in my career, I was fortunate to receive some programming lessons from one of the early pioneers in computer age, A. Neil Pappalardo. While I am sure he would not remember me, I certainly remembered him, and he said one thing that I remember very much: Minimize Code, Maximize Data.

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Best Kept Secret in Programming

This week we are going to examine what I was told so many years ago:

Minimize Code, Maximize Data

Since then, and it was nearly 15 years ago, I have never once heard another programmer (except myself) express this very basic and simple idea. It is the best kept secret in programming. I can think of two reasons why most programmers working today have never heard the simple idea, "minimize code, maximize data."

First possibility: the guy was totally wrong. I highly doubt this however as his company is approaching its 40th year of worldwide growth, and how many companies in this field last that long, never mind prosper and grow?

The second possibility seems much more likely to me: most programmers just don't think that way. Most programmers would never stumble upon this idea on their own and if they do hear it they forget it or reject it. We programmers love to code, and are reluctant at best to accept the idea that he who codes least codes best.

Now we will see how this simple idea impacts the entire software cycle, from designer to user.

The Example: Magazine Regulation

Our example is fairly easy. Consider a magazine distributor, a company that receives magazines in bulk and distributes them to stores. His system contains a table called DEFAULTS that lists the default quantity of each individual magazine that is distributed to each store (a big fat cross-reference table).

Now for the tricky part. Let's say we run a "SELECT SUM(qty)" from the DEFAULTS table for TV GUIDE and we get 2123. That means he needs 2123 copies of TV GUIDE to give every store their default delivery. But when the magazines arrived from the supplier, they only received 1900. What does he do? Well it so happens this happens every day, the distributor in fact never receives his exact default requirements, they are always over or under. So the distributor must run a process called regulation.

Regulation is an automatic computer process that runs through the defaults, compares them to what you actually have, and increases or decreases the delivery amounts to each store until everything lines up. It is by nature iterative, you run through the stores over and over increasing amounts by one or two until you balance out. Next we will look at how to write it.

Writing a Regulation Program

Before we get to the details of the regulation program, I have to point out one detail about magazine distribution. It turns out that 50-80% of magazines sent to retail racks go back to the distributors and are shredded. This means it is not uncommon to talk about sales percentages of 50% or 60%. If a store is receiving too many magazines, they might easily have a sales percent as low as 10%.

So how do we write the regulation program? I'm going to avoid any long build-up here and just go straight to the answer. You make a table of rules that determine how the regulation process works. The owner of the distribution company will say things like, "If the store sold less than 20% of the TV GUIDES for the past 4 weeks, drop their amount by 2, but do not go below 2." So you make a table with columns like "THRESHOLD_PERCENT", "DECREMENT" and "ABSOLUTE_MINIMUM". There will also be rules that operate when the company has too many magazines. The owner says, "If his sales percent is greater than 80%, give him two more." So your table now contains percentages and increase amounts to use when raising the distribution amounts.

From here your program is reduced to a fairly simple affair, it is little more than a double-nested loop. You fetch a list of the rules, in order, and you begin to iterate through the magazines, applying the current rule to the current magazine. You keep applying each rule over and over until there are no more cases where it applies, then you move on to the next rule. As soon as the total magazines matches the amount you have, the program terminates. It is the very soul of simplicity, and a perfect example of what database applications can do so well.

The Wrong Way To Do It

It so happens that I have a customer who is a magazine distributor, which is how I learned about this process. The system he is using now, which I maintain for him, was not written as I described it above. There is no table of rules. The programmer coded up each different rule separately, and the programmer is now dead (I feel like Dave Barry saying this, but I swear I am not making this up). The owner and I are both afraid to touch the code, and so he lives with it while we plan for something better.

While I do not wish to speak ill of the deceased, it is safe to say that the original programmer did not know he was supposed to minimize code and maximize data. Let's examine the consequences of that ignorance:

The customer is not in total control of his own business, because he
cannot reliably modify one of his most basic operations.

So we can now draw the conclusion that the "minimize code, maximize data" has very direct consequences for the user experience, especially those very important users who sign the checks.

The Impact Upon Your Code

A code grinder who is not trying to minimize code usually ends up with very complicated programs. This is true even of veteran and expert programmers who supposedly know better. They set out with an idea to simplify things and then end up writing complex heavily inter-dependent class libraries. I contend that this happens because they don't know that they are supposed to minimize code by maximizing data.

On the other hand, the database programmer who knows to maximize data ends up with dramatically simpler code patterns. This happens because a lot of the complex conditionals and branching logic that is required for code-centric solutions is reduced to row scanning operations that contain much simpler algorithms in the innermost loop.

We can see this in the example above for the regulation program. The data-centric solution is basically one loop nested inside of another with the core operation occurring on the inside. The code-centric solution, which I mentioned we are afraid to touch, is full of conditionals and branches that make it dangerous to mess with for fear of causing unintended side-effects. The original programmer could probably do anything with it, everybody else is reduced to doing nothing.

The problem becomes more exaggerated as time goes on. As programs mature, their original simplicity is enhanced to handle more sophisticated edge cases and exceptions. As this process continues, many of these sophisticated edge cases and exceptions will be mutually exclusive or will interact in subtle ways. When this happens the code-centric program becomes increasingly difficult to understand and keep correct, as the layering of conditionals and branches and interdependencies makes it harder and harder to eliminate unwanted side effects. The data-centric solution on the other hand, while still becoming more difficult, is reduced to simply making sure that the tables provide the correct options for the code, and the code remains a matter of scanning the rules, picking precedence, and executing them.

The Impact on Debugging

It is much easier to debug data than code, especially when a simple operation has been matured to handle a collection of subtle edge cases and exceptions that may interact with each other.

If we continue to the example of the regulation process, the basic question arises, how do you test this thing? The entire concept of reliable testing is far more than we can cover in a single essay, but I do want to introduce the idea of testing with -- you guessed it, more data.

The regulation program as I have described it loops through rules and magazine quantities and adjusts those quantities. It is a row-by-row process, with each pass through the inner loop executing a single change. Debugging this process can be vastly simplified if a table is made that records, in order, each update that was made and what rule was used to make the update. Obvious errors could be detected if a rule was being applied out of order, or if the rule made the wrong change. Generating the test cases brings in yet again more data, creating a body of magazines, defaults and customers that deliberately stresses the system with extreme cases.

This debug-the-data approach can have a huge impact on boosting the customer's confidence and control. If he can directly view this log he has a perfect black-and-white explanation of how the process ran. If he does not like it he can change the rules. If you let your program run in "planning only" mode, where it generates these logs without making the changes, then your customer can play what-if and you will find you have truly made a new friend!

Meta-Data and the Data Dictionary

This week's essay leads naturally to the matter of meta-data and data dictionaries and how they can dramatically reduce the amount of code needed for routine table maintenance tasks. However, that is a large topic and must be reserved for one or more future essays.

When we get to the topic of data dictionaries, we will be looking at how a data dictionary can give you true zero-code generated table maintenance forms, among other things.

Conclusion

The motto "Minimize Code, Maximize Data" is not well known in popular discussions today, which I contend is a natural consequence of our basic personalities as programmers. Coding is what we do and we tend to think of solving problems as an exercise in code and more code.

Nevertheless, we have seen this week in a specific example (which could be repeated many times over) that the code-centric versus data-centric decision impacts everybody. The rule "Minimize Code, Maximize Data" has positive impacts and the coding process, the debugging process, the maintenance process, and the user experience. Since that covers all parties concerned with software development, it is safe to conclude that this is a crucial design concept.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are:

31 comments:

zippy1981 said...

Ken,

This article is the first hit on google for "Minimize Code, Maximize Data" when you quote it. I guess that just proves that this is a very foriegn concept to average software developers.

You verbalize a truism that I, and I'm sure other programmers, probably need to hear on a regular basis.

I think programmers such as myself, have a tendency to combine the factory pattern with run time loading to try to create the one size almost fits all solution. We then stick all our per (customer|document type|file format) in classes instantiated by said factories. While this approach has its place, it doesn't solve any of the problems that "Minimize Code, Maximize Data" does. It just neatly segregates all the messy business logic into distinct heaping piles of hard to maintain code.

So from what I can tell your mentor is the creator of a project called MUMPs written in Fortran 90. Well fortran has a reputation of being written by scientists and engineers, people who code as a means to another end. If your mentor is one that sees coding as a means to an end, and is familar with having a lot of data that he would have to process in different ways until he found some sort of meaningful pattern his mindset makes perfect sense. Those of us that like to code for the sake of coding, would never make this simple realization in a million years.

This also explains why whenver I need to keep track of date that means something to me or troubleshoot a problem I find myself using excel or SQL queries as opposed to writing some throways C# or php. Once ones gets over the "I am a coder I write code" mindset this makes perfect sense.

Now if you excuse me I have to refactor some code to use a simple table so I can turn 2 php scripts that generate xml into one.

Gary Capell said...

Pretty sure this is also a theme in "The Practice of Programming" by Kernighan and Pike, and "Programming Pearls", by Bentley.

Alex said...

Can you share the name of your mentor?

Anonymous said...

one could also consider table oriented programming (http://www.geocities.com/tablizer/top.htm). Also, Apple's WebObjects for more than 10 years has had a rules-system that not only allows one to externalize such conditional processing from the code, the rules system can actually be used to generate the entire program (http://developer.apple.com/documentation/WebObjects/Developing_With_D2W/Architecture/chapter_3_section_9.html#//apple_ref/doc/uid/TP30001015-DontLinkChapterID_2-BAJHEDBJ). WebObjects can not only generate web applications, but also web services and even java rich clients.

KenDowns said...

alex: I did not do that because I did not want to name drop, but if you look at zippy's comments he got it right, you can google it from there.

KenDowns said...

anonymous: I doubt they're as cool as my own framework, www.andromeda-project.org :)

Alex said...

Ken: No problem with name dropping, but in science a quote without a citation is rather useless ;-)

KenDowns said...

alex: yes, you are right. Neil Pappalardo.

Alex said...

Ken: Thanks a lot - your hint will help me a lot within my work!

manoreza said...

Minimize Code, Maximize Data calls for a T-Shirt! I take a large in black please.

As others have said, the concept is not new but it is rarely employed. Code run amok is the order of the day. Over complicated, too many dependencies, no documentation. I far prefer to change a table than to change 10 programs.

As this series continues, I would like to see more side to side comparisons of the right and wrong way to do things with small data grids and code. Let's show the nitty gritty and we will convert more programmers into true believers!

Stacy Curl said...

I think that pure functional languages such as Haskell have a similar orientation. In a well written Haskell program the core part of the program consists of expressions operating over data, and one only finds statements & side-effects in the 'outer shell' : the parts that feed the inner pure parts.

Since watching SICP I can't help judging languages according to Abelson's & Susseman's criteria: (i) What are the primatives ? (ii) What are the means of composition ? (iii) What are the means of abstraction ? and crucially: (iv) Is the language closed (in a mathematical sense) under (ii) & (iii) ?

SQL could be a better language because whilst is comes with good primitives and some good means of composition and abstraction it is not closed.

Because SQL is not closed one is forced out of the database too soon, as soon as one wants to create derivative data, essentially a view of the original data one almost always has to leave the database. As soon as one is out of the database one lands more often than not in an imperative, predominantly side-effectful language.

I think the recent trend of DSL's moves code toward a healthier situation: place code under control of the business by providing an interface to it (in the form a language) that business can understand. I think a business can understand more than plain ol' data. So I would argue that not all code need disempower those it was written for.

KenDowns said...

Stacy: your comments give us all much food for thought. But let me ask you specifically why you find that SQL forces you "out of the database." My experience does not follow this, so I hoped to find out more of what you meant before responding to anything particular.

Stacy Curl said...

Suppose I keep a record of how far I have travelled recently using various vehicles. Here's a query to bring back some information about my travels:

SELECT Vehicle, Miles
FROM TripLog
WHERE Vehicle IN ( 'Car', 'Bike' )

Now say that I want to remove the hard coded values 'Car' & 'Bike' and replaced them with 'vehicles that get good gas-mileage'.

-- EfficientVehicles:
SELECT Type FROM Vehicle
WHERE ( Miles / MPG ) > 5.00

Now what I'd like to do is take the data from the TripLog query ('Car', 'Bike') and replace it with a reference to the EfficientVehicles query:

SELECT Vehicle, Miles
FROM TripLog
WHERE Vehicle IN EfficientVehicles

I could then bind all of this to a view: EfficientTripLog, this would enable me to reference these new views of the data from any application. I could also potentially build more complex aggregations & views on top of them.

Now I may be wrong but I think that the above is not valid SQL. Furthermore I think that even the approach of functional decomposition cannot be cleanly performed in SQL.
The 'WHERE Vehicle IN EfficientVehicles' clause would effectively have to be inlined into the original query and probably rewritten to use a join instead.

In my previous comment I said that SQL was not closed, the problem with SQL goes further: it does not support sufficient means of abstraction. You can assign names only to built in types: Tables, Columns, Views, Sprocs. You cannot name where clauses, a collection of columns. You cannot build you own aggregations and name these. You cannot abstract over any part of a select statement and give a name to that. You cannot invoke a named query and pass a 'where clause' as an argument.

If SQL had better support for abstraction & composition (and closure of these) than much less code would be needed in other languages.

Of course it SQL had these features they could be (ab)used to create code that disempowers the end user, code that is not written in terms of the domain the user understands.


Hope this helps elucidate my view a little more clearly.


(Examples copied, but modified from: http://jasonf-blog.blogspot.com/2007/04/closures-in-sql.html)

http://en.wikipedia.org/wiki/Closure_(mathematics)

KenDowns said...

Stacy: Thanks for the clarification. Personally I have used SQL for so long that I long ago stopped looking for the improvements you suggest and understood that SQL makes me do a lot of typing. Most of the queries you mention are possible, its just that SQL makes you do a lot of the work. Long ago I adopted a mentality aimed at generating most of my SQL from data dictionaries, so the problem for me is effectively solved.

However, when you mention things like being able to name a WHERE clause, it really ought to shake up the vendors as to how far SQL has not gone while so many other languages provide such basic abstractions.

My own personal big item is that SQL should be tremendously smarter about "knowing what I mean", for example your query:

SELECT Vehicle, Miles
FROM TripLog
WHERE Vehicle IN EfficientVehicles

...which will not work, should work, because SQL should be able to identify the keys and infer the link between TripLog and EfficientVehicles, only throwing an error if there is ambiguity.

But, with a sigh, I do not expect these things to appear. From a practical standpoint I built a dictionary layer that works out these things for me and I moved on...

John "Z-Bo" Zabroski said...

@Ken: "Personally I have used SQL for so long that I long ago stopped looking for the improvements you suggest and understood that SQL makes me do a lot of typing."

All I can do is smile.

@Stacy: the hard part is not typing, its quickly designing queries that can answer complicated questions, and make it perform well. this means understanding cost-based optimizers in addition to expressing intent. there is fertile ground for this. i suggest looking at klesli, k2, etc. however, these tools are not a "framework".

also, i think anyone who depends on sql directly for most work is a nut.

didroe said...

Stacy:

You can do this:

CREATE VIEW EfficientVehicles AS
SELECT Type FROM Vehicle
WHERE ( Miles / MPG ) > 5.00
/

CREATE VIEW EfficientTripLog AS
SELECT Vehicle, Miles
FROM TripLog
WHERE Vehicle IN
(SELECT Type FROM EfficientVehicles)
/

Granted it would be nicer if a single column table/view could be treated as a set without a SELECT (eg. IN EfficientVehicles).

You cannot build you own aggregations and name these.
That's what views are for.

You cannot abstract over any part of a select statement and give a name to that.
You can't with the text of the query (although you could with stored procedures) but you can use views to abstract things pretty nicely.

You cannot invoke a named query and pass a 'where clause' as an argument.

CREATE VIEW my_view ...(complicated SQL);
SELECT * FROM my_view WHERE a_column = 'some value';

Anonymous said...

What's the difference between code and data? Going back to the dawn of computing and the salad days of LISP, there is no difference between code and data.

So in reality you can't minimize code over data. It's all the same thing. Also, being a big company founder may mean you know something about systems or it may mean you're a great salesman. If you had said "My mentor is a great computer theorist and he's Allan Kay" it would carry more weight.

KenDowns said...

Anonymous: nonsense. What's the difference between butter and a throwing star, they're both electrons, protons and neutrons, no? It does make a big difference though if one of them is thrown at your chest at high speed.

With all due respect to Von Neumann, the contention that code and data are identical is only true at a level far deeper than anything the application developer sees. At our level code and data are completely different animals which follow different principles, we use different tools to create and handle them, the rules for modifying each is totally different, and the qualifications for being allowed to create or modify each is totally different, and their value to the customer is completely different.

Sandro Magi said...

Stacy, I agree that a better query language is needed. This topic came up on LtU awhile back, and "higher-order SQL" was discussed as a solution. Current SQL is a rather poorly abstracted language, being relegated to first-order constructs, but there seems to be no fundamental reason the DB couldn't execute a higher-order language. Queries would thus be much more concise, and there could be more "query reuse", just as higher-order code reuse results from higher-order functions. The only stumbling block is ensuring termination, but there are many simple ways to ensure termination in a higher-order language, so that shouldn't be a problem.

Anonymous said...

Right but wrong...

Database centric application logic is generally a bad idea. Data is NOT easier to debug than code. Remember, data has no behavior. Therefore the significance of the data in the tables is really a mindset of the programmer and no more or less maintainable than anything else.

Also, since the SQL used to query (decode?) this "logic" is written in very limited SQL or worse a stored procedure language, the expressiveness of the logic will be more or less obfuscated by mechanisms which were intended to merely persist "data".

What if the necessary "logic" you prescribe to the database, cant even be optimized properly by the database and should have been implemented using the app tier and not the DB tier? Your app will perform like a dog.

Coding in your database is a bad idea. And if anyone wants to take stored procedures to the next level (hint: there isnt one) then go for it.

What I was hoping you would say instead of saying min code max data was that software should avoid retyping data structures within languages like Java C# etc, to enable the domain to grow new fingers and cut other ones off with a minimal amount of code break.

Max data to me does not mean max data supported algorythms, it means the data in my app should be able to change and that assumption should be present throughout the system.

Data supported algorythms are just as dangerous and avoidable as obtuse object models in the hands of OOP experts.

Stelios said...

I strongly agree. The database is the BASE of an application. It's the last thing you have to change and the starting point for your application.

@didroe
@stacy
@KenDowns
Why not something like:
SELECT Vehicle, Miles
FROM TripLog
WHERE Vehicle IN
(SELECT Type FROM Vehicle
WHERE ( Miles / MPG ) > 5.00)

It's elementary and working on most of the DBs

Kragen Javier Sitaker said...

Tim Berners-Lee's Principle of Least Power, which underlies the design of the web, is a more nuanced version of "Minimize Code, Maximize Data".

Casper Bang said...

Good post. The Pragmatic Programmer (Andrew Hunt & Dave Thomas) are using a slightly different wording and context, but essentially it boils down to the same thing. They say "Configure, don't integrate".

KenDowns said...

Anonymous: It appears unlikely that we will ever agree, though it might be fun if we ever had a chance to sit down and really debate the matter.

I'll repeat my basic thesis and comment no further (though I have been inspired to write entire entry on the matter): at the application level data and code are fundamentally different, regardless of their relationship at lower levels, and the application developer has everything to gain practically from treating them differently.

KenDowns said...

Kragen: I'm embarrassed to say I was unaware of Berners-Lee's these until recently. I gave it a quick read and very impressed. Recommended for anybody interested in this topic to google it and read it. Much food for thought there.

KenDowns said...

Casper: The Pragmatic Programmer is probably one the best overall combinations of theory and practicality I have ever read. Required reading in my shop.

John "Z-Bo" Zabroski said...

I gather anonymous is saying that duplication is a bad thing, especially when you duplicate metadata, because you now have two (or more) sources of data and business rules, and therefore business processes are undefined.

Seems like anonymous is just grandstanding and inadvertently talking past Ken. This is not a battle of wits.

By the way, in Meditech, the system defined by Neil Pappalardo, each datum has one metadatum and one business rule. Even tho' I would not recommend purchasing a system from Meditech (due to other design flaws), this particular aspect of Meditech is incredibly well done.

KenDowns said...

Z-BO: "each datum has one metadatum and one business rule". I did my best to implement this idea in Andromed, and am pretty happy with the result.

I did not separate a 'datum', or fact, from its implementation as a column in a table. So we would say each column is completely defined in a single location. Outside application code can refer to the meta-data for assistance in generating HTML (or anything else), but it cannot override the meta data.

nate said...

Ken,

Great article, really spot-on. I've always aimed to design my software like this, though never been able to express the concept as succinctly. I wrote about it briefly here: http://debuggable.com/posts/code-insults-round-1---why-switch-blocks-are-dumb:4901d363-d210-482c-9794-65bd4834cda3, but with a slightly different approach and terminology.

Scratchy Tag said...

Hi Ken,

Great post, stimulates thinking. Comments are good too. Maybe this could be boiled down to one or more Rules of Thumb, such as:

1) if it *can* go into the DB instead of the code, then put it in the former.

2) any list that appears in the code (think of long, nested CASE statements) is presumptively a candidate for the DB. This even includes classic GUI elements such as the members of a pull-down list.

Anonymous has a grumpy attitude, but he was on to something with the reference to Von Neumann and McCarthy/LISP. One program's code is another program's data.

In an application, some things must be in the code, others must be in the DB, leaving much that may go into either. Those tables that are not basic facts (like your REGULATION table) are a kind of code, a structured representation of a business rule.

By moving it to the DB, this logic becomes more transparent to all (especially to the customer) and thus more maintainable.

I have seen advocacy of 'Fat Database', wherein you move everything possible into the DB (which includes catalog elements like stored procedures) in order to keep the logic close to the data. (see, e.g.,

http://ora-00001.blogspot.com/2009/06/fat-database-or-thick-database-approach.html

But your approach puts as much of the program logic as it can into tables. In my own work I would try to mark tables by category, such as 'basic facts', 'lookup/reference', business rules, GUI lists, etc., but this is merely window dressing.

Anonymous said...

http://www.lysator.liu.se/c/pikestyle.html


Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self­evident. Data structures, not algorithms, are central to programming. (See Brooks p. 102.)

--Rob Pike, 1989