Sunday, October 26, 2008

Data and Code at the Application Level

This week I would like to address the assertion that "code is data" and how the application developer might benefit or be harmed by this idea in the practical pursuit of deadlines and functioning code. For some reason my essay written back in May, Minimize Code, Maximize Data got picked up on the blogosphere on Thursday, and comments on ycombinator, on reddit.com, and on the post itself have suggested the thesis is flawed or unworkable because "code is data." Let's take a look at that.

Credit Where Credit Due

I first heard the thesis "Minimize Code, Maximize Data" from A. Neil Pappalardo. I consider it the "best kept secret in programming" because I personally have found it to be almost completely absent from my own day-to-day experience with other programmers.

However, glomek over at reddit.com also credits Eric Raymond with the following quote, "Smart data structures and dumb code works a lot better than the other way around."

Also, sciolizer over on the news.ycombinator.com comments area gives us these quotes from some of the greats:

  • Fred Brooks: "Show me your flow charts and conceal your tables and I shall continue to be mystified, show me your tables and I won't usually need your flow charts; they'll be obvious."
  • Rob Pike: "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."
  • Eric S. Raymond (again): "Fold knowledge into data, so program logic can be stupid and robust."
  • Peter Norvig: "Use data-driven programming, where pattern/action pairs are stored in a table."

And finally, Kragen Javier Sitaker left a comment on my original essay mentioning Tim Berners-Lee and his theory of "least power." You can read a description of that here.

So Why Do They Say Code is Data?

The suprising answer is that code is data, in particular contexts and when trying accomplish certain tasks. The contexts do not include application development, and the tasks do not involve storing of customer information, but the fact remains true for those who work in the right contexts.

As an example, at the bottom layer of the modern computer are the physical devices of CPU and RAM. Both the computer program being executed and the data it operates on are stored in RAM in the same way. This is called the Von Neumann architecture. Its a fascinating study and a programmer can only be improved by understanding it. At this level code is data in the most fundamental ways. There are many many other contexts and tasks for which it is true that code is data.

But we who create applications for customers are separated from Von Neumann by decades. These decades have seen a larger and larger stack of tools that allow us to concentrate on specialized tasks without worrying about how the tools below are doing their jobs. One of the most significant sets of tools that we use allow us to cleanly separate code from data and handle them differently.

The One and Only Difference

Trying to explain the differences between code and data is like trying to explain the differences between a fish and a bicycle. You can get bogged down endlessly explaining the rubber tire of the wheel, which the fish does not even have, or explaining the complexity of the gills, which the bicycle does not even have.

To avoid all of that nonsense I want to go straight to what data is and what code is. The differences after that are apparent.

Data is an inert record of fact. It does nothing but sit there.

A program is the actor, the agent, the power. The application program picks up the data, shakes it, polishes, and puts it down somewhere else (as in picking it up from the db server, transforming it into HTML, and delivering it to a browser).

To repeat: data is facts. Code is actions that operate on facts. The one and only difference is simply that they are not the same thing at all, they are a fish and a bicycle.

Exploiting The Difference

All of the quotes listed above, and my original essay in May on the subject, try to bring home a certain point. This point is simply that the better class of programs are those that begin with a distinction between fact and action, and seek first to organize the facts and only then to plan the actions.

Put another way, it is of enormous practical advantage to the programmer to fully understand that first and always he is manipulating facts (data). If he ignores the principles of how facts are organized and operated on, he can never reach his full abilities as a programmer. Only when he understands how the facts are organized can he see the clearest program designs.

And Again: Understand the facts first. From there design your data structures. After that the algorithms write themselves.

Minimizing and Maximizing

The specific advice to minimize code and maximize data is nothing more than taking the idea to its logical conclusion. If I write program X so that the data structures are paramount, and I find the algorithms to be simple (or "dumb" as ESR would say), easy to write and easy to maintain, don't I want to do that all of the time?

Conclusion

The wise programmer is one who can take the wisdom and theory of the industry and correctly judge what is appropriate and applicable and what is not. This is the programmer who has a shot at keeping focused, making budget and making deadlines. He knows when a generalized routine will support the overall project and when to just code the case at hand and move on.

The unwise programmer is one who cannot properly apply a theoretical concept to the correct context, or cannot judge the context in which a concept is appropriate. He is the one who produces mammoth abstractions, loses sight of the end-goals of the check-signers and end-users, and never seems to be able to make the deadline.

Of all of the advice I have received over the years, one of the most useful and productive has been to "minimize code, maximize data." As an application developer and framework developer it has served me better than most.

8 comments:

Greg said...

Don't let the redditer's, etc. bug you. Any time someone gives solid advice the comments usually are filled with people using South Park philosophy: "well, both sides are right! Sometimes you should focus on code, sometimes you should focus on data. Moderation is key!"

Of course, those of us who have been around the block realize that saying you shouldn't *always* do something isn't really advice, it's just common sense and can be applied to anything, from programming to diet and exercise. Your point stands regardless, it is important for programmers (particularly today, when RDBMS's are decried left and right for 'object databases') to remember that at the end of the day they are operating on data and need to think about how they model their data and operate on it. Getting the data model correct, in my experience as well, is both hard but worthwhile as the code ends up being simpler.

KenDowns said...

Greg: What I've noticed is that there are plenty of strong commments on both ycombinator and reddit, but the weakest comments will be found on reddit and the most compelling will usually be found on ycombinator.

Your comment that advice to "don't always do x" is not really advice is well taken. Basically content free.

Anonymous said...

Maybe folks get confused by "code", in which some people see "code" as a finite representation of something that is somehow unchanging (as in a coded transmission). Have you ever heard of "legacy data" as opposed to "legacy code"?

John "Z-Bo" Zabroski said...

@anonymous

Well, it seems like you were asking a rhetorical question and forgot to follow-up with the answer. I'll bite and ask: what is "legacy data"?

Is it data defined by a data model that is not at least in BCNF?

There is some literature that vaguely discusses how to normalize a schema that was never normalized, and then how to denormalize it when it makes sense.

It's not mainstream, mainly because it's not merely intuitable and requires discipline to get value out of. The more disciplined something is, the less likely people are going to investigate the benefits and simply complain about how disciplined it is. At this point, you just have to ask yourself, How do parent's get their kids to do their homework? Well, step 1 is that they don't bother helping out children not their own.

However, 50% of the world's currently employed programmers are untrained and develop their skills informally. They know nothing about discipline, and take years of getting burned by various scenarios before they finally start to accept there is a grand picture: a portrait of discipline. In the words of Erich Gamma, “You have to feel the pain of a design which has some problem. [Only then] you. . . appreciate a pattern. Like realizing your design isn’t flexible enough, a single change ripples through the entire system, you have to duplicate code, or the code is just getting more and complex. If you then apply a pattern in such a messy situation it can happen that the pain goes away and you feel good afterwards. It’s an eye opener to realize that oh, actually this pattern, factory or strategy, is a solution to my problem.”

@Ken

Why is the Pragmatic Programmer required reading? I read it, and it doesn't even mention the principle that is even more important than Minimize Code, Maximize Data: Fire and Motion. When you have the chance, read Joel Spolsky's two essays on the topic.

KenDowns said...

@z-bo: The Pragmatic Programmer was one of those books that presented about half of my own experience to me with far more insight than I had, allowing me to immediately grasp the concepts. Based on that, I had great faith in the other half and took great benefit from it.

In short: they understand the balance between theory and practice, and are not selling a silver bullet, just their experience.

John "Z-Bo" Zabroski said...

My basic strategy for blending theory and practice is to know theory at a superficial level and applying it. It sounds shameful, but most businesses can't and don't want to hire rockstar computer scientists to solve their problems. They want to hire programmers who know the theory sufficiently well enough to apply it. At the same time, knowing what I don't know is key, and stops me from proposing theories of my own, which in a production environment will always be totally uncalled for. It's learning this temperament that appears to be what you, the Pragmatic Programmers, and myself have done. Selling round wheels is a surprisingly healthy way to make a living, and less riskier than trying to commercialize faster travel by rocket.

I'll repeat my suggestion of reading Fire and Motion, and add it to your collection of zen koans.

Anonymous said...

@z-bo: I think over the long-term the data (related to the model) is more important than the means at which it is glued (coded) together for presentation (the human factor).

For instance, we may intuitively know that a state field groups outcomes by geography, but without the record of a state in relation to it's correlated attributes (creating meaningfulness), having the means of anticipating and grouping the data is meaningless without means to store and retrieve the data itself accurately and normally.

Content matters, although the medium is often interpreted as the message (to paraphrase McLuhan), which is how the process goes whacky. Quality control is measured by strength of process to produce appropriate quality according to accepted guidelines, which demands acceptable rigidness and discipline to assure performance.

John "Z-Bo" Zabroski said...

@anonymous

Well, I follow your guidelines, but there are still times when they get me in trouble. It's just that the reward outweighs the risk, but as with any solution to a problem it always comes with future aggravation.

The way I measure my quality as a programmer is how different my problems are compared with others. My problems tend to be microscopic, not macroscopic.

I am also a huge fan of McLuhan.