tag:blogger.com,1999:blog-4269223998705770722024-03-19T05:29:44.065-04:00The Database ProgrammerAll things related to database applications, both desktop and web.KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.comBlogger75125tag:blogger.com,1999:blog-426922399870577072.post-89560168345472383502013-01-16T21:36:00.000-05:002013-01-16T21:37:45.269-05:00All Software Development is Schema Management<div dir="ltr" style="text-align: left;" trbidi="on">
Before you automatically disagree, stop a bit and think about it. Can you think of any code you have ever written that did not handle data of some sort? Of course not. This is not about relational data either, it is about any data. There is no software that does not process data. Even the textbook example of some function that squares a number is processing the parameter and returning a value. The first line of that function, in most languages, names the input parameter and its type. That is a schema. <br />
<br />
This remains true as you climb out of the toy textbook examples into simple one-off programs into small packages and ultimately to apps that build to megabytes of executables running on a rack (or a floor) of servers. Just as each method call has parameters, so does each class have its attributes, every table its columns, every XML file its XSD (or so we hope) and that code works if and only if everybody on the team understood what was supposed to happen to the data.<br />
<h2>
<br /></h2>
<h2>
Are We Talking UI or Platform or Middleware?</h2>
<div>
<br /></div>
<div>
Generally today we are talking about the server side of things. This is about any project that is going to take a user request and consult a data store: the file system, a search engine, a database, a NoSQL database, or an XML database. If you go to the disk, that is what we are talking about.</div>
<h2>
<br /></h2>
<h2>
New Versions Are (Almost Always) About Schema Changes<div style="font-size: medium; font-weight: normal;">
</div>
</h2>
<div>
<br /></div>
<div>
So imagine you've got some working code. Maybe a single script file, or perhaps a library or package, or maybe some huge application with hundreds of files. It is time to change it. In my experience new code means some kind of change to the schema.</div>
<div>
<br /></div>
<div>
I cannot prove that, nor do I intend to try. It is a mostly-true not an always-true.</div>
<h2>
<br /></h2>
<h2>
Schema Here Does Not Mean Relational</h2>
<div>
<br /></div>
<div>
This is not about relational schemas, though they are included. If you are using Mongo or some other NoSQL Database which does not manage the schema for you, it just means you are managing the schema yourself somewhere in code. But since you cannot write code that has no knowledge of the structure of the data it handles, that schema will be there somewhere, and if the code changes the schema generally changes.</div>
<h2>
<br /></h2>
<h2>
Does It Need To Be Said?</h2>
<div>
<br /></div>
<div>
Sometimes you will find yourself in a situation where people do not know this. You will notice something is wrong, they first symptom is that the conversation does not appear to be proceeding to a conclusion. Even worse, people do not seem to know what conclusion they are even seeking. They do not know that they are trying to work out the schema, so they wander about the requirements trying to make sense of them.</div>
<div>
<br /></div>
<div>
Order and progress can be restored when somebody ties the efforts down to the discovery and detailing of the schema. The question is usually, "What data points are we handling and what are we doing to them?"</div>
</div>
KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com473tag:blogger.com,1999:blog-426922399870577072.post-82510342682907516132013-01-15T21:06:00.001-05:002013-01-15T21:07:29.365-05:00Code Today's Requirements Today<div dir="ltr" style="text-align: left;" trbidi="on">
In Part 1 of this series, <a href="http://database-programmer.blogspot.com/2012/06/do-you-know-what-day-it-is.html">Do You Know What Day It Is?</a> (written a mere 6 months ago, I really ought to update this blog more often), we looked at Ken's First Law of Architecture:<br />
<br />
<span style="background-color: lime; border: 1px solid black; padding: 10px;">Today's Constant is Tomorrow's Variable</span><br />
<br />
<br />
If you do not know this, there are two mistakes you can make.<br />
<br />
<h2 style="text-align: left;">
Mistake 1: You Ain't Gonna Need It</h2>
<br />
Mistake 1 is creating a variable, option, switch, parameter, or other control for something which as far as we know <i>for today is a constant.</i> You can avoid this principle if you remember that <a href="http://en.wikipedia.org/wiki/You_ain't_gonna_need_it">You Ain't Gonna Need It</a>. <br />
<br />
It is this simple: you can invent an infinite number of variables that the user may wish to control in the future, but the chance of guessing which ones they will actually want to control is near zero. It is a total complete waste of time. <br />
<br />
Let's take an example. Imagine you are asked to do a task. Any task. It does not even have to be programming. The task will be loaded with points where programmers <i>invent decisions that nobody asked them to make</i>. For example, consider this blog post.<br />
<br />
1) Should I have used a different background color up there in that green box?<br />
2) Should I have used a sub-section heading by this point, maybe I should go back and do that?<br />
3) Should this list be numbered or maybe just bullets instead?<br />
<br />
Notice that this list is kind of arbitrary. I did not think of it very hard, I just made up those questions. Another person might have made up different questions. The problem is that the list never ends.<br />
<br />
This can never be eradicated, it is fundamental to the human condition. Well meaning individuals will raise irrelevancies in an effort to be helpful and the rules of polite society work against weeding these out. Committees will put two heads on a horse and you will end up with a bunch of YAGNI options. But we can still fight against them.<br />
<br />
<h2 style="text-align: left;">
Mistake 2: Constants in Code - Nobody Does That!</h2>
<br />
It is possible to go the other way, which is to embed constants into your code so that when you need to change them you find it is expensive, difficult and error prone. Most of us learn not to do this so early that it seems unthinkable to us. Why even write a blog post about it in 2013?<br />
<br />
Well the very good reason to write about it is that we all still embed "constants in code" in ways that we never think about.<br />
<br />
For example, if you are using a relational database, then the structure of your table is a "constant in code", it is something that is woven through every line. Your code "knows" which tables are in there and what columns they have. When you change the structure you are changing something very much "constant" in your system. This is why it is so important to use a lot of library code that itself reads out of data dictionaries, so that you have removed this particular "constant in code." <br />
<br />
The great problem here is that we cannot see the future, and you never think that a basic feature of your system is an embedded constant in code. In a simpler period in my life a system lived on a server. One server, that was it, who ever heard of a server farm? That was a constant: number of servers=1, and all of my code assumed it would always be true. I had to weed out that constant from my code.<br />
<br />
Those who understand this basic inability to see the future are most at risk of slipping into over-generalization and generating YAGNI features. It is an attempt to fight the very universe itself, and you cannot win. <br />
<br />
<h2 style="text-align: left;">
Keep it Simple, Keep it Clean</h2>
<div>
<br />
In the end you can only write clean code, architect clean systems, follow the old knowledge and be the wise one who learns to expect the unexpected.</div>
</div>
KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com110tag:blogger.com,1999:blog-426922399870577072.post-34476369415570888042012-06-30T08:51:00.000-04:002012-06-30T13:57:00.064-04:00Do You Know What Day It Is?<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
Have you ever sat in on a meeting like this?</div>
<div>
<br />
<b>Person 1</b>: The customer wants a green background on this page.<br />
<br />
<b>Person 2</b>: Do we know they're going to stick with green? Maybe we need a configuration option for background color.<br />
<br />
<i>Nobody wants to disagree and the manager never comes to these boring design meetings so there is quiet before somebody says...</i><br />
<br />
<b>Person 3</b>: A configuration option just for background color is kind of weak, we'd be better off allowing choices for all colors.<br />
<br />
<b>Person 4: </b>Is this an instance level configuration, or do we let the end-user pick her color?<br />
<br />
<b>Person 2 (the original troublemaker)</b>: well it really has to be instance level, but the end-user can override, that way we satisfy all possibilities.<br />
<br />
From here the debate continues until they've decided to create a skinning system with lots of UI color pickers and other stuff, and they wrap up with:<br />
<br />
<b>Person 6: </b>Well this sounds great, but what should we make the default background color?<br />
<br />
...and nobody can remember that:<br />
<br />
<div style="background-color: yellow; border: 2px solid black; color: red; margin: 10px; padding: 10px;">
The customer specifically asked for green, only green, and nothing but green.</div>
<br />
Is this an exaggeration? Perhaps, but I've sat in meetings that come close. Seems to me that I've sat in meetings that were worse, but I'd hate to slander anybody, and being as I'm not perfect myself, we'll leave that one alone for now.<br />
<br />
So anyway, there were many things going wrong in that meeting, but we're going to stick with the simple fact that nobody there knew this one simple rule, <i>the first rule of system design</i>, which should be burned into your brain, which you should repeat before and during any design or architecture meeting:</div>
<div>
<br /></div>
<div style="background-color: #99ffcc; border: 1px solid black; margin: 10px; padding: 10px;">
<b>Today's Constant is Tomorrow's Variable.</b></div>
<br />
This is not so much a rule as an observation, but when you realize that it is true on almost any level, it can become a guiding rule, something that actually lets you make decisions with confidence, decisions that turn out to work well.<br />
<br />
The problem is that most people in technology don't know what day it is. They make one of two mistakes, they either:<br />
<ol style="text-align: left;">
<li>constantly plan, estimate, design, or program for <i>tomorrow's variable</i> when all they need to do is handle the simpler case of <i>today's constant</i>.</li>
<li>or they don't realize that when "the customer changed the requirements!!!" the customer is doing what everybody does, taking something simple from yesterday and making it a bit more complex today.</li>
</ol>
So you can turn this around and say, "Everything I do today will get changed tomorrow. It will become more complicated. Everything I think is fixed or constant will float or become a variable."<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
What It Means</h3>
<div>
We've all learned (hopefully) that you don't embed constants in code because it makes the code difficult to change. We define the constants in headers, or we accept all values from outside sources as parameters. This makes the code more flexible, and it is a good thing. The code is more <i style="font-weight: bold;">robust</i>, it can handle more cases without breaking or needing alteration.<br />
<br />
The trick to using <i>Ken's first and only rule of system design</i>, "Today's Constant is Tomorrow's Variable" is to recognize the many forms of "constants" that we build into our systems in hard-to-reach places, making it hard to change them when tomorrow comes and they are no longer fixed and constant. <br />
<br />
Another trick to using the rule is to <i>always know what day it is.</i> Most of what I do today will involve features that may never change. It is extremely easy to see how they might change, but impossible to know for sure. So we leave them constant for today.<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Back To That Meeting</h3>
</div>
<div>
Let's go back to that meeting we started with. Here is how it goes when somebody knows that "Today's Constant is Tomorrow's Variable."<br />
<br />
<br />
<b>Person 1:</b> The customer wants a green background on this page.<br />
<br />
<b>Person 2:</b> Do we know they're going to stick with green? Maybe we need a configuration option for background color.<br />
<br />
<b>Our Hero:</b> We don't know they are going to stick with green because we never know if any customer is ever going to stick with anything, and we all know that <i>today's constant is tomorrow's variable. </i>However, I hope we've put the styles into a style sheet and not embedded them directly into the HTML so we can change it later if we have to, right?<br />
<br />
<br />
Somebody mumbles that yes in fact we do use style sheets, and the meeting moves on.<br />
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
<br /></h3>
<h3 style="text-align: left;">
Is That Lame Example?</h3>
<div>
That's a pretty lame example really, who doesn't use style sheets? And who embeds constants in their code?</div>
<div>
<br /></div>
<div>
Well kiddies it turns out that we didn't always use style sheets. When I got interested in web pages CSS was still optional (Yes! Believe it!) and you put your style information directly into tags, which was basically <i>embedding constants into code, </i>and it didn't take long before you intuitively realized this was not right, and you discovered CSS. </div>
<div>
<br /></div>
<div>
It is amazing how often that basic pattern repeats itself, trying to identify what you thought was a constant, realizing it is "buried in code", and turning it into a variable.</div>
<div>
<br /></div>
<h3 style="text-align: left;">
</h3>
<h3 style="text-align: left;">
Today's Post Is Tomorrow's Promise</h3>
<div>
<br />
I'll be posting a lot more on this subject, using a very reliable and strict schedule based on a host of variables that mostly comes down to whenever-the-hell-I-feel-like-it(TM). </div>
<div>
<br /></div>
<div>
Cheers.</div>
<br /></div>
</div>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com320tag:blogger.com,1999:blog-426922399870577072.post-37120618534786205642011-01-21T22:21:00.001-05:002011-01-21T22:21:39.967-05:00Maintaining One Code Base with Possibly Conflicting Custom Features<p>Today's essay deals with the tricky issue of custom features
for individual customers who are running instances of your
software.
</p>
<p>The question comes by way of a regular reader who prefers to
remain anonymous, but asks this:
</p>
<blockquote>
<p>... I work on a large (to me, anyway) application that serves as a client database, ticket system, time-tracking, billing, asset-tracking system. We have some customers using their own instances of the software. Often, those customers want additional fields put in different places (e.g., a priority column on tickets). This results in having multiple branches to account for versions with slight changes in code and in the database. This makes things painful and time-consuming in the long run: applying commits from master to the other branches requires testing on every branch; same with database migrate scripts, which frequently have to be modified.
</p>
<p>
Is there an easier way? I have thought about the possibility of making things "optional" in the database, such as a column on a table, and hiding its existence in the code when it's not "enabled." This would have the benefit of a single code set and a single database schema, but I think it might lead to more dependence on the code and less on the database -- for example, it might mean constraints and keys couldn't be used in certain cases.
</p>
</blockquote>
<h2>Restating the Question</h2>
<p>Our reader asks, is it better to have different code branches
or to try to keep a lot of potentially conflicting and optional
items mixed in together?
</p>
<p>Well, the wisdom of the ages is to maintain a single code branch,
including the database schema. I tried exactly once, very early
in my career, to fork my own code, and gave up almost within days.
When I went to work in larger shops I always arrived in a situation
where the decision had already been made to maintain a single
branch. Funny thing, since most programmers cannot agree on the
color of the sky when they're staring out the window, this is
the only decision I have ever seen maintained with absolute
unanimity no matter how many difficulties came out of it.
</p>
<p>There is some simple arithmetic as to why this is so. If you have
single feature for a customer that is giving you a headache, and
you fork the code, you now have to update both code branches for
every change plus regression test them both, including the feature
that caused the headache. But if you keep them combined you only
have the one headache feature to deal with. That's why people
keep them together.
</p>
<h2>Two Steps</h2>
<p>Making custom features work smoothly is a two-step process.
The first step is arguably more difficult than the second,
but the second step is absolutely crucial if you have
business logic tied to the feature.
</p>
<p>Most programmers when confronted with this situation
will attempt to make various features optional. I
consider this to be a mistake because it complicates
code, especially when we get to step 2. By far the
better solution is to make features <i>ignorable</i>
by anybody who does not want them.
</p>
<p>The wonderful thing about ingorable features is
they tend to eliminate the problems with apparently
conflicting features. If you can rig the features
so anybody can use either or both, you've eliminated
the conflict.
</p>
<h2>Step 1: The Schema</h2>
<p>As mentioned above, the first step is arguably more
difficult than the second, because it may involve
casting requirements differently than they are
presented.
</p>
<p>For example,
our reader asks about a priority column on tickets,
asked for by only one customer. This may seem like
a conflict because nobody else wants it, but we
can dissolve the conflict when we make the feature
ignorable. The first step involves doing this at
the database or schema level.
</p>
<p>But first we should mention that the UI is easy,
we might have a control panel
where we can make fields invisible. Or maybe our
users just ignore the fields they are not interested
in. Either way works.
</p>
<p>The problem is in the database.
If the values for priority come
from a lookup table, which they should,
then we have a foreign key, and
we have a problem if we try to ignore it:
</p>
<ul><li>We can allow nulls in the foreign key, which is
fine for the people ignoring it, but
<li>This means the people who require it can end
up with tickets that have no priority because it does
not prevent a user from leaving it blank.
</ul>
<p>A simple answer here is to pre-populate your priority
lookup table with a value of "Not applicable", perhaps
with a hardcoded id of zero. Then we set the default
value for the TICKET.priority to zero. This means people
can safely ignore it because it will always be valid.
</p>
<p>Then, for the customer who paid for it, we just go in
after the install and delete the default entry. It's
a one-time operation, not even worth writing a script
for, and it forces them to create a set of priorities
before using the system. Further, by leaving the
default of zero in there, it forces valid answers
because users will be dinged with an FK violation if
they do not provide a real priority.
</p>
<p>For this particular example, there is no step 2, because
the problem is completely solved at the schema level.
To see how to work with step 2, I will make up an
example of my own.
</p>
<h2>Step 2: Unconditional Business Logic</h2>
<p>To illustrate step 2, I'm going to make up an
example that is not really appropriate to our
reader's question, frankly because I cannot think
of one for that situation.
</p>
<p>Let's say we have an eCommerce system, and one
of our sites wants customer-level discounts based
on customer groups, while another wants discounts
based on volume of order -- the more you buy, the
deeper the discount. At this point most programmers
start shouting in the meeting, "We'll make them
optional!" Big mistake, because it makes for lots
of work. Instead we will make them ignorable.
</p>
<p>Step 1 is to make ignorable features in the schema.
Our common code base contains a table of customer
groups with a discount percent, and in the customers
table we make a nullable foreign key to the customer
groups table. If anybody wants to use it, great, and
if they want to ignore it, that's also fine. We do
the same thing with a table of discount amounts,
we make an empty table that lists threshhold amounts
and discount percents. If anybody wants to use it
they fill it in, everybody else leaves it blank.
</p>
<p>Now for the business logic, the calculations of
these two discounts. The crucial idea here is
<i>not to make up conditional logic that tries to
figure out whether or not to apply the discounts.</i>
It is vastly easier to <i>always apply both
discounts, with the discounts coming out zero for
those users who have ignored the features.</i>
</p>
<p>So for the customer discount, if the customer's
entry for customer group is null, it will not match
to any discount, and you treat this as zero.
Same for the sale amount discount, the lookup to
see which sale amount they qualify doesn't find
anything because the table is empty, so it treats
it as zero.
</p>
<p>So the real trick at the business logic level is
not to figure out which feature to use, which leads
to complicatec conditionals that always end up
conflicting with each other, but to <i>always use
all features and code them so they have no effect
when they are being ignored.</i>
</p>
<h2>Conclusion</h2>
<p>Once upon a time almost everybody coding for a living
dealt with these situations -- we all wrote code that
was going to ship off to live at our customer's site.
Nowadays this is less common, but for those of us
dealing with it it is a big deal.
</p>
<p>The wisdom of the ages is to maintain a common code
base. The method suggested here takes that idea
to its most complete implementation, a totally common
code base in which all features are active all of
the time, with no conditionals or optional features
(except perhaps in the UI and on printed reports),
and with schema and business logic set up so that
features that are being ignored simply have no
effect on the user.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com72tag:blogger.com,1999:blog-426922399870577072.post-8292279301174790162011-01-06T20:21:00.005-05:002011-01-06T23:41:42.192-05:00Can You Really Create A Business Logic Layer?<p>The past three posts of this little mini-series
have gone from a <a href="http://database-programmer.blogspot.com/2010/12/working-definition-of-business-logic.html"
>Working definition of business logic</a>
to a <a href="http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html"
>Rigorous definition of business logic</a>
and on to some <a href="http://database-programmer.blogspot.com/2011/01/theorems-regarding-business-logic.html"
>theorems about business logic</a>.
To wrap things up, I'd like to ask the question,
is it possible to isolate business logic into
a single tier?
</p>
<h2>Related Reading</h2>
<p>There are plenty of opinions out there.
For a pretty thorough explanation of how to put
everything into the DBMS, check out
<a href="http://thehelsinkideclaration.blogspot.com/2009/04/helsinki-code-layers-in-dbms.html"
>Toon Koppelaar's description</a>. Mr.
Koppelaars has some good material, but you do
need to read through his earlier posts to get
the definitions of some of his terms. You can also
follow his links through to some high quality
discussions elsewhere.
</p>
<p>Contrasting Mr. Koppelaar's opinion is a piece
which does not have nearly the same impact, IMHO,
because in
<a href="http://www.codeproject.com/KB/architecture/DudeWheresMyBusinessLogic.aspx"
>Dude, Where's My Business Logic?</a> we get some solid
history mixed with normative assertions based on
either anecdote or nothing at all. I'm a big believer
in anecdote, but when I read a sentence that
says, "The database should not have any knowledge of what a customer is, but only of the elements that are used to store a customer." then
I figure I'm dealing with somebody who needs to see
a bit more of the world.
</p>
<h2>Starting At the Top: The User Interface</h2>
<p>First, let's review that our rigorous definition of business logic
includes schema (types and constraints),
derived values (timestamps, userstamps, calculations,
histories), non-algorithmic compound operations
(like batch billing) and algorithmic compound
operations, those that require looping in their
code. This encompasses everything we might do
from the simplest passive things like a constraint
that prevents discounts from being over 100% to
the most complex hours-long business process,
along with everything in between accounted for.
</p>
<p>Now I want to start out by using that definition
to see a little bit about what is going on in
the User Interface. This is not the <i>presentation</i>
layer as it is often called but the <i>interaction</i>
layer and even the <i>command</i> layer.
</p>
<p>Consider an admin interface to
a database, where the user is entering or modifying
prices for the price list. Now, if the user could
enter "Kim Stanley Robinson" as the price, that would be
kind of silly, so of course the numeric inputs
only allow numeric values. Same goes for dates.
</p>
<p>So the foundation of usability for a UI
is at very least
knowlege of <i>and enforcement of</i> types in
the UI layer. Don't be scared off that I am
claiming the UI is enforcing anything, we'll
get to that a little lower down.
</p>
<p>Now consider the case where the user is
typing in a discount rate for this or that,
and a discount is not allowed to be over 100%.
The UI really ought to enforce this,
otherwise the user's time is wasted when she
enters an invalid value, finishes the entire form,
and only then gets an error when she tries to
save. In the database world we call this
a constraint, so the UI needs to know about
constraints to better serve the user.
</p>
<p>Now this same user is typing a form where there
is an entry for US State. The allowed values are
in a table in the database, and it would be nice
if the user had a drop-down list, and one that
was auto-suggesting as the user typed. Of course
the easiest way to do something like this is just
make sure the UI form "knows" that this field is
a foreign key to the STATES table, so it can generate
the list using some generic library function that
grabs a couple of columns out of the STATES
table. Of course, this kind of lookup thing will
be happening all over the place, so it would work
well if the UI knew about <i>and enforced</i> foreign
keys during entry.
</p>
<p>And I suppose the user might at some point be
entering a purchase order. The purchase order is
automatically stamped with today's date. The
user might see it, but not be able to change it,
so now our UI knows about system-generated values.
</p>
<p>Is this user allowed to delete a customer?
If not, the button should either be grayed out or not
be there at all. The UI needs to know about
<i>and enforce</i> some security.
</p>
<h2>More About Knowing and Enforcing</h2>
<p>So in fact the UI layer not only knows the logic
but is enforcing it. It is enforcing it for
two reasons, to improve the user experience with
date pickers, lists, and so forth, and to prevent the user
from entering invalid data and wasting round trips.
</p>
<p>And yet, because we cannot trust what comes in
to the web server over the wire, we have to
<i>enforce every single rule a second time when
we commit the data.</i>
</p>
<p>You usually do not hear people say that the UI
enforces business logic. They usually say the
opposite. But the UI does enforce business logic.
The problem is, everything the UI enforces has
to be enforced again. That may be why we often
overlook the fact that it is doing so.
</p>
<h2>The Application and The Database</h2>
<p>Now let's go through the stuff the UI is
enforcing, and see
what happens in the application and the database.
</p>
<p>With respect to <b>type</b>, a strongly typed language
will throw an error if the type is wrong, and a weakly
typed language is wise to put in a type check anyway.
The the DBMS is going to only allow correctly typed
values, so, including the UI,
<i>type is enforced three times</i>.
</p>
<p>With respect to <b>lookups</b> like US state, in
a SQL database we always let the server do that
with a foreign key, if we know what is good for
us. That makes double enforcement for lookups.
</p>
<p>So we can see where this is going. As we look at
constraints and security and anything else that
must be right, we find it will be enforced at least
twice, and as much as three times.
</p>
<h2>You Cannot Isolate What Must be Duplicated</h2>
<p>By defining First Order Business Logic, the simplest
foundation layer, as including things like types
and keys and constraints, we find that the enforcement
of this First Order stuff is done 2 or 3 times, but
never only once.
</p>
<p>This more or less leaves in tatters the idea of a
"Business Logic Layer" that is in any way capable of
handling all business logic all by its lonesome.
The UI layer is completely useless unless it is
also enforcing as much logic as possible, and
even when we leave the Database Server as the
final enforcer of First Order Business Logic
(types, constraints, keys), it is still often good
engineering to do some checks to prevent
expensive wasted trips to the server.
</p>
<p>So we are wasting time if we sit around trying to figure
out how to get the Business Logic
"where it belongs", because it "belongs" in at
least two places and sometimes three. Herding
the cats into a single pen is a fool's errand, it
is at once unnecessary, undesirable, and impossible.
</p>
<p><b>Update: Regular reader Dean Thrasher of Infovark summarizes
most of what I'm saying here using an apt industry
standard term: Business Logic is a <i>cross-cutting concern</i>.
</b></p>
<h2>Some Real Questions</h2>
<p>Only when we have smashed the concept that Business
Logic can exist in serene isolation in its own layer
can we start to ask the questions that would actually
speed up development and make for better engineering.
</p>
<p>Freed of the illusion of a separate layer, when we
look at the higher Third and Fourth Order Business
Logic, which always require coding, we can decide where
they go based either on <a href="http://database-programmer.blogspot.com/2010/12/critical-analysis-of-algorithm-sproc.html"
>engineering</a> or the availability of qualified
programmers in particular technologies,
but we should not make
the mistake of believing they are going where they
go because the gods would have it so.
</p>
<p>But the real pressing question if we are seeking
to create efficient manageable large systems is
this: how we distribute
the same business logic into 2 or 3 (or more)
different places so that it is enforced
consistently everywhere. Because a smaller code
base is always easier to manage than a large one,
and because configuration is always easier than
coding, this comes down to meta-data, or if you
prefer, a data dictionary. That's the trick that
always worked for me.
</p>
<h2>Is This Only A Matter of Definitions?</h2>
<p>Anybody who disagrees with the thesis here has
only to say, "Ken, those things are not business
logic just because you wrote a blog that says they
are. In my world business logic is about <b>code</b>
baby!" Well sure, have it your way.
After all, the nice thing about definitions is that we
can all pick the ones we like.
</p>
<p>But these definitions, the theorems I derived on
Tuesday, and the multiple-enforcement thesis presented
here today should make sense to anbyody struggling
with where to put the business logic. That struggle
and its frustrations come from the mistake of
<i>imposing abstract
conceptual responsibilities</i> on each tier instead
of <i>using the tiers as each is able to get the
job done.</i> Databases are wonderful for type,
entity integrity (uniqueness), referential integrity,
ACID compliance, and many other things. Use them!
Code is often better when the problem at hand cannot
be solved with a combination of keys and constraints
(Fourth Order Business Logic), but even that code can
be put into the DB or in the application.
</p>
<p>So beware of paradigms that assign responsibility
without compromise to this or that tier. It cannot
be done. Don't be afraid to use code for doing things
that require structured imperative step-wise operations,
and don't be afraid to use the database for what it is
good for, and leave the arguments about "where everything
belongs" to those with too much time on their hands.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com133tag:blogger.com,1999:blog-426922399870577072.post-71466482527609642122011-01-04T17:33:00.001-05:002011-01-04T17:33:38.639-05:00Theorems Regarding Business Logic<p>In yesterday's <a href="http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html"
>Rigorous Definition of Business Logic</a>, we saw that
business logic can be defined in four orders:
</p>
<ul><li>First Order Business Logic is entities and
attributes that users (or other agents) can save,
and the security rules that govern read/write
access to the entitites and attributes.
<li>Second Order Business Logic is entities
and attributes derived by rules and formulas,
such as calculated values and history tables.
<li>Third Order Business Logic are non-algorithmic
compound operations (no structure or looping is
required in expressing the solution), such as
a month-end batch billing or, for the old-timers
out there, a year-end general ledger
roll-up.
<li>Fourth Order Business Logic are algorithmic
compound operations. These occur when the action
of one step affects the input to future steps.
One example is ERP Allocation.
</ul>
<h2>A Case Study</h2>
<p>The best way to see if these have any value is to
cook up some theorems and examine them with an
example. We will take
a vastly simplified time billing system, in which
employees enter time which is billed once/month to
customers. We'll work out some details a little below.
</p>
<h2>Theorem 1: 1st and 2nd Order, Analysis</h2>
<p>The first theorem we can derive from these definitions
is that we should look at First and Second Order Schemas
together during analysis. This is because:
</p>
<ul><li>First Order Business Logic is about entities and atrributes
<li>Second Order Business Logic is about entities and attributes
<li>Second Order Business Logic is about values
generated from First Order values and, possibly,
other Second Order values
<li>Therefore, Second Order values are always
expressed ultimately in terms of First Order
values
<li>Therefore, they should be analyzed together
</ul>
<p>To give the devil his due, ORM does this easily, because
it ignores so much database theory (paying a large price
in performance for doing so) and
considers an entire row, with its first order and
second order values together, as being part of one class.
This is likely the foundation for the claims of ORM
users that they experience productivity gains when
using ORM. Since I usually do nothing but bash ORM,
I hope this statement will be taken as utterly sincere.
</p>
<p>Going the other way, database theorists and evangelists
who adhere to full normalization can hobble an
analysis effort by refusing to consider
2nd order because those values <i>denormalize</i> the database,
so sometimes the worst of my own crowd will prevent
analysis by trying to keep these out of the conversation.
So, assuming I have not pissed off my own friends,
let's keep going.
</p>
<p>So let's look at our case study of the time billing
system. By theorem 1, our analysis of entities and
attributes should include both 1st and 2nd order
schema, something like this:
</p>
<pre>
INVOICES
-----------
invoiceid 2nd Order, a generated unique value
date 2nd Order if always takes date of batch run
customer 2nd Order, a consequence of this being an
aggregation of INVOICE_LINES
total_amount 2nd Order, a sum from INVOICE_LINES
INVOICE_LINES
---------------
invoiceid 2nd order, copied from INVOICES
customer +- All three are 2nd order, a consequence
employee | of this being an aggregration of
activity +- employee time entries
rate 2nd order, taken from ACTIVITIES table
(not depicted)
hours 2nd order, summed from time entries
amount 2nd order, rate * hours
TIME_ENTRIES
--------------
employeeid 2nd order, assuming system forces this
value to be the employee making
the entry
date 1st order, entered by employee
customer 1st order, entered by employee
activity 1st order, entered by employee
hours 1st order, entered by employee
</pre>
<p>Now, considering how much of that is 2nd order, which
is almost all of it, the theorem is not only supported
by the definition, but ought to line up squarely
with our experience. Who would want to try to analyze
this and claim that all the 2nd order stuff should
not be there?
</p>
<h2>Theorem 2: 1st and 2nd Order, Implementation</h2>
<p>The second theorem we can derive from these definitions
is that First and Second Order Business logic require
separate implementation techniques. This is because:
</p>
<ul><li>First Order Business Logic is about user-supplied values
<li>Second Order Business Logic is about generated values
<li>Therefore, unlike things cannot be implemented with
like tools.
</ul>
<p>Going back to the time entry example, let's zoom in on
the lowest table, the TIME_ENTRIES. The employee
entering her time must supply customer, date, activity, and
hours, while the system forces the value of employeeid.
This means that customer and activity must be validated
in their respective tables, and hours must be checked
for something like <= 24. But for employeeid the
system provides the value out of its context.
So the two kinds of values are processed in very
unlike ways. It seems reasonable that our code would
be simpler if it did not try to force both kinds of
values down the same validation pipe.
</p>
<h2>Theorem 3: 2nd and 3rd Order, Conservation of Action</h2>
<p>This theorem states that
the sum of Second and Third Order
Business Logic is fixed:
</p>
<ul><li>Second Order Business Logic is about generating
entities and attributes by rules or formulas
<li>Third Order Business Logic is coded
compound creation of entities and attributes
<li>Given that a particular set of requirements
resolves to a finite set of actions that generate
entities and values, then
<li>The sum of Second Order and Third Order Business
Logic is fixed.
</ul>
<p>In plain English, this means that the more Business
Logic you can implement through 2nd Order
<i>declarative</i> rules and formulas, the fewer
processing routines you have to code. Or, if you
prefer, the more processes you code, the fewer
declarative rules about entitities and
attributes you will have.
</p>
<p>This theorem may be hard to compare to experience
for verification
because most of us are so used to thinking in
terms of the batch billing as a process that we cannot imagine it
being implemented any other way: how exactly am I
suppose to implement batch billing <i>declaratively?</i>.
</p>
<p>Let's go back to the schema above, where we can
realize upon examination that the entirety of the batch
billing "process" has been detailed in a 2nd Order
Schema, if we could somehow add these facts to our
CREATE TABLE commands the way we add keys, types,
and constraints, batch billing would occur
without the batch part.
</p>
<p>Consider this. Imagine that a user enters a
a TIME_ENTRY. The system
checks for a matching EMPLOYEE/CUSTOMER/ACTIVITY
row in INVOICE_DETAIL, and when it finds the row
it updates the totals. But if it does not find
one then it creates one! Creation
of the INVOICE_DETAIL record causes the system to
check for the existence of an invoice for that
customer, and when it does not find one it creates
it and initializes the totals. Subsequent time entries
not only update the INVOICE_DETAIL rows but the
INVOICE rows as well. If this were happening, there would be no
batch billing at the end of the month because the
invoices would all be sitting there ready to go
when the last time entry was made.
</p>
<p>By the way, I coded something that does this in a
pretty straight-forward way a few years ago, meaning
you could skip the batch billing process and add a few
details to a schema that would cause the database to
behave exactly as described above. Although the
the format for specifying these extra features
was easy enough (so it seemed to me as the author),
it seemed the <i>conceptual shift of thinking</i>
that it required of people was far larger than I
initially and naively imagined. Nevertheless,
I toil forward, and that is
the core idea behind my <a href="http://code.google.com/p/triangulum-db/"
>Triangulum</a> project.
<h2>Observation: There Will Be Code</h2>
<p>This is not so much a theorem as an observation.
This observation is that if your application
requires Fourth Order Business Logic then somebody
is going to code something somewhere.
</p>
<p>An anonymous reader pointed out in the comments
to <a href="http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html"
>Part 2</a> that Oracle's MODEL clause may work
in some cases. I would assume so, but I would also
assume that reality can create complicated Fourth
Order cases faster than SQL can evolve. Maybe.
</p>
<p>But anyway, the real observation here is is that
no modern language, either app
level or SQL flavor, can express an algorithm
declaratively. In other words, no combination
of keys, constraints, calculations and derivations,
and no known combination of advanced SQL functions
and clauses
will express an ERP Allocation routine or a
Magazine Regulation routine. So you have to code it.
This may not always be true, but I think it is
true now.
</p>
<p>This is in contrast to the example given in the
previous section about the fixed total of
2nd and 3rd Order Logic. Unlike that example,
you cannot provide enough
2nd order wizardry to eliminate fourth order.
<i style="color:gray">(well ok maybe you can,
but I haven't figured it
out yet myself and have never heard that anybody
else is even trying. The trick would be to have
a table that you truncate and insert a single row
into, a trigger would fire that would know how
to generate the
next INSERT, generating a cascade. Of course, since
this happens in a transaction, if you end up
generating 100,000 inserts this might be a bad
idea ha ha.)</i>
</p>
<h2>Theorem 5: Second Order Tools Reduce Code</h2>
<p>This theorem rests on the acceptance of an observation,
that using meta-data repositories, or data dictionaries,
is easier than coding. If that does not hold true,
then this theorem does not hold true. But if that
observation (my own observation, admittedly) does
hold true, then:
</p>
<ul><li>By Theorem 3, the sum of 2nd and 3rd order
logic is fixed
<li>By observation, using meta-data that manages
schema requires less time than coding,
<li>By Theorem 1, 2nd order is analyzed and specified
as schema
<li>Then it is desirable to specify as much business
logic as possible as 2nd order schema, reducing
and possibly eliminating manual coding of Third
Order programs.
</ul>
<p>Again we go back to the batch billing example.
Is it possible to convert it all to 2nd Order as
described above. Well yes it is, because I've done
it. The trick is an extremely counter-intuitive
modification to a foreign key that causes a
failure to actually generate the parent row that
would let the key succeed. To find out more about
this, check out <a href="http://code.google.com/p/triangulum-db/"
>Triangulum</a> (not ready for prime time as of this
writing).
</p>
<h2>Conclusions</h2>
<p>The major conclusion in all of this is that anlaysis
and design should begin with First and Second Order
Business Logic, which means working out schemas, both
the user-supplied values and the system-supplied
values.
</p>
<p>When that is done, what we often call "processes"
are layered on top of this.
</p>
<p>Tomorrow we will see part 4 of 4, examining the
business logic layer, asking, is it possible to
create a pure business logic layer that gathers
all business logic unto itself?
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com31tag:blogger.com,1999:blog-426922399870577072.post-77727680063348489532011-01-02T13:05:00.004-05:002011-01-04T17:36:12.995-05:00Business Logic: From Working Definition to Rigorous Definition<p>This is part 2 of a 4 part mini-series that began
before the holidays with <a href="http://database-programmer.blogspot.com/2010/12/working-definition-of-business-logic.html"
>A Working Definition Business Logic</a>. Today we proceed
to a rigorous definition, tomorrow we will see <a href=
"http://database-programmer.blogspot.com/2011/01/theorems-regarding-business-logic.html">some theorems</a>,
and the series will wrap up with a post on the "business layer."
</p>
<p>In the first post, the working definition said that
business logic includes at least:
<ul><li>The Schema
<li>Calculations
<li>Processes
</ul>
<p>None of these was very rigorously defined, kind of a
"I'll know it when I see it" type of thing, and we did
not talk at all about security. Now the task becomes
tightening this up into a rigorous definition.
</p>
<h2>Similar Reading</h2>
<p>Toon Koppelaars has some excellent material along
these same lines, and a good place to start is his
<a href="http://thehelsinkideclaration.blogspot.com/2009/03/window-on-data-applications.html">Helsinki Declaration (IT Version)</a>.
The articles have a different focus than this series,
so they make great contrasting reading. I consider
my time spent reading through it very well spent.
</p>
<h2>Definitions, Proofs, and Experience</h2>
<p>What I propose below is a definition in four parts.
As definitions, they are not supposed
to prove anything, but they are definitely supposed
to ring true to the experience of any developer
who has created or worked on
a non-trivial business application. This effort
would be a success if we reach some concensus that
"at least it's all in there", even if we go
on to argue bitterly about which components
should be included in which layers.
</p>
<p>Also, while I claim the definitions below are
rigorous, they are not yet <i>formal</i>. My
instinct is that formal definitions can be
developed using <a href="http://en.wikipedia.org/wiki/First-order_logic"
>First Order Logic</a>, which would allow the
theorems we will see tomorrow to move from
"yeah that sounds about right" to being
formally provable.
</p>
<p>As for their practical benefit, inasmuch as
"the truth shall make you free", we ought to be
able to improve our architectures if we can settle
at very least <i>what we are talking about</i>
when we use the vague term "business logic."
</p>
<h2>The Whole Picture</h2>
<p>What we commonly call "business logic", by
which we vaguely mean, "That stuff I have
to code up",
can in fact be rigorously defined
as having four parts, which I believe are
best termed <i>orders</i>, as there is a definite
precedence to their discovery, analysis and implementation.
</p>
<ul><li>First Order: Schema
<li>Second Order: Derivations
<li>Third Order: Non-algorithmic compound operations
<li>Fourth Order: Algorithmic compound operations
</ul>
<p>Now we examine each order in detail.
</p>
<h2>A Word About Schema and NoSQL</h2>
<p>Even "schema-less" databases have a schema, they
simply do not enforce it in the database server.
Consider: an eCommerce site using MongoDB is not
going to be tracking the local zoo's animal
feeding schedule, because that is out of scope.
No, the code
is limited to dealing with orders, order lines,
customers, items and stuff like that.
</p>
<p><i>It is in the very act of expressing scope as
"the data values we will handle" that a schema is
developed.</i> This holds true regardless of whether
the datastore will be a filesystem, an RDBMS, a
new NoSQL database, or anything else.
</p>
<p>Because all applications have a schema, whether the
database server enforces it or whether the
application enforces it, we need a vocabulary
to discuss the schema. Here we have an embarrasment
of choices, we can talk about entities and attributes,
classes and properties, documents and values, or
columns and tables. The choice of "entities and
attributes" is likely best because it is as close as
possible to an implementation-agnostic language.
</p>
<h2>First Order Business Logic: Schema</h2>
<p>We can define schema, including security, as:
</p>
<p class="quote">that body of entities and
their attributes whose relationships and
values will be managed by the
application stack, including the authorization of
roles to read or write to entities and properties.
</p>
<p>Schema in this definition does not include derived
values of any kind or the processes that may operate
on the schema values, those are higher order of
business logic. This means that the schema
actually defines <i>the entire body of values that
the application will accept from outside sources
(users and other programs) and commit to the
datastore.</i> Restating again into even more
practical terms, the schema is the stuff users
can save themselves.
</p>
<p>With all of that said, let's enumerate the properties
of a schema.
</p>
<p><b>Type</b> is required for every attribute.
</p>
<p><b>Constraints</b> are limits to the values allowed
for an attribute beyond its type. We may have a
discount percent that may not exceed 1.0 or 100%.
</p>
<p><b>Entity Integrity</b> is usually thought of
in terms of primary keys
and the vague statement "you can't have duplicates."
We cannot have a list of US States where "NY" is
listed 4 times.
</p>
<p><b>Referential Integrity</b> means that when one
entity links or refers to another entity, it must
always refer to an existing entity.
We cannot have some script kiddie flooding our
site with sales of
items "EAT_ME" and "F***_YOU", becuase those are
not valid items.
</p>
<p>The general term 'validation' is not included
because any particular validation rule is
is a combination of any or all of type, constraints,
and integrity rules.
</p>
<h2>Second Orders Business Logic: Derived values</h2>
<p>When we speak of derived values, we usually
mean calculated values, but some derivations
are not arithmetic, so the more general term
"derived" is better. Derivations are:
</p>
<p class="quote">A complete entity or an attribute
of an entity generated from other entities
or attributes according to a formula or rule.
</p>
<p>The definition is sufficiently general that
a "formula or rule" can include conditional
logic.
</p>
<p>Simple arithmetic derived values include things
like calculating price * qty, or summing an
order total.
</p>
<p>Simple non-arithmetic derivations include
things like
fetching the price of an item to use on an
order line. The price in the order is <i>defined</i>
as being a copy of the item's price at the
time of purchase.
</p>
<p>An example of a complete entity being derived
is a history table that tracks changes
in some other table.
This can also be implemented
in NoSQL as a set of documents tracking the
changes to some original document.
</p>
<p>Security also applies to generated values
only insofar as who can see them. But security
is not an issue for writing these values
because by definition they are generated from
formulas and rules, and so no outside user
can ever attempt to explicitly specify the
value of a derived entity or property.
</p>
<p>One final point about Second Order Business
Logic is that it can be expressed declaratively,
<i>if we have the tools</i>, which we do not, at
least not in common use. I wrote one myself some
years ago and am re-releasing it as <a href=
"http://code.google.com/p/triangulum-db/"
>Triangulum</a>, but that is a post for another day.
</p>
<h2>Sorting out First and Second Order</h2>
<p>The definitions of First and Second Order Business Logic
have the
advantage of being agnostic to what kind of
datastore you are using, and being agnostic
to whether or not the derived values are
materialized. (In relational terms, derivations
are almost always <i>denormalizing</i> if
materialized, so in a fully normalized database
they will not be there, and you have to go through
the application to get them.)
</p>
<p>Nevertheless, these two definitions can right off
bring some confusion to the term "schema."
Example: a history table is absolutely in a database schema,
but I have called First Order Business Logic "schema" and
Second Order Business Logic is, well, something else.
The best solution here is to simply use the
terms First Order Schema and Second Order Schema.
An order_lines table is First Order schema, and
the table holding its history is Second Order Schema.
</p>
<p>The now ubiquitous auto-incremented surrogate primary
keys pose another stumbling block. Because they are
used so often (and so often because of seriously faulty
reasoning, see <a href="http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html"
>A Sane Approach To Choosing Primary Keys</a>) they
would automatically be considered schema -- one of the
very basic values of a sales order, check, etc. But
they are system-generated so they must be Second Order, no?
Isn't the orderid a very basic part of the schema and
therefore First Order? No. In fact, by these
definitions, very little if any of an order header
is First Order, the tiny fragments that are first order
might be the shipping address, the user's choice of
shipping method, and payment details provided by the
user. The other information that is system-generated,
like Date, OrderId, and order total are all Second
Order.
</p>
<h2>Third Order Business Logic</h2>
<p>Before defining Third Order Business Logic
I would like to offer a simple example:
<b>Batch Billing</b>. A consulting
company bills by the hour. Employees enter time
tickets throughout the day. At the end of the
month the billing agent runs a program that, in
SQL terms:
</p>
<ul><li>Inserts a row into INVOICES for each
customer with any time entries
<li>Inserts a row into INVOICE_LINES that
aggregates the time for each employee/customer
combination.
</ul>
<p>This example ought to make clear what I mean by
definining Third Order Business Logic as:
</p>
<p class="quote">A Non algorithmic compound
operation.
</p>
<p>The "non-algorithmic" part comes from the fact that
none of the individual documents, an INVOICE
row and its INVOICE_LINES, is dependent on any other.
There is no case in which the
invoice for one customer will influence the value
of the invoice for another. You do not need an
algorithm to do the job, just one or more steps
that may have to go in a certain order.
</p>
<p>Put another way, it is a one-pass set-oriented
operation. The fact that it must be executed in
two steps is an <i>artifact</i> of how database
servers deal with referential integrity, which is
that you need the headers before you can put in
the detail. In fact,
when using a NoSQL database, it may be possible to
insert the complete set of documents in one
command, since the lines can be nested directly
into the invoices.
</p>
<p>Put yet a third way, in more practical terms,
there is no conditional or looping logic required
to <i>specify the operation</i>. This does not
mean there will be no looping logic in the final
implementation, because performance concerns and
locking concerns may cause it to be implemented
with 'chunking' or other strategies, but the
important point is that the <i>specification</i>
does not include loops or step-wise operations
because the individual invoices are all
functionally independent of each other.
</p>
<p>I do not want to get side-tracked here, but I
have had a working hypothesis in my mind for
almost 7 years that Third Order Business Logic,
even before I called it that, is an <i>artifact</i>,
which appears necessary because of the limitations
of our tools. In future posts I would like to
show how a fully developed understanding and
implementation of Second Order Business Logic
can dissolve many cases of Third Order.
</p>
<h2>Fourth Order Business Logic</h2>
<p>We now come to the upper bound of complexity
for business logic, Fourth Order, which
we label "algorithmic compound operations",
and define a particular Fourth Order Business
Logic process as:
</p>
<p class="quote">Any operation where it
is possible or certain that
there will be at least
two steps, X and Y, such that the result
of Step X modifies the inputs available to
Step Y.
</p>
<p>In comparison to Third Order:
</p>
<ul><li>In Third Order the results are
independent of one another, in Fourth Order
they are not.
<li>In Third Order no conditional or branching
is required to express the solution, while in
Fourth Order conditional, looping, or branching
logic will be present in the expression of the
solution.
</ul>
<p>Let's look at the example of ERP Allocation.
In the interest of brevity, I am going to skip most
of the explanation of the ERP Allocation algorithm
and stick to this basic review: a company has a list
of sales orders (demand) and a list of purchase
orders (supply). Sales orders come in through EDI,
and at least once/day the purchasing department
must match supply to demand to find out what they
need to order. Here is an unrealistically simple
example of the supply and demand they might be facing:
</p>
<pre>
*** DEMAND *** *** SUPPLY ***
DATE | QTY DATE | QTY
------------+----- ------------+-----
3/ 1/2011 | 5 3/ 1/2011 | 3
3/15/2011 | 15 3/ 3/2011 | 6
4/ 1/2011 | 10 3/15/2011 | 20
4/ 3/2011 | 7
</pre>
<p>The desired output of the ERP Allocation
might look like this:
</p>
<pre>
*** DEMAND *** *** SUPPLY ****
DATE | QTY | DATE_IN | QTY | FINAL
------------+-----+------------+------+-------
3/ 1/2011 | 5 | 3/ 1/2011 | 3 | no
| 3/ 3/2011 | 2 | Yes
3/15/2011 | 15 | 3/ 3/2011 | 4 | no
| 3/15/2011 | 11 | Yes
4/ 1/2011 | 10 | 3/15/2011 | 9 | no
4/ 3/2011 | 7 | null | null | no
</pre>
<p>From this the purchasing agents know that the
Sales Order that ships on 3/1 will be two days
late, and the Sales Orders that will ship on
4/1 and 4/3 cannot be filled completely. They
have to order more stuff.
</p>
<p>Now for the killer question: Can the desired
output be generated in a single SQL query?
The answer is no, not even with Common
Table Expressions or other recursive constructs.
The reason is that <b>each match-up of a purchase
order to a sales order modifies the supply
available to the next sales order.</b> Or,
to use the definition of Fourth Order Business
Logic, each iteration will consume some supply
and so <i>will affect the inputs available to
the next step</i>.
</p>
<p>We can see this most clearly if we look at some
pseudo-code:
</p>
<pre>
for each sales order by date {
while sales order demand not met {
get earliest purchase order w/qty avial > 0
break if none
make entry in matching table
<b><font color="blue">// This is the write operation that
// means we have Fourth Order Business Logic</font>
reduce available qty of purchase order</b>
}
break if no more purchase orders
}
</pre>
<h2>Conclusions</h2>
<p>As stated in the beginning, it is my belief
that these four orders should "ring true" with
any developer who has experience with non-trivial
business applications. Though we may dispute
terminology and argue over edge cases, the
recognition and naming of the Four Orders should
be of immediate benefit during analysis, design,
coding, and refactoring. They rigorously
establish both the minimum and maximum bounds of
complexity while also filling in the two kinds of
actions we all take between those bounds.
They are datamodel agnostic,
and even agnostic to implementation strategies
within data models (like the normalize/denormalize
debate in relational).
</p>
<p>But their true power is in providing a framework
of thought for the process of synthesizing
requirements into a specification and from there
an implementation.
</p>
<p>Tomorrow we will see some theorems that we can
derive from these definitions.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com223tag:blogger.com,1999:blog-426922399870577072.post-17277887882828625672010-12-21T22:25:00.006-05:002011-01-04T17:34:42.521-05:00A Working Definition of Business Logic, with Implications for CRUD Code<p><b>Update: the <a href="http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html"
>Second Post</a> of this series is now available.
</b>
</p>
<p><b>Update: the <a href="http://database-programmer.blogspot.com/2011/01/theorems-regarding-business-logic.html"
>Third Post</a> of this series is now available.
</b>
</p>
<p>The <a href="http://en.wikipedia.org/wiki/Business_logic"
>Wikipedia entry on "Business Logic"</a> has a
wonderfully honest opening sentence stating
that "Business logic,
or domain logic, is a <i>non-technical term</i>...
(emphasis mine)". If this is true, that the term
is non-technical, or if you like, <i>non-rigorous</i>,
then most of us spend the better part of our efforts
working on something that <i>does not even have a definition</i>.
Kind of scary.
</p>
<p>Is it possible to come up with a decent working
definition of business logic? It is certainly
worth a try. This post is the first in a four
part series. The <a href="http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html">second post</a> is about
a more rigorous definition of Business Logic.
</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Complete Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>The Method</h2>
<p>In this essay we will pursue a method of finding
operations that we can define as business logic
with a minimum of controversey, and identify those
that can likely be excluded with a minimum of
controversey. This may leave a bit of gray area
that can be taken up in a later post.
</p>
<h2>An Easy Exclusion: Presentation Requirements</h2>
<p>If we define <b>Presentation Requirements</b>
as all requirements about "how it looks"
as opposed to "what it is", then we can rule
these out. But if we want to be rigorous
we have to be clear, Presentation Requirements
has to mean things like branding, skinning,
accessibility, any and all formatting, and
anything else that is about the appearance
and not about the actual values fetched from
somewhere.
</p>
<h2>Tables as the Foundation Layer</h2>
<p>Database veterans are likely to agree that your
table schema constitutes the foundation layer of all
business rules. The schema, being the tables,
columns, and keys, determines <i>what must be
provided</i> and <i>what must be excluded</i>.
If these are not business logic, I guess I don't
know what is.
</p>
<p>What about CouchDB and MongoDB and others that do
not require a predefined schema? These systems
give up the advantages of a fixed schema for
scalability and simplicity. I would argue here
that the schema has not disappeared, it has simply
moved into the code that writes documents to the
database. Unless the programmer wants a nightmare
of chaos and confusion, he will enforce some document
structure in the code, and so I still think it safe
to say that even for these databases there is a
schema <i>somewhere</i> that governs what must be
stored and what must be excluded.
</p>
<p>So we have at least a foundation for a rigorous
definition of business rules: the schema, be it
enforced by the database itself or by the code,
forms the bottom layer of the business logic.
</p>
<h2>Processes are Business Logic</h2>
<p>The next easy addition to our definition of
business logic would be processes, where
a process can be defined loosely as anything
involving multiple statements, can run without
user interaction, may depend on parameters
tables, and may take longer
than a user is willing to wait, requiring background
processing.
</p>
<p>I am sure we can all agree this is business logic,
but as long as we are trying to be rigorous, we
might say it is business logic because:
</p>
<ul><li>It must be coded
<li>The algorithm(s) must be inferred from the requirements
<li>It is entirely independent of Presentation Requirements
</ul>
<h2>Calculations are Business Logic</h2>
<p>We also should be able to agree that calculated
values like an order total, and the total after
tax and freight, are business logic. These are
things we must code for to take user-supplied
values and complete some picture.
</p>
<p>The reasons are the same as for processes, they
must be coded, the formulas must often be inferred
from requirements (or forced out of The Explainer
at gunpoint), and the formulas are completely
independent of Presentation Requirements.
</p>
<h2>The Score So Far</h2>
<p>So far we have excluded "mere" Presentation
Requirements, and included three entries I hope
so far are non-controversial:
</p>
<ul><li>Schema
<li>Processes
<li>Calculations
</ul>
<p>These are three things that some programmer must
design and code. The schema, either in a
conventional relational database or in application
code. Processes, which definitely must be
coded, and calculations, which also have to be
coded.
</p>
<h2>What Have We Left Out?</h2>
<p>Plenty. At very least security and notifications.
But let's put those off for another day and see
how we might handle what we have so far.
</p>
<p>For the Schema, I have already mentioned that you
can either put it into a Relational database or
manage it in application code when using a "NoSQL"
database. More than that will have to wait for
2011, when I am hoping to run a series detailing
different ways to implement schemas. I'm kind of
excited to play around with CouchDB or MongoDB.
</p>
<p>For processes, I have a <a href="http://database-programmer.blogspot.com/2010/12/critical-analysis-of-algorithm-sproc.html"
>separate post</a> that examines the implications
of the stored procedure route, the embedded SQL route,
and the ORM route.
</p>
<p>This leaves calculations. Let us now see how
we might handle calculations.
</p>
<h2>Mixing CRUD and Processes</h2>
<p>But before we get to CRUD, I should state that
if your CRUD code involves processes,
<i>seek professional help immediately</i>.
Mixing processes into CRUD is an extremely
common design error, and it can be
devastating. It can be recognized
when somebody says, "Yes, but when the salesman
closes the sale we have to pick this up and move
it over there, and then we have to...."
</p>
<p>Alas, this post is running long already and so
I cannot go into exactly how to solve these, but
the solution will always be one of these:
</p>
<ul><li>Spawning a background job to run the process
asynchronously. Easy because you don't have to
recode much, but highly suspicous.
<li>Examining why it seems necessary to do so
much work on what ought to be a single INSERT
into a sales table, with perhaps a few extra
rows with some details. Much the better solution,
but often very hard to see without a second pair
of eyes to help you out.
</ul>
<p>So now we can move on to pure CRUD operations.</p>
<h2>Let The Arguments Begin: Outbound CRUD</h2>
<p>Outbound CRUD is any application code that
grabs data from the database and passes it
up to the Presentation layer.
</p>
<p>A <b>fully normalized database</b>
will, in appropriate cases, require business logic of the
calculations variety, otherwise the
display is not
complete and meaningful to the user.
There is really no
getting around it in those cases.
</p>
<p>However, a database <b>Denormalized With
Calculated Values</b> requires no business logic
for outbound CRUD, it only has to pick up what
is asked for and pass it up. This is the route
I prefer myself.
</p>
<p>Deciding whether or not to include denormalized
calculated values has heavy implications for
the architecture of your system, but before we
see why, we have to look at inbound CRUD.
</p>
<h2>Inbound CRUD</h2>
<p>Inbound CRUD, in terms of business logic, is
the mirror image of outbound. If your
database is fully normalized, inbound CRUD
should be free of business logic, since it
is simply taking requests and pushing them to
the database. However, if you are denormalizing
by adding derived values, then it has to be
done on the way in, so inbound CRUD code must
contain business logic code of the calculations
variety.
</p>
<p>Now let us examine how the normalization
question affects system architecture and
application code.
</p>
<h2>Broken Symmetry</h2>
<p>As stated above, denormalizing by including
derived values forces calculated business
logic on the inbound path, but frees your
outbound path to be the "fast lane".
The opposite decision, not storing calculated
values, allows the inbound path to be the "fast lane"
and forces the calculations into the outbound
path.
</p>
<p>The important conclusion is: if you have business
logic of the calculation variety in both lanes
then you may have some inconsistent practices,
and there may be some gain involved in sorting
those out.
</p>
<p>But the two paths are not perfectly symmetric.
Even a fully normalized database will often,
sooner or later, commit those calculated values
to columns. This usually happens when some
definition of finality is met. Therefore, since
the inbound path is more likely to contain calculations
in any case, the two options are not really
balanced. This is one reason why I prefer
to store the calculated values and get them right
on the way in.
</p>
<h2>One Final Option</h2>
<p>When people ask me if I prefer to put business
logic in the server, it is hard to answer without
a lot of information about context. But when
calculations are involved the answer is yes.
</p>
<p>The reason is that calculations are incredibly
easy to fit into patterns. The patterns themselves
(almost) all follow foreign keys, since the foreign
key is the only way to correctly relate data between
tables. So you have the "FETCH" pattern, where a
price is fetched from the items table to the cart,
the "EXTEND" pattern, where qty * price = extended_Price,
and various "AGGREGATE" patterns, where totals are
summed up to the invoice. There are others, but it
is surprising how many calculations fall into these
patterns.
</p>
<p>Because these patterns are so easy to identify, it
is actually conceivable to code triggers by hand
to do them, but being an incurable toolmaker, I
prefer to have a code generator put them together
out of a data dictionary. More on that around the
first of the year.
</p>
<h2>Updates</h2>
<p><b>Update 1:</b> I realize I never made it quite clear that this is part
1, as the discussion so far seems reasonable but is
hardly rigorous (yet). Part 2 will be on the way after I've
fattened up for the holidays.</p>
<p><b>Update 2:</b> It is well worth following the link Mr. Koppelaars has put in the comments:
<a href="http://thehelsinkideclaration.blogspot.com/2009/03/window-on-data-applications.html">http://thehelsinkideclaration.blogspot.com/2009/03/window-on-data-applications.html</a>
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com24tag:blogger.com,1999:blog-426922399870577072.post-1055208724929478582010-12-19T13:10:00.001-05:002010-12-19T18:38:55.356-05:00User-Submitted Analysis Topic: Email<p>Reader <a href="mailto:dean.thrasher@infovark.com">Dean Thrasher</a>
of <a href="http://www.infovark.com">Infovark</a> has submitted
a schema for review and analysis as part of my
<a href="http://database-programmer.blogspot.com/p/submit-your-analysis-request.html"
>User-Submitted Analysis Request</a> series.
Today we are going to take a first look
at what he has. Mr. Thrasher and I both hope that any and all readers
will benefit from the exercise of publicly reviewing the schema.
</p>
<p>This particular analysis request is a great start to the series,
because it has to do with email. Everybody uses email so we all
understand at a very basic level what data will be handled.
<h2>Brief Introduction to User-Submitted Schemas</h2>
<p>Mr. Thrasher and I have exchanged a couple of emails, but we have
avoided any in-depth discussion. Instead, we want to carry out the
conversation on the public blog. So I am not aiming to provide
any "from on high" perfect analysis, instead this essay will contain
a lot of questions and suggestions, and we will then move into the
comments to go forward.
</p>
<p><b>Disclosure:</b> None. We are not paying each other anything, nor have
I received any merchandise that would normally carry a licensing fee.
</p>
<p>Today's essay is the very first in the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html#user"
>User-Submitted Anlaysis Requests</a> series. If you would like to see an analysis
of your schema, follow that link and contact me.
</p>
<p>This blog has a <a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Complete Table of Contents</a> and a list
of <a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>Brief Description and Starting Point</h2>
<p>To get us started, I am going to quote the <a href="http://www.infovark.com/product/"
>Infovark Product Page</a>, and then we will see what we want to zoom in on:
<p style="padding: 10px 20px 10px 20px">Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date</p>
<p>So we know even before we see anything technical that we are going to have
tables of contacts, emails, phones, addresses, appointments and many other
things pulled in from email systems, plus the value-add provided by the
product.
</p>
<h2>The Schema As-Is</h2>
<p>We are going to start by looking at how the details of a CONTACT
are stored. The schema models contacts with a group of
cross references, aka many-to-many relationships, like so:
</p>
<pre>
CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
|
+----- CONTACTS-X-PHONES -------- PHONES
|
+----- CONTACTS-X-ADDRESSES ----- ADDRESSES
|
+----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES
</pre>
<p>The first thing we have to note is that there is nothing wrong
with this at all. It is fully normalized and so it will be
very easy to make sure that database writes will not produce
anomalies or bad data.
</p>
<p>But, not surprisingly, Mr. Thrasher notes this makes for complicated
SELECTS, so we want to ask if
perhaps it is <i>over-normalized</i>, are there complications
in there that do not need to be there?
</p>
<h2>Email as A Property of Contact</h2>
<p>If I were to
<a href="http://database-programmer.blogspot.com/2008/01/table-design-patterns.html"
>follow my own advice</a>, I would first want to identify the
master tables. Master tables generally represent real things in the
world: people, places, products, services, events.
</p>
<p>So my first question is this: is an email address a free-standing
entity in its own right that deserves a master table? Or is it
instead a property of the CONTACT? I am going to suggest that an
email address is a property of a CONTACT, and, since a CONTACT
may have more than one email address, they should be stored in
a child table of the CONTACTS, more like this:
</p>
<pre>
CONTACTS +----- <font color="red"><s>CONTACTS-X-EMAILS -------- EMAILADDRESSES</s></font>
+----- <font color="green">CONTACTEMAILADDRESSES</font>
|
+----- CONTACTS-X-PHONES -------- PHONES
|
+----- CONTACTS-X-ADDRESSES ----- ADDRESSES
|
+----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES
</pre>
<p>Whether or not we make this switch depends not on
technical arguments about keys or data types, but on
<i>whether this accurately models reality</i>. If in fact
email addresses are simply properties of contacts, then
this is the simplest way to do it. Going further, the
code that imports and reads the data will be easier to
code, debug and maintain for two reasons: one, because
it is simpler, but more importantly, two, because it
accurately models reality and therefore will be easier
to think about.
</p>
<p>If this proves to be the right way to go, it may be
a one-off improvement, or it may repeat itself for
Phones, Addresses, and Web Addresses, but we will take
that up in the next post in the series.
</p>
<p>I am going to proceed as if this change is correct, and
ask then how it will ripple through the rest of the
system.
</p>
<h2>Some Specifics on the Email Addresses Table</h2>
<p>The EMAILADDRESSES table currently has these columns:
</p>
<pre>-- SQL Flavor is Firebird
CREATE TABLE EMAILADDRESS (
ID INTEGER NOT NULL,
USERNAME VARCHAR(64) NOT NULL COLLATE UNICODE_CI,
HOSTNAME VARCHAR(255) NOT NULL COLLATE UNICODE_CI,
DISPLAYNAME VARCHAR(255) NOT NULL
);
ALTER TABLE EMAILADDRESS
ADD CONSTRAINT PK_EMAILADDRESS
PRIMARY KEY (ID);
CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
IF (NEW.ID IS NULL) THEN
NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1);
END^</pre>
<p><b>Suggestion:</b> The first thing I notice is that the
complete email itself is not actually stored. So we need to
ask Mr. Thrasher what the thinking behind that was. My first
instinct is to store that, because it is the original natural
value of interest.
</p>
<p><b>Suggestion:</b> The columns USERNAME and HOSTNAME I could
go either way on. If they are needed for querying and statistics,
it is better to put them in. While this violates 3rd Normal Form and
so puts us at risk, the values are supplied AFAIK by a batch import,
and so there is only one codepath populating them, and we are likely safe.
However, if we DO NOT need to query these values for statistics,
and they are only there for convenience at display time, I would
likely remove them and generate them on-the-fly in application
code. There are some other good reasons to do this that will
come up a little further along.
</p>
<p><b>Suggestion:</b> Unless I missed something in the schema
sent over, we need a unique constraint on the combination of
CONTACTID and USERNAME and HOSTNAME. Or, if we remove
USERNAME and HOSTNAME in favor of the original EMAILADDRESS,
we need a unique constraint on CONTACTID + EMAILADDRESS.
</p>
<h2>Before We Get To Transactions</h2>
<p>We are about to go into Part 2, which is about the other
tables that reference EMAILADDRESSES, but before we do
let's look at what the two tables would be if we made all
changes suggested so far:
</p>
<pre>
CONTACTS EMAILADDRESSES
------------ --------------------
ID (surrogate key)
CONTACT_ID --------& CONTACT_ID
other columns... EMAILADDRESS
LABEL
USERNAME (possibly removed)
HOSTNAME (possibly removed)
DISPLAYNAME
</pre>
<p>You may notice the LABEL column showed up out of nowhere.
That column was previously in the cross-reference. When
the cross-reference went away it landed in EMAILADDRESSES.
That column LABEL holds values like "work", "home" and
so on. It is supplied from whatever system we pull
emails from, and so we have no constraints on it or
rules about it.
</p>
<h2>Changing Emails And Transactions</h2>
<p>Now we move on from the basic storage of EMAIL addresses
to the other tables that reference those addresses.
These are things like emails themselves with their lists
people sent to/from, and meetings, and presumably other
types of transactions as well.
</p>
<p>When we look at transactions, which will reference
contacts and email addresses, we also have to consider
the fact that a CONTACT may change their email address
over time. Consider a person working for Vendor A, who
moves over to Vendor B. For some of the transactions
they will have been at Vendor A, and then going forward
they are all at Vendor B. This leads to this very
important question:
</p>
<p><b>Do Transactions store details about the CONTACTS as
they were at the time of the transaction, or as they
are now?</b>
</p>
<p>In other words, if a CONTACT moves from one company to
another, and you look at a meeting with that person
from last year, should it link to where they are now?
Or should it be full of information about where they
were at the time?
</p>
<p>The answer to this question is important because it
determines how to proceed on the two final points I
would like to raise:
</p>
<ol><li>Should the various transactions have a foreign
key back to EMAILADDRESSES, or should they simply
link back to CONTACTS and contain the EMAILADDRESS
itself?
<li>Do we need an integer surrogate key on the
EMAILADDRESSES table, especially if we do not link
back to it?
</ol>
<h2>First Final Suggestion</h2>
<p>So the first of the final two suggestions is: maybe
the transactions tables should just link back to CONTACTID
and contain a freestanding EMAILADDRESS. The first argument for
this is that it preserves the history as it was, and
if that is what we want, then this accomplishes it.
The second argument is that by putting the actual value
instead of an integer key back to some table, we
simplify coding by removing a join.
</p>
<p>The arguments against embedding the email address might
be basically, "hey, if this is a kind of a data warehoues,
you are really supposed to be doing the snowflake thing and
you don't want to waste space on that value." To which I
respond that the engineer always has the choice of trading
space for speed. Putting the email in directly is a correct
recording of a fact, and takes more space, but eliminates
a very common JOIN from many queries, so Mr. Thrasher may
choose to make that call.
</p>
<p>This also plays back to my question about whether we should
have USERNAME and HOSTNAME in the EMAILADDRESSES table.
If we start putting email addresses directly into tables,
we can also keep putting these other two columns in, which
trades again space for speed. We could also skip them and
code a parser in the application that generates them
on-the-fly as needed.
</p>
<h2>Second Final Suggestion</h2>
<p>Now we go all of the way back to the child table
and ask a basic question: Why is there is an integer
surrogate key there? Integer surrogate keys are
useful in many situations, but contrary to what the
web generation learned, they are not some kind of
required approach in relational databases.
</p>
<p>Consider: we need a unique constraint on CONTACTID+EMAILADDRESS
anyway, so we have to justify why we would add a new
column that does not add value. The reflex answer tends
to be "because they join faster" but that ignores the fact
that if you use the natural key of CONTACTID+EMAILADDRESS,
and put these columns into child tables, <i>you do not need
to join at all!</i> If we use the surrogate key and embed
it in child tables, then getting the CONCTACT information
forces two joins: through EMAILADDRESS to CONTACTS. But if
we use the natural key of CONTACTID + EMAILADDRESS <i>we
already have the contact id</i> which saves a JOIN when we
are after CONTACTS details, and, unless we want to know
something like LABEL, <i>we do not have to JOIN back to
EMAILADDRESSES at all</i>.
<h2>Conclusion</h2>
<p>Well that's it. As promised, we have a few suggestions
and a lot of questions for Mr. Thrasher. Check back in the
coming days to see how the various questions work themselves
out in the comments.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com12tag:blogger.com,1999:blog-426922399870577072.post-69043468843958345512010-12-16T23:24:00.003-05:002010-12-21T22:52:08.094-05:00Critical Analysis of an Algorithm: Sproc, Embedded SQL, and ORM<p>This is a follow-up to yesterday's
<a href="http://database-programmer.blogspot.com/2010/12/historical-perspective-of-orm-and.html"
>historical perspective
on ORM.</a> In this essay we examine a particular class of
business logic and ask what happens if we go server-side,
embedded SQL, or ORM.
</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Complete Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>Processes</h2>
<p>We are going to look at a process. The term is not
well defined, but my own working definition is any
operation that has as many of the following properties
as I seem to think are important at the time I make
the decision:
</p>
<ul><li>Is categorically not CRUD: requires much more
than displaying data to user or sending a single-row
operation to the server
<li>Involves reading or writing many rows
<li>Involves reading or writing from multiple tables
<li>Involves multiple passes of the same data
<li>Involves no user interaction while executing
<li>If not coded to be idempotent can cause huge
headaches
<li>Will likely take longer than a user is willing
to wait (these days 3-10 seconds) and so runs in the
background.
<li>Depends upon rules tables that control its behavior
</ul>
<h2>A Particular Process: Magazine Regulation</h2>
<p>I have a more complete description of this problem
<a href="http://database-programmer.blogspot.com/2008/05/minimize-code-maximize-data.html">here</a>, so this is going to be very
short. The system handles magazine deliveries to
stores. The shop running the system has thousands of stores
and thousands of magazines. Every store has a
<i>default quantity</i> of the particular magazines they
carry.
For one particular magazine, <b>NewsTime</b>,
there are 1000 stores that get an average default
quantity of 50, requiring 50,000 magazines each weak.
</p>
<p>Here is the twist. You never get exactly 50,000, no
matter what you tell the distributor. Some weeks you
get 45,000, others you get 55,000, with any variation
in between. So the <b><i>Regulation
Process</i></b> adjusts the defaults for each store
until the delivery amounts equal the on-hand total that
was delivered on the truck.
</p>
<h2>The Naive or Simple Algorithm</h2>
<p>In the first pass, we are going to consider an
unrealistically simple version of Magazine Regulation,
where we have too many magazines and must up the
quantities until we're giving out the entire amount on-hand.
</p>
<p>Assume a table has already been populated that has
the default quantities for each store, where the relevant
columns for the moment would be these:
</p>
<pre>
StoreId | MagazineId | QTY_DEFAULT | QTY_REGULATED
-----------+---------------+--------------+---------------
1 | 50 | 75 | 0
2 | 50 | 23 | 0
4 | 50 | 48 | 0
10 | 50 | 19 | 0
17 | 50 | 110 | 0
21 | 50 | 82 | 0
</pre>
<p>We are told only that the increases must be evenly
distributed, we can't just ship 5000 extra magazines to
a single store. That makes sense. A simple algorithm to do this would be:
</p>
<ol><li>Give each store one additional magazine until you
run out of magazines or run out of stores.
<li>Repeat step 1 until you run out of magazines.
</ol>
<h2>The Pseudo-Code</h2>
<p>Let's mock something up in pseudo-code that
shows the structure of the solution:
</p>
<pre>
magsToDeliver = get total magazines...
magsDefault = get total of defaults...
-- Outer loop implements rule 2:
-- "repeat until you run out of magazines"
while magsToDeliver > magsDefault {
-- Inner loop implements rule 1:
-- "increase each store by 1 until you
-- run out of stores or run out of magazines"
for each store getting this magazine {
if magsToDeliver <= magsDefault break
-- If you want to go nuts, and even allow
-- the accidental simultaneous execution
-- of two routines doing the same thing,
-- put these lines here instead
magsToDeliver = get total magazines...
magsDefault = get total of defaults...
-- This is the actual job getting done
qty_regulate +=1
magsToDeliver -=1
}
}
</pre>
<h2>The Three Methods</h2>
<p>Let's characterize what happens with our three
choices of stored procedure, embedded SQL, or
ORM.
</p>
<p><b>Stored Procedure.</b>
Likely the fastest
solution, considering all of that row-by-row
going on. If that were in app code (ORM or not) we would
be making two
<a href="http://database-programmer.blogspot.com/2010/12/cost-of-round-trips-to-server.html"
>round trips</a> to the server per iteration.
</p>
<p>The really crazy thing about the stored procedure
solution is that it is <i>utterly neutral to the
ORM question</i>. The entire ORM good-bad debate
dissolves because there is no OOP code involved.
So this could be the magic solution that ORM lovers
and haters could both agree upon.
</p>
<p><b>App Code With Embedded SQL (no ORM).</b> Just about
everybody hates this idea these days, but it should
be here for completeness, and because there are some
advantages. The top of the pseudo-code requires to
aggregate pulls, and if you are not afraid of SQL you
can pull down the result in one pass, instead of
iterating on the client. Further, the innermost operation
can be coded in SQL as a "UPDATE deliveries from (SELECT
TOP 1 deliverid From deliveries...)" so that you get only
one round trip per iteration, where ORM will cost two.
</p>
<p><b>Any Kind of ORM.</b> By this I mean the code will
contain no SQL, and the innermost loop will likely
instantiate some "delivery" objects, one after another,
increment their qty_regulated property by one, and flush them out.
This is twice as expensive as embedded SQL because you
have to fetch the row from the database and then
write it back, where the embedded SQL can issue a single
command that locates and updates the row in a single
statement.
</p>
<p>Some may argue that I misunderstand ORM
here, in that the library may be smart enough to allow
the write without the read, and <i>without forcing you
to embed SQL</i>. It would have to be something like
A) instantiate empty object with key, B) assign value
as an expression, like "+=1", C) save.
I welcome any such examples and will
update the post accordingly if any are forthcoming.
I am assuming that no ORM tool I have seen can do this
and would be happy to be corrected.
</p>
<p>If the ORM forces us to calculate the initial sum
of QTY_Default by fetching each row as an object and summing
them in the app, we get an extra complete set of round
trips. Bummer. But if we say, "Hey my ORM tool lets
me embed SQL in *emergencies*" then perhaps we can embed
a query with an aggregrate and skip that cost. But
oops, we've got some embedded SQL. Leaky abstraction.
</p>
<h2>The Score So Far</h2>
<p>So how does it look so far? All methods have
the same number of reads and writes to disk,
so we are scoring
them on round trips. If "X" is the number
of extra magazines to be distributed, and "Y" is
the number of stores getting the magazine, we have
for round trips:
</p>
<ul><li><b>Stored Procedure:</b> 1
<li><b>Embedded SQL:</b> X + 1 (the first pull plus one trip per
extra copy of the magazine)
<li><b>ORM, Hypothetical:</b>X + 1 (if the ORM tool can figure out how
to do an update without first reading the row to the app)
<li><b>ORM, Best Case:</b> 2X + 1 (if the first pull can be an aggregrate
without embedding SQL, and two round trips per iteration)
<li><b>ORM, Worst Case:</b>2X + Y (if the first pull must aggregate
app-side and there are two round trips per iteration)
</ul>
<p><b>Update: if you want a laugh, check out the image on the
<a href="http://en.wikipedia.org/wiki/Business_logic">Wikipedia page for "Business Logic"</a>, it depicts
aggregation occuring on the client side.</b>
</p>
<p>This gives us the shape of the engineering decision.
With all options reading and updating the same number of
rows, it call comes down to round trips. As soon as you
go client side your round trips go way up, and if your
ORM tool does not support Update without Select, then
it doubles from there.
</p>
<p>Now multiply this across the entire application,
every single action in code, every bit of "business
logic" with any kind of loop that iterates over
rows.
</p>
<h2>It Gets Worse/Better: Considering SQL Possibilities</h2>
<p>If you happen to know much about modern SQL,
you may be aware of the amazingly cool SQL RANK()
function. If this function is used in the sproc
or embedded SQL approaches, you can execute the
algorithm with only one loop, in a maximum of
<b>N=CEILING((Delivered-Regulated)/Stores)</b>
iterations. This will go much faster than the
row-by-row, and now those two options are pulling
even further ahead of the naive row-by-row methods
encouraged by an ORM tool.
</p>
<p>This ability of SQL will become extremely
important, as we are about to blow apart the
simplicity of the algorithm.
</p>
<h2>We Now Return You To The Real World</h2>
<p>I have never been paid good money to write an
algorithm as simple as the one described above.
This is because mature businesses have always
refined these simple methods for years or decades,
and so the real situation is always more complex.
</p>
<p>In a real magazine regulation algorithm, the rules
tend to be more like this:
</p>
<ol><li>Apply all of these rules whether you are
increasing or decreasing the amounts to deliver
<li>Stop applying the rules when delivery amounts
have been balanced to what we have on hand, no matter
where you are in the process
<li>Always increase/decrease any particular store
by exactly one on each iteration, no matter which rule
you are working on
<li>Never decrease any store below 2
<li>Decrease any store whose past 3 issues sold less
than 60% by 1, unless this would project their sales
of this issue above 60%, and prioritize
by prior sales % ascending.
<li>If the previous step completes, and we are
<b>short of magazines</b> decrease each store
by 1 by order of previous sales percents
ascending. Repeat until we are in balance.
<li>If all stores are reduced to 2 and we are
still short, abort and write error to log.
<li>If after the decreases we have <b>excess magazines</b>,
increase any store whose past 3 issues sold more than
70% by 1, unless this would reduce their projected
sales of this issue below 70%, and prioritize by
prior sales % descending (so the stores with the most
sales are handled first in case we don't get to all of them)
<li>If the previous step completes, and we are
<b>still in excess</b>, increase each store by 1 in order
of previous sales percents descending. Repeat until
we are in balance.
</ol>
<p>This can also get doubled again if we must implement one
set of rules when we start out with a too few magazines,
and another set of rules when we start out with too many.
</p>
<p>Well, it's not that hard. It actually comes down to
having four outer loops in succession. The percent-based
reduction, then the by 1 reduction, then the percent-based
increase, then the by 1 increase.
</p>
<p>But technically the more important matters are these:
</p>
<ul><li>We now have to
grab the sales % for every store for this magazine
on their past 3 issues and keep it handy throughout
the routine.
<li>The rules stated above contain <b>constants</b>
like 70%, 60%. These would properly be in some
parameter table to allow the user to control them,
so those have to be fetched.
<li>The loop through the stores is now much different,
as we are <i>filtering</i> on prior sales percent for
the percent-based passes, and <i>filtering and ordering</i>
on prior sales percent for the by 1 passes.
</ul>
<h2>Revisiting the Three Approaches</h2>
<p>Now let's see how our three approaches would change.
</p>
<p><b>The Improved Stored Procedure.</b> If we change the
sproc to use RANK() and make batch updates, we would
pull the prior sales percents into a temp table and
apply a <a href="http://www.sql-server-performance.com/tips/covering_indexes_p1.aspx"
>covering index</a> to cut the reads from that table in
half. Our main loop would then simply join to this
temp table and use it for both filtering and ordering.
</p>
<p><b>The Embedded SQL.</b> If we also changed the
embedded SQL so it was making batch updates with
RANK(), we would also generate a temp table. This
option remains the same as the sproc except for where
we put the SQL. However, it now has far fewer
round trips, and starts to look much more like the
sproc in terms of performance.
</p>
<p><b>The ORM Approach.</b> The idea here would be to
get those prior sales percents down into an ordered
collection and then use them as
the basis for the loops. The thorny part is that they
must be aggregated and sorted. If we want to avoid
all embedded SQL, then the aggregation
can be done client-side if we don't mind pulling down
3 times as many rows as are required. The sorting we
can pull off if we put the objects into a
collection such as an associative array, where the key
is the sales percent, then we can use <b>[language of choice]</b>'s
built-in sorting (hopefully), and we have escaped the
dread evil of embedded SQL.
</p>
<p>So we end up where we were, only more so. The sproc
remains the fastest, and if we know how to code set-oriented
nifty stuff with RANK() then the embedded SQL will run
in almost the exact same time. The ORM requires
most likely even more round trips and expensive
app-side operations that are performed much more efficiently
in the db server, unless we are willing to break the
abstraction and embed a bit of SQL.
</p>
<p>But in the end, if all of that cost of the ORM kicks
a 3 second routine to 7 seconds, still well below what
any user would notice, and you avoid
embedded SQL, and it lets you keep your paradigm,
who am I to judge?
</p>
<h2>Conclusions</h2>
<p>I offer none. There are so many conventions in play
regarding where to put your code, what tool you are
already using, and so forth, that it is really up to
the reader to draw conclusions. I only hope there is
enough information here to do so.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com16tag:blogger.com,1999:blog-426922399870577072.post-4896892701966852812010-12-15T21:55:00.007-05:002011-01-03T12:23:44.009-05:00Historical Perspective of ORM and Alternatives<p>A couple of years ago I broke my basic rule of sticking
to practical how-to and general programming philosophy
and wrote <a href="http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html"
>Why I Do Not Use ORM</a>. It sure got a lot of hits,
and is read every day
by people searching such things as "orm bad" or "why use orm".
But I have never been
satisfied with that post, and so I decided to take
another stab from another angle. There are legitimate
problems that led to ORM, and those problems need to
be looked at even if we cannot quite agree on what they
are or if ORM is the answer.
</p>
<p><b>UPDATE: In response to comments below and on reddit.com,
I have a <a href="http://database-programmer.blogspot.com/2010/12/critical-analysis-of-algorithm-sproc.html">new post</a> that gives a detailed
analysis of an algorithm implemented as a sproc, in app
code with embedded SQL, and in ORM.</b>
</p>
<p>Here then, is one man's short history of commercial
database application programming, from long before
the ORM system, right up to the present.
</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2007/12/database-skills-complete-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>The Way Back Machine</h2>
<p>When I began my career the world was a different place.
No Web, no Java, and Object Orientation had not yet
entered the mainstream. My first
application was written on a timeshare system (a microVAX)
and writing LAN applications made me a good living for
awhile before I graduated to client/server.
</p>
<p>In those days there were three things a programmer
(We were not "software engineers" yet, just
programmers) had to know. Every programmer I knew
wanted to master all of these skills. They were:
</p>
<ul><li>How to design a database schema for correctness
and efficiency.
<li>How to code an application that could process
data from the database, correctly and efficiently.
<li>How to make a good UI, which came down to
hotkeys and stuffing the screen with as much info
as possible.
</ul>
<p>In this essay we are going to look at those first two.</p>
<p>My own experience may be somewhat peculiar in that I
have never worked on a team where the programmers
were separated from the database. <i style="color:#555">(OK, one exception, in
my current assignment there is an iron curtain between
the two, but happily it is not my problem from where
I sit).</i> Coders made tables, and "tablers" wrote
code. So this focus on being a good developer by
developing both skills may be rare, enjoyed by those who
have the same ecumenical background that I enjoyed.
</p>
<h2>Some Changes That Did Not Matter</h2>
<p>Things changed rapidly, but most of those changes
did not really affect application development.
</p>
<p>When Windows 95 came out, being "almost as good as
a Mac", we recoded our DOS apps into Windows apps
without too much trouble and life went on as before.
</p>
<p>Laser printers replaced dot-matrix for most office use,
CPUs kept getting faster (and Windows kept getting
slower), each year there were more colors on the
screen, disks got bigger and RAM got cheaper.
</p>
<p>Only the internet and the new <i>stateless programming</i>
required any real adjustment, but it was easy for a
database guy because good practice had always been to
keep your transactions as short as possible. The stateless
thing just kind of tuned that to a sharp edge.
</p>
<p>Finally, with the internet, the RDBMS finally lost its
place as sole king of the datastore realm, but those new
datastores will have to wait for another day, lest we
get bogged down.
</p>
<h2>Enter Object Orientation</h2>
<p>Arguably nothing changed programming more than
Object Orientation. Certainly not Windows 95, faster
graphics or any of those other Moore's Law consequences.
I would go so far as to say that even
the explosion of the web just produced more programming,
and of different kinds of apps, and even that did not
come close to the impact of Object Orientation.
Disagree if you like, but as it came in, it was
new, it was strange, it was beautiful, and we were
in love.
</p>
<p>Now here is something you may not believe. The biggest
question for those of us already successfully developing
large applications was: What is it good for? What does
it give me that I do not already have? Sure its
beautiful, but <i>what does it do?</i>
</p>
<p>User interfaces were for me the easiest first place to
see the benefits. When the widgets became classes and objects,
and we empolyed encapsulation, inheritance and
composition, the world
changed and I don't know anybody who ever looked back.
</p>
<h2>OOP, Data, and Data Structures</h2>
<p>But in the matter of processing data, things were not
so clear cut. The biggest reason may have been that
all languages back then had <i>specialized data structures</i>
that were highly tuned to handling relational data.
These worked so well that nobody at first envisioned
anything like <a href="http://en.wikipedia.org/wiki/Active_record_pattern"
>ActiveRecord</a> because
we just did not need it.
</p>
<p>With these structures you could write applications
that ran processes involving dozens of tables, lasting
hours, and never wonder, "Gosh, how do I map this data
to my language of choice?" You chose the language you
were using <i>precisely because it knew how to handle
data!</i>
<p>I would like to throw in just one example to show how
OOP was not relevant to getting work done back then.
I was once asked to optimize something called "ERP
Allocation" that ran once/day, but was taking 26 hours
at the largest customer site, obviously a big problem.
It turned out there was a call to the database inside of
a tightly nested loop, and when I moved the query outside
of the loop the results were dramatic. The programmers
got the idea and they took over from there. The main
point of course is that it was all about how to
efficiently use a database. The language was OOP, and
the code was in a class, but that had nothing to do
with the problem or the solution. Going further,
coding a process so data intensive as this one
using ActiveRecord
was prima facia absurd to anybody who knew about data
and code.
</p>
<h2>Java and the Languages of The Internet</h2>
<p>But the web had another impact that was far
more important than just switching to stateless
programming. This was the introduction
of an entirely new family of languages that took
over the application space, listed here in no
particular order: Perl, PHP,
Python, Ruby, and the king of them all: Java.
</p>
<p>All of these languages have one thing in common
that positively jumps out at a veteran: <i>a complete
lack of data structures specialized for handling
relational data.</i>
So as these languages exploded in popularity
with their dismal offerings in data handling,
the need to provide something better in that
area became rapidly clear.
</p>
<p>Java has a special role to play because it was
pure OOP from the ground up. Even the whitespace
is an object! The impact of Java is very important
here because Object Orientation was now the One True
Faith, and languages with a more
flexible approach were gradually demoted
to mere 'scripting' languages. <i style='color:#555'>(
Of course proponents will quickly point out that 1/12 of the
world's population is now using a single application
written in one of those 'scripting' languages).</i>
</p>
<p>So the explosion of languages without decent
data handling abilities, coupled with a rise in
OOP-uber-alles thinking led us quite naturally to:
</p>
<h2>The First Premise of ORM: The Design Mismatch</h2>
<p>The first premise of ORM is that there is a design
mismatch between OOP and Relational, which must resolved
before any meaningful work can be done.
</p>
<p>This view is easy to sympathize with, even if you
disagree, when you consider the points raised in the
above sections, that the languages in play lack any real
specialized data structures, and that a certain
exclusive truthiness to OOP has arisen that is blind
to entire classes of solutions.
</p>
<p>So we must grant the ORM crowd their first
premise, in modified form. It is not that there
is a design mismatch, it is that there is something
missing, something that was in older systems that
is just not there in the newer languages. Granting
that this missing feature is an actual mismatch
requires a belief in the Exclusive Truth of OOP,
which I do not grant. OOP is like the computer
itself, of which Commander Spock said, "Computers
make excellent servants, but I have no wish to be
servant to a computer."
</p>
<p>But anyway, getting back to the story, the race
was on to replace what had been lost, and to do it
in an OOPy way.
</p>
<h2>The Second Premise of ORM: Persistence</h2>
<p>Fast forward and we soon have an entire family
of tools known as Object-Relational-Mappers,
or ORM. With them came an old idea: persistence.
</p>
<p>The idea has always been around that databases
exist to <i>persist</i> the work of the programmer.
I thought that myself when I was, oh, about 25 or
so. I learned fast that my view of reality was,
*cough*, lacking,
and that in fact there are two things
that are truly real for a developer:
</p>
<ul><li>The users, who create the paycheck, and
<li>The data, which those users seemed to think
was supposed to be correct 100% of the time.
</ul>
<p>From this perspective, the application code suddenly
becomes a go-between, the necessary appliance that
gets data from the db to the user (who creates the
paycheck), and takes instructions back from the user
and puts them in the database (correctly, thank you,
and don't make the user wait). No matter how
beautiful the code was, the user would only ever see
the screen (or page nowadays) and you only heard about
it if it was wrong. Nobody cares about my code, nobody
cares about yours.
</p>
<p>However, in the ORM world the idea of a database as the
<i>persistence</i> layer now sits on a throne reserved for
axiomatic truth. Those who disagree with me on this
may say that I have the mistaken perspective of an outsider,
to which I could say only that it is this very idea that
keeps me an outsider.
</p>
<p>But we should not paint the world with a broad brush.
Chris Wong writes an excellent blog where he occassionally
details how to respect the database while using Hibernate, in
<a href="http://chriswongdevblog.blogspot.com/2010/12/oops-mangling-your-database-with.html">this post</a>
and <a href="http://chriswongdevblog.blogspot.com/2010/10/beware-magic-flush.html">this post</a>.</p>
<h2>An Alternative World View</h2>
<p>There are plenty of alternatives to ORM, but I would
contend that they begin with a different world view.
Good business recognizes the infinite value of the
users as the generators of the Almighty Paycheck, and
the database as the permanent record of a job well
done.
</p>
<p>This worldview forces us into a humble position with
respect to our own application code, which is that it
is little more than a waiter, carrying orders to the
kitchen and food back to the patrons. When we see it
this way, the goal becomes to write code that can
efficiently get data back and forth. A small handful
of library routines can trap SQL injection, validate
types, and ship data off to the database. Another
set can generate HTML, or, can simply pass JSON
data up to those nifty browser client libraries
like <a href="http://www.sencha.com/">ExtJS (now
"Sencha" for some reason)</a>.
</p>
<p>This covers a huge amount of what an application
does, if you do not have much in the way of
business logic.
</p>
<h2>But How Do You Handle Business Logic?</h2>
<p>I have an entire essay on this about half-written,
but in short, it comes down to understanding what
business logic really is. <b>Update: <a href=
"http://database-programmer.blogspot.com/2011/01/business-logic-from-working-definition.html">This post is now available</a></b>
</p>
<p>The tables themselves are the bottom layer of
business logic. The table design itself implements
the foundation for all of the business rules.
This is why it is so important to get it right.
The tables are organized using normalization to
have a place for everything and everything in its
place, and after that the application code mostly
writes itself.
</p>
<p>The application code then falls into two areas:
value-add and no value-add. There is no value-add
when the application simply ships data off to the
user or executes a user request to update the
database. Those kinds of things should be handled
with the lightest possible library that gets the
job done.
</p>
<p>But the value-add stuff is different, where a
user's request requires lookups, possibly computations
and so forth. The problem here is that a naive
analysis of requirements (particulary the
<a href="http://database-programmer.blogspot.com/2008/02/false-patterns-such-as-reverse-foreign.html">transliteration error (Scroll down to "The
Customer Does Not Design Tables)</a>
will tend to generate many cases of perceived need for
value-add where a simpler design can reduce these
cases to no value-add. But even when the database has
been simplified to pristine perfection, there are jobs
that require loops, multiple passes and so forth,
which must be made idempotent and robust, which
will always require some extra coding. But if you know
what you are doing, these always turn out to be the
ERP Allocation example given above: they are a lot more
about the data than the classes.
</p>
<p>Another huge factor is where you come down on the
normalization debate, particularly on the inclusion of
derived values. If you keep derived values out of the database,
which is technically correct from a limited perspective,
then suddenly the value-add code is much more important
because <i>without it your data is incomplete</i>. If
you elect to put derived values into your database than
value-add code is only required <i>when writing to the
database</i>, so huge abstractions meant to handle any
read/write situation are unnecessary. (And of course,
it is extremely important to <a href="http://database-programmer.blogspot.com/2008/11/keeping-denormalized-values-correct.html">
Keep denormalized values correct</a>
).
</p>
<h2>And the Rest of It</h2>
<p>This essay hardly covers the entirety of
making code and data work together. You still have
to synchronize schema changes to code, and I still
think a data dictionary is the best <a href="http://en.wikipedia.org/wiki/Don't_repeat_yourself">D-R-Y</a> way to
do that.
</p>
<p>I hope this essay shows something of why many programmers
are so down on ORM, but much more importantly that there
are coherent philosophies out there that begin with a
different worldview and deliver what we were all doing
before ORM and what we will all still be doing after
ORM: delivering data back and forth between user and
database.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com38tag:blogger.com,1999:blog-426922399870577072.post-54178605491993543852010-12-11T14:33:00.003-05:002010-12-11T14:37:40.467-05:00The Cost of Round Trips To The Server<p>A database is not much without the applications
that connect to it, and one of the most important
factors that affects the application's performance
is how it retrieves data from queries. In this essay
we are going to see the effect of <i>round trips</i>
on application performance.
</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Complete Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>Pulling 15,000 Rows</h2>
<p>The test will pull 15,000 rows from a
table. We do it three different ways and see
which is faster and by how much.
</p>
<h2>Getting a Lot of Rows</h2>
<p>The script below creates a table and puts 1 million
rows into it. We want far more rows in the table than
we will actually pull so that we can pull fresh rows
on every pass through the test. It is deliberately crafted to spread
out the adjacent values of the integer primary key.
This is because, inasmuch as can control what is
going on, we want
every single row to be on a different page, so that
in all tests the cost of retrieving the row is roughly
the same and we are measuring only the effect of our
retrieval methods.
</p>
<p>The script can be run without modification in pgAdmin3,
and with slight mods on MS SQL Server.
</p>
<pre class="sql" name="code">create table test000 (
intpk int primary key
,filler char(40)
)
-- BLOCK 1, first 5000 rows
-- pgAdmin3: run as pgScript
-- All others: modify as required
--
declare @x,@y;
set @x = 1;
set @y = string(40,40,1);
while @x <= 5000 begin
insert into test000 (intpk,filler)
values ((@x-1)*200 +1,'@y');
set @x = @x + 1;
end
-- BLOCK 2, put 5000 rows aside
--
select * into test000_temp from test000
-- BLOCK 3, Insert the 5000 rows 199 more
-- times to get 1million altogether
-- pgAdmin3: run as pgScript
-- All others: modify as required
--
declare @x;
set @x = 1;
while @x <= 199 begin
insert into test000 (intpk,filler)
select intpk+@x,filler from test000_temp;
set @x = @x + 1;
end</pre>
<h2>Test 1: The Naive Code</h2>
<p>The simplest code is a straight loop that
pulls 15,000 consecutive rows by sending
an explicit query for each one.
</p>
<pre class="php" name="code"># Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");
# Program 1, Individual explicit fetches
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 1, using $x1 to $x2";
$timeBegin = microtime(true);
while ($x1++ <= $x2) {
$dbResult = pg_exec("select * from test000 where intpk=$x1");
$row = pg_fetch_array($dbResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 1, elapsed time: ".$elapsed;
echo "\n";</pre>
<h2>Test 2: Prepared Statements</h2>
<p>The next command asks the server to prepare a
statement, but it still makes 15,000 round trips,
executing the prepared statement with a new parameter
each time. The code looks like this:
</p>
<pre class="php" name="code"># Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");
# Program 2, Individual fetches with prepared statements
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 2, using $x1 to $x2";
$timeBegin = microtime(true);
$dbResult = pg_prepare("test000","select * from test000 where intpk=$1");
while ($x1++ <= $x2) {
$pqResult = pg_execute("test000",array($x1));
$row = pg_fetch_all($pqResult);
}
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 2, elapsed time: ".$elapsed;
echo "\n";</pre>
<h2>Test 3: A single round trip</h2>
<p>This time we issue a single command to retrieve
15,000 rows, then we pull them all down in one
shot.
</p>
<pre class="php" name="code"># Make a database connection
$dbConn = pg_connect("dbname=roundTrips user=postgres");
# Program 3, One fetch, pull all rows
$timeBegin = microtime(true);
$x1 = rand(0,199)*5000 + 1;
$x2 = $x1 + 14999;
echo "\nTest 3, using $x1 to $x2";
$dbResult = pg_exec(
"select * from test000 where intpk between $x1 and $x2"
);
$allRows = pg_fetch_all($dbResult);
$elapsed = microtime(true)-$timeBegin;
echo "\nTest 3, elapsed time: ".$elapsed;
echo "\n";</pre>
<h2>Results</h2>
<p>I ran this five times in a row, and this is what I got:</p>
<table>
<tr><td style="border-bottom:1px solid black;">Naive 15,000</td>
<td style="border-bottom:1px solid black;">Prepared 15,000</td>
<td style="border-bottom:1px solid black;">One Round Trip</td>
<tr><td>~1.800 seconds</td>
<td>~1.150 seconds</td>
<td>~0.045 seconds</td>
</table>
<p>Compared to the naive example, the <i>set-oriented</i>
fetch of al 15,000 rows in a single shot <b><i>ran
40 times faster</i></b>. This is what set-oriented
code does for an application.
</p>
<p>While the prepared statement option ran faster than
the naive option, the
set oriented example still ran <b><i>25 times faster
than the repeated prepared statements</i></b>.
<p>I also re-arranged the order of the tests, and
the results were the same.
</p>
<h2>Does Server or Language Matter?</h2>
<p>So this test was done using PHP against PostgreSQL,
will other servers and client languages get different
results? Given the same hardware, a different client
language or server is going to have a different spread
but the shape will be the same. Fetching all rows in
a single shot beats the living frack out of round trips
inside of loops in any client language against any
server.
</p>
<h2>Putting It Into Use</h2>
<p>The most obvious conclusion is that any query
returning more than 1 row should return all rows
as a set. The advantage is so stark with large
row counts that it is worthwhile making this the
default for our applications, unless we can find
a very good reason not to. So what would the
objections be?
</p>
<p>One objection might go something like, "<font color="#333">Ken, I
see the numbers, but I know my app very well and
we never pull more than 10-20 rows in a pop. I
cannot imagine how it would matter at 10-20 rows,
and I do not want to recode.</font>" This makes sense
so I ran a few more
tests with 20 and 100 rows, and found that, on
my hardware, you need about 100 rows to see a
difference. At 20 rows all three are neck-in-neck
and at 100 the set is pulling 4 times faster than
the prepared statement and 6 times faster than the
naive statement. So the conclusion is not an
absolute after all, some judgment is in order.
</p>
<p>Another thing to consider is how many simultaneous
reads and writes might be going on at any given
time. If your system is known to have
simultaneous transactions running regularly, then the
complete fetch may be a good idea even if you do some
tests for best-guess row count and the tests are inconclusive.
The reason is that the test is a <i>single user case</i>,
but multiple <i>simultaneous</i> users put a strain on
the database, even when they are not accessing the same
tables. In this case we want the application to
play the "good citizen" and get in and out as quickly
as possible to reduce strain on the server, which will
improve the performance of the entire application, not
just the portions optimized for complete fetches.
</p>
<p>Another objection might be, "<font color="#333">Well, my code needs to
pull from multiple tables, so I cannot really do this.
When we do -PROCESS-X- we go row by row and need to pull
from multiple tables for each row.</font>" In this case
you *definitely* need to go set oriented and pull all
associated quantities down in a query with a JOIN or two.
Consider this, if on your particular hardware the ratio
of naive row-by-row to single fetch is 10, and you must
pull from 2 other tables for each row, that means you are
really running 30 times slower (ratio is 10 x 3 reads)
than you could be.
</p>
<h2>A Final Note About PHP, Data Structures, and Frameworks</h2>
<p>Back when dinosaurs ruled the Earth and there was
no internet (outside of Universities, etc),
the languages we used had specialized data structures
that were tuned to database use. Compared to those
older systems the newer languages born on the
internet are more or less starving for such a
data structure.
</p>
<p>PHP gets by fairly well because its associative
array can be used as a passive (non object-oriented)
data structure that comes pretty close to what we had
before.
</p>
<p>I bring this up because the choice of a language and
its support for a "fetch all" operation obviously
impacts how well the conclusions of this post can
be implemented. If your mapping tool has an iterator
that absolves you of all knowledge of what is going
on under the hood, it may be worthwhile to see if it
is doing a complete fetch or a row-by-row.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com67tag:blogger.com,1999:blog-426922399870577072.post-62686394731913248312010-12-08T18:50:00.005-05:002010-12-08T21:23:36.053-05:00Submit Analysis Request to the Database Programmer<p>
I generally do not reveal too many details about systems I design for customers
or employers. This leaves me sometimes in a bind for example material. I either
have to simplify it beyond what I would like, or make something up that I have
not actually put into Production.
</p>
<p>On top of that, one of the key themes of this blog is that table design
is a crucial skill, and if the examples I give do not match what you
are doing, they may be hard to make use of.
</p>
<p>So I would like invite analysis requests. Go over to the
<a href="http://database-programmer.blogspot.com/p/contact-author.html"
>Contact the Author</a> page and drop me an email and tell me
about the system you are trying to design or optimize.
</p>
<p>There are no rules on the type of system.
</p>
<p>The most interesting mini-projects would be those where advice you
have been given elsewhere (or here for that matter) does not seem
to fit.
</p>
<p>I will do my best to reply, even if I have to say no, so that
nobody is left wondering.
</p>
<p>Remember this blog is one of those hobby/professional things,
good for all of us but nobody is getting paid, so if you are in
a terrible hurry this might not be the best thing.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com4tag:blogger.com,1999:blog-426922399870577072.post-27718989780724802192010-12-02T21:36:00.002-05:002010-12-02T21:42:04.422-05:00A Case When Table Design is Easy and Predictable<p>Good table design is a great foundation for a successful
application stack. Table design patterns basically resolve
into master tables and transaction tables. When we know
a thing or two about the master tables (or entities if you
prefer), we can infer a great deal about the transactions.
</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>A Time Billing System</h2>
<p>Imagine we have been asked to recode the company's
time-billing system. Because this is for the company
we work for, we have some inside knowledge about how
things work. We know that:
</p>
<ul><li>There are, of course, customers.
<li>....and employees who record time
<li>Each hour we record goes against a Work Order
<li>There are different kinds of jobs, like
project management, programming, programming
management, and others.
</ul>
<p>Knowing only this, is it
possible to anticipate what the system will look like?
A safe answer is "no", on the claim that we will
undoubtedly learn more, but this safe answer happens
to be wrong. We can in fact anticipate
the overall shape of the system, and new information
will shift details, but it will not change the shape.
</p>
<p>We can anticipate the nature of the transactions
if we determine the <i>upper bound of complexity</i>
and the <i>combinatorial completeness</i> of the
system.
</p>
<h2>The Upper Bound of Complexity</h2>
<p>We can safely assume that the big number to get
right is going to be the billing rate. Our employer
assumes we will get everything else right, but the
billing rate is going to have them chewing their fingernails
until they know we understand it and have coded it
correctly.
</p>
<p>The cool thing is that we already have enough information
to establish an <i>upper bound on the complexity</i> of
the system by looking at the master tables, where a master table
is generally one that lists details about real things
like people, places, things, or activities.
So far we know (or think we know) about three master tables:
</p>
<ul><li>Customers
<li>Employees
<li>Services
</ul>
<p>Now we define the upper bound of complexity as:</p>
<p style="margin: 10px; border: 2px solid black;
padding: 8px;
background-color: lightgreen;
font-weight: bolder">The upper bound of complexity
occurs when the billing rate is determined by all three
master entities.
</p>
<p>In plain English, calculating a billing rate can be as
complicated as looking up a rate specific to a customer
for a service for an employee <i>but cannot be more
complex than that</i> because there are no other entities
with which to work.
</p>
<h2>Combinatorially Complete</h2>
<p>We can also anticipate all possible calculations for
the billing rate by working through the complete set
of combinations of master entities. This would look
like the list below. Note that we are not trying to
figure out right now which of these is likely to occur,
we just want to get them listed out:
</p>
<ul><li>Each service has a default rate
<li>Each customer has a negotiated rate
<li>Each employee bills out at a default rate
<li>The combination customer-service may have a rate
<li>The combination customer-employee may have a rate
<li>The combination customer-service-employee may have
a rate (this is the upper bound of complexity, all
three master entities determine the rate).
</ul>
<p>Unless we live in a super-simple world where only the first
item in the list is present, we will end up dealing with
several if not all of the combinations listed above.
</p>
<p>Each of these combinations then becomes a table, and
we know the billing rate will be determined by a
<a href="http://database-programmer.blogspot.com/2008/04/advanced-table-design-resolutions.html">resolution</a>.
<h2>New Information</h2>
<p>Now comes the big day and we interview with somebody
we'll call "The Explainer" who is going to officially
explain the billing system. Can he break what we
already know? No. At most he can:
</p>
<ul><li>Make us aware of new master entities, perhaps
there are "projects" and "contracts" that get their
own billing arrangements.
<li>Dispel our notions about some of the combinations
by saying, "Oh we never give a customer a default
rate, the default rates come out of the services."
</ul>
<h2>Going in Cold</h2>
<p>What about the case where we know absolutely nothing
about an assignment when we go in to begin the interviews?
We can do a good job of thinking on our feet if we draw
"The Explainer" towards the master entities. As we gain
confidence that we know what the master entities are,
we can ask questions to probe Combinatorial Completeness
and the Upper Bound of Complexity.</p>
<p>One caveat: This method works for transactions between
master entities. When "The Explainer" starts describing
something that cannot be recognized as an interaction
between master entities, do not try to stuff the problem
into this box, it may not fit.
</p>
<h2>What About the Application?</h2>
<p>At this point, we can also anticipate a lot of
what the application will look like. We will need
maintenance screens for all of the master entities,
and a really slick UI will allow for very easy editing
of those various cross-reference combination tables.
As long as that much is done, we are almost finished,
but not yet.
</p>
<p>There will be some billing process that pulls
the time entries, finds the correct billing rate for
each one, and permanently records the invoices. If
we use a <a href="">resolution</a> this task is
child's play to code, debug, and maintain.
</p>
<p>Then of course there is the presentation, the actual
bill. Depending on the company, these may be delivered
as hardcopy or in email. That will of course have to
be coded up.
</p>
<h2>Conclusion</h2>
<p>There are two conclusions. First, as originally stated,
many transactions can be anticipated when you know what
the master entities are.
</p>
<p>But secondly, and every bit as important, once the table
design is sound, the application pretty much writes itself.
On a personal note, this is probably why I do not find
application coding as exciting as I once did. Once I
realized that the real challenge and satisfaction was in
working out the tables, the coding of the app became a
bit of a drudge, it requires no judgment as far as
business rules are concerned.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com3tag:blogger.com,1999:blog-426922399870577072.post-24016430540822385882010-11-30T22:29:00.006-05:002010-12-01T21:43:22.904-05:00The Really Cool NTILE() Window Function<p>If you regularly code queries and have never been
introduced to the <i>windowing functions</i>, then
you are in for a treat. I've been meaning to write
about these for over a year, and now it's time to get
down to it.
</p>
<h2>Support in Major Servers</h2>
<p>SQL Server calls these functions
<a href="http://msdn.microsoft.com/en-us/library/ms189798.aspx"
>Ranking Functions</a>.</p>
<p>PostgreSQL supports a wider range of functions
than MS SQL Server, having put them in at
8.4, and PostgreSQL and calls them
<a href="http://www.postgresql.org/docs/8.4/interactive/functions-window.html"
>Window Functions</a>.</p>
<p>Oracle's support is broader (by a reading of the docs)
than SQL Server or PostgreSQL, and they call them
<a href="http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions001.htm#i81407"
>Analytic Functions</a>.</p>
<p>I try to stay away from MySQL, but I did a quick Google on
all three terms and came up with a few forum posts asking
when and if they will be supported.
</p>
<h2>The NTILE() Function</h2>
<p>In this post we are going to look at NTILE, a cool function
that allows you to segment query results into groups and
put numbers onto them. The name is easy to remember because
it can create any -tile, a percentile, a decile, or anything
else. In short, an <i><b>n-</b></i>tile. But it is much easier to
understand with an example, so let's go right to it.
</p>
<h2>Finding percentiles</h2>
<p>Consider a table of completed sales, perhaps on an eCommerce site.
The Sales Manager would like them divided up into quartiles,
four equally divided groups, and she wants the average and
maximum sale in each quartile. Let's say the company is not
exactly hopping, and there are only twelve sales, which is good
because we can list them all for the example. If we already
had the quartiles provided then the query would be easy, so if
we were lucky enough to be starting with this:
</p>
<pre>
CUSTTYPE | AMOUNT | QUARTILE
----------+---------+----------
RETAIL | 78.00 | <font color="blue">1</font>
RETAIL | 234.00 | <font color="blue">1</font>
DEALER | 249.00 | <font color="blue">1</font>
DEALER | 278.00 | <font color="red">2</font>
RETAIL | 392.00 | <font color="red">2</font>
RETAIL | 498.00 | <font color="red">2</font>
DEALER | 500.00 | <font color="purple">3</font>
RETAIL | 738.00 | <font color="purple">3</font>
DEALER | 1250.00 | <font color="purple">3</font>
RETAIL | 2029.00 | <font color="green">4</font>
RETAIL | 2393.00 | <font color="green">4</font>
RETAIL | 3933.00 | <font color="green">4</font>
</pre>
<p>The query would be child's play <i>if we already
had the quartile</i>:</p>
<pre>
Select quartile
, avg(amount) as avgAmount
, max(amount) as maxAmount
FROM ORDERS
GROUP BY quartile
ORDER BY quartile
</pre>
<h2>The Problem is We Do Not Have Quartile</h2>
<p>The problem of course is that we do not usually
have handy columns like QUARTILE provided, but
we can generate the QUARTILE column during the
query by using NTILE.
</p>
<pre>
Select quartile
, avg(amount) as avgAmount
, max(amount) as maxAmount
FROM (
<font color="green">-- The subquery is necessary
-- to process all rows and add the quartile column</font>
SELECT amount
, ntile(4) over (order by amount) as quartile
FROM ORDERS
) x
GROUP BY quartile
ORDER BY quartile
</pre>
<p>This query will give us what the Sales Manager wants.</p>
<h2>Dissecting the Function and The OVER Clause</h2>
<p>The NTILE() function takes a single argument, which tells
the server how many groups to divide the data into. If
there are not an exact number of rows in each group, the
server decides which groups will be missing one row. So
in an exact case all of your groups have the same count of
rows, but when it does not divide evenly, one or more of them
will be one row short.
</p>
<p>If you pass 100 to NTILE(), you get a percentile. If you
pass 10, you get a decile, and so forth.
</p>
<p>The magic is in the OVER() function. This supports two clauses,
and the example shows one, the ORDER BY. Quite simply, the
ORDER BY clause tells the server how to line up the rows when
adding the NTILE values. The clause is very flexible, and has
nothing to do with your query's overall ORDER BY clause.
<h2>The Second Clause: PARTITION</h2>
<p>Now we will pretend the Sales Manager is not satisfied, and
wants separate numbers for the two Customer Types. We could
do this if the NTILE() function would create two sets
of quartiles, one for each Customer Type, like so:
</p>
<pre>
CUSTTYPE | AMOUNT | QUARTILE
----------+---------+----------
DEALER | 249.00 | <font color="blue">1</font>
DEALER | 278.00 | <font color="red">2</font>
DEALER | 500.00 | <font color="purple">3</font>
DEALER | 1250.00 | <font color="green">4</font>
RETAIL | 78.00 | <font color="blue">1</font>
RETAIL | 234.00 | <font color="blue">1</font>
RETAIL | 392.00 | <font color="red">2</font>
RETAIL | 498.00 | <font color="red">2</font>
RETAIL | 738.00 | <font color="purple">3</font>
RETAIL | 2029.00 | <font color="purple">3</font>
RETAIL | 2393.00 | <font color="green">4</font>
RETAIL | 3933.00 | <font color="green">4</font>
</pre>
<p>We can do this by using the PARTITION BY clause,
which tells the server to break the rows into
groups and apply the NTILE() numbering separately
within each group. The new query would be this:
</p>
<pre>
Select custtype
, quartile
, avg(amount) as avgAmount
, max(amount) as maxAmount
FROM (
<font color="green">-- The subquery is necessary
-- to process all rows and add the quartile column</font>
SELECT amount
, ntile(4) over (partition by custtype
order by amount) as quartile
FROM ORDERS
) x
GROUP BY custtype,quartile
ORDER BY custtype,quartile
</pre>
<h2>Bonus Points: The Median</h2>
<p>Now once again the Sales Manager, who is never satisified,
comes down and says that the average is no good, she
needs the max and the <i>median</i> sale value within each quartile.
To keep it simple, she does not need this broken out
by customer type, it can be applied to the entire set.
</p>
<p>This is a case where we can use NTILE() twice. The first
time we will break all sales up into four groups, to get
the quartiles, and then we will break up each quartile into
two groups to get the median. The code looks like this:
</p>
<pre>
Select quartile
, max(case when bitile=1 then amount else 0 end) as medAmount
, max(amount) as maxAmount
FROM (
<font color="green">-- The second pass adds the
-- 2-tile value we will use to find medians</font>
SELECT quartile
, amount
, ntile(2) over (partition by quartile
order by amount) as bitile
FROM (
<font color="green">-- The subquery is necessary
-- to process all rows and add the quartile column</font>
SELECT amount
, ntile(4) over (order by amount) as quartile
FROM ORDERS
) x1
) x2
GROUP BY quartile
ORDER BY quartile
</pre>
<p>The magic here is that we know we've divided the data
evenly into four sets, so the median will be the maximum
value half way through each set. In other words, it will be the
maximum value when the value of bitile=1 for each quartile.
</p>
<h2>One More Note About Oracle</h2>
<p>Once you get down the basics of the OVER clause, Oracle
looks really good, because they support the clause over
the largest range of functions, at least going by the
respective doc pages for each platform.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com24tag:blogger.com,1999:blog-426922399870577072.post-87902057043446044082010-11-29T22:19:00.005-05:002010-12-02T20:34:00.983-05:00Loops Without Cursors<h2>Looping Without Cursors</h2>
<p>Sometimes you need to process a table row-by-row,
and the established approach is to use <i>cursors</i>,
which are verbose, slow, and painful to code
and use.
</p>
<h2>The Cursor Example</h2>
<p>Here is the basic minimum syntax required to
loop through a table and get something done.
The SQL flavor is MS SQL Server, but its not
much better in any other flavor.
</p>
<pre>
<font color="red">-- I coded this off the top of my head, there
-- may be a minor syntax error or two</font>
<font color="green">-- Most of this is pseudo-code, but take
-- note that it is ordered on column1</font>
declare someCursorName cursor for
select column1, column2, column3
from anyTable
ORDER BY column1
<font color="green">-- Have to do this now</font>
open someCursorName
<font color="green">-- Now you need to declare some variables
-- For the example I'm just making everything int</font>
declare @column1 int
, @column2 int
, @column3 int
<font color="green">-- Gosh, we're actually about to start the loop! Finally!</font>
fetch next from someCursorName into @column1,@column2,@column3
while @@fetch_status = 0 begin
<font color="green">-- If you still remember what you actually wanted
-- to do inside the loop, code it here:</font>
<font color="green">-- Repeat this line from the top here again:</font>
fetch next from someCursorName into @column1,@column2,@column3
end
<font color="green">-- Not done yet, these two lines are crucial</font>
close someCursorName
deallocate someCursorName
</pre>
<p>Call me petty, but what I hate about that code is that I
have to refer to specific columns of interest 3 times (not
counting the declarations). You refer to them in the
cursor declaration and in the two FETCH commands. With
a little clever coding, we can vastly simplify this
and do it only once.
</p>
<h2>Using An Ordered Column</h2>
<p>We can execute the same loop without the cursor if
one of the columns is ordered and unique. Let us say
that column1 is the primary key, and is an auto-incremented
integer. So it is ordered and unique. The code now
collapses down to:
</p>
<pre><font color="red">-- I coded this off the top of my head, there
-- may be a minor syntax error or two</font>
<font color="green">-- We can't get around declaring the vars, so do that</font>
declare @column1 int
, @column2 int
, @column3 int
<font color="green">-- If you know a safe value for initialization, you
-- can use the code below. If this is not 100%
-- safe, you must query for the value or it must
-- be supplied from some other source</font>
set @column1 = -1
<font color="green">-- BONUS POINTS: Can this become an infinite loop?</font>
while 1 = 1 begin
<font color="green">-- Now we code the query and exit condition</font>
select TOP 1
@column1 = column1
, @column2 = column2
, @column3 = column3
from anyTable
<font color="red">WHERE column1 > @column1 -- this is what advances the loop</font>
ORDER BY column1
if @@rowcount = 0 begin
break
end
<font color="green"> -- Put the actions here </font>
end
</pre>
<h2>Final Notes</h2>
<p>The only requirement for this approach is
that you have a unique ordered column.
This usually means a unique key or primary
key. If "column1" is not unique, the loop
will skip all but the first value in each
group.
</p>
<p>Also, it is very nice if you know a safe
value to use as an initializer. Without that,
you must query for the minimum value that matches
the condition and then decrement it by one.
</p>
<p>Finally, can this loop become infinite? No.
Well, if, in the extremely unlikely situation
that rows are being added to the base table faster
than you are processing them, then yes, it could
go on for a very long time. But if that were
happening I'd say there was a separate problem to
look at.
</p>
<p>It should probably go without saying, but if
the particular loop is going to happen very
often, the table should be indexed on your
unique ordered column. If it is a primary key
or you already have a unique constraint it is not
necessary to create an index explicitly because
there will be one as part of the key or constraint.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com12tag:blogger.com,1999:blog-426922399870577072.post-35492696026742163932010-11-27T13:43:00.003-05:002010-11-28T22:05:24.263-05:00Revisiting Normalization and Denormalization<p>In this blog I have done at many articles on Normalization
and Denormalization, but I have never put all of the arguments
together in one place, so that is what I would like to do today.
</p>
<p>There are links to related essays on normalization and denormalization at the <a href="#bottom">bottom of this post</a>.</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>The What and Why of Normalization</h2>
<p>Normalization is the process of designing tables so that each fact is
stored in exactly one place. A "fact" in this case is any detail that
we have to keep track of, such as a product's description, a product's
price, an employee's social security number, and so forth.
</p>
<p>The process is all about figuring out what tables you need and what
columns each table will have. If we are talking about an employee's
social security number, then we can guess right from the start that
will have a table of EMPLOYEES, and that one of the columns will be
SSN. As we get more details, we add more tables and columns.
</p>
<p>The advantage of normalization comes when your application writes
data to the database. In the simplest terms, when the application
needs to store some fact, it only has to go to one place to do it.
Writing this kind of code is very easy. Easy to write, easy to debug,
easy to maintain and improve.
</p>
<p>When the database is not normalized, you end up spending more time
writing more complicated application code that is harder to debug.
The chances of bad data in your production database go way up.
When a shop first experiences bad data in production, it starts to
become tempting to "lock down" access to the database, either by
forcing updates to go through stored procedures or by trying to
enforce access to certain tables through certain codepaths. Both
of these strategies: stored procedures and code paths, are the
actually the same strategy implemented in different tiers, they
both try to prevent bugs by routing access through some bit of
code that "knows what to do." But if the database is normalized,
you do not need any magic code that "knows what to do."
</p>
<p>So that, in brief, is what normalization is and why we do it.
Let's move on now to denormalization.
</p>
<h2>Denormalization is Harder to Talk About</h2>
<p>Normalization is easy to explain because there is a clearly
stated end-goal: correct data. Moreover, there are well-defined
methods for reaching the goal, which we call the normal forms,
<a href="http://database-programmer.blogspot.com/2007/12/database-skills-first-normal-form.html">First Normal Form</a>, <a href="http://database-programmer.blogspot.com/2007/12/database-skills-second-normal-form.html">Second Normal Form</a>,
and higher forms. By contrast, denormalization is much harder
to talk about because there is no agreed-upon end goal. To make
matters worse, denormalization violates the original theory of
Relational Databases, so you still have plenty of people screaming
not to do it all, making things even more confusing. What we have
now in our industry is different
shops denormalizing in different ways for different reasons.
</p>
<p>The arguments that I have heard in my career boil down to two
basic groups. The first set of arguments centers around
calculated or derived values, and the second set centers
around programmer convenience.
</p>
<h2>Arguments for Derived Values</h2>
<p>My own experience comes down heavily in favor of denormalizing
by storing derived values directly into the tables, with the
extremely signficant caveat that you must have a way to ensure
that they are always correct. In this paradigm you maintain
strict normalization for facts supplied from the outside,
and then layer on additional facts that are calculated during
write operations and saved permanently.
</p>
<p>Here is a very simple example.
A strictly normalized database happens to be missing data
that many programmers would automatically assume should be
stored. Believe it or not, a simple value in a shopping
cart like EXTENDED_PRICE is forbidden by 3rd normal form
because it is a <i>non-key dependency</i>, or, in plain
English, since it can be derived from other values (QTY * PRICE),
then it is redundant, and we no longer have each fact stored
in exactly one place. The value of EXTENDED_PRICE is only
correct if it always equals QTY * PRICE, and so there is now
a "fact" that is spread across three locations.
If you store EXTENDED_PRICE, but do not have a way to ensure
that it will always 100% of the time equal QTY * PRICE,
then you will get bad data.
</p>
<p>So, given the risk of bad data, what is to be gained by
putting EXTENDED_PRICE into the cart? The answer is that
it adds value to the database and actually simplifies
application code. To see why, imagine a simple eCommerce
shopping cart that does not store any derived values.
Every single display of the cart to the user must go all
over the place to gather lots of details and recalculate
everything. This means re-calculating not just the
EXTENDED_PRICE, but adding in item level discounts, taking
account of possible tax exemptions for different items,
rolling
the totals to the cart, adding in tax, shipping, perhaps
a customer discount, a coupon, and who knows what else.
All of this just to display the cart, every time, no matter
what the purpose.
</p>
<p>This situation leads to three problems. A pitifully slow
application (too many disk reads and lots of cycles calculating
the values), maddening bugs when an application update
has subtle changes to the calculations so the customer's
order no longer displays the same numbers as it did yesterday,
and the frustrating requirement that the simplest of reports
must route through application code to calculate these values
instead of simply reading them off the disk, which leads to
reporting systems that are orders of magnitude slower than they
could be and horribly more complicated than they need to be
because they can't just read straight from the tables.
</p>
<p>Now let's look at how that same shopping cart would be used
if all of those calculated values were generated and saved
when the order is written. Building on your foundation of
normalized values (price, qty), you need only one body of code
that has to perform calculations. This magic body of code
takes the user-supplied values, adds in the calculations,
and commits the changes. <i>All other subsequent operations
need only to read and display the data, making them faster,
simpler, and more robust.</i>
</p>
<p>So the obvious question is how to make sure the derived
values are correct. If they are correct, we gain the
benefits with no down side. If there is the smallest chance
of bad data, we will quickly pay back any benefit we gained
by chasing down the mistakes.
</p>
<p>From a technical standpoint, what we really need is some
technology that will make sure the calculations cannot
be <i>subverted</i>, it cannot be possible for a stray
bit of program code or SQL Statement to
put the wrong value in for EXTENDED_PRICE. There are a
few generally accepted ways to do this:
</p>
<ul><li>Require all writes to go through a certain codepath.
The only PRO here is that you keep the logic in the
application code, and since most shops have more programmers
than database people, this makes sense. The only CON is that
it never works. One programmer working alone can maintain
discipline, but a team cannot. All it takes is one programmer
who did not know about the required codepath to screw it all
up. Also, it makes your system inflexible, as it is no longer
safe to write to the database except through a single application.
<li>Require all writes to go through stored procedures.
This is nominally better than the codepath solution because it
is not subvertible, and you can allow different side apps and
utilities to safely write to the database. But it makes a lot
of work and tends to be very inflexible.
<li>Putting triggers onto tables that perform the calculations
and throw errors if a SQL statement attempts to explicitly
write to a derived column. This makes the values completely
non-subvertible, ensures they will always be correct, and allows
access from any application or utility. The downside is that
the triggers cannot be coded by hand except at extreme cost, and
so must be generated from a data dictionary, which is fairly easy
to do but tends to involve extreme psychological barriers. In
these days of ORM many programmers mistakenly believe their
class files define reality, but this is not true. Reality is
defined by the users who one way or another create the paychecks, and
by the database, which is the permanent record of facts. But
a programmer who thinks his classes define reality simply cannot
see this and will reject the trigger solution for any number of
invalid reasons.
</ul>
<p>So denormalizing by putting in derived values can make a database
much more valuable, but it does require a clear systematic
approach to generating the derived values. There is no technical
problem associated with ensuring the values are correct because
of course the application has to do that somehow somewhere anyway,
the real barriers tend to be the psychological and political.
</p>
<h2>Arguments For Programmer Convenience</h2>
<p>The second set of arguments for denormalization tend to be
rather weak, and come down to something like this (you have to
picture the programmer whining like a child when he
says this), "I don't like
my data scattered around so many tables, can't we play some
other game instead?"
</p>
<p>Many programmers, when they first learn about normalization
and build a normalized database,
discover that the data they need to build a screen is "scattered"
about in many tables, and that it is tedious and troublesome to
get it all together for presentation to the user. A simple
example might be a contacts list. The main table is CONTACTS,
and it contains not much more than first and last name. A second
table is a list of PHONES for each contact, and a third
table is a list of various mailing addresses. A fourth table
of EMAILS stores their email addresses. This makes four tables
just to store a simple contact! We programmers look at this and
something inside of us says, "That's just way too complicated,
can't I do something else instead?"
</p>
<p>This is a case of programmer convenience clashing with correctness
of data. Nobody argues (at least not that I've heard) that they
do not want the data to be correct, they just wonder if it is possible
to simplify the tables so that they do not have to go out to so
many places to get what they need.
</p>
<p>In this case, programmers argue that denormalization will make
for simpler code if they <i>deliberately skip one or more steps
in the normalizing process.</i> (Technically I like to call the
result a "non-normalized" database instead of denormalized, but
most people call it denormalized, so we will go with that.)
</p>
<p>The argument goes something like this: I know for a fact that
nobody in the contacts list will have more than 3 emails, so
I'm going to skip the EMAILS table and just put columns EMAIL1,
EMAIL2, and EMAIL3 into the main CONTACTS table. In this case,
the programmer has decided to skip 1st Normal Form and put a
<i>repeating group</i> into the CONTACTS table. This he argues
makes for simpler database retrieval and easier coding.
</p>
<p>The result is painfully predictable. The simplification the
programmer sought at one stage becomes a raft of complications
later on. Here is an example that will appear trivial but really
gets to the heart of the matter. How do you count how many
emails a user has? A simple SELECT COUNT(*)...GROUP BY CONTACT
that would have worked before now
requires more complicated SQL. But isn't this trivial? Is it
really that bad? Well, if all you are coding is a CONTACTS
list probably not, but if you are doing a real application with
hundreds of tables and this "convenience" has been put out there
in dozens of cases,
than it becomes a detail that programmers need to know on a
table-by-table basis, it is an exception to how things ought
to be that has to be accounted for by anybody who touches the
table. In any shop with more than 5 programmers, whatever
convenience the original programmer gained is lost quickly
in the need to document and communicate these exceptions.
And this is only a single trivial example.
</p>
<p>Other examples come when it turns out you need more than
three slots for phone. In the normalized case this never comes
up. Any user can have any number of phones, and the code to
display the phones is running through a loop, so it does not
need to be modified for the case of 1 phone, 2 phones, etc.
But in the "convenient" denormalized case you now must
modify the table structure <i>and the code that displays the contacts,</i>
making it quite inconvenient.
</p>
<p>Then you have the case of how to define unused slots. If the
user has only one email, do we make EMAIL2 and EMAIL3 empty
or NULL? This may also seem like a silly point until you've sat
through a flamewar at the whiteboard and discovered just how
passionate some people are about NULL values. Avoiding that argument
can save your shop a lot of wasted time.
</p>
<p>In short, programmer convenience should never lead to a shortcut
in <i>skipping normalization steps</i> because it introduces far
more complications than it can ever pay for.
</p>
<a name="bottom">
<h2>Related Essays</h2>
</a>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<p>The normalization essays on this blog are:</p>
<ul><li><i><a href="http://database-programmer.blogspot.com/2010/11/revisiting-normalization-and.html"
>Revisiting Normalization and Denormalization (this essay)</a></i>.
<li><a href="http://database-programmer.blogspot.com/2008/07/database-performance-pay-me-now-or-pay.html">Pay Me Now Or Pay Me Later</a>
<li><a href="http://database-programmer.blogspot.com/2008/10/argument-for-normalization.html">The Argument for Normalization</a>
<li><a href="http://database-programmer.blogspot.com/2007/12/database-skills-first-normal-form.html">First Normal Form</a>
<li><a href="http://database-programmer.blogspot.com/2007/12/database-skills-second-normal-form.html">Second Normal Form</a>
<li><a href="http://database-programmer.blogspot.com/2008/01/database-skills-third-normal-form-and.html">Third Normal Form and Calculated Values</a>
<li><a href="http://database-programmer.blogspot.com/2008/10/argument-for-denormalization.html">The Argument for Denormalization</a>
<li><a href="http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html">Denormalization Patterns</a>
<li><a href="http://database-programmer.blogspot.com/2008/11/keeping-denormalized-values-correct.html">Keeping Denormalized Values Correct</a>
<li><a href="http://database-programmer.blogspot.com/2008/05/database-triggers-encapsulation-and.html">Triggers, Encapsulation and Composition</a>
<li><a href="http://database-programmer.blogspot.com/2009/01/data-dictionary-and-calculations-part-1.html">The Data Dictionary and Calculations, Part 1</a>
<li><a href="http://database-programmer.blogspot.com/2009/01/data-dictionary-and-calcuations-part-2.html">The Data Dictionary and Calculations, Part 2</a>
</ul>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com6tag:blogger.com,1999:blog-426922399870577072.post-18443781164041455322010-11-19T23:06:00.004-05:002010-11-28T22:08:51.814-05:00Prepare Now For Possible Future Head Transplant<p>This is the Database Programmer blog, for anybody who wants
practical advice on database use.</p>
<p>There are links to other essays at the <a href="#bottom">bottom of this post</a>.</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>Planning For The Unlikely</h2>
<p>We programmers love to plan for things that will hardly
ever happen, like coding the system's upgrade engine to
handle spontaneous human combustion, making sure the
SQL scrubbing layer can also launch a rocket into
space, and, well, trying to work out ahead of time
what to do if
we ever need a head transplant.
</p>
<p>The boss comes over and says, "can you toss a simple
plot onto the sales' staff home page that shows sales
by day? Use that 'jquicky' or whatever you call it.
Should take a couple of hours, right?" And three days
later we're working on the world's greatest plotting
system that can report <i>everything except what the
boss actually asked for because we haven't gotten
around to that part of it yet.</i> (Really, can he expect
me to just bang this out without the required
Seven Holy Layers of Abstraction and Five Ritual
Forms of Parameterization, and the Just and Wholesome
Mobile Support, or the features Not Yet Required
but visible to the Far Seeing Wise and Learned Men?)
</p>
<h2>Abstraction Contraptions</h2>
<p>So what I am getting at is that programmers of all
stripes are addicted to abstraction, it gives us
goosebumps and makes us feel warm and tingly, and
so we do it even when we do not need to. We
build abstraction contraptions.
</p>
<p>When it comes to designing a database, this
unhealthy proclivity can seriously slow you
down, because of what I call:
</p>
<h2>Ken's Law</h2>
<p>Everybody wants to be remembered for something. If
I could write my own epitaph, it might be:
</p>
<p style="border: 2px solid black; font-weight:bolder;
text-align: center; padding: 5px 5px 5px 5px;
background-color:lightgreen">
Table-based datastores are optimally abstract
</p>
<p>This law is not about database access, it is about
database design. It can be expressed informally as:
</p>
<p style="border: 2px solid black; font-weight:bolder;
text-align: center; padding: 5px 5px 5px 5px;
background-color:lightgreen">
People Understand Tables Just Fine
</p>
<p>Or more rigorously as:</p>
<p style="border: 2px solid black; font-weight:bolder;
text-align: center; padding: 5px 5px 5px 5px;
background-color:lightgreen">
Table-based datastores are optimally
abstract; they require no additional abstraction
when requirements are converted to desgin; they
cannot be reduced to a less abstract form.
</p>
<h2>Structured Atomic Values</h2>
<p>I should point out that this essay deals with
structured atomic values, who live in the
Kingdom of The Relational Database. The
concepts discussed here do not apply to
free-text documents or images, or sound
files, or any other media.
</p>
<h2>No Additional Abstraction Required</h2>
<p>My basic claim here is that you cannot create
an abstraction of data schemas that will pay
for itself. At best you will create a
description of a database where everything
has been given a different name, where tables have been
designated
'jingdabs' and columns have been designated
'floopies' and in the end all of your jingdab
floopies will become columns in tables. Oh,
and I suppose I should mention the Kwengars will
be foreign keys and the Slerzies will be primary
keys.
</p>
<p>After that it goes downhill, because if we
generate an abstraction that is not a simple
one-to-one mapping, we actually obscure the
end goal. Consider an example so simple as to
border on the trivial or absured.
Why would we ever use the terms
"One-to-Many" or "Many-to-Many" when the more
precise terms "child table" and "cross-reference
table" convey the same idea without the noise?
I said above that this would sound
trivial, and you can accuse me of nit-picking,
but this is one of those camel's nose things,
or perhaps a slippery slope. When technical folk
get together to design a system, we should call
things what they are, and not make up new words
to make ourselves sound smarter.
</p>
<h2>No de-Abstraction is Possible</h2>
<p>The second half of Ken's law says that you
cannot de-Abstract a table schema into some
simpler form. This should be very easy to
understand, because relational databases
deal with atomic values, that is, values
which cannot themselves be decomposed. If
you cannot decompose something, then it cannot
be an abstraction of something more specific.
</p>
<p>Going further, if the schema has been
normalized, then every fact is stored in
exactly one place, so no further simplification
is possible. If you cannot simplify it or
resolve it into something more specific, then
it is not an abstraction of something else.
</p>
<h2>But Does it Work?</h2>
<p>I originally began to suspect the existence
of what I call in all humility "Ken's Law"
when I was sitting at a large conference table
with my boss, her boss, a couple of peers, and
3 or 4 reps from a Fortune 5 company. My job
was basically to be C3PO, human-cyborg relations.
Some people at the table protested loudly to
being non-technical, while others had technical
roles. But <i>everybody at the table spent all
day discussing table design.</i>
</p>
<p>Later on, when at a different position, the
programmers received their instructions from
Project Managers. The best Project Managers worked
with their customers to figure out what they were
trying to keep track of, and handed us specs that
were basically table layouts. The customers loved
these guys because they felt they could "understand
what the project manager was talking about", and
the project managers, who swore they were not technical,
were respected because they handed us requirements
we could actually understand and implement.
</p>
<p>Since that time I have further learned that it
is very easy for anybody who deals with non-technical
people to bring them directly to table design without
telling them you are doing it. All you have to do
is listen to what words they use and adopt your
conversation accordingly. If they say things like
"I need a screen that shows me orders by customer
types" they have told you there will be a table of
customer types. Talk to them in terms of screens.
If they say, "Our catalog has 3 different
price list and four discount schemes" then you know that
there will be a PRICELIST table, a DISCOUNTS table, and
likely some cross-references and parent-child relationships
going on here.
</p>
<h2>So How Does ORM Come Into This?</h2>
<p>One of the greatest abstraction contraptions of
this century (so far), is ORM, or Object-Relational
Mapping, which I <a href="http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html">do not use</a>
precisely because it is an abstraction contraption.
</p>
<p>To be clear, the mistake that ORM makes is not
at the design phase, but at the access phase.
The ORM meme complex instructs its victims
that it is ok to put
structured atomic values into a Relational
Database, but when it comes time to access and
use that data <i>we will pretend we did not put it
into tables and we will pretend that the data is in
objects.</i> In this sense the term Object-
Relational Mapping is a complete misnomer, because
the point is not to map data to objects but to
create the illusion that the tables do not even
exist.
</p>
<p>Perhaps ORM should stand for Obscuring Reality Machine.
</p>
<h2>Getting Back to That Head Transplant</h2>
<p>So what about that weird title involving head
transplants? Obviously a head transplant is
impossible, making it also very unlikely, besides
being silly and non-sensical. It came to mind
as a kind of aggregrate of all of the bizarre
and unrealistic ideas about abstracting data
designs that I have heard over the years.
</p>
<p>One of these ideas is that it is possible and
beneficial to create a design that is abstract
so that it can be implemented in any model:
relational, hierarchical, or network. I'm not
saying such a thing is impossible, it is likely
just a <a href="http://www.catb.org/jargon/html/S/SMOP.html">small matter of programming</a>,
but for heaven's sake, what's the point?
</p>
<h2>Conclusion</h2>
<p>So don't waste time creating abstractions that
add steps, possibly obscure the goal, and
add no value. Don't plan for things that are
not likely to happen, and avoid abstraction
contraptions.
</p>
<a name="bottom">
<h2>Related Essays</h2>
</a>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<p>Other philosophy essays are:</p>
<ul><li><i><a href="http://database-programmer.blogspot.com/2010/11/prepare-now-for-possible-future-head.html"
>Prepare Now For Possible Future Head Transplant (This Essay)</a></i>
<li><a href="http://database-programmer.blogspot.com/2008/09/quest-for-absolute.html"
>The Quest for The Absolute</a>
<li><a href="http://database-programmer.blogspot.com/2009/03/i-am-but-humble-filing-clerk.html"
>I Am But A Humble Filing Clerk</a>
<li><a href="http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html"
>Why I Do Not Use ORM</a>
<li><a href="http://database-programmer.blogspot.com/2008/05/minimize-code-maximize-data.html"
>Minimize Code, Maximize Data</a>
</ul>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com7tag:blogger.com,1999:blog-426922399870577072.post-46864586389585335852010-11-13T12:14:00.002-05:002010-11-29T20:45:18.533-05:00Database Skills<p>It seems strange to me that I've been working on this blog
for 3 years or so (with one very long break) and somehow never
got around to writing a simple list of skills that all
database experts need. So here it is!
</p>
<h2>Various Job Tiles for Database People</h2>
<p>There are three common job titles in the database area,
which are Database Administrator (DBA), Database Programmer,
and Database Architect. These titles tend to be somewhat
variable from shop-to-shop, but generally the "Architect"
term indicates the highest level of skill combined with
considerable management responsibilities. The "Programmer"
term is somewhere below that, but the "DBA" is extremely
variable. I have seen shops where a person was called a
DBA and filled a relatively constrained role closer to
IT or operations (routine tasks, no real programming) and
other shops where a person with the DBA title was basically
the Architect.
</p>
<p>Because of this high variability in what titles mean, I am not
going to waste time categorizing skills as belonging to one
job title or another, I am simply going to list them all out.
</p>
<p>The various levels of skills are these:</p>
<ul><li>Before Hello World!: The basics of tables, columns, rows
<li>The Hello World! Level: SQL Select
<li>Just after Hello World!: Writing Data
<li>Commands to create, modify and drop tables, or Data
Definition Language (DDL)
<li>Knowing how to use a Query Analyzer or optimization tool
<li>Understanding Normalization
<li>Understanding Denormalization
<li>Understanding Primary Keys, Foreign Keys and Constraints
<li>Understanding Transactions
<li>Understanding ACID
<li>Understanding Indexes as optimization tool
<li>Views
<li>Database Security
<li>Upgrades and Installs
<li>Efficient access of database from application
<li>Bulk operations: loading or exporting large amounts
of data
<li>Understanding of Contexts and how they lead to
different sets of Best Practices
<li>Preventing performance degradation through
various maintenance tasks
<li>Deployment strategies: partitioning, tablespaces
<li>Deployment strategies, failure protection, from
simple backup to hot standbys
<li>Server side coding: stored procedures and functions
<li>Server side coding: triggers
<li>Temporary tables
</ul>
<p>As long as that list is, it only covers those of us who
<i>use</i> database systems. There is an entire set of
skills for those who actually <i>create and maintain</i> these
systems, but that is not something that will be treated
in this blog.
</p>
<h2>Before Hello World!: Tables and Columns</h2>
<p>If you have never so much as typed a single SQL command,
or seen a table diagram, or anything like that, then it is
worth a few minutes to go through the basics of what
a database does, which is to organize atomic values into
tables.
</p>
<p>I am going to write an essay on this soon, even though it
may seem so obvious as to be completely unnecessary. But I
will do it because the most popular essay on this
blog is about using GROUP BY, which tells me newer programmers
are starving for useful tutorials at the beginner level.
So it seems to me, why not put something out there at the
very beginning of the beginning?
</p>
<h2>The Hello World! Level: SQL Select</h2>
<p>If you are starting completely from scratch and want to know
about database programming, you want to start with the SQL
SELECT command. This is the (almost) only command used to
extract data from a database, and all of the possible ways to
combine, filter and extract data are expressed in the many
clauses of this command.
</p>
<ul><li>Simplest Possible <a href="http://database-programmer.blogspot.com/2008/03/introduction-to-queries.html">SQL SELECT</a> commands
<li>Simple embellishments: renaming columns, calculations
<li>Aggregrations: <a href="http://database-programmer.blogspot.com/2008/04/group-by-having-sum-avg-and-count.html">GROUP BY...HAVING</a>
<li>Multiple table queries: <a href="http://database-programmer.blogspot.com/2008/03/how-sql-union-affects-table-design.html">UNION</a> and <a href="http://database-programmer.blogspot.com/2008/03/join-is-cornerstone-of-powerful-queries.html">JOIN (part 1)</a> and
<a href="http://database-programmer.blogspot.com/2008/04/joins-part-two-many-forms-of-join.html">JOIN (part 2)</a>
<li>Subqueries as column values
<li>Subqueries as tables
<li>Partitioning Functions: ntile(), row_number(), etc.
<li><a href="http://database-programmer.blogspot.com/2010/11/recursive-queries-with-common-table.html">Recursive queries using Common Table Expressions</a>
<li>Extracting results as XML
</ul>
<h2>Just after Hello World!: Writing Data</h2>
<p>When it comes time to change the data in a database
there are three commands, listed below. These commands
are based on the tables-and-rows nature of databases,
and allow to add a row (or rows), change a row (or rows)
and delete a row (or rows).
<ul><li>The INSERT command
<li>The UPDATE command
<li>The DELETE command
</ul>
<h2>Commands to create, modify and drop tables, or Data
Definition Language (DDL)</h2>
<p>The term "DDL" stands for "Data Definition Language"
and includes all of the commands use to build the
tables that will hold the data for the INSERT, UPDATE,
DELETE and SELECT statements. The basic list of
commands to be familiar with is:
</p>
<ul><li>Understanding Data Types (databases are strongly typed)
<li>CREATE TABLE and ALTER TABLE
<li>Commands to add and drop primary keys
<li>Commands to add and drop foreign keys
<li>Commands to add and drop constraints
<li>Commands to add and drop indexes
</ul>
<p>There are also numerous commands that are specific
to different products. Those will not be listed here
today, but who knows what the future may bring.
</p>
<h2>Knowing how to use a Query Analyzer or optimization tool</h2>
<p>Database programmers, once they get started with the skills
listed above, tend to become more and more obsessed with
performance. Every major database has some type of tool
that lets you examine how the server is going to process
a SQL SELECT, and database programmers depend on these tools
to discover where they might alter tables or indexes or the
SELECT itself to make the queries go faster.
</p>
<h2>Understanding Normalization</h2>
<p>The term "normalization" refers to the process of analyzing
the data that your system is required to store, and organizing
it so that every fact is stored in exactly one place. Understanding
how to normalize data is an absolute requirement for the
database programmer who wants to design databases.
</p>
<p>We speak of normalization in "forms" as in "first normal form",
"second normal form", and so on. It is a good idea to understand
<a href="http://database-programmer.blogspot.com/2008/10/argument-for-normalization.html">The argument for normalization</a> and then to pursue
at very least:
</p>
<ul><li><a href="http://database-programmer.blogspot.com/2007/12/database-skills-first-normal-form.html">First Normal Form</a>
<li><a href="http://database-programmer.blogspot.com/2007/12/database-skills-second-normal-form.html">Second Normal Form</a>
<li><a href="http://database-programmer.blogspot.com/2008/01/database-skills-third-normal-form-and.html">Third Normal Form</a>
<li>Fourth Normal Form and higher forms
</ul>
<p>Normalization is a a fascinating topic to study, and it
extends all they way up to "Domain-key Normal Form" which is
considered the most complete normalization for a database.
</p>
<h2>Understanding Denormalization</h2>
<p>Every database programmer, after fully understanding
normalization, realizes that there are severe practical
problems with a fully normalized database, such a database
solves many problems but generates problems of its own.
This has led programmer after programmer down the path
of <i>denormalization</i>, the deliberate re-intoduction
of redundant values to improve the usability of the
database.
</p>
<p>There is a surprising lack of material available on the
web regarding denormalization strategies. Most of what
you find is arguments and flame wars about whether or not
to do it, with little to nothing on how to actually do it.
For this reason, I provide my own essays on this blog on
the strategies and methods I have worked out over the years:
</p>
<p>After reviewing <a href="http://database-programmer.blogspot.com/2008/10/argument-for-denormalization.html">The Argument For Denormalization</a>
it is worthwhile to follow up with:</p>
<ul><li>Understanding the "Automation Constraint" and how
to <a href="http://database-programmer.blogspot.com/2008/11/keeping-denormalized-values-correct.html">Keep denormalized values correct</a>
<li>Understanding the <a href="http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html">three denormalization patterns of FETCH, EXTEND, and AGGREGATE</a>
<li>Other Patterns
</ul>
<p>The arguments for and against denormalization are heavily
affected by the <a href="http://database-programmer.blogspot.com/2008/07/database-performance-pay-me-now-or-pay.html">Pay me now or pay me later</a>
design tradeoff.
</p>
<h2>Understanding Primary Keys, Foreign Keys and Constraints</h2>
<p>One might argue that this list of skills belongs much higher
up the list, up there with the CREATE TABLE command. However,
I have it useful to distinguish between simply <i>knowing the
commands</i> to make a primary key and actually understanding
<i>the tremendous power</i> of keys.
</p>
<p>In this author's opinion it is not truly possible to understand
how powerful and beneficial Primary keys and Foreign Keys are
for an entire application stack until you have learned the commands,
built some databases, and worked through the concepts of normalization
and denormalization. Only then can you revisit these humble
tools and realize how powerful they are.
</p>
<ul><li><a href="http://database-programmer.blogspot.com/2007/12/database-skills-primary-keys-this-is.html">Primary Keys and Table Design</a>
<li><a href="http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html">Choosing data types for primary keys</a>
<li><a href="http://database-programmer.blogspot.com/2007/12/database-skills-foreign-keys-this-is.html">Foreign keys and table design</a>
<li><a href="http://database-programmer.blogspot.com/2008/07/different-foreign-keys-for-different.html">Foreign keys and cascading actions</a>
<li><a href="http://database-programmer.blogspot.com/2008/03/of-tables-and-constraints.html">The How and Why of Constraints</a>
</ul>
<h2>Understanding Transactions</h2>
<p>The word "transaction" has two meanings in common day-to-day
database talk. One meaning is very loose and refers to some
individual command or set of commands. You might hear somebody
using the term loosely when they say, "We're seeing about
10 transactions per second this week."
</p>
<p>The more rigorous use of the term refers to a statement or
set of statements that <i>must be guaranteed to either
complete in their entirety or fail in their entirety.</i>
This is a profoundly important concept once you get beyond
simply making tables with keys and get into real-world
heavy multi-user activity. And this leads us to the
next topic...
</p>
<h2>Understanding ACID</h2>
<p>Modern relational databases expect multiple simultaneous
users to be writing and reading data all of the time.
The term "ACID Compliance" refers to both the philosophy
of how to handle this and the actual methods that
implement that philosophy. The term ACID refers to:
</p>
<ul><li>The Atomic nature of each transaction
<li>The Consistentcy of the database during and
after simultaneous overlapping transactions
<li>The Isolation of each transaction
<li>The Durability of the results
</ul>
<h2>Understanding Indexes as optimization tool</h2>
<p>An index is a special tool databases use to provide very
rapid access to large amounts of data. Just like keys, it
is not enough to know the commands, it is necessary to
understand the subtle power of indexes when used with some
craftsmanship. The basic uses of indexes are:
</p>
<ul><li>A simple index on a column to provide rapid search
on that column
<li>A "covering index" that includes extra columns that
can further speed up certain common access patterns
<li>Clustered indexes (MS SQL Server) and what they give
and what they take away
<li>The cost of indexes on write operations
</ul>
<h2>Views</h2>
<p>A view looks like a table to the SQL SELECT command. The view
itself is a stored SQL SELECT command that encodes some
query that is either used very often or is very compex. In
all cases, views are used to present the database data to
the application in some simplified convenient or secure
form. The two major uses of views are:
<ul><li>To simplify the application programmer's job
<li>To provide a read-only interface for
some applications
</ul>
<h2>Upgrades and Installs</h2>
<p>If you are a single programmer or hobbyist working with a
database, it is all well and good to just add and drop tables
as you wish. But as soon as you get into development
with quality control stages and multiple programmers, it becomes
evident that you need a strategy for handling the
<i>schema changes</i> that come with with new versions of
the system. There are multiple essays available on
this blog, covering:
</p>
<ul><li><a href="http://database-programmer.blogspot.com/2009/01/dictionary-based-database-upgrades.html">Dictionary based upgrades</a>
<li><a href="http://database-programmer.blogspot.com/2009/01/upgrading-indexes-with-data-dictionary.html">Upgrading indexes and keys</a>
<li><a href="http://database-programmer.blogspot.com/2009/01/upgrading-indexes-with-data-dictionary.html">Upgrades, dictionary, and calculated values part 1</a>
<li><a href="http://database-programmer.blogspot.com/2009/01/data-dictionary-and-calcuations-part-2.html">Upgrades, dictionary, and calculated values part 2</a>
</ul>
<h2>Database Security</h2>
<p>Databases provide incredible security provisions that are
just about completely ignored by modern web developers.
Sometimes there is good reason for this, but overall anybody
who wants to become a truly accomplished Database Programmer
or Database Architect must have a thorough understanding
of database security and how it can simplify the entire
system stack.
</p>
<p>Database security comes down to specifying who is allowed
to perform the 4 basic operations of INSERT, UPDATE,
DELETE and SELECT against which tables:
</p>
<p>My basic introduction to security is <a href="http://database-programmer.blogspot.com/2008/05/introducing-database-security.html">here</a>.
<ul><li>Understanding roles (we used to say users and groups)
<li>Simple table-level security
<li>Column-level security (not widely supported)
<li>Row-level security (not widely supported)
</ul>
<h2>Efficient access of database from application</h2>
<p>Imagine you have the perfectly designed database, with
every nuance and subtlety excellently crafted in the
ares of keys, indexes, normalization, denormalization
and security. At this point your job branches out into
several new areas, but one of the most important is
knowing how to write application code that efficiently
accesses the database.
</p>
<h2>Bulk operations: loading or exporting large amounts of data</h2>
<p>Some database applications involve a large number of small
transactions, where each trip to the database writes only a
single row or reads only a dozen or so rows.
</p>
<p>But in many cases you need to bulk load large amounts of
data in one shot, thousands or even millions of rows. In
these cases the techniques that govern small transactions
are useless and counter-productive, and you need to learn
some new commands and strategies to handle the bulk loads.
</p>
<h2>Understanding Contexts and how they lead to different sets of Best Practices</h2>
<p>Not all databases are created for the same purpose. If you
have a very large operations then it will likely have multiple
independent databases that fill the classical roles, while in
a smaller shop the roles may be combined in one database. I
like to refer to these roles as "contexts" because they determine
how the tables will be designed and how acess to the tables
will be governed. The major contexts are:
</p>
<ul><li>OLTP or Online Transaction Processing, characterized
by simultaneous reads and writes, generally assumes
little or no periods of inactivity, and generally
assumes that the individual transactions are very
small. The apps we were all writing in the 80s and
90s to do accounting, ERP, MRP, Job control, payroll,
airline reservations and many others fall into this
context.
<li>Data Warehouse context, characterized by periodic
bulk loads of new information with most activity
being reads. The Data Warehouse context is largely
associated with the "Star Schema" table design.
Data in a Warehouse is historical, it never changes
after it is loaded.
<li>CMS or Content Management System, also characterized
by very few writes compared to reads, but more likely
to have a normalized structure. Unlike a Data
Warehouse, the data is subject to change, just not that
often.
<li>Any other Read Only Context. I include this category
because I spent some time working on Direct Marketing
databases, which are like a Data Warehouse in that they
are updated periodically and the data does not change,
but the Star Schema is completely inappropriate for them.
</ul>
<p>If you consider a huge online shopping system, you can see that
within that application there are at least two contexts. The
product catalog is likely to see vastly fewer writes than
reads, but the shopping cart tables will be in a constant state
of reads and writes.
</p>
<h2>Preventing performance degradation through
various maintenance tasks</h2>
<p>Once the database and its application stack is up and running,
and the reads and writes and coming through, the laws of
thermodynamics come into play and system performance can
begin to degrade even if the database stays the same size
and the load on the system is steady.
</p>
<p>Different vendors have different tools for combatting this,
but they tend to come down to reclaiming temporary space and
occassionally rebuilding indexes. There are also log files
that have to be purged, regular backups to be made, and other
operations along those lines.
</p>
<h2>Deployment strategies: partitioning, tablespaces</h2>
<p>When systems become sufficiently large, it is no longer
possible to just attach some disks to a box and run
a database server. The Database Architect must consider
breaking different tables out onto different sets of
spindles, which is usually done with "tablespaces", and
moving older data onto slower cheaper spindles, which is
often done with Partitioning.
</p>
<h2>Deployment strategies, failure protection, from
simple backup to hot standbys</h2>
<p>Because a database typically experiences simultaneous
reads and writes from multiple sources, and may be expected
to be up and running 24/7 <i>indefinitely</i>, the concept
of making a backup and recovering from a failure becomes
more complicated than simply copying a few files to a
safe location.
</p>
<p>In the most demanding case, you will need to provide a
second complete box that can become fully live within
seconds of a disastrous failing of the main box. This is
called various things, but Postgres calls it a "hot standby"
in version 9 and some MS SQL Server shops call it a
"failover cluster."
</p>
<p>The ability to come up live on a second box when the first
one fails is made possible by the way databases handle
ACID compliance, and the fact that they produce something
called a Write-Ahead-Log (WAL) that can be fed into a
second box that "replays" the log so that its copy of the
database is getting the same changes as the master copy.
</p>
<h2>Server side coding: stored procedures and functions</h2>
<p>I really could not figure out where to put this entry
in the list, so I just punted and put it near the end.
It could really go anywhere.</p>
<p>Stored procedures or functions are procedural routines
(not object oriented) that are on the database server and
can be invoked directly from an application or embedded
inside of SQL commands. Generally speaking they provide
various flow-control statements and rudimentary variable
support so that you can code multi-step processes on the
server itself instead of putting them in application code.</p>
<h2>Server side coding: Triggers</h2>
<p>Triggers are quite possibly the most elegant and beautiful
technology that servers support, absolutely the least
understood, and definitely the most reviled by the
ignorant. You will find virtually no web content today
that will explain why and how to use triggers and
what they are good for.
</p>
<p>Except of course for <a href="http://database-programmer.blogspot.com/2008/05/database-triggers-encapsulation-and.html">my own essay on
triggers</a> that discusses them in terms of
encapsulation.
<h2>Temporary tables</h2>
<p>Temporary tables are like Stored Procedures inasmuch as
I had no idea where to put them in the list, so they
just ended up at the end.
</p>
<p>As the name implies, a temporary table is a table that
you can create on-the-fly, and which usually disappears
when your transaction is complete. They are most often
found in Stored Procedures. They can impact performance
for the worst in many ways, but can be extremely
useful when you are doing multi-staged analsysis of
data in a Data Warehouse (that's where I use them the most).
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com592tag:blogger.com,1999:blog-426922399870577072.post-19939540824475662752010-11-06T13:20:00.001-04:002010-11-28T22:14:13.556-05:00Recursive Queries with Common Table Expressions<p>
This week The Database Programmer returns after almost 18 months
with an entry on using Common Table Expressions (CTEs) to do
recursive queries. Relational databases were plagued from their
inception with a lack of meaningful treatment for recursive
operations, and CTEs have finally plugged that hole.
</p>
<p>
Common Table Expressions appeared in SQL Server 2005, and in
PostgreSQL 8.4, and are also available in Oracle. As for
mySQL, since I don't use it, I did a quick Google search and
looked at the Docs for 5.5, and couldn't really find
anything. I generally tend to assume mySQL cannot do the
tough stuff.
</p>
<h2>But First, A Rant</h2>
<p>
There have always been plenty of people who claimed SQL was
a bloated and clumsy language to work with. Most of the time
I tend to agree, but I find the advantages of relational/SQL
system to be so large that I'm willing to pay that price.
</p>
<p>
But with Commom Table Expressions (CTEs) I just can't help
drifting into conspiracy theories involving the enemies of
SQL infiltrating the committees and deliberately suggesting
the most twisted, bloated, complicated way they could think
of to do what is really a very basic operation. In other
words, I am profoundly unimpressed with the syntax of CTEs,
but as long as they are here and they work, we'll go along.
</p>
<h2>The Basic Example</h2>
<p>
Your basic recursive table contains a foreign key to itself,
so that some rows in the table are children of some other
row in the table. This recursion can nest to any depth,
and the chart below shows a very simple example:
</p>
<pre>
Primary_key | Parent_Key | Notes
--------------+---------------+---------------------------
A | null | top level row, no parent
B | A | first level child of A
C | B | child of B, grandchild
| | of A
D | C | child of C, grandchild
| | of B, great-grandchild
| | of A
X | null | top level row, no parent
Y | X | child of X
Z | Y | child of Y, grandchild
| | of X
</pre>
<p>What we want is a query that can return a given row
and <i>all of its children</i> out to any level, including
helpful information about the structure of the recursion,
something like this:
</p>
<pre>
Primary_key | Level | Top_Key | Immediate_Parent_key
------------+-------+---------+-----------------------
A | 1 | A | null
B | 2 | A | A
C | 3 | A | B
D | 4 | A | C
X | 1 | X | null
Y | 2 | X | X
Z | 3 | X | Y
</pre>
<h2>And Another Rant</h2>
<p>
At this point the mind boggles at how long this blog entry
needs to be to explain this simple operation. But lets
get going anyway.
</p>
<h2>The First Step and Last Step</h2>
<p>A Common Table Expression begins with the "WITH" clause
and ends with a standard SQL Select:
</p>
<pre>
;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
....we'll get to this below....
)
select * from myCTEName
</pre>
<p>The basic idea is that we are going to define a CTE with
a name and a list of columns, and then SELECT out of it.
Without that final SELECT statement the CTE does not actually
do anything. The SELECT can also be arbitrarily complex, doing
aggregrations, WHERE clauses and anything else you might need.</p>
<p>The first thing to notice is the leading semi-colon. This is
a trick adopted by MS SQL Server users. SQL Server does not
require statements to be terminated with a semi-colon, but a
SQL Server CTE requires the <b>previous</b> statement to have
been terminated with a semi-colon (nice huh?). So SQL Server
programmers adopted the strategy of starting the CTE with
a semi-colon, which keeps the syntactical requirement with
the CTE, where it belongs.</p>
<p>A given CTE sort of has a name. That is, you have to name
it something, but think of it as a table alias in a SQL SELECT,
such as "Select * from myTable a JOIN otherTable b...", it
exists only during the execution of the statement.
</p>
<p>The columns listed in the parantheses can have any names
(at least in SQL Server). But these column names are what
you will refer to in the final SQL SELECT statement.
</p>
<h2>Coding The Inside of the CTE, Step 1</h2>
<p>Now we code the inside of the CTE in two steps.
The first step is called the "anchor", and it is a
straightforward query to find the top-level rows:
</p>
<pre>
;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
select primary_key as primary_key
, 1 as level
, primary_key as top_key
, null as immediate_parent_key
from myRecursiveTable
where Parent_key is null
)
select * from myCTEName
</pre>
<p>This should be self-explanatory, we are querying only for
rows that have no parent (WHERE Parent_key is null) and we
are hardcoding the "level" column to 1, and we are also
hardcoding the "immediate_parent_key" column to null.
</p>
<p>This query alone would return two of the rows from
our desired output:
</p>
<pre>
Primary_key | Level | Top_Key | Immediate_Parent_key
------------+-------+---------+-----------------------
A | 1 | A | null
X | 1 | X | null
</pre>
<h2>Coding The Inside of the CTE, Step 2</h2>
<p>Now we are going to add the actual recursion. When I first
learned CTEs this was the hardest part to figure out, because it
turned out my hard-won set-oriented thinking was actually slowing me
down, I had to think like a procedural programmer when defining
the second half of the query.
</p>
<pre>
;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
select primary_key,1,primary_key,null
from myRecursiveTable
where Parent_key is null
UNION ALL
select chd.primary_key,par.level+1,par.top_key,chd.parent_key
FROM myCTEName par
JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select * from myCTEName
</pre>
<p>Thinking step-wise, here is what is going on under the hood:</p>
<ol><li>The server executes the "anchor" query, generating a
result set called "myCTEName" containing just the
top level rows.
<li>The server then executes the second query. At this
point the result set "myCTEName" exists and can be
referenced, so that you can link children to their
parents. (That's why you see the JOIN)
<li>Step 2 is repeated recursively, adding grand-children,
great-grand-children, and so on, until no more rows
are being added, at which point it stops, and...
<li>The final result set is passed to the trailing
SELECT, which pulls results out of "myCTEName"
as if it were a table or view.
</ol>
<p>So when we code the 2nd part of the inside of the CTE, the
part after the UNION ALL, act as if the first query has
already run and produced a table/view called "myCTEName"
that can be referenced. Once you understand that, the
query is pretty easy to understand:
</p>
<ul><li>The "From myCTEName par" clause tells us we are pulling
from the previously generated set. I like to use the alias
"par" for "parent" to remind myself that the prior result is
the parent row.
<li>We then join to the original source table and use the
alias "chd" to remind ourselves we are pulling child rows
from there. The "ON chd.parent_key = par.primary_key"
defines how children are joined to parents.
<li>Our first column, "chd.primary_key", is the unique
key for the results.
<li>Our second column, "par.level+1" gives us a nifty
automatically incremented "level" column.
<li>Our third column, "par.top_key" ensures that all rows
contain a reference to their top-most parent.
<li>Our final column, "chd.parent_key", makes sure each
row contains a reference to its immediate parent.
</ul>
<h2>Finding Various Statistics</h2>
<p>Once you have the inside of the CTE coded, the fun part
moves to the final SELECT, which is operating on the
complete set of results. You do not necessarily have to pull
the complete list. For instance, you may want to find out
the maximum nesting level for each parent, or the count
of children for each parent:
</p>
<pre>
;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
select primary_key,1,primary_key,null
from myRecursiveTable
where Parent_key is null
UNION ALL
select chd.primary_key,par.level+1,par.top_key,chd.parent_key
FROM myCTEName par
JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select top_key
, max(level) as maxNestingLevel
, count(*) as countRows
, count(*)-1 as countChildren
from myCTEName
</pre>
<h2>Conclusion</h2>
<p>Common Table Expressions give SQL-based databases the (very)
long-needed ability to execute recursive queries, albeit with
a rather complex syntax. Once you grasp the basics of how
to code them, there are many possible uses that go far beyond
the simple example shown here.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com23tag:blogger.com,1999:blog-426922399870577072.post-90269346882583661052009-06-29T21:33:00.000-04:002010-11-29T20:48:40.101-05:00Approaches to "UPSERT"<p>This week in the Database Programmer we look at something
called an "UPSERT", the strange trick where an insert
command may magically convert itself into an update if
a row already exists with the provided key. This trick
is very useful in a variety of cases. This week we will
see its basic use, and next week we will see how the same
idea can be used to materialize summary tables efficiently.
</p>
<h2>An UPSERT or ON DUPLICATE KEY...</h2>
<p>The idea behind an UPSERT is simple. The client issues
an INSERT command. If a row already exists with the
given primary key, then instead of throwing a key
violation error, it takes the non-key values and updates
the row.
</p>
<p>This is one of those strange (and very unusual) cases
where MySQL actually supports something you will not
find in all of the other more mature databases. So if you
are using MySQL, you do not need to do anything special
to make an UPSERT. You just add the term "ON DUPLICATE
KEY UPDATE" to the INSERT statement:
</p>
<pre>
insert into table (a,c,b) values (1,2,3)
on duplicate key update
b = 2,
c = 3
</pre>
<p>The MySQL command gives you the flexibility to specify
different operation on UPDATE versus INSERT, but with
that flexibility comes the requirement that the UPDATE
clause completely restates the operation.
</p>
<p>With the MySQL command there are also various considerations
for AUTO_INCREMENT columns and multiple unique keys.
You can read more at the MySQL page for the
<a href=
"http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html"
>INSERT ... ON DUPLICATE KEY UPDATE</a> feature.
</p>
<h2>A Note About MS SQL Server 2008</h2>
<p>MS SQL Server introduced something like UPSERT in
SQL Server 2008. It uses the MERGE command, which is
a bit hairy, check it out in this
<a href=
"http://www.databasejournal.com/features/mssql/article.php/3739131/UPSERT-Functionality-in-SQL-Server-2008.htm"
>nice tutorial.</a>
</p>
<h2>Coding a Simpler UPSERT</h2>
<p>Let us say that we want a simpler UPSERT, where you do not
have to mess with SQL Server's MERGE or rewrite the entire
command as in MySQL. This can be done with triggers.
</p>
<p>To illustrate, consider a shopping cart with a natural key
of ORDER_ID and SKU. I want simple application code that
does not have to figure out if it needs to do an INSERT or
UPDATE, and can always happily do INSERTs, knowing they will
be converted to updates if the line is already there.
In other words, I want simple application code that just keeps
issuing commands like this:
</p>
<pre>
INSERT INTO ORDERLINES
(order_id,sku,qty)
VALUES
(1234,'ABC',5)
</pre>
<p>We can accomplish this by a trigger. The trigger must occur
before the action, and it must redirect the action to an
UPDATE if necessary. Let us look at examples for MySQL,
Postgres, and SQL Server.
</p>
<h2>A MySQL Trigger</h2>
<p>Alas, MySQL giveth, and MySQL taketh away. You cannot code
your own UPSERT in MySQL because of an extremely severe
limitation in MySQL trigger rules. A MySQL trigger <i>may not
affect a row in a table different from the row originally
affected by the command that fired the trigger.</i> A MySQL
trigger attempting to create a new row may not affect
a different row.
</p>
<p><i>Note: I may be wrong about this. This limitation has bitten
me on several features that I would like to provide for MySQL.
I am actually hoping this limitation will not
apply for UPSERTs because the new row does not yet exist, but
I have not had a chance yet to try.</i>
</p>
<h2>A Postgres Trigger</h2>
<p>The Postgres trigger example is pretty simple, hopefully the
logic is self-explanatory. As with all code samples, I did
this off the top of my head, you may need to fix a syntax
error or two.
</p>
<pre>
CREATE OR REPLACE FUNCTION orderlines_insert_before_F()
RETURNS TRIGGER
AS $BODY$
DECLARE
result INTEGER;
BEGIN
SET SEARCH_PATH TO PUBLIC;
-- Find out if there is a row
result = (select count(*) from orderlines
where order_id = new.order_id
and sku = new.sku
)
-- On the update branch, perform the update
-- and then return NULL to prevent the
-- original insert from occurring
IF result = 1 THEN
UPDATE orderlines
SET qty = new.qty
WHERE order_id = new.order_id
AND sku = new.sku;
RETURN null;
END IF;
-- The default branch is to return "NEW" which
-- causes the original INSERT to go forward
RETURN new;
END; $BODY$
LANGUAGE 'plpgsql' SECURITY DEFINER;
-- That extremely annoying second command you always
-- need for Postgres triggers.
CREATE TRIGGER orderlines_insert_before_T
before insert
ON ORDERLINES
FOR EACH ROW
EXECUTE PROCEDURE orderlines_insert_before_F();
</pre>
<h2>A SQL Server Trigger</h2>
<p>SQL Server BEFORE INSERT triggers are significantly different
from Postgres triggers. First of all, they operate at the
<i>statement level</i>, so that you have a set of new rows instead
of just one. Secondly, the trigger must itself contain an
explicit INSERT command, or the INSERT never happens. All of this
means our SQL Server example is quite a bit more verbose.
</p>
<p>The basic logic of the SQL Server example is the same as the
Postgres, with two additional complications. First, we must use
a CURSOR to loop through the incoming rows. Second, we must
explicitly code the INSERT operation for the case where it
occurs. But if you can see past the cruft we get for all of that,
the SQL Server exmple is doing the same thing:
</p>
<pre>
CREATE TRIGGER upsource_insert_before
ON orderlines
INSTEAD OF insert
AS
BEGIN
SET NOCOUNT ON;
DECLARE @new_order_id int;
DECLARE @new_sku varchar(15);
DECLARE @new_qty int;
DECLARE @result int;
DECLARE trig_ins_orderlines CURSOR FOR
SELECT * FROM inserted;
OPEN trig_ins_orderlines;
FETCH NEXT FROM trig_ins_orderlines
INTO @new_order_id
,@new_sku
,@new_qty;
WHILE @@Fetch_status = 0
BEGIN
-- Find out if there is a row now
SET @result = (SELECT count(*) from orderlines
WHERE order_id = @new_order_id
AND sku = @new_sku
)
IF @result = 1
BEGIN
-- Since there is already a row, do an
-- update
UPDATE orderlines
SET qty = @new_qty
WHERE order_id = @new_order_id
AND sku = @new_sku;
END
ELSE
BEGIN
-- When there is no row, we insert it
INSERT INTO orderlines
(order_id,sku,qty)
VALUES
(@new_order_id,@new_sku,@new_qty)
UPDATE orderlines
-- Pull the next row
FETCH NEXT FROM trig_ins_orderlines
INTO @new_order_id
,@new_sku
,@new_qty;
END -- Cursor iteration
CLOSE trig_ins_orderlines;
DEALLOCATE trig_ins_orderlines;
END
</pre>
<h2>A Vague Uneasy Feeling</h2>
<p>While the examples above are definitely cool and nifty,
they ought to leave a certain nagging doubt in many
programmers' minds. This doubt comes from the fact that
an <i>insert is not necessarily an insert anymore</i>,
which can lead to confusion. Just imagine the new programmer
who has joined the team an is banging his head on his desk
because he cannot figure out why his INSERTS are not
working!
</p>
<p>We can add a refinement to the process by making the
function optional. Here is how we do it.
</p>
<p>First, add a column to the ORDERLINES table called
_UPSERT that is a char(1). Then modify the trigger so that
the UPSERT behavior only occurs if the this column holds
'Y'. It is also extremely import to always set this value
back to 'N' or NULL in the trigger, otherwise it will appear
as 'Y' on subsequent INSERTS and it won't work properly.
</p>
<p>So our new modified explicit upsert requires a SQL statement
like this:
</p>
<pre>
INSERT INTO ORDERLINES
(_upsert,order_id,sku,qty)
VALUES
('Y',1234,'ABC',5)
</pre>
<p>Our trigger code needs only a very slight modification.
Here is the Postgres example, the SQL Server example should
be very easy to update as well:
</p>
<pre>
...trigger declration and definition above
IF new._upsert = 'Y'
result = (SELECT.....);
_upsert = 'N';
ELSE
result = 0;
END IF;
...rest of trigger is the same
</pre>
<h2>Conclusion</h2>
<p>The UPSERT feature gives us simplified code and fewer
round trips to the server. Without the UPSERT there are
times when the application may have to query the server to
find out if a row exists, and then issue either an UPDATE
or an INSERT. With the UPSERT, one round trip is eliminated,
and the check occurs much more efficiently inside of the
server itself.
</p>
<p>The downside to UPSERTs is that they can be confusing if
some type of explicit control is not put onto them such as
the _UPSERT column.
</p>
<p>Next week we will see a concept similar to UPSERT used
to efficiently create summary tables.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com34tag:blogger.com,1999:blog-426922399870577072.post-26764730186094051602009-04-19T16:50:00.000-04:002010-11-28T22:10:05.094-05:00The Relational Model<p>
If you look at any system that was born on and for the
internet, like Ruby on Rails, or the PHP language, you find
an immense wealth of resources on the internet itself, in
endless product web sites, blogs, and forums. But when
you look for the same comprehensive information on products
or ideas that matured before the web you find it is not there.
Relational databases stand out as a product family that matured
before the internet, and so their representation in cyberspace
is very different from the newer stuff.
</p>
<h2>The Math Stuff</h2>
<p>You may have heard relational theorists argue that the
strength of relational databases comes from their solid
mathematical foundations. Perhaps you have wondered,
what does that mean? And why is that good?
</p>
<p>To understand this, we have to begin with
<a href="http://en.wikipedia.org/wiki/Edsger_W._Dijkstra"
>Edsger W. Dijkstra</a>, a pioneer in the area of computer
science with many accomplishments to his name. Dijkstra
believed that the best way to develop a system or program
was to begin with a mathematical description of the system,
and then refine that system into a working program. When
the program completely implemented the math, you were
finished.
</p>
<p>There is a really huge advantage to this approach. If you
start out with a mathematical theory of some sort, which
presumably has well known behaviors, then the working program
will have all of those behaviors and, put simply, everybody
will know what to expect of it.
</p>
<p>This approach also reduces time wasted on creative efforts
to work out how the program should behave. All those
decisions collapse intot he simple drive to make the program
mimic the math.
</p>
<h2>A Particular Bit of Math</h2>
<p>It so happens that there is a particular body of math
known as Relational Theory, which it seemd to
<a href="http://en.wikipedia.org/wiki/Edgar_F._Codd"
>E. F. Codd</a> would be a very nice fit for storing
business information. In his landmark 1970 paper
<a href="http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf"
>A Relational Model of Data for Large Shared Data Banks
(pdf) </a> he sets out to show how these mathematical
things called "relations" have behaviors that would be
ideal for storing business models.
</p>
<p>If we take the Dijkstra philosophy seriously, which is to
build systems based on well-known mathematical theories,
and we take Codd's claim that "Relations" match well to
business record-keeping needs, the obvious conclusion is
that we should build some kind of "Relational" datastore,
and so we get the Relational Database systems of today.
</p>
<p>So there in a nutshell is why relational theorists are
so certain of the virtues of the relational model, it's
behaviors are well-known, and if you can build something
that matches them, you will have a very predictable
system.
</p>
<h2>They are Still Talking About It</h2>
<p>If you want to know more about the actual mathematics,
check out the <a href=
"http://groups.google.com/group/comp.databases.theory/topics"
>comp.databases.theory</a> Usenet group, or check out
Wikipedia's articles on <a href=
"http://en.wikipedia.org/wiki/Relational_algebra"
>Relational Algebra</a> and <a href=
"http://en.wikipedia.org/wiki/Relational_calculus"
>Relational Calculus</a>.
</p>
<h2>A Practical Downside</h2>
<p>The downside to all of this comes whenever the mathematical
model describes behaviors that are contrary to human goals
or simply irrelevant to them. Examples are not hard to
find.
</p>
<p>When the web exploded in popularity, many programmers found
that their greatest data storage needs centered on <i>documents</i>
like web pages rather than <i>collections of atomic values</i>
like a customer's discount code or credit terms. They found
that relational databases were just not that good at storing
documents, which only stands to reason because they were never
intended to. In <i>theory</i> the model could be stretched,
(if the programmer stretched as well), but the programmers
could feel in their bones that the fit was not right, and they
began searching for something new.
</p>
<p>Another example is that of calculated values. If you have
shopping cart, you probably have some field "TOTAL" somewhere
that stores the final amount due for the customer. It so
happens that such a thing violates relational theory, and there
are some very bright theorists who will refuse all requests
for assistance in getting that value to work, because you
have violated their theory. This is probably the most shameful
behavior that relational theorists exhibit - a complete
refusal to consider extending the model to better reflect
real world needs.
</p>
<h2>The Irony: There are No Relational Databases</h2>
<p>The irony of it all is that when programmers set out to build
relational systems, they ran into quite a few practical
downsides and a sort of consensus was reached to break the
model and create the SQL-based databases we have today.
In a <i>truly relational</i> system a table would have quite
a few more rules on it than we have in our SQL/TABLE based
systems of today. But these rules must have seemed
impractical or too difficult to implement, and they were
scratched.
</p>
<p>There is at least one product out there that claims to
be truly relational, that is <a href=
"http://en.wikipedia.org/wiki/Dataphor">Dataphor</a>.
<h2>The Weird Optional Part</h2>
<p>Probably the grandest irony in the so-called relational
database management systems is that any programmer can
completely break the relational model by making bad
table designs. If your tables are not normalized, you
lose much of the benefits of the relational model,
and you are completely free to make lots of
non-normalized and de-normalized tables.
</p>
<h2>Conclusion</h2>
<p>I have to admit I have always found the strength of
relational databases to be their simplicy and power,
and not so much their foundations (even if shaky) in
mathematical theory. A modern database is very good
at storing data in tabular form, and if you know how
to design the tables, you've got a great foundation for
a solid application. Going further, I've always found
relational theorists to be unhelpful in the extreme in
the edge cases where overall application needs are not
fully met by the underlying mathematical model. The
good news is that the products themselves have all of
the power we need, so I left the relational theorists
to their debates years ago.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com11tag:blogger.com,1999:blog-426922399870577072.post-44444607801556029592009-03-01T19:08:00.003-05:002010-11-28T22:10:19.298-05:00I Am But a Humble Filing Clerk<p>This week we are returning to the series on Philosophy,
and we will nail down the role of data and the
database in any application that requires such
things.
</p>
<p>This is the Database Programmer blog, for anybody who wants
practical advice on database use.</p>
<p>There are links to other essays at the <a href="#bottom">bottom of this post</a>.</p>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<h2>Review of The Absolute</h2>
<p>In the first essay in this series, <a href=
"http://database-programmer.blogspot.com/2008/09/quest-for-absolute.html"
>The Quest For the Absolute</a>, I offered the opinion
that all programmers by nature seek absolutes to
simplify and inform the development effort. Taking
a page from the ancient Greek philosopher Aristotle,
I suggested that the best absolute was the quest for
the <i>virtuous</i> program, which is to say a program
that served its purpose well.
</p>
<p>A program that serves its purpose well is one that
meets the needs of the check signer, the end-user,
and the programmer. The check signer needs some
strategic goal to be met, the end-user must be
productive, and the programmer must make a living.
If a program achieves all of these, it is an ideal
virtuous program, and has satisfied the absolute
requirements that are true of all programs.
</p>
<h2>Considering the Decision Maker</h2>
<p>Normally we think of a decision maker as some important
person who has the power to choose your product or
services, or to give her money to your competitor.
She makes her decision based on how well she can judge
who will meet her strategic needs.
</p>
<p>Although the decision maker will have vastly different
needs in different situations, and is usually thinking
at a high level, she has at least one
need that is universal: the simple need to keep and
use records. She needs a filing system. All of her
stated goals will <i>assume</i> that you both know
this unstated goal is down there at the foundation of
the entire proposed system.
</p>
<p>We programmers often forget this simple fact because
computers have been around long enough that we
do not remember that in their original forms it was
impossible to mistake that computers were just
electronic filing systems. Way back when
I was a kid the day came when phone bills started
arriving with an "IBM Card" slipped into them. You
returned the card with your check -- they were moving
their files into the electronic age. Then came
electronic tickets on airlines -- nothing more than
a big filing system. The modern web sites we visit
to buy tickets are nothing but an interface to what
remains a filing system at its heart.
</p>
<h2>The Virtuous Programmer</h2>
<p>So if we go back to the idea of "virtue" as the Greeks
thought of it, which means serving your function well,
a virtuous programmer will remember always that he
is but humble filing clerk. This is not his entire
purpose, but it is the beginning of all other purposes
and the foundation that the higher purposes are
built upon.
</p>
<h2>Not Just Relational</h2>
<p>This principle is general to all programming. An
email server is a program that must receive and
store email for later retrieval. What good is an
email server that cannot store anything? What
good is a camera without its memory card? What
good is a mobile phone without its contacts list?
What good is a image editing program if it cannot
read and write files?
</p>
<p>So all programs exist to process data, and the
business application programmer knows that in his
context this means we are really making giant sexy
record-keeping systems. We are the guys that
color-code the filing cabinets.
</p>
<h2>Does Not Mean Relational Is Required</h2>
<p>This idea, that we are filing clerks, does not
automatically mean we must pick relational databases
for the persistence layer -- the question of what
filing system to use is a completely different
question.
</p>
<h2>Conclusion</h2>
<p>If we begin with the idea that the ideal program
meets the needs of decision maker, end-user, and
programmer, and if we consider first the needs of
the decision maker, then we begin with the universal
strategic need to keep good records. The ideal
programmer knows this need is at the bottom of all
other needs, and remembers always that we are but
humble filing clerks.
</p>
<a name="bottom">
<h2>Related Essays</h2>
</a>
<p>This blog has two tables of contents, the
<a href="http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html">Topical Table of Contents</a> and the list
of
<a href="http://database-programmer.blogspot.com/2010/11/database-skills.html">Database Skills</a>.
</p>
<p>Other philosophy essays are:</p>
<ul><li><a href="http://database-programmer.blogspot.com/2010/11/prepare-now-for-possible-future-head.html"
>Prepare Now For Possible Future Head Transplant</a>
<li><a href="http://database-programmer.blogspot.com/2008/09/quest-for-absolute.html"
>The Quest for The Absolute</a>
<li><i><a href="http://database-programmer.blogspot.com/2009/03/i-am-but-humble-filing-clerk.html"
>I Am But A Humble Filing Clerk (this essay)</a></i>
<li><a href="http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html"
>Why I Do Not Use ORM</a>
<li><a href="http://database-programmer.blogspot.com/2008/05/minimize-code-maximize-data.html"
>Minimize Code, Maximize Data</a>
</ul>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com3tag:blogger.com,1999:blog-426922399870577072.post-75500153382018360872009-02-14T14:05:00.002-05:002009-02-14T14:08:09.923-05:00A Comprehensive Database Security Model<p>This week I am taking a bit of a departure. Normally I write
about things I have already done, but this week I want to
speculate a bit on a security model I am thinking of coding
up. Basically I have been asking myself how to create a
security model for database apps that never requires elevated
privileges for code, but still allows for hosts sharing multiple
applications, full table security including row level and
column level security, and structural immunity to SQL injection.
</p>
<h2>The Functional Requirements</h2>
<p>Let's consider a developer who will be hosting multiple
database applications on a server, sometimes instances of the
same application for different customers. The applications
themselves will have different needs, but they all boil down
to this:
</p>
<ul><li>Some applications will allow surfers to join the site
and create accounts for themselves, while others will be
private sites where an administrator must make user accounts.
<li>Some applications will not contain sensitive data, and
so the site owner wants to send forgotten passwords in email
-- which means the passwords must be stored in plaintext. Other
site owners will need heightened security that disallows
storing of passwords in plaintext.
<li>In both cases, administrators must of course be able to
manage accounts themselves.
<li>The system should be structurally immune
to SQL injection.
<li>It must be possible to have users with the same user id
("Sheilia", "John", etc.) on multiple applications who are
actually totally different people.
<li>The application code must never need to run at an
elevated privelege level for any reason -- not
even to create accounts on public sites where
users can join up and conduct transactions.
<li>It must be possible for the site owners or their
agents to directly
connect to the database at very least for querying and
possibly to do database writes without going through our
application.
<li>Users with accounts on one app must never be able to
sign on to another app on the same server.
</ul>
<p>These requirements represent the most flexible possible
combination of demands that I have so far seen in real life.
The question is, can they be met while still providing
security? The model I'd like to speculate on today says
yes.
<h2>Informed Paranoia Versus Frightened Ignorance</h2>
<p>Even the most naive programmer knows that the internet
is not a safe place, but all too often a lot of security
advice you find is based on <i>frightened ignorance</i>
and takes the form, "never do x, you don't know what might
happen." If we are to create a strong security model,
we have to do better than this.
</p>
<p>Much better is to strive to be like a strong system architect,
whose approach is based on <i>informed paranoia</i>.
This hypothetical architect knows everybody is out
to compromise his system, but he seeks a thorough knowledge
of the inner workings of his tools so that he can
engineer the vulnerabilities out as much as possible.
He is not looking to write rules for the programmer
that say "never do this", he is rather looking to make it
impossible for the user or programmer to compromise
the system.
</p>
<h2>Two Examples</h2>
<p>Let us consider a server hosting two applications, which
are called "social" and "finance".
</p>
<p>The "social" application is a social networking site with
minimal security needs. Most important is that the site
owners want members of the general public to sign up, and
they want to be able to email forgotten passwords
(and we can't talk them out of it) -- so we
have to store passwords in plaintext.
</p>
<p>The "finance" application is a private site used by employees
of a corporation around the world. The general public is
absolutely not welcome. To make matters worse however, the
corporation's IT department demands to be able to directly
connect to the database and write to the database without
going through the web app. This means the server will have
an open port to the database. Sure it will be protected with
SSL and passwords, but we must make sure that only users
of "finance" can connect, and only to their own application.
</p>
<h2>Dispensing With Single Sign-On</h2>
<p>There are two ways to handle connections to a database. One
model is to give users real database accounts, the other is
to use a single account to sign on to the database. Prior to
the web coming along, there were proponents of both models in
the client/server world, but amongst web developers the single
sign-on method is so prevalent that I often wonder if they
know there is any other way to do it.
</p>
<p>Nevertheless, we must dispense with the single sign-on method
at the start, regardless of how many people think that Moses
carved it on the third tablet, because it just has too many
problems:
</p>
<ul><li>Single Sign-on <b>is the primary architectural flaw that makes
SQL injection possible</b>. As we will see later, using real
database accounts makes your site (almost) completely immune
to SQL injection.
<li>Single Sign-on requires a connection at the maximum privilege
level that any system user might have, where the code then decides
what it will let a particular user do. This is a complete
violation of the requirement that code always run at the lowest
possible privilege level.
<li>Single Sign-on totally prevents the requirement that
authorized agents be allowed to connect to the database and
directly read and write values.
</ul>
<p>So single sign-on just won't work with the requirements listed.
This leads us to creating real accounts on the database server.
</p>
<h2>Real Accounts and Basic Security</h2>
<p>When you use a real database account, your code connects
to the database using the username and password provided
by the user. Anything he is allowed to do your code will
be allowed to do, and anything he is not allowed to do will
throw and error if your code tries to do it.
</p>
<p>This approach meets quite a few of our requirements nicely.
A site owner's IT department can connect with the same
accounts they use on the web interface -- they have
the same privileges in both cases. Also, there is no
need to ever have application code elevate its privilege
level during normal operations, since no regular users should ever be
doing that. This still leaves the issue of how to create
accounts, but we will see that below.
</p>
<p>A programmer who thinks of security in terms of <i>what code
can run</i> will have a very hard time wrapping his head around
using real database accounts for public users. The trick to
understanding this approach
is to forget about code for a minute and to
think about tables. The basic fact of database application
security is that <i>all security
resolves to table permissions</i>. In other words, our security
model is all about who can read or write to what tables, it is
not about who can run which program.
</p>
<p>If we grant public users real database accounts, and they
connect with those accounts, the security must be handled
within the database itself, and it comes down to:
</p>
<ul><li>Defining "groups" as collections of users who share
permissions at the table level.
<li>Deciding which groups are allowed select, insert, update,
and delete privileges on which tables.
<li>Granting and revoking those privileges on the server itself
when the database is built.
<li>At very least row-level security will be required, wherein
a user can only see and manipulate certain rows in a table.
This is how you keep users from using SQL Injection to mess
with each other's order history or member profiles.
<li>Column security is also very nice to finish off the
picture, but we will not be talking about that today as it
does not play into the requirements.
</ul>
<p>Now we can spend a moment and see why this approach eliminates
most SQL Injection vulnerabilities. We will imagine a table of
important information called SUPERSECRETS. If somebody could
slip in a SQL injection exploit and wipe out this table we'd all
go to jail, so we absolutely cannot allow this.
Naturally, most users would have no privileges on
this table -- even though they are directly connected to the
database they cannot even see the table exists, let alone
delete from it. So if our hypothetical black hat
somehow slips in ";delete from supersecrets"
and our code fails to trap for it, nothing happens. They have
no privlege on that table. On the other side of things, consider
the user who is privileged to delete from that table. If this
user slips in a ";delete from supersecrets" he is only going to
the trouble with SQL Injection <i>to do something he is perfectly
welcome to do anyway through the user interface.</i> So much
for SQL injection.
</p>
<p>To repeat a point made above: row-level security is a must.
If you grant members of a social site global UPDATE privileges
on the PROFILES table, and you fail to prevent a SQL Injection,
all hell could break loose. Much better is the ability to
limit the user to seeing only his own row in the PROFILE table,
so that once again you have created a structural immunity
to SQL injection.
</p>
<h2>Anonymous Access</h2>
<p>Many public sites allow users to see all kinds of information
when they are not logged on. The most obvious example would
be an eCommerce site that needs read access to the ITEMS table,
among others. Some type of anonymous access must be allowed
by our hypothetical framework.
</p>
<p>For our two examples, the "social" site might allow limited
viewing of member profiles, while the "finance" application
must show absolutely nothing to the general public.
</p>
<p>If we want a general solution that fits both cases, we opt
for a <i>deny-by-default</i> model and allow each application
to optionally have an anonymous account.
</p>
<p>First we consider deny-by-default. This means simply that
our databases are always built so that no group has any
permissions on any tables. The programmer of the "social"
site now has to grant certain permissions to the anonymous
account, while the programmer of the "finance" application
does nothing - he already has a secure system.
</p>
<p>But still the "finance" site is not quite so simple. An anonymous
user account with no privileges <i>can still log in</i>, and
that should make any informed paranoid architect nervous.
We should extend
the deny-by-default philosophy so the framework will
not create an anonymous
account unless requested. This way the programmer of the
"finance" application still basically does nothing, while
the programmer of the "social" must flip a flag to create
the anonymous account.
</p>
<h2>Virtualizing Users</h2>
<p>If we are having real database accounts, there is one small
detail that has to be addressed. If the "social" site has
a user "johnsmith" and the finance application has a user
of the same name, but they are totally different people,
we have to let both accounts exist but be totally separate.
</p>
<p>The answer here is to alias the accounts. The database
server would actually have accounts "finance_johnsmith" and
"social_johnsmith". Our login process would simply take
the username provided and append the code in front of it
when authenticating on the server. 'nuf said on that.
</p>
<h2>Allowing Public Users To Join</h2>
<p>The "social" site allows anybody to join up and create
an account. This means that somehow the web application
must be able to create accounts on the database server.
Yet it must do this without allowing the web code to
elevate its privileges, and while preventing the disaster
that would ensue if a user on the "social" site somehow
got himself an account on the "finance" site.
</p>
<p>Believe it or not, this is the easy part! Here is how it
works for the "social" site:
</p>
<ul><li>Create a table of users. The primary key is the user_id
which prevents duplication.
<li>For the social site, there is a column called
PASSWORD that stores the password in plaintext.
<li>Allow the anonymous account to INSERT into this table!
(Remember though that deny-by-default means that so far
this account has no other privileges).
<li>Put an INSERT trigger on the table that automatically creates
an aliased user account, so that "johnsmith" becomes
"social_johnsmith". The trigger also sets the password.
<li>A DELETE trigger on the table would delete users if
the row is deleted.
<li>An UPDATE trigger on the table would update the password
if the user UPDATES the table.
<li>Row level security is an absolute must.
Users must be able to
SELECT and UPDATE table, but only their own row. If your
database server or framework cannot support row-level
security, it's all out the window.
</ul>
<p>This gives us a system that almost gets us where we need
to be: the general public can create acounts,
the web application does not need to elevate its privileges,
users can set and change their passwords, and no user can
see or set anything for any other user. However, this leaves
the issue of password recovery.
</p>
<p>In order to recover passwords and email them to members of
the "social" site, it is tempting to think that
the anonymous account must be able to
somehow read the users table, but that is no good because
then we have a <i>structural flaw</i> where a successful
SQL injection would expose user accounts. However, this
also turns out to be easy. There are two options:
</p>
<ul><li>Write a stored procedure that the anonymous user is
free to execute, which does not return a password but
actually emails it directly from within the database
server. This requires your database server be able to
send emails. (Postgres can, and I assume SQL Server
can, and I don't really know about mySql).
<li>Create a table for password requests, allow inserts
to it but nothing else. A trigger sends the email.
In this approach you can track email recovery requests.
</ul>
<p>For the "finance" application we cannot allow any of this
to happen, so again we go to the deny-by-default idea. All
of the behaviors above will not happen unless the programmer
sets a flag to turn them on when the database is built.
</p>
<p>This does leave the detail of how users of the "finance"
application will reset their passwords.
For details on how a secure app can still allow password
resets, see my posting of Sept 7 2008 <a href=
"http://database-programmer.blogspot.com/2008/09/advanced-table-design-secure-password.html"
>Secure Password Resets</a>.
</p>
<h2>One More Detail on Public Users</h2>
<p>We still have one more detail to handle for public users.
Presumably a user, having joined up, has more privileges than
the anonymous account. So the web application must be able
to join them into a group without elevating its privileges.
The solution here is the same as for creating the account:
there will be a table that the anonymous user can make
inserts into (but nothing else), and a trigger will join
the user to whatever group is named.
</p>
<p>Except for one more detail. We cannot let the user join
whatever group they want, only the special group for members.
This requirement can be met by defining the idea of a "freejoin"
group and also a "solo" group. If the anonymous user inserts
into a user-group table, and the requested group is flagged
as allowing anybody to join, the trigger will allow it, but
for any other group the trigger will reject the insert.
The "solo" idea is similar, it means that if a user is in
the "members" group, and that group is a "solo" group, they
may not join any other groups. This further jails in
members of the general public.
</p>
<h2>Almost Done: User Administration</h2>
<p>In the last two sections we saw the idea of a table of users
and a cross-reference of users to groups. This turns out to
solve another issue we will have: letting administrators
manage groups. If we define a group called "user_administrators"
and give them total
power on these tables, and also give them CRUD screens
for them, then we have a user administrator system.
This works for both the "social" and the "finance" application.
</p>
<p>The triggers on the table have to be slightly different
for the two cases, but that is a small exercise to code
them up accordingly.
</p>
<h2>Cross-Database Access</h2>
<p>Believe it or not, the system outlined above has met all of
our requirements except one. So far we have a system that never
requires the web server to have any elevated priveleges within
the database, allows members of the public to join some sites
while barring them from others, is structurally immune from
SQL injection, allows different people on different sites to
have the same user id, and allows administrators
of both sites to directly manage accounts. Moreover, we
can handle both plaintext passwords and more serious
reset-only situations.
</p>
<p>This leaves only one very thorny issue: cross-database
access. The specific database server I use most is PostgreSQL,
and this server has a problem (for this scenario) anyway,
which is that out-of-the-box, a database account can connect
to any database. This does not mean the account has any
priveleges on the database, but we very seriously do not want
this to happen at all. If a member of the "social" site can
connect to the "finance" app, we have a potential vulnerability
even if he has zero privileges in that database. We would be
much happier if he could not connect at all.
</p>
<p>In Postgres there is a solution to this, but I've grown to
not like it. In Postgres you can specify that a user can only
connect to a database if they are in a group that has the
same name as the database. This is easy to set up, but it
requires changing the default configuration of Postgres.
However, for the sheer challenge of it I'd like to work out
how to do it without requiring that change. So far I'm
still puzzling this out. I'd also like to know that the
approach would work at very least on MS SQL Server and
mySql.
</p>
<h2>Conclusion</h2>
<p>Most of what is in this week's essay is not that radical to
any informed database veteran. But to web programmers
who were unfortunate enough to grow up in the world
of relational-databases-must-die nonsense, it is probably
hard or impossible to imagine a system where users are
connecting with real database accounts. The ironic thing
is that the approached described here is far more secure
than any single sign-on system, but it requires the programmer
to shift thinking away from action-based code-centric models
to what is really going on: table-based privileges. Once
that hurdle is past, the rest of it comes easy.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com16tag:blogger.com,1999:blog-426922399870577072.post-91155735810132998102009-02-01T16:29:00.000-05:002009-02-01T16:30:55.164-05:00This Application Has Unique Business Rule Needs<p>No it does not. If it did, then your customer/employer
would be
doing something no other human being has ever done, which
is unlikely in the extreme. The application may be
unique in its particulars, but it is almost certainly
extremely common in its patterns. This week we will see
how "unique" needs are in fact nothing but common ordinary
development projects.
</p>
<h2>Beginning With the Conclusion</h2>
<p>I have had this conversation with many programmers over
the past few years, and it always follows the same
patterns. The easy part of the argument is showing the
programmer that what he thinks is special or unique
is in fact common. The much harder part, because it
involves the delicate human ego, is showing the programmer
that he has not seen this because he is ignorant. This
is not fun to do and I myself usually skip it, it's
usually not worth the trouble.
</p>
<h2>Path 1: Details</h2>
<p>Occasionally I speak to a programmer who thinks he has
a unique situation. His claim begins with
the mountain of details he must handle, details which appear
to be contradictory, subtle, and overall perplexing. He
wonders if some new approach is required to handle them.
</p>
<p>In answering this claim, we begin with the easy part,
showing that the situation is itself not unique. In short,
all worthwhile projects involve mountains of detail, so
there is nothing special there. When it comes to the
subtleties and the maze of exceptions and special cases,
these are common in mature businesses that have evolved
this complexity in response to business needs over the years.
So again there is nothing unique here, the programmer's
situation is again common.
</p>
<p>At this point we have to ask how the programmer will deal
with this perplexing mountain of detail. If he knows
what he is doing, he will give the general answer that he
is going to break it down as much as possible into
independent smaller problems that can be solved on their
own. Since this is nothing more than how all programmers
solve complex problems, the entire "uniqueness" claim
has completely collapsed. His project is utterly common.
</p>
<p>The much harder part of the conversation comes if the
programmer does not know how to break down
the problem. For instance, if the problem is all about
a fiendishly complex pricing system with lots of discounts
and pricing levels, and the programmer does not know that
he needs to begin with the database, and he further does not
want to hear that, well, there is not much I can do for
him. He will end up working a lot harder than he needs
to, and will probably remain convinced he is dealing with
something "unique".
</p>
<p>But let's go a little deeper into that example of the
complicated pricing system. Why do I claim that he must
start with the tables, and that is he is wasting time
if he does not? Well, a complete answer is much more than
will fit here, and in fact I hit that theme over and over
in these essays, but it comes down to:
</p>
<ul><li>He must have an accurate and precise description of
the details that govern the pricing scheme. That is
what tables are for.
<li>In working out the mechanics of the tables, particularly
their primary and foreign keys, he will come to a
his most complete understanding of the mechanisms
involved.
<li>When the tables completely reflect the details he
must work with, the code will just about write itself.
<li>Lastly, but probably most importantly, the customer
will expect to control the pricing system by adjusting
the parameters at all levels. Again, that is what tables
are for. The user is in control of the pricing system
if he can edit the tables (because of course he cannot
edit the code).
</ul>
<h2>Path 2: Combinations</h2>
<p>Once upon a time we had simple desktop business applications,
games, and then this weird new thing, "the web". Now they
are all mixed together, as we play games on the internet that
are tied into huge databases. Modern applications often
combine technologies that used to be comfortably separate.
On any particular project,
some of the requirements look like they
can be met with an RDBMS, some require management and
delivery of media such as MP3 or video, and he is told as
well he must provide RSS feeds and import data coming in
XML format. Perhaps as well there will be stone tablets
and papyrus scrolls.
</p>
<p>This programmer may believe he is in a unique situation
because of this <i>combination</i> of needs. Because no single
toolset out there can meet the entire project, perhaps this
is something never before seen? But this does
not hold up. Just like the argument about complexity,
he must break the problem up correctly, and when he has done
so he will have a perfectly ordinary project. Though I might
add it will also be a very <i>interesting</i> project and
probably a lot of fun.
</p>
<h2>In The End It Is All About Patterns</h2>
<p>I have given two examples above taken from my own experience
where programmers have claimed to me that they faced some
unique situation. There are many other cases, and they always
make perfect sense to the person who thinks he has discovered
something new. The biggest flaw in the programmer's thinking
is failing to distinguish between <i>particulars</i> and
<i>patterns</i>.
</p>
<p>My claim in this essay is that the patterns of all problems
are the same. Somebody has seen it before, somebody has done
it before, the answer is out there. The process of analysis
and programming is about slotting your particulars in the
patterns that have already been established.
</p>
<p>In the broadest sense all programs process data, and
particular programs break down into broad patterns of data
access and manipulation. Sometimes you have a broad range
of users putting in data with very little massaging
(think twitter) and sometimes you have one group controlling
much of the data while others make use of it (think
Amazon), and sometimes your data is mostly relational
and table based (think any ecommerce or biz app) and
sometimes its mostly media (think youtube).
</p>
<p>Once you have these broad patterns identified, you can then
proceed to make use of established practices within
each particular area. What is the best way to provide
sensitive data on the web and protect it from unauthorized
eyes? Somebody has done it before. What is the best way
to track large amounts of media? Somebody has done it
before. What is the best way to set up a complex pricing
system with lots of discounts and pricing levels? Somebody
has done it before. In all cases, your particulars may
be different, but the patterns will be the same.
</p>
<h2>Conclusion: Find the Patterns</h2>
<p>Whenever I find myself looking at a situation that appears
to be new, I try to tell myself that it may be new to me,
but it is not likely to be new to the human race. If it
does not appear to follow a well-known pattern then I
proceed as if <i>I have not yet recognized the pattern</i>
and continue to analyze and break it apart until the pattern
emerges. So far it always has.
</p>KenDownshttp://www.blogger.com/profile/11117175783163937575noreply@blogger.com8