Wednesday, January 16, 2013

All Software Development is Schema Management

Before you automatically disagree, stop a bit and think about it.  Can you think of any code you have ever written that did not handle data of some sort?  Of course not.  This is not about relational data either, it is about any data.  There is no software that does not process data.  Even the textbook example of some function that squares a number is processing the parameter and returning a value.  The first line of that function, in most languages, names the input parameter and its type.  That is a schema.

This remains true as you climb out of the toy textbook examples into simple one-off programs into small packages and ultimately to apps that build to megabytes of executables running on a rack (or a floor) of servers.  Just as each method call has parameters, so does each class have its attributes, every table its columns, every XML file its XSD (or so we hope) and that code works if and only if everybody on the team understood what was supposed to happen to the data.


Are We Talking UI or Platform or Middleware?


Generally today we are talking about the server side of things.  This is about any project that is going to take a user request and consult a data store: the file system, a search engine, a database, a NoSQL database, or an XML database.  If you go to the disk, that is what we are talking about.


New Versions Are (Almost Always) About Schema Changes


So imagine you've got some working code.  Maybe a single script file, or perhaps a library or package, or maybe some huge application with hundreds of files.  It is time to change it.  In my experience new code means some kind of change to the schema.

I cannot prove that, nor do I intend to try.  It is a mostly-true not an always-true.


Schema Here Does Not Mean Relational


This is not about relational schemas, though they are included.  If you are using Mongo or some other NoSQL Database which does not manage the schema for you, it just means you are managing the schema yourself somewhere in code.   But since you cannot write code that has no knowledge of the structure of the data it handles, that schema will be there somewhere, and if the code changes the schema generally changes.


Does It Need To Be Said?


Sometimes you will find yourself in a situation where people do not know this.  You will notice something is wrong, they first symptom is that the conversation does not appear to be proceeding to a conclusion.  Even worse, people do not seem to know what conclusion they are even seeking.  They do not know that they are trying to work out the schema, so they wander about the requirements trying to make sense of them.

Order and progress can be restored when somebody ties the efforts down to the discovery and detailing of the schema.  The question is usually, "What data points are we handling and what are we doing to them?"

Tuesday, January 15, 2013

Code Today's Requirements Today

In Part 1 of this series, Do You Know What Day It Is? (written a mere 6 months ago, I really ought to update this blog more often), we looked at Ken's First Law of Architecture:

Today's Constant is Tomorrow's Variable


If you do not know this, there are two mistakes you can make.

Mistake 1: You Ain't Gonna Need It


Mistake 1 is creating a variable, option, switch, parameter, or other control for something which as far as we know for today is a constant.  You can avoid this principle if you remember that You Ain't Gonna Need It.

It is this simple: you can invent an infinite number of variables that the user may wish to control in the future, but the chance of guessing which ones they will actually want to control is near zero.  It is a total complete waste of time.

Let's take an example.  Imagine you are asked to do a task.  Any task.  It does not even have to be programming.  The task will be loaded with points where programmers invent decisions that nobody asked them to make.  For example, consider this blog post.

1) Should I have used a different background color up there in that green box?
2) Should I have used a sub-section heading by this point, maybe I should go back and do that?
3) Should this list be numbered or maybe just bullets instead?

Notice that this list is kind of arbitrary.  I did not think of it very hard, I just made up those questions.  Another person might have made up different questions.  The problem is that the list never ends.

This can never be eradicated, it is fundamental to the human condition.  Well meaning individuals will raise irrelevancies in an effort to be helpful and the rules of polite society work against weeding these out.  Committees will put two heads on a horse and you will end up with a bunch of YAGNI options.  But we can still fight against them.

Mistake 2: Constants in Code - Nobody Does That!


It is possible to go the other way, which is to embed constants into your code so that when you need to change them you find it is expensive, difficult and error prone.  Most of us learn not to do this so early that it seems unthinkable to us.  Why even write a blog post about it in 2013?

Well the very good reason to write about it is that we all still embed "constants in code" in ways that we never think about.

For example, if you are using a relational database, then the structure of your table is a "constant in code", it is something that is woven through every line.  Your code "knows" which tables are in there and what columns they have.  When you change the structure you are changing something very much "constant" in your system.   This is why it is so important to use a lot of library code that itself reads out of data dictionaries, so that you have removed this particular "constant in code."

The great problem here is that we cannot see the future, and you never think that a basic feature of your system is an embedded constant in code.  In a simpler period in my life a system lived on a server.  One server, that was it, who ever heard of a server farm?  That was a constant: number of servers=1, and all of my code assumed it would always be true.  I had to weed out that constant from my code.

Those who understand this basic inability to see the future are most at risk of slipping into over-generalization and generating YAGNI features.  It is an attempt to fight the very universe itself, and you cannot win.

Keep it Simple, Keep it Clean


In the end you can only write clean code, architect clean systems, follow the old knowledge and be the wise one who learns to expect the unexpected.

Saturday, June 30, 2012

Do You Know What Day It Is?

Have you ever sat in on a meeting like this?

Person 1:  The customer wants a green background on this page.

Person 2: Do we know they're going to stick with green?  Maybe we need a configuration option for background color.

Nobody wants to disagree and the manager never comes to these boring design meetings so there is quiet before somebody says...

Person 3: A configuration option just for background color is kind of weak, we'd be better off allowing choices for all colors.

Person 4: Is this an instance level configuration, or do we let the end-user pick her color?

Person 2 (the original troublemaker): well it really has to be instance level, but the end-user can override, that way we satisfy all possibilities.

From here the debate continues until they've decided to create a skinning system with lots of UI color pickers and other stuff, and they wrap up with:

Person 6: Well this sounds great, but what should we make the default background color?

...and nobody can remember that:

The customer specifically asked for green, only green, and nothing but green.

Is this an exaggeration?  Perhaps, but I've sat in meetings that come close.  Seems to me that I've sat in meetings that were worse, but I'd hate to slander anybody, and being as I'm not perfect myself, we'll leave that one alone for now.

So anyway, there were many things going wrong in that meeting, but we're going to stick with the simple fact that nobody there knew this one simple rule, the first rule of system design, which should be burned into your brain, which you should repeat before and during any design or architecture meeting:

Today's Constant is Tomorrow's Variable.

This is not so much a rule as an observation, but when you realize that it is true on almost any level, it can become a guiding rule, something that actually lets you make decisions with confidence, decisions that turn out to work well.

The problem is that most people in technology don't know what day it is.  They make one of two mistakes, they either:
  1. constantly plan, estimate, design, or program for tomorrow's variable when all they need to do is handle the simpler case of today's constant.
  2. or they don't realize that when "the customer changed the requirements!!!" the customer is doing what everybody does, taking something simple from yesterday and making it a bit more complex today.
So you can turn this around and say, "Everything I do today will get changed tomorrow.  It will become more complicated.  Everything I think is fixed or constant will float or become a variable."


What It Means

We've all learned (hopefully) that you don't embed constants in code because it makes the code difficult to change.  We define the constants in headers, or we accept all values from outside sources as parameters.  This makes the code more flexible, and it is a good thing.  The code is more robust, it can handle more cases without breaking or needing alteration.

The trick to using Ken's first and only rule of system design, "Today's Constant is Tomorrow's Variable" is to recognize the many forms of "constants" that we build into our systems in hard-to-reach places, making it hard to change them when tomorrow comes and they are no longer fixed and constant.

Another trick to using the rule is to always know what day it is.  Most of what I do today will involve features that may never change.  It is extremely easy to see how they might change, but impossible to know for sure.  So we leave them constant for today.


Back To That Meeting

Let's go back to that meeting we started with.  Here is how it goes when somebody knows that "Today's Constant is Tomorrow's Variable."


Person 1:  The customer wants a green background on this page.

Person 2: Do we know they're going to stick with green?  Maybe we need a configuration option for background color.

Our Hero:  We don't know they are going to stick with green because we never know if any customer is ever going to stick with anything, and we all know that today's constant is tomorrow's variable.  However, I hope we've put the styles into a style sheet and not embedded them directly into the HTML so we can change it later if we have to, right?


Somebody mumbles that yes in fact we do use style sheets, and the meeting moves on.


Is That  Lame Example?

That's a pretty lame example really, who doesn't use style sheets?  And who embeds constants in their code?

Well kiddies it turns out that we didn't always use style sheets.  When I got interested in web pages CSS was still optional (Yes! Believe it!) and you put your style information directly into tags, which was basically embedding constants into code, and it didn't take long before you intuitively realized this was not right, and you discovered CSS.  

It is amazing how often that basic pattern repeats itself, trying to identify what you thought was a constant, realizing it is "buried in code", and turning it into a variable.

Today's Post Is Tomorrow's Promise


I'll be posting a lot more on this subject, using a very reliable and strict schedule based on a host of variables that mostly comes down to whenever-the-hell-I-feel-like-it(TM).  

Cheers.

Friday, January 21, 2011

Maintaining One Code Base with Possibly Conflicting Custom Features

Today's essay deals with the tricky issue of custom features for individual customers who are running instances of your software.

The question comes by way of a regular reader who prefers to remain anonymous, but asks this:

... I work on a large (to me, anyway) application that serves as a client database, ticket system, time-tracking, billing, asset-tracking system. We have some customers using their own instances of the software. Often, those customers want additional fields put in different places (e.g., a priority column on tickets). This results in having multiple branches to account for versions with slight changes in code and in the database. This makes things painful and time-consuming in the long run: applying commits from master to the other branches requires testing on every branch; same with database migrate scripts, which frequently have to be modified.

Is there an easier way? I have thought about the possibility of making things "optional" in the database, such as a column on a table, and hiding its existence in the code when it's not "enabled." This would have the benefit of a single code set and a single database schema, but I think it might lead to more dependence on the code and less on the database -- for example, it might mean constraints and keys couldn't be used in certain cases.

Restating the Question

Our reader asks, is it better to have different code branches or to try to keep a lot of potentially conflicting and optional items mixed in together?

Well, the wisdom of the ages is to maintain a single code branch, including the database schema. I tried exactly once, very early in my career, to fork my own code, and gave up almost within days. When I went to work in larger shops I always arrived in a situation where the decision had already been made to maintain a single branch. Funny thing, since most programmers cannot agree on the color of the sky when they're staring out the window, this is the only decision I have ever seen maintained with absolute unanimity no matter how many difficulties came out of it.

There is some simple arithmetic as to why this is so. If you have single feature for a customer that is giving you a headache, and you fork the code, you now have to update both code branches for every change plus regression test them both, including the feature that caused the headache. But if you keep them combined you only have the one headache feature to deal with. That's why people keep them together.

Two Steps

Making custom features work smoothly is a two-step process. The first step is arguably more difficult than the second, but the second step is absolutely crucial if you have business logic tied to the feature.

Most programmers when confronted with this situation will attempt to make various features optional. I consider this to be a mistake because it complicates code, especially when we get to step 2. By far the better solution is to make features ignorable by anybody who does not want them.

The wonderful thing about ingorable features is they tend to eliminate the problems with apparently conflicting features. If you can rig the features so anybody can use either or both, you've eliminated the conflict.

Step 1: The Schema

As mentioned above, the first step is arguably more difficult than the second, because it may involve casting requirements differently than they are presented.

For example, our reader asks about a priority column on tickets, asked for by only one customer. This may seem like a conflict because nobody else wants it, but we can dissolve the conflict when we make the feature ignorable. The first step involves doing this at the database or schema level.

But first we should mention that the UI is easy, we might have a control panel where we can make fields invisible. Or maybe our users just ignore the fields they are not interested in. Either way works.

The problem is in the database. If the values for priority come from a lookup table, which they should, then we have a foreign key, and we have a problem if we try to ignore it:

  • We can allow nulls in the foreign key, which is fine for the people ignoring it, but
  • This means the people who require it can end up with tickets that have no priority because it does not prevent a user from leaving it blank.

A simple answer here is to pre-populate your priority lookup table with a value of "Not applicable", perhaps with a hardcoded id of zero. Then we set the default value for the TICKET.priority to zero. This means people can safely ignore it because it will always be valid.

Then, for the customer who paid for it, we just go in after the install and delete the default entry. It's a one-time operation, not even worth writing a script for, and it forces them to create a set of priorities before using the system. Further, by leaving the default of zero in there, it forces valid answers because users will be dinged with an FK violation if they do not provide a real priority.

For this particular example, there is no step 2, because the problem is completely solved at the schema level. To see how to work with step 2, I will make up an example of my own.

Step 2: Unconditional Business Logic

To illustrate step 2, I'm going to make up an example that is not really appropriate to our reader's question, frankly because I cannot think of one for that situation.

Let's say we have an eCommerce system, and one of our sites wants customer-level discounts based on customer groups, while another wants discounts based on volume of order -- the more you buy, the deeper the discount. At this point most programmers start shouting in the meeting, "We'll make them optional!" Big mistake, because it makes for lots of work. Instead we will make them ignorable.

Step 1 is to make ignorable features in the schema. Our common code base contains a table of customer groups with a discount percent, and in the customers table we make a nullable foreign key to the customer groups table. If anybody wants to use it, great, and if they want to ignore it, that's also fine. We do the same thing with a table of discount amounts, we make an empty table that lists threshhold amounts and discount percents. If anybody wants to use it they fill it in, everybody else leaves it blank.

Now for the business logic, the calculations of these two discounts. The crucial idea here is not to make up conditional logic that tries to figure out whether or not to apply the discounts. It is vastly easier to always apply both discounts, with the discounts coming out zero for those users who have ignored the features.

So for the customer discount, if the customer's entry for customer group is null, it will not match to any discount, and you treat this as zero. Same for the sale amount discount, the lookup to see which sale amount they qualify doesn't find anything because the table is empty, so it treats it as zero.

So the real trick at the business logic level is not to figure out which feature to use, which leads to complicatec conditionals that always end up conflicting with each other, but to always use all features and code them so they have no effect when they are being ignored.

Conclusion

Once upon a time almost everybody coding for a living dealt with these situations -- we all wrote code that was going to ship off to live at our customer's site. Nowadays this is less common, but for those of us dealing with it it is a big deal.

The wisdom of the ages is to maintain a common code base. The method suggested here takes that idea to its most complete implementation, a totally common code base in which all features are active all of the time, with no conditionals or optional features (except perhaps in the UI and on printed reports), and with schema and business logic set up so that features that are being ignored simply have no effect on the user.

Thursday, January 6, 2011

Can You Really Create A Business Logic Layer?

The past three posts of this little mini-series have gone from a Working definition of business logic to a Rigorous definition of business logic and on to some theorems about business logic. To wrap things up, I'd like to ask the question, is it possible to isolate business logic into a single tier?

Related Reading

There are plenty of opinions out there. For a pretty thorough explanation of how to put everything into the DBMS, check out Toon Koppelaar's description. Mr. Koppelaars has some good material, but you do need to read through his earlier posts to get the definitions of some of his terms. You can also follow his links through to some high quality discussions elsewhere.

Contrasting Mr. Koppelaar's opinion is a piece which does not have nearly the same impact, IMHO, because in Dude, Where's My Business Logic? we get some solid history mixed with normative assertions based on either anecdote or nothing at all. I'm a big believer in anecdote, but when I read a sentence that says, "The database should not have any knowledge of what a customer is, but only of the elements that are used to store a customer." then I figure I'm dealing with somebody who needs to see a bit more of the world.

Starting At the Top: The User Interface

First, let's review that our rigorous definition of business logic includes schema (types and constraints), derived values (timestamps, userstamps, calculations, histories), non-algorithmic compound operations (like batch billing) and algorithmic compound operations, those that require looping in their code. This encompasses everything we might do from the simplest passive things like a constraint that prevents discounts from being over 100% to the most complex hours-long business process, along with everything in between accounted for.

Now I want to start out by using that definition to see a little bit about what is going on in the User Interface. This is not the presentation layer as it is often called but the interaction layer and even the command layer.

Consider an admin interface to a database, where the user is entering or modifying prices for the price list. Now, if the user could enter "Kim Stanley Robinson" as the price, that would be kind of silly, so of course the numeric inputs only allow numeric values. Same goes for dates.

So the foundation of usability for a UI is at very least knowlege of and enforcement of types in the UI layer. Don't be scared off that I am claiming the UI is enforcing anything, we'll get to that a little lower down.

Now consider the case where the user is typing in a discount rate for this or that, and a discount is not allowed to be over 100%. The UI really ought to enforce this, otherwise the user's time is wasted when she enters an invalid value, finishes the entire form, and only then gets an error when she tries to save. In the database world we call this a constraint, so the UI needs to know about constraints to better serve the user.

Now this same user is typing a form where there is an entry for US State. The allowed values are in a table in the database, and it would be nice if the user had a drop-down list, and one that was auto-suggesting as the user typed. Of course the easiest way to do something like this is just make sure the UI form "knows" that this field is a foreign key to the STATES table, so it can generate the list using some generic library function that grabs a couple of columns out of the STATES table. Of course, this kind of lookup thing will be happening all over the place, so it would work well if the UI knew about and enforced foreign keys during entry.

And I suppose the user might at some point be entering a purchase order. The purchase order is automatically stamped with today's date. The user might see it, but not be able to change it, so now our UI knows about system-generated values.

Is this user allowed to delete a customer? If not, the button should either be grayed out or not be there at all. The UI needs to know about and enforce some security.

More About Knowing and Enforcing

So in fact the UI layer not only knows the logic but is enforcing it. It is enforcing it for two reasons, to improve the user experience with date pickers, lists, and so forth, and to prevent the user from entering invalid data and wasting round trips.

And yet, because we cannot trust what comes in to the web server over the wire, we have to enforce every single rule a second time when we commit the data.

You usually do not hear people say that the UI enforces business logic. They usually say the opposite. But the UI does enforce business logic. The problem is, everything the UI enforces has to be enforced again. That may be why we often overlook the fact that it is doing so.

The Application and The Database

Now let's go through the stuff the UI is enforcing, and see what happens in the application and the database.

With respect to type, a strongly typed language will throw an error if the type is wrong, and a weakly typed language is wise to put in a type check anyway. The the DBMS is going to only allow correctly typed values, so, including the UI, type is enforced three times.

With respect to lookups like US state, in a SQL database we always let the server do that with a foreign key, if we know what is good for us. That makes double enforcement for lookups.

So we can see where this is going. As we look at constraints and security and anything else that must be right, we find it will be enforced at least twice, and as much as three times.

You Cannot Isolate What Must be Duplicated

By defining First Order Business Logic, the simplest foundation layer, as including things like types and keys and constraints, we find that the enforcement of this First Order stuff is done 2 or 3 times, but never only once.

This more or less leaves in tatters the idea of a "Business Logic Layer" that is in any way capable of handling all business logic all by its lonesome. The UI layer is completely useless unless it is also enforcing as much logic as possible, and even when we leave the Database Server as the final enforcer of First Order Business Logic (types, constraints, keys), it is still often good engineering to do some checks to prevent expensive wasted trips to the server.

So we are wasting time if we sit around trying to figure out how to get the Business Logic "where it belongs", because it "belongs" in at least two places and sometimes three. Herding the cats into a single pen is a fool's errand, it is at once unnecessary, undesirable, and impossible.

Update: Regular reader Dean Thrasher of Infovark summarizes most of what I'm saying here using an apt industry standard term: Business Logic is a cross-cutting concern.

Some Real Questions

Only when we have smashed the concept that Business Logic can exist in serene isolation in its own layer can we start to ask the questions that would actually speed up development and make for better engineering.

Freed of the illusion of a separate layer, when we look at the higher Third and Fourth Order Business Logic, which always require coding, we can decide where they go based either on engineering or the availability of qualified programmers in particular technologies, but we should not make the mistake of believing they are going where they go because the gods would have it so.

But the real pressing question if we are seeking to create efficient manageable large systems is this: how we distribute the same business logic into 2 or 3 (or more) different places so that it is enforced consistently everywhere. Because a smaller code base is always easier to manage than a large one, and because configuration is always easier than coding, this comes down to meta-data, or if you prefer, a data dictionary. That's the trick that always worked for me.

Is This Only A Matter of Definitions?

Anybody who disagrees with the thesis here has only to say, "Ken, those things are not business logic just because you wrote a blog that says they are. In my world business logic is about code baby!" Well sure, have it your way. After all, the nice thing about definitions is that we can all pick the ones we like.

But these definitions, the theorems I derived on Tuesday, and the multiple-enforcement thesis presented here today should make sense to anbyody struggling with where to put the business logic. That struggle and its frustrations come from the mistake of imposing abstract conceptual responsibilities on each tier instead of using the tiers as each is able to get the job done. Databases are wonderful for type, entity integrity (uniqueness), referential integrity, ACID compliance, and many other things. Use them! Code is often better when the problem at hand cannot be solved with a combination of keys and constraints (Fourth Order Business Logic), but even that code can be put into the DB or in the application.

So beware of paradigms that assign responsibility without compromise to this or that tier. It cannot be done. Don't be afraid to use code for doing things that require structured imperative step-wise operations, and don't be afraid to use the database for what it is good for, and leave the arguments about "where everything belongs" to those with too much time on their hands.

Tuesday, January 4, 2011

Theorems Regarding Business Logic

In yesterday's Rigorous Definition of Business Logic, we saw that business logic can be defined in four orders:

  • First Order Business Logic is entities and attributes that users (or other agents) can save, and the security rules that govern read/write access to the entitites and attributes.
  • Second Order Business Logic is entities and attributes derived by rules and formulas, such as calculated values and history tables.
  • Third Order Business Logic are non-algorithmic compound operations (no structure or looping is required in expressing the solution), such as a month-end batch billing or, for the old-timers out there, a year-end general ledger roll-up.
  • Fourth Order Business Logic are algorithmic compound operations. These occur when the action of one step affects the input to future steps. One example is ERP Allocation.

A Case Study

The best way to see if these have any value is to cook up some theorems and examine them with an example. We will take a vastly simplified time billing system, in which employees enter time which is billed once/month to customers. We'll work out some details a little below.

Theorem 1: 1st and 2nd Order, Analysis

The first theorem we can derive from these definitions is that we should look at First and Second Order Schemas together during analysis. This is because:

  • First Order Business Logic is about entities and atrributes
  • Second Order Business Logic is about entities and attributes
  • Second Order Business Logic is about values generated from First Order values and, possibly, other Second Order values
  • Therefore, Second Order values are always expressed ultimately in terms of First Order values
  • Therefore, they should be analyzed together

To give the devil his due, ORM does this easily, because it ignores so much database theory (paying a large price in performance for doing so) and considers an entire row, with its first order and second order values together, as being part of one class. This is likely the foundation for the claims of ORM users that they experience productivity gains when using ORM. Since I usually do nothing but bash ORM, I hope this statement will be taken as utterly sincere.

Going the other way, database theorists and evangelists who adhere to full normalization can hobble an analysis effort by refusing to consider 2nd order because those values denormalize the database, so sometimes the worst of my own crowd will prevent analysis by trying to keep these out of the conversation. So, assuming I have not pissed off my own friends, let's keep going.

So let's look at our case study of the time billing system. By theorem 1, our analysis of entities and attributes should include both 1st and 2nd order schema, something like this:

 
 INVOICES
-----------
 invoiceid      2nd Order, a generated unique value
 date           2nd Order if always takes date of batch run
 customer       2nd Order, a consequence of this being an
                           aggregation of INVOICE_LINES
 total_amount   2nd Order, a sum from INVOICE_LINES
               
 INVOICE_LINES
---------------
 invoiceid      2nd order, copied from INVOICES
 customer         +-  All three are 2nd order, a consequence
 employee         |   of this being an aggregration of
 activity         +-  employee time entries
 rate           2nd order, taken from ACTIVITIES table
                           (not depicted)
 hours          2nd order, summed from time entries
 amount         2nd order, rate * hours
 
 TIME_ENTRIES
--------------
 employeeid     2nd order, assuming system forces this
                    value to be the employee making
                    the entry
 date           1st order, entered by employee
 customer       1st order, entered by employee
 activity       1st order, entered by employee
 hours          1st order, entered by employee

Now, considering how much of that is 2nd order, which is almost all of it, the theorem is not only supported by the definition, but ought to line up squarely with our experience. Who would want to try to analyze this and claim that all the 2nd order stuff should not be there?

Theorem 2: 1st and 2nd Order, Implementation

The second theorem we can derive from these definitions is that First and Second Order Business logic require separate implementation techniques. This is because:

  • First Order Business Logic is about user-supplied values
  • Second Order Business Logic is about generated values
  • Therefore, unlike things cannot be implemented with like tools.

Going back to the time entry example, let's zoom in on the lowest table, the TIME_ENTRIES. The employee entering her time must supply customer, date, activity, and hours, while the system forces the value of employeeid. This means that customer and activity must be validated in their respective tables, and hours must be checked for something like <= 24. But for employeeid the system provides the value out of its context. So the two kinds of values are processed in very unlike ways. It seems reasonable that our code would be simpler if it did not try to force both kinds of values down the same validation pipe.

Theorem 3: 2nd and 3rd Order, Conservation of Action

This theorem states that the sum of Second and Third Order Business Logic is fixed:

  • Second Order Business Logic is about generating entities and attributes by rules or formulas
  • Third Order Business Logic is coded compound creation of entities and attributes
  • Given that a particular set of requirements resolves to a finite set of actions that generate entities and values, then
  • The sum of Second Order and Third Order Business Logic is fixed.

In plain English, this means that the more Business Logic you can implement through 2nd Order declarative rules and formulas, the fewer processing routines you have to code. Or, if you prefer, the more processes you code, the fewer declarative rules about entitities and attributes you will have.

This theorem may be hard to compare to experience for verification because most of us are so used to thinking in terms of the batch billing as a process that we cannot imagine it being implemented any other way: how exactly am I suppose to implement batch billing declaratively?.

Let's go back to the schema above, where we can realize upon examination that the entirety of the batch billing "process" has been detailed in a 2nd Order Schema, if we could somehow add these facts to our CREATE TABLE commands the way we add keys, types, and constraints, batch billing would occur without the batch part.

Consider this. Imagine that a user enters a a TIME_ENTRY. The system checks for a matching EMPLOYEE/CUSTOMER/ACTIVITY row in INVOICE_DETAIL, and when it finds the row it updates the totals. But if it does not find one then it creates one! Creation of the INVOICE_DETAIL record causes the system to check for the existence of an invoice for that customer, and when it does not find one it creates it and initializes the totals. Subsequent time entries not only update the INVOICE_DETAIL rows but the INVOICE rows as well. If this were happening, there would be no batch billing at the end of the month because the invoices would all be sitting there ready to go when the last time entry was made.

By the way, I coded something that does this in a pretty straight-forward way a few years ago, meaning you could skip the batch billing process and add a few details to a schema that would cause the database to behave exactly as described above. Although the the format for specifying these extra features was easy enough (so it seemed to me as the author), it seemed the conceptual shift of thinking that it required of people was far larger than I initially and naively imagined. Nevertheless, I toil forward, and that is the core idea behind my Triangulum project.

Observation: There Will Be Code

This is not so much a theorem as an observation. This observation is that if your application requires Fourth Order Business Logic then somebody is going to code something somewhere.

An anonymous reader pointed out in the comments to Part 2 that Oracle's MODEL clause may work in some cases. I would assume so, but I would also assume that reality can create complicated Fourth Order cases faster than SQL can evolve. Maybe.

But anyway, the real observation here is is that no modern language, either app level or SQL flavor, can express an algorithm declaratively. In other words, no combination of keys, constraints, calculations and derivations, and no known combination of advanced SQL functions and clauses will express an ERP Allocation routine or a Magazine Regulation routine. So you have to code it. This may not always be true, but I think it is true now.

This is in contrast to the example given in the previous section about the fixed total of 2nd and 3rd Order Logic. Unlike that example, you cannot provide enough 2nd order wizardry to eliminate fourth order. (well ok maybe you can, but I haven't figured it out yet myself and have never heard that anybody else is even trying. The trick would be to have a table that you truncate and insert a single row into, a trigger would fire that would know how to generate the next INSERT, generating a cascade. Of course, since this happens in a transaction, if you end up generating 100,000 inserts this might be a bad idea ha ha.)

Theorem 5: Second Order Tools Reduce Code

This theorem rests on the acceptance of an observation, that using meta-data repositories, or data dictionaries, is easier than coding. If that does not hold true, then this theorem does not hold true. But if that observation (my own observation, admittedly) does hold true, then:

  • By Theorem 3, the sum of 2nd and 3rd order logic is fixed
  • By observation, using meta-data that manages schema requires less time than coding,
  • By Theorem 1, 2nd order is analyzed and specified as schema
  • Then it is desirable to specify as much business logic as possible as 2nd order schema, reducing and possibly eliminating manual coding of Third Order programs.

Again we go back to the batch billing example. Is it possible to convert it all to 2nd Order as described above. Well yes it is, because I've done it. The trick is an extremely counter-intuitive modification to a foreign key that causes a failure to actually generate the parent row that would let the key succeed. To find out more about this, check out Triangulum (not ready for prime time as of this writing).

Conclusions

The major conclusion in all of this is that anlaysis and design should begin with First and Second Order Business Logic, which means working out schemas, both the user-supplied values and the system-supplied values.

When that is done, what we often call "processes" are layered on top of this.

Tomorrow we will see part 4 of 4, examining the business logic layer, asking, is it possible to create a pure business logic layer that gathers all business logic unto itself?

Sunday, January 2, 2011

Business Logic: From Working Definition to Rigorous Definition

This is part 2 of a 4 part mini-series that began before the holidays with A Working Definition Business Logic. Today we proceed to a rigorous definition, tomorrow we will see some theorems, and the series will wrap up with a post on the "business layer."

In the first post, the working definition said that business logic includes at least:

  • The Schema
  • Calculations
  • Processes

None of these was very rigorously defined, kind of a "I'll know it when I see it" type of thing, and we did not talk at all about security. Now the task becomes tightening this up into a rigorous definition.

Similar Reading

Toon Koppelaars has some excellent material along these same lines, and a good place to start is his Helsinki Declaration (IT Version). The articles have a different focus than this series, so they make great contrasting reading. I consider my time spent reading through it very well spent.

Definitions, Proofs, and Experience

What I propose below is a definition in four parts. As definitions, they are not supposed to prove anything, but they are definitely supposed to ring true to the experience of any developer who has created or worked on a non-trivial business application. This effort would be a success if we reach some concensus that "at least it's all in there", even if we go on to argue bitterly about which components should be included in which layers.

Also, while I claim the definitions below are rigorous, they are not yet formal. My instinct is that formal definitions can be developed using First Order Logic, which would allow the theorems we will see tomorrow to move from "yeah that sounds about right" to being formally provable.

As for their practical benefit, inasmuch as "the truth shall make you free", we ought to be able to improve our architectures if we can settle at very least what we are talking about when we use the vague term "business logic."

The Whole Picture

What we commonly call "business logic", by which we vaguely mean, "That stuff I have to code up", can in fact be rigorously defined as having four parts, which I believe are best termed orders, as there is a definite precedence to their discovery, analysis and implementation.

  • First Order: Schema
  • Second Order: Derivations
  • Third Order: Non-algorithmic compound operations
  • Fourth Order: Algorithmic compound operations

Now we examine each order in detail.

A Word About Schema and NoSQL

Even "schema-less" databases have a schema, they simply do not enforce it in the database server. Consider: an eCommerce site using MongoDB is not going to be tracking the local zoo's animal feeding schedule, because that is out of scope. No, the code is limited to dealing with orders, order lines, customers, items and stuff like that.

It is in the very act of expressing scope as "the data values we will handle" that a schema is developed. This holds true regardless of whether the datastore will be a filesystem, an RDBMS, a new NoSQL database, or anything else.

Because all applications have a schema, whether the database server enforces it or whether the application enforces it, we need a vocabulary to discuss the schema. Here we have an embarrasment of choices, we can talk about entities and attributes, classes and properties, documents and values, or columns and tables. The choice of "entities and attributes" is likely best because it is as close as possible to an implementation-agnostic language.

First Order Business Logic: Schema

We can define schema, including security, as:

that body of entities and their attributes whose relationships and values will be managed by the application stack, including the authorization of roles to read or write to entities and properties.

Schema in this definition does not include derived values of any kind or the processes that may operate on the schema values, those are higher order of business logic. This means that the schema actually defines the entire body of values that the application will accept from outside sources (users and other programs) and commit to the datastore. Restating again into even more practical terms, the schema is the stuff users can save themselves.

With all of that said, let's enumerate the properties of a schema.

Type is required for every attribute.

Constraints are limits to the values allowed for an attribute beyond its type. We may have a discount percent that may not exceed 1.0 or 100%.

Entity Integrity is usually thought of in terms of primary keys and the vague statement "you can't have duplicates." We cannot have a list of US States where "NY" is listed 4 times.

Referential Integrity means that when one entity links or refers to another entity, it must always refer to an existing entity. We cannot have some script kiddie flooding our site with sales of items "EAT_ME" and "F***_YOU", becuase those are not valid items.

The general term 'validation' is not included because any particular validation rule is is a combination of any or all of type, constraints, and integrity rules.

Second Orders Business Logic: Derived values

When we speak of derived values, we usually mean calculated values, but some derivations are not arithmetic, so the more general term "derived" is better. Derivations are:

A complete entity or an attribute of an entity generated from other entities or attributes according to a formula or rule.

The definition is sufficiently general that a "formula or rule" can include conditional logic.

Simple arithmetic derived values include things like calculating price * qty, or summing an order total.

Simple non-arithmetic derivations include things like fetching the price of an item to use on an order line. The price in the order is defined as being a copy of the item's price at the time of purchase.

An example of a complete entity being derived is a history table that tracks changes in some other table. This can also be implemented in NoSQL as a set of documents tracking the changes to some original document.

Security also applies to generated values only insofar as who can see them. But security is not an issue for writing these values because by definition they are generated from formulas and rules, and so no outside user can ever attempt to explicitly specify the value of a derived entity or property.

One final point about Second Order Business Logic is that it can be expressed declaratively, if we have the tools, which we do not, at least not in common use. I wrote one myself some years ago and am re-releasing it as Triangulum, but that is a post for another day.

Sorting out First and Second Order

The definitions of First and Second Order Business Logic have the advantage of being agnostic to what kind of datastore you are using, and being agnostic to whether or not the derived values are materialized. (In relational terms, derivations are almost always denormalizing if materialized, so in a fully normalized database they will not be there, and you have to go through the application to get them.)

Nevertheless, these two definitions can right off bring some confusion to the term "schema." Example: a history table is absolutely in a database schema, but I have called First Order Business Logic "schema" and Second Order Business Logic is, well, something else. The best solution here is to simply use the terms First Order Schema and Second Order Schema. An order_lines table is First Order schema, and the table holding its history is Second Order Schema.

The now ubiquitous auto-incremented surrogate primary keys pose another stumbling block. Because they are used so often (and so often because of seriously faulty reasoning, see A Sane Approach To Choosing Primary Keys) they would automatically be considered schema -- one of the very basic values of a sales order, check, etc. But they are system-generated so they must be Second Order, no? Isn't the orderid a very basic part of the schema and therefore First Order? No. In fact, by these definitions, very little if any of an order header is First Order, the tiny fragments that are first order might be the shipping address, the user's choice of shipping method, and payment details provided by the user. The other information that is system-generated, like Date, OrderId, and order total are all Second Order.

Third Order Business Logic

Before defining Third Order Business Logic I would like to offer a simple example: Batch Billing. A consulting company bills by the hour. Employees enter time tickets throughout the day. At the end of the month the billing agent runs a program that, in SQL terms:

  • Inserts a row into INVOICES for each customer with any time entries
  • Inserts a row into INVOICE_LINES that aggregates the time for each employee/customer combination.

This example ought to make clear what I mean by definining Third Order Business Logic as:

A Non algorithmic compound operation.

The "non-algorithmic" part comes from the fact that none of the individual documents, an INVOICE row and its INVOICE_LINES, is dependent on any other. There is no case in which the invoice for one customer will influence the value of the invoice for another. You do not need an algorithm to do the job, just one or more steps that may have to go in a certain order.

Put another way, it is a one-pass set-oriented operation. The fact that it must be executed in two steps is an artifact of how database servers deal with referential integrity, which is that you need the headers before you can put in the detail. In fact, when using a NoSQL database, it may be possible to insert the complete set of documents in one command, since the lines can be nested directly into the invoices.

Put yet a third way, in more practical terms, there is no conditional or looping logic required to specify the operation. This does not mean there will be no looping logic in the final implementation, because performance concerns and locking concerns may cause it to be implemented with 'chunking' or other strategies, but the important point is that the specification does not include loops or step-wise operations because the individual invoices are all functionally independent of each other.

I do not want to get side-tracked here, but I have had a working hypothesis in my mind for almost 7 years that Third Order Business Logic, even before I called it that, is an artifact, which appears necessary because of the limitations of our tools. In future posts I would like to show how a fully developed understanding and implementation of Second Order Business Logic can dissolve many cases of Third Order.

Fourth Order Business Logic

We now come to the upper bound of complexity for business logic, Fourth Order, which we label "algorithmic compound operations", and define a particular Fourth Order Business Logic process as:

Any operation where it is possible or certain that there will be at least two steps, X and Y, such that the result of Step X modifies the inputs available to Step Y.

In comparison to Third Order:

  • In Third Order the results are independent of one another, in Fourth Order they are not.
  • In Third Order no conditional or branching is required to express the solution, while in Fourth Order conditional, looping, or branching logic will be present in the expression of the solution.

Let's look at the example of ERP Allocation. In the interest of brevity, I am going to skip most of the explanation of the ERP Allocation algorithm and stick to this basic review: a company has a list of sales orders (demand) and a list of purchase orders (supply). Sales orders come in through EDI, and at least once/day the purchasing department must match supply to demand to find out what they need to order. Here is an unrealistically simple example of the supply and demand they might be facing:

  *** DEMAND ***          *** SUPPLY ***

    DATE    | QTY           DATE    | QTY
------------+-----      ------------+----- 
  3/ 1/2011 |  5          3/ 1/2011 |  3
  3/15/2011 | 15          3/ 3/2011 |  6
  4/ 1/2011 | 10          3/15/2011 | 20
  4/ 3/2011 |  7   

The desired output of the ERP Allocation might look like this:

 *** DEMAND ***      *** SUPPLY ****
    DATE    | QTY |  DATE_IN   | QTY  | FINAL 
------------+-----+------------+------+-------
  3/ 1/2011 |  5  |  3/ 1/2011 |  3   |  no
                  |  3/ 3/2011 |  2   | Yes 
  3/15/2011 | 15  |  3/ 3/2011 |  4   |  no
                  |  3/15/2011 | 11   | Yes
  4/ 1/2011 | 10  |  3/15/2011 |  9   |  no
  4/ 3/2011 |  7  |    null    | null |  no

From this the purchasing agents know that the Sales Order that ships on 3/1 will be two days late, and the Sales Orders that will ship on 4/1 and 4/3 cannot be filled completely. They have to order more stuff.

Now for the killer question: Can the desired output be generated in a single SQL query? The answer is no, not even with Common Table Expressions or other recursive constructs. The reason is that each match-up of a purchase order to a sales order modifies the supply available to the next sales order. Or, to use the definition of Fourth Order Business Logic, each iteration will consume some supply and so will affect the inputs available to the next step.

We can see this most clearly if we look at some pseudo-code:

for each sales order by date {
   while sales order demand not met {
      get earliest purchase order w/qty avial > 0
         break if none
      make entry in matching table
      // This is the write operation that 
      // means we have Fourth Order Business Logic
      reduce available qty of purchase order
   }
   break if no more purchase orders
}

Conclusions

As stated in the beginning, it is my belief that these four orders should "ring true" with any developer who has experience with non-trivial business applications. Though we may dispute terminology and argue over edge cases, the recognition and naming of the Four Orders should be of immediate benefit during analysis, design, coding, and refactoring. They rigorously establish both the minimum and maximum bounds of complexity while also filling in the two kinds of actions we all take between those bounds. They are datamodel agnostic, and even agnostic to implementation strategies within data models (like the normalize/denormalize debate in relational).

But their true power is in providing a framework of thought for the process of synthesizing requirements into a specification and from there an implementation.

Tomorrow we will see some theorems that we can derive from these definitions.