Today is Memorial Day in the United States, a national holiday in which we remember those who made the ultimate sacrifice to preserve our liberties.
The Database Programmer will return next week with an essay on meta-data and data dictionaries.
When Object Oriented Programming burst into the mainstream, encapsulation rapidly came to be accepted as a basic tenet of sound programming. Modern database servers provide a wonderful way to tightly couple code and data, a way that is secure, non-subvertible, flexible and powerful. This week we will see the first use of the database trigger, and examine its curious ability to accomplish the same goal as OOP encapsulation. Furthermore, while the database requires no analog to the idea of inheritance, triggers also serve as a very powerful way to build up business processes through composition.
There are links to related essays on normalization and denormalization at the bottom of this post.
A "trigger" is a small computer program stored on the database server that fires whenever a table is written to. The trigger can validate and if necessary reject the operation, and it can also modify the data being written. These two simple abilities can be combined to implement incredibly powerful operations and combinations of operations.
Before we get to the first example, we need to review the four basic options that are available when you define a trigger. These four options will be used in the examples below, and this quick review can serve as a reference when you review the examples.
Three weeks ago I reviewed the three basic Denormalization Patterns that are used to put calculated values into a database. These denormalization patterns fit hand-in-glove with triggers. The following code shows three simple calculations that can be performed in a trigger. I should note that the syntax is peculiar to Postgres, and you will need to review your own server's documentation to get the syntax right. If you have never written a trigger before, I should further warn that Postgres has a perverse two-step process, it makes you write a function and then attach that function to a trigger. MS SQL Server lets you define the trigger and code together in one step, which makes much more sense. But that being said, here it is:
CREATE FUNCTION order_lines_before_row_func RETURNS TRIGGER AS $BODY$ -- execute the FETCH of the price SELECT INTO new.price price FROM items WHERE items.sku = new.sku ; -- EXTEND The price SET new.extended_price = new.price * new.qty; -- AGGREGATE the total to the sales order UPDATE ORDERS set lines_total = COALESCE(lines_total,0) + new.extended_price; $BODY$ SECURITY DEFINER LANGUAGE PLPGSQL; -- This second step is a peculiar requirement of postgres, -- by contrast, MS SQL Server would have let me define these options -- above and would not require this very annoying second step. CREATE TRIGGER order_lines_before_row_trigger BEFORE INSERT OR UPDATE ON ORDER_LINES FOR EACH ROW EXECUTE PROCEDURE order_lines_before_row_func;
It is no understatment to say that you can build huge applications of hundreds of tables that use little more than the example above repeated over and over across many tables. There is of course more to the story, lots more, but the example above lays the pattern for most of what follows.
Some months ago a member of the Long Island PHP User's Group asked about a method for ensuring that an ordered playlist would never have two songs in the same position. This question struck me because it is a great example of the general problem of maintaining any ordered list when you may be moving items around. It turns out an extremely simple trigger takes care of it neatly.
CREATE FUNCTION playlist_before_row_func RETURNS TRIGGER AS $BODY$ -- given a column "position", if an item is given a position, -- already occupied by another row, bump that other row forward UPDATE playlist SET position = position + 1 WHERE position = new.position; $BODY$ SECURITY DEFINER LANGUAGE PLPGSQL; -- This second step is a peculiar requirement of postgres, -- by contrast, MS SQL Server would have let me define these options -- above and would not require this very annoying second step. CREATE TRIGGER playlist_before_row_trigger ON playlist BEFORE INSERT OR UPDATE FOR EACH ROW EXECUTE FUNCTION playlist_before_row_func();
This trigger works because triggers will cascade, which is explained next.
A trigger can contain code that executes an UPDATE statement, which may then fire other triggers, which themselves contain UPDATE statements, and so on. This is called cascading.
The second example above only works if a trigger can cascade. The chart below shows how it works. The first column shows the value of "position" for a few rows. Imagine the user inserts a new row at position 1, "stepping on" the value that is already there. The trigger shown in example two takes care of everything:
- POSITION - Old | New | Trigger action -----+-----+------------------------------------------------------ | 1 | Insert causes trigger to fire the UPDATE 1 | 2 | UPDATE flips this from 1 to 2, causing trigger to fire again 2 | 3 | 2nd UPDATE flips this from 2 to 3, causing the trigger to fire again 3 | 4 | 3rd UPDATE flips this from 3 to 4, trigger fires a 4th time 6 | 6 | Fourth UPDATE finds no target row, execution ends, no action on this row 7 | 7 | no action
This patterns has some latent performance issues that are too complex to get into until I can do a complete essay on this table pattern. But the point here is that those performance problems are a part of the pattern itself, and using a trigger for the pattern is actually the fastest way to implement the pattern.
The term encapsulation is used (as far as I know) exclusively by Object Oriented theorists. Nevertheless, it is striking how well the term describes the effect of triggers. When a trigger is put onto a table, you have the tightest possible integration of code and data, at least as far as database applications are concerned. This is true because the trigger allows for non-subvertibility and integrated security.
A system is said to be "subvertible" if a program or user can, by accident or malice, disobey a business rule or circumvent any of the protections set up in the system. Triggers represent a non-subvertible implementation of business rules, because no application code or user can bypass their actions. If that is not encapsulation I don't know what is.
There is a really nifty advantage to triggers that falls into this general discussion. When you implement calculations in triggers, you can freely allow access to the database from outside of your own application. This is a crucial feature when selling to companies with large IT departments that expect this ability like a birthright. Going further along these lines, you can implement browser and desktop versions of the application without recoding all of the calculations.
Last week we had an Introduction To Security in which we saw that database security is implemented as restrictions on who can INSERT, UPDATE, DELETE or SELECT from a table. This dovetails perfectly with the use of triggers because it allows you to think of security entirely in terms of table access, which is what it all boils down to in the end anyway. Using triggers plus table security gives the tightest possible integration of security and business logic, and puts your entire security framework on a simple and consistent basis.
The Object Oriented Programmer is used to instantiating and using objects together in different patterns. This technique, known as "composition", is one of the basic ways for classes to interact with each other. I will not argue here that triggers are an exact analog to composition, because such a statement is not true and will invite meaningless debates over terms. What I will argue is that within a database the use of triggers to write to other tables accomplishes the same goal: the interaction of code from different but related contexts.
To illustrate this, our third example makes use of the cascading effect described above to link together the ORDER_LINES table, the ORDERS table, and the CUSTOMERS table. The idea is this. One of the rules on the ORDER_LINES table is that the extended_price must be calculated whenever price or qty change. But there is also a rule on the ORDERS table that it must always contain the total of the line items of the orders. Now let us imagine that the CUSTOMERS table must always contain the total of all open orders for the customer. A complete example for INSERT, UPDATE and DELETE would be far more than will fit here, but if we concentrate just on UPDATE operations we can see the general idea:
-- This is a modified form of example one above CREATE FUNCTION order_lines_before_row_func RETURNS TRIGGER AS $BODY$ -- EXTEND The price SET new.extended_price = new.price * new.qty -- AGGREGATE the total to the sales order, -- notice that we are subtracting the old value and -- adding the new UPDATE ORDERS set lines_total = COALESCE(lines_total,0) + new.extended_price - old.extended_price; $BODY$ SECURITY DEFINER LANGUAGE PLPGSQL; -- The trigger above affected the ORDERS table, and now -- we go the next step and affect the customers table CREATE FUNCTION orders_before_row_func RETURNS TRIGGER AS $BODY$ -- AGGREGATE the total to the customer UPDATE CUSTOMERS set orders_total = COALESCE(orders_total,0) + new.lines_total - old.lines_total; $BODY$ SECURITY DEFINER LANGUAGE PLPGSQL; -- Postgres requires this 2nd step, but I'm not going into it here CREATE TRIGGER order_lines_before_row_trigger ..... CREATE TRIGGER orders_before_row_trigger.....
I want to repeat here that I am not claiming this mechanism is an exact analog to composition, but I do want to stress that when you go with triggers a lot of code disappears out of your application, and you may find yourself wondering how all of this fits with the OOP mentality. All programming systems require the basic ability to both isolate code while also defining and controlling how these isolated fragments can interact. OOP makes use of inheritance and composition to relate code segments to each other, but triggers change the scene quite radically. The trigger encapsulates the behavior of the table, and because triggers can update other tables, the ability to have cascading or secondary actions is satisfied.
My final note on this idea is to repeat something I have stressed in these essays many times. The seasoned and experienced database programmer thinks of things in terms of his basic building block, the table. The trigger gives him the ability to attach code directly to this building block. The fact that triggers can update other tables gives him the necessary mechanism to have operations that span multiple tables.
There is far more to cover on triggers, and I have glossed over many details in this first introduction. Future essays will demonstrate table design patterns that have their cleanest implementations with triggers, and we will also examine grave performance issues that must be considered on large complex systems. Before we get to that however, next week we will have an introduction to meta data and data dictionaries, which we have to cover first because they lead to more essays on triggers and code generation, and the data dictionary will play more and more into future essays on table design patterns.
Triggers are a very powerful mechanism for creating business logic that is guaranteed to execute, cannot be subverted, dovetails nicely with security. The trigger has the curious property of strongly encapsulating code and data, even though we do not usually use the term encapsulation to describe this. Triggers put great power in the hands of the programmer.
The normalization essays on this blog are:
For a database application, all security in the end becomes a question of who can read and write to what tables. While this is obvious to veteran database programmers, it is not always so obvious to relative newcomers. A programmer who thinks primarily in terms of code will be used to conceptualizing actions, steps, procedures and so forth. This programmer will therefore think of security in terms of who is allowed to perform what actions. However, when you examine the actions performed by database application code, you find that, no surprise, it is a lot of database reads and writes.
This week in The Database Programmer we are going to see the basics of table-based security. This is a blog for anybody who wants to learn the practical realities of working with databases. A new essay is published every Monday morning, and the Complete Table of Contents is here.
The most basic unit of storage for a database is the table, and so security begins with deciding who can do what to a table. The SQL language has one read command and three write commands (INSERT, UPDATE, and DELETE), so there are four possible permissions that can be granted to a user. Most databases out there provide syntax that is pretty close to this, which makes some direct permission assignments to a group named "ARSTAFF"
CREATE GROUP ARSTAFF GRANT SELECT ON RECEIVABLES TO ARSTAFF GRANT INSERT ON RECEIVABLES TO ARSTAFF GRANT UPDATE ON RECEIVABLES TO ARSTAFF GRANT DELETE ON RECEIVABLES TO ARSTAFF
These commands should be run at the same time that your tables are built, they are part of the basic database build.
A very basic security dictionary will list all tables and show who can do what to what. The dictionary begins with a single table, which lists tables and groups and permissions, something like this:
TABLE | GROUP | SELECT | INSERT | UPDATE | DELETE -------------+-----------+--------+--------+--------+-------- AR_INVOICES | AR_STAFF | 1 | 1 | 1 | 0 AR_PAYMENTS | AR_STAFF | 1 | 1 | 1 | O CUSTOMERS | AR_STAFF | 1 | 1 | 1 | 0 CUST_TYPES | AR_STAFF | 1 | 0 | 0 | 0 CUST_TYPES | AR_ADMIN | 1 | 1 | 1 | 1
Security permissions should always be defined in terms of groups, not individual users. This is important because it lets you make the group definitions part of the basic definition of the system, leaving the system administrator free to add users to groups at his own discretion.
Most security systems make use of the idea of deny by default. This means you do not have to specify who cannot see a table, the system starts out by assuming that nobody can see anything. Users can then only work with a table if they are in a group that can work with that table.
Another key idea is this: if a user is in even one group that allows a permission, the user gets the permission, even if his other groups are not allowed in.
With these two ideas, deny-by-default, and any-group-gets-you-in, then assuming you have a table that lists what groups each user is in, you can find a particular user's permissions on a table with this query:
SELECT MAX(select) as select ,MAX(insert) as insert ,MAX(update) as update ,MAX(delete) as delete FROM secure_dictionary WHERE table = 'table name' AND group IN (list,of,groups,user,is,in)
If the query comes back empty, the user has no rights, otherwise you will get the user's permissions for that table.
The security dictionary portrayed above can become tedious to work with because it often happens that you want to make similar assignments to groups of tables. For instance, in an accounting application you may an "APSTAFF" group that has basic permissions on Accounts Payable tables, and an "APADMIN" group that can do anything to any table in Accounts Payable. However, those groups have no permissions on the Accounts Receivable tables, so you have two new groups, "ARSTAFF" and "ARADMIN". Likewise for inventory, general ledger, and so forth.
This can all be made much easier by organizing your tables into modules and defining default permissions within a module to each group. The expanded dictionary would have these tables:
THE TABLE OF TABLES TABLE | MODULE -------------+--------- AP_INVOICES | AP AP_CHECKS | AP AR_INVOICES | AR AR_PAYMENTS | AR GL_ACCOUNTS | GL GL_LEDGER | GL MODULE PERMISSION MODULE | GROUP | SELECT | INSERT | UPDATE | DELETE --------+-----------+--------+--------+--------+-------- AP | AP_STAFF | 1 | 0 | 0 | 0 AP | AP_ADMIN | 1 | 1 | 1 | 1 AR | AR_STAFF | 1 | 0 | 0 | 0 AR | AR_ADMIN | 1 | 1 | 1 | 1 GL | GL_STAFF | 1 | 0 | 0 | 0 GL | GL_ADMIN | 1 | 1 | 1 | 1
The permissions above basically let the regular in each group see anything, but write nothing. The admin users in each group can do anything. Specific table assignments to particular tables can then override these defaults to give the *STAFF members increased access on a table-by-table basis.
We are now going to see the first table design pattern that is based on security: the read-only lookup table. Consider a table of customer types and discounts. This table does not change that often, and regular staff are not permitted to make changes to the table, only managers can do that.
I call this pattern the "Read-Only Lookup Table." The table holds values that everybody needs to see, but only a few people can modify. The SQL to create a table might look like this:
CREATE GROUP AR_STAFF CREATE GROUP AR_ADMIN CREATE TABLE CUSTTYPES ( custtype char(10) primary key ,description char(25) ,discount numeric(4,2) ) DENY ALL ON CUSTTYPES TO AR_STAFF GRANT SELECT ON CUSTTYPES TO AR_STAFF GRANT ALL ON CUSTTYPES TO AR_ADMIN
By the way, this is yet another argument against the deep tendency that code grinders have to combine tables that have similar structures, especially lookup tables. We know in theory that combining lookup tables is bad because like things should be stored together in a table and unlike thing should be stored in separate tables. Now we see the practical impact, which is that you cannot implement security on tables that have been combined (just like you can't implement constraints, foreign keys, calculations...) The crucial point here is that table design is not so much about identifying column names and widths, at its heart it is about organizing data. When you mistake a superficial resemblance (similar column structure) for an actual similarity, you confound your own ability to manage either entity correctly.
You can certainly implement security in your framework by doing on-the-fly lookups to see who is allowed to do what. However, your run-time framework code will be much simpler if the security definitions are built into the server when the tables are built. The drawback is that it requires a dramatic change to how you connect to the database.
Many programmers use frameworks (or write their own) that make only one connection to the database. This connection has the maximum allowed permissions that any user of the system has, and the framework determines what particular users can actually do. This system is so common that many programmers probably assume it is the 11th commandment, and these same programmers are unaware that there is any other way to do it.
But in fact databases are able to create users and groups just the same way that operating systems are, and then to allow you to connect as these various users. Making all connections as a super-user and trusting your code to get it right puts a huge burden on the programmer, and we have a ready example of what can go wrong by looking at a very popular operating system produced in Redmond, Washington. By contrast, a simple table-based system reduces the task to careful review of the much simpler security dictionary.
The decision on which way to go is not necessarily cut and dried. If you have a site with a very simple database and very simple security, and thousands or millions of users, there is much to be gained by caching connections and arbitrating security in code. However, if you are producing line of business programs like accounting, ERP, medical billing or any other app where your users are limited, there is nothing to be gained by caching connections and everything to be gained by simplifying the system and making it easier to code and maintain.
While table security can be vastly simpler than trying to secure programs one by one, it cannot in fact handle all of the cases that might come up. In later essays we will examine column-level security and row-level security. These are not well supported in the popular products available today, so you may have to work it into your framework (which is what I did for my Andromeda framework).
Also, now that we have seen an introduction to database security, we will be seeing more secure table design patterns in the future.
Early in my career, I was fortunate to receive some programming lessons from one of the early pioneers in computer age, A. Neil Pappalardo. While I am sure he would not remember me, I certainly remembered him, and he said one thing that I remember very much: Minimize Code, Maximize Data.
This is the Database Programmer blog, for anybody who wants practical advice on database use.
There are links to other essays at the bottom of this post.
This week we are going to examine what I was told so many years ago:
Since then, and it was nearly 15 years ago, I have never once heard another programmer (except myself) express this very basic and simple idea. It is the best kept secret in programming. I can think of two reasons why most programmers working today have never heard the simple idea, "minimize code, maximize data."
First possibility: the guy was totally wrong. I highly doubt this however as his company is approaching its 40th year of worldwide growth, and how many companies in this field last that long, never mind prosper and grow?
The second possibility seems much more likely to me: most programmers just don't think that way. Most programmers would never stumble upon this idea on their own and if they do hear it they forget it or reject it. We programmers love to code, and are reluctant at best to accept the idea that he who codes least codes best.
Now we will see how this simple idea impacts the entire software cycle, from designer to user.
Our example is fairly easy. Consider a magazine distributor, a company that receives magazines in bulk and distributes them to stores. His system contains a table called DEFAULTS that lists the default quantity of each individual magazine that is distributed to each store (a big fat cross-reference table).
Now for the tricky part. Let's say we run a "SELECT SUM(qty)" from the DEFAULTS table for TV GUIDE and we get 2123. That means he needs 2123 copies of TV GUIDE to give every store their default delivery. But when the magazines arrived from the supplier, they only received 1900. What does he do? Well it so happens this happens every day, the distributor in fact never receives his exact default requirements, they are always over or under. So the distributor must run a process called regulation.
Regulation is an automatic computer process that runs through the defaults, compares them to what you actually have, and increases or decreases the delivery amounts to each store until everything lines up. It is by nature iterative, you run through the stores over and over increasing amounts by one or two until you balance out. Next we will look at how to write it.
Before we get to the details of the regulation program, I have to point out one detail about magazine distribution. It turns out that 50-80% of magazines sent to retail racks go back to the distributors and are shredded. This means it is not uncommon to talk about sales percentages of 50% or 60%. If a store is receiving too many magazines, they might easily have a sales percent as low as 10%.
So how do we write the regulation program? I'm going to avoid any long build-up here and just go straight to the answer. You make a table of rules that determine how the regulation process works. The owner of the distribution company will say things like, "If the store sold less than 20% of the TV GUIDES for the past 4 weeks, drop their amount by 2, but do not go below 2." So you make a table with columns like "THRESHOLD_PERCENT", "DECREMENT" and "ABSOLUTE_MINIMUM". There will also be rules that operate when the company has too many magazines. The owner says, "If his sales percent is greater than 80%, give him two more." So your table now contains percentages and increase amounts to use when raising the distribution amounts.
From here your program is reduced to a fairly simple affair, it is little more than a double-nested loop. You fetch a list of the rules, in order, and you begin to iterate through the magazines, applying the current rule to the current magazine. You keep applying each rule over and over until there are no more cases where it applies, then you move on to the next rule. As soon as the total magazines matches the amount you have, the program terminates. It is the very soul of simplicity, and a perfect example of what database applications can do so well.
It so happens that I have a customer who is a magazine distributor, which is how I learned about this process. The system he is using now, which I maintain for him, was not written as I described it above. There is no table of rules. The programmer coded up each different rule separately, and the programmer is now dead (I feel like Dave Barry saying this, but I swear I am not making this up). The owner and I are both afraid to touch the code, and so he lives with it while we plan for something better.
While I do not wish to speak ill of the deceased, it is safe to say that the original programmer did not know he was supposed to minimize code and maximize data. Let's examine the consequences of that ignorance:
So we can now draw the conclusion that the "minimize code, maximize data" has very direct consequences for the user experience, especially those very important users who sign the checks.
A code grinder who is not trying to minimize code usually ends up with very complicated programs. This is true even of veteran and expert programmers who supposedly know better. They set out with an idea to simplify things and then end up writing complex heavily inter-dependent class libraries. I contend that this happens because they don't know that they are supposed to minimize code by maximizing data.
On the other hand, the database programmer who knows to maximize data ends up with dramatically simpler code patterns. This happens because a lot of the complex conditionals and branching logic that is required for code-centric solutions is reduced to row scanning operations that contain much simpler algorithms in the innermost loop.
We can see this in the example above for the regulation program. The data-centric solution is basically one loop nested inside of another with the core operation occurring on the inside. The code-centric solution, which I mentioned we are afraid to touch, is full of conditionals and branches that make it dangerous to mess with for fear of causing unintended side-effects. The original programmer could probably do anything with it, everybody else is reduced to doing nothing.
The problem becomes more exaggerated as time goes on. As programs mature, their original simplicity is enhanced to handle more sophisticated edge cases and exceptions. As this process continues, many of these sophisticated edge cases and exceptions will be mutually exclusive or will interact in subtle ways. When this happens the code-centric program becomes increasingly difficult to understand and keep correct, as the layering of conditionals and branches and interdependencies makes it harder and harder to eliminate unwanted side effects. The data-centric solution on the other hand, while still becoming more difficult, is reduced to simply making sure that the tables provide the correct options for the code, and the code remains a matter of scanning the rules, picking precedence, and executing them.
It is much easier to debug data than code, especially when a simple operation has been matured to handle a collection of subtle edge cases and exceptions that may interact with each other.
If we continue to the example of the regulation process, the basic question arises, how do you test this thing? The entire concept of reliable testing is far more than we can cover in a single essay, but I do want to introduce the idea of testing with -- you guessed it, more data.
The regulation program as I have described it loops through rules and magazine quantities and adjusts those quantities. It is a row-by-row process, with each pass through the inner loop executing a single change. Debugging this process can be vastly simplified if a table is made that records, in order, each update that was made and what rule was used to make the update. Obvious errors could be detected if a rule was being applied out of order, or if the rule made the wrong change. Generating the test cases brings in yet again more data, creating a body of magazines, defaults and customers that deliberately stresses the system with extreme cases.
This debug-the-data approach can have a huge impact on boosting the customer's confidence and control. If he can directly view this log he has a perfect black-and-white explanation of how the process ran. If he does not like it he can change the rules. If you let your program run in "planning only" mode, where it generates these logs without making the changes, then your customer can play what-if and you will find you have truly made a new friend!
This week's essay leads naturally to the matter of meta-data and data dictionaries and how they can dramatically reduce the amount of code needed for routine table maintenance tasks. However, that is a large topic and must be reserved for one or more future essays.
When we get to the topic of data dictionaries, we will be looking at how a data dictionary can give you true zero-code generated table maintenance forms, among other things.
The motto "Minimize Code, Maximize Data" is not well known in popular discussions today, which I contend is a natural consequence of our basic personalities as programmers. Coding is what we do and we tend to think of solving problems as an exercise in code and more code.
Nevertheless, we have seen this week in a specific example (which could be repeated many times over) that the code-centric versus data-centric decision impacts everybody. The rule "Minimize Code, Maximize Data" has positive impacts and the coding process, the debugging process, the maintenance process, and the user experience. Since that covers all parties concerned with software development, it is safe to conclude that this is a crucial design concept.
Other philosophy essays are: