Saturday, November 13, 2010

Database Skills

It seems strange to me that I've been working on this blog for 3 years or so (with one very long break) and somehow never got around to writing a simple list of skills that all database experts need. So here it is!

Various Job Tiles for Database People

There are three common job titles in the database area, which are Database Administrator (DBA), Database Programmer, and Database Architect. These titles tend to be somewhat variable from shop-to-shop, but generally the "Architect" term indicates the highest level of skill combined with considerable management responsibilities. The "Programmer" term is somewhere below that, but the "DBA" is extremely variable. I have seen shops where a person was called a DBA and filled a relatively constrained role closer to IT or operations (routine tasks, no real programming) and other shops where a person with the DBA title was basically the Architect.

Because of this high variability in what titles mean, I am not going to waste time categorizing skills as belonging to one job title or another, I am simply going to list them all out.

The various levels of skills are these:

  • Before Hello World!: The basics of tables, columns, rows
  • The Hello World! Level: SQL Select
  • Just after Hello World!: Writing Data
  • Commands to create, modify and drop tables, or Data Definition Language (DDL)
  • Knowing how to use a Query Analyzer or optimization tool
  • Understanding Normalization
  • Understanding Denormalization
  • Understanding Primary Keys, Foreign Keys and Constraints
  • Understanding Transactions
  • Understanding ACID
  • Understanding Indexes as optimization tool
  • Views
  • Database Security
  • Upgrades and Installs
  • Efficient access of database from application
  • Bulk operations: loading or exporting large amounts of data
  • Understanding of Contexts and how they lead to different sets of Best Practices
  • Preventing performance degradation through various maintenance tasks
  • Deployment strategies: partitioning, tablespaces
  • Deployment strategies, failure protection, from simple backup to hot standbys
  • Server side coding: stored procedures and functions
  • Server side coding: triggers
  • Temporary tables

As long as that list is, it only covers those of us who use database systems. There is an entire set of skills for those who actually create and maintain these systems, but that is not something that will be treated in this blog.

Before Hello World!: Tables and Columns

If you have never so much as typed a single SQL command, or seen a table diagram, or anything like that, then it is worth a few minutes to go through the basics of what a database does, which is to organize atomic values into tables.

I am going to write an essay on this soon, even though it may seem so obvious as to be completely unnecessary. But I will do it because the most popular essay on this blog is about using GROUP BY, which tells me newer programmers are starving for useful tutorials at the beginner level. So it seems to me, why not put something out there at the very beginning of the beginning?

The Hello World! Level: SQL Select

If you are starting completely from scratch and want to know about database programming, you want to start with the SQL SELECT command. This is the (almost) only command used to extract data from a database, and all of the possible ways to combine, filter and extract data are expressed in the many clauses of this command.

Just after Hello World!: Writing Data

When it comes time to change the data in a database there are three commands, listed below. These commands are based on the tables-and-rows nature of databases, and allow to add a row (or rows), change a row (or rows) and delete a row (or rows).

  • The INSERT command
  • The UPDATE command
  • The DELETE command

Commands to create, modify and drop tables, or Data Definition Language (DDL)

The term "DDL" stands for "Data Definition Language" and includes all of the commands use to build the tables that will hold the data for the INSERT, UPDATE, DELETE and SELECT statements. The basic list of commands to be familiar with is:

  • Understanding Data Types (databases are strongly typed)
  • CREATE TABLE and ALTER TABLE
  • Commands to add and drop primary keys
  • Commands to add and drop foreign keys
  • Commands to add and drop constraints
  • Commands to add and drop indexes

There are also numerous commands that are specific to different products. Those will not be listed here today, but who knows what the future may bring.

Knowing how to use a Query Analyzer or optimization tool

Database programmers, once they get started with the skills listed above, tend to become more and more obsessed with performance. Every major database has some type of tool that lets you examine how the server is going to process a SQL SELECT, and database programmers depend on these tools to discover where they might alter tables or indexes or the SELECT itself to make the queries go faster.

Understanding Normalization

The term "normalization" refers to the process of analyzing the data that your system is required to store, and organizing it so that every fact is stored in exactly one place. Understanding how to normalize data is an absolute requirement for the database programmer who wants to design databases.

We speak of normalization in "forms" as in "first normal form", "second normal form", and so on. It is a good idea to understand The argument for normalization and then to pursue at very least:

Normalization is a a fascinating topic to study, and it extends all they way up to "Domain-key Normal Form" which is considered the most complete normalization for a database.

Understanding Denormalization

Every database programmer, after fully understanding normalization, realizes that there are severe practical problems with a fully normalized database, such a database solves many problems but generates problems of its own. This has led programmer after programmer down the path of denormalization, the deliberate re-intoduction of redundant values to improve the usability of the database.

There is a surprising lack of material available on the web regarding denormalization strategies. Most of what you find is arguments and flame wars about whether or not to do it, with little to nothing on how to actually do it. For this reason, I provide my own essays on this blog on the strategies and methods I have worked out over the years:

After reviewing The Argument For Denormalization it is worthwhile to follow up with:

The arguments for and against denormalization are heavily affected by the Pay me now or pay me later design tradeoff.

Understanding Primary Keys, Foreign Keys and Constraints

One might argue that this list of skills belongs much higher up the list, up there with the CREATE TABLE command. However, I have it useful to distinguish between simply knowing the commands to make a primary key and actually understanding the tremendous power of keys.

In this author's opinion it is not truly possible to understand how powerful and beneficial Primary keys and Foreign Keys are for an entire application stack until you have learned the commands, built some databases, and worked through the concepts of normalization and denormalization. Only then can you revisit these humble tools and realize how powerful they are.

Understanding Transactions

The word "transaction" has two meanings in common day-to-day database talk. One meaning is very loose and refers to some individual command or set of commands. You might hear somebody using the term loosely when they say, "We're seeing about 10 transactions per second this week."

The more rigorous use of the term refers to a statement or set of statements that must be guaranteed to either complete in their entirety or fail in their entirety. This is a profoundly important concept once you get beyond simply making tables with keys and get into real-world heavy multi-user activity. And this leads us to the next topic...

Understanding ACID

Modern relational databases expect multiple simultaneous users to be writing and reading data all of the time. The term "ACID Compliance" refers to both the philosophy of how to handle this and the actual methods that implement that philosophy. The term ACID refers to:

  • The Atomic nature of each transaction
  • The Consistentcy of the database during and after simultaneous overlapping transactions
  • The Isolation of each transaction
  • The Durability of the results

Understanding Indexes as optimization tool

An index is a special tool databases use to provide very rapid access to large amounts of data. Just like keys, it is not enough to know the commands, it is necessary to understand the subtle power of indexes when used with some craftsmanship. The basic uses of indexes are:

  • A simple index on a column to provide rapid search on that column
  • A "covering index" that includes extra columns that can further speed up certain common access patterns
  • Clustered indexes (MS SQL Server) and what they give and what they take away
  • The cost of indexes on write operations

Views

A view looks like a table to the SQL SELECT command. The view itself is a stored SQL SELECT command that encodes some query that is either used very often or is very compex. In all cases, views are used to present the database data to the application in some simplified convenient or secure form. The two major uses of views are:

  • To simplify the application programmer's job
  • To provide a read-only interface for some applications

Upgrades and Installs

If you are a single programmer or hobbyist working with a database, it is all well and good to just add and drop tables as you wish. But as soon as you get into development with quality control stages and multiple programmers, it becomes evident that you need a strategy for handling the schema changes that come with with new versions of the system. There are multiple essays available on this blog, covering:

Database Security

Databases provide incredible security provisions that are just about completely ignored by modern web developers. Sometimes there is good reason for this, but overall anybody who wants to become a truly accomplished Database Programmer or Database Architect must have a thorough understanding of database security and how it can simplify the entire system stack.

Database security comes down to specifying who is allowed to perform the 4 basic operations of INSERT, UPDATE, DELETE and SELECT against which tables:

My basic introduction to security is here.

  • Understanding roles (we used to say users and groups)
  • Simple table-level security
  • Column-level security (not widely supported)
  • Row-level security (not widely supported)

Efficient access of database from application

Imagine you have the perfectly designed database, with every nuance and subtlety excellently crafted in the ares of keys, indexes, normalization, denormalization and security. At this point your job branches out into several new areas, but one of the most important is knowing how to write application code that efficiently accesses the database.

Bulk operations: loading or exporting large amounts of data

Some database applications involve a large number of small transactions, where each trip to the database writes only a single row or reads only a dozen or so rows.

But in many cases you need to bulk load large amounts of data in one shot, thousands or even millions of rows. In these cases the techniques that govern small transactions are useless and counter-productive, and you need to learn some new commands and strategies to handle the bulk loads.

Understanding Contexts and how they lead to different sets of Best Practices

Not all databases are created for the same purpose. If you have a very large operations then it will likely have multiple independent databases that fill the classical roles, while in a smaller shop the roles may be combined in one database. I like to refer to these roles as "contexts" because they determine how the tables will be designed and how acess to the tables will be governed. The major contexts are:

  • OLTP or Online Transaction Processing, characterized by simultaneous reads and writes, generally assumes little or no periods of inactivity, and generally assumes that the individual transactions are very small. The apps we were all writing in the 80s and 90s to do accounting, ERP, MRP, Job control, payroll, airline reservations and many others fall into this context.
  • Data Warehouse context, characterized by periodic bulk loads of new information with most activity being reads. The Data Warehouse context is largely associated with the "Star Schema" table design. Data in a Warehouse is historical, it never changes after it is loaded.
  • CMS or Content Management System, also characterized by very few writes compared to reads, but more likely to have a normalized structure. Unlike a Data Warehouse, the data is subject to change, just not that often.
  • Any other Read Only Context. I include this category because I spent some time working on Direct Marketing databases, which are like a Data Warehouse in that they are updated periodically and the data does not change, but the Star Schema is completely inappropriate for them.

If you consider a huge online shopping system, you can see that within that application there are at least two contexts. The product catalog is likely to see vastly fewer writes than reads, but the shopping cart tables will be in a constant state of reads and writes.

Preventing performance degradation through various maintenance tasks

Once the database and its application stack is up and running, and the reads and writes and coming through, the laws of thermodynamics come into play and system performance can begin to degrade even if the database stays the same size and the load on the system is steady.

Different vendors have different tools for combatting this, but they tend to come down to reclaiming temporary space and occassionally rebuilding indexes. There are also log files that have to be purged, regular backups to be made, and other operations along those lines.

Deployment strategies: partitioning, tablespaces

When systems become sufficiently large, it is no longer possible to just attach some disks to a box and run a database server. The Database Architect must consider breaking different tables out onto different sets of spindles, which is usually done with "tablespaces", and moving older data onto slower cheaper spindles, which is often done with Partitioning.

Deployment strategies, failure protection, from simple backup to hot standbys

Because a database typically experiences simultaneous reads and writes from multiple sources, and may be expected to be up and running 24/7 indefinitely, the concept of making a backup and recovering from a failure becomes more complicated than simply copying a few files to a safe location.

In the most demanding case, you will need to provide a second complete box that can become fully live within seconds of a disastrous failing of the main box. This is called various things, but Postgres calls it a "hot standby" in version 9 and some MS SQL Server shops call it a "failover cluster."

The ability to come up live on a second box when the first one fails is made possible by the way databases handle ACID compliance, and the fact that they produce something called a Write-Ahead-Log (WAL) that can be fed into a second box that "replays" the log so that its copy of the database is getting the same changes as the master copy.

Server side coding: stored procedures and functions

I really could not figure out where to put this entry in the list, so I just punted and put it near the end. It could really go anywhere.

Stored procedures or functions are procedural routines (not object oriented) that are on the database server and can be invoked directly from an application or embedded inside of SQL commands. Generally speaking they provide various flow-control statements and rudimentary variable support so that you can code multi-step processes on the server itself instead of putting them in application code.

Server side coding: Triggers

Triggers are quite possibly the most elegant and beautiful technology that servers support, absolutely the least understood, and definitely the most reviled by the ignorant. You will find virtually no web content today that will explain why and how to use triggers and what they are good for.

Except of course for my own essay on triggers that discusses them in terms of encapsulation.

Temporary tables

Temporary tables are like Stored Procedures inasmuch as I had no idea where to put them in the list, so they just ended up at the end.

As the name implies, a temporary table is a table that you can create on-the-fly, and which usually disappears when your transaction is complete. They are most often found in Stored Procedures. They can impact performance for the worst in many ways, but can be extremely useful when you are doing multi-staged analsysis of data in a Data Warehouse (that's where I use them the most).

3 comments:

Dean said...

I'm glad you decided to start posting again, Ken. I've found your articles and advice very useful. Even though I know my way around SQL, its great to get a refresher course on the basics.

Charlie 1 木匠 said...

Nice post, nice blog.

As a database developer and architect, I like them, please keep up the good work.

mohammed razal said...

This is an Awesome Post , For People like me wondering where study with DB...


Thanks,
Razal