It seems strange to me that I've been working on this blog
for 3 years or so (with one very long break) and somehow never
got around to writing a simple list of skills that all
database experts need. So here it is!
Various Job Tiles for Database People
There are three common job titles in the database area,
which are Database Administrator (DBA), Database Programmer,
and Database Architect. These titles tend to be somewhat
variable from shop-to-shop, but generally the "Architect"
term indicates the highest level of skill combined with
considerable management responsibilities. The "Programmer"
term is somewhere below that, but the "DBA" is extremely
variable. I have seen shops where a person was called a
DBA and filled a relatively constrained role closer to
IT or operations (routine tasks, no real programming) and
other shops where a person with the DBA title was basically
the Architect.
Because of this high variability in what titles mean, I am not
going to waste time categorizing skills as belonging to one
job title or another, I am simply going to list them all out.
The various levels of skills are these:
- Before Hello World!: The basics of tables, columns, rows
- The Hello World! Level: SQL Select
- Just after Hello World!: Writing Data
- Commands to create, modify and drop tables, or Data
Definition Language (DDL)
- Knowing how to use a Query Analyzer or optimization tool
- Understanding Normalization
- Understanding Denormalization
- Understanding Primary Keys, Foreign Keys and Constraints
- Understanding Transactions
- Understanding ACID
- Understanding Indexes as optimization tool
- Views
- Database Security
- Upgrades and Installs
- Efficient access of database from application
- Bulk operations: loading or exporting large amounts
of data
- Understanding of Contexts and how they lead to
different sets of Best Practices
- Preventing performance degradation through
various maintenance tasks
- Deployment strategies: partitioning, tablespaces
- Deployment strategies, failure protection, from
simple backup to hot standbys
- Server side coding: stored procedures and functions
- Server side coding: triggers
- Temporary tables
As long as that list is, it only covers those of us who
use database systems. There is an entire set of
skills for those who actually create and maintain these
systems, but that is not something that will be treated
in this blog.
Before Hello World!: Tables and Columns
If you have never so much as typed a single SQL command,
or seen a table diagram, or anything like that, then it is
worth a few minutes to go through the basics of what
a database does, which is to organize atomic values into
tables.
I am going to write an essay on this soon, even though it
may seem so obvious as to be completely unnecessary. But I
will do it because the most popular essay on this
blog is about using GROUP BY, which tells me newer programmers
are starving for useful tutorials at the beginner level.
So it seems to me, why not put something out there at the
very beginning of the beginning?
The Hello World! Level: SQL Select
If you are starting completely from scratch and want to know
about database programming, you want to start with the SQL
SELECT command. This is the (almost) only command used to
extract data from a database, and all of the possible ways to
combine, filter and extract data are expressed in the many
clauses of this command.
Just after Hello World!: Writing Data
When it comes time to change the data in a database
there are three commands, listed below. These commands
are based on the tables-and-rows nature of databases,
and allow to add a row (or rows), change a row (or rows)
and delete a row (or rows).
- The INSERT command
- The UPDATE command
- The DELETE command
Commands to create, modify and drop tables, or Data
Definition Language (DDL)
The term "DDL" stands for "Data Definition Language"
and includes all of the commands use to build the
tables that will hold the data for the INSERT, UPDATE,
DELETE and SELECT statements. The basic list of
commands to be familiar with is:
- Understanding Data Types (databases are strongly typed)
- CREATE TABLE and ALTER TABLE
- Commands to add and drop primary keys
- Commands to add and drop foreign keys
- Commands to add and drop constraints
- Commands to add and drop indexes
There are also numerous commands that are specific
to different products. Those will not be listed here
today, but who knows what the future may bring.
Knowing how to use a Query Analyzer or optimization tool
Database programmers, once they get started with the skills
listed above, tend to become more and more obsessed with
performance. Every major database has some type of tool
that lets you examine how the server is going to process
a SQL SELECT, and database programmers depend on these tools
to discover where they might alter tables or indexes or the
SELECT itself to make the queries go faster.
Understanding Normalization
The term "normalization" refers to the process of analyzing
the data that your system is required to store, and organizing
it so that every fact is stored in exactly one place. Understanding
how to normalize data is an absolute requirement for the
database programmer who wants to design databases.
We speak of normalization in "forms" as in "first normal form",
"second normal form", and so on. It is a good idea to understand
The argument for normalization and then to pursue
at very least:
Normalization is a a fascinating topic to study, and it
extends all they way up to "Domain-key Normal Form" which is
considered the most complete normalization for a database.
Understanding Denormalization
Every database programmer, after fully understanding
normalization, realizes that there are severe practical
problems with a fully normalized database, such a database
solves many problems but generates problems of its own.
This has led programmer after programmer down the path
of denormalization, the deliberate re-intoduction
of redundant values to improve the usability of the
database.
There is a surprising lack of material available on the
web regarding denormalization strategies. Most of what
you find is arguments and flame wars about whether or not
to do it, with little to nothing on how to actually do it.
For this reason, I provide my own essays on this blog on
the strategies and methods I have worked out over the years:
After reviewing The Argument For Denormalization
it is worthwhile to follow up with:
The arguments for and against denormalization are heavily
affected by the Pay me now or pay me later
design tradeoff.
Understanding Primary Keys, Foreign Keys and Constraints
One might argue that this list of skills belongs much higher
up the list, up there with the CREATE TABLE command. However,
I have it useful to distinguish between simply knowing the
commands to make a primary key and actually understanding
the tremendous power of keys.
In this author's opinion it is not truly possible to understand
how powerful and beneficial Primary keys and Foreign Keys are
for an entire application stack until you have learned the commands,
built some databases, and worked through the concepts of normalization
and denormalization. Only then can you revisit these humble
tools and realize how powerful they are.
Understanding Transactions
The word "transaction" has two meanings in common day-to-day
database talk. One meaning is very loose and refers to some
individual command or set of commands. You might hear somebody
using the term loosely when they say, "We're seeing about
10 transactions per second this week."
The more rigorous use of the term refers to a statement or
set of statements that must be guaranteed to either
complete in their entirety or fail in their entirety.
This is a profoundly important concept once you get beyond
simply making tables with keys and get into real-world
heavy multi-user activity. And this leads us to the
next topic...
Understanding ACID
Modern relational databases expect multiple simultaneous
users to be writing and reading data all of the time.
The term "ACID Compliance" refers to both the philosophy
of how to handle this and the actual methods that
implement that philosophy. The term ACID refers to:
- The Atomic nature of each transaction
- The Consistentcy of the database during and
after simultaneous overlapping transactions
- The Isolation of each transaction
- The Durability of the results
Understanding Indexes as optimization tool
An index is a special tool databases use to provide very
rapid access to large amounts of data. Just like keys, it
is not enough to know the commands, it is necessary to
understand the subtle power of indexes when used with some
craftsmanship. The basic uses of indexes are:
- A simple index on a column to provide rapid search
on that column
- A "covering index" that includes extra columns that
can further speed up certain common access patterns
- Clustered indexes (MS SQL Server) and what they give
and what they take away
- The cost of indexes on write operations
Views
A view looks like a table to the SQL SELECT command. The view
itself is a stored SQL SELECT command that encodes some
query that is either used very often or is very compex. In
all cases, views are used to present the database data to
the application in some simplified convenient or secure
form. The two major uses of views are:
- To simplify the application programmer's job
- To provide a read-only interface for
some applications
Upgrades and Installs
If you are a single programmer or hobbyist working with a
database, it is all well and good to just add and drop tables
as you wish. But as soon as you get into development
with quality control stages and multiple programmers, it becomes
evident that you need a strategy for handling the
schema changes that come with with new versions of
the system. There are multiple essays available on
this blog, covering:
Database Security
Databases provide incredible security provisions that are
just about completely ignored by modern web developers.
Sometimes there is good reason for this, but overall anybody
who wants to become a truly accomplished Database Programmer
or Database Architect must have a thorough understanding
of database security and how it can simplify the entire
system stack.
Database security comes down to specifying who is allowed
to perform the 4 basic operations of INSERT, UPDATE,
DELETE and SELECT against which tables:
My basic introduction to security is here.
- Understanding roles (we used to say users and groups)
- Simple table-level security
- Column-level security (not widely supported)
- Row-level security (not widely supported)
Efficient access of database from application
Imagine you have the perfectly designed database, with
every nuance and subtlety excellently crafted in the
ares of keys, indexes, normalization, denormalization
and security. At this point your job branches out into
several new areas, but one of the most important is
knowing how to write application code that efficiently
accesses the database.
Bulk operations: loading or exporting large amounts of data
Some database applications involve a large number of small
transactions, where each trip to the database writes only a
single row or reads only a dozen or so rows.
But in many cases you need to bulk load large amounts of
data in one shot, thousands or even millions of rows. In
these cases the techniques that govern small transactions
are useless and counter-productive, and you need to learn
some new commands and strategies to handle the bulk loads.
Understanding Contexts and how they lead to different sets of Best Practices
Not all databases are created for the same purpose. If you
have a very large operations then it will likely have multiple
independent databases that fill the classical roles, while in
a smaller shop the roles may be combined in one database. I
like to refer to these roles as "contexts" because they determine
how the tables will be designed and how acess to the tables
will be governed. The major contexts are:
- OLTP or Online Transaction Processing, characterized
by simultaneous reads and writes, generally assumes
little or no periods of inactivity, and generally
assumes that the individual transactions are very
small. The apps we were all writing in the 80s and
90s to do accounting, ERP, MRP, Job control, payroll,
airline reservations and many others fall into this
context.
- Data Warehouse context, characterized by periodic
bulk loads of new information with most activity
being reads. The Data Warehouse context is largely
associated with the "Star Schema" table design.
Data in a Warehouse is historical, it never changes
after it is loaded.
- CMS or Content Management System, also characterized
by very few writes compared to reads, but more likely
to have a normalized structure. Unlike a Data
Warehouse, the data is subject to change, just not that
often.
- Any other Read Only Context. I include this category
because I spent some time working on Direct Marketing
databases, which are like a Data Warehouse in that they
are updated periodically and the data does not change,
but the Star Schema is completely inappropriate for them.
If you consider a huge online shopping system, you can see that
within that application there are at least two contexts. The
product catalog is likely to see vastly fewer writes than
reads, but the shopping cart tables will be in a constant state
of reads and writes.
Preventing performance degradation through
various maintenance tasks
Once the database and its application stack is up and running,
and the reads and writes and coming through, the laws of
thermodynamics come into play and system performance can
begin to degrade even if the database stays the same size
and the load on the system is steady.
Different vendors have different tools for combatting this,
but they tend to come down to reclaiming temporary space and
occassionally rebuilding indexes. There are also log files
that have to be purged, regular backups to be made, and other
operations along those lines.
Deployment strategies: partitioning, tablespaces
When systems become sufficiently large, it is no longer
possible to just attach some disks to a box and run
a database server. The Database Architect must consider
breaking different tables out onto different sets of
spindles, which is usually done with "tablespaces", and
moving older data onto slower cheaper spindles, which is
often done with Partitioning.
Deployment strategies, failure protection, from
simple backup to hot standbys
Because a database typically experiences simultaneous
reads and writes from multiple sources, and may be expected
to be up and running 24/7 indefinitely, the concept
of making a backup and recovering from a failure becomes
more complicated than simply copying a few files to a
safe location.
In the most demanding case, you will need to provide a
second complete box that can become fully live within
seconds of a disastrous failing of the main box. This is
called various things, but Postgres calls it a "hot standby"
in version 9 and some MS SQL Server shops call it a
"failover cluster."
The ability to come up live on a second box when the first
one fails is made possible by the way databases handle
ACID compliance, and the fact that they produce something
called a Write-Ahead-Log (WAL) that can be fed into a
second box that "replays" the log so that its copy of the
database is getting the same changes as the master copy.
Server side coding: stored procedures and functions
I really could not figure out where to put this entry
in the list, so I just punted and put it near the end.
It could really go anywhere.
Stored procedures or functions are procedural routines
(not object oriented) that are on the database server and
can be invoked directly from an application or embedded
inside of SQL commands. Generally speaking they provide
various flow-control statements and rudimentary variable
support so that you can code multi-step processes on the
server itself instead of putting them in application code.
Server side coding: Triggers
Triggers are quite possibly the most elegant and beautiful
technology that servers support, absolutely the least
understood, and definitely the most reviled by the
ignorant. You will find virtually no web content today
that will explain why and how to use triggers and
what they are good for.
Except of course for my own essay on
triggers that discusses them in terms of
encapsulation.
Temporary tables
Temporary tables are like Stored Procedures inasmuch as
I had no idea where to put them in the list, so they
just ended up at the end.
As the name implies, a temporary table is a table that
you can create on-the-fly, and which usually disappears
when your transaction is complete. They are most often
found in Stored Procedures. They can impact performance
for the worst in many ways, but can be extremely
useful when you are doing multi-staged analsysis of
data in a Data Warehouse (that's where I use them the most).