Sunday, June 22, 2008

Database Performance 1: Huge Inserts

The modern database server provides a wealth of features that provide robust and reliable storage. Understanding how these features work is vital if you want fast performance for your databases. This week we begin a series on performance by looking at "ACID" compliance and how it affects our handling of large operations.

Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a Complete Table Of Contents that is updated each week, and a Master list of table design patterns that is updated whenever a new design pattern is presented.

What is ACID Compliance

The modern database provides a set of features known as ACID compliance which make the database very robust. To paraphrase the Wikipedia article, ACID compliance means that:

  • Each transaction is Atomic. It is completed in its entirety or not at all.
  • The database is always Consistent. No user ever sees the intermediate and possibly invalid state of the database while your transaction is in progress.
  • Each transaction is isolated. Your changes do not get mixed up with other people's changes, even though they are executing at the same time (see more in the Wikipedia article on Serializability.)
  • The transaction is durable. Once the database says the job is complete without errors, you are assured the database has checked all constraints and keys and the transaction is completely valid. In most cases we also take this to mean the data is safely on disk and pulling the plug will not corrupt it.

Maintaining ACID compliance is expensive for the database server. It must in effect keep two versions of every row in play, and it must do so while multiple users have multiple transactions running at the same time, even while other users may be trying to read the rows that are being effected. This cost is considered more than acceptable when the reliability requirement is high. But there is one case where the inevitable consequence of ACID compliance is to destroy performance, and this is on large UPDATES and INSERTS. Today we are going to look particularly at large INSERT operations.

Consider the case where you are creating a new system that must load 1 million rows into a single table from an older system. You go to your server's manual and find the command to do so (For PostgreSQL it is COPY..., for SQL Server it is BULK INSERT...). You work out the painful details of the command with a test file of 10 rows and get the syntax down. Then you issue the command with the real file of 1 million rows. After a minute or two you realize, well, 1 million rows is a lot, time to get some coffee. Returning with the coffee you see that it is still not finished, so time to check some email. An hour later it is still not done, and when you leave it running overnight and come back in the morning your machine is frozen.

The problem here is simply one of numbers. One million rows of average length of 100 characters (in ASCII or UTF-8) will be about 100 megabytes. The server must do no less than maintain two completely separate states for database -- one state without your rows and one with your rows. The cost of this is several times the actual size of the input data, so the 1 million rows in this example will take several hundred megabytes of resources, at least! The server will be managing this process on both disk and in RAM. This will simply die on a laptop or development workstation, even one with a gig or two of RAM.

You can maybe go out and buy some RAM, but the purpose of this essay is to explain how to deal with those inevitable cases where the operation you are performing requires more resources than you have. This situation will always come up, so it is good to know how to deal with it.

Step 1: Drop Indexes and Keys

ACID compliance extends to indexes as well. When you INSERT many thousands or millions of rows to a single table in one shot, the server must maintain two separate versions of each index. This burden is laid on top of the burden of calculating the index keys for every single row one-by-one. If we began with a burden of several hundred megabytes of resources, just a few indexes on your table could end up more than doubling that.

This is why you will see advice on mailing lists and forums to drop indexes before doing large insert operations.

Your table will have one index automatically for the primary key, so you must drop the primary key. You will also want to drop foreign keys so that you do not waste time checking them (they also have indexes). Unique constraints also end up creating indexes automatically behind the scenes, so you must drop those. Other constraints must also be dropped to prevent having them checked for every single one of your million rows. Finally, you must drop all indexes you created yourself. After the load is complete, you must recreate these keys, indexes, and constraints.

In some cases, where your load is small enough, this may be enough to get predictable load times, and you can stop here. But the larger the operation, the more likely that this will not be enough. In those cases, there is another step to take.

Step 2: Chunks

The basic problem described above is that the database performance has gone non-linear. When you double the number of rows, it does not take twice as long, but four times (or 3 or 10 or whatever). When you multiply the rows by 10, it may not take 10 times as long, you might see it take 100 times as long, or more! (Or maybe you just killed it after you came back in the morning and your workstation was frozen).

We can restore linear performance if we break the input into chunks and load them one at a time in succession. You break the input into files that are small enough so that no individual file will send the server into non-linear hell. If we find that a chunk of 5000 rows loads in 4 seconds, and we have 2000 of these files, we now have a predictable load time. We have restored linear performance because we know that twice as many rows will take twice as long.

I am currently working on a system where I must occassionally load 3 tables of about 3 million rows each periodically from an old desktop Visual Foxpro system into a Postgres database. The chunking code on the output looks something like this:

    * This is FOXPRO code, which is vaguely like BASIC...
  mCount     = 0
  mIncrement = 5000                    * Hardcoded chunk size
  mRN        = Reccount(p_tableName)   * Fox's command to get row count

    * these three commands get the record pointer to the top
    * of the table and turn off all indexes 
 SELECT (p_tableName) 
 set order to 0
 for rnStart = 1 TO mRN step mIncrement
  mCount = mCount + 1
        * each loop outputs the next 5000 rows and leaves the
        * record pointer ready for the next loop. 
        * Foxpro uses a semi-colon to mean continue onto next line
  COPY column1,column2,column3 ;
        TO (m_dir+p_tableName+"_"+PADL(mCount,6,'0')+".asc") DELIMITED ;
   WHILE recno() < (rnStart + mIncrement)
        * This is foxpro's 'echo' command, a question mark
  ? p_tableName + "  copied "+str(_TALLY)+" records"

Then on the receiving end I need a program that reads the chunks and loads them in. The relevant portion of the code is here (the example is in PHP and loads to PostgreSQL):

#  Assume variable $fcnt holds the number of chunks to load
#  and that variables like $tabname, $ddir etc hold file names,
#  directory locations, column lists and so forth.
for($c = 1; $c<= $fcnt; $c++) {
    $insert = "_".str_pad($c,6,'0',STR_PAD_LEFT);
    LogEntry("     loading file # "
        .str_pad($c,6,' ',STR_PAD_LEFT)
        .' of '.str_pad($fcnt,6,' ',STR_PAD_LEFT)
    $cmd="COPY $tabname ($collist) "
        ." FROM '$ddir$afile$insert.asc' DELIMITERS ',' CSV QUOTE '\"'";
    # This is my frameworks super-simple direct SQL command

Conclusion: Chunks Restore Linear Performance

Database programmers depend heavily on "ACID" features to provide robust data storage. We depend upon these features so much that we will not consider using systems that cannot provide them (MySQL's MyISAM engine for instance). The cost of these features for performance is considered part of the bargain when robustness is required, but when you are doing a huge insert, the ACID features cause performance to go "non-linear", to become unpredictably long. As a first step you can drop indexes, keys, and constraints on a table to improve load times, but if that is not enough, you can restore linear performance by breaking the large operation into many "chunks", each of which is small enough to stay linear.

Next Essay: Performance in the Web Layer


Anonymous said...

But why is the performance non-linear?

KenDowns said...

anonymous: As long as the transaction is small enough, the amount of time it takes is roughly proportional to the amount of data going in, which makes it roughly linear. When the transaction gets to big and the server starts swapping out then all bets are off, that's when you've gone non-linear.

Tim said...

How would you approach a need to be able to quickly search over the data being inserted in these chunks? You wouldn't want to be re-indexing after each chunk because a new chunk may be coming in, but you also want to be able to search your data in near-realtime.

I ran into this problem when I was trying to import ~20 million rows over the course of 24 hours. I had the database partitions for every hour. My only solution was "ok we just accept the risk that we can't search for the current hour"


Anonymous said...

This may be some really stupid questions, but why not just insert those rows one-by-one? Or if that causes too much communication between the application and the database server, 5000-by-5000 (or something like that)? I know that that is what you said in the "chunking"-part. What I don't really understand though, is why you would want to drop all of those indices. I understand that it helps with performance, but it seems a lot easier to just make more, smaller INSERTs. So why would you drop indices?

KenDowns said...
This comment has been removed by the author.
KenDowns said...

: I would rephrase it to say that you need to keep the table online during the load.

Do you need to load 20 million every 24 hours, or did you only need to do it once?

KenDowns said...

anonymous: 1-by-1 also takes forever, because the round-trip cost comes into play.

Rebuilding the indexes is done soley for performance. The database can build it much faster by examining all rows than it can by adding one key at a time.

Tim said...

In the particular situation I've dealt with, it was large imports consistently over the course of an hour (10k+ every 5 minutes).

After the import though, data for the "past" was static or wouldn't need to be updated at such a rate as to degrade performance.

Pinning down an acceptable process for doing those large continuous imports while still maintaining snappy searches was beyond me though.

KenDowns said...

Tim, sounds like pure inserts then. Without having it in front of me I cannot say for sure, but I think I would try small chunks into an online table (w/o dropping indexes) and users would have to be informed that the load itself took some time.

tieTYT said...

"anonymous: As long as the transaction is small enough, the amount of time it takes is roughly proportional to the amount of data going in, which makes it roughly linear. When the transaction gets to big and the server starts swapping out then all bets are off, that's when you've gone non-linear."

OK, and why is that non-linear? Regardless, this comment implies that you should commit after every chunk. But, your article doesn't really make this statement. Should you commit after every chunk?


KenDowns said...

tietyt: Nonlinear means that you cannot reliably predict that performance will go as the number of rows. Twice as much rows could take 10 times as long.

Boby Thomas said...

Interesting especially the droping of indexes

Boby Thomas