Reader Dean Thrasher of Infovark has submitted a schema for review and analysis as part of my User-Submitted Analysis Request series. Today we are going to take a first look at what he has. Mr. Thrasher and I both hope that any and all readers will benefit from the exercise of publicly reviewing the schema.
This particular analysis request is a great start to the series, because it has to do with email. Everybody uses email so we all understand at a very basic level what data will be handled.
Brief Introduction to User-Submitted Schemas
Mr. Thrasher and I have exchanged a couple of emails, but we have avoided any in-depth discussion. Instead, we want to carry out the conversation on the public blog. So I am not aiming to provide any "from on high" perfect analysis, instead this essay will contain a lot of questions and suggestions, and we will then move into the comments to go forward.
Disclosure: None. We are not paying each other anything, nor have I received any merchandise that would normally carry a licensing fee.
Today's essay is the very first in the User-Submitted Anlaysis Requests series. If you would like to see an analysis of your schema, follow that link and contact me.
Brief Description and Starting Point
To get us started, I am going to quote the Infovark Product Page, and then we will see what we want to zoom in on:
Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date
So we know even before we see anything technical that we are going to have tables of contacts, emails, phones, addresses, appointments and many other things pulled in from email systems, plus the value-add provided by the product.
The Schema As-Is
We are going to start by looking at how the details of a CONTACT are stored. The schema models contacts with a group of cross references, aka many-to-many relationships, like so:
CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES | +----- CONTACTS-X-PHONES -------- PHONES | +----- CONTACTS-X-ADDRESSES ----- ADDRESSES | +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES
The first thing we have to note is that there is nothing wrong with this at all. It is fully normalized and so it will be very easy to make sure that database writes will not produce anomalies or bad data.
But, not surprisingly, Mr. Thrasher notes this makes for complicated SELECTS, so we want to ask if perhaps it is over-normalized, are there complications in there that do not need to be there?
Email as A Property of Contact
If I were to follow my own advice, I would first want to identify the master tables. Master tables generally represent real things in the world: people, places, products, services, events.
So my first question is this: is an email address a free-standing entity in its own right that deserves a master table? Or is it instead a property of the CONTACT? I am going to suggest that an email address is a property of a CONTACT, and, since a CONTACT may have more than one email address, they should be stored in a child table of the CONTACTS, more like this:
CONTACTS-X-EMAILS -------- EMAILADDRESSES+----- CONTACTEMAILADDRESSES | +----- CONTACTS-X-PHONES -------- PHONES | +----- CONTACTS-X-ADDRESSES ----- ADDRESSES | +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES
Whether or not we make this switch depends not on technical arguments about keys or data types, but on whether this accurately models reality. If in fact email addresses are simply properties of contacts, then this is the simplest way to do it. Going further, the code that imports and reads the data will be easier to code, debug and maintain for two reasons: one, because it is simpler, but more importantly, two, because it accurately models reality and therefore will be easier to think about.
If this proves to be the right way to go, it may be a one-off improvement, or it may repeat itself for Phones, Addresses, and Web Addresses, but we will take that up in the next post in the series.
I am going to proceed as if this change is correct, and ask then how it will ripple through the rest of the system.
Some Specifics on the Email Addresses Table
The EMAILADDRESSES table currently has these columns:
-- SQL Flavor is Firebird CREATE TABLE EMAILADDRESS ( ID INTEGER NOT NULL, USERNAME VARCHAR(64) NOT NULL COLLATE UNICODE_CI, HOSTNAME VARCHAR(255) NOT NULL COLLATE UNICODE_CI, DISPLAYNAME VARCHAR(255) NOT NULL ); ALTER TABLE EMAILADDRESS ADD CONSTRAINT PK_EMAILADDRESS PRIMARY KEY (ID); CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS ACTIVE BEFORE INSERT POSITION 0 AS BEGIN IF (NEW.ID IS NULL) THEN NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1); END^
Suggestion: The first thing I notice is that the complete email itself is not actually stored. So we need to ask Mr. Thrasher what the thinking behind that was. My first instinct is to store that, because it is the original natural value of interest.
Suggestion: The columns USERNAME and HOSTNAME I could go either way on. If they are needed for querying and statistics, it is better to put them in. While this violates 3rd Normal Form and so puts us at risk, the values are supplied AFAIK by a batch import, and so there is only one codepath populating them, and we are likely safe. However, if we DO NOT need to query these values for statistics, and they are only there for convenience at display time, I would likely remove them and generate them on-the-fly in application code. There are some other good reasons to do this that will come up a little further along.
Suggestion: Unless I missed something in the schema sent over, we need a unique constraint on the combination of CONTACTID and USERNAME and HOSTNAME. Or, if we remove USERNAME and HOSTNAME in favor of the original EMAILADDRESS, we need a unique constraint on CONTACTID + EMAILADDRESS.
Before We Get To Transactions
We are about to go into Part 2, which is about the other tables that reference EMAILADDRESSES, but before we do let's look at what the two tables would be if we made all changes suggested so far:
CONTACTS EMAILADDRESSES ------------ -------------------- ID (surrogate key) CONTACT_ID --------& CONTACT_ID other columns... EMAILADDRESS LABEL USERNAME (possibly removed) HOSTNAME (possibly removed) DISPLAYNAME
You may notice the LABEL column showed up out of nowhere. That column was previously in the cross-reference. When the cross-reference went away it landed in EMAILADDRESSES. That column LABEL holds values like "work", "home" and so on. It is supplied from whatever system we pull emails from, and so we have no constraints on it or rules about it.
Changing Emails And Transactions
Now we move on from the basic storage of EMAIL addresses to the other tables that reference those addresses. These are things like emails themselves with their lists people sent to/from, and meetings, and presumably other types of transactions as well.
When we look at transactions, which will reference contacts and email addresses, we also have to consider the fact that a CONTACT may change their email address over time. Consider a person working for Vendor A, who moves over to Vendor B. For some of the transactions they will have been at Vendor A, and then going forward they are all at Vendor B. This leads to this very important question:
Do Transactions store details about the CONTACTS as they were at the time of the transaction, or as they are now?
In other words, if a CONTACT moves from one company to another, and you look at a meeting with that person from last year, should it link to where they are now? Or should it be full of information about where they were at the time?
The answer to this question is important because it determines how to proceed on the two final points I would like to raise:
- Should the various transactions have a foreign key back to EMAILADDRESSES, or should they simply link back to CONTACTS and contain the EMAILADDRESS itself?
- Do we need an integer surrogate key on the EMAILADDRESSES table, especially if we do not link back to it?
First Final Suggestion
So the first of the final two suggestions is: maybe the transactions tables should just link back to CONTACTID and contain a freestanding EMAILADDRESS. The first argument for this is that it preserves the history as it was, and if that is what we want, then this accomplishes it. The second argument is that by putting the actual value instead of an integer key back to some table, we simplify coding by removing a join.
The arguments against embedding the email address might be basically, "hey, if this is a kind of a data warehoues, you are really supposed to be doing the snowflake thing and you don't want to waste space on that value." To which I respond that the engineer always has the choice of trading space for speed. Putting the email in directly is a correct recording of a fact, and takes more space, but eliminates a very common JOIN from many queries, so Mr. Thrasher may choose to make that call.
This also plays back to my question about whether we should have USERNAME and HOSTNAME in the EMAILADDRESSES table. If we start putting email addresses directly into tables, we can also keep putting these other two columns in, which trades again space for speed. We could also skip them and code a parser in the application that generates them on-the-fly as needed.
Second Final Suggestion
Now we go all of the way back to the child table and ask a basic question: Why is there is an integer surrogate key there? Integer surrogate keys are useful in many situations, but contrary to what the web generation learned, they are not some kind of required approach in relational databases.
Consider: we need a unique constraint on CONTACTID+EMAILADDRESS anyway, so we have to justify why we would add a new column that does not add value. The reflex answer tends to be "because they join faster" but that ignores the fact that if you use the natural key of CONTACTID+EMAILADDRESS, and put these columns into child tables, you do not need to join at all! If we use the surrogate key and embed it in child tables, then getting the CONCTACT information forces two joins: through EMAILADDRESS to CONTACTS. But if we use the natural key of CONTACTID + EMAILADDRESS we already have the contact id which saves a JOIN when we are after CONTACTS details, and, unless we want to know something like LABEL, we do not have to JOIN back to EMAILADDRESSES at all.
Well that's it. As promised, we have a few suggestions and a lot of questions for Mr. Thrasher. Check back in the coming days to see how the various questions work themselves out in the comments.