Sunday, December 19, 2010

User-Submitted Analysis Topic: Email

Reader Dean Thrasher of Infovark has submitted a schema for review and analysis as part of my User-Submitted Analysis Request series. Today we are going to take a first look at what he has. Mr. Thrasher and I both hope that any and all readers will benefit from the exercise of publicly reviewing the schema.

This particular analysis request is a great start to the series, because it has to do with email. Everybody uses email so we all understand at a very basic level what data will be handled.

Brief Introduction to User-Submitted Schemas

Mr. Thrasher and I have exchanged a couple of emails, but we have avoided any in-depth discussion. Instead, we want to carry out the conversation on the public blog. So I am not aiming to provide any "from on high" perfect analysis, instead this essay will contain a lot of questions and suggestions, and we will then move into the comments to go forward.

Disclosure: None. We are not paying each other anything, nor have I received any merchandise that would normally carry a licensing fee.

Today's essay is the very first in the User-Submitted Anlaysis Requests series. If you would like to see an analysis of your schema, follow that link and contact me.

This blog has a Complete Table of Contents and a list of Database Skills.

Brief Description and Starting Point

To get us started, I am going to quote the Infovark Product Page, and then we will see what we want to zoom in on:

Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date

So we know even before we see anything technical that we are going to have tables of contacts, emails, phones, addresses, appointments and many other things pulled in from email systems, plus the value-add provided by the product.

The Schema As-Is

We are going to start by looking at how the details of a CONTACT are stored. The schema models contacts with a group of cross references, aka many-to-many relationships, like so:

CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

The first thing we have to note is that there is nothing wrong with this at all. It is fully normalized and so it will be very easy to make sure that database writes will not produce anomalies or bad data.

But, not surprisingly, Mr. Thrasher notes this makes for complicated SELECTS, so we want to ask if perhaps it is over-normalized, are there complications in there that do not need to be there?

Email as A Property of Contact

If I were to follow my own advice, I would first want to identify the master tables. Master tables generally represent real things in the world: people, places, products, services, events.

So my first question is this: is an email address a free-standing entity in its own right that deserves a master table? Or is it instead a property of the CONTACT? I am going to suggest that an email address is a property of a CONTACT, and, since a CONTACT may have more than one email address, they should be stored in a child table of the CONTACTS, more like this:

CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         +----- CONTACTEMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

Whether or not we make this switch depends not on technical arguments about keys or data types, but on whether this accurately models reality. If in fact email addresses are simply properties of contacts, then this is the simplest way to do it. Going further, the code that imports and reads the data will be easier to code, debug and maintain for two reasons: one, because it is simpler, but more importantly, two, because it accurately models reality and therefore will be easier to think about.

If this proves to be the right way to go, it may be a one-off improvement, or it may repeat itself for Phones, Addresses, and Web Addresses, but we will take that up in the next post in the series.

I am going to proceed as if this change is correct, and ask then how it will ripple through the rest of the system.

Some Specifics on the Email Addresses Table

The EMAILADDRESSES table currently has these columns:

-- SQL Flavor is Firebird
CREATE TABLE EMAILADDRESS (
  ID           INTEGER NOT NULL,
  USERNAME     VARCHAR(64) NOT NULL COLLATE UNICODE_CI,
  HOSTNAME     VARCHAR(255) NOT NULL COLLATE UNICODE_CI,
  DISPLAYNAME  VARCHAR(255) NOT NULL
);

ALTER TABLE EMAILADDRESS
  ADD CONSTRAINT PK_EMAILADDRESS
  PRIMARY KEY (ID);

CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
  IF (NEW.ID IS NULL) THEN
  NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1);
END^

Suggestion: The first thing I notice is that the complete email itself is not actually stored. So we need to ask Mr. Thrasher what the thinking behind that was. My first instinct is to store that, because it is the original natural value of interest.

Suggestion: The columns USERNAME and HOSTNAME I could go either way on. If they are needed for querying and statistics, it is better to put them in. While this violates 3rd Normal Form and so puts us at risk, the values are supplied AFAIK by a batch import, and so there is only one codepath populating them, and we are likely safe. However, if we DO NOT need to query these values for statistics, and they are only there for convenience at display time, I would likely remove them and generate them on-the-fly in application code. There are some other good reasons to do this that will come up a little further along.

Suggestion: Unless I missed something in the schema sent over, we need a unique constraint on the combination of CONTACTID and USERNAME and HOSTNAME. Or, if we remove USERNAME and HOSTNAME in favor of the original EMAILADDRESS, we need a unique constraint on CONTACTID + EMAILADDRESS.

Before We Get To Transactions

We are about to go into Part 2, which is about the other tables that reference EMAILADDRESSES, but before we do let's look at what the two tables would be if we made all changes suggested so far:

 CONTACTS             EMAILADDRESSES 
------------         --------------------
                      ID            (surrogate key)
 CONTACT_ID --------& CONTACT_ID
 other columns...     EMAILADDRESS  
                      LABEL
                      USERNAME      (possibly removed)
                      HOSTNAME      (possibly removed)
                      DISPLAYNAME

You may notice the LABEL column showed up out of nowhere. That column was previously in the cross-reference. When the cross-reference went away it landed in EMAILADDRESSES. That column LABEL holds values like "work", "home" and so on. It is supplied from whatever system we pull emails from, and so we have no constraints on it or rules about it.

Changing Emails And Transactions

Now we move on from the basic storage of EMAIL addresses to the other tables that reference those addresses. These are things like emails themselves with their lists people sent to/from, and meetings, and presumably other types of transactions as well.

When we look at transactions, which will reference contacts and email addresses, we also have to consider the fact that a CONTACT may change their email address over time. Consider a person working for Vendor A, who moves over to Vendor B. For some of the transactions they will have been at Vendor A, and then going forward they are all at Vendor B. This leads to this very important question:

Do Transactions store details about the CONTACTS as they were at the time of the transaction, or as they are now?

In other words, if a CONTACT moves from one company to another, and you look at a meeting with that person from last year, should it link to where they are now? Or should it be full of information about where they were at the time?

The answer to this question is important because it determines how to proceed on the two final points I would like to raise:

  1. Should the various transactions have a foreign key back to EMAILADDRESSES, or should they simply link back to CONTACTS and contain the EMAILADDRESS itself?
  2. Do we need an integer surrogate key on the EMAILADDRESSES table, especially if we do not link back to it?

First Final Suggestion

So the first of the final two suggestions is: maybe the transactions tables should just link back to CONTACTID and contain a freestanding EMAILADDRESS. The first argument for this is that it preserves the history as it was, and if that is what we want, then this accomplishes it. The second argument is that by putting the actual value instead of an integer key back to some table, we simplify coding by removing a join.

The arguments against embedding the email address might be basically, "hey, if this is a kind of a data warehoues, you are really supposed to be doing the snowflake thing and you don't want to waste space on that value." To which I respond that the engineer always has the choice of trading space for speed. Putting the email in directly is a correct recording of a fact, and takes more space, but eliminates a very common JOIN from many queries, so Mr. Thrasher may choose to make that call.

This also plays back to my question about whether we should have USERNAME and HOSTNAME in the EMAILADDRESSES table. If we start putting email addresses directly into tables, we can also keep putting these other two columns in, which trades again space for speed. We could also skip them and code a parser in the application that generates them on-the-fly as needed.

Second Final Suggestion

Now we go all of the way back to the child table and ask a basic question: Why is there is an integer surrogate key there? Integer surrogate keys are useful in many situations, but contrary to what the web generation learned, they are not some kind of required approach in relational databases.

Consider: we need a unique constraint on CONTACTID+EMAILADDRESS anyway, so we have to justify why we would add a new column that does not add value. The reflex answer tends to be "because they join faster" but that ignores the fact that if you use the natural key of CONTACTID+EMAILADDRESS, and put these columns into child tables, you do not need to join at all! If we use the surrogate key and embed it in child tables, then getting the CONCTACT information forces two joins: through EMAILADDRESS to CONTACTS. But if we use the natural key of CONTACTID + EMAILADDRESS we already have the contact id which saves a JOIN when we are after CONTACTS details, and, unless we want to know something like LABEL, we do not have to JOIN back to EMAILADDRESSES at all.

Conclusion

Well that's it. As promised, we have a few suggestions and a lot of questions for Mr. Thrasher. Check back in the coming days to see how the various questions work themselves out in the comments.

10 comments:

Dean Thrasher said...

Hey Ken, thanks for taking a look at a piece of our schema. I'll do my best to respond to your questions and those of your readers here in the comments.

In your suggestions, you asked why I chose to break up the email address into user name (the part before the @ symbol) host name (the part after the @ symbol) and the display name (the optional, human-readable part that some email systems provide).

The display name is separate because it is optional and not really part of the identifier. It's not necessary for the delivery of the email. But the split between username and hostname is a bit strange. Why divide the key into two parts?

The reasoning behind it is that we wanted to construct queries that could return email addresses from individuals working at the same organization. It was easier to split the data when we stored the address that it was to construct a SQL select that performed string manipulation. This also meant that we could index the hostname field to make those lookups fast.

In retrospect, since our users don't perform this "find other coworkers" query particularly often, and since its performance is not a crucial issue, your suggestion to use single email field makes a lot of sense. It's one of the first thing's I'd change.

Dean Thrasher said...

...And some comments on your two final suggestions.

Your first final suggestion makes a ton of sense. As you said, the email addresses to which a particular email was sent never change. Linking these to a Contact entity whose email information may change in the future is a bad idea. It was already starting to cause us headaches.

But it's worth mentioning why we screwed this up: We thought we were normalizing properly by creating a master email address table.

What we were actually doing was confusing two facts that ought to have been kept separate. The historical address used to send an email is not the same thing as a Contact's current email address.

If we'd considered the problem from a transactional perspective we might have avoided this pitfall.

Your second final suggestion follows naturally from the first. If our new design allows us embed the email address, there's no reason to use a surrogate Id. We won't ever need to reference a child row from outside the context of its master table.

Both suggestions would greatly simplify the amount of work we do to get items into and out of the database. Avoiding unnecessary joins is always a good thing.

KenDowns said...

Dean:

From your first comment, it sounds like it might be useful to keep the hostname for analysis, but maybe drop the username. Or what the heck, keep 'em both. Even if users are not making use of it much, it sounds like the kind of thing you'll hear about when you take it out, if you know what I mean :)

Regarding your second comment on normalization and email as a master entity, I am very glad that the suggestion in the post was worthwhile and that the transactional reasoning helped.

Am also glad the second suggestion made sense.

Sounds like our first stab at "Submit your analysis request" was a success. Thank you very much for being willing to expose a bit of your work to public scrutiny, I hope that many readers will benefit.

Arnaud said...

thanks for running that series. This is good problem solving and the way you ask questions makes the reader think, not just read.

John Zabroski said...

Dean,

Are you aware that storing e-mail addresses this way actually disallows you from storing all possible e-mail addresses? Most e-mail addresses today follow the username@hostname convention, but certainly not all. Read the RFC.

I would augment Ken's suggestion by saying that when we are dealing with natural keys as candidate keys, we should always read what the international specification (if any) says for the candidate key. In this case, there is an RFC governing valid e-mail addresses.

Finally, keep in mind that Organization is not necessarily one to one with a hostname. You might have Jeff@CinemaNow.com and Bob@BestBuy.com, and despite the fact they both work for the same corporation, your query will not tell you that they report to the same supervisor.

If you find yourself saying, "I need a report on an organization's ____", then you should probably first look at your schema and ask, Where da heck is my Organization table...?

Cheers,
Z-Bo

Dean Thrasher said...

Hey Z-Bo,

Sorry for my delay in replying to your question.

We deliberately chose to support only the "Internet Email" address specification, as it's the most prevalent one used. It's also the identifier most likely to be recognized by our end users.

Our application actually maps other email address formats, such as X.400, to the SMTP internet mail address used in these tables.

For those interested, the current version of the Internet Message Specification can be found here: http://www.ietf.org/rfc/rfc5322.txt

After reading through it, you'll never think of email as "unstructured" information again. :)

ethernett001 said...

red scope studios Web Design is a Sugar Land and Houston Web Design / graphic design company specializing in web design, web development, database design, custom database development, content management systems, and ecommerce websites.

Esay leoz said...

red scope studios Web Design is a Sugar Land and Houston Web Design / graphic design company specializing in web design, web development, database design, custom database development, content management systems, and ecommerce websites.

Esay leoz said...

red scope studios Web Design is a Sugar Land and Houston Web Design / graphic design company specializing in web design, web development, database design, custom database development, content management systems, and ecommerce websites.

sajukhann001 said...

red scope studios Web Design is a Sugar Land and Houston Web Design / graphic design company specializing in web design, web development, database design, custom database development, content management systems, and ecommerce websites.