The Database Programmer: User-Submitted Analysis Topic: Email

Sunday, December 19, 2010

User-Submitted Analysis Topic: Email

Reader Dean Thrasher of Infovark has submitted a schema for review and analysis as part of my User-Submitted Analysis Request series. Today we are going to take a first look at what he has. Mr. Thrasher and I both hope that any and all readers will benefit from the exercise of publicly reviewing the schema.

This particular analysis request is a great start to the series, because it has to do with email. Everybody uses email so we all understand at a very basic level what data will be handled.

Brief Introduction to User-Submitted Schemas

Mr. Thrasher and I have exchanged a couple of emails, but we have avoided any in-depth discussion. Instead, we want to carry out the conversation on the public blog. So I am not aiming to provide any "from on high" perfect analysis, instead this essay will contain a lot of questions and suggestions, and we will then move into the comments to go forward.

Disclosure: None. We are not paying each other anything, nor have I received any merchandise that would normally carry a licensing fee.

Today's essay is the very first in the User-Submitted Anlaysis Requests series. If you would like to see an analysis of your schema, follow that link and contact me.

This blog has a Complete Table of Contents and a list of Database Skills.

Brief Description and Starting Point

To get us started, I am going to quote the Infovark Product Page, and then we will see what we want to zoom in on:

Infovark automatically collects and catalogs your files and email. It consolidates your digital life into a personal wiki based on what it finds. Once you set Infovark to work, it will monitor your computer and keep your web site up-to-date

So we know even before we see anything technical that we are going to have tables of contacts, emails, phones, addresses, appointments and many other things pulled in from email systems, plus the value-add provided by the product.

The Schema As-Is

We are going to start by looking at how the details of a CONTACT are stored. The schema models contacts with a group of cross references, aka many-to-many relationships, like so:

CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

The first thing we have to note is that there is nothing wrong with this at all. It is fully normalized and so it will be very easy to make sure that database writes will not produce anomalies or bad data.

But, not surprisingly, Mr. Thrasher notes this makes for complicated SELECTS, so we want to ask if perhaps it is over-normalized, are there complications in there that do not need to be there?

Email as A Property of Contact

If I were to follow my own advice, I would first want to identify the master tables. Master tables generally represent real things in the world: people, places, products, services, events.

So my first question is this: is an email address a free-standing entity in its own right that deserves a master table? Or is it instead a property of the CONTACT? I am going to suggest that an email address is a property of a CONTACT, and, since a CONTACT may have more than one email address, they should be stored in a child table of the CONTACTS, more like this:

CONTACTS +----- CONTACTS-X-EMAILS -------- EMAILADDRESSES
         +----- CONTACTEMAILADDRESSES
         |
         +----- CONTACTS-X-PHONES -------- PHONES
         |
         +----- CONTACTS-X-ADDRESSES ----- ADDRESSES
         |
         +----- CONTACTS-X-WEBADDRESSES--- WEBADDRESSES

Whether or not we make this switch depends not on technical arguments about keys or data types, but on whether this accurately models reality. If in fact email addresses are simply properties of contacts, then this is the simplest way to do it. Going further, the code that imports and reads the data will be easier to code, debug and maintain for two reasons: one, because it is simpler, but more importantly, two, because it accurately models reality and therefore will be easier to think about.

If this proves to be the right way to go, it may be a one-off improvement, or it may repeat itself for Phones, Addresses, and Web Addresses, but we will take that up in the next post in the series.

I am going to proceed as if this change is correct, and ask then how it will ripple through the rest of the system.

Some Specifics on the Email Addresses Table

The EMAILADDRESSES table currently has these columns:

-- SQL Flavor is Firebird
CREATE TABLE EMAILADDRESS (
  ID           INTEGER NOT NULL,
  USERNAME     VARCHAR(64) NOT NULL COLLATE UNICODE_CI,
  HOSTNAME     VARCHAR(255) NOT NULL COLLATE UNICODE_CI,
  DISPLAYNAME  VARCHAR(255) NOT NULL
);

ALTER TABLE EMAILADDRESS
  ADD CONSTRAINT PK_EMAILADDRESS
  PRIMARY KEY (ID);

CREATE TRIGGER BI_EMAILADDRESS FOR EMAILADDRESS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
  IF (NEW.ID IS NULL) THEN
  NEW.ID = GEN_ID(GEN_EMAILADDRESS_ID,1);
END^

Suggestion: The first thing I notice is that the complete email itself is not actually stored. So we need to ask Mr. Thrasher what the thinking behind that was. My first instinct is to store that, because it is the original natural value of interest.

Suggestion: The columns USERNAME and HOSTNAME I could go either way on. If they are needed for querying and statistics, it is better to put them in. While this violates 3rd Normal Form and so puts us at risk, the values are supplied AFAIK by a batch import, and so there is only one codepath populating them, and we are likely safe. However, if we DO NOT need to query these values for statistics, and they are only there for convenience at display time, I would likely remove them and generate them on-the-fly in application code. There are some other good reasons to do this that will come up a little further along.

Suggestion: Unless I missed something in the schema sent over, we need a unique constraint on the combination of CONTACTID and USERNAME and HOSTNAME. Or, if we remove USERNAME and HOSTNAME in favor of the original EMAILADDRESS, we need a unique constraint on CONTACTID + EMAILADDRESS.

Before We Get To Transactions

We are about to go into Part 2, which is about the other tables that reference EMAILADDRESSES, but before we do let's look at what the two tables would be if we made all changes suggested so far:

 CONTACTS             EMAILADDRESSES 
------------         --------------------
                      ID            (surrogate key)
 CONTACT_ID --------& CONTACT_ID
 other columns...     EMAILADDRESS  
                      LABEL
                      USERNAME      (possibly removed)
                      HOSTNAME      (possibly removed)
                      DISPLAYNAME

You may notice the LABEL column showed up out of nowhere. That column was previously in the cross-reference. When the cross-reference went away it landed in EMAILADDRESSES. That column LABEL holds values like "work", "home" and so on. It is supplied from whatever system we pull emails from, and so we have no constraints on it or rules about it.

Changing Emails And Transactions

Now we move on from the basic storage of EMAIL addresses to the other tables that reference those addresses. These are things like emails themselves with their lists people sent to/from, and meetings, and presumably other types of transactions as well.

When we look at transactions, which will reference contacts and email addresses, we also have to consider the fact that a CONTACT may change their email address over time. Consider a person working for Vendor A, who moves over to Vendor B. For some of the transactions they will have been at Vendor A, and then going forward they are all at Vendor B. This leads to this very important question:

Do Transactions store details about the CONTACTS as they were at the time of the transaction, or as they are now?

In other words, if a CONTACT moves from one company to another, and you look at a meeting with that person from last year, should it link to where they are now? Or should it be full of information about where they were at the time?

The answer to this question is important because it determines how to proceed on the two final points I would like to raise:

Should the various transactions have a foreign key back to EMAILADDRESSES, or should they simply link back to CONTACTS and contain the EMAILADDRESS itself?
Do we need an integer surrogate key on the EMAILADDRESSES table, especially if we do not link back to it?

First Final Suggestion

So the first of the final two suggestions is: maybe the transactions tables should just link back to CONTACTID and contain a freestanding EMAILADDRESS. The first argument for this is that it preserves the history as it was, and if that is what we want, then this accomplishes it. The second argument is that by putting the actual value instead of an integer key back to some table, we simplify coding by removing a join.

The arguments against embedding the email address might be basically, "hey, if this is a kind of a data warehoues, you are really supposed to be doing the snowflake thing and you don't want to waste space on that value." To which I respond that the engineer always has the choice of trading space for speed. Putting the email in directly is a correct recording of a fact, and takes more space, but eliminates a very common JOIN from many queries, so Mr. Thrasher may choose to make that call.

This also plays back to my question about whether we should have USERNAME and HOSTNAME in the EMAILADDRESSES table. If we start putting email addresses directly into tables, we can also keep putting these other two columns in, which trades again space for speed. We could also skip them and code a parser in the application that generates them on-the-fly as needed.

Second Final Suggestion

Now we go all of the way back to the child table and ask a basic question: Why is there is an integer surrogate key there? Integer surrogate keys are useful in many situations, but contrary to what the web generation learned, they are not some kind of required approach in relational databases.

Consider: we need a unique constraint on CONTACTID+EMAILADDRESS anyway, so we have to justify why we would add a new column that does not add value. The reflex answer tends to be "because they join faster" but that ignores the fact that if you use the natural key of CONTACTID+EMAILADDRESS, and put these columns into child tables, you do not need to join at all! If we use the surrogate key and embed it in child tables, then getting the CONCTACT information forces two joins: through EMAILADDRESS to CONTACTS. But if we use the natural key of CONTACTID + EMAILADDRESS we already have the contact id which saves a JOIN when we are after CONTACTS details, and, unless we want to know something like LABEL, we do not have to JOIN back to EMAILADDRESSES at all.

Conclusion

Well that's it. As promised, we have a few suggestions and a lot of questions for Mr. Thrasher. Check back in the coming days to see how the various questions work themselves out in the comments.

31 comments:

Dean Thrasher said...: Hey Ken, thanks for taking a look at a piece of our schema. I'll do my best to respond to your questions and those of your readers here in the comments.

In your suggestions, you asked why I chose to break up the email address into user name (the part before the @ symbol) host name (the part after the @ symbol) and the display name (the optional, human-readable part that some email systems provide).

The display name is separate because it is optional and not really part of the identifier. It's not necessary for the delivery of the email. But the split between username and hostname is a bit strange. Why divide the key into two parts?

The reasoning behind it is that we wanted to construct queries that could return email addresses from individuals working at the same organization. It was easier to split the data when we stored the address that it was to construct a SQL select that performed string manipulation. This also meant that we could index the hostname field to make those lookups fast.

In retrospect, since our users don't perform this "find other coworkers" query particularly often, and since its performance is not a crucial issue, your suggestion to use single email field makes a lot of sense. It's one of the first thing's I'd change.; December 19, 2010 at 2:24 PM
Dean Thrasher said...: ...And some comments on your two final suggestions.

Your first final suggestion makes a ton of sense. As you said, the email addresses to which a particular email was sent never change. Linking these to a Contact entity whose email information may change in the future is a bad idea. It was already starting to cause us headaches.

But it's worth mentioning why we screwed this up: We thought we were normalizing properly by creating a master email address table.

What we were actually doing was confusing two facts that ought to have been kept separate. The historical address used to send an email is not the same thing as a Contact's current email address.

If we'd considered the problem from a transactional perspective we might have avoided this pitfall.

Your second final suggestion follows naturally from the first. If our new design allows us embed the email address, there's no reason to use a surrogate Id. We won't ever need to reference a child row from outside the context of its master table.

Both suggestions would greatly simplify the amount of work we do to get items into and out of the database. Avoiding unnecessary joins is always a good thing.; December 19, 2010 at 3:25 PM
KenDowns said...: Dean:

From your first comment, it sounds like it might be useful to keep the hostname for analysis, but maybe drop the username. Or what the heck, keep 'em both. Even if users are not making use of it much, it sounds like the kind of thing you'll hear about when you take it out, if you know what I mean :)

Regarding your second comment on normalization and email as a master entity, I am very glad that the suggestion in the post was worthwhile and that the transactional reasoning helped.

Am also glad the second suggestion made sense.

Sounds like our first stab at "Submit your analysis request" was a success. Thank you very much for being willing to expose a bit of your work to public scrutiny, I hope that many readers will benefit.; December 21, 2010 at 7:35 PM
Arnaud said...: thanks for running that series. This is good problem solving and the way you ask questions makes the reader think, not just read.; January 10, 2011 at 3:33 PM
John Zabroski said...: Dean,

Are you aware that storing e-mail addresses this way actually disallows you from storing all possible e-mail addresses? Most e-mail addresses today follow the username@hostname convention, but certainly not all. Read the RFC.

I would augment Ken's suggestion by saying that when we are dealing with natural keys as candidate keys, we should always read what the international specification (if any) says for the candidate key. In this case, there is an RFC governing valid e-mail addresses.

Finally, keep in mind that Organization is not necessarily one to one with a hostname. You might have Jeff@CinemaNow.com and Bob@BestBuy.com, and despite the fact they both work for the same corporation, your query will not tell you that they report to the same supervisor.

If you find yourself saying, "I need a report on an organization's ____", then you should probably first look at your schema and ask, Where da heck is my Organization table...?

Cheers,
Z-Bo; February 18, 2011 at 6:45 PM
Dean Thrasher said...: Hey Z-Bo,

Sorry for my delay in replying to your question.

We deliberately chose to support only the "Internet Email" address specification, as it's the most prevalent one used. It's also the identifier most likely to be recognized by our end users.

Our application actually maps other email address formats, such as X.400, to the SMTP internet mail address used in these tables.

For those interested, the current version of the Internet Message Specification can be found here: http://www.ietf.org/rfc/rfc5322.txt

After reading through it, you'll never think of email as "unstructured" information again. :); March 10, 2011 at 10:38 PM
Kenneth Leitner said...: abcya real game and news; February 15, 2020 at 3:04 AM
Unknown said...: Analyzde the statistics and evaluate these values.

run 3; July 13, 2020 at 1:40 PM
impotence said...: Please let me know if you're looking for a writer for your weblog. You have some really great articles and I think I would be a good asset. If you ever want to take some of the load off, I'd really like to write some content for your blog in exchange for a link back to mine. Please shoot me an email if interested. Thanks!; November 24, 2020 at 8:42 PM
aditya said...: This comment has been removed by the author.; February 15, 2021 at 3:59 AM
mikejasonuk said...: brother Printers continue changing and refreshing its product so it matches with the current climate. In the event that you are the person who is thinking why your Brother printer is in the blunder state then there can be numerous explanations behind the equivalent. Brother printer is an error stateThe interaction to print now and then turns out to be exceptionally troublesome when the brother printer is in a error state. On the off chance that your printer is in error state it will ask you for a fast fix and when the brother printer stops to work then it obviously implies that the issue is in the actual printer and you need to fix something similar. In this article, you will figure out how might you fix this issue, so let us start the theme.; June 22, 2021 at 7:19 AM
Benjamin Jones said...: H0t Profile Picture
LadyinSwimwear
MatureFemale
LadyinSwimwear
CollegeGoingGirl
IndianAunty
Indianface
ImageGallery
VideoGallery
VideoGallery; January 12, 2022 at 6:55 AM
스테트슨 메종 우랜돌프 said...: Such an interesting topic that everyone should know this. keep posting; December 2, 2024 at 3:02 AM
카일리 엔리케스에덴 굿 said...: I am happy to helpful information here within the article. Thanks for sharing.; December 2, 2024 at 3:02 AM
카밀라 카이 황슈나이더 said...: This site is good and I found very interesting stuff here. Great job, thanks; December 2, 2024 at 3:03 AM
카지노사이트 카지노사이트 모음 said...: Thank you for the information.

´´´´´████████´´´´
´´`´███▒▒▒▒███´´´´´
´´´███▒●▒▒●▒██´´´
´´´███▒▒▒▒▒▒██´´´´´
´´´███▒▒💋▒▒██´
´´██████▒▒███´´´´´
´██████▒▒▒▒███´´
██████▒▒▒▒▒▒███´´´´
´´▓▓▓▓▓▓▓▓▓▓▓▓▓▒´´
´´▒▒▒▒▓▓▓▓▓▓▓▓▓▒´´´´´
´.▒▒▒´´▓▓▓▓▓▓▓▓▒´´´´´
´.▒▒´´´´▓▓▓▓▓▓▓▒
..▒▒.´´´´▓▓▓▓▓▓▓▒
´▒▒▒▒▒▒▒▒▒▒▒▒
´´´´´´´´´███████´´´´
´´´´´´´´████████´´´´´´
´´´´´´´█████████´´´´´
´´´´´´██████████´´´
´´´´´´██████████´´
´´´´´´´█████████´
´´´´´´´█████████´
´´´´´´´´████████´´´
________▒▒▒▒▒
_________▒▒▒▒
_________▒▒▒▒
________▒▒_▒▒
_______▒▒__▒▒
_____ ▒▒___▒▒
_____▒▒___▒▒
____▒▒____▒▒
___▒▒_____▒▒
_███______▒▒
█_███____███
█__███_ _█_███
█ ███; September 28, 2025 at 11:27 PM
토토사이트 토토사이트 추천 said...: I should say only that its awesome! The blog is informational and always produce amazing things.; September 28, 2025 at 11:28 PM
카지노사이트 카지노사이트 추천 said...: This is an interesting way of explaining that most bloggers don’t understand that even when writing a blog, they need to consider this fact. And after reading your post, I feel like you’re doing that on your blog. Would you like to visit my blog later? Check out this!; September 29, 2025 at 2:26 AM
Nikhil said...: Learn Django online to understand modern web application development.It focuses on application-based learning.This learn Django online supports beginners.It is helpful.; January 21, 2026 at 10:42 AM
vr said...: Learn UI/UX design with Online IT Guru. Create user-friendly interfaces using wireframing, prototyping, usability testing, and design tools. ui ux design classes; January 22, 2026 at 6:58 AM
바카라사이트 모음 said...: Thanks for this nice article; January 30, 2026 at 10:18 AM
vr said...: Great share! This aws & devops course
course explains DevOps concepts clearly while offering hands-on labs for real-world pipeline implementation.; February 21, 2026 at 5:11 AM
Nikhil said...: Python online course provides flexible and structured learning sessions. It covers beginner to advanced topics step-by-step. This python online course enhances practical coding ability. Learners work on live assignments. File handling is included. Database integration is covered. API basics are explained. It supports career growth.; February 21, 2026 at 10:49 AM
vr said...: DevOps online training simplifies complex workflows. Learn devops training
CI/CD, containerization, infrastructure as code, and monitoring with expert guidance and practical exercises.; February 22, 2026 at 1:47 AM
ab initio course said...: Ab initio online training
ab initio online training offers the flexibility that modern learners really need. Being able to access high-quality learning materials from anywhere makes a big difference for working professionals. The interactive sessions and hands-on exercises enhance understanding effectively. It’s definitely a convenient way to gain expertise without disrupting your daily schedule.; March 23, 2026 at 7:20 AM
vr said...: Great insights! A devops training online
helps learners master CI/CD, automation, and cloud tools. Practical training ensures they gain job-ready skills and can handle real-world IT challenges efficiently.; March 30, 2026 at 12:29 PM
ab initio course said...: Salesforce Development Course
Salesforce development course is great for developers in India. This salesforce developer course india looks detailed. Covers coding basics. Good learning path.; March 31, 2026 at 11:40 AM
ab initio course said...: Salesforce Admin Course
Salesforce admin course is in demand in Hyderabad. This salesforce admin hyderabad looks informative. Easy to understand. Good for beginners. Worth exploring.; April 1, 2026 at 1:00 AM
ab initio course said...: Tableau Training
I’m searching for a good Tableau training program. Found this tableau training. Does it cover real-time projects?; April 6, 2026 at 11:43 AM
vr said...: Insightful read! Our data modeling class
combines theory and practical exercises in database design, normalization, and data analytics to boost career opportunities in IT.; April 7, 2026 at 10:24 AM
anjani02 said...: This blog is really educational and useful. I found the examples very practical and relatable. Great job on this post.servicenow classes; May 8, 2026 at 7:23 AM

The Database Programmer

Sunday, December 19, 2010

User-Submitted Analysis Topic: Email

Brief Introduction to User-Submitted Schemas

Brief Description and Starting Point

The Schema As-Is

Email as A Property of Contact

Some Specifics on the Email Addresses Table

Before We Get To Transactions

Changing Emails And Transactions

First Final Suggestion

Second Final Suggestion

Conclusion

31 comments:

Pages

In the Top 200 Tech Blogs!

Labels

Database Programmer Links

External Links

Blog Archive

Statcounter