Wednesday, March 21, 2012

Database Normalization

For those IT professionals that know me know that I have a passion for data and data architecture.  I also do many other IT things (for example I have a small web development company).  One of my things that drives my passion for data is that it is the basis that drives efficiencies in almost all areas of IT and business.  I find it very frustrating when we pay a bunch of money for an application and install it and discover that the data model is so poorly designed that the application performs poorly or it is difficult to get good info out of it.  For this reason I put a lot of stress on data models on ever IT project that I work on.

The processes where by a person would go through tuning a data model is called normalization and denormalization.

Normalization is the process where you organize data using different steps or levels.

First Normal Form (1NF) contains the following steps:
  • Organize data set(s) into data sets specific content (e.g. data relative to persons and then data related to address).
  • Identify each record in each data set with a unique ID (primary key).
  • Remove duplicate columns with the same information
Second Normal Form (2NF) contains the following steps:
  • Meets 1NF criteria.
  • Remove subsets of data that create multiple rows (e.g. an employee is in multiple divisions, create a table that links the employee to the division with foreign keys).
  • Create a a table and foreign key relationships for data that is duplicated across data sets (e.g. the account, contact and order data sets all have address fields.  The address info should be put into a table and then a foreign key to that table should then be stored in the account, contact and order data sets).
Third Normal Form (3NF) contains the following steps:
  • Meets 2NF criteria.
  • Any data element that could be repeated in multiple records for a single field should be put into its own table and then referenced via foreign key (e.g. The university name in a list of national students should be put into a university data set and then a reference should be used in the national student list).
Boyce-Codd Normal Form (BCNF) contains the following steps:
  • Meets 3NF criteria.
  • Create separate tables for data that is dependent upon a value other than the primary key (e.g. a table with address information might have street address, city, state and zip.  The city and state are dependent on the zip code which should then be the key to another table that contains the city and state info to reduce duplication of data). 
There are other Normal forms though most processes do not typically go beyond 3NF.

Denormalization is the process of violating Normal Forms for the purpose of performance.  For those not familiar with the concept of "Query Cost" there is a performance hit for each table joined together in a query.  For this reason there are reasons to violate these forms to improve query performance.  One common example is to include a full address in the address table instead of just having the street address and zip code and another table with the zip code to state and city associations.

I hope that others find this information useful.

No comments:

Post a Comment