Database Normalization

By | February 17th 2019 05:31:54 PM | viewed 72 times

Database Normalize

Database Normalization is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy(repetition) and undesirable characteristics like Insertion, Update and Deletion Anomalies. It is a multi-step process that puts data into tabular form, removing duplicated data from the relation tables

Normalization is used for mainly two purposes:

  • Eliminating redundant(useless) data.
  • Ensuring data dependencies make sense i.e data is logically stored

Problems Without Normalization

If a table is not properly normalized and have data redundancy then it will not only eat up extra memory space but will also make it difficult to handle and update the database, without facing data loss. Insertion, Updation and Deletion Anomalies are very frequent if database is not normalized. To understand these anomalies let us take an example of a Student table.

student_id roll_no name department dept_head office_phone
1 101 Akon CSE Mr. X 53337
2 102 Bkon CSE Mr. X 53337

In the table above, we have data 2 of students. As we can see, data for the fields department, dept_head and office_phone is repeated for the students who are in the same department in the college, this is Data Redundancy.

Insertion Anomaly

Suppose for a new admission,Data of the student of department cannot be inserted, or else we will insert the department information as NULL.

Also, if we will insert data of 100 students of same department, then the department information will be repeated for all those 100 students.

hese scenarios are nothing but Insertion anomalies.

Updation Anomaly

What if Mr. X leaves the college? or is no longer the dept_head of computer science department? In that case all the student records will have to be updated, and if by mistake we miss any record, it will lead to data inconsistency. This is Updation anomaly.

Deletion Anomaly

In our Student table, two different informations are kept together, Student information and department information. Hence, at the end of the academic year, if student records are deleted, we will also lose the department information. This is Deletion anomaly.

Normalization Rule

Normalization rules are divided into the following normal forms:

  1. First Normal Form
  2. Second Normal Form
  3. Third Normal Form
  4. Boyce and Codd Normal Form(BCNF)
  5. Fourth Normal Form

First Normal Form (1NF)

it should follow the following 4 rules:

  1. All the columns name in a table should have unique names
  2. Values stored in a column should be of the same datatype
  3. It should only have single(atomic) value(means not CSE,Math only CSE or Math)
  4. the order in which data is stored, does not matter

Second Normal Form (2NF)

it should follow the following 2 rules:

  1. It should be in the First Normal form
  2. it should not have Partial Dependency

What is Dependency

From the following table we can see

student_id roll_no name department address
1 101 Akon CSE xyz
2 102 Akon IT xxxx

In this table, student_id is the primary key and will be unique for every row, hence we can use student_id to fetch any row of data from this table

Even for a case, where student names are same, if we know the student_id we can easily fetch the correct record.

Hence we can say a Primary Key(student_id) for a table is the column or a group of columns(student_id,name)(composite key) which can uniquely identify each record in the table.

I can get department name of student with student_id 1 Similarly,I can get name of student with student_id 1 or 2. So all I need is student_id and every other column depends on it, or can be fetched using it.

This is Dependency and we also call it Functional Dependency.

What is Partial Dependency

We can see from Subject and Score tables:

subject_id subject_name
1 Java
2 PHP
3 C++

From Subject table,subject_id is a primary key

score_id student_id subject_id marks teacher_name
1 1 1 70 Java Teacher
2 1 2 80 PHP Teacher
3 2 3 60 C++ Teacher

From Score table,student_id and subject_id together make a candidate key for this table this can be primary key

We can not get marks of student with student_id 1,because we don't know for which subject. Similarly if I give you subject_id, you would not know for which student. Hence we need student_id + subject_id to uniquely identify any row.

Now if you look at the Score table, we have a column names teacher which is only dependent on the subject, for Java it's Java Teacher and for C++ it's C++ Teacher & so on.

Now as we just discussed that the primary key for this table is a composition of two columns which is student_id & subject_id but the teacher's name only depends on subject, hence the subject_id, and has nothing to do with student_id.

This is Partial Dependency, where an attribute/field in a table depends on only a part of the primary key and not on the whole key.

Remove Partial Dependency

We can see from following table where teacher does not depend on a part of primary key:

subject_id subject_name teacher_name
1 Java Java Teacher
2 PHP PHP Teacher
3 C++ C++ Teacher
score_id student_id subject_id marks
1 1 1 70
2 1 2 80
3 2 3 60

Introduction to Database Keys

Keys are very important part of Relational database model. They are used to establish and identify relationships between tables and also to uniquely identify any record or row of data inside a table.

A Key can be a single attribute/field or a group of attributes/field, where the combination may act as a key.

student_id name phone age
1 xx 01822222222 22
2 xy 01522222222 29
3 xyz 01922222222 25

Super Key

Super Key is defined as a set of attributes within a table that can uniquely identify each record within a table. Super Key is a superset of Candidate key.

In the table defined above super key would include student_id, (student_id, name), phone etc where (student_id, name) is a candidate key

Candidate Key

Candidate keys are defined as the minimal set of fields which can uniquely identify each record in a table.

It is an attribute/field or a set of attributes/field that can act as a Primary Key for a table to uniquely identify each record in that table. There can be more than one candidate key.

In our example, student_id and phone both are candidate keys for table Student.

Rules of Candidate Key

  • A candiate key can never be NULL or empty. And its value should be unique
  • There can be more than one candidate keys for a table.
  • A candidate key can be a combination of more than one columns(attributes/field)

Primary Key

Primary key is a candidate key that is most appropriate to become the main key for any table. It is a key that can uniquely identify each record in a table

For the above Student we can make the student_id column as the primary key.

Composite Key

Key that consists of two or more attributes/field that uniquely identify any record in a table is called Composite key. But the attributes/fields which together form the Composite key are not a key independentely or individually.

In above table student_id and subject_id together will form the primary key, hence it is a composite key.

Secondary or Alternative key

The candidate key which are not selected as primary key are known as secondary keys or alternative keys.

Non-key Attributes

Non-key attributes are the attributes or fields of a table, other than candidate key attributes/fields in a table.

Non-prime Attributes

Non-prime Attributes are attributes other than Primary Key attribute(s)

Third Normal Form (3NF)

Let's we have 3 tables,Student, Subject and Score

student_id roll_no name department address
1 101 Akon CSE xyz
2 102 Akon IT xxxx
subject_id subject_name teacher_name
1 Java Java Teacher
2 PHP PHP Teacher
3 C++ C++ Teacher
score_id student_id subject_id marks
1 1 1 70
2 1 2 80
3 2 3 60

In the Score table, we need to store some more information, which is the exam name and total marks, so let's add 2 more columns to the Score table.

score_id student_id subject_id marks exam_name total_marks

Requirements for Third Normal Form

  1. It should be in the Second Normal form
  2. And it should not have Transitive Dependency.

What is Transitive Dependency

With exam_name and total_marks added to our Score table, it saves more data now. Primary key for our Score table is a composite key, which means it's made up of two attributes or columns ? student_id + subject_id.

Our new column exam_name depends on both student and subject. For example, a mechanical engineering student will have Workshop exam but a computer science student won't. And for some subjects you have Prctical exams and for some you don't. So we can say that exam_name is dependent on both student_id and subject_id.

Well, the column total_marks depends on exam_name as with exam type the total score changes. For example, practicals are of less marks while theory exams are of more marks.

But, exam_name is just another column in the score table. It is not a primary key or even a part of the primary key, and total_marks depends on it.

This is Transitive Dependency. When a non-prime attribute depends on other non-prime attributes rather than depending upon the prime attributes or primary key.

How to remove Transitive Dependency

Again the solution is very simple. Take out the columns exam_name and total_marks from Score table and put them in an Exam table and use the exam_id wherever required.

score_id student_id subject_id marks exam_id
exam_id exam_name total_marks

Advantage of removing Transitive Dependency

  1. Amount of data duplication is reduced.
  2. Data integrity achieved

Boyce-Codd Normal Form(BCNF)

Boyce-Codd Normal Form or BCNF is an extension to the third normal form, and is also known as 3.5 Normal Form.

Rules for BCNF

  1. It should be in the Third Normal Form
  2. And, for any dependency A -> B, A should be a super key.it means, that for a dependency A -> B, A cannot be a non-prime attribute, if B is a prime attribute.
student_id subject_name teacher_name
1 Java Java Teacher
2 PHP PHP Teacher
3 C++ C++ Teacher
4 Java Java Teacher2
1 C++ C++ Teacher

In the table above

  1. One student can enrol for multiple subjects. For example, student with student_id 1, has opted for subjects - Java & C++
  2. For each subject, a professor is assigned to the student.
  3. And, there can be multiple professors teaching one subject like we have for Java.

in the table above student_id, subject together form the primary key, because using student_id and subject, we can find all the columns of the table.

there is a dependency between subject and professor here, where subject depends on the teacher name.

This table satisfies the 1st Normal form because all the values are atomic, column names are unique and all the values stored in a particular column are of same domain.

This table also satisfies the 2nd Normal Form as their is no Partial Dependency.

And, there is no Transitive Dependency, hence the table also satisfies the 3rd Normal Form.

But this table is not in Boyce-Codd Normal Form.

Why this table is not in BCNF

In the table above, student_id, subject form primary key, which means subject column is a prime attribute.

But, there is one more dependency, teacher -> subject.

And while subject is a prime attribute, professor is a non-prime attribute, which is not allowed by BCNF.

How to make BCNF

Below we have the structure for both the tables.

student_id teacher_id
1 1
2 2
teacher_id teacher_name subject_name
1 Java Teacher Java
2 PHP Teacher PHP

Fourth Normal Form (4NF)

Fourth Normal Form comes into picture when Multi-valued Dependency occur in any relation. In this tutorial we will learn about Multi-valued Dependency, how to remove it and how to make any table satisfy the fourth normal form.

Rules for 4th Normal Form

  1. It should be in the Boyce-Codd Normal Form
  2. And, the table should not have any Multi-valued Dependency.

What is Multi-valued Dependency

A table is said to have multi-valued dependency, if the following conditions are true

  1. For a dependency A -> B, if for a single value of A, multiple value of B exists, then the table may have multi-valued dependency
  2. Also, a table should have at-least 3 columns for it to have a multi-valued dependency
  3. And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B, then B and C should be independent of each other.

If all these conditions are true for any relation(table), it is said to have multi-valued dependency

student_id course hobby
1 Science Cricket
1 Maths Hockey
2 Php Hockey
2 C# Cricket

As you can see in the table above, student with s_id 1 has opted for two courses, Science and Maths, and has two hobbies, Cricket and Hockey.

Well the two records for student with s_id 1, will give rise to two more records, as shown below, because for one student, two hobbies exists, hence along with both the courses, these hobbies should be specified.

student_id course hobby
1 Science Cricket
1 Maths Hockey
1 Science Hockey
1 Maths Cricket

And, in the table above, there is no relationship between the columns course and hobby. They are independent of each other.

So there is multi-value dependency, which leads to un-necessary repetition of data and other anomalies as well.

How to satisfy 4th Normal Form

To make the above relation satify the 4th normal form, we can decompose the table into 2 tables.

student_id course
1 Science
1 Maths
2 Php
2 C#
student_id hobby
1 Cricket
1 Hockey
2 Cricket
2 Hockey

Now this relation satisfies the fourth normal form.

bONEandALL