Normalization is a process of organizing the data in a relational database to reduce redundancy and improve data consistency. The goal of normalization is to eliminate data redundancy by breaking down a large table into smaller, more efficient tables that are easier to maintain.
Normalization is a critical part of database design and is essential for ensuring data consistency and accuracy. In this tutorial, we will discuss the process of normalization and the various normal forms that are used to ensure that a database is well-organized.
The normalization process involves several steps, each of which results in a table that is more efficient and easier to maintain than the previous table. The steps of the normalization process are as follows:
First Normal Form (1NF)
First Normal Form (1NF) is the first step in the normalization process of a relational database. It is the most basic level of normalization, and it requires that a table should not contain repeating groups of data. This means that each table cell should hold only one value and should not contain multiple values separated by commas or any other delimiter.
To convert a table to 1NF, we need to follow these rules:
- Each table cell should contain a single value.
- Each column in a table should have a unique name.
- Each row in a table should be unique.
- The order in which the data is stored in the table does not matter.
Let’s consider an example of a table that is not in 1NF:
Student ID | Name | Courses
———–+——+————————
1 | John | Math, Science, English
2 | Jane | History, Math
Here, the “Courses” column contains multiple values separated by commas. To convert this table to 1NF, we need to split the “Courses” column into separate rows, like this:
Student ID | Name | Course
———– +—— +———
1 | John | Math
1 | John | Science
1 | John | English
2 | Jane | History
2 | Jane | Math
Now, each table cell contains a single value, and each row in the table is unique. This table is in 1NF.
Second Normal Form (2NF)
Second Normal Form (2NF) is the second step in the normalization process of a relational database. It builds on the first normal form (1NF) by removing redundancy and dependencies on partial keys.
A table is in 2NF if it meets the following two criteria:
- It is in 1NF.
- It does not contain partial dependencies.
Partial dependencies occur when non-key attributes are dependent on only a portion of the primary key. In other words, when there is a composite primary key consisting of more than one attribute, and a non-key attribute is dependent on only one of the attributes of the composite key.
To convert a table to 2NF, we need to split the table into two or more tables, each with its own primary key, and eliminate partial dependencies. Here’s an example:
Order # | Item # | Item Name | Quantity | Price
——-+——–+———–+———-+——-
1001 | 1 | Widget | 10 | $2.00
1001 | 2 | Gizmo | 5 | $1.50
1002 | 1 | Widget | 3 | $2.00
In this table, the primary key is a composite key made up of Order # and Item #. However, the non-key attribute “Item Name” is dependent only on Item # and not on Order #. This is a partial dependency, and it violates the second normal form.
To eliminate the partial dependency, we can split the table into two tables: one for orders and one for items, as follows:
Orders:
Order # | Date
——-+———–
1001 | 2022-01-01
1002 | 2022-01-02
Items:
Item # | Item Name | Price
——-+———–+——-
1 | Widget | $2.00
2 | Gizmo | $1.50
Order Items:
Order # | Item # | Quantity
——-+——–+———-
1001 | 1 | 10
1001 | 2 | 5
1002 | 1 | 3
Now, the “Item Name” and “Price” attributes are moved to a separate table, and the partial dependency is eliminated. The “Order #”, “Item #”, and “Quantity” attributes are moved to a new table called “Order Items”. This new table has a composite primary key made up of both “Order #” and “Item #”. This way, the second normal form is satisfied.
Third Normal Form (3NF):
Third Normal Form (3NF) is the third step in the normalization process of a relational database. It builds on the first normal form (1NF) and the second normal form (2NF) by removing transitive dependencies.
A table is in 3NF if it meets the following two criteria:
- It is in 2NF.
- It does not contain transitive dependencies.
Transitive dependencies occur when a non-key attribute is dependent on another non-key attribute rather than directly on the primary key. For example, consider the following table:
Student | Course | Professor | Professor Office
——–+——–+———–+—————-
John | Math | Smith | 101
John | English | Brown | 201
Sally | Math | Smith | 101
Sally | English | Brown | 201
In this table, the primary key is a composite key made up of both “Student” and “Course”. However, the non-key attribute “Professor Office” is dependent on “Professor” and not directly on the primary key. This is a transitive dependency, and it violates the third normal form.
To eliminate the transitive dependency, we can split the table into three tables: one for students, one for courses, and one for professors, as follows:
Students:
ID | Name
—–+—–
1 | John
2 | Sally
Courses:
ID | Course
—–+——-
1 | Math
2 | English
Professors:
ID | Name | Office
—–+——–+——-
1 | Smith | 101
2 | Brown | 201
Enrollment:
Student ID | Course ID | Professor ID
———–+———-+————-
1 | 1 | 1
1 | 2 | 2
2 | 1 | 1
2 | 2 | 2
Now, the “Professor Office” attribute is moved to a separate table called “Professors”. The “Student”, “Course”, and “Professor” attributes are moved to a new table called “Enrollment”. This new table has a composite primary key made up of “Student ID” and “Course ID”. The “Professor ID” attribute is a foreign key that refers to the “ID” attribute in the “Professors” table. This way, the third normal form is satisfied.
Boyce-Codd Normal Form (BCNF):
A table is in BCNF if it meets the following two criteria:
- It is in 3NF.
- Every determinant is a candidate key.
In BCNF, a determinant is a column or a set of columns that uniquely determines the value of another column. A candidate key is a minimal set of columns that can uniquely identify each row in a table.
To understand the concept of BCNF, consider the following example:
Salespeople (Salesperson ID, Region, Product, Sales)
In this table, the primary key is a composite key made up of “Salesperson ID”, “Region”, and “Product”. However, the “Sales” attribute is dependent only on the “Salesperson ID” and “Product” attributes, and not on the “Region” attribute. This is a violation of the BCNF.
To bring the table into BCNF, we need to split it into two tables, as follows:
Salespeople (Salesperson ID, Region, Product)
Sales (Salesperson ID, Product, Sales)
In this example, we have removed the “Sales” attribute from the “Salespeople” table and placed it in a new table called “Sales”. The “Salesperson ID” and “Product” attributes now form the primary key of the “Sales” table, and they are both candidate keys. The “Region” attribute is no longer part of the primary key, which eliminates the dependency issue.
In summary, BCNF is a higher level of normalization that aims to reduce redundancy and eliminate anomalies in a relational database design. It ensures that every determinant is a candidate key, which means that there are no redundant or unnecessary dependencies in the table.
Fourth Normal Form (4NF):
A table is in 4NF if it meets the following criteria:
- It is in BCNF.
- It does not contain any multivalued dependencies.
Multivalued dependencies occur when a table contains multiple independent sets of values for the same set of attributes. This can result in redundancy and inconsistencies in the database.
To understand the concept of 4NF, consider the following example:
Employee Projects (Employee ID, Project ID, Skills)
In this table, an employee can work on multiple projects and possess different skills for each project. This creates a multivalued dependency between the “Employee ID” and “Skills” attributes, as shown below:
To bring the table into 4NF, we need to split it into two tables, as follows:
Employee Skills (Employee ID, Project ID, Skills)
In this example, we have removed the multivalued dependency by creating a new table called “Employee Skills”, which contains the “Employee ID”, “Project ID”, and “Skills” attributes. The “Employee ID” and “Project ID” attributes are both part of the primary key of the “Employee Skills” table.
In summary, 4NF is a higher level of normalization that aims to eliminate multivalued dependencies in a relational database design. It ensures that the tables are free of any redundancy and inconsistency, resulting in a more efficient and effective database design.
Fifth normal form(5NF):
Fifth Normal Form (5NF), also known as Project-Join Normal Form (PJNF), is a level of normalization that deals with cases where a relation contains multiple sets of independent and overlapping multi-valued dependencies.
A table is in 5NF if it meets the following criteria:
- It is in 4NF.
- It does not contain any non-trivial join dependencies.
A non-trivial join dependency occurs when three or more tables are joined together based on the values of their primary keys. The dependency is considered non-trivial when the tables are not functionally dependent on each other.
To understand the concept of 5NF, consider the following example:
Students (student_id, course_id, professor_name, textbook, textbook_author)
In this table, a student can take multiple courses, and each course can have multiple textbooks and authors. There is a multi-valued dependency between the “course_id” and “textbook” attributes, as well as between “textbook” and “textbook_author” attributes. There is also a join dependency between “course_id” and “professor_name”, meaning that the professors assigned to each course depend on the course itself.
To bring the table into 5NF, we need to split it into four tables, as follows:
Students (student_id, course_id, professor_id)
Courses (course_id, professor_name)
Textbooks (course_id, textbook)
Authors (textbook, textbook_author)
In this example, we have removed the multi-valued dependencies and join dependencies by creating three additional tables: “Courses”, “Textbooks”, and “Authors”. The “Courses” table contains the “course_id” and “professor_name” attributes, and the “Textbooks” table contains the “course_id” and “textbook” attributes. Finally, the “Authors” table contains the “textbook” and “textbook_author” attributes.
In summary, 5NF is a higher level of normalization that ensures that a database is free from any anomalies and redundancies, resulting in a more efficient and effective database design. However, it should be noted that achieving 5NF can sometimes result in an overly complex database structure, which can be difficult to maintain and query efficiently. Therefore, a trade-off between normalization and practicality should be considered when designing a database.
Advantages of Normalization
Normalization provides several benefits to the database design and management process. Some of the key advantages of normalization are:
- Data Consistency: Normalization helps in eliminating data redundancy, which in turn helps to maintain consistency in the data. With normalization, each data item is stored only once, thereby reducing the risk of inconsistencies arising due to multiple instances of the same data.
- Improved Data Integrity: By eliminating redundancy and ensuring that each data item is stored in a single location, normalization ensures that data integrity is maintained. This helps to avoid inconsistencies that could arise if different instances of the same data have different values.
- Efficient Data Retrieval: Normalization simplifies the database structure and ensures that data is organized in a logical and efficient manner. This helps to improve data retrieval times, as queries can be optimized to access only the data that is needed.
- Flexibility: Normalization provides a flexible design that can be easily modified to accommodate changing data requirements. This allows the database to evolve over time as new requirements emerge, without requiring major restructuring of the entire database.
- Scalability: Normalization provides a solid foundation for scaling the database as the amount of data and users grows. With a normalized database, additional data can be easily added without compromising the integrity of the existing data.
- Better Database Design: Normalization encourages good database design practices, such as breaking down data into smaller, more manageable chunks. This helps to ensure that the database is well-designed from the start, which reduces the risk of issues arising further down the line.
Disadvantages of Normalization
- Complexity: Normalization can result in a more complex database structure, which can be difficult to understand and manage, especially for those who are not familiar with the normalization process.
- Increased Storage Requirements: Normalization can result in more tables being created, which can increase the storage requirements for the database. This can result in slower data retrieval times, especially if the database is large.
- Performance Overhead: Because normalization requires more tables and relationships between those tables, the performance of the database can be impacted. This can result in slower query execution times, especially if the database is large and complex.
- Cost: Normalization can increase the cost of database design and development, as it requires more time and effort to create a normalized database. Additionally, more complex databases may require more resources and hardware to support, which can increase costs further.
- Data Duplication: While normalization is designed to eliminate data redundancy, in some cases, it can actually result in more data being duplicated across tables. This can lead to data inconsistencies and errors if not managed properly.
- Difficulty in Querying: Normalization can make it more difficult to write complex queries that involve data from multiple tables. This can make it harder for developers to retrieve the data they need from the database.