Database normalization is a process, Of arranging your tables, columns and relationships between tables; In order to avoid data redundancy and to attain data integrity Normalization: to put correct data in correct tables In order to avoid CUD (Create, update and delete) anomalies (issues) So let us discuss in detail now as to why should we do database normalization Let us write a table which is in de-normalized form So this is my FilmAgents table it has data such as agent_id, agent_name the agents are from certain agencies the agencies are located in LA, NY, LA and Mumbai and they have customers agent A1(Ari Gold) has two customers; Bread Pith and Fakira Tori Bronze has Nedd Stark, CNolan and Solman Kan and Gori Diamond (agent) has two customers; Rajini Sir and Saruk Kan This table is in de-normalized form. Now, why do I say that? because it has certain issues the same issues to avoid which we do database normalization There is data redundancy, it is clearly visible you can see that the agency name has been repeated and because the agency name is repeated so is the agency office location and office contact no. as I said before, we need to avoid CUD (create, update, delete) anomalies let us start with CREATE (or INSERT) anomaly that exists here BTW, AgentId column is the primary key here. I will denote it by underlining the column name so AgentId is my primary key in this FilmAgents denormalized table let us suppose that there is a new agency in the market called as Free Folks (watch GoT) this agency is based in Los Angeles (where else) and they have already acquired a client;
Mr. Morgage Freeman the problem is, as this is a new agency, they don't have an agent yet so we can't create a database entry in this table (FilmAgents) right now for this new agency because they don't have an agent So AgentName value will be null, which is still okay but the primary key column will also be null, as there is no AgentId because there is no agent so we can't insert this data (new agency info) into this table now that's a problem, the agency is there but we can't store that data because don't have an agent so this is an example of Insert (Create) anomaly Now let's look at an example of Update anomaly let's say that "New Cine" agency has moved out from LA to San Diego and hence they have to change their phone number so this (111-222-333) has been changed to 222-444-555 and this will be San Diego, this has to be updated but the problem is that because this data is repeated in multiple rows, we are only looking at 4-5 entries right now (thanks to my white-board size) but imagine if this table had 100s of records, there will be at least 20-30 entries of data for New Cine agency so we will have to make sure that, we update these new details (contact number and office location) in all of those 20-30 rows if, it happens that this new information gets updated in few rows and not in remaining rows then in that case, the data will be in inconsistent state So this is called as UPDATE anomaly and it is arising because you have data redundancy (duplication) Now let's look at an example of DELETE anomaly Now we have this agent, Gori Diamond from the FireFolks Pvt (agency) in Mumbai What if tomorrow, Gori Diamond is no more with us (permanent retirement) due to some unforeseeable circumstances so surely, AgentId being the primary key, we will have to remove this entry of (RIP) Gori Diamond from the table but the problem is, if we remove that entry, the information about FireFolks Pvt Ltd (the agency) will also get removed from the table the agency info will also be gone and if we assume that there was only one agent (RIP) Gori Diamond for this agency, then we won't have the agency information at all What about the clients? Mr Rajini, Mr Saruk Kan? so we can't possibly delete the agency just because the agent is no more (RIP Gori Diamond) so this is an example of DELETE anomaly there is one more issue with this denormalized table although we don't have something called as READ anomaly Let us suppose I want to find who is the agent or which is the agency working for Mr Morgage Freeman So how do I write the query? Select AgencyName from FilmAgents table where customer1 = "M Freeman" OR customer2 = "M Freeman" OR customer3 = "M Freeman"; what if tomorrow we add two more columns to add infor about five customers so we will have to add two more columns in where clause in all our queries feels a bit clumsy right? What if we want to sort the data based on customer name so on which column will we sort? on customer1, customer2 or customer3? so there are problems in the retrieval of data also in this denormalized way of storing data So let us rather normalize this database In database normalization there are few forms, you can achieve database normalization in stages the forms are 1 NF, 2 NF and 3 NF (NF: Normal Form) apart from these three, there are few more normal forms such as: BCNF (Boyd-Codd Normal Form) fondly called as 3.5 NF and there are 4 NF, 5 NF and 6 NF also but it is generally said that; if your database is in 3 NF, it is sufficiently normalized so we will study normalization till 3 NF What are the rules for attaining 1 NF? 1 NF says that; each column should be atomic i.e you should not have column values as Fakira, Sakira single value in every column and, there can't be a group of columns representing similar information so we can't have three customer columns the second rule of 1 NF says that: there should be a key which identifies each row of the database table that key could be one column (like AgentId) or a combination of two or more columns So currently this table is not in 1 NF. Why? because it has a group of columns representing similar information (customer1, customer2, customer 3) In order to achieve 1 NF, I will divide this table into two tables so I will separate out the customer information into another table so now I will call this table as AgentInfo and I will create another table called as CustomerInfo so as you can see here, I have customer id (CId) as my primary key agent id value for this row is: A1 so I can identify that Mr Bread Pith is represented by agent A1 which is: Ari Gold Mr Morgage Freeman (C6) is represented by agent A4 that is Gori Diamond (if she is still alive) and it has customer city and customer pin code so now our database is in 1 NF all the columns are atomic because they have single values and there are no more group of columns representing similar information Also, as the rules of 1 NF says for each table there should be a column or a group of columns which should be uniquely able to identify each row In short, there should be a primary key So in CustomerInfo table we have CId (customer_id) and in AgentInfo we have AgentId so we have primary keys and we have atomic columns so we are in now 1st normal form now being in 1 NF, we have resolved certain issues now we can add unlimited amount of customers for each agent remember earlier we had three columns customer1, customer2, customer3...so at max every agent could represent three customers In this table (CustomerInfo) you can keep adding new entries, so one agent can have as many customers he/she wants to represent Sorting, remember we could not sort data based on customer name now we can sort the data based on customer name as it just one column instead of three insert and delete anomalies on the customer table are also gone I could delete a customer, let us say Mr Morgage Freeman goes out of planet in that case, I can delete this entry of Mr Morgage Freeman from this table and still my agent info A4 (i.e of Gori Diamond) will not be harmed in any manner So delete anomaly is avoided but we still have some anomalies left in this design we still haven't solved this problem there is a new agency in LA but as they don't have any agents so they can't represent any customers so although we are in 1 NF we still have some anomalies left so we will now go to 2 NF in order to get rid of those anomalies, have more data integrity and less data redundancy in order to attain 2 NF, you should adhere to all the rules of 1 NF; as I told you earlier, they are progressive plus all non-key columns should be dependent on primary key So first, what is a non-key column? so in this AgentInfo table, AgentId is primary key here (in CustomerInfo) table, CustomerId is primary key so all these other columns (AgentName, AgencyName, location, contact) are non-key columns so primary key could be a single column as in our case or it could be a combination of multiple columns for e.g here the primary key could be (a combination of) Customer_Id and CustName so at times, primary key is a composite key i.e it is a combination of two columns so let us say here if CId and CustName form your composite key; then CustCity and CustPinCode will be your non-key columns (* AgentId is foreign key, so not a non-key column) here, in our case (AgentInfo table), it is more simpler AgentId is the primary key, so all these are non-key columns they (non-key columns should be dependent on primary key) so what do you mean by dependent? In order to answer that question; are they dependent or not? you have to ask one simple question Does that non-key column describe what this primary key identifies? so in this case, consider OfficeLocation column (a non-key column) Los Angeles: Does it describe what is uniquely identified by this primary key A1? This primary key A1 is uniquely identifying what? The Agent: Ari Gold Now Los Angeles office location, does it describe Ari Gold? Not really. It is actually describing his office, his work-place So if your answer is yes, for e.g column AgentName primary key A1 is representing this agent AgentName is describing that agent? Yes so it (the column) should be in this table if the answer is NO, in the case of OfficeLocation column, e.g Los Angeles then that information should not be in this table it should be in some other table they say that; for a table to be in 2 NF it should have a single purpose so our current database design is not in 2 NF why? because in AgentInfo table we have three columns (AgencyName, OfficeLocation and OfficeContact) which are not describing the agent they are describing his/her work-place So this is not in 2 NF So lets make our database in 2 NF if your non-key columns are not dependent on your primary key; meaning if your non-key columns do not describe what is being uniquely identified by your primary key; move those columns into another table so I have divided my database into three tables now AgentInfo (table) has AgentId, AgentName and AgencyId AgentId is primary key and AgencyId is my foreign key we have established a relation between Agents and Agencies via foreign key of AgencyId Thus we know that this agent: Ari Gold belongs to New Cine Agency What about, we removing the foreign key AgentId from the CustomerInfo table so how would we know, a customer (e.g Bread Pith) is being represented by which agent and consequently by which agency? so we will create a join table so this is my join table It is called as join table or intersection table it will have just two columns: AgentId and CustomerId both foreign keys join tables are usually used to represent many-to-many relationships an Agent can have many customers and a Customer can be represented by many agents (that is what our system says, usually it's one-to-one) so from this table we get the information that; agent A1 (Ari Gold) is representing customer C1 (Bread Pith) so this is our join table 2 NF: All the non-key columns shuld be dependent on primary-key here (in AgentInfo) table, AgencyId is a foreign key so it's not a non-key column rest of the columns are dependent on the primary key as they describe the agent here (in AgencyInfo table) AgencyName, OfficeLocation and OfficePhone are describing the agency so they are dependent on priamry key here (in CustomerInfo table) CustName, CustCity and CustPinCode are describing the customer which is uniquely identified by this CId primary key hence they are also dependent on primary key so all these three tables are 2 NF compliant so is this table, because there are only keys here both columns are foreign keys so these rules (2 NF) do not apply here at all so all these tables are in 2 NF, hence we are (our database) is in 2 NF now this in itself is a very good database design In fact, 3 out of 4 of our tables are already 3 NF compliant except for this CustomerInfo table so let us see now, what is 3 NF so 3NF will be all the rules of 2NF plus all the non-key columns should be NON-TRANSITIVELY dependent on primary key so what do you mean by NON-TRANSITIVE ? so first let us see what is Transitive A depends on B and then B depends on C so in turn we say: A is dependent on C so this is transitive property 3 NF says: all non-key columns should be non-transitively (i.e directly) dependent on primary key in our case, this CustomerInfo table is violating that rule How? so we have the CId (customer_id) as primary key CustName describes the customer which is uniquely identified by this primary key hence CustName depends on the primary key CustCity? The same. It describes the customer hence it is directly depends on the primary key now, CustPinCode? CustPinCode depends on CustCity and CustCity depends on customer id (primary key) so here CustPinCode is transitively dependent on the primary key CId (customer id) not directly but transitively through CustCity column hence this table (CustomerInfo) is not in 3NF so let us make it in 3NF I will create another table: CustomerPinCodes It has CustPinCode and CustCity as columns now this is the foreign key here (CustPinCode) joining these two tables hence that non-transitive rule does not apply on this column (CustPinCode) anymore as it is not a non-key column anymore. It is a foreign-key column. so this table (CustomerInfo) is in 3NF And here also, City is describing (or directly dependent on primary key) Pin Code hence it is also 3NF compliant So now our database is in 3NF It is a good relational database design when your database is in 3NF