How nested loop, hash, and merge joins work.

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so joints are pretty common across both transactional databases like postgress my SQL and analytics database like red shift big query and even distributed computation Frameworks like Apaches spark they are so common so common that we don't even think twice before adding the next inner joint clause in our table right as an engineer we don't care typically on how they're implemented we just write SQL is a declarative language you say what you want and the SQL engine figures out what it needs to do behind the scenes right but in this one go in depth of the algorithms that it uses when we give them the joint query now let's say we take a we take a simple example to understand it throughout across all algorithms let's say I am building a blogging website I have a blogs table in which all the blogs are listed each blog table has a column called user ID to tell who wrote that particular blog right and I have a users table what I want to do I just want to fire a simple query which says for like across all users each user the number of blogs they public ordered by the total number of blogs in the descending order so my query would look simple something like this select user.name and countstar from blogs inner join users on blog. user ID equal to user. ID right so now I'm joining the two tables on blog. user ID and user. ID and Counting the number of blogs that are there order by bount descending order and this would give me for each user the total number of blogs he or she published order by total number of logs in descending order right okay but given this query how is your database engine my SQL POS whatever they are implementing it internally what are they doing internally to do it one of the four one of the easiest algorithm that databases Implement to do joints across two table is called nested Loop joint now what nested Loop joint does it's very simple as the name suggest it is a nested Loop for within a for right so the idea is very simple you start from each row in your left table in your left relation for each row of that you match every other Row in the right relation that's it literally for I like literally for each row in the left and for each row in the right you try to see if your join condition matches if it matches you add it to the result set done that's the easiest of all implementation so if I were to go slightly deeper into implementation details or as a walk through of this example it would look something like this let's say I have users table with two rows ID 1 id2 arpit and Arya and I have a blogs table in which I have fly five blogs written three from user One and Two from user two so overall when I'm joining this two table the resultant set would be having five rows right because I'm joining to row on user ID so what I want as a result is the I'm joining this two table so my result should contain columns across both the tables combined together on the join attribute right so in this case my flow looks something like this I pick the first row from this table and for each row I'm going through all the rows from the second table then I will check if the join condition matches so first row from this first row from this join condition matches I add it to the result set first R from this second row from the second table join condition matches added to the result set First Row third row from this this doesn't match so I'll skip it then I'll to the next row then to the next row and so on and so forth so literally it's the easiest of all implementation one nested Loop and you are done obviously looking at this we all know how expensive uh for within a for hour or nested Loops are so although it's easy to implement it could be very timeconsuming for a large large data set imagine joining two tables millions of rows too to to too slow right but it because of its Simplicity they're really efficient for smaller data set and if you have the right indexes the flow becomes even faster right so be mindful whenever you are adding two columns just have the right set of indexes on them right okay now the second one is merge join in case of merge join what do we do here the core idea is the merge of the merge sort that's the that's the tldr of it so the algorithm goes like this if you want to join two tables what we do is we sort each table on the join attribute so my table one which is users I would join it on user ID the primary key of the table and the blogs one I would sort it by user undor ID this way after the step I would have two tables both sorted on the join attributes look here they're all sorted right now given I have two sorted arrays imagine as list right for you the merging of them becomes very easy in the nested Loop joint we had to go for each row across all other rows in the other table now here life is very easy see now because now what we know that our because of sorting in the right table and on the join attribute the rows are now grouped so now my joining becomes really very efficient I start with the first row I start with the first row from both the tables I see if the join attribute matches I would add it over here then I would go to the second row join attribute matches added over here third row join attribute matches added over here which means in the result set right but when I go to the next I figure out as the join attribute don't match right but given that I have sorted this table all the possible ones user ID equal to one would have been grouped there I know there is no other user ID equal to one in subsequent rows so what I can do I can be very assured that I can move I can I've done processing all the user IDs equal to one from here from both the tables so now I'll go to the next step go to the second row and match the sub subsequent rows from that it's literally joining of two sorted arrays dead simple right but again you see the beauty of how your computer loves data to be kept in an organized fashion in an ordered fashion because a lot of algorithms become really efficient when your data is ordered this is a classic example of that again this merge join approach it is very efficient for large data set because you are just sorting two things once right and then just a sequential scan across them and you are done with join but again that pre-join preparation that you are doing which is sorting of two two tables that could become a problem if it is very large but it's still decent enough for large data set scanning happens once which is excellent and if you have the right set of indexes like always you can skip going through all the rows and like directly go to that location and start Ting from there again indexes do make these things faster this is the second algorithm the third algorithm that that data can use to perform joints of two tables are hash joint in hash joint the idea is again very simple you have some sort of preparation that you need to do and as the name suggests it has to do something with hash function right so what do you do you take you do very simple things you take one table on the join attribute store it in the hash table right for example if let's say I want to have join across users and blocks as the example that we are talking about what I would do is I would take the blocks table and I would iterate it row by Row for the relevant Row for the rows that are relevant to me and I will put them in a hash table the hash key of the table would be the join attribute because I'm joining on user ID so I would put all the rows pass I'll pass the join attribute through the hash function that holds my hash key and that goes in my hash table so for example all the rows where user ID is equal to one goes to this slot all the rows for user ID equal to two goes to this slot again collisions are there not a problem but I'm storing all the rows over here in this hash table so now what happens is now this is my Preparatory stage once this is done now the joints is pretty simple I can literally simply just go through each row from the users table then do a simple hash table lookup to figure out where this id1 is present I have all the rows I'll pick the ones which are relevant from there which means which has exactly the real ID the same Jo attributed I'm looking for because there are possible that two different user ID is m to the same hash slot and there is a collision so I would still need to match the join attribute and then keep adding it to the result set right so I'll start with Row one I have user ID one I'll go to that hash slot for each of the rows present in this in this value like that map to the same hash slot I'll go through them one by one if the join condition matches I'll add it to the result set right once I exhaust this row then I go to this row this row two I'll go to that all the row that match join them and put it in my result set so again this hash join is really well suited for equ joints equ joints means where you mapping user ID is equal to u. ID right so blog. user ID equal to u. ID equ equivalent joints it is efficient for L data set again pre-join preparation is required but it just hash T Construction once but the overall benefit that you get is pretty high right but again you could very clearly see given that constructing a new hash table additional memory requirements Ken and the entire thing is defined on the premise that my hash function should distribute the data equally almost equally that if it is a skewed hash function then everything gets skewed to the same row then you don't get any benefit out of it so some precursors but this is the third algorithm the databases used to perform joints of two tables now one key thing how do your SQL engine decide which one to break so what SQL engine typically postgress engine for example what they do is they take a look at the data they do statistic they they typically maintain the statistics across all tables distribution cardinality etc etc and they figure out what is the best way to join these two tables and they pick one of these three algorithm to perform the join so this is where your SQL query Optimizer on query planner and Optimizer kick who does this analysis for you right and this is how your databases join two tables internally the three famous algorithm Being Ned Loop joint mer joint and hash joint right there are variants of these algorithms which are optimized for certain set of use cases but again the core idea would be one of the three right and yeah this is all what I wanted to cover in this one I hope you found it interesting hope you found it amusing that's it for this one I'll see you in the next one thanks l

Info

Channel: Arpit Bhayani

Views: 27,164

Rating: undefined out of 5

Keywords: Arpit Bhayani, Computer Science, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Architecture, Real-world System Design, Database Internals, PostgreSQL Internals, How Joins work?, Nested Loop Join, Hash Join, Merge Join, How databases join the two tables, Joins explained, MySQL internals, Spark internals

Id: -htbah3eCYg

Channel Id: undefined

Length: 11min 7sec (667 seconds)

Published: Fri Apr 26 2024