How do nested loop, hash, and merge joins work? Databases for Developers Performance #7

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi there! I want to combine all the cards in this deck, with all the cards in this deck. It's time to talk about joins! First up I'm going to take the naive approach. For every card in this first or outer deck, I'm gonna look for a card with the same value and suit in this second or inner deck. So what we'll do is we'll take the top card from here. That's the six of clubs. I'll then search through all the cards in this second deck until - eventually - we get to there we go there's six of clubs. So we have a join there. We've joined those cards. We can then go back to our first deck. First card here is the three of clubs. And it's also the first card in this second deck. Now you might think at this point we can stop. No! We still need to search through every single other card in this second deck, just in case someone snuck an extra three of clubs in at the bottom. So once I've found the join for the three of clubs I'll then go to the four of spades and look through all these cards. So for every card in this first or outer deck I'm going to look through every single card in this second or inner deck. This is the basis of a nested loop join. For every row in the first table, we will query the second table looking for matching rows. Clearly in this case it is very slow. We're going to read every row in the second table once for every row in the first. That's a lot of work. So how can we make combining the decks faster? Well one trick is to sort the decks first. So just bear with me a moment while I do this. So I'll take this first deck and start ordering the cards by value and suit. So hang on a sec ... so we've finished sorting this second deck by putting the ace of spades at the bottom. So now we have two ordered decks and I'm going to join them again. So we'll start with the first card in the outer deck. That's the two of clubs. We then go to the second inner deck and that also has the two clubs as the first card. So we've got a join. I can then go to the next card in this second deck, that's the three of clubs. This no longer matches. So because this second deck is sorted, we now know that there can be no other cards with the value two of clubs in the second deck. So we can stop processing now. And go back to the first deck and look for the first card in it; that's the three of clubs. Now at this point we will restart the search from the last value we had a join on, so that's the two of clubs. Two and three don't match so we go to the three of clubs, find the join there the two joining three of clubs. Until we get to the four of clubs again. Four is now greater than three, this deck is sorted, so we know there can be no other cards with the value four of clubs anywhere in it. So we can stop again, go back to our first deck. Look for the four of clubs. So again we're gonna have to restart our search from the last matched value. That's the three of clubs. We'll do three, four, five, and so on. And we can repeat this process walking down both decks bit by bit. This is the basis of a merge join. First we sort the tables on the join columns and then walk down them bit by bit comparing the rows. The key thing is because the rows are sorted we can stop processing once we've looked at a value in the second table which is greater than the value of the current row in the first table. So combining the decks with a merge join is faster than nested loops. But you need to sort them first. And sorting is slow. So how can we do any better? For my final approach what I'm going to do is first look at all the cards in my outer deck. For each one what I'll do is look at its value and suit and apply a hash function to it. So we'll get the six of spades, apply a hash function and build a hash table on it. So then go to the next card, put that in the hash table, nine of diamonds, Queen of Hearts, and so on. And I will do that for all the cards in this first deck. Once I've read them all I will then go to the second deck and apply the same hash function to the value and suit. So we've got the Queen of Hearts, apply the same hash function and then lookup - probe - the hash table we built for matching values. We've got a matching entry here so we have a join. Do that for the nine of diamonds, probe the hash table we've got a join. We'll repeat this process for all the cards in this second deck. As you might guess from the fact that I'm talking about hash functions, this is the basis of a hash join. This first reads all the rows in one table, applying a hash function to the join criteria to build the hash table. It then walks through all the rows in the second table, applying the same hash function and looking up - probing - the hash table to see if there is a matching value in the hash table we've just built. So we've got our three different approaches: nested loops, merge-join, hash join. How do they compare in terms of the amount of work they do? With nested loops for every row in the first table we're looking at every row in the second table. With 52 rows in each that's over two and a half thousand comparisons! A lot of work. With a merge-join, first we have to walk through each table sorting the rows. The exact work to do this varies, but it's roughly the number of rows in the table multiplied by the log of the number of rows. We do this for both tables giving a total of around a couple hundred sorting operations. Then we walk down the tables comparing the rows. In this case we'll compare each row in the first table to around three rows in a second, giving a total of over 300 operations. Finally with the hash join we walk through all the rows in this first table once, applying a nice fast hash function. We then start walking through the second table, applying the same hash function. Assuming we're using a good hash function this will find at most one matching value. So the total operations here is a little over 100. Significantly less than the two other join methods. This is because, unlike nested loops and merge joins, we're guaranteed to read each row at most once. Now at this point you might be wondering: if hash joins are so great why do we even bother with the two other join types? Well a couple of reasons. First up in order to use a hash join then your join must include equalities. That is check the values from the column in one table equal of the column in the other table. If this is not the case and all your join criteria are range comparisons; greater than, less than, and so on or inequalities, not equal to, then the optimizer will not use a hash join. But even if your join includes equalities, it's still possible that nested loops or merge joins can be faster in some cases. How?? Well so far we've looked at joining every single card in this first deck to find all the matching cards in this second deck. What if we don't want to do that? What if we just want to join a handful? Say a hand of five cards? Well with a hash join we've still got to go through every single card in this first deck before we can start reading any of them in this second deck. With nested loops on the other hand, as soon as we've read one row from here we can start looking for matching rows in our second deck. Now at this point you might be saying: well for each of the five cards we get from here we've still got to read all 52 cards from our second deck!? That's still pretty slow, right? Well that's true. Unless of course, we create an index! With an index on the suit and value of the second table, all the lookups of it become far more efficient. After reading the first row from the outer table, we can use the index to find the location of the matching row in the inner one. So instead of having to read all 52 cards here five times, we just do five index lookups of this deck to find the matching card. So we read five cards from here. And five cards from here. The hash join on the other hand cannot make use of the index on this table. So it starts by still reading all the cards in this deck. It still has to read all the rows in the outer table first building the hash table. Then start reading the rows in the inner table. It does this until it finds five matching rows which may mean it needs to read all the rows in the second table too! So now the hash join is much slower than nested loops! So what about our merge-join? Remember this had to sort the datasets first. But indexes are ordered data structures. So it can use this to avoid a sort. So if we've got an index on the join columns, instead of having to sort all the rows in the outer table, we can just walk down the index, getting the matching rows from the table. So we've saved ourselves a sort of the first table. Sadly, even if we've gone index on color and suit for the second table, Oracle Database will still sort it before performing the join. So if you're returning a fraction of the rows from each table, nested loops can be very efficient. Say I want to search through this deck for damaged cards. Ones which are bent, torn or whatever. So you can tell what they are just by looking at their back. Which isn't good if you want to play bluffing games such as poker! So I want to replace my bent six of spades from this deck with a nice fresh one from this deck. I've got an index on this one to find it nice and efficiently. But with our hash join, even though we've got an index on the value and suit over here, after building the hash table with the first deck I've still got to go through every single card in this second deck probing the hash table for matches. This is because it cannot take advantage of an index on this second table. What about nested loops? Well again I can find my bent card nice and efficiently, use the index to find the location of the six of spades in this deck, and find just that card. So we read one row from each table. Nice and efficient. So if your query finds a small fraction of the rows from one table and each of those match few rows in the other table, nested loops can be the fastest way. But what if I go and bend a whole bunch of the cards over here, so I've got lots of lookups to do over here? With nested loops the more cards we get from the first table, the more index lookups we need to do on our second table. At some point it becomes faster to just switch to a hash join and full scan this second table! This is a problem for the optimizer, because the switching point from nested loops being the fastest to hash joins being the fastest can be tiny. So if your table stats are even ever so slightly out of date, its row estimates can lead the optimizer to choose the wrong join type! Luckily from Oracle Database 12c we have a solution: adaptive plans! This is one plan which considers two join types. With these, the first time you run a query the optimizer says: I'm not sure if a nested loop is the best approach or if a hash join is superior. So I'll consider both. Then I'll have an operation - a statistics collector - watching the rows coming out of the first table. Provided this stays under some threshold, I will then use nested loops to look up the rows in the second table. But if the number of rows exceeds this threshold, I'll start building the hash table instead and switch to a full scan of the second. This allows the optimizer to adapt and choose the best join type based on what's actually stored in the tables. Let's recap. We have our three join types: nested loops, merge joins, hash joins. Nested loops are best suited when joining tiny data sets or small fractions of much larger tables. Hash joins on the other hand are best when joining all the rows in a table or just large data sets in general. A merge join can be super efficient - if the rows are sorted. But sorting is slow. So this is a relatively rare operation. OK, so how does knowing this help you tune your SQL queries? Well as always check a couple of things: the table statistics and their indexes. One of the key things driving the optimizer's decision behind the join type and the join order - which table it accesses first - is how many rows it thinks it's going to get from each table. Adaptive plans means it can hedge its bets when it comes to the join type. But the join order is still fixed. So if you find that the estimated number of rows is orders of magnitude different to the actual number of rows you get from a table, look to see if tweaking the stats can help the optimizer find a better plan. When it comes to indexes, look to see which exists on the join columns. If there isn't one, will creating it help the optimizer use a nested loop join? If there is one will removing it force the optimizer to use a hash join? By tweaking the stats and indexes on a table you can help the optimizer find a better plan for your query. Just remember: changing these will affect all other queries against the table! Thanks for watching! I hope you enjoyed this and it helped you understand how the different join types work. If you have any questions about these let us know, down in the comments. And subscribe for more SQL tips :)

Info

Channel: The Magic of SQL

Views: 10,997

Rating: undefined out of 5

Keywords: SQL, Oracle Database, RDBMS, Software, Programming

Id: pJWCwfv983Q

Channel Id: undefined

Length: 15min 54sec (954 seconds)

Published: Fri Jun 26 2020