Hi there! I want to combine all the cards
in this deck, with all the cards in this deck. It's time to talk about joins! First up I'm going to take the naive approach. For every card in this first or outer
deck, I'm gonna look for a card with the same value and suit in this second or
inner deck. So what we'll do is we'll take the top card from here. That's the
six of clubs. I'll then search through all the cards in this second deck until -
eventually - we get to there we go there's six of clubs. So we have a join there.
We've joined those cards. We can then go back to our first deck. First card here
is the three of clubs. And it's also the first card in this second deck. Now you
might think at this point we can stop. No! We still need to search through every
single other card in this second deck, just in case someone snuck an extra
three of clubs in at the bottom. So once I've found the join for the
three of clubs I'll then go to the four of spades and look through all these
cards. So for every card in this first or outer deck I'm going to look through
every single card in this second or inner deck. This is the basis of a nested
loop join. For every row in the first table, we will
query the second table looking for matching rows. Clearly in this case it is
very slow. We're going to read every row in the second table once for every row
in the first. That's a lot of work. So how can we make combining the decks faster?
Well one trick is to sort the decks first. So just bear with me a moment
while I do this. So I'll take this first deck and start ordering the cards by
value and suit. So hang on a sec ... so we've finished sorting this second deck by
putting the ace of spades at the bottom. So now we have two ordered decks and I'm
going to join them again. So we'll start with the first card in the outer deck.
That's the two of clubs. We then go to the second inner deck and that also has
the two clubs as the first card. So we've got a join. I can then go to the next
card in this second deck, that's the three of clubs. This no longer matches. So
because this second deck is sorted, we now know that there can be no other
cards with the value two of clubs in the second deck. So we can stop processing
now. And go back to the first deck and look for the first card in it; that's the
three of clubs. Now at this point we will restart the search from the last value
we had a join on, so that's the two of clubs. Two and three don't match so we go
to the three of clubs, find the join there the two joining three of clubs.
Until we get to the four of clubs again. Four is now greater than three, this deck
is sorted, so we know there can be no other cards with the value four of clubs
anywhere in it. So we can stop again, go back to our first deck. Look for the four
of clubs. So again we're gonna have to restart our search from the last matched
value. That's the three of clubs. We'll do three, four, five, and so on. And we can
repeat this process walking down both decks bit by bit. This is the basis of a
merge join. First we sort the tables on the join
columns and then walk down them bit by bit
comparing the rows. The key thing is because the rows are sorted we can stop
processing once we've looked at a value in the second table which is greater
than the value of the current row in the first table. So combining the decks with
a merge join is faster than nested loops. But you need to sort them first.
And sorting is slow. So how can we do any better? For my final approach what I'm
going to do is first look at all the cards in my outer deck. For each one what
I'll do is look at its value and suit and apply a hash function to it. So we'll get
the six of spades, apply a hash function and build a hash table on it.
So then go to the next card, put that in the hash table, nine of diamonds, Queen of
Hearts, and so on. And I will do that for all the cards in this first deck. Once
I've read them all I will then go to the second deck and apply the same hash
function to the value and suit. So we've got the Queen of Hearts, apply the same
hash function and then lookup - probe - the hash table we built for matching values.
We've got a matching entry here so we have a join. Do that for the nine of
diamonds, probe the hash table we've got a join. We'll repeat this process for all
the cards in this second deck. As you might guess from the fact that I'm
talking about hash functions, this is the basis of a hash join. This first reads all the rows in one table, applying a hash function to the
join criteria to build the hash table. It then walks through all the rows in the
second table, applying the same hash function and looking up - probing - the hash
table to see if there is a matching value in the hash table we've just built.
So we've got our three different approaches: nested loops, merge-join,
hash join. How do they compare in terms of the amount of work they do? With
nested loops for every row in the first table we're looking at every row in the
second table. With 52 rows in each that's over two and a half thousand comparisons!
A lot of work. With a merge-join, first we have to walk through each table
sorting the rows. The exact work to do this varies, but it's roughly the number
of rows in the table multiplied by the log of the number of rows. We do this for
both tables giving a total of around a couple hundred sorting operations. Then
we walk down the tables comparing the rows. In this case we'll compare each row
in the first table to around three rows in a second, giving a total of over 300
operations. Finally with the hash join we walk
through all the rows in this first table once, applying a nice fast hash function.
We then start walking through the second table, applying the same hash function.
Assuming we're using a good hash function this will find at most one
matching value. So the total operations here is a little over 100. Significantly
less than the two other join methods. This is because, unlike nested loops and
merge joins, we're guaranteed to read each row at most once. Now at this point
you might be wondering: if hash joins are so great why do we even bother with the
two other join types? Well a couple of reasons. First up in order to use a hash
join then your join must include equalities. That is check
the values from the column in one table equal of the column in the other table.
If this is not the case and all your join criteria are range comparisons;
greater than, less than, and so on or inequalities, not equal to, then the
optimizer will not use a hash join. But even if your join includes equalities,
it's still possible that nested loops or merge joins can be faster in some cases.
How?? Well so far we've looked at joining every single card in this first deck to
find all the matching cards in this second deck. What if we don't want to do
that? What if we just want to join a handful? Say a hand of five cards? Well
with a hash join we've still got to go through every single card in this first
deck before we can start reading any of them in this second deck. With nested
loops on the other hand, as soon as we've read one row from here
we can start looking for matching rows in our second deck. Now at this point you
might be saying: well for each of the five cards we get from here we've still
got to read all 52 cards from our second deck!? That's still pretty slow, right? Well
that's true. Unless of course, we create an index! With an index on the suit and
value of the second table, all the lookups of it become far more efficient.
After reading the first row from the outer table, we can use the index to find
the location of the matching row in the inner one. So instead of having to read
all 52 cards here five times, we just do five index
lookups of this deck to find the matching card. So we read five cards from
here. And five cards from here. The hash join on the other hand cannot make use
of the index on this table. So it starts by still reading all the cards in this
deck. It still has to read all the rows in the
outer table first building the hash table. Then start reading the rows in the inner
table. It does this until it finds five matching rows which may mean it needs to
read all the rows in the second table too!
So now the hash join is much slower than nested loops! So what about our
merge-join? Remember this had to sort the datasets first. But indexes are ordered
data structures. So it can use this to avoid a sort. So if we've got an index
on the join columns, instead of having to sort all the rows in the outer table, we
can just walk down the index, getting the matching rows from the table. So we've
saved ourselves a sort of the first table. Sadly, even if we've gone index on
color and suit for the second table, Oracle Database will still sort it
before performing the join. So if you're returning a fraction of the rows from
each table, nested loops can be very efficient. Say I want to search through
this deck for damaged cards. Ones which are bent, torn or whatever. So you can
tell what they are just by looking at their back. Which isn't good if you want
to play bluffing games such as poker! So I want to replace my bent six of spades
from this deck with a nice fresh one from this deck. I've got an index on this
one to find it nice and efficiently. But with our hash join, even though we've got
an index on the value and suit over here, after building the hash table with the
first deck I've still got to go through every single card in this second deck
probing the hash table for matches. This is because it cannot take advantage of
an index on this second table. What about nested loops?
Well again I can find my bent card nice and efficiently, use the index to find
the location of the six of spades in this deck, and find just that card. So we
read one row from each table. Nice and efficient.
So if your query finds a small fraction of the rows from one table and each of
those match few rows in the other table, nested loops can be the fastest way. But
what if I go and bend a whole bunch of the cards over here, so I've got lots of
lookups to do over here? With nested loops the more cards we get from the
first table, the more index lookups we need to do on our second table. At some
point it becomes faster to just switch to a hash join and full scan this
second table! This is a problem for the optimizer, because the switching point
from nested loops being the fastest to hash joins being the fastest can be tiny.
So if your table stats are even ever so slightly out of date, its row estimates
can lead the optimizer to choose the wrong join type!
Luckily from Oracle Database 12c we have a solution:
adaptive plans! This is one plan which considers two join types.
With these, the first time you run a query the optimizer says: I'm not sure if
a nested loop is the best approach or if a hash join is superior.
So I'll consider both. Then I'll have an operation - a statistics collector -
watching the rows coming out of the first table. Provided this stays under
some threshold, I will then use nested loops to look up the rows in the second
table. But if the number of rows exceeds this threshold, I'll start building the
hash table instead and switch to a full scan of the second. This allows the
optimizer to adapt and choose the best join type based on what's actually
stored in the tables. Let's recap. We have our three join types: nested loops, merge
joins, hash joins. Nested loops are best suited when joining tiny data sets or
small fractions of much larger tables. Hash joins on the other hand are best
when joining all the rows in a table or just large data sets in general. A merge
join can be super efficient - if the rows are sorted. But sorting is slow. So this
is a relatively rare operation. OK, so how does knowing this help you tune your
SQL queries? Well as always check a couple of things: the table statistics
and their indexes. One of the key things driving the optimizer's decision behind
the join type and the join order - which table it accesses first - is how many rows
it thinks it's going to get from each table. Adaptive plans means it can hedge
its bets when it comes to the join type. But the join order is still fixed. So if
you find that the estimated number of rows is orders of magnitude different to
the actual number of rows you get from a table, look to see if tweaking the stats
can help the optimizer find a better plan. When it comes to indexes, look to
see which exists on the join columns. If there isn't one, will
creating it help the optimizer use a nested loop join? If there is one will
removing it force the optimizer to use a hash join? By tweaking the stats and
indexes on a table you can help the optimizer find a better plan for your
query. Just remember: changing these will affect all other queries against the
table! Thanks for watching! I hope you enjoyed this and it helped you
understand how the different join types work. If you have any questions about
these let us know, down in the comments. And subscribe for more
SQL tips :)