JASON KU: Welcome to the
fourth lecture of 6.006. Today we are going to be
talking about hashing. Last lecture, on Tuesday,
Professor Solomon was talking about
set data structures, storing things so that
you can query items by their key right, by what
they intrinsically are-- versus what Professor
Demaine was talking about last week, which was
sequence data structures, where we impose an external
order on these items and we want you
to maintain those. I'm not supporting operations
where I'm looking stuff up based on what they are. That's what the set
interface is for. So we're going to be talking a
little bit more about the set interface today. On Tuesday, you saw two
ways of implementing the set interface-- one using just a
unsorted array-- just, I threw these things
in an array and I could do a linear
scan of my items to support basically
any of these operations. It's a little exercise
you can go through. I think they show it to you
in the recitation notes, but if you'd like to implement
it for yourself, that's fine. And then we saw a slightly
better data structure, at least for the find operations. Can I look something
up, whether this key is in my set interface? We can do that faster. We can do that in log n
time with a build overhead that's about n log n, because we
showed you three ways to sort. Two of them were n squared. One of them was n log n, which
is as good as we showed you how to do yesterday. So the question then becomes,
can I build that data structure faster? That'll be a subject of next
week's Thursday lecture. But this week we're going to
concentrate on this static find. we got log n, which is an
exponential improvement over linear right, but
the question now becomes, can I do faster than log n time? And what we're going to do at
the first part of this lecture is show you that, no, you-- AUDIENCE: [INAUDIBLE] JASON KU: What's up? No? OK-- that you can't do
faster than log n time, in the caveat that we are in a
slightly more restricted model of computation that we
were-- than what we introduce to you a couple of weeks ago. And then so if we're not in
that more constrained model of computation, we can
actually do faster. Log n's already pretty good. Log n is not going to be larger
than like 30 for any problem that you're going to be
talking about in the real world on real computers, but a
factor of 30 is still bad. I would prefer to do faster with
those constant factors, when I can. It's not a constant factor. It's a logarithmic factor,
but you get what I'm saying. OK, so what we're
going to do is first prove that you can't
do faster for-- does everyone understand--
remember what find key meant? I have a key, I have a bunch of
items that have keys associated with them, and I want to see
if one of the items that I'm storing contains a key
that is the same as the one that I searched for. The item might
contain other things, but in particular,
it has a search key that I'm maintaining the
set on so that it supports find operations, search
operations based on that key quickly. Does that make sense? So there's the find one
that we want to improve, and we also want to
improve this insert delete. We want to be-- make this data
structural dynamic, because we might do those
operations quite a bit. And so this lecture's about
optimizing those three things. OK, so first, I'm
going to show you that we can't do faster
than log n for find, which is a little weird. OK, the model of
computation I'm going to be proving this
lower bound on-- how I'm going to approach
this is I'm going to say that any way that I store these-- the items that I'm storing
in this data structure-- for anyway I saw these
things, any algorithm of this certain type
is going to require at least logarithmic time. That's what we're
going to try to prove. And the model of
computation that's weaker than what we've been
talking about previously is what I'm going to call
the comparison model. And a comparison model
means-- is that the items, the objects I'm storing-- I can kind of think of
them as black boxes. I don't get to touch these
things, except the only way that I can distinguish
between them is to say, given a key and an item, or two
items, I can do a comparison on those keys. Are these keys the same? Is this key bigger
than this one? Is it smaller than this one? Those are the only operations
I get to do with them. Say, if the keys are
numbers, I don't get to look at what number that is. I just get to take two
keys and compare them. And actually, all of
the search algorithms that we saw on Tuesday we're
comparison sort algorithms. What you did was stepped
through the program. At some point, you
came to a branch and you looked at
two keys, and you branched based on whether one
key was bigger than another. That was a comparison. And then you move
some stuff around, but that was the
general paradigm. Those three sorting operations
lived in this comparison model. You've got a
comparison operations, like are they equal,
less than, greater than, maybe greater than or
equal, less than or equal? Generally, you have
all these operations that you could do--
maybe not equal. But the key thing here
is that there are only two possible outputs to
each of these comparitors. There's only one thing
that I can branch on. It's going to branch
into two different lines. It's either true and I do
some other computation, or it's false and I'll do a
different set of computation. That makes sense? So what I'm going
to do is I'm going to give you a comparison-- an algorithm in the
comparison model as what I like to
call a decision tree. So if I specify an
algorithm to you, the first thing it's going to
do-- if I don't compare items at all, I'm kind of
screwed, because I'll never be able to tell if my
keys in there or not. So I have to do
some comparisons. So I'll do some computation. Maybe I find out the
length of the array and I do some constant time
stuff, but at some point, I'll do a comparison,
and I'll branch. I'll come to this node,
and if the comparison-- maybe a less than-- if it's true, I'm going to go
this way in my computation, and if it's false, I'm going to
go this way in my computation. And I'm going to keep doing
that with various comparisons-- sure-- until I get down
here to some leaf in which I I'm not branching. The internal nodes here are
representing comparisons, but the leaves
are representing-- I stopped my computation. I'm outputting something. Does that make sense,
what I'm trying to do? I'm changing my
algorithm to be put in this kind of graphical
way, where I'm branching what my program could possibly
do based on the comparisons that I do. I'm not actually counting
the rest of the work that the program does. I'm really only looking
at the comparisons, because I know that I need to
compare some things eventually to figure out what my items are. And if that's the only way
I can distinguish items, then I have to do those
comparisons to find out. Does that make sense? All right, so what I
have is a binary tree that's representing
the comparisons done by the algorithm. OK. So it starts at one comparison
and then it branches. How many leaves must
I have in my tree? What does that question mean,
in terms of the program? AUDIENCE: [INAUDIBLE] JASON KU: What's up? AUDIENCE: The number
of comparisons-- JASON KU: The number
of comparisons-- no, that's the number
of internal nodes that I have in the algorithm. And actually, the
number of comparisons that I do in an execution
of the algorithm is just along a path from
here to the-- to a leaf. So what do the leaves
actually represent? Those represent outputs. I'm going to output
something here. Yep? AUDIENCE: [INAUDIBLE] JASON KU: The number of-- OK. So what is the output
to my search algorithm? Maybe it's the-- an index of
an item that contains this key. Or maybe I return the
item is the output-- the item of the
thing I'm storing. And I'm storing n things, so
I need at least n outputs, because I need to be able
to return any of the items that I'm storing based on a
different search parameter, if it's going to be correct. I actually need one more output. Why do I need one more output? If it's not in there-- so any correct comparison
searching algorithm-- I'm doing some comparisons
to find this thing-- needs to have at
least n plus 1 leaves. Otherwise, it can't be correct,
because I could look up the one that I'm not
returning in that set and it would never be
able to return that value. Does that make sense? Yeah? AUDIENCE: [INAUDIBLE] JASON KU: What's n? For a data structure,
n is the number of things stored in that
data structure at that time-- so the number of items
in the data structure. That's what it means
in all of these tables. Any other questions? OK, so now we get
to the fun part. How many comparisons does
this algorithm have to do? Yeah, up there-- AUDIENCE: [INAUDIBLE] JASON KU: What's up? All right, your colleague is
jumping ahead for a second, but really, I have to do as many
comparisons in the worst case as the longest root-to-leaf
path in this tree-- because as I'm executing
this algorithm, I'll go down this thing,
always branching down, and at some point,
I'll get to a leaf. And in the worst
case, if I happen to need to return this
particular output, then I'll have to walk down the
longest thing, just the longest path. So then the longest path is the
same as the height of the tree, so the question
then becomes, what is the minimum height of any
binary tree that has at least n plus 1 leaves? Does everyone understand why
we're asking that question? Yeah? AUDIENCE: Could you over again
why it needs n plus 1 leaves? JASON KU: Why it needs
n plus 1 leaves-- if it's a correct algorithm,
it needs to return-- it needs to be able to
return any of the n items that I'm storing or say that
the key that I'm looking for is not there-- great question. OK, so what is
the minimum height of any binary tree
that has n plus 1-- at least n plus 1 leaves? You can actually state
a recurrence for that and solve that. You're going to do that
in your recitation. But it's log n. The best you can do is if this
is a balanced binary tree. So the min height is going
to be at least log n height. Or the min height
is logarithmic, so it's actually
theta right here. But if I just said
height here, I would be lower
bounding the height. I could have a linear height,
if I just changed comparisons down one by one, if I was doing
a linear search, for example. All right, so this is saying
that, if I'm just restricting to comparisons, I have to
spend at least logarithmic time to be able to find whether
this key is in my set. But I don't want
logarithmic time. I want faster. So how can I do that? AUDIENCE: [INAUDIBLE] JASON KU: I have one operation
in my model of computation I presented a
couple of weeks ago that allows me to do faster,
which allows me to do something stronger than comparisons. Comparisons have a
constant branching factor. In particular, I can-- if I do this operation-- this
constant time operation-- I can branch to two
different locations. It's like an if kind of
situation-- if, or else. And in fact, if I had
constant branching factor for any constant here-- if I had three or four, if
it was bounded by a constant, the height of this
tree would still be bounded by a log
base the constant of that number of leaves. So I need, in some sense,
to be able to branch a non-constant amount. So how can I branch a
non-constant amount? This is a little tricky. We had this really neat
operation in the random access machine that we
could randomly go to any place in memory
in constant time based on a number. That was a super
powerful thing, because within a single
constant time operation, I could go to any
space in memory. That's potentially much larger
than linear branching factor, depending on the
size of my model and the size of my machine. So that's a very
powerful operation. Can we use that to find quicker? Anyone have any ideas? Sure. AUDIENCE: [INAUDIBLE] JASON KU: We're going to
get to hashing in a second, but this is a simpler
concept than hashing-- something you probably
are familiar with already. We've kind of been
using it implicitly in some of our sequence
data structure things. What we're going to do is, if
I have an item that has key 10, I'm going to keep an array and
store that item 10 spaces away from the front of the
array, right at index 9, or the 10th index. Does that make sense? If I store that item at
that location in memory, I can use this random
access to that location and see if there's
something there. If there's something
there, I return that item. Does that make sense? This is what I call a
direct access array. It's really no different
than the arrays that we've been talking
about earlier in the class. We got an array, and
if I have an item here with key equals 10, I'll stick
it here in the 10th place. Now, I can only now store
one item with the key 10 in my thing, and that's
one of the stipulations we had on our set data structures. If we tried to insert
something with the same key as something already
stored there, we're going to replace the item. That's what the semantics
of our set interface was. But that's OK. That's satisfying the
conditions of our set interface. So if we store it
there, that's fantastic. How long does it
take to find, if we have an item with the key 10? It takes constant
time, worst case-- great. How about inserting
or deleting something? AUDIENCE: [INAUDIBLE] JASON KU: What's that? AUDIENCE: [INAUDIBLE] JASON KU: Again, constant time-- we've solved all our problems. This is amazing. OK. What's not amazing about this? Why don't we just do
this all the time? Yeah? AUDIENCE: You don't know
how high the numbers go. JASON KU: I don't know
how high the numbers go. So let's say I'm
storing, I don't know, a number associated with
that the 300 or 400 of you that are in this classroom. But I'm storing your MIT IDs. How big are those numbers? Those are like
nine-digit numbers-- pretty long numbers. So what I would need to do--
and if I was storing your keys as MIT IDs, I
would need an array that has indices
that span the tire space of nine-digit numbers. That's like 10 to the-- 10 to the 9. Thank you. 10 to the 9 is the size of
a direct access road off to build to be able
to use this technique to create a direct access array
to search on your MIT IDs, when there's only really
300 of you in here. So 300 or 400 is
an n that's much smaller than the
size of the numbers that I'm trying to store. What I'm going to
use as a variable to talk about the size of
the numbers I'm storing-- I'm going to say u is the
maximum size of any number that I'm storing. It's the size of the universe of
space of keys that I'm storing. Does that make sense? OK, so to instantiate a direct
access array of that size, I have to allocate
that amount of space. And so if that is
much bigger than n, then I'm kind of
screwed, because I'm using much more space. And these order operations are
bad also, because essentially, if I am storing these
things non-continuously, I kind of just have
to scan down the thing to find the next
element, for example. OK, what's your question? AUDIENCE: Is a
direct access array a sequence data structure? JASON KU: A direct access
array is a set data structure. That's why it's a set
interface up there. Your colleague is asking whether
you can use a direct accessory to implement a set-- I mean a sequence. And actually, I think you'll
see in your recitation notes, you have code that can
take a set data structure and implement sequence
data structure, and take sequence data structure
and implement a set data structure. They just won't necessarily
have very good run time. So this direct access
array semantics is really just good for these
specific set operations. Does that makes sense? Yeah? AUDIENCE: What is u? JASON KU: u is this the
size of the largest key that I'm allowed to store. That makes sense? The direct access array is
supporting up to u size keys. Does that make sense? OK, we're going to
move on for a second. That's the problem, right? When u largest key-- we're assuming integers here-- integer keys-- so in
the comparison model, we could store any
arbitrary objects that supported a comparison. Here we really need
to have integer keys, or else we're not going to be
able to use those as addresses. So we're making an
assumption on the inputs that I can only
store integers now. I can't store
arbitrary objects-- items with keys. And in particular, I also
need to-- this is a subtlety that's in the word RAM model-- how can I be assured
that these keys can be looked up in constant time? I have this little CPU. It's got some number of
registers it can act upon. How big is those registers? AUDIENCE: [INAUDIBLE] JASON KU: What? Right now, they're 64 bits,
but in general, they're w. They're the size of your
word on your machine. 2 to the w is the number
of dresses I can access. If I'm going to be able to
use this direct accessory, I need to make sure that the
u is less than 2 to the w, if I want these operations
to run in constant time. If I have kids that are
much larger than this, I'm going to need to
do something else, but this is kind
of the assumption. In this class, when we give
you an array of integers, or an array of
strings, or something like that on your
problem or on an exam, the assumption is,
unless we give you bounds on the size of those things-- like the number of
characters in your string or the size of the
number in the-- you can assume that those things
will fit in one word of memory. w is the word size of your
machine, the number of bits that your machine can do
operations on in constant time. Any other questions? OK, so we have this problem. We're using way too
much space, when we have a large universe of keys. So how do we get around
that Problem any ideas? Sure. AUDIENCE: Instead
of [INAUDIBLE].. JASON KU: OK, so what
your colleague is saying-- instead of just storing
one value at each place, maybe store more than one value. If we're using
this idea, where I am storing my key at
the index of the key, that's getting
around the us having to have unique keys
in our data structure. It's not getting around
this space usage problem. Does that make sense? We will end up storing
multiple things at indices, but there's another trick that
I'm looking for right now. We have a lot of
space that we would need to allocate for
this data structure. What's an alternative? Instead of allocating a
lot of space, we allocate-- less space. Let's allocate less space. All right. This is our space of keys, u. But instead, I want to store
those things in a direct access array of maybe size n, something
like the order of the things that I'm going to be storing. I'm going to relax
that and say we're going to make this
a length m that's around the size of the
things I'm storing. And what I'm going to
do is I'm going to try to map this space of keys-- this large space of
keys, from 0 to u minus 1 or something like that-- down to arrange
that 0 to m minus 1. I'm going to want a function-- this is what I'm
going to call h-- which maps this range
down to a smaller range. Does that make sense? I'm going to have
some function that takes that large base of keys-- sticks them down here. And instead of staring
at an index of the key, I'm going to put the key through
this function, the key space, into a compressed
space and store it at that index location. Does that make sense? Sure. AUDIENCE: [INAUDIBLE] JASON KU: Your colleague is-- comes up with the question I
was going to ask right away, which was, what's
the problem here? The problem is it's the
potential that we might be-- have to store more than
one thing at the same index location. If I have a function that
matches this big space down to this small
space, I got to have multiple of these things going
to the same places here, right? It can't be objective. But just based on
pigeonhole principle, I have more of these things. At least two of them have to
go to something over here. In fact, if I have, say, u
is bigger than n squared, for example, there-- for any function I
give you that maps this large space down to the
small space, n of these things will map to the same place. So if I choose a
bad function here, then I'll have to store n things
at the same index location. And if I go there,
I have to check to see whether any of
those are the things that I'm looking for. I haven't gained anything. I really want a hash function
that will evenly distribute keys over this space. Does that make sense? But we have a problem here. If we need to store
multiple things at a given location in memory-- can't do that. I have one thing
I can put there. So I have two options
on how to deal-- what I call collisions. If I have two items
here, like a and b, these are different keys
in my universe of space. But it's possible that
they both map down to some hash that
has the same value. If I first hash a, and a is-- I put a there, where do I put b? There are two options. AUDIENCE: Is the second
data structure [INAUDIBLE] so that it can
store [INAUDIBLE]?? JASON KU: OK, so what
your colleague is saying-- can I store this one
is a linked list, and then I can just insert a
guy right next to where it was? What's the problem there? Are linked lists good with
direct accessing by an index? No, they're terrible
with get_at and set_at They take linear time there. So really, the whole
point of direct this array is that there is an
array underneath, and I can do this
index arithmetic and go down to the next thing. So I really don't want
to replace a linked list as this data structure. Yeah? What's up? AUDIENCE: [INAUDIBLE] JASON KU: We can make
it really unlikely. Sure. I don't know what likely
means, because I'm giving you a hash function--
one hash function. And I don't know
what the inputs are. Yeah? Go ahead. AUDIENCE: [INAUDIBLE] JASON KU: OK, right. So there are actually
two solutions here. One is I-- maybe, if I
choose m to be larger than n, there's going to be
extra space in here. I'll just stick it somewhere
else in the existing array. How I find an open space
is a little complicated, but this is a technique
called open addressing, which is much more common
than the technique we're going to be talking
about today in implementations. Python uses an open addressing
scheme, which is essentially, find another place in the
array to put this collision. Open addressing is notoriously
difficult to analyze, so we're not going to
do that in this class. There's a much easier
technique that-- we have an implementation for you
in the recitation handouts. It's what your
colleague up here-- I can't find him-- over there was saying-- was, instead of storing
it somewhere else in the existing direct
access array down here, which we usually
call the hash table-- instead of storing it somewhere
else in that hash table, we'll instead, at that
key, store a pointer to another data structure,
some other data structure that can store a bunch of things--
just like any sequence data structure, like a dynamic
array, or linked list, or anything right. All I need to do is be able
to stick a bunch of things on there when there
are collisions, and then, when I go up
to look for that thing, I'll just look through all
of the things in that data structure and see
if my key exists. Does that make sense? Now, we want to make sure
that those additional data structures, which
I'll call chains-- we want to make sure that
those chains are short. I don't want them to be long. So what I'm going to do is,
when I have this collision here, instead I'll have
a pointer to some-- I don't know-- maybe make it
a dynamic array, or a linked list, or something like that. And I'll put a here
and I'll b here. And then later, when I look
up key K, or look up a or b-- let's look up b-- I'll go to this hash value here. I'll put it through
the hash function. I'll go to this index. I'll go to the data structure,
the chain associated to that index, and I'll
look at all of these items. I'm just going to
do a linear find. I'm going to look. I could put any
data structure here, but I'm going to look at
this one, see if it's b. It's not b. Look at this one-- it is b. I return yes. Does that make sense? So this is an idea
called chaining. I can put anything I want there. Commonly, we talk about
putting a linked list there, but you can put a
dynamic array there. You can put a sorted array
there to make it easier to check whether
the key is there. You can put anything
you want there. The point of this
lecture is going to try to show that there's
a choice of hash function I can make that make sure
that these chains are small so that it really doesn't
matter how I saw them there, because I can just-- if there's a constant number
of things stored there, I can just look at all of
them and do whatever I want, and still get constant time. Yeah? AUDIENCE: So does that means
that, when you have [INAUDIBLE] let's just say, for some
reason, the number of things [INAUDIBLE] is that most of
them get multiple [INAUDIBLE].. Is it just a data structure
that only holds one thing? JASON KU: Yeah. So what your colleague
is saying is, at initialization,
what is stored here? Initially, it points to
an empty data structure. I'm just going to initialize
all of these things to have-- now, you get some overhead here. We're paying something for
this-- some extra space and having pointer and
another data structure at all of these things. Or you could have
the semantics where, if I only have one
thing here, I'm going to store that
thing at this location, but if I have multiple, it
points to a data structure. These are kind of complicated
implementation details, but you get the basic idea. If I just have a 0
size data structure at all of these
things, I'm still going to have a constant
factor overhead. It's still going to be a
linear size data structure, as long as m is linear in n. Does that makes sense? OK. So how do we pick a
good hash function? I already told you
that any fixed hash function I give you is going
to experience collisions. And if u is large, then there's
the possibility that I-- for some input, all of
the things in my set go directly to the same
hashed index value. So that ain't great. Let's ignore that for a second. What's the easiest
way to get down from this large space of
keys down to a small one? What's the easiest
thing you could do? Yeah? AUDIENCE: [INAUDIBLE] JASON KU: Modulus-- great. This is called the
division method. And what its function
is is essentially, it's going to take
a key and it's going to say equal
to be K mod m. I'm going to take
something of a large space, and I'm going to mod it so
that it just wraps around-- perfectly valid thing to do. It satisfies what we're
doing in a hash table. And if my kids are completely
uniformly distributed-- if, when I use my hash
function, all of the keys here are uniformly distributed
over this larger space, then actually, this isn't
such a bad thing. But that's imposing some kind
of distribution requirements on the type of
inputs I'm allowed to use with this
hash function for it to have good performance. But this plus a little bit
of extra mixing and bit manipulation is essentially
what Python does. Essentially, all it
does is jumbles up that key for some fixed
amount of jumbling, and then mods it m,
and sticks it there. It's hard coded in the Python
library, what this hash function is, and so there
exist some sequences of inserts into a hash table
in Python which will be really bad in
terms of performance, because these chain links are
the amount number of collisions that I'll get at a single
hash is going to be large. But they do that
for other reasons. They want a deterministic
hash function. They want something that
I do the program again-- it's going to do the
same thing underneath. But sometimes Python
gets it wrong. But if your data
that you're storing is sufficiently uncorrelated
to the hash function that they've chosen-- which, usually, it is-- this is a pretty
good performance. But this is not a
practical class. Well, it is a practical
class, but one of the things that we are-- that's the emphasis
of this class is making sure we can prove that
this is good in theory as well. I don't want to know that
sometimes this will be good. I really want to know
that, if I choose-- if I make this data structure
and I put some inputs on it, I want a running time that
is independent on what inputs I decided to use,
independent of what keys I decided to store. Does that makes sense? But it's impossible for me to
pick a fixed hash function that will achieve this,
because I just told you that, if u is large-- this is u-- if u is
large, then there exists inputs that map
everything to one place. I'm screwed, right? There's no way to
solve this problem. That's true if I want a
deterministic hash function-- I want the thing
to be repeatable, to do the same thing
over and over again for any set of inputs. What can I do instead? Weaken my notion of what
constant time is to do better-- OK, use a non-deterministic-- what does
non-deterministic mean? It means don't choose a
hash function up front-- choose one randomly later. So have the user-- they pick whatever inputs
they're going to do, and then I'm going to pick
a hash function randomly. They don't know which hash
function I'm going to pick, so it's hard for them to
give me an input that's bad. I'm going to choose a
random hash function. Can I choose a hash
function from the space of all hash functions? What is the space of all
hash functions of this form? For every one of these values,
I give a value in here. For each one of these
independently random number between this range, how many
such hash functions are there? m to the this number--
that's a lot of things. So I can't do that. What I can do is fix a
family of hash functions where, if I choose
one from-- randomly, I get good performance. And so the hash function
I'm going to use, and we're going to spend
the rest of the time on, is what I call a
universal hash function. It satisfies what we call
a universal hash property-- so universal hash function. And this is a little bit
of a weird nomenclature, because I'm defining this to you
as the universal hash function, but actually, universal
is a descriptor. There exist many
universal hash functions. This just happens to be
an example of one of them. OK? So here's the hash function-- doesn't look actually
all that different. Goodness gracious-- how
many parentheses are there-- mod p, mod m. OK. So it's kind of doing the same
thing as what's happening up here, but before modding by m,
I'm multiplying it by a number, I'm adding a number, I'm
taking it mod another number, and then I'm getting by m. This is a little weird. And not only that-- this is
still a fixed hash function. I don't want that. I want to generalize this to
be a family of hash functions, which are this habk for
some random choice of a, b in this larger range. All right, this is a
lot of notation here. Essentially what this is
saying is, I have a has family. It's parameterized by the
length of my hash function and some fixed large random
prime that's bigger than u. I'm going to pick some
large prime number, and that's going to be fixed
when I make the hash table. And then, when I
instantiate the hash table, I'm going to choose
randomly one of these things by choosing a random a and
a random b from this range. Does that makes sense? AUDIENCE: [INAUDIBLE] JASON KU: This is
a not equal to 0. If I had 0 here, I lose
the key information, and that's no good. Does this make sense? So what this is doing
is multiplying this key by some random number,
adding some random number, modding by this prime,
and then modding by the size of my thing. So it's doing a
bunch of jumbling, and there's some
randomness involved here. I'm choosing the hash
function by choosing an a, b randomly from this thing. So when I start
up my program, I'm going to instantiate this
thing with some random a and b, not deterministically. The user, when they're
using this thing, doesn't know which
a and b I picked, so it's really hard for them
to give me a bad example. And this universal
hash function-- this universal hash family,
shall we say-- really, this is a family of functions,
and I'm choosing one randomly within that family-- is universal. And universality says that-- what is the property
of universality? It means that the probability,
by choosing a hash function from this hash family,
that a certain key collides with another key is less than
or equal to 1/m for all-- any different two
keys in my universe. Does that make sense? Basically, this thing has the
property that, if I randomly-- for any two keys that I
pick in my universe space, if I randomly choose
a hash function, the probability that
these things collide is less than 1/m. Why is that good? This is, in some
sense, a measure of how well distributed
these things are. I want these things to
collide with 1/m probability so that these things
don't collide very-- it's not very likely for
these things to collide. Does that make sense? So we want proof that
this hash family satisfies this universality property. You'll do that in 046. But we can use this
result to show that, if we use a universal--
this universal hash family, that the length of our change-- chains is expected to
be constant length. So we're going to use this
property to prove that. How do we prove that? We're going to do a
little probability. So how are we going
to prove that? I'm going to define a random
variable, an indicator random variable. Does anyone remember what an
indicator in a variable is? Yeah, it's a variable that,
with some amount of probability, is 1, and 1 minus
that probability is 0. So I'm going to
define this indicator random variable xij is a random
variable over my choice-- over choice of a hash
function in my has family. And what does this mean? It means xij equals 1,
if hash Ki equals hKj-- these things collide--
and 0 otherwise. So I'm choosing randomly
over this hash family. If, for two keys-- key i and and j-- if these things collide,
that's going to be 1. If they don't, then it's 0. OK? Then, how can we write
a formula for the length of a chain in this model? So the size of a chain-- or let's put it here-- the size of the chain at i-- at i in my hash table-- is going to equal-- I'm going to call that
the random variable xi-- that's going to equal the
sum over j equals 0 to-- what is it-- over, I think,
u minus 1 of summation-- or sorry-- of xij. So basically, if I
fix this location i, this is where this key goes. Sorry. This is the size of
chain at h of Ki. Sorry. So I look at wherever
Ki goes is hashed, and I see how many
things collide with it. I'm just summing over
all of these things, because this is 1 if there's a
collision and 0 if there's not. Does that make sense? So this is the size of the chain
at the index location mapped to by Ki. So here's where your
probability comes in. What's the expected
value of this chain length over my random choice? Expected value of
choosing a hash function from this universal hash
family of this chain length-- I can put in my definition here. That's the expected value of
the summation over j of xij. What do I know about
expectations and summations? If these variables are
independent from each other-- AUDIENCE: [INAUDIBLE] JASON KU: Say what? AUDIENCE: [INAUDIBLE] JASON KU: Linearity
of expectation-- basically, the expectation sum
of these independent random variables is the
same as the summation of their expectations. So this is equal
to the summation over j of the expectations
of these individual ones. One of these j's
is the same as i. j loops over all of the
things from 0 to u minus 1. One of them is i, so when xhi is
hj, what is the expected value that they collide? 1-- so I'm going
to refactor this as being this, where j
does not equal i, plus 1. Are people OK with that? Because if i equals-- if j and i are equal,
they definitely collide. They're the same key. So I'm expected to have
one guy there, which was the original key, xi. But otherwise, we can use
this universal property that says, if they're not
equal and they collide-- which is exactly this case-- the probability that
that happens is 1/m. And since it's an
indicator random variable, the expectation is
there are outcomes times their probabilities--
so 1 times that probability plus 0 times 1 minus that
probability, which is just 1/m. So now we get the
summation of 1/m for j not equal to i plus 1. Oh, and this-- sorry. I did this wrong. This isn't u. This is n. We're storing n keys. OK, so now I'm looping over j-- this over all of those things. How many things are there? n minus 1 things, right? So this should equal 1
plus n minus 1 over m. So that's what
universality gives us. So as long as we choose
m to be larger than n, or at least linear
in n, then we're expected to have our
chain lengths be constant, because this thing becomes a
constant if m is at least order n. Does that make sense? OK. The last thing I'm
going to leave you with is, how do we make
this thing dynamic? If we're growing
the number of things we're storing in
this thing, it's possible that, as we
grow n for a fixed m, this thing will stop being-- m will stop being
linear in n, right? Well, then all we have to
do is, if we get too far, we rebuild the entire thing-- the entire hash
table with the new m, just like we did
with a dynamic array. And you can prove-- we're not going to
do that here, but you can prove that you won't do that
operation too often, if you're resizing in the right way. And so you just
rebuild completely after a certain
number of operations. OK, so that's hashing. Next week, we're
going to be talking about doing a faster sort.