Hello and welcome everyone! Hope you're having a great day. And welcome to mCoding, where we try to get just a little bit
better at programming every episode. Let's just jump into it today
and start off with a question for you. Suppose you're given two or more dictionaries
containing corresponding data. By corresponding data, I mean they
share the same set of keys. In this case, the keys are some channel
IDs for some of my favorite YouTube channels. And the values are names and
subscriber counts of those channels. Take a moment and let me know
in the comments below how you would iterate over these
corresponding dictionaries. While you're doing that,
let me motivate this example a little bit. Because the first question you
might have might be: Why do we have the names and
sub_counts stored in separate dictionaries? You might be expecting
something more structured like this. In this example, we've defined a class that
stores the ID, name, and sub_count of a channel. And then we have this structured data
of ChannelData stored in a list. If you got this data from
a web API, this is what you might be
expecting to get back. An aggregate of structured data types that each
contain many different types of data. You can then have a dictionary
of IDs to the channels they correspond to. And then the preferred way to loop
is to loop over the items of that dictionary. This is certainly a common case.
But it's not the only case. For instance, if the names came
from one API like the YouTube Data API but the subscriber counts came from a
different API like the Analytics API, then you could very well end
up with two separate data structures that have the same keys. And that's how we end up with
the situation from before. Whether you got these pieces of
data from separate sources and your goal is to combine them or whether it just makes more sense
in the context of your program to keep them separate and you just
want to loop over both of them, the question remains: how should you
loop over the corresponding data? Here's one way to do it.
It works but it's not the best way. Loop over the channel IDs (cid)
from the names. Then use the channel ID (cid) to pull
out the name and sub_count. But the first thing I want to note is
this is actually an anti-pattern of for-loops. Whenever you loop over the
keys of a dictionary and then the first thing you do is
get out the value corresponding to that key, you should instead prefer to
loop over the items of that dictionary. This is going to be slightly more efficient because when you loop over items,
you don't have to compute the hash of the keys. This method is also fine
and it works. But it's still not quite what
I would like. The names and sub_counts are supposed to
have corresponding data. But if there was a missing subscriber
count, I would get a key error. Whereas, if there was a missing name, I would
silently just loop over less data. This kind of asymmetry
can cause very subtle bugs. Additionally, the code just looks
a little bit asymmetric. And that can cause it to be a little bit
harder for people to read. Here I'm looping over names,
why not loop over sub_counts instead? Even though they share kind of equal
purpose in this iteration, I had to pick one of them
to go first. So while it works, this is still not
quite what I'm looking for. To get a better feel for what
I am looking for Let's take inspiration from
a very similar example. Suppose that instead of dictionaries,
these data were stored in corresponding lists. In that case, the simple and straightforward way
to do it is to use the built-in `zip` function. As of Python 3.10, you could even use the `strict` flag
to ensure that all the lists have the same length. I think, this is a very clear and
great way to do it for lists. So, how do we zip dictionaries? We can't just use `zip` itself because
`zip` is going to loop over the keys of dictionaries. We could try to zip corresponding items. But there are a lot of problems
with this approach. We end up repeating the ID field. We now need extra parentheses. And we're actually depending on
the fact that the keys were inserted into
the dictionaries in the same order. As of Python 3.7, dictionary iteration
order is guaranteed to be insertion order. But since these dictionaries came from
potentially separate sources, we can't really assume that the keys
were inserted in the same order even if we know that they have
the same keys. What I really want is to be able
to write something like this. I want a dictionary version of `zip` (dict_zip), And when I loop over it, I get out the key and then value, value,
value for each of the corresponding dictionaries. To me this API is simple
and straightforward, just like `zip`. The only problem is there
is no `dict_zip`. It's not a built-in like `zip` and it's not
part of the itertools library either. But we're all programmers here.
If we want it, we can write it ourselves. So let's write it.
Take in the dictionaries. We're going to make this
function a generator. So let's just throw in a yield
so that we have that mindset. If we weren't passing any dictionaries,
then we just stop. I personally like the strictness to be built in.
So, let's do a length check. Then we go ahead and loop over corresponding
keys in the most efficient way that we can. That means using the items method
of the zeroth dictionary. And then just grabbing out the rest of the values from
the other dictionaries in a comprehension. Here I'm using a generator
comprehension to get all the values. Then the star is used to unpack all the
values so that the result is a tuple. So all in all, we have a tuple of the key,
first value, and then the rest of the values. And that's all there is to it! Now we can loop over corresponding dictionaries
in a straightforward way. And all the asymmetry that we had before
is hidden inside the implementation. There are a few other things that I want
to say about this implementation though. First off, this is a prime example of something
that really should be implemented in C. In Python, we're forced to use brackets
in order to get the corresponding values for the keys. Every time we do this, this involves
computing the hash of the key. But the hash of the key is the same
for every dictionary, so why recompute it? This is from the CPython source code. It's in the file dictobject.c, which defines
the implementation of a dictionary. In particular, in C, we can use this
`GetItem_KnownHash` function. This function is special built for getting
a key out of a dictionary when you already know the hash. So our `dictionary_zip` is a perfect
example of a place where we could save a lot of computation time
by removing the need to recompute a hash function. This is something I could say more about in
a future video if you're interested. But I claim that if we did this
using the C API, we could get away with not computing
the hash function at all for any key. There's also one other hidden usability
issue in this implementation: Look what happens if I try to use autocomplete
on one of these entries (say, the channel ID). I get nothing useful! Not everyone depends on autocomplete,. But having it there is a huge
win for most developers. And the way we wrote our implementation
is not amenable to autocomplete. Of course, if this was part of
the standard library, we wouldn't have to worry about this.
It would be taken care of. But a little tip when you try to implement
these things in one of your own code bases: you can help autocomplete work
better by using type hints. Unfortunately, if you think about
what the signature of this function should be, Python's type hinting system
does not yet support it. There's an early draft of a PEP
to fix this in 3.12 or 3.13. But for now here's a quick fix. The quick fix in this case is to provide
typing overloads for the most common use cases. I'd say that zipping one, two or three
dictionaries together are the most common cases. So make all the type variables
you need. And then just provide overloads
for each of those cases. We actually don't even use dictionaries;
any mapping will do. So here's an overload that takes
in two mappings. They share the same key type.
But they can have different value types. Then I'll return an iterator of tuples
of the common key, value 1 and value 2. Let's go back down to our example
and see how the autocomplete works now. Here we are again
and let's see what it does. Perfect! This time I'm getting string
autocomplete suggestions for the string key. And for the sub_count which is an
integer we're getting `int` suggestions. Finally, don't be afraid to go wild and to
find ways of iterating that work for your code base. Maybe you just want to loop
over common elements of a dictionary like a SQL inner join. Or maybe you want to loop over a union
of keys like an outer join and provide a fillvalue. In the right circumstances, providing these functions
can actually vastly improve the readability of your code. And that's what I recommend you do. Think
about the ways that you iterate. Is there anything that you do often by
hand that you could just make a function for? If so, do it! Anyway, thanks for watching!
That's all I've got. Thank you to my patrons and donors.
And I'll see you in the next one! As always, don't forget to slap that like
button an odd number of times.