Writing zip, but for dictionaries

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello and welcome everyone! Hope you're having a great day. And welcome to mCoding, where we try to get just a little bit better at programming every episode. Let's just jump into it today and start off with a question for you. Suppose you're given two or more dictionaries containing corresponding data. By corresponding data, I mean they share the same set of keys. In this case, the keys are some channel IDs for some of my favorite YouTube channels. And the values are names and subscriber counts of those channels. Take a moment and let me know in the comments below how you would iterate over these corresponding dictionaries. While you're doing that, let me motivate this example a little bit. Because the first question you might have might be: Why do we have the names and sub_counts stored in separate dictionaries? You might be expecting something more structured like this. In this example, we've defined a class that stores the ID, name, and sub_count of a channel. And then we have this structured data of ChannelData stored in a list. If you got this data from a web API, this is what you might be expecting to get back. An aggregate of structured data types that each contain many different types of data. You can then have a dictionary of IDs to the channels they correspond to. And then the preferred way to loop is to loop over the items of that dictionary. This is certainly a common case. But it's not the only case. For instance, if the names came from one API like the YouTube Data API but the subscriber counts came from a different API like the Analytics API, then you could very well end up with two separate data structures that have the same keys. And that's how we end up with the situation from before. Whether you got these pieces of data from separate sources and your goal is to combine them or whether it just makes more sense in the context of your program to keep them separate and you just want to loop over both of them, the question remains: how should you loop over the corresponding data? Here's one way to do it. It works but it's not the best way. Loop over the channel IDs (cid) from the names. Then use the channel ID (cid) to pull out the name and sub_count. But the first thing I want to note is this is actually an anti-pattern of for-loops. Whenever you loop over the keys of a dictionary and then the first thing you do is get out the value corresponding to that key, you should instead prefer to loop over the items of that dictionary. This is going to be slightly more efficient because when you loop over items, you don't have to compute the hash of the keys. This method is also fine and it works. But it's still not quite what I would like. The names and sub_counts are supposed to have corresponding data. But if there was a missing subscriber count, I would get a key error. Whereas, if there was a missing name, I would silently just loop over less data. This kind of asymmetry can cause very subtle bugs. Additionally, the code just looks a little bit asymmetric. And that can cause it to be a little bit harder for people to read. Here I'm looping over names, why not loop over sub_counts instead? Even though they share kind of equal purpose in this iteration, I had to pick one of them to go first. So while it works, this is still not quite what I'm looking for. To get a better feel for what I am looking for Let's take inspiration from a very similar example. Suppose that instead of dictionaries, these data were stored in corresponding lists. In that case, the simple and straightforward way to do it is to use the built-in `zip` function. As of Python 3.10, you could even use the `strict` flag to ensure that all the lists have the same length. I think, this is a very clear and great way to do it for lists. So, how do we zip dictionaries? We can't just use `zip` itself because `zip` is going to loop over the keys of dictionaries. We could try to zip corresponding items. But there are a lot of problems with this approach. We end up repeating the ID field. We now need extra parentheses. And we're actually depending on the fact that the keys were inserted into the dictionaries in the same order. As of Python 3.7, dictionary iteration order is guaranteed to be insertion order. But since these dictionaries came from potentially separate sources, we can't really assume that the keys were inserted in the same order even if we know that they have the same keys. What I really want is to be able to write something like this. I want a dictionary version of `zip` (dict_zip), And when I loop over it, I get out the key and then value, value, value for each of the corresponding dictionaries. To me this API is simple and straightforward, just like `zip`. The only problem is there is no `dict_zip`. It's not a built-in like `zip` and it's not part of the itertools library either. But we're all programmers here. If we want it, we can write it ourselves. So let's write it. Take in the dictionaries. We're going to make this function a generator. So let's just throw in a yield so that we have that mindset. If we weren't passing any dictionaries, then we just stop. I personally like the strictness to be built in. So, let's do a length check. Then we go ahead and loop over corresponding keys in the most efficient way that we can. That means using the items method of the zeroth dictionary. And then just grabbing out the rest of the values from the other dictionaries in a comprehension. Here I'm using a generator comprehension to get all the values. Then the star is used to unpack all the values so that the result is a tuple. So all in all, we have a tuple of the key, first value, and then the rest of the values. And that's all there is to it! Now we can loop over corresponding dictionaries in a straightforward way. And all the asymmetry that we had before is hidden inside the implementation. There are a few other things that I want to say about this implementation though. First off, this is a prime example of something that really should be implemented in C. In Python, we're forced to use brackets in order to get the corresponding values for the keys. Every time we do this, this involves computing the hash of the key. But the hash of the key is the same for every dictionary, so why recompute it? This is from the CPython source code. It's in the file dictobject.c, which defines the implementation of a dictionary. In particular, in C, we can use this `GetItem_KnownHash` function. This function is special built for getting a key out of a dictionary when you already know the hash. So our `dictionary_zip` is a perfect example of a place where we could save a lot of computation time by removing the need to recompute a hash function. This is something I could say more about in a future video if you're interested. But I claim that if we did this using the C API, we could get away with not computing the hash function at all for any key. There's also one other hidden usability issue in this implementation: Look what happens if I try to use autocomplete on one of these entries (say, the channel ID). I get nothing useful! Not everyone depends on autocomplete,. But having it there is a huge win for most developers. And the way we wrote our implementation is not amenable to autocomplete. Of course, if this was part of the standard library, we wouldn't have to worry about this. It would be taken care of. But a little tip when you try to implement these things in one of your own code bases: you can help autocomplete work better by using type hints. Unfortunately, if you think about what the signature of this function should be, Python's type hinting system does not yet support it. There's an early draft of a PEP to fix this in 3.12 or 3.13. But for now here's a quick fix. The quick fix in this case is to provide typing overloads for the most common use cases. I'd say that zipping one, two or three dictionaries together are the most common cases. So make all the type variables you need. And then just provide overloads for each of those cases. We actually don't even use dictionaries; any mapping will do. So here's an overload that takes in two mappings. They share the same key type. But they can have different value types. Then I'll return an iterator of tuples of the common key, value 1 and value 2. Let's go back down to our example and see how the autocomplete works now. Here we are again and let's see what it does. Perfect! This time I'm getting string autocomplete suggestions for the string key. And for the sub_count which is an integer we're getting `int` suggestions. Finally, don't be afraid to go wild and to find ways of iterating that work for your code base. Maybe you just want to loop over common elements of a dictionary like a SQL inner join. Or maybe you want to loop over a union of keys like an outer join and provide a fillvalue. In the right circumstances, providing these functions can actually vastly improve the readability of your code. And that's what I recommend you do. Think about the ways that you iterate. Is there anything that you do often by hand that you could just make a function for? If so, do it! Anyway, thanks for watching! That's all I've got. Thank you to my patrons and donors. And I'll see you in the next one! As always, don't forget to slap that like button an odd number of times.

Info

Channel: mCoding

Views: 41,805

Rating: undefined out of 5

Keywords: python

Id: k2ZWyHdahEk

Channel Id: undefined

Length: 8min 23sec (503 seconds)

Published: Mon Jun 27 2022