Python Iterators! COPY or NO COPY?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello and welcome to mCoding! This is James Murphy. I am doing another giveaway. So, if you'd like a professional license to PyCharm, which is the editor that I'm currently using, then check the description. In this video, we're going over whether or not a bunch of built-in things in Python make copies of your data. This is actually inspired by one of my previous videos. I had a lot of commenters, rightfully worried that looping over a list this way in order to loop over it backwards might be making a copy. I got suggestions to use a slice of a list instead. And, in the wild, I've even seen people fall back to using error-prone indices just to avoid this issue. Well, it's definitely clear this index-based approach doesn't make a copy of my list. But it's so much less readable. So, I'm happy to tell you that reversed() does not make a copy of your sequence. If you want to loop over something backwards, this is the way to do it. This isn't a foolproof method to tell whether or not a copy is made. But we can actually get an inclination that one is not made by printing out the reversed of the list. Notice that instead of getting an actual list back, we get this list reverse iterator object. If you see something that says like an iterator or a view or a generator, that should be a clue that a copy probably isn't being made. Just for comparison, notice that if we use this fancy slice of a list, then we actually get another list back. So although it's shorter in this case, it is also making a copy. So then how is reversed() able to loop over the list without making a copy? Well, here's an implementation of a class that acts the way that reversed does. The key idea in all of these ways of iterating is to define your own iterator. An iterator is just an object that you can repeatedly ask for elements. The way that you ask it for an element is by calling its __next__ method. This is also what's called by the built-in next function. If there are any more elements left to iterate over, then next returns the next one. If there aren't any more elements to iterate over, you signal that by raising a stop iteration. Technically speaking, an iterator must also have an __iter__ method that returns self. This is a little bit confusing because the magic __iter__ method is used to define an iterable, not an iterator. Iterables are just things that you can ask for an iterator from. Whereas, an iterator you can think of more as an in-flight stream of elements. So all the return self really says is that if you want to get a stream of elements from a stream of elements, just return the stream. Most of the important logic is in the next method. So all we do when we construct the reversed object is store reference to the sequence. Initialize an index to the last element of the sequence. And do some error checking. Every time you ask a reversed for the next element, it gives you the element at its current index. And then subtracts one from the index. The last or I guess first valid index is 0. So, after returning that one, our index gets set to -1. So, if we see minus one, then we know there are no more elements. So as you can see this allows us to iterate over the entire sequence without making a copy of it. The built-in reverse is actually going to do something more like this that actually does things in terms of this next method. A common way to do something similar without as much code is to use a generator instead. This generator code does essentially the same thing. And is quite a bit clearer in what it's doing. But the major difference here is that the error checking doesn't happen until you actually start the generator which happens when you actually ask for the first element. Whereas, the class solution did the error checking in its init function. Of course, doing error checking earlier rather than later is usually a good idea. But this is an option. So then the question is: Which of the major built-ins actually make copies and which don't? Aside from sorted, most of the built-in that make copies are containers themselves. And that kind of makes sense. If you ask to make a new container using an old one, then you're kind of asking to make a copy. On the other hand, there are a bunch of them that don't make copies. So enumerate, filter, inter, map, reversed, and zip, any of the dictionary views and generator comprehensions don't make copies. Once again, you can mostly tell by printing them out. The ones that make copies look like they've returned a new version of the original thing. Whereas, the ones that don't make copies are returning these special purpose looking objects. As is often the case in Python though, there are almost always exceptions. You're able to change how a lot of built-ins interact with your classes. For instance, slicing a list makes a copy. Because that's the behavior that the built-in list shows. But you can define your own get item method for your classes. And do whatever you want with it. And that's exactly what NumPy arrays do. Things are not looking good if you just print it out. Both of these objects are numpy arrays. So, you might think that reversed is a copy. But NumPy arrays are actually just very thin views around raw memory. And printing out the data attribute of both of these arrays, we see that they're sharing the same exact memory object. Contrary to the way that lists behave, no copy was made here. The way that NumPy achieves this is by using strides. Strides tell you how many bytes you need to jump if you want to get to the next element of the array. These are tuples instead of just numbers because arrays can be multi-dimensional. But for a 1D array, we basically just have a number. The `4` here tells me that from some element I need to jump four bytes to get to the next one. That's because NumPy chose a 32-bit or 4 byte integer type for this array. So, all I need to do to get a reversed array is start at the end and jump by negative e 4 for each element instead of positive 4. It's a very simple, clever and effective way to do things efficiently. I wouldn't expect anything less from NumPy. And finally, we spend all this time talking about how to avoid making copies of things. But I have to say that copies are not always bad. Sometimes you really need a copy. The canonical example of when you'd want to copy is if you're iterating over something and mutating it at the same time. For instance, here I have a, b, c mapped to 1, 2, 3. I want to go into the dictionary and add uppercase A, B and C with the same values 1, 2 and 3. Seems simple enough. Iterate over key value pairs. Convert the key to uppercase and assign it the same value. It seems straightforward but we got slapped with this runtime error. Dictionary changes size during iteration. If you're not familiar how dictionaries are implemented in terms of hash tables, this might be a very confusing error. Why does Python care if I modify something while I'm looping over it? The very short and simple explanation is that modifying the dictionary might cause the dictionary to have to move all the elements around. If that were to happen then the iterator might basically just lose its place. The solution is to manually make the copy. Making a copy here means that we exhaust the iterator and collect all the results into the list. So the iterator is completely done before we ever start to modify the dictionary. So, there you have it. I hope this helped a bit. Don't forget about the giveaway if you're interested in a professional PyCharm license! Of course, thank you to my fantastic set of patrons and donors! If you like my content, hit that subscribe button. And if you especially like it, please consider becoming a patron or donor! My team and I are also available for Python and C++ consulting! Thanks for watching! See you next time!

Info

Channel: mCoding

Views: 25,317

Rating: undefined out of 5

Keywords: pyplease, python

Id: hVFKy9Gw95c

Channel Id: undefined

Length: 7min 10sec (430 seconds)

Published: Mon Jan 02 2023