Python Generators

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello and welcome to mCoding where  the only limit is your imagination. And your download speed. But mostly your imagination. In this episode, we're talking about Python generators including the yield keyword, generator comprehensions, yield from. And how all this relates to async. You define a generator function in Python just like you would any other function using the def keyword and they  can take parameters. The only thing different that makes this a generator is the presence of one of these yield statements somewhere in the body. Generators act like normal functions. Except  when you hit a yield statement, it pauses. Every time you pause you can also yield a value, in this case, "hello" that becomes available to the caller. Unlike with functions, calling the generator doesn't run the generator. Instead printing out the generator, we just see a generator object. The way you actually run a generator is by calling its next method which is what the next built-in will do. It will run the generator until it hits a yield statement. And return the value at the yield statement. With every next call, it resumes the generator until it hits another yield statement or until the function ends. If you resume the generator and the function ends before hitting another yield, you'll get a stop iteration exception raised. Returning a value from a generator is fine. But it doesn't appear as the result of a next call like the others do. The return value of the generator actually appears as an attribute on the stop  iteration exception that's raised. This is mainly for a very niche purpose that we'll talk about at the end. While Ariana Grande might prefer to thank you next next next over and over again, in Python, it's more common to let a for-loop do it for you. This works because under the hood a for-loop will call that next function over and over again until it finds a stop iteration at which point it stops. Check out my video on what a for-loop actually translates to under the hood if you haven't seen it already. In any case, here we see the items printed out hello world123. So what are generators good for? The most common use case for a generator is to define an iterator of a class. For a simple example, consider this `Range` class. Just pretend you don't know about the built-in range. And you're building a Range class yourself. In the spirit of being lazy like a generator, our ranges are all going to go from zero to some stop value. We don't support start or step. So just like the built-in range which we don't know exists, range of 5 is the numbers 0, 1, 2, 3, 4. But when we create the Range, we don't actually store all those numbers somewhere. We just store the start on the stop. But if all we have is a start and a stop, how do we iterate over the elements of the Range that we're supposed to be representing? The answer is the highly sophisticated solution of counting. Start at the starting value. And then continually yield the  current value and add 1 until we get to the stop. We can now iterate over our Range just the same as we would over the built-in range. And just like the built-in range because we're  only storing the start and the stop, not all the numbers in between, we can construct huge ranges. There's no way a list of numbers from 0 all the way up to this number would ever fit into memory. Yet we can construct the Range and iterate over it quickly and efficiently. And to reiterate it's this laziness where we're not actually constructing those numbers until we ask to see them that allows us to be able to do this. So, while generators can be slightly slower  than lists in certain situations, if you're just processing them one at a time like here we're  just printing out the current n, then a generator can be a huge win over a list. And here's a little history for you. The built-in range in Python 2 actually used to return a list. That turned out to be a huge mistake. And for Python 3, it was changed to be something more similar to what the generator is doing. It's not actually using a generator. It's using something handwritten in C. But it's the same idea. Another very common and useful place that you might use the generator is reading from a file. Once again, this is a situation where because files can be so big, you might not want to read the whole file into memory all at once. If you can process things line by line using a generator, then even if the file is gigabytes big, it doesn't matter. You'll only need as much memory  as you would need to process a single line. So for example, I have some custom dataclass. In this case, it's just xyz points. My file just looks like this. I just have floating point values xyx, xyz, xyz. I don't know what's up with the red squiggle. pycharm thinks I have a syntax error in my text file. Feel free to free associate about what the problem is. In any case, we define a generator that expects a file handle. What I mean by that is, the file object that you get back from an open call. In a very generator like fashion, iterating over a file actually iterates over the lines of the file. We strip off the trailing new line and split it by the commas. Then we convert everything to floats. Create one of our custom data structures and yield it. And then we just print out the rows. Of course, you can do whatever data processing you like. The next very common use case for generators is  to think about them as lazy sequences. You can loop over them repeatedly returning values. So you can think about those as values of some sequence, whether it be a mathematical sequence like in this  case, or just a sequence of objects And you just don't compute the next term in the sequence until someone asks for it. So, here's a `collatz` sequence. Take a positive integer n. If it's even, divided by 2. Otherwise, multiply by 3 and add 1. Then repeat. If you ever got to 1 then the sequence  would start repeating, it would start to go 1, 4, 2, 1, 4, 2 and so on. As of 2022, it's one of the world's most famous unsolved problems in math. Starting at any number do you always get to 1? Or could there be a sequence that goes off to infinity? Or maybe some other cycle like 4, 2, 1, 4, 2, 1? Well I'm here to announce that I've actually proved the Collatz Conjecture to be independent of the axioms of mathematics. Just kidding. Anyway, here's what a  typical Collatz sequence looks like. It does some unpredictable stuff. And then eventually, you hit a power of 2 and it shrinks down to 1. This showcases another very important property of generators. Imagine, if instead of a generator we were using a list. Well, besides the fact that this list might need to be arbitrarily large because we don't know how long a Collatz sequence will be, what if I didn't care about the whole sequence? For instance, what if I just wanted to know how long is it? If you're wondering, it's 111 elements. But if I return to list in this case,  that would have been a huge waste. Why allocate all that memory and store all those numbers just to get the length? If I wanted to be more efficient, I'd need to write another function that calculates the length instead of storing the list. But that length function would be basically identical to the list function. Just instead of appending into a list, we add one to account. if only there was a way to have one implementation of the Collatz sequence that I can do whatever I want with. Once again, generators to the rescue. If I want the length of the sequence, then I just count one for every element of the sequence. Once again, we see 111. And if I did actually want the whole list of  numbers in memory, then I can just call list on it. Generators can even be used to represent sequences that we know are infinite. We can only ever use finitely many terms. But we're able to compute as many as we need without specifying ahead of time. So, you could represent all the powers of two all the rational numbers of Fibonacci Sequence or all the prime numbers. All you need is an algorithm  for enumerating them. Defining a generator is as simple as defining a regular function. But you can go even simpler if your generator is simple enough. This is a list comprehension which hopefully you're familiar with. And it creates a list whose elements are x times x for each x in the range. Replace those brackets with parentheses. And you now have a generator comprehension. This is really just shorthand notation for defining and calling a generator function. Meaning this code and the elements of this sequence will not be computed until you try to actually iterate  over the generator. And once again this can be more efficient than the list version which creates all the elements in memory immediately. If you happen to be immediately passing a generator into a function, you can also do it this way. This creates a generator just like in the previous line. And passes it to the sum function. Basically, it just lets you leave off a pair of parentheses. And now that we know about generator comprehensions, another great feature of generators is that they're extremely easy to compose. You can build pipelines of data out of generators in no time. Suppose, you want to be able to parse a file like this. It has data in it that you want to treat as floating point numbers. But you also want to allow comments, full-line comments nans, infinities and blank lines. No need to write a fancy parser. Generators are plenty expressive enough to get this job done. We start by opening the file and iterating over the rows. Remember each row is  one line of the file. Strip off the new line. And remove anything after a hash in order to strip trailing comments. Then Define another generator that loops over the generator from the first line. All it does is filter out empty lines. Each lines should now contain a floating point number. So, we use float to convert it from a string to a float. Then we do another filter operation to  throw out any Infinities or nans. Then let's just pretend that we want to replace anything negative with 0. And just for something to do, let's just say we want to add up those numbers. This was very simple to write and easy to read like a step-by-step instruction manual on how to create the pipeline. And once again all this happens lazily. So, it's very memory efficient. We've completely defined our pipeline before we ever actually read from the file. At this line in the code, we haven't even read a single byte from the file. Each next call inside the sum triggers this generator to look for one more element. That triggers this generator to compute  one more element which triggers this generator to compute more elements until it finds a finite one. Which triggers this one to compute more elements until it finds a non-empty row. Which triggers this one which finally reads a line from the file. So, we're able to process the whole file. And we don't even need more than one line at a time in memory. And now we get to the advanced usage of generators. A yield statement is not just a statement. It's also an expression. It returns a value back to you. And that's because generators are not just possible functions that yield values. Generators are actually bi-directional pipelines. Just like a generator can yield a value up to its caller, Its caller can send a value back down to the generator. And it's these sent values that are returned from the yield expression. So, here's how we read this. We have a worker generator. The worker has a collection of tasks, initially empty. Initially, we yield none because we haven't had any chance to receive any tasks. Our caller is expected to send us a batch of tasks. The idea being that we're  supposed to evaluate the given function using the arguments that the caller passed to us. If the caller passes some new tasks, then we extend our task list with those arguments. Otherwise, we assume that the caller is asking us to complete a task. So, if there's a task available, we pop it off. Evaluate the function with those arguments to get some value. which then gets yielded back out to the caller. And here's how a caller would use the worker. So, our worker is just going to convert  whatever arguments we give it to a string. We use the send method to send values into the generator. However, when we just create the generator it doesn't start running it. The very first value that we send can't possibly be accepted by the worker because the function is going to start at the beginning. There's no yield statement there. So, first we just send none to cause the generator to run to the first yield. Just like a call to next, send will cause the generator to run until yield or until the function returns. After the send none, the generator will be paused here. So now, we'll send in three tasks: the number 1, the number 2, and the number 3. These are wrapped as single  element tuples because we're processing them using star args. Now, if we call next three times, then we'll see our three values evaluated. When use the next call, the return value from yield is going to be none. We can send in more values and then print them out. And that's part of the usefulness of this setup. I can add tasks or evaluate tasks at any time. I don't need to have everything prepared ahead of time. And I don't need to compute all the answers at once. And another thing you can do is, use the throw method to throw an exception inside the generator. As you can see, the exception acts  as if it was thrown from the yield statement. So, you could surround this in a try-except if you  wanted to handle exceptions that way. There's also the close method that does basically the same thing as throw except it throws a special generator exit exception. This exception gets special treatment. And it's basically a way for you to cancel the generator without having that error propagate up. So at this point, this should feel very familiar to something else in Python. We're basically submitting tasks into this worker thing. And then something drives the worker. And the worker decides how the tasks are scheduled and when to actually call the function and do the work. Doesn't that sound a lot like async? I don't expect you to already know all about async. Don't worry, I've got a video on that coming. But just the general idea of defining tasks and pausing functions and continuing later when things are convenient. Well, it's no coincidence. As it happens, under the hood in Python, async await co-routines are defined in terms of generators. So, once again, being lazy is paying off big time. The lazy machinery of generators is powerful  enough to design an entire async framework around. And that's not even the end of the video because we still have one major feature of generators left to cover. `yield from` `yield from` allows one generator to yield values from another generator. In most cases, you could use it exactly like you  would a for-loop looping over the values and yielding them. And that's totally fine. There's nothing wrong with doing that. Using `yield from` is going to be one line shorter. And it's going to avoid using an extra local variable. However, that's not the intended use of `yield from`. And it's not why it was introduced into the language. Just think about it. Do you think they would really  introduce a whole new set of keywords just to have a shortened for-loop. The true purpose of `yield from` was that it was introduced into the language in order to facilitate the bi-directional nature of generators. Remember a caller can receive values from generators. But they can also send values to generators. But what if a generator wants to take the values that it receives from its caller and pass them to a sub-generator. For instance, here's a quiet worker. It's another generator because it  has a yield. And all it does is it creates a worker. And then yield from allows the worker to pass  messages from its caller directly into the worker. And likewise whatever the worker yields is yielded back up to the caller. If a task causes an error, the quiet worker just catches it. And then creates a new worker to keep going. This is of course very bad practice because the task queue of the worker may not have been empty. But we're throwing it out and creating a new worker anyway. In any case, the `yield from` is what allows us to pass messages bi-directionally. It essentially acts as a pass-through. Taking whatever messages from our caller and passing them to the worker. And taking whatever messages from our worker and yielding them up to the caller. That is the true purpose of `yield from`. The fact that you can use them to write one line shorter for-loops is just a bonus. Oh yeah. And just like yield, `yield from` also returns something. it's the return value of the subgenerator. That thing inside the stop iteration from the beginning of the video. This is its real purpose. So, there you have it. That's all I've got on generators. I hope you learned something. I hope you enjoyed it. Let me know in the comments how much you want me to make that async video. If you really enjoy my channel, please do subscribe and consider becoming a patron or donor. As always, don't forget to slap that like  button an odd number of times. See you next time.
Info
Channel: mCoding
Views: 113,420
Rating: undefined out of 5
Keywords:
Id: tmeKsb2Fras
Channel Id: undefined
Length: 15min 32sec (932 seconds)
Published: Mon Oct 10 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.