More About Generators - James Powell

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we get started we're just gonna jack that up to about that big she'll be big enough everyone okay so emmalin no I didn't forget to write slides I was too lazy to write slides so we're gonna do something totally different we're gonna write a presentation together as a Python module why not it's easier than writing slides and maybe we can have some fun the title of this talk is the title of this talk is more generators and I think one of the very first talks I ever gave at PI data at a PI data event was about generators in fact the very first PI data Londyn that I ever attended in 2013 boy that's going back a long way I gave a I didn't actually I didn't I gave a keynote because the other guy cancelled and so I had the whole audience to myself and so let's see let's see what we can do now five years on with generators generators have been around in Python for 15 years or so interesting question that I keep coming to is how come you see people talk about them at conferences and everyone thinks they're such a cool feature but nobody really uses them like you don't see any big libraries use them as a core metaphor or core API other than some of the async stuff how come it doesn't show up that much in the field of data science now one huge disclaimer despite having spoken for now my fifth year in a row at a PI data London and having spoken at 25 pi data's in the world I don't really know a whole lot about that assigns or machine learning I do mostly scientific computing I'm more of a programmer that I have anything else so please if you want know what a lot about programming just just leave the room I won't be too upset okay so let's get started my name is James Powell you know this is the Python community so I really should make sure to include my Twitter handle or I should put my email here so I'm James at don't use this code comm and my Twitter handle is also don't use this okay so we'll make sure that we print those out so we can get started probably the most important things now as you can see at the top of my file here I have a couple of simple vim macros just to make my life easier so as we go through this I'll be pressing one of those key combos and you'll see some output at the bottom of the screen and so let's see if we can use that to kind of follow along now one unfortunate thing here is because these aren't slides I can't do the self-promotional thing of getting all of you to follow me on Twitter by having like a big Twitter thing at the bottom of the screen but because this is Python you know we can just override print to always print my Twitter handle every time I print something out so let's just do that really quickly and we'll see that if this gets annoying or not will say from built-ins import print as underscore prints so don't lose the original print will print what we were originally going to print and then we'll print out oops we'll print out follow me on Twitter at and then my Twitter handle yeah this is gonna be the hole it's gonna be able to 45 minutes folks get ready get ready for this and let's just put this all the way on the side of the screen there we go and let's give this a try with this module and see if it works well oh and by the way we need to make these from built-in to import print ads this is Python three that's one small thing we have to make sure that we're using Python 3 here so we'll just there we go because unfortunately sometimes other we got follow me on Twitter although that's kind of annoying so what we'll do is we'll just play with this a little bit this is look this is harder than it looks by the way this is much harder than it looks folks I'm not faking it this is for real if first and first pop else cuz we want this to be really classy we'll only put the Twitter handle we'll only put the full follow me on Twitter for the first time it prints out there we go isn't that fantastic so hopefully we won't do this for the whole talk cuz that will be really tedious but that's kind of what you're in store for if you want to leave now that's okay let's give you a quick overview what we're going to talk about in this session it took us five minutes to even print out the title so let's see how fast we can go so the first part we're going to talk about a little bit of review the second part and so this part in the second part we are going to talk about what generators actually are in the third part we're going to talk about why they are not what you think they are and the last part we're gonna have some fun okay now let's get started with what generators really are so we'll just create a new file called review dot py we'll grab these from the top and you want to grab the print and really actually have us print out my Twitter handle every slide or every print okay good neither did I that's just going really to you okay so let's figure out what a generator you really do want me to do that every time okay otherwise otherwise you might think I'm faking this whole thing okay so there we go so there's our custom print function okay I'm gonna start off with a very simple question and since this is so tedious to see on the screen we'll just zip those let's start off with a very simple question for you my question for you is what's the difference between these two yeah we can try that you can also sit closer to the front the colors will rise like hey look at that thank you very much for your feedback okay what's the difference between these two got function okay come on folks this is like the very beginning what's the difference in these two okay now if I make this a little bit simpler and I ask you what's the difference between these two from the perspective of somebody just using them who can't see the source code is there any difference between these two from the perspective of just using these two they do the same thing right they're basically formal differences and what you could say as well somebody were clever there might be some help text on this and you wouldn't see that help text on the add two because it's an anonymous function right but you see the help text on the ad and the name wouldn't be right but then I'd say you know what this is Python we can just monkey patch this thing right and this name equals add two and then it's really it really becomes difficult for us to distinguish between these two and in fact it turns out that these are just two ways to write essentially the same thing except there are certain syntactical restrictions to the lambda that we don't have in a normal function we can't have statements it's just expressions let me throw in another example and I want to tell me what the difference between this is and we'll call this a door we'll put a pass here so we use this in a second okay how about this one what's the difference between this structure and the previous structures do you see any difference still adds numbers it's just a different formulation and you might say well we still have the problem with the name and the help text but trust me we can patch that out now what an interesting thing here is you might say well there is a difference because if I have this adder I can take this code and I can move it somewhere I can't move the other code I can do things with it but I can't do it the other one right I can do something like this and I can initialize this to zero and when I actually run it I can give it some kind of weird behavior where it remembers some state because this is an object and objects are about State and so now the adding behind the scenes that transparently adds something in there but I'll tell you what are you talking about that's nonsense I knew the same thing with my original add up here right and I really didn't do much to change this and you might say ah but James there's something more let's do they've gotta make this global and you could say you can actually even do more you could say you know I don't really like using Global's even though that's the same and you know having this global state it's not encapsulated and I'll say oh it's not encapsulated is it well can I do this I just add it's a function right functions in Python can have they have a little dick so you can see behavior is the same but you might say ah but if you are being really clever you could make two versions of this and they operate separately because it's a class classes can create instances right and because classes can create instances we can have two of these and they both have independent state right and I might say yeah but but why can't I just do this and then I don't even need this this crazy thing here I just use non-local and then we finally get tired of this exercise and what we realize is that what we have discovered here is a fundamental principle of computer science namely and I'll finish the example before I tell you what the principle is boy I must be very concentrated to be able to talk and code at the same time huh who's impressed I hope some of you impressed this is really really hard this is really hard okay so there we're there and we're done so we can see exact same thing and in one case we have a closure and a function returning the the other case we have an object and we've now discovered first hand by trying to tease apart some differences a fundamental proper interval in computer science which is objects and closures are identical there's a clear equivalence between them let's park that for a second and let's talk about something slightly different here's what we're going to talk about that slightly different let's let's take let's pretend that we have a function that computes something and because I because this is hard enough as it is I'll just give you a dumb compute function that doesn't do anything that interesting but it takes a tenth of a second to run right so it returns a random number between 1 and 10 and it takes a tenth of a second run see you'd even feel that you never feel that tenth of a second there now I can have a function just a regular old function that maybe he does a couple of computations in a row so maybe it calls this function 10 times right and I can print out what this function does and we can see this in actually should take about a second and you can see it takes about a second now I can look at this and I can say that's very interesting it takes a full second it computes everything upfront and I can use what I learned up before to see if I could just take this code here and put it into my class formulation all right there we go and again if I do this you can see this oh we have to of course we'll put a mark up here there we go and you can see they both do the same thing now one thing that I can notice is that typically when I actually want to use this I'm just gonna iterate through the values right I will put a mark up here just so we can you know the last time I did one of these presentations it went on YouTube I got 300,000 views and then some people in the comments were like wow he's really good at using them and then some guys like he's such a noob his vim skin a sock he doesn't even use VIP for visual in paragraph but I'm like okay you gotta step up my game so I'll try to step up my game here but you can see you know this is the example we really happened that YouTube comments are harsh they're always very harsh but you can see we have the same equivalents now something interesting we can see it's because we're iterating over this one thing that we might know is that in Python whatever we have a structure that looks like the following we know that behind the scenes Python turns it into something that looks like the following and then there's a cat right here and then you try and get the next value out of the iterator the excise tense right and an iterator for that and you accept a stop iteration and break so we know there's an equivalence here and we know that if we if we implement it or and next we can do something kind of interesting so I'll show you what we can do we can create our own object that's it Oracle will say that it's it's own iterator and in the next we'll just return the next value well what we'll do is in the iterator we'll just keep track of the size of the computation we want to do which is 10 and the next we'll just say if not self dot size raise stop iteration otherwise decrement self dot size and just return this right and now if we look at this and we no longer need DS here if we look at this we have and we have to go back up to our mark to run everything we can see does the same thing so what we've discovered here is that there's another concordance here that we can swap out calling something for iterating over something but if I were to tell you and we can see that this result is actually a lot better because one thing that you notice from the two variants if I give you these two variants side by side there's one thing that I want you to notice and this is one of the keys that people first learn when they learn about generators and see if you can notice it okay did you notice something let me put a break let me put a little visual break between the two instances of these two functions I want you to tell me really quickly anybody sees a difference between these two formulations because all I've done right now is you kind of played around with sticking some code in a different place anyone see something really visually different there one's immediate and the other one is streaming what we discovered here is we've discovered some laziness now something kind of interesting occurs when we think about generators and Python because there is some distinct advantages to the second form namely one notice there's no storage we're just doing one computation returning the value here their storage notice number two we're getting the values only as we need them so if we only looked at one value the top version takes all of one second and the bottom version only takes a tenth of a second see that first one took the entire time it was eager to the whole computation and gave us the whole result we threw away 90% of it the second one only gave us a results as we wanted so we can say oh that's really interesting that's a lot more efficient for both a memory and a time perspective we're customizing what we what resources we use in terms of computational time and memory - exactly what the user wants and the whole nice thing about generators is generators are this idea that you can write a formulation the above both out all of this tedium so we can just write this here grab our code from here and we can see this generator formulation here and this class formulation here are just as equivalent as this function formulation here in a class and this raw formulation here so we can see is that generators are a way for us to very conveniently formally describe a structure that performs a computation on demand where internally the structure itself looks very similar to something that has a knitter and a next employee okay so that's a review I hope that was well within what all of you know because that's just the start of where we're gonna be let's take a look at has anyone lost okay let's let's take a look at next what large errors so one of the things that we saw from that example and while I grab the top of this file here and my print statement which we didn't use because I wasn't highlighting each time but we'll use it by the end and we'll just hide some of this stuff because it's little garbage on the screen like that this what we saw is that we saw immediately some laziness versus eagerness one of them runs to completion one of them does the entire computation the other one eagerly gives you the computation as you go and what we saw and what we might also know is some kind of pipelining mechanism so one of the things that we see from that generator is the generator only gives you values as you asked for them so if we take our compute let's just copy our compute over if we have actually know how about this from review import compute oh look at this this really is now a Python package not just a module and we I'm doing this off top of my head come on gotta give me something oh let me just do this and let's call this okay there we go because the other module has that at the top now everything's gonna have this isn't that great let's do this pretty cold that's that's gonna be so annoying let's go be so annoying okay so what we're gonna do is we're gonna do a little bit of we're gonna be nice to you cuz that's just gonna annoy you I'll just move this out here to top okay there we go so what we can see is we have now an infinite sequence right so if I say for X in F print X I now have an infant sequence which would be very difficult to monitor eagerly because you can't store an infinite sequence anywhere and what we know from this is we can do all sorts of cute little pipelining things just like at the shell things which approximate the kind of data pipelining that we often do on fancy platforms like Hadoop and spark and whatnot and we know that Python gives us the hitter tools module inator tools module gets a lot of helpers so we can do things like take a generator at Tia to make copies we can enumerate that to get you know the index of it and then we can do something like slice each of these so we can offset each of these by in amounts and then what we can do is we can zip them together to put the star here make sure we have enough closing parentheses we could zip these together and this is a very useful tool called n wise that gives us pairwise views of some structure all right so I'll show this to in action you can see just using what's in Python we can get like pairwise views we need a BBC CD and we can even take this further and we can do fun stuff like getting windowed averages which is just the sum of these values divided by the length of these values for some values in a windowed size of whatever we want and so now we can take all of this fanciness and we can do windowed averages of f of some sighs let's do windowed averages of size 3 and now we have a lot of little helpers in a very terse almost functional form you can see it's processing an infinite sequence of numbers and getting like the 3 windowed averages right we need two moving averages and that's really cool but there's more there's one other thing that we might we might have noticed and I want to ask you is anyone ever seen an API that looked kind of like this let's do this three times there was some function that you had to run first there was some function that you had to run second and there was some function you had to run last and you had to run them in those orders otherwise it would be you know total anarchy cats and burying dogs crossing the streams absolutely absolute chaos you had some API that's really order dependent and it's so easy to screw up because there's nothing enforcing that order well one thing that we also know about generators is that they are a sequencing mechanism so we can't take all the code that we wrote here you know the average programmer writes 100 lines of code a day and we're already at 200 so you're really getting your money's worth folks we can do this right and fi will be an instance of this generator and we can do this and you can see we get sequencing except we can't screw it up and we might even know that in Python 3 6 and later we have this async/await structure so we can even do stuff like this you know create tasks that yield and print some message out and we can write our own scheduler and right and we can just you know do something like scheduler or tasks hello task goodbye and we can see it and we can see that the sequencing is actually very important it's key to why we have generators as the async mechanism in Python for for single processes asynchronous you can see that what you have from the scheduling is the yield is actually doing literally what it says yielding control back to the scheduler and the scheduler is figuring out how to enter loose so this is why generators work this way now let's talk about something else let's talk about why generators don't appear in data in in pi data talks why I'm gonna has given this and it's not just because I'm done that's all I know that is what answer this is really all I know I wish I could give a talk on machine learning I'm sure it would be interesting maybe as interesting as the average machine learning talk is which is I will I will that we're really going off the rails here aren't we I will say that one thing that we don't see a PI data conference is we don't see in the world is you know out there there's probably like a million biologists whose state-of-the-art tool that using its pearl and so a tool like generators is a step Python itself is a step beyond what they're doing and Python with generators they're doing a lot of text processing on like DNA sequences and looking for substring matches and maybe this is an appropriate technique for them but we don't see this in the world of scientific computing or numerical analysis and I'll give you an idea why that is one problem is that fundamentally even though there's a very nice lazy mechanism Python is pretty pretty junk at doing anything numeric computing related so I'll show you a very simple example let's just do let's just time to operations well time using this generator and this window to average mechanism against and I'll get I'll get this code from up here because I don't really want to cut they want to import that well time this against just doing an operation on like a numpy and da-rae random import random and we'll just create a random number and D array with the values from zero to ten of size and million okay and what we'll do is we'll just do very similar to have that n wise up there works we'll just do something like this to get the window to views and we'll just divide by two and we'll see how long this takes make sure that we nothing this time it and we only run this five times because it's not gonna be that fast okay you can see it took like point one seconds here and we'll try the same thing with our generator approach and we'll see what happens and it's not gonna be it's not gonna be very good it's just that Python integers and Python lists or even Python generators are just not a very good tool for any sort of numerical computation you can see why as a tool these don't really show up when you're doing large unistream multiplications or you know vector operations we'll just take take this from here and take that from there we don't need you don't need this stewey and we'll just say from random import ran range X s equals R and rating from zero to ten for this range this is like an interview this is like the heart this is the hardest interview that means the least I've ever done and will do this list windowed I hope if I ever interview at any of your companies you ask me exactly the questions that we're covering in this presentation but we'll see how fast this is and we probably need to make sure we type this right they ever seen anyone live code this much code and get it right I've never I've never done this before this is really special you can see the speed of these two things boy this is slow and windowed average let's try that one more time I see it's really really really slow even the construction of the in C it's the two seconds versus 0.09 cents for a million integers right so you can see even at very large scales if I make this a billion yeah we have we have easy memory use but if you want to do anything numeric involving large streams just use an umpire in the array its cache coherent it's packed into memory it's just going to be faster this generator approach and using lists and Python integers for anything that's mathematical in nature it doesn't work so I think this is one of the reasons why we don't see generators being used as part of data science libraries in this kind of numeira stream processing sense because for if you're doing text processing you're a biologist and you're comp the computational part of your computational biology isn't really that computational yeah it's a great tool if you're doing serious machine learning or models or deep learning it's not actually a very valuable tool because it's hindered by the poor mathematical support in Python itself Python is an orchestration language a glue language and a mechanism like generators is not going to give you the ability to do that much but let's look at the last part of our presentation we have 15 minutes let's have some fun because if we really think about it when we go back to what we learn about generators in the second section and we think about this sequencing what we can see is that generators are a mechanism that can fit into orchestration they can allow us to do to give us structure to a program that the program might not already have they're not necessarily a computational mechanism they're not about you know doing very quick vector operations because numpy can do that way better and if it's out of memory you have other tools that can do out of memory you know julia can do out of memory operations as part of the core language what you might think though is generators might be a tool for structure in a program and what's a common non-mathematical structuring for a program we're about a graph right we've seen this these used to be all the rage like a year or two ago I don't know why they're not that popular anymore but the idea that you could take some equation and do it as a graph and if each of the nodes are very hard to compute you can minimalistic leary compute the nodes to get some results so think about some operation where your will do this very simply we'll do something like this do I really have ten minutes because I don't need ten I only need 10 minutes to finish this ok what we're going to do is we're going to build in the next 10 minutes a graph computation framework at about 70 lines of Python using generators so what we might have is you might have some term and the term is just a generator that yields its own value and yep and new value and this is gonna be really really sloppy so we can always improve this slightly we'll do something like this okay so this is just a generator that can take new values just a little computational unit we're going to fit it into a larger structure and maybe it will do simple simple addition of terms okay and this again will be something where we set the value to none and what we'll do is we'll do like a while true and we'll say value equals sum of next e-40 in terms okay and you can see we need to add term one term to term three and by virtue of these being you know live computational units we can say for X in Q critics or actually we'll do it like this while true print next thank you okay now every time we do this we do a loop but let's do and we have to yield the value you can see we just sealed this value and if we want to we can make this a little bit a little bit easier to read so we'll just say from itertools import count for step in count if step equals two we'll actually inject a value into one of these so you don't descend and we'll send the value 10 in instead of one and then if step equals five we'll break out of this okay so here you can see that's even reactive right we can send something in at recompute but it's not lazy yet right and there's a couple of kind of irritating things that we have to do with this you'll see the first few values are not quite correct now it's not lazy and we want to make it lazy now one problem is unlike Python functions you know a Python function one thing you might have noticed is you can define a Python function like this and you can throw attributes on it you can't do that with the Python generator instance you can't throw attributes on it so here we'll need to continue to use a generator formulation but we'll wrap it in a larger object so what we'll do is we'll create just a very simple decorator and we'll call it wrapped and we'll have a class here called wrapped and we'll just wrap our generator so we have a place a storage place to put something like this and we'll pop it this is one thing you'll find with these co-routines you have to pump them and we don't have time to talk about this we'll do something kind of interesting we'll send our self into this so we can do something like this at the top of each of these generators we can actually keep track of the state of it itself okay and the rest of this is fairly boilerplate and we won't implement the entire protocol okay and we do something like that now as we do this we can do the follow up we can say actually know where did that so what we can do is the follow here when we have the terms that we're adding we might say for each term we can say t dot subscribers plus equals 1 and we can keep track of how many people are subscribing to these values so we know if we're dirty or or not and we know how many people are waiting to see values from us then what we can do and this it's really inconvenient and we'll just clip this to zero we can always every time we yield a value we can just decrement by one so we we're no longer we're no longer a dirty value and then what we can do is we'll use our same trick from before we'll put a little sleep in there just so that we can see the dirtiness like take effect here in the term what we can do is we can say whenever we compute the term we can mark ourself as dirty and we can sleep for some period of time and so what we'll see is as long as we're not dirty it'll just instantly recompute if we're dirty we this represents a little small but very large chunk of a computation like a deep learning model that's connected in a graph up to something else right some learning model that takes a few seconds to run but only is run and the graph only recomputes it on demand as we go up here in the ad it's not that hard what we have to do we just have to say if any of our terms are dirty we just mark ourself as dirty and we recompute our value so we'll just freakin will only recompute our value here but we will compute it once up here okay so let's see if this thing works there's gonna be real pain the debug if it doesn't and then we'll just set seconds equals to zero so that most of these are instantaneous and then what we'll do is we'll actually the one that we tweaked we'll see and then functions not iterable oh thank you and I have to wow you're really paying attention one more thing and we yield our value here and in terms routine terms we have to we have to decrement this up here we're also decrement it for dirty here and let's see if this thing works there's one last thing that we're missing so I'm not I'm not stupid I did actually prep a little bit at the set of time we'll just grab from my preparation yeah so we'll just grab my prep diversion here oh we actually have to use our wrapped class as well let's see if we can continue let's see if we can continue with this without screwing this up too much but if not we'll switch over to the canonical version but I did I mean I didn't actually practice this thing I'm not doing this all stop my head okay so we'll switch over to the canonical version real quick and we'll just grab we'll just remove this code here set paste grab the code from my canonical version which I promise you I haven't been cheating the whole time on and move on and what we can see through this approach is that we can pace this okay yeah I know where that I do know where we're going with this we'll just grab this code here and I'll show you this in practice and this version was the final version that I wrote in preparing this and I'll show you the final version because I don't want you to have to suffer too much of reading through me live code this it's gonna be real difficult to debug there we go but this is the final version in about eighty two lines of slightly harder to write than live coded on stage what we can see is we can see down here this generator it'll pause for 10 seconds when one of the values is updated and the extra 20 lines of code mean that it also will I mean it's a graph computation framework so it'll also graph the thing so you can see this is our ad this one's a much harder computation there's an Ag think it's another ad we'll close it close it and you can even see the terms go dirty though that's like the 9 you can see that's the one second wait for it to do the recomputation and it's happening at each levels and then it should up you can see that one went dirty and then you can see each of the results are now faster again because it's not very computed and you can see that dirtiness propagate through the network about a hundred lines of code using a generator as your fundamental organizing structure and I'll show you all I added here was not that much the only thing I added here was a little bit of an additional mechanism to use Network X to draw the thing and you know I guess it's a little bit easier to do five minutes before you get on stage though when you're on stage but for the most part you can see this code is essentially the same it's just a generator that yields its values that waits for it to be signaled as dirty and then results and you can see this generator is actually an interesting structuring mechanism for larger code by itself it's not an interesting computational mechanism but what tricks like this you can form interesting structural things because each generator by virtue of being a fixed they a an object that represents a computation it can also do neat things like if I had three generators in a sea Quinn's I could have themself elide themselves and infuse into another one this is the second example I wanted to do if I happen to be running faster than I thought I would which is to show you how you can have a pipeline of three generators that automatically rearrange themselves or fuse themselves with another twenty lines of code and it's actually not that difficult and you can see the API here maybe it's a little bit of a little hinky but it gives rise to thinking about form formal structures in their code and I say even though this API is somewhat hinky a comparable graph formulation framework that does just these two pieces there's gonna be at least two hundred three hundred four hundred eyes of code with generators and a little bit of knowledge of how they fit into how Python objects work and the structuring behind them I mean I have 95 lines of code here for a graph computation framework not too bad and I almost got it right on the first try live coding it so I hope you enjoyed that that's more about generators thank you so much [Music] you [Applause]
Info
Channel: PyData
Views: 9,848
Rating: 4.6972971 out of 5
Keywords:
Id: m6asOJmfGpY
Channel Id: undefined
Length: 40min 45sec (2445 seconds)
Published: Sun May 27 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.