Larry Hastings The Gilectomy How's It Going PyCon 2017

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

It'll be done in 50 years' time, just like fusion power...

👍︎︎ 9 👤︎︎ u/skulgnome 📅︎︎ May 23 2017 🗫︎ replies

Captions

past so it's time for our last talk before lunch everyone so we really should make a start now our next speaker is here to talk about the gill ectomy and how it's going please make welcome Larry Hastings thank you uh my name is Larry Hastings ladies and gentlemen boys and girls children of all ages this is the Galactic how's it going hi Khan 2017 edition I call the talk that because when I go to conferences now people walk up to me and say Oh Larry oh you think I work on the glass to me how's it going and I'm like ah I'm about to speak for 40 minutes about how the Galacta me is going and it's I don't want to answer for 40 minutes every person walks out to me at a conference so I'm going to do it for everybody else once and now you don't have to ask that question you can ask a more specific question so about this talk I want to warn you just like last year I gave a talk about the Galacta me last year an introductory talk this is kinda the same preface slide this talk will be exceedingly technical and it's going to assume that everybody in here is comfortable about me talking about multi-threading issues and about cpython internals this year I'm kind of assuming that you're mildly familiar with maybe you might talk from last year so there's a lot of stuff that I'm not going to explain them just going to dive right into some of these concepts if you want to know more you can watch my talk from two years ago Python pythons infamous Gil where I explained the Gil itself and then last year's talk where I kinda got started they colectomy removing OOP typo removing Python skill so the Gil the goal of the colectomy is a Gil of the colectomy I want to run existing multi-threaded Python program so a Python program you could write today using the threading module I want to run it on multiple cores simultaneously I wanted to run it with as little C I want to implement this with this little C API breakage as possible it's impossible to not break the C API on there some guarantees the Gil gives you that I cannot guarantee anymore but with this little as breakage as possible and I want to run it faster than C Python does with a Gil by wall time so this is if you had a stopwatch and you started your program running when it finishes you hit it again if the program ran faster then I will declare the Galacta me a success which it has never done the approach that I'm taking and again this is this was most of the talk last year Python uses reference counting for tracking the lifetimes of objects and I switch that to atomic inker and decker that's something that your Intel CPU does for you where it will increment or decrement a number stored in memory in an atomic way such that there are no race conditions you're guaranteed that after you're done the number has been incremented by one and it was a done in a safe manner I used a bunch I added locks to all the internal data structures inside of Python objects that are mutable like Dix and lists so that those operations are safe if you append to a list or you add a set an item I dick that has to be an atomic operation it has to be safe and so those items those objects lock themselves internally I also added a bunch of locks around a bunch of internal C data structures to the users inside of C Python like free state but there's two in pretty fewer I want to call out automatic which is what we call the small block alligator this is used for memory allocation for small objects which is like under 256 bytes or something I don't remember what the cutoff point is but this is used for most of the objects that are allocated inside of C Python and this is very flat very finely tuned and it's super fast and C Python relies on it being fast and adding locking to that in order to make it safe has slowed it down a great deal also there are a bunch of free list inside of C Python it's very handy if you're using a bunch of say integers or frame objects if you if you allocate a fresh one every time that's a little slower than just say oh the last one that I use I'm gonna put it on the list and then reuse it next time I need a free list from object or an integer or something so there are a bunch of free lists internally of Python objects that get really used constantly and I of course had to put locks around those data structures or make them per thread I also disabled the garbage collector entirely I'm just not ready to deal with the garbage collection under the colectomy there there's a lot of research there are absolutely thread safe multi-threaded friendly garbage collection algorithms that's going to be a lot of work and that's not the thing that interests me what interests me is getting Python to be fast enough that this is useful for somebody ever so my general approach is I got it working and now I figure out what the slowest thing is and I run another a profiler I come up with thought experiments and I experiment I try something I see if I can make it go faster if I may go faster than I keep it then it doesn't go faster than I I don't throw it away I may like put it on the back burner and try it again later because sometimes you turn a corner and now this technology is going to work again just to give you an idea of how crappy this project is my official benchmark that this is literally all they ever run under the colectomy is a really bad fibonacci number generator it's a recursive Fibonacci so this is exercising a an if-statement it's exercising some math it's exercising recursive function calls and of course bytecode and a bunch of internal stuff inside of the Python engine but this is all I ever run so I'm running on this on multiple cores simultaneously inside of C Python so an overview of what's happened since last year when I gave the colectomy talk we start with what I'm going to call I'm Nikole last June's version of the Galactica the atomic version for using atomic ingrid decker i switch to something called buffered reference counts which was done by about october and then i did a bunch of work on off malloc and that was done in about April and then just this month I did some work that I'm calling no TLS TLS when I say TLS what I mean is thread-local storage this is data that you can store per thread so that each thread has its own version of something you need that in order to you can store something there and no other thread is going to examine it and so you don't have to lock it in order to talk to it so I'm going to go through each of these items and basically what it is and I'll show you how much it got faster but I want to talk for just a moment about how benchmarking is impossible on Marvin computers as Victor talked about in his talk this morning modern CPUs like an Intel CPU like a Xeon has a speedstep technology where it's constantly adjusting the run the speed of the cores inside of your CPU and so this I have a computer that literally has like 50 6 cores it's enormous server and this is some of the cores at a particular moment in time how fast they're running so you can see I was running a benchmark at the time so the ones in the lower right of the fastest ones they're running at thirty one hundred megahertz the slowest ones are running at twelve hundred megahertz you really don't know from moment to moment how fast your CPU core is running and so it makes it almost impossible to do meaningful benchmarks in any meaningful way so this is just to give you an idea like already my benchmarking is crappy there even crappier because it is I haven't figured this one out yet a victor says oh yeah there's way you can turn it on for Linux we'll figure it out but that's just a pervy so to give you the sense of let's look at the benchmarks from a sort of a 10,000 foot view let's not get concerned about like 5 percent this way or another so this is a graph of the speed of the galectins versus stock cpython and again this is I'm talking about different eras so I'm going to have update the graph every time I do a new thing so the red line at the bottom is a stock cpython which was the base revision that i started to collect me against which was last February it was trunk so it's what became 3 6 but oh there's a lot of work put in the 3 6 after I branched and then the blue line is the colectomy the bottom the x-axis is the number of cores I'm using because it's the number of threads that I'm running so cpython forces only is running multiple threads but it's only ever using one core at a time I really should relate with that threads not cores and then the the y-axis of course is the number of seconds that it took so just to drive home how bad this is on seven cores cpython runs in 4.4 seconds and the Galecki is running in 80 3.0 seconds it is 18.9 times slower as an aside this is why I got to say I haven't been terribly gracious like a lot of people like to talk to me and they say well have you heard about the super fast locking mechanism that's not where the war is going to be won the difference between 19 times slower is not oh you should use a slightly faster locking mechanism there are major battles to be won here once I get it down to the point where it's kind of within shooting range of cpython then I might be interested in more efficient locking mechanisms right now there are more fundamental structural problems inside the Galacta me the enormous nails that need to be hammered down and that's where my head is at so unless you're spend time doing analysis on the Galactic mean saying I found our super slow thing did you know about this Larry here's a way that you can make it faster that would be super interesting oh this is the way that Grand Central does logging not so interested so let's start with where we were in in the in last June like I mentioned cpython uses in creditor for reference counting and I was using atomic sinker and Dekker just to get the Galacta me working but this was 30 percent slower off the top just using two threads and they get slower and slower each thread that you add is more than is more overhead than last I think what's going on here and I don't know for certain but what I believe is going on here is that the the course need to talk to each other there's an internal bus that they use to tell each other hey I'm doing an atomic increment on this thing you need to not look at it right now there's some internal bus that they're using to communicate with each other and the more ly automatics cubing the more atomic anchor and Decker that I'm doing the more traffic there is on that bus until we're starting to saturate the bus and people are having to wait so getting that work not to be atomic but would be a major win it turns out so first of all I spend a lot of time thinking about this problem cuz I knew this was my hardest problem to solve I actually came up with my own technique to solve it and then I felt terrible because I was like nobody invents anything new in computer science everything's been done you don't want to invent something new if you do that you've probably done something wrong because some all the good ideas have already been taken particularly in multi-core multi-threading that's all the work was done with million years ago so it turns out there's this book called the garbage collection handbook the second edition just updated in the last couple of years and they have it's mostly about garbage collection well what about technically it's called tracing garbage collection but reference counting also counts as garbage collection so they have a chapter in the beginning about reference counting and then turns out they have a chapter at the end about reference counting in multi-threaded programs and there are two things in this rather slim chapter one of them was exactly the thing that I invented on my own the other one is something that's very complicated that I don't think we can use called the recycler to be honest I don't understand how the recycler works internally I got to sit down and just read this book but I don't think we can use it I think it requires like compile time court but let's talk about this concept what the the technique is called buffered reference counting so let's examine our problem we have this object o at the top we have three threads that want to talk to object o and every time they want to talk to it they want to increment the reference count and when they're done talking to it they want to decrement the reference count so they're all trying to talk to the same area of memory and that's why I've used atomic anchoring decker we'd like to make it cheaper the way you make it cheaper is let's have only one person talking to the reference count at a time ever in a guaranteed way and then he doesn't have to use the atomic anchor or a Decker how do you do that well let's add a separate thread we're going to call the reference count committing log a committing thread we're going to make a big log what's just wave our hands to say this is a memory is this as big as we'll ever need it to be it's a reference count log it's like a transaction log for reference count changes so now the three threads talk to that instead every time they want to increment the reference count on o they write down o add one to the reference count every time they want to decrement they say o minus one to the reference count and then the separate thread sits there spinning watching the log and whenever there's work to do it goes to heaven does it since the reference count committing law thread the only thread that ever touches reference counts he doesn't have to use a tonic during Becker he can just modify the memory directly as a matter of fact it's a good idea if he's the only thread to ever looks at reference counts because one of the problems of this technique is going to be now reference counts don't happen in real time now reference counts you don't know whether they're accurate or not there could be reference count changes we're waiting in the logs that have been committed yet so this is going to cause some problems we'll talk about in a minute well let's move on this solves the problem of atomic occur and Decker but all we've really done is change it so that now we have contention around this ref count log where that has to be locked and unlocked for thread but we can solve that let's have one rest count log per thread so now each thread writes this reference count law changes to its own local log now there's no contingent over the logs they just need to lock a little bit between the commit log the committing thread and the individual threads but you just take the buffer and pass it off that's no big deal this solves all the contention problems but now we have an ordering problem so let's walk through that let's say that we have three threads or really only need two threads for this example and let's say for the sake of argument we have a list that's capital L and it has an object inside of it world I'm the collet Oh it's unnamed in this example and list L has the only reference to object oh that's the only reference it exists so always reference count is one and all of those changes happen a million years ago it's all the dust is all settled so it's nice and stable right now so currently Oh has a reference kind of one and that reference is being held by pal thread one comes along and says for XML print X that does an increment and decrement on object oh and then later thread zero comes along says l dot clear blows away everything inside of L that says o minus one that drops the reference to l to object o so that's going to kind of look like this the reference count log for thread one we're gonna have an O plus 100 minus one and then later in reference count logs words thread zero we're going to have an O minus one the problem is what if we commit the reference count log four zero before we commit the reference count log for one we're going to drop the last reference to object o the object will be destroyed and then we're going to later commit the reference count region and thread one and we're going to say o a plus one and now your program is no longer correct because you're touching an object has been destroyed now you might say well just do those in the opposite order the first thing I would say to that is how do you know you were supposed to do them in the opposite order and the second thing I would say is what if I do this to you now we have two lists L L and L 2 we have two objects o and O two we iterate over the list in one thread each and then we clear the other list there is no order that you can do process these logs in where your program will be correct now your program is inevitably incorrect so how do we solve that problem one way might be to write down a time stamp every time we write down a reference count change in the log that's expensive and it may not even be correct there's been bugs in the past with what I would use the are DTSC instruction on intel x86 where that gives you a time stamp counter that's internal the cpu that's very high precision which is what I would need for this but if you have actually two physical CPUs and your computer like I have sometimes they can drift so that the time stamp counters will be different and that would be problematic that would mean we would commit these things in out of order so it would also be it would be expensive and unsafe we can solve this problem on that it takes a stepping back and re-examining the problem itself because it turns out we don't actually need super-strong ordering of a reference count changes we can do something much weaker that is much cheaper and we can achieve that very easily so let's start with this observation let's talk about two objects our tubing I mean let me get this right let's talk about two reference count change events these are two events that we're going to write the blog though I'm going to talk about reference count change one and two and one might be an inker or a Decker and two might be an inker or Decker and the question is can I swap them in time is that harmless and it turns out three times out of four the answer is yes if you have an inker followed by an inker you can swap those that's harmless if you have a Decker fall owed by a Decker you can swap those that's harmless if you have a Decker followed by an inker you can swap that that's harmless because they can't be talking about the same object or if they are the object is obviously still alive because we did a Decker we assume the program is safe if we did a Decker and when we did an inker on the same object I assume that that was safe because the object is still alive the only one you have to make sure that you keep in the same order is if you have an inker followed by a Decker if those are tough without the same object and there was only one reference to the object and you did that anchor that keeps it alive when you do a decorator medially afterwards you cannot swap those but all this is telling us is we need to make sure that any Decker that happens after an inker in time has to happen afterwards when we commit it and that's a lot cheaper to achieve so here's what we do we have two separate logs per box for threat one for anchors one for Decker's and we all we need to do is cue them in such a way that any time we process a Decker log we ensure that all the incres that could have possibly happened before the time are committed first I got this working and then I iterated on the the algorithm and now I have a very safe queuing algorithm that is in constant time for everything so this is working really well as an aside right here this was an indispensable tool and working on the Galactica this is called undo DB that's undo debugger it is a reversible debugger where you can it behaves exactly like gdb and what you do is you set a breakpoint and you examine data and you're like how the hell did we get in this state and so you set a memory watch point and then you run your program backwards and you see when that triggers it's like oh that memory changed in this way well how did that get there and then you set another breakpoint or a memory watch point or something you run the program backwards some more indispensable for solving these problems because you can get in these situations where it's like I have no idea how the program got in the state now you can run it backwards and find out so I've been I used undo DB a lot in the development of the reference count manager but let's talk about how what has happened as a result so this is the old macro in C for performing an in craft this is what chains adds one to the reference counts very simple straightforward we take the object we find the reference count inside and we add one it's more complicated down so the first thing we do look at the bottom I say PI in craft at sync ref one in craft one what it does is it goes to thread-local storage and pulls out this ref log object which is where we're starting our reference counts per thread and it doesn't Inc rep 2 on the object now so Inc ref - is implemented like this where we say is there space in the ref blog right now for me to write another pointer oh there is that I'm going to go ahead and write one oh there isn't then I need to rotate the logs out get fresh anchor and Decker buffers and now I write the log and then I have three which is unsafe which just writes the thing directly and the reason I have three of these is because you can use they get progressively faster this is the really slow version but I knew that it was going to be slow and so I have this macro it says PI rest cache you put that at the top of a funk and now it's cashed attached the ref log that's for your thread as a stack variable and now you can just refer that all the time so now you can change all your PI in graphs and pie deck graphs into PI Inc rep 2 and PI deck rep - and they're just faster for free if you want to work a little harder then you can sort of pre ensure that there's room in the ref log for all the thing you're about to do if you're about to do 5 deck reps in a row but you don't have to check each time as they're spaced you can say is there space for five do deck wraps and if there is then we just go and if there isn't then we do a right to rotate right then that's a little troublesome just because you can't you have to be right there it can't you can't recurse into the Python interpreter because that's going to do its own in grups indecorous so you need to make sure that this is very tightly coupled but you can occasionally use those pining crafts in pi vector3 by pre establishing you're going to have room in the logs this isn't actually it looks like a lot of code is actually pretty fast it's really not a big deal like these temporary rules go away all the pirate pad is really doing is doing a derrick view referencing a pointer storing to it and incrementing so the other thing that I caught that this resulted in the other problem with this cause apart from making incorrect and Decorah follow more complicated is that there are a couple of places there's one place I know of for certain we really need real-time reference counts on an object and that is we graphs the way that a week rep is implemented I'll remind you that you can have an object let's say object a and then you have a week rep object that's a separate object called we're going to call that B and B is a week reference to a you can say hey B give me a strong reference day it's like get weak reference on there get rep and it goes and gives you one and the way that's implemented inside a python is a knows that there are weak refs pointing to it and when a is destroyed it calls all of its weak reps in and tells it hey a is being destroyed you're not legal anymore and they say okay sorry but from a technical point of view B has a pointer to a but it does not have a reference to a it does not actually know at any particular time whether or not a is alive it's relying on the fact that we're running in the Gil and a is going to tell it oh I'm dead now to be safe but under the colectomy of course we don't have the Gil we don't have that safety and so I had no way because of these buffered reference counts I didn't know in real-time whether or not it was alive or not I what I wound up solving the problem with was a secondary reference count that is only used for weak reps in a couple of these other special examples which is atomically modified it's actually not incurred in Deckard the way that it works is you read the value if it's negative one the objects being destroyed and you're done if it's not negative one add 1 to the value you got and do an atomic test compare and swap if that succeeds then you've kept the optical alive and you have a reference in you're good if you get back a negative one then you know that the objects being destroyed and you can't modify it any more so that's all the problem for weak graphs I'm actually also using it for something called inter moral strings people have said Larry you're you're using too big a hammer here you can solve that another way I'm using it right now I haven't worried about it because it's kind of a minor implementation detail it's not a performance hit thing but I got internal moral strengths using the same technology I don't think I need to finally someone pointed out again I think she the same guy actually Mark Shannon pointed out I have a problem with resurrecting objects if you have a dunder del method that takes self and writes it to an external variable that's doing an inker and ink rest on the object and that's going to keep the object of lives well guess what I'm already I don't know that the objects being kept alive I don't have a real-time reference can I go ahead and delete it and other references deleted object I have no idea I'm going to solve that he says he has an idea I don't believe it so anyway my advice is don't resurrect objects inside of dunder del that's never been a good idea in Python and now it's an even worse or idea actually jaaye fun and ironpython had to solve a lot of these problems themselves so I can always talk to the ironpython and JSON guys and ask them they said they have a solution that is terrible but it works so let's look at the graph this is what happened as a result of making the reference account manager so again we have the same two lines red is always going to be see foxy Python blue is always going to be the atomic version as of June and this new green line that's what it was like as of October so it's getting faster so then the next thing that seemed to be slow was there the small black Alec Kreider what I'm going to call off malloc and what I originally did was I had one big lock for just all Volvo because like you grab that and you can allocate memory to drop it and that was super hot right away so internally odd malloc thinks of it calls these memory classes like it's a range of bytes v1 allocate if you want to Calicut zero bytes that's just illegal I think it just always gives you a pointer back this valid one to eight bytes is a class and then nine to sixteen bytes of the class and 17 to 24 bytes is a class so it thinks of those internal ease being separate so what I did is I added per class locking and that made it faster but it still wasn't fast enough so I added to stage per class locking there was the fast lock which is for the super fast case of we have memory we just needed slice that off and hand it back and that's got its own super fast lock it's actually spin lock and then there's a slower heavier lock for oh I need to go and talk to an arena or I would need to actually go allocate memory or whatever anything that isn't the super fastest thing allocates a second heavier lock which doesn't prevent other threads from doing the faster thing if that actually happens to work for them at the time that sped things up I also added a per thread free list for generic allocated memory on this stored in thread-local storage and that sped things up again and finally around this time I also went through I gather a bunch of statistics when I'm doing debugging of the colectomy and I turn that on and that slows things down immensely and I did the conscientious job of making sure that there was absolutely no overhead from statistics when statistics were turned off because I was pretty sloppy about it before so as a result it did something to get faster the problem is that doesn't show up in the graph I don't know what happened here when I ran these benchmarks you can tell this is janky you can also tell it's above the green line that really should be below I guarantee you I swear it was faster when I was done but I don't have the data to show for it I'm sorry I don't know what I did but your get your get used to it you're gonna be staring that yellow line so that as of February I think yeah that size of April this year that's as far as I took it I took a break for a couple of months I was just tired of working on it but I came back to it and the next thing that seemed to be slowed was the physical act of pulling things out of thread-local storage so I'm going to show you a little bit of sweet Python internals here the function that actually runs byte code is this one gigantic function called PI eval underscore eval frame e^x that's the guy who literally runs pipe code so I needed to pull up my thread-local storage variable in order to look at stuff inside of that so at the top of the function I say PI thread state equals T say equals PI thread state yet that goes to thread-local storage pulls it out sixes in a stack variable then whenever you make a recursive function call it calls a function called call function call function needed that sea state so it pulled it out and then that recursively calls another function called fast function fast function needs that thread state thing and then that calls pie of alpha vol frame e^x these three functions get called every time you make a function call nc Python and the the Fibonacci benchmark is nothing but recursive so I'm making millions of recursive function call all over the place and every time I do it I'm looking in thread-local storage three times so I was making 370 million calls to P thread get specific which seemed to be at the top it wasn't dominating runtime but it was an enormous piece of runtime and I wanted to get rid of that that's pretty straightforward actually all I did is I sort of added an external membrane I took all the existing functions and I added two to the end again the Galacta me we're not going to merge this we're going to do a proper job if we ever if it makes it into c python but just to get it going I added a to the end of all these internal functions these three function calls and I made the the externally visible one eval frame e^x I made that so that it it looks up the thread local storage thing and passes it in as a parameter and now PI eval frame X two takes as a Fram call function two takes as a parameter fast function two takes as a parameter now we only look it up once and we can do eight million recursive function calls so we're fine that's sped things up again so this is where we are today this black line that's what I'm calling the no tlf line you'll notice that it is faster still than any of the previous lines it is getting faster and faster now I want to draw your attention the fact this is the CPU time graph so this is the collective amount of CPU time I've spent across all seven cores at the right side of the graph as opposed to cpython which is only using one core but what I said at the beginning was I'm actually interested in wall time I'm defining success or failure on this in terms of wall time so here's the wall time graph guess what the black line looks a lot better now doesn't that I'm pretty close again it's the benchmarking is so terrible and the and the CPU cores are changing frequency the ground is shifting out beneath my feet I don't know it could be that I'm actually faster probably not it could be that I'm slower than that that's more likely but I'm within striking distance and I feel like if I find another big thing that's a problem I may actually start to dip below it now and then which one I will say is I collect music success to everybody so the next thing I'm working on I'm working on autonomic again because I think it's still dominating run time at one point I switched mouth to using spin locks instead of my few text-based blocks and the few sex locks I ran under the profiler and allocate memory was 20 percent of run time and free memory was 20 percent of run time that doesn't show up when I use the few text space base locks for some reason I don't know maybe cache Brian doesn't like me but I'm like if I change that lock out the run time doesn't change significantly but the graph changes I suspect that maybe it's like I think it's still using the time it's just not showing up in the profiler so I want to rewrite automatic so that there's a central first data structure that's called used pools and that's global for the entire process I want to make that per thread and actually technically I did make it per thread and now it's not working properly and so I run my benchmark and instead of using 200 megabytes it uses 10 gig before it's time before Sun and the physical act of allocating and freeing 10 gig worth of objects or whatever it is doing underneath itself is slow and so the whole program got slower once I figure out what that memory allocation problem is I can make that go away but I still have another experiment or two to try I have an idea I call private locking you start with the observation that all objects most objects when you never leave the thread they were created in and so so if we create a dict and it only lives in a single thread and then it's destroyed and never escapes that thread then why do we have to lock and release lock and release every time we talk to that dick it'd be cheaper to do something else where we knew that it was a per thread object and we didn't have to do all that locking so I have an idea that essentially uh Dickson lists and other objects like that are created in this sort of pre pre locked model where if another thread wants to talk to it has to say hey I actually want to talk to this you need to turn into a regular lockable object and before that locking and unlocking would be a simple local non none atomic increment and decrement I tried it once it made it slower but like I said I may turn the corner and I may think of something out maybe I botched it the first time if I try it again maybe I'll get it to be faster the other idea I have this was implemented for me actually by Thomas Wooters at the Sprint's last year he has he did something of where he was at work where they store the reference count outside of the object so right now the reference kind of sorted right at the top of the object and the reason that this might help the colectomy is because of cache lines every time that you change a memory that invalidates that cache line for all the other cores so if you have eight cores in your CPU all looking at the same cache line you change that memory suddenly it tells all the other cations hey you have to throw away your memory and get a fresh copy because it's changed and since we're storing the reference calm the object that means that we're changing the memory on every object in python anytime we examine it including objects that are otherwise immutable like say an integer or string so again the colectomy is doing nothing but integers and every time it examines the number zero or one or two its invalidating the cache lines for all the other cores if we could get it so that these objects were genuinely immutable then they wouldn't blow away those cache lines that would get rid of contention on this internal bus where those cores are talking to each other it would it seems like it would it would it actually would be better for copy on semantics for when you were doing what we process although again the Golda colectomy is make it so you don't have to go multi-process but storing the reference count outside of the object itself is overhead and it made it slower so but maybe in the future maybe we'll figure out a better way or maybe I just watched it and maybe it'll disaster in the future it certainly might help with copy-on-write semantics we'll see if all else fails I have one more huge thing that I can do this was suggested to me by multiple this wasn't my idea this was Marc Shanon Lukasz both suggested independently the idea is everybody knows that tracing garbage collection is faster than reference counting if you wanted to multi-threaded and that's what everybody does these days like Java is a reference garbage tracing garbage collection rust go everybody these days all the new languages are all tracing garbage collection so we switched cpython internal views tracing garbage collection this is a more difficult API to get right it's going to break the entire C API but there's a technology inside a pipe I called C PI X pi PI of course who knows the Python that's implemented in Python and it uses a JIT and it uses real traits and garbage collection it doesn't use a reference counts so their object model is very different from C Python but a classic problem for pi PI is that it doesn't run the extensions they had an idea where they could run C extensions unmodified like you literally could use a compiled shared library and just plug it into C Python to pi PI but but what they would do is simulate the C API and that involves simulating C pythons object model which involves simulating reference counts in a tracing garbage collection environment we don't have their problem of their internal representation of an object is very different so what they had that do eventually was they had two different objects there was the actual internal pi PI object of whatever it was and then this is C PI X X representation and they Honda had a problem keeping them in sync and stuff we don't have that problem we can just use the internal C Python object without the Gil and C PI extensions can talk to that and we would just simulate reference counts on top of it that'd be pretty harmless and so it would be a lot simple for us and all we need to do is some of the reference counts and we could probably get that working that probably pretty close again we can't guarantee no breakage on the capi but we can we could actually do some things to mitigate the multi-threaded nosov the Galacta me and we simulate reference counts and we could get away perhaps with running cpython extensions in this colectomy version with tracing garbage collection that'd be a lot of work I really don't want to do it so I'm going to try and push this existing approach as far as I can but if I really have to give up maybe I can start over and do this so the final thing I want to talk about is just this is what we want to see this is the graph we actually want to see right I just drew this purple line here the idea is you add threads to your program and it just uses multiple cores and doesn't care the program doesn't get any slower this is what we wish we had right actually we have that because that's guy fun and what I say is that JSON and ironpython are actually existence proofs that the Galacta me will work because consider let's let's just say that we have a whole pile of C code and it just happens that like one of the piles is a Python as a Java interpreter and the other one's a CLR runtime but it's a big pile of C code essentially and at the end of the day they have a Python interpreter this running will be threaded without a Gil this proves that you can write a multi-threaded Python interpreter in C without a Gil the question is not can we write one will it work the question is how much does the C API do I have to break before I can get the Python to work well out to go thank Larry Hastings ladies Midland by the way I have stickers who lunch is happening at the moment so if you are wanting to leave that can you please do so quietly if you have questions for Larry there are two microphones here in the aisles please line up and please be quiet if you're walking out what I say is you can have one of each design of the sticker okay we don't take more than one year on my right hai-yah Larry can hear you I can hear you go ahead so some of this sounded very Intel and architecture-specific is there when you think about arm CPUs or running Python and other environments do you expect with this approach with those places as well in the audience can you please be quiet there are questions and answers happening in people are still trying to listen okay I heard everything you said is that your whole question the question is is your approach to this going to be usable in other architectures like arm you talked about interspecific CPU instruction right so the answer is it should be right now it's not literally so like I said I discovered about a friend of mine told me about a week ago oh did you realize that your cores are changing speed on you I said oh my god no so I wanted to test on something that I had handy that I knew was multi-core and I guess was not going to be sophisticated enough to change the speed on me like that so I tried compiling the Galacta me on a Raspberry Pi 3 which is for core and arm and it didn't work because it's first of all I had one problem then I fixed it and I had another problem and then I said oh yeah I can't include this x86 internal header file there was something that GCC had given me and so I like ok I can't solve that but I all the atomic instructions that I'm using I talked about Intel specific things are DTS a of course is Intel specific I'm really only using that for statistics gathering I'm not using that as an implementation base of the colectomy every modern architecture does have atomic increment decrement every modern architecture has atomic test ins and Satre compare and swap instructions so all those things should be portable there will be more platform reliance in neglecting because we have to use those things and that's not part of the standard C API but they're all available and so we all we need to do is a little bit more platform hacking in order to get those to work next question okay so basically you said the benchmarking is really bad everyone knows it so how worried are you like that the FIB function isn't good enough I mean if you tried any different function or anything else that the results would be completely different like how worried are you about that I don't think it'd be a world of difference again it's so this is specifically exercising its I had a list all the things that's exercising so for example someone suggested an optimization to me they said you're the Fibonacci function itself looking up the Fibonacci function takes a lot of time because we have to do a lookup and actually have to look in the module so it's not cached locally so it needs to do it needs to lock and unlock a dict and that module addicted self was hot and so if we did a multiple reader single writer lock on the dict that would speed it up and I said well on the one hand I think that's optimizing for my particular benchmark let's not do that but on the other hand that's a generally purpose that's a helpful optimization for a lot of people and it really isn't all that's tailored to my code and it would work on all dicks and it'd be fine so I think ultimately we're going to merge that and make that benchmark faster of course we wouldn't see that if we went to a different benchmark like the one that David Beasley uses when he's timing things and playing with the Gil he uses countdown which is just a for loop over ten million times or something like that so at the end of the day I'm not all that worried because fundamentally I am exercising bytecode I'm exercising function calls I'm exercising dict and list lookups internally I'm exercising a little bit of boolean logic I'm exercising integers not strings but all the things that are implemented and see I'm not that worried about I'm worried about the internal core bytecode engine really of C Python is what I'm worried about and one function is as good as another at that point so once I'm more confident about the Galacta me in the approach then I'd be more interested in running more code through it anyway but right now I just have my head down it's like if I could make Fibonacci run faster than foxy Python then it becomes much easier to make other things on fast and see Potter so yes eventually I'll run other benchmarks but not now now the question so if I understand your ref count log implementation correctly it would actually truly make like dell in vacations like actually non-deterministic ivan fortunately seen a lot of Python code that kind of relies on reference counting even though that was never really guaranteed by the actual implementation per se but if you did go forward with this plan I could see code breaking because they relied on that reference counting implementation and I guess what is your like okay so you're talking about that how do you go forward with that you're talking about the fact that people rely on recipes you're not talking about reference counting per se you're talking about the people that rely on the fact that once the last reference to an object is dropped the object is freed immediately correct right the Python language spec specifically says you're not allowed to rely on that and none of the other implementations make it happen and that's just a fact of life that is a Python visible side effect of the implementation of the Gil and if we ever merged it and became official Python then yes it would go away and I'm sorry so that's why we have things like context managers they kind of manage explicitly therefore this sort of object like time management exactly yeah can't do anything about it sorry can't help you all right another question I thought you were going to bring up another topic which I didn't talk touch on really quickly one side effect the original implementation of the reference count manager of course I was doing the incres and deckers on this of the threat right which meant that the last dekker happened on other thread which meant that d alec happened on that other thread and so now that's two effects one d Alex happened on a different thread from where the object was originally like where it would have naturally happened and two that meant that the sialic thread the the commit thread had to work like a pit pony it turned out that that just swamped it it was spending all this time doing D Alex and it never got it fell behind immediately never caught up so what I wound up doing internally is when the object reaches and reference kind of zero I put it on another list and I pass it back to the last thread that did the last decra and then he notices that later and he commits it so there is a another delay built in before the object is destroyed so it's worse than you probably thought next question please just another quick question about benchmarking I don't have a quick question it's probably a long answering sorry I think this up well assuming that speed step is just for reducing power consumption is there not a like BIOS setting to turn it off yeah so I went through my BIOS and tried to turn off everything and I guess I maybe I had a lame by Oscar I was looking in the wrong spot but there was literally no big flashing set your CPU frequency here thing so I turned off a I turbo boost or something like that there were about three things that I turned off and it seemed to be a little bit more stable then and so these benchmarks are actually run with those settings turned on but apparently Victor said there's literally a Linux kernel setting where you can specify the max frequency and so if I said oh yeah your max frequency is a gigahertz and guess what everyone's gonna run the de gigahertz cuz that's like slower than it ever wants to be so I'm going to do that once I figure out what that setting is and once I get home and I'm sitting in front of the computer again but I mean I'm a guy I said I mean I'm only in so much of a hurry about it because I'm like I I'm in the I'm in the neighborhood a last question do you have an idea how much you can minimize the C API changes from this by the time how much I've what since we've gotten how much the C API changes will be - right okay so I kind of answered that question in my talk from last year the the answer is that so far I have essentially preserved the existing source code level C API so a recompile would get you using all the new technologies underneath you didn't have to touch a line of source code the problem is that there are semantic changes involved so there are guarantees that the Gil gives you that I don't give you like and there and there are other things happening like the what he alluded to which is the instantaneous objects going away being able to rely on that which you out which you don't have anymore because about for reference counting so another example of this which there's literally an example of this as a good approach to how to do things in the C Python extension documentation they say here's example code they say let's say you have an object you want to lazily instantiate because it's expensive you only want to do and it's needed so uhit's a static PI object star two equals null and then you say inside in the middle of your function you say if foo is null create foo and now we can use food and everybody's happy the problem is if you call that three times now you've got three races and probably you're going to allocate memory uselessly and you're going to stomp on that value and that's that used to be safe because it was protected by the guild literally nobody could interrupt you and and call into that again but without the Gil that's no longer safe and so I would say don't do that but there's lots of code that does that and it's a guarantee that the Gil gave you that I've taken away so that's a semantic change the API that isn't encoded in the actual like literal PI you know in craft PI deck ref those things haven't changed but the semantics around them those stuff you've been doing for years have changed and so that's why I say I know I'm going to break the extensions because these some axes have changed and I can't do anything about them I can the whole point was to break those sorts of things and I can't fix that as in terms of source code compatibility literally my goal is that you can recompile and you will produce like you you won't get any errors and there are no new API is that you were supposed to call that you didn't your code will continue to work and there are additional api's that will make things go faster but unchanged C API should physically compile Larry Hastings everyone [Applause]

Info

Channel: PyCon 2017

Views: 21,055

Rating: 4.9591837 out of 5

Keywords:

Id: pLqv11ScGsQ

Channel Id: undefined

Length: 45min 51sec (2751 seconds)

Published: Sat May 20 2017