Modern Dictionaries by Raymond Hettinger

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's see something festive to talk about how about dictionaries how many of you like dictionaries are they modern and cool how long have they been around hash tables how long is forever over 30 years the art of computer programming Donald Knuth algorithm D spinning Python for 26 years there's anything new now nothing's changed how many of you all learn to hash tables a long time ago how many of you think you still have something to learn who thinks you know it all already how many of you learned everything there was to know about hash tables before last year that's good because you'd be out of date there are new things under the Sun they've gotten better and better on the other hand when the interesting themes of the talk is we're going to take a journey there and back again and that there starts what did we have before we had hash tables we had databases we had Lisp we had ancient tech we're never going back there again are we how many of you going to do a lisp association tables instead of Python dictionaries how many of you would rather use a database than a Python dictionary that's too bad that's where our journey is going to end up all right so let's see hosted by Microsoft reactor thank you very much Microsoft it is wonderful to donate this space in this time I've always been a fan of Microsoft but now I'm a huge fan what they have done for our community has just been spectacular thank grace and Simeon conference organizers and thank you band oh me my mission is to Train thousands of Python programmers not not figuratively but literally I've just crossed the five thousand numbers I've trained personally and then my team has cumulatively trained over 10,000 Python programmers and we're not talking about our to training we're talking week-long training sessions and it is a language that sells itself if I do anything cool up here tweet about it with Raymond H if you don't do anything cool or say something weird misspell one of these letters okay all right here we go so I don't do slides I just Fink's I will post the slides I mean the website for this right afterwards and tweet it out for those of you who are interested and our journey the beginning in the end so I'll start with the punchline when you tell a joke you shouldn't start with a punchline but I thought you should just see where we're going so what's Python God pythons got dictionaries lots of dictionaries the entire language is wrapped around them various namespaces Global's is a dictionary locals isn't actually a dictionary unless you ask to see the locals at which point it turns into a dictionary internally it's an implemented in a more optimized form but conceptually it's a dictionary and behaves like a dictionary that you can't write two modules or dictionaries classes are thin wrappers around dictionaries instances are thin wrappers around dictionaries are the dictionaries everywhere I see dictionaries everywhere so they're kind of important all the codes you see will be code that runs either Python 2.7 runs under Python 2.6 3.0 all the way up to what's the latest version of Python you're using right now well the one I'm going to have up here will be the three six but you're right I run on three seven everyday build a fresh one and it runs under all these so I came up with a bogus class user property that takes some values and assigns them to some famous people like cuido berry Warsaw my lovely wife Kim Peters and Sarah who sent me an email that I didn't get it has a clever repper so that we can see it I make three instances of it one user properties colors ones in other cities and others fruits blue goes up to cuido orange goes over to Sarah but in cities cuido gets Austin and Sarah gets Dallas you see how it rolls right so we've got a class we've got three instances a lot of people like to get to the dictionary for an instance by doing dunder dick there's a function for getting there what's it called VARs how many new virus is there VARs is an unloved function you should start to love it it's really kind of awesome once you get used to it but everyone forgets that it's there and that will just simply print the three instance dictionaries and this part is well I'm the itertools guy so I'm very functional I do a map of a saw bears of a size of a map of a list of a print you know that sort of thing makes me feel very functional and happy so what do we get we get three instance dictionaries quito's blue Saros orange widows Austin's arrows Dallas quito's Apple Sarah is banana all right what is the point of this the point of this is this is a stand-in for basically every important Python program that runs on the face of the planet somewhere there's a class lots of instances and they have lots of dictionaries suggesting that dictionaries are important to us I'm going to pick these to talk about during the class so let's jump right to the end the punchline in Python to seven how big they are each of these dictionaries two hundred and eighty bytes each I know what you're thinking that's huge it's huge good when it comes to memory no huge is bad I remember when I was little on a birthday party this is about first grade I might got taken out to play miniature golf one of my friends was so excited that he got the highest score we didn't have the heart to tell him so which is the highest scoring best Python of all time that gets the highest score for dictionary skies Python - 7 by 3 5 it's been shrunk down to a hundred and 96 bytes guess who was responsible for all of that savings I'm looking for a name starts with a mark ends with a Shannon how many know mark Shannon so this is from key sharing dictionary so something kind of awesome happened when Mark got to it he realized that cuido has been repeated to three times is the world big enough for 3 quito's I don't think so so mark at this bright idea how about we store cuido wants his hash once and then only save the data separately so this was called a key sharing dictionary and you've already got it in Python 3 5 who thinks it's kind of cool mark is a smart man which is why they give him a PhD then Python 3 6 they got a little smaller how did that happen we compacted them and so I'd like to show you how compact Dick's work how it is that this technology that's been known since 1970 has been refined in by a lot of people how it ended up down here at less than half the size by the way how maybe think that's a pretty impressive savings and by the way this is not a random obscure corner of Python dictionaries are everywhere you're going to get a massive savings now I'm a big fan of Python 2/7 that said in for the last seven years I've devoted all of my evenings and weekends to making Python 3 better and so I would have loved to have come out and told you how awesome Python 3 was all the time but I never did that until today python 3 6 comes out in a week you really want it it actually is the best Python ever it's the first Python 3 worthy of the name go get it [Applause] beautiful is better than ugly is scrambled better than ordered so in Python to seven we put in cuido first but now Sarah moves to the top of the list and poor cuido is last after Tim Rachel and berry deeth who's there that was a quite interesting so Oh it's scrambled but deterministic so on any Python to seven you will get exactly the same order which is not the order that you put it in at one time there was almost articulated principle that scrambled is better than ordered but because our since because these things had scrambled people would learn how to hash tables at work and so whenever you taught somebody Python one of the first things they would ask is are the dictionaries did they remember that the order that you put them in and you would explain no and explain how the hash tables were it was such an easy explanation because what raised the question was that the elements went in out of the order that they were put in I know what you're thinking what if they accidentally showed up in alphabetical order with that mess up your your explanation you all don't know what to say when I pound on the table you're supposed to say there must be a better way ready there is a better way scrambled is better than ordered and by Python three five we have the sip hash so the hashes get randomized every time you turn on Python and so occasionally Sarah gets to go first and Tim goes second but once in a while one time in five cuido gets to go first every time you turn it on it scrambles so everyone's Doc test-tube relied on the order what happened to them broken so scrambled was better than or better than just scrambled it was randomized I know what you're thinking in Python three these are ordered they go in in exactly the same order you put them in isn't that nice yeah you don't like ordered it all right but if you don't like ordered then you're going to love what cuido said cuido said ordering is not guaranteed you can't rely on it now he'll change his mind a few years from now but for now if you actually want an order dictionary you should input order dictionary but I can assure you I've been working with this for a while and I really enjoy it it's very very nice and a lot of people want a sorted dictionary one that's sorted in a particular order how do you make one of those it's easy take your key items sort it and stick it in the dictionary and the order is preserved it's fantastic and it's small small and ordered welcome to the third millennium it's a great time to be alive so I've started you at the end of the story where we're going to end up are you ready for the journey all right here we go Oh a journey with code lots of little tiny code would you like it to be bigger there it's one of my favorite keys on the Mac make big it's the story of evolution in the beginning there were data bases now oh that says know which is come from it should say now okay so much for spellcheck we'll just scroll that off the screen like it never happened so here's our setup I'm importing the future got P print which is your best friend when you're dealing with nested data structures I've recorded the keys the values for each of the dictionaries all of the hashes are for those and how entries an entry would be a couple of linked three the hashes a hash the key and the value and a combined entry says let's put the hash keys and all three values in the same list of tuples so this is our input data so the question is how would a database have stored of our three instances welcome to 1960 in 1960 we have these cute little columns here was the fruit column here was the column for the city here was a colored column and this wall used to be known by a fancy name it was called the primary key kind of like a main squeeze you know what I mean what was it why would they have stored information this way because memory was precious unlike now where it's just overflow it no it's not data got bigger a lot bigger so this is how we used to do it we stored this in a flat file any ever you see a flat file like that 1960s technology are you let's go get rid of it after all they had no idea what they were doing back then and they weren't very smart they taught uh they taught our parents our parents taught us and now we are the smartest generation of all we come up with all the clever stuff and our parents would be dazzled and amazed by what we've accomplished you guys sound dubious yeah and showmanship that part's called the promise later we will fulfill the promise all right so how would Lisp do it Lisp would store these as association list it would associate a key with a value just say a list of uh pairs and lists got by very far with just association list okay now what does that cost us in space up here in the flat file cuido was mentioned once now he's mentioned three different times after all by the time we get to a lisp we didn't care about memory anymore fair enough all right so while we're in this period of throwing away memory let's see if we can make it bigger after all bigger is better we want the highest score currently who's the high score winner the database are the association lists association list is currently the high score winner we can do better so we decided to speed it up a little bit the problem with what we were doing here is it had a linear search if you're looking for poor Rachel you had to go down four steps in order to find out that she has the pair Henshaw she's associated with irene reno better way is separate chaining we make a set of buckets and then each bucket has a list and we go grow the list and now oh that's going to reduce the linear search by a constant factor for instance if i make two buckets i've got one here and in that bucket we have both cuido and Tim Sarah Barry and Rachel stand alone what was fantastic about this is if you wanted to go look for Rachel you computer hash that's this number right here modulo four drop in to the fourth entry and check out there's Rachel Oh actually I said fourth this one's separate chaining two there's two buckets one with two people in it one with three people are so to look for Rachel you computer hash and she says she's in the second bucket now you scan through into a linear search it takes three probes to find Rachel formerly it took four she's improved her a lot in life by 33% no was that 25% all right nice speed improvement good job Rachel cuido gets found right away Sarah gets found right away Berry has to wait one step poor rachel has to wait three it is a better way you can make more buckets more buckets I have four separate lists here one list has Sara and Barry in it but cuido Tim and Rachel stand alone so four people get found instantly they each have one probe and then Barry has two probes what's the weighted average number of probes they're 1.2 a lot faster than to begin with and how are we doing on memory well not so well we've now stored the hash value as well as the key and the value and in addition each of these lists needs room to grow so it over allocates we've throw more memory away who is our new high score for maximum memory used our new high score is separate chaining high scores better right ok remember don't tell the person the birthday party that if they get the highest number of points in miniature-golf that they're not the best my friend said I won and we didn't have the heart to tell them so we're not going to tell separate chaining so now we can go even further since we're wasting space how can we make it worse we'll go up to eight buckets now with eight buckets some of the buckets are empty but there's no longer any collisions everybody gets found on one probe and rachel has improved her lot in life from four probes to one 300 percent improvement that rachel is pretty happy don't you agree what about our memory usage well we have several lists that need room to grow we've stored the hash each one of these have is over allocation we have a new record for wasted memory space but we've sped up our search dramatically several hundred percent there is another way it's called open addressing in open addressing rather than having the extra space at the end of ah each of these lists so that we can grow the list the idea is we make one big table to begin with and then we insert everything into that one table so this is going to make a table of size eight loop over all the entries compute the index modulo the table size and then this part is new logic this part says if the table is already occupied you have to find another slot so I'll do open addressing for our eight probes and you've got a table of size eight this time there is no over-allocation at all congratulations you save some space did it cost you something poor Timmy Timmy collided with Sarah Timmy wanted to be in this slot so we did a linear probe and we wrapped him around he's enough linear probes are easy to compute it says take the next available slot how bad could it be we end up with what's known as catastrophic linear pileup once in a while your application becomes dog slow because everybody wants the same or close to the same address I know what you're thinking there is a better way by the way we've gotten you now up to Knuth algorithm D you're now up to 1972 technology the concept of multiple hashing was a unknown even back then and the idea is if the first hash has a collision rather than do a linear probe nail do a secondary hash and find another place and the idea is you still have a collision but you're not going to end up with a big 500 car pileup all in one place in other words it doesn't speed up the best case it just keeps the worst case from being catastrophic fair enough however uncle Timmy comes along Tim Peters the hero of this story and Tim gets this bright idea Tim hat says we know all about random number generators and a very simple one is a linear congruential random number generator take a number multiply it by five add one and take it modulo some Fix modulus and if you recycle that over and over again it gives you a chain of our random numbers in other words it probes around randomly in a different pattern depending on all your full hash value what's cool about this is a Tim found a way to perturb shift in some additional bits and when you're out of bits this random number sequence is guaranteed to over time visit every single separate cell in your table that way if there's a hole in it it's guaranteed to be found this was kind of important because if it didn't search all of the places at some point it could get trapped in a loop and your dictionary would tie up your system put it in a deadlock causing it to die just before your IPO during the due diligence check Thank You Timmy so Timmy to put in something new Under the Sun something that wasn't in algorithm D algorithm D had multiple hashing but if the secondary hash didn't help you it fell back to linear probes instead he put in the random number generator and suddenly we no longer had catastrophic linear pileup Thank You uncle Timmy when did you get this when did it go into Python it was already there when I arrived in Python 1.5.2 Timmy goes way back now what does this structure eye look like who I go ahead and build one of the five does uh did I end up with a few collisions that Barry he bumped into cuido cuido I bumped into Rachel Rachel bumped into Barry Rachel bumped into cuido and then Tim bumped into a Rachel is there a lot of bumping going on so five collisions in this tiny little a table we've actually slowed it down a little bit over what we had before so what did we gain we saved some space that's really nice and more importantly we saved our worst case because our worst case without the randomization is that it was going to tie up and sit there and probe back and forth between two slots never find an open slot and die on you thank you uncle Timmy for slowing down no saving space and preventing disaster how long did this technology last this meant for quite a while but when I was on the scene very early on probably in my first three or four years of our contribution my contribution was to take allies the performance of these dictionaries and I realized you know we're still getting a lot of collisions maybe we should make the table more sparse so if you were using Python 2 7 or before what the dictionaries would do and they resize is they wouldn't double when they needed more space they would quadruple so it meant most dictionaries most of the time had enough space in it to where you had hardly any collisions all I did was throw away some space but I topped out at about 50,000 entries about 50,000 entries we go back to doubling once it got up to a size that you would actually notice so when you're not looking for all the little small nickel and dimes I was quadrupling your space but if you quadruple a tiny dictionary that's not a lot then March innen came along and said this is terrible and he turned it back to Dublin so you got that about Python the 3.2 he slowed down the Python a little bit but saved a little space in little nooks and crannies do you see a little tug of war but strain strengths and weaknesses didn't mention mark shannon is brilliant he came up with something cool now I'm about to alter time you know like in Star Trek Voyager you know captain we've encountered it too important Amelie the temporal anomaly is this the compact dick just happened I thought of it four years ago proposed it four years ago and cuido wasn't in a accepting mood at that point all you have to do with him is be patient and eventually he goes home great idea put it right on in pi pi ax in the meantime I was more aggressive and put it in a couple of years ago and so those of you who are pi PI users have been having the compact dick but prior to that was key sharing dick so the temporal anomoly is I'll show you compact dick first why it's cleaner to show it to you this way so compact dick looks very much like the other dick except that I pull all of the indices out and the hash table is now this is an array bytes so each this is one bite one bite one bite one bite one bite we don't really use ones are nuns we use a negative one so the hash-table part of this is tiny now notice up here this is stored dents there's no spate open spaces in it plus it got stored in the order listed so every time you add a key but the first key added was cuido and blue and his hash code and that's appended right to the beginning of the structure and then you add in Sarah it goes in right next no wasted space at all so the problem with the structure up here is it was wide and it has holes in it the holes aren't so bad unless it's mined but now each hole gets multiplied by the width do you see the problem my bright idea was smush all of these together and put the holes in a tiny little table so your hash is eight bytes this pointer is eight bytes this pointer is 8 bytes they're 24 each 24 times 5 120 somewhere thereabouts so 120 bytes for that down here though the partners is the actual indexes into the hash table the first hash code is 0 remember cuido was in position 5 before so position 5 refers to 0 Sarah is referred to by this 1 the number 2 refers to Barry and so I have indexes are reaching engines which raises the question how big is this thing it is 8 bytes total so all of this was 120 bytes the hash table part of this is only 8 who thinks that's kind of cool hash tables used to waste space like crazy now for a dictionary of this size the hashing part takes only 8 bytes how big is a hash line on a cache line on a computer one of your machines not a single person in the room knows how big a cache line is on your machines how many bits did you buy at the big store 16 32 a 64 you've got a 64 bit machine it's the size of a cache line so when we fetch one byte from memory you actually get 64 back are you trying to tell me that if you go to look up this hash table and load it memory it takes exactly one memory fetch what if this dictionary is bigger this what is this dictionary holds 128 injuries how big is this thing one byte each one times 128 what did you guys drink I don't want that okay so 128 bytes what's 128 divided by 64 to 2 memory fetches to pull in the entire hash table for a dictionary of 128 entries pretty awesome how long do is each of the fetches take you here that they take 3 cycles that's the latency but the throughput is 1 cycle which means you issue 1 fetch and then you etch the other one so it take our 1 cycle later so it takes a total of 4 clock cycles to pull in this entire table who thinks that's kind of awesome and if you compared it to what we had before this thing was huge remember the smallest dictionary was 280 bytes not things like 8 so that was my contribution I proposed it about four years ago I brought it to Python dev and so be this guys it seems good Martin is like you go back to the drawing board you study more and when one of the smartest people on the planet tells you to go back to the Brawn board you know it's all over and cuido expressed no excitement whatsoever what should you do crawl into your cave admit defeat or take a back door back door is there more than one python implementation yeah Python written in Python Python pi PI so I get on IRC I find magic and Arman and I said I've got this great recipe a proof of concept I've already written all the code for a compact dick in pure Python and if already tested it in pi PI and say the in Python it saves memory and runs faster so this pure Python code runs through pipe I gave you the compact dick right away so they went and studied it and said this seems like a reasonably good idea and they did something that felt really good you remember when Tim invented his own sort routine what did he call it Tim sort and then I put so much effort into designing this thing do you know what it was called the Raymond dict they referred to it in all of their documents as Raymond dick they sent me notes about the Raymond dick we discussed the uranium dick and then ma check found and many many years ago somebody had worked out something like this in Java so he no longer got called the Raymond dick it got idea proposed by Raymond with a significant prior work in Java not a Raymond dict anymore Java dick that's okay we're going back in time to Java this is the part of the journey where we stop going forward we start going backward backward for two reasons one is I presented this out of out of order in fact prior to the compact dict was the key sharing date I already discussed the concept of key sharing instead of putting cuido in three times for each instance how about we store is key one time the hash one time and only save the values this is wonderful it means every time you have an instance all it's storing in the instances are just the values and not the keys who thinks that's pretty awesome that mark Shannon one smart dude now finally we come to mark Shannon's got a key sharing dick and I've got a compact dick and somebody is going to win who got there first well in my story I get here first but in reality who got there first mark so Mark's going to be there first and my comeback dick is going to be squeezed out never to see the light of day unless it was a romance unless they were destined to me maybe they were meant for each other maybe they were perfect complements maybe they were too great ideas individually conceived they came together and made something wonderful like chocolate and peanut butter Reese's KitKat bars I like the way you're thinking five bucks I don't actually pay up I just like saying it all right so I won't even show you keep sharing directly I'll just show you shared and compact we make on a nice little table with loop over the position and entries we still have Tim Peters randomization all logic we still check to see if the tables not none but what we insert into the table is just the position and then we up in the entry and here's what it looks like now we have cuido blue austin and apple all of these in one table notice cuido is only mentioned one time notice each of these fields have one - notice there's no space in between them the only thing that indexes them is this little table how many bytes wide is this I'm looking for some number that is eight times one I know the drinks have hit hard this thing is only eight bytes to index this table now what if I oh I didn't change the number here that one should be 16 interestingly I can make the table more sparse reduce collisions and speed it up remember the last time we try to increase sparseness it slowed down or it decreased smartness it ate up memory but it sped up the program remember when we went from quadrupling back to doubling all because we were afraid of wasting space are you still afraid of doubling the size of your dictionary or quadrupling it you should no longer be afraid let's change it from size 8 to 16 when we do is there any change in the table of keys and hashes and values no change at all what changes this part raw down here I'm looking for a size here some number that is about 16 times 1/2 double the size of this table cost 8 bytes this whole thing is still fetched in a single cache line fetch who thinks that's awesome we can now make the dictionaries much more sparse we're arbitrarily sparse to eliminate collisions without throwing away any memory who learned something new of course you learn something new this is all brand-new you don't get it until you get Python three six it comes out in a week get it now it's worth it it's awesome how many of you think I'm really darn clever coming up with a compact dick how many of you think that Mark Shannon was really really awesome when he came up with key sharing so that we didn't have multiple Quito's and this is the state of the art of the technology in 2017 modulo and extra thing put him by Victor since dinner that I will mention shortly and now we've progressed so far and I bet if I show this to my father he's going to go WOW you youngsters are so smart I'm really glad we taught you how to program if my grandfather were alive I could go show him this and say wow my grandson invented a compact Dick's got together with a key sharing and invented something that these kids are so smart I've got a question for you have you ever seen this before this table not this part have you ever seen this we've just reinvented the technology of 1960 it is exactly this but augmented by a hash and augmented by a bite-sized index table it turns out our fathers and grandfathers our mothers and grandmothers knew this a long time ago they're just like you kids you had all the space to waste you used it all up and yet suddenly your megabytes grew into gigabytes and your gigabytes grew into terabytes and you had all the space and you just wasted it and then suddenly your data turned into petabytes and now you care once again all right so oh I'd like to think that we're all just a really clever but in fact this is the technology of yesteryear and what hasn't happened before if someone never came along and said what we knew about databases a long long time ago could be applied now to object oriented world where you have many instances now imagine I get a hundred instances all this thing is going to do is grow out to the right the table at the index table itself isn't going to change so if you have 10,000 instances this grows to the right this thing stays the same width and it still has the same look up speed and at this size each person has exactly one probe no hash collision imagine 10,000 instances no redundancy fully as dense as a database and the lookup table into it makes it fast loads in a single cache line with zero collisions welcome to 2016 what are you going to download come Tuesday of next week starts with a PI and ends with a 3.6 the last page here is my original recipe for the compact dick that I presented four years ago it went into pi PI two years ago it went into C Python I think two or three months ago at our Facebook sprints oh it is not perfect inside the implementation has not fully realized some of the benefits in here it's got the space savings some of the benefits that have yet to be realized is this part the resize is kind of Awesome currently the resize logic in Python 3.6 builds a brand new table by reinserting when it gets larger but as you saw before really all that has to change is the index table and we don't have to move the entries I timed resizing many years ago and I was shocked to find out something you have a table this big and you want to make a new table this big you allocate the new table and then you start inserting the entries one at a time into the new table to make it more sparse which part takes more time making the table or inserting the entries I would have guessed inserting the entries but the new table is twice as big as the old one it only takes one memory allocation but then you have to do a mem set to zero it out and shockingly when you time it zeroing the memory took more time than the actual insertions who learn something new after all you're not inserting into every element you're just taking a table it's only 2/3 full so you've got a table of size 16 2/3 full as 10 entries is now going into a table thirty two you're only hitting one third of the rows actually making the zeros cost all the time so when this next fix gets put in hopefully in some point in 3.7 what it's going to mean is the cost of resizing is we're not going to have to move almost any of that data the only thing it needs to be zero is that little by index table which can be done in a couple of our clock cycles in other words resizing will become basically free one other benefit that's kind of cool is you see this thing here this is it er yep what does it iterate over iterates over the key list informally on a dictionary of size 16 the key ah is 2/3 full we loop over it and some of the entries are empty so we have to loop 16 times and we have an unpredictable branch is this table got a key or is an empty table are empty but now all of those keys are all pushed to the left side there are all dense so all 10 entries are in order when we finish up improving this one you'll be able to iterate over a dictionary as fast as a list why because it is a list inside it's all consecutive pointers not a single branch misprediction not a single wasted space so at the tables of size 16 you only have to loop over ten consecutive entries who learned something new either cool tech is the size of the table it auto resizes if your table is size 1 2011 128 or less it uses a byte array however if it's a little bit bigger it goes to a word size array so 16 bytes and then if it gets even bigger it starts to use along and then finally last it only uses a big pointers in other words the table automatically scales the size density of the indices based on how big your table is most dictionaries most of the time are really small so they're one byte each and that's my story and I'm sticking with it Oh one other righteous dude needs to be mentioned Victor's dinner Victor's dinner came up with a fairly neat idea a lot of times but we're looping through our programs we get an attribute we go look it up and there's a dictionary lookup to go find that attribute and possibly there's a chain of lookups when you have an inheritance hierarchy what if we were to mark each of these dictionaries with a dirty bit to indicate when it was changed now all of the internal routines that access these dictionaries when they go to look up an attribute they can check and say have I looked up this attribute before if I have it'll check the dirty bit if it's not dirty if the dictionary hasn't changed and it almost never changes it doesn't have to do the lookup again in other words dig Victor has found a way to not only we sound set found a way to save space but we can actually eliminate a lot of the lookups altogether that is in Python 3 6 as well and so I like to come up with a name for this for app icon I didn't title it this here but essentially the modern dictionaries are confluence of about a dozen different ideas from different people over time and individually they were nothing collectively they've made a force awesome to behold and we've rediscovered the technology of 1960 thank you so much for inviting me to speak to you - it's a holiday party mr. Spock do you want the Conn or shall we take questions I don't but Victor does and he published them they were not huge Oh remember the algorithms are exactly the same as what we had before and there's an extra indirection step the worry was it was going to make Python slower because we had to go through the indirection table but then I was realized that that table can be read in a single cache line our two consecutive lines in and that we actually got a net benefit from fewer memory accesses and better out cache performance but since the algorithms all were the same it was mostly about space savings that said when you make something smaller your cash informants improves and you automatically get some win so I believe it's a few percent in terms of speed but mostly it's about space and it'll really pay off when you get hit with the big data you'll find the big data the size of the data is the size of how much space it actually takes David what you got what happens if you delete an entry in your dictionary and add another entry does your do you lose your order I do I never delete so this is a good question what happens when you delete an entry it would leave a hole in the table this original recipe had a plan for deletion down here in the Dell item code what it does is if needed it swaps with the last entry to avoid leaving a hole but in fact that sacrifices the order there's another way to do this just leave the hole there as we currently do that causes a branch prediction we've cost you a little bit of space but when it gets up to two-thirds full just like everything else in the dictionary you do a resize and a compaction that said there's a reason I published this this way four years ago what was the principal ordered it's better than scrambled or scrambled is better than ordered had I put this in his order cuido would have just said no way we're never doing it which would have been terrible so to appease the gods I was like oh but when somebody deletes an entry it changes the order DISA God scrambles it's a good thing cool then when anata came along to implement this in sea he went ahead and put the compaction back in restoring order as they did in a PI pipe which was a very simple modification and I think ordering is a big big win that said it's not guaranteed until you all decide you like it until you all defy quito's will and start come to rely on it and then if someone ever discusses taking it away from you you rise up to rebel and a couple years from now say of course it's guaranteed remember that's what we did with the Tim sort and say oh there's many things that behaviors that income guaranteed till later a question from a lady in the audience this is a cuido thing I'm following up on it raise your hand and I'll get you a mic alright question from the dude or any other option what's that you're welcome yes sir thank you so much so it is very nice explanation and my question is that does this consider to all kind of a big data or and also small data because when there is a very small data some time linear probing is ok then creating a new hash is that right say the last part again yeah so when there is a very small data set huh small real small at that time rather creating a new hash it would be better to do linear probing or something is that true because this kind of a scenario is good for all kind of a data set is it really small and what's your name Fuu Shan Shan excellent question 25 bucks Bhushan said when data gets big it doesn't fit into cash anymore memory accesses are catastrophic ly slow but we fetch an entire cache line at a time notice how I'm rephrasing your question and when we fetch a cache line you get all of the adjacent bytes whenever you're achieving mine so wouldn't it be great if we did a linear probe rather than searching somewhere else and that's in fact what I talked about it last year's holiday party what I did for sets so sets actually does a series of linear probes before it jumps to take advantage of the cache online performance on the other hand though that path is at odds with this and we're we to try and do linear probing we'd actually have to move the keys adjacent to each other and that would throw away ordering throw away all of the space benefit so you get one or the other but not both so if you use sets you get the linear probing I've designed sets for extremely slash membership testing but they won't iterate as fast as dictionaries and they won't be space as space efficient on the other hand that's not their typical use case but for dictionaries which dominates all the rest of use of a Python Wow there was one ambiguous term that you use big data and small data my thoughts on the subject I heard it a PI data this notion of medium data a medium data I think he serves is a great concept to separate big from small what is medium data medium data is small enough to fit on your computer but big enough to where if you use a crummy algorithm you're going to really pay a price for it so big data is anything bigger than that now if you go to an actual Big Data conference bigness varies there's the mega BIG's the Giga BIG's and the pedo BIG's and the pet of people think if you're not saying pet a it's not actually a big data so big is kind of a relative concept I think the concrete concept is medium yes sir speaking of cache lines do you ever have the possibility of having a cache line split access or is it always aligned to the cache line so the question is are we aligned with the cache lines the compiler makes all of our structure structure entries sixteen byte a line cache line is 64 so the question is how much do we have in one of those cache lines and it depends on whether you're using the key sharing dictionary for instances are non key sharing for non instances of regular dictionary in key sharing you have consecutive keys all in a row in a regular dictionary you have an entry a key a value and a hash those are eight bytes eats 16 bytes are I'm sorry 24 bytes for the three of them together and that fits very neatly in the cache line most of the time but once in a while you get a spill when one of them all falls over because 64 is not a multiple of 24 so once in a while there's a spill that said for small dictionaries if you're hitting them often enough they're all all of the parts already in cache anyway and making all the Python more cache friendly means that you have more cash available the interesting thing about trying to optimize for cache is it's shockingly difficult to time because when you go to time it you're timing the feature of interest which puts it in cash anyway so it's hard to observe the effects hash effects only show up in big programs that are constantly blowing something else out of hash so it's shocking any difficult to get good measurements in fact one reason I don't trust a lot of people's measurements is they focus on what happens when everything's in cash and they completely ignore the cash effects great question 15 bucks all right Semyon do you want to take the oh here we go Cindy so one of the most common use cases for a lot of people when they deal with dictionaries is typically JSON data that so we're dealing with a lot of nested data structures so like know a dictionary with a lot of nested dictionaries in extra dictionary so I was just wondering how it is the cache friendliness of those kind of dictionaries um sort of hold up okay so there's a couple parts to what Cindy proposed she said one when you're dealing with nested data structures which were very common in JSON data is they're interesting any interesting cash effect there and how does that interact with all the dictionaries I've shown here there's actually no special and interaction and there's nothing special about nested miss internally to Python Python containers don't contain anything you put something in the list it's not really in a list we just have a reference to the original object so everything in Python is kind of all nested anyway if you look little object diagrams it's just a big pile of spaghetti so from pythons point of view the JSON data is no different than anything else also how unless your JSON data is huge and filling up all of your memory most Python programs their dictionary usage isn't just the JSON we think about the dictionaries we name like oh I'm looking at my data has a dictionary that is the dictionary but in fact you're modules are dictionaries you're Global's or dictionaries you're every attribute lookup is a dictionary every one every time you use a dot in Python except for a decimal point you're doing a dictionary lookup so in fact they're everywhere whether you realize them or not so typical Django uh run is making millions of dictionaries and not the one that the designer thought about and said oh I've got a big dictionary to save everything so I would say that there's no special effect other than when it comes to cache smaller is better always excellent question 25 bucks yes sir and I see Eddie pointing at somebody raise your hand I'll bring I'll get you next so you've made a good good point that you can improve speed you can improve memory utilization you can improve cache hits but you've been sort of talking about well the average user this is what's important is they're actually a census that gathers Democratic demographic data about Python and so that you could model for the average user this trade-off would be better or worse I should think that might be useful so his question is quite interesting sometimes when you're optimizing are improving the language you're making trade-offs you make one thing better while you're making something else worse and hopefully that thing that you're making worse isn't in common use or does it take some particular user who relies on it it makes their life catastrophic Allah bad dictionaries are kind of a special case though everything in the Python world uses dictionaries if you improve dictionaries anywhere you approve all of Python and so I don't know of any special cases where we have made anyone worse making these things smaller and making them faster is pretty much a straight across the board when the only case is the one that concerns me perfectly personally is I I designed the LRU cache and the LRU cache hits a stable size and once it does it deletes an entry adds an entry deletes an entry adds an entry and so we end up with lots of holes and resizes the measurements I've done so far I can't tell whether it's made it better or worse by much or by a little bit but I know it hasn't made it better works by a lot so it is roughly in the same ballpark as before so possibly if your entire life is an LRU cache you might have gotten a little better or a little worse I don't know his other more a very interesting question is how do we know what everybody does well there's some Python core developers who develop basically for themselves the idea is they have a niche they came to scratch it I needed this and I put it put it in this sounds terrible and it's selfish but it's how the open Bazaar works do you like the JSON module while people eat eroded it because he needed a JSON module not because he cares about you later he's like okay I care about Rob a little bit I'll donate it to Python and sign it sign it over a lot of things came into a Python that way so there's the core developers who developed for themselves and the idea is what's good for us is good for you I don't universally believe that is true so particularly when it comes to API design iil api's on students in classes I tried them out engineers for Wright clients I make blog post or I post out codes on ASP and recipes so people can try it out and to where I can get the feedback what I used to use quite a bit to inform the design in Python is Google Code search so I could take any tool in Python and the question would come how do people use it in the wild and I could search for Python in that particular tool and see everywhere in the world how it was being used and so it informed me of what users were actually doing so Google took that away and now I have something that is not as good which is github code search the github code search is good in the sense of it's a fine search engine it's bad and then it has sampling errors of now I get to learn what are the needs of people who use github and they are not representative of progress of the programmers in the world who might have quite different needs military programmers and you know I would never use a get up for example so I've immediately got a sampling bias but how do we learn what's around this I can search I can go to conferences and see what people do I can try out apps I can see what my needs are and there are lots of folks who have Python apps that are performance intensive and every time that we go into beta they go and run tests and when they run the tests the people for occasionally will get a bug report somebody from who maintains cheetah will come back to say someone made a change to the regular expression module that was only a little one-line change to clean up some code and it slowed us down 15% and will revert it immediately so either someone has to scream I have to search and find them get lucky or your needs have to match mine or O or your knees don't get taken care of at all it's a somewhat random vile process excellent question 25 bucks just a crowdsource or a random thing somebody help me I guess the beers kills me to brain cells but I was recently seeing a github project where you up you run your Python code it profiles and then suggest that you recompile Python with GCC flags so it's like optimizing your Python for your prop particular problem problem set is that bring about for anybody nobody else all right I'm guessing that as a form of PTO performance guidance optimization the idea is you compile a Python and the compiler doesn't know how Python is going to be used and unlike Linux we don't fill it with a lot of hints on branch prediction in part because some of the feedback from the Linux community was that what were accurate predictions at one time stopped being accurate later and they actually worked to the detriment of the compiler the compilers are just getting really smart at figuring this out but with pgo you run your PI compile your Python I'll run it on your application the pgo learns how your application gets used and recompiles it which to me is just a slow version of pi pi pi pi does that for you anyway yeah that's why we have just-in-time compilers but in fact it is a great way to go and the Windows distributions in Python I think Martin Vaughn Louis might still be building them or it might be an Eddie Lee on now and they're always built with pgo and the problem the good part of pgo is it makes Python a lot faster the bad part is it Tunes it to a specific application and that applications hot spots and so when you tune it it needs to be you and then if your needs change then your underlying Python doesn't change and so I'm kind of a generalist I believe GCC just has ESP and it mostly knows what I want most of the time rarely gets us all wrong and I go with that but this sounds like a fantastic project someday they should meet the pie-pie people who will do it in just-in-time manner excellent question 7 bucks Marty an old friend very good thank you for the talk tonight I wanted to ask you a question since you've been promoting 34.6 coming out next week any gotchas for people going from 3.4 or 3.5 to 3.6 that people should be aware of these questions or any gotchas that you should all be aware of moving from three four two three five two three six you're putting me in the position of saying something negative about something I love and keep in mind I'm unique a muscly core developers and that I did not love Python 3 for most of its existence I don't even love Python 3 5 much at all I believe that when you were told go to Python 3 because it's better than Python 2 trust me for most of the time that was not true 300 was slower than Python 2 7 memory hog email module didn't work but so few people used it that people didn't realize it so we could go around say the numbers 3 is bigger than the number 2 therefore you should upgrade and no one I've said any different now along the way the email pod you'll got fixed by Python 3.2 and it became usable that's it some friends of my starting to use it in production and there's I am I the first person to ever glue these two parts together it's like yes you are lot and so the people who were happy with it were people teaching classes people doing toy puzzle problems writing games they didn't notice this sort of thing but as you started to use it commercially it started to fall apart so I was not a big fan until basically now Python 36 is the first one I love I've been working every weekend and night for years to make it good and it is finally are worthy so any gotchas along the way which effectively translates to is there anything I don't like about Python 3 6 Oh to upgrading I don't know of any all the code I have written for the Python 3s goes up almost different Leslie what is really nice is when a person becomes a new core developer they tend to be very deprecation happy every nut but he knows oh I should clean things up this thing is old I'll make a new one that's better and deprecated the old one I was that way when I became a crime core developer and I deprecated a few things and then I heard the screams from the people haha we have no test for our code and you deprecated it he's like I can't upgrade now and I realized if ocation caused an enormous amount of pain but we had a whole bunch of deprecated errs and Python and nobody's using Python 3/3 so by the time - - three five comes around lots of things have been deprecated out you had a warning period but you weren't using Python three and so at the last developer summit I just presented this little model you have a little fence you have a sack of gold and he wants somebody to climb over the fence to get to the gold how do you incentivize them hey make the paw a sack of gold bigger the compact key sharing dick does that it makes it worth upgrading async IO was the compelling reason to move up to Python 3 if you didn't need a sink you didn't really need Python 3 5 now 3 6 is so much better than Python 2 7 you really want it so the pot of gold is bigger it was like oh then they'll all switch I'll go but along the way a bunch of you deck bro caters have been making the fence higher and higher and higher and higher and so there been years of effort putting into migrating twisted it will happen this year but it was hard and along the way every time they got something morning so there were the deprecated errs came off and we the divergence between 2 & 3 grew and grew over time I think we're now at a point where everybody just needs to switch forget the greatest Python ever invented - 7 move on move forward this one is finally worthy of it but in fact I managed to get this across and we've undec ated a whole bunch of things so python 3 six broke hypothesis for example by deprecating part of the inspect protocol and we undid that so I think it ought to be fairly effortless Exley question thirty bucks last question yes sir do you have a preview or a teaser for something you're working on or another core dev that you think is cool for 3 7 3 8 somewhere far out a preview of what I'm working on I have an aspiration in about one week as soon as 3/6 goes out final I'm going to sleep we have nurtured in baby this release in fact if I'm not exhausted when I get all the way back to Santa Clara tonight I'm going to be editing the what's new document because of what's new document is what you're going to read to tell you what's inside and so it needs to have good examples rather than just whatever spewed out of the other keyboard and so I'm going to sleep but when I wake up oh probably in collection stick I will add a slicing victor has another handful of optimizations up his sleeve Yuri who's worked mostly on a sink has some brilliant ideas for speeding up Python itself inada is going wild now looking at set and dictionary code looking for some of the cleanups that I mentioned here big feature wise I'm not thinking of any right now that said the floodgates to me it seems like if open I believe that we're adding features too fast and we're doing them out without a lot of user feedback and concerned that we're adding features that individually seem to make sense but brought together might be terrible here's an example a feature of Python 36 is to submit easier for the optional static types is the colon can be used now when you're doing a variable assignment X equal ten can now be X colon int equal a 10 easy enough and that will make typing a lot more beautiful so now we Colon's in multiple places another thing that we got in Python 3 6 is in your numeric literals you can put underscores where you would normally put commas the problem is underscores are used for lots of other things and colons are used for other things and so now I'm finding third I've been using Python 3 6 for quite some time now I always work on the head and fine I'm starting to find this that occasionally when I make typos PI Python used to complain to me because I'd done something syntactically invalid not now there are lots of random key presses I made that are valid underscore Oh has always been a valid variable name but o underscore is not a valid variable name and would be an error and that's an easy transposition error to make but now it's a valid number o underscore is a number I can add a comma inside it and so I've come up with all these little combinations of keystrokes I've made this Wow I can't believe that went through and didn't complain I'm taking a dictionary copying it out and I left I grabbed I killed one of the curly braces and I'm like you leave it curly brace off the dictionary it's going to be invalid right but that colon is now treated as a typing declaration and it became valid so I'm worried that the rate of addition of features is now too fast we what we used to do is put something in watch how it plays out put propose another one tease it out a little bit see how it interacts with anything else and python is evolving very fast I teach Python for a living I do Python consulting for a living and my worry is a consultant is Python is so big now nobody on the face of the planet except for maybe Yuri and sir hey knows what's in Python now I have had Steve Holden uh someone proposed a feature request I've had Steve Bolden chairman the P former chairman of the PC ESF a Python instructor come out and say we really need to teach anything been there for a long time and that happens a lot a lots of core developers have never gotten into async i/o yet we don't how it all works yet and surely haven't played out all the consequences do you go the Beasley rout curio or do you go to the async i/o route the implications of this are not not as a consultant people are about to ask me really hard questions I don't know the answer to as a teacher I have to ask myself can I still teach Python in a week I don't think so it is now a very very very big language so there's some concerns there that said it seems to be working fine for people I Triple E ranked it in the top three languages in the world it's number one language taught in universities in the world number one in high schools and for the second half of elementary school also number one it dominates many many communities so whatever we've done to it apparently hasn't killed it all right fair enough excellent question 26 bucks Merry Christmas everybody happy holidays thank you again for the honor of inviting me to your party you
Info
Channel: SF Python
Views: 136,070
Rating: undefined out of 5
Keywords: 2016, SF Python, Dictionaries, Raymond Hettinger
Id: p33CVV29OG8
Channel Id: undefined
Length: 67min 41sec (4061 seconds)
Published: Thu Dec 15 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.