Raymond Hettinger Modern Python Dictionaries A confluence of a dozen great ideas PyCon 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I think it's safe to say, even without watching it first, that all talks of Raymond are recommended, even if you are a very senior developer (especially if you are.).

👍︎︎ 61 👤︎︎ u/red_simplex 📅︎︎ May 21 2017 🗫︎ replies

Skimming through the video it looks very similar to his other talk, is there any differences?

👍︎︎ 14 👤︎︎ u/qria 📅︎︎ May 21 2017 🗫︎ replies

I see Raymond, I make time to watch his talk.

👍︎︎ 11 👤︎︎ u/[deleted] 📅︎︎ May 21 2017 🗫︎ replies

I love this guy.

👍︎︎ 9 👤︎︎ u/nevus_bock 📅︎︎ May 21 2017 🗫︎ replies

He sped up too much near the end... Would've liked it if he explained the more complicated stuff in more detail

👍︎︎ 5 👤︎︎ u/lookatmetype 📅︎︎ May 21 2017 🗫︎ replies

Couldn't even get into this talk because the room was at full capacity. Hettinger talks are always great.

👍︎︎ 3 👤︎︎ u/Sector_Corrupt 📅︎︎ May 21 2017 🗫︎ replies

RayHet's talks are always great. I think it is long past the time that he got the Guido treatment at PyCons, so if he needs an extra 5 minutes, he gets an extra 5 minutes.

Just tell the next speaker to skip their useless and boring preamble in which they try to justify their occupation of the stage, pimp their company and then waste another 2 minutes telling us what they are going to spend the next 25 minutes talking about.

👍︎︎ 3 👤︎︎ u/KODeKarnage 📅︎︎ May 21 2017 🗫︎ replies
👍︎︎ 2 👤︎︎ u/pffcomeonjack 📅︎︎ May 23 2017 🗫︎ replies
Captions
now welcome Raymond head and gun alright welcome to my talk a brief history of time by Steven I mean Raymond headings are in the beginning the dinosaurs roamed the earth they died in became extinct because of their inadequate capabilities in the void that they left behind the mammals arose and intelligent life came to this planet and we invented Python dictionaries any questions would you like a few more details alright so uh my name is uh Raymond heading to the theme of my talk is lots of smart people over time thought about the dictionary problem very very deeply came up with some ideas presented their ideas the world accepted them and thought we had good dictionaries then someone else came along thought about it very deeply and had another innovation and another innovation and I think what we have today is the confluence of about a dozen graded ideas that span the course of about fifty years and so I'd like to take you from the dinosaur era up to the present and then slightly into the future there are those of you who know nothing about hash tables you will know everything about hash tables shortly there are those of you who learned about hash tables in school now you'll learn how they're done in real industrial-strength production code there are those of you who already knew that but don't know what's what the latest innovations are in Python we have propelled the future quite a bit forward in the last few years and I'd like to take you on this journey to where I think is a somewhat magnificent I'd call it an ending point but evolution never ceases shall we begin all right so one little thing about me what is my mission in life it is to Train thousands of Python programmers I personally trained about 5,000 programmers my company mutable minds Inc I have a team of trainers and we have trained about 13,000 people cumulatively so if you have an interest in such things call me and let's begin all right our journey the beginning and the end let's do the overview first which is is do not reprogram your keys just prior to a presentation there you go so the importance of a Python is built around dictionaries Global's or dictionaries locals were at one time actually dictionaries modules have dictionaries classes have dictionaries instances have dictionaries they are central to the language which means that they use everywhere and they're pretty darn important one example is we have instances so we have a class and we have several instance variables and I've chosen cuido saraberry Rachel and Tim all important people of in my life and a Reppert to show them we've made a three instances a color a set of cities a set of fruits so that we know Quito's favorite color is blue this favorite city is Austin and his favorite fruit is Apple I just made all of it up and we can loop over the instances and print them out common place every day Python code what I'd like to do is look what's behind it and look what it looks like so behind each of these three instances is addiction instance dictionary courtesy make big button this is my favorite button there we go okay so we've got three dictionaries cuido blue and in the next one quito's Austin and the next one cuido is Apple and so that's a consequence of help I've done is design instances have dictionaries behind them so what we'd like to do is see how these dictionaries are implemented over various versions of Python jumping right to the end to see where we ended up after all the evolution in Python to seven one of these two each one of these dictionaries to two hundred and eighty bytes each and the ordering of the dictionary was Sara would go first and cuido would go last and that is deterministic it is repeatable in Python 2 7 and we uh the keys are scrambled we actually say arbitrary because they're not Reis cranville 2 every time in Python 3 5 something said uh changed by 3 5 the size of the dictionary had gotten down to 196 bytes which is very nice a compaction and this time the keys are actually randomized so that every time you rerun it you get a different key order so they were no longer deterministic then Python 3 6 came out it's the size a little bit smaller 112 bytes raw which is pretty darn amazing and the key order is now deterministic and it's the order that you added the keys because after all in the beginning there was cuido and from cuido came everything that you see around you okay so cuido goes first always the ruler of the kingdom in the ordered dictionaries in the Python 3 5 I universe he was just another Schmo once in a while he'd get to be a benevolent dictator once in a while the rest of us would get to tell him what to do in Python 2:7 he was always on the bottom of the list and I called Timmy took priority but now he always gets to be a king and so that's where we're going to end up at the end of our journey is with a smaller dictionaries that are ordered by the way how many of you like small and order dictionaries this is a fine thing it's one of a your reasons to want to upgrade to Python 3 6 straightaway I'm not going to say that I was a Python 3 holdout because I've been working on making it better for about 8 years that's it I believe Python 361 is the first of the Python 3 series that's actually better in many many many respects it is very very very good it is now time to really push hard on other people to go ahead and just switch over when I step back 2 to 7 which was a nearly perfect Python it feels like a dinosaur to me and you'll see part of the reason why coming up so let's start back in time with the dinosaurs okay so the dinosaurs were our parents and our grandparents who have this bright idea of using databases and in a database we had a rows and columns and the columns were each fields the name the color the city and the fruit now our forefathers were entirely clueless they came up the bright idea of indexing into this database so that you could do quick lookups for a berry but clearly our ancestors our parents and grandparents had no idea what they were doing back in the 1960s when they made databases like this I should actually say early 70s this design was from EF Cod in 1970 who announced onto the world all at once relational database theory so this is the state of the art in 1970 the dinosaurs who died and became stink because their skills were inadequate in comparison to our awesome skills of today their parents were unequal to us they would shiver to see our modern cook just saying this we all know is true programming has advanced dramatically over 50 years do you believe me it was a trick question do you believe me do you believe me it's a trick question no okay so I've got a little setup here I've got some keys and values three sets of values I create a list of hashes a list of entries and combined entries that have the keys hashes and values these would be the data structures that I use to the rest of a talk so what is the database way of solving the problem in Python we would represent it as a list of tuples the problem with this is we have to do a linear search and so as it starts to scale the performance is really bad interestingly compared to modern dictionaries once you get to a list of size two you were not as good as a dictionary so we tend to think of scaling following 400 if you've got a million entries or dictionaries would be better but typically it's about two or three when dictionaries start win which is pretty amazing I know what you're thinking oh that was a trick - there must be a worse way than this there is unto them was born a crazy group of people notice list programmers and the saying was a list programmer knows the the value of everything and the price of nothing so a list programmer would does store what we're called association lists the key value pairs so in this case we have a list of tuples each one of them needs to be searched independently but you can see there's there's a linear search for each of these and we have multiple lists and there's redundancy inside wasting space like crazy like Lisp programmers are want to do just saying remember the dinosaurs are parents and grandparents were crazy people and didn't know all of the things that we know today that was a trick all right so what was the first innovation after the dinosaurs died off it was called separate training and the idea is instead of doing a linear search of a big list how about do smaller linear searches of smaller lists so the idea is take out that one big list we had before and divide it into two this is list number one and list number two now if you search list number one you either get one probe to find cuido or two probes to find quit Tim one for Sara two for Barry and three for Rachel this double sub performance of the lookups as long as you know which bucket to look in and you know which bucket to look in by hashing if somebody asks you for a simple summary of what hashing is it is reducing the size of the search space so if I was doing a linear search of this room I would have to go through 600 people but if I know which cluster you're in cluster 1 cluster 2 or cluster 3 I could search cut the search space by a factor of 3 easy enough that say one or two-sentence elevator pitch for hash tables what is a hash table something that reduces the search space by cutting it into smaller clusters which are or traditionally named buckets when talking about separate to a chaining this time it's better way okay there is a better way if two buckets is good four buckets would be even better so now I have four buckets and what's cool about that is the more buckets you have the more you produce the search so we have now for people who are found with one probe and Barry is a hash collision and it takes a second probe to find him so that means that on average the weighted number of probes across these of five people is 1.2 probes that's pretty darn good it's like most of the time you look for something it's always in the first place you look rather than in the last place pretty awesome if four buckets would be good what would be better eight buckets now eight buckets is kind of interesting we actually still end up with a collision cuido and Rachael are in the same bucket for these by the way these I've run actual code behind this so these are literally where the collisions are and using Python to seven hashing but if I get the buckets big enough eventually everybody will be in their own bucket which means I will find them right away the number of lookups is exactly one this is dramatically better than linear searching so does this sound like a really great idea I think so too but you'll notice little bits of wasted space in there but in exchange for wasted space you get very fast lookup this is the advantage of sparsity the problem is is this dictionary gets bigger and bigger even with size eight with eight buckets in 2000 and rive this computer here we go with eight buckets I put two thousand entries in it the average chain would it be about of length 250 and so as the dictionary gets slows down or as it gets bigger it slows down in fact there is a better way the solution is to periodically resize the dictionary so that's never more than two-thirds full essentially we copy out all the key value pairs then we loop back I set up a glutton bigger number of buckets than we had before and then we reinsert it so whenever eight buckets is not enough go and make sixteen buckets after sixteen buckets 32 and that way no matter how big the dictionary gets it retains its performance pretty awesome these some people in the beginning were thinkers and this is the way they teach it to you in school but they leave a lot of important things out in school like here's an important thing that you typically don't find in the textbooks which is that we cache the hash value the idea is instead of this narrow table up above we're going to add an additional column to it and that is to actually store the full hash value inside at first this appears crazy why do you need a hash value when you're inside the hash table well the problem is when you resize you have to go rehash every key and potentially hashing is very expensive for a key and you don't want to do it again the solution to the problem is cache it so we save it inside at that point the resize code becomes a little simpler it says loop over our keys and values and the known hash value and then reinsert it back into the new bucket in a larger bigger table so this is by the way this stuff is not pseudocode it all runs and I'll give you links to it although in the end I'll combine it together into one big recipe and hand it to everybody so that you can play with all of these things and the idea is in textbooks you don't see this because it isn't essential to the algorithm and it makes it use up extra space do you see this third column here okay so extra space but it makes sense is very very cheap I'm astonished by this even though I'm one of the maintainer zuv this code when I timed it I found that resizing a dictionary is about one-fifth as fast as a list copy it is essentially running at the speed of copy it makes no calls to equality it makes no calls to hash it essentially just says what is my current hash take it modulo some power of two and reinsert it into a new table and loops over the existing keys it's incredibly fast almost as fast as copying a list of keys who thinks that's kind of cool which is why we use this extra column all right so the next step up is faster matching this is something that they don't teach in school as well in school they say loop over the keys until you find a match if you don't find a match raise a key herb the problem is how do you know whether a keys matched you do an equality test does the key equal the target key that makes a lot of sense in textbooks because the thing that they're looking for is typically a number or something like that so equality tests are very cheap but in object-oriented languages any object can define a tender EQ that's very expensive for instance comparing two decimal objects involves a lot of steps because we have to line up the number of decimal places first it's not a just a direct comparison of values and so a lot of work can be done there potentially if you have a database record you have to compare every one of the fields in order to determine that their equality the mall is in an object-oriented language you can't assume that equality is fast it can be very very very slow and if it's slow you want to do it all the time or do you want to avoid it I would avoid it and so there's two fast early outs one of them is this notion that identity implies equality if we have two pointers to the same object two variable names pointing to identically the same object we don't even have to look at the object to know that they're equal you are equal to you doesn't that seem obvious to you if two there's two pointers to the same object it's the same object serve what's your name I'll Christian a person on the front row on the far right side what's your name oh I had two references to the same object did I have to ask him both time or if I knew the references were the same did I even have to ask to know it was the same guy identity implies equality it's a person sitting in that person sitting in that chair is the person who's sitting in that chair regardless of how I refer to other chair does that seem logical and obvious to you it does interestingly everything in the Python world is the subject of contentious debate in the world of floating-point numbers is the concept of an and not a number and the I Triple E 754 specification of Nan's defines Nan's is not being equal to themselves are you equal to yourself of course you are because you're not an an but if you were an an it would say no and so there's this thought that dictionaries break man's because if you use one as a key then we go back and we say oh I found that Keynes is I'm not really me at which point if a person who is a die-hard for Nan's let's say a Dan should not be look up a bull he should be able to put it in a dictionary but you'd never be able to find it later and then they hold up the specification I Triple E 754 arithmetic what what do I have that can trump an I Triple E 754 specification something of greater weight than that something we care about more sin of Python I hold up the Zen a Python monkey oh okay me I say practicality beats purity if you put something in a dictionary and can't get it back out that's very impractical it's generally a bad thing it makes it impossible to reason about the container and if I make identity I implies the quality I can make dictionaries very fast if you want dictionaries that are fast and practical or do you want one that is makes it so you can put Nan's in and show your mama cute trick okay so I fight it with practicality beats purity periodically though once a year somebody comes back up and posts I put an in in a dictionary and later I was able to find it this is terrible and I have to fight the battle once again I'm afraid if I ever retire from Python core development this will disappear people will be able to put things in dictionaries and not get them back out and say I Triple E 754 and uncle Timmy would come back and haunt them from his grave at that time all right there is a the way hash tables work is I group all of you into clusters and I only have to search that cluster for you but if I'm mistaken about which cluster you're in I'm never going to find you this is cluster one closer to and cluster 3 if I go to look for you and cluster1 I'm not going to find you so we have a what is known as a hash invariant all hash tables require this if two objects are equal to each other then they have to have the same hash they have to be in the same cluster so that we can go find an object equal to that object does that make sense now I'm going to get mathematical and logical onion-like dr. Sheldon Cooper from the Big Bang Theory we will use modus tollens logics and I will express the contrapositive of this statement that's where you switch the if and the then and negate the bin it's called denying the antecedent the contact causative says if two objects have caches that are unequal if you were in two different clusters didn't means that you're not the same person you're not equal to each other so we flip it around to the contrapositive and it gives us a fast match algorithm internally dictionaries and sets he's a fast match first of all they check for identity are the two objects sitting in the same chair if the answer is yes we assume that they're equal identity implies equality this is extremely fast it's just a pointer comparison the next thing is we already know the hash of the object and we already know the hash stored in the hash table as a result of that if the hashes aren't equal we don't even have to test them for equality we know that they are not equal to each other and then all finally we have the slow test if these two early outs don't work we finally have to do an equality test too fast early out in front of a a slow test if you knock these two lines out of Python it cuts its speed by more than half these lines kind of important so you like identity applies in quality and you're not going to file a bug report about I Triple E 754 he says the world's not safe for Nan's I see how it is that is an interesting test this one though I've known about for very long time the full impact of it though didn't hit me until we started going to go to the 64-bit world what are the odds that two hashes are going to be equal to each other and the objects are not going to be equal to each other the answer is one and two the 64 that means that our with this test in place we never well almost never like never in your lifetime are in the lifetime of the universe we will never have a case where the key is equal to the target key we never do an unnecessary quality J so every dictionary and set lookup if it finds a match does exactly one equality test even if there are collisions who thinks that's kind of cool these odds are so great that I conducted an interesting and funky experiment I replace this line in Python with return true which said if the hashes are equal just assume they're the same object interestingly the entire test suite passed Django's test suite past numpy test suite pass in other words this is a very very very good at something how many of you think it would be safe if I just took out the Equality test and said if you don't think it'd be safe how many of you think it would be unsafe if I were to just take this slowly quality test out entirely really a lot of you think it would be in safe now I've got another poll question how many of you use git it does exactly this when it compares files it doesn't look at the file if the hashes are equal it believes that they are the same this is a really good idea until sha-1 started to look kind of weak of all eight so they're actively working on switching from sha-1 to a different hash where a person can't produce a collision easily on purpose all right so all of this was great but there's a problem separate chaining takes a great deal of a space by having all the pointers to many separate lists so what's the solution the solution is make the table more dense get rid of all the pointers and put all the buckets in one table the downside of this is you introduce a risk of collisions and the solution to it is using a linear probing so this is a pure Python code for doing linear probing it says look in position I and check to see if you found the value that you needed if not go switch to the next location and look there and look in the next place in the next place in the next place so with open addressing here's what the table looks like interestingly Kim was supposed to go in Sarah's slot but because it was already occupied he went to the next slot wrapping it around how this greatly improves the density of the table pretty good idea it is except for one problem later if we delete Sarah from the list Tim becomes unfindable by the way is Tim Peters here oh that's because Sarah got deleted in fact there is a better way the problem is removing a key leaves a hole the solution is to mark that with a dummy entry and the dummy entry says this space was used by some key when you're searching go ahead and skip past this and keep on looking until you find uncle Timmy and here's the pure Python code for that and you can see we are the idea is wow I'm going to speed up is it truly 15 to 15 minutes till questions are okay how many time people walk out the door 20 all right good enough all right so free slot is we track to see if we found any dummies as we do the search if we find a dummy we remember where it is later if we find a hole in the table will go back instead of using the hole we'll use the original free slot the the dummy that we found and let lets us all reuse the slots on the other hand if we find a key we can use that fast match that we just discussed with identity will apply quality in the hash table check in order to do a quick check and this is the core logic for dictionaries this is known as Knuth algorithm d and it was known right at the end of the 1960's it appears in the art of computer programming turns out our grandparents knew a thing or two after all multiple hashing the problem with linear probing is that we end up with a catastrophic linear pileup the solution is du saw every time you get a collision do a rehash so that they don't pile up in the same place so they split off in different directions and so there's two tricks there one is we use all the bits of a hash we've got a 64-bit hash and perhaps a table with only eight entries so we only need three bits at a time so we gradually fit shift in five at a time and that's called perturbing in addition there's a linear congruential random number generator that says take the current slot times five add one modulo the number of slots to find the next one what's cool about this is it provably eventually visits every slot it means that hash tables don't get caught in a loop when there's collisions I believe uncle Timmy came up with these ideas and so this code is uncle Timmy's good whenever you get a collision we print out we perturb the bits shifting and additional bits multiplied by five add one taken modulo and update the UH perturb so in this structure we end up with no claw are all of the here's the collisions that came out for this as is probing all of our round I may disappear Lee dents table and I picked these names on purpose to create a lot of collisions so wow that is an improvement on noose algorithm D are you trying to say that Kim uncle Timmy was able to come up with something that Donald Knuth didn't know pretty cool now an early out for a lookup sis Victor's dinner here so Victor Center is an awesome Python core developer who's been working very hard on improving the speed of Python ah one of his ideas is that internally in Python we use dictionaries a lot and we're often looking up method names and it's the same method name over and over again and we have to repeat the dictionary lookup every time because potentially the dictionary could mutate the better way is in pet 509 which appeared in Python 36 he added a private version number to the dictionary so every time we update the dictionary we update the version number that means that anything that wants to do a avoid repeated lookups can just keep track of the version number and say it's a version number the same I don't need to do the lookup again and just sped up the internals quite a bit are you trying to tell me that Victor's dinner came up with something the uncle Timmy couldn't who came up with something at Donald Knuth couldn't come up with which was better than the dinosaurs who used databases in fact that's the case there was a problem though with that design if you look at this table it's got these holes in it and the holes are quite wide it is all three fields and so there's an enormous amount of wasted space inside this is my big idea my big idea was the compact dict and essentially what I did was compress these down and got rid of all the holes so all of the hash values the keys and the values are stored in a dense fashion and there's a separate table of indexes that's a very narrow interestingly because we only have to have 8 possible slots the indexes can be stored in one by each so this index table is only 8 bytes it's actually smaller than the space it takes to store of Quito's name which i think is a pretty awesome they probably doesn't seem like that to you because Cleo has got five characters but string objects are bigger than that they have some overhead so the index table is essentially free in comparison to this no wasted space dictionaries are compact you've gotten the idea of how they got smaller in later versions of Python we got rid of all the wasted space are you telling me that I thought of an idea that Victor Center didn't think of the Tim Peterson think of that noosa didn't think of that the dinosaurs didn't think of ins of databases that is in fact true that said there was a terrible problem problem is with my dictionaries by the way I'm getting in history slightly out of order here to make it in awe easier to explain these things happened in a different order than I'm saying here but oh the problem is we had a dictionary for colors a dictionary for cities and a dictionary for fruits notice that all of these have exactly the same hash values exactly the same keys over and over again repeat it and if you have a thousand instances you have a thousand dictionaries where two-thirds of the space is completely redundant this can't be solved by people like me or Victor we're not smart enough no it came from a PhD Mark Shannon and Pep Bob 412 a key sharing dictionary the idea is we share the first two columns so all of that compacts down to this which means we've got four dictionaries in one little space here with no gaps in between do you see all these ideas dub to get together really well so let me take you in that's into the present let me take you into the future the future is something I'm working on right now what the sets because this part is so cheap it's only eight bytes would it cost me very much to make it bigger and to double the size of it and make it all less sparse the answer's no well I hear people clapping in the other room that's a bad sign okay and so the only difference here is not this data it is this index table is twice as big as it was before it's got 16 entries which means it went from 8 bytes to 16 bytes and it's still smaller than Quito's name it's the size of two of these hash values it's nothing in comparison to the total table but I've doubled the sparsity what's cool about this is I'll one second what's really nice about this structure is we've completely eliminated all of the collisions if you make the table sparse enough there's no collisions at all and it turns out we can now make them sparse essentially for free who learned something new so that's going to be in the future dictionaries will do even fewer steps to lookup and same is what set by the way have you seen anything that looks like this before really yes if it's up at the top of the page do you remember what those dinosaurs did a long time ago here in the third millennium we brie indented databases with indexes that said they're pretty badass they're compact they iterate faster and as a side effect of how we build them they preserve order so in Python 36 sticks are smaller and faster space efficient beautiful and key sharing and if you get a million instances with a million dictionaries none of this data over here these two columns are repeated at all in fact in the incremental space is just a space to store the values for that individual one who learned something new all right I had some odds and ends at the end let me just say a couple of them real quickly we still need a hip sip hash in Python the purpose of that is it turns out some people are diabolical and try and create keys that intentionally collide we fought back against them by putting a cryptographic hash that's receded every time time Python starts up so it's very difficult to come up with keys that collide also sets use a different strategy they use multiple chaining and linear probing because sets are about something different than dictionaries dictionaries are about looking something up sets are about determining whether something is actually there or not dictionaries you typically know the thing is there you just don't know what the value is different use cases a different balance of strategies a popular thing to talk about on hacker news now is cuckoo hashing and all a number of languages are using it how well does that mesh with this and the answer is nothing I've described conflicts with cuckoo hashing cuckoo hashing is simply a different arrangement of the index table but the rest of this could survive we could still do it I don't think we should because we have no need for densification at this point given that we've already reached a maximum identity and cuckoo hashing is all about increasing the number of collisions whereas making it sparse is all about decreasing it and I think that would be a step in the wrong direction additionally there's one other piece of logic inside of in a safe world someone mute loops over addiction and they don't change it while they're looping over it I like how the analogy is he put a ladder up on the side of a house you take out a saw and you cut out the ladder underneath you that's what happens when you mutate a structure while you're iterating and so we have logic to guard against that so that the dictionary's in sets won't segfault if you're interested when I post all these slides all of this combined together is in a single piece of code this was something that I posted about four years ago it is pure Python code it is a complete implementation of compact dictionaries from scratch and everything I showed you is in here somewhere the generation of the probes with Tim Peter's sequence the dummy entries the compact indices the resizing all logic the off fast matching all of that is in here and so I will post this shortly and in the end here's what we get this be the output of this is this beautiful little dictionary which is what I posted of to Python dev back in 2012 congratulations welcome to the third millennium [Applause] okay so it looks like a bunch of you're going to lunch that's fine just be quiet while you're leaving the room there are plenty of people who are sticking around and probably want to ask some questions if you do have a question there are microphones in the in the aisles here sir thank you all right let's call it a day and go eat yeah everybody please thank right ahead and just [Applause] how I remember
Info
Channel: PyCon 2017
Views: 69,419
Rating: undefined out of 5
Keywords:
Id: npw4s1QTmPg
Channel Id: undefined
Length: 37min 37sec (2257 seconds)
Published: Sat May 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.