"Performance Matters" by Emery Berger

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

The conference happened just 2 days ago, and the talk is already online. Awesome!

👍︎︎ 170 👤︎︎ u/nfrankel 📅︎︎ Sep 15 2019 🗫︎ replies

Better profiling tools, which consider the context outlined by the entire system, would certainly be helpful, but as long as we're relying on humans to optimize software by hand, our programs are going to become more difficult to understand, while still being far from optimal.

I wish there was more research into developing automated optimization systems that aim to take a very high level description of what we want computers to do, and translate that into an optimal (to the theoretical limit) implementation of that description. Contemporary "optimizing compilers" seem to focus mostly on code, when most of the optimization opportunities seem to be data related (memory layout, cache locality, etc).

👍︎︎ 98 👤︎︎ u/GoranM 📅︎︎ Sep 15 2019 🗫︎ replies

Great presentation, some observations: * alpha significance is confused with observability percentages

most probably the layout randomization does not result in random distributed run times, so what he is observing is a sum of normally distributed variables, so not CLT
O3 vs O2 optimizations really depend on the source code you’re compiling
there are better profilers nowadays than the ones from 70s: perf, stap, ebpf zoo, lttng, just to name a few
sqlite part: “you double the length of all of your critical sections” - maybe Emery was rushing out the presentation here, but that’s not what it means

Otherwise, remarkable job and a lovely presentation!

👍︎︎ 35 👤︎︎ u/NoInterest4 📅︎︎ Sep 15 2019 🗫︎ replies

I was at this talk yesterday! Crowd was loving it!!!

👍︎︎ 12 👤︎︎ u/yeahdef 📅︎︎ Sep 15 2019 🗫︎ replies

I was lucky enough to go to St. Louis for Strange Loop. This was the last talk I saw and it was the most entertaining by far.

I think some compiler tools like VC++ Profile Guided optimization (POGO) specifically address the layout concerns he mentions. And he completely ignored the amazing work it took to make games like DOOM run on the hardware of the 80s. And finally he completely ignored I/O which for the distributed system he describes is probably going to be a dominant factor.

All that said, the talk itself is a good model for how a conference talk should flow. It was great.

Edit: okay technically doom was 90s so substitute the work the demoscene guys were going in the mid-80s.

👍︎︎ 39 👤︎︎ u/BeowulfShaeffer 📅︎︎ Sep 15 2019 🗫︎ replies

Pretty nice talk!

The talk mentions Stablizer 'szc', which I imagine is https://github.com/ccurtsinger/stabilizer , which is ancient. Then there's a fork with LLVM 6.0 support (but the author seems unsure if it's actually working) at https://github.com/fusiled/stabilizer .

So I'm wondering, is there a modern working version of that tool?

👍︎︎ 11 👤︎︎ u/eras 📅︎︎ Sep 15 2019 🗫︎ replies

Performance doesn't matter, only the time to launch does!

-- JS crowd

👍︎︎ 9 👤︎︎ u/[deleted] 📅︎︎ Sep 16 2019 🗫︎ replies

I am programming retarded. I'm only at pondskipper level scripting.

But I do distributed systems....and this is EXACTLY how actual optimizations are done.

So good. Great talk.

👍︎︎ 19 👤︎︎ u/Cheeze_It 📅︎︎ Sep 15 2019 🗫︎ replies

This is the best thing I seen in a long time

👍︎︎ 3 👤︎︎ u/Objective_Status22 📅︎︎ Sep 16 2019 🗫︎ replies

Captions

everybody so I memory burger I'm going to be talking today about performance matters this is joint work with my former PhD student Charlie Curtsinger I'm going to tell you a story that happened a short time ago in a valley far far away not not actually this one but in fact this one so the story is about character who we're gonna call Luke Luke is our hero of course is a hero in Silicon Valley because Luke has an idea for an app alright so here is luke's genius idea for an app basically you use your smartphone and you take pictures of things and then it goes and finds matches like and returns them to you and he's got a great name for it as well it's gonna call it a goal and he's pretty sure ah --gel is gonna totally disrupt image search nobody has told him about google image search yet but anyway so he sets about building it so he makes the prototype on all right so the prototype toggle works like this take a picture it gets sent off to the Google Google Data Center if it is new it gets added to the database and then at the same time it goes and finds similar pictures and sends them back to the user alright so this is pretty straightforward you can kind of abstract out this flow if you think of it as follows you get a request the request comes in it essentially spawns two threads one of them goes and takes the image and let's say compresses it and then saves it to a three and a half inch floppy as one does and then at the same time it does some indexing to look up matches right so it does a search after doing some feature extraction and then sends it back to the user via paper airplane which is the most effective method known to man and then eventually these two threads join up alright so this is obviously a simplification I've alighted certain things like for example locks so when you have locks right there's some database for example you can't be modifying the database simultaneously blah blah blah but anyway this is essentially at a high level how angle is going to work so Luke goes and ships it off to the app store it gets approved and you know users start coming and ah --gel is working right he's disrupting image search but then as the number of users kind of like mount it turns out ogle starts getting slow right so it takes a long time for oggela though all right so this is too bad he's not happy about this he's not really sure what to do so but before we get into that let me talk about another version of Luke this is also Luke this is the cool Cool Hand Luke if you will with his glasses all right but really this is Luke in the 80s all right so he's got the 80s glasses to match and of course those of you old enough to know or who have watched TV know that there were no smartphones in the 80s there were instead computers with big CRTs so this is what auga looked like in the 80s so you'd have some ASCII art and you want to go out and it would connect via some modem let's say search through the database this is version 1.0 of Augen a 4 and it comes up with its matches and this is your user interface so hope you liked it all right now of course back in the day 1984 computers were really slow right really slow and so actually this you know Luke today who is like ah Ghilas too slow what things were really bad in 1984 right but it turns out things were also kind of better in a way so performance used to be really easy so I'm sure all of you have seen this kind of graph what I'm going to show you is on the left the number of transistors as years progress on the x-axis if they go up you'll notice it is a log scale and on the right you have clock speed which roughly corresponds to computing performance and so basically for years and years and years we had this situation where the number of transistors were increasing roughly at doubling every 18 months this is famously Moore's law and this generally had the effect of allowing clock speed to increase and so you had smaller features this meant that your programs would actually run faster and in fact you could just literally wait right if you buy new hardware in fact the upgrade cycle every you know every year or so like if you bought a new computer everything would run much faster right now things were slow right so don't be so excited about how great it was back that right was bad right 33 megahertz 266 was awesome back then and kind of terrible today okay but it did change the landscape of what it meant to be somebody who works on performance improvement because this is what you would do right so you they're really in a way was no sense trying to squeeze out any performance out of something when it was going to double in eighteen months right the chance that you in eighteen months were going to squeeze out double the performance pretty slender so you might as well just sip mojitos on the beach all right I can tell you that this was well maybe the mojitos in beach part notwithstanding this was actually kind of a strategy that was recommended for high performance computing people would literally say we're just gonna upgrade the computers next year or in two years so focus on something else don't focus on performance right now unfortunately that's not the current state of affairs right so today as all of you know when you upgrade a computer it does not get faster anymore maybe it gets lighter maybe it has improved battery life maybe not but but basically what happened is eventually Moore's law kind of ran out of steam the reason that it did is not actually it's not actually true that Moore's law ran out of steam this is technically a problem called Dennard scaling and so Dennard scaling ran out and so route not right now we're in a situation where basically if we clock things up we no longer can dissipate the heat on chips effectively enough to make them run faster so transistor counts actually are still increasing which is why we have multiple cores so we have a bunch of transistors what are we going to do with them let's just stamp out more of the same thing right so that's great but it's also not super awesome because if you have a program that is not parallel it's not going to run any faster so everything now has multi-core your your phone has multi-core but it turns out it's still a problem so these are actual screenshots from App Store updates it's every App Store update practically is about performance or bug fixes and so here's one that says under the hood updates for better performance as opposed to the over the hood updates okay here's bug fixes and performance improvements bug fixes an important performance improvements and then this one we've still I'd like loved this one a lot so their app engine calibrations because it sounds cooler so they've calibrated the App Engine alright okay so why does this keep happening why is it like every update is like oh god we got to improve the performance we've got to improve the performance right why is this so hard like why didn't they get it right the first time and it turns out you know unlike bugs so okay so code is always buggy right and you you know it's hard to debug programs but it's like this thing produce the wrong result that's pretty clear but if you have something and it just runs slower you don't really know where or what to change right so it can be really complicated so what I'm going to talk about are two things in this talk one is performance analysis so I'm gonna explain how to do performance analysis right it turns out that the thing that we including myself have often done it's not really very rigorous and you can draw the wrong conclusions and then I'm going to talk about a new approach for performance profiling that will show you how to do it better all right so first I'm going to talk about performance analysis so here's Luke Luke has his code and Luke is like it's too slow hoggle is too slow what am I gonna do and so Luke has an idea and Luke's idea involves some sort of new you know I'm going to do this first and I'm gonna change this code I'm gonna change this function blah blah blah I make a bunch of changes right and eventually end up with a prime all right the new version of a and so now I want to know did it work right so I've made this change I thought it would make things faster so I go and I take my code and I have some set of inputs some benchmarks let's say and I go and I say run a and a takes 90 seconds which is clearly too long for your app store thing to run but anyway but notwithstanding that let's say there's some thing that takes 90 seconds in his tests right and then he runs a prime eighty-seven point five seconds fantastic success right two point eight percent faster all right time for a mojito okay so this is great right and clearly you know what Luke went did had a big impact big impact all right the question is like is it really faster right so if you go and you plot like here's a bar graph with I'm kind of giving away the game here one execution of a prime in a it turns out that a prime is 2.8 percent faster looks good but there's maybe a problem here so what's the problem right so there's this problem called variance right like when you run a program once you're gonna get some result but if you run it again maybe the result will be slightly different and you want to account for that great so now we'll run it 30 times 30 is the magic number by the way so we're gonna run 30 times and we get this graph and so they're pretty tightly distributed and you can see it's still 2.8 percent faster all right so seems plausible like I think everybody here would probably be like looks like a prime is faster great so the question you have to ask yourself is why is it faster so you might think well of course the reason is the code change right so I as Luke and the developer and I go and I'm a genius and I have my great idea and it pays off 2.8% right well it turns out changing the code can actually lead to all sorts of knock-on effects that have nothing to do with your intended change so it could totally be an accident so let me explain why so there's this wonderful paper that is it appeared in s boss in 2009 by Todd Midwich and a number of other people I highly recommend you read it it's something like how to do things wrong without really trying something like that and the conclusion is that the layout like we're the code and data end up in memory has a pretty big impact on performance alright so when you go to measure something those measurements are biased by depending where things kind of fell right so here are a few things that can have an impact on layout and I'm going to talk about more so one is link order so if you're in C C++ land and you have a make file and the make file has a bunch of this let's link step and has a bunch of datos depending how those datos are arranged you can get different performance okay do I think fine all right your environment variables so when you go to execute your program your environment variables whether it's in C C++ or even managed languages they somehow get copied in and everything else gets shifted so in C and C++ it moves this moves the stack so this actually has an effect on layout these two alone can lead to shifts in performance of plus or minus forty percent okay so that's not great so what what is happening like why is this happening this is a huge shift this is literally larger than the impact of - oh three / - Oh zero okay yes you laugh but as well you should so why is a prime faster than a right so what is going on why could this happen without actually trying so part of the problem here is that basically modern processors have become insanely complicated in their zeal to increase speed so what do they do so they have caches right and data and instructions get mapped to the cache well it turns out for good reasons these things are binned up into these things called sets they map to the same set you can have a conflict so if you have hot code a lot of hot code that is mapping to the same set then it's not going to necessarily fit in cache and your code will run slower by luck you could be in a situation where when you changed a prime you actually disrupted this conflict and so now you have no conflict write it these two things one is the hot code and one maps to nothing so no conflict boom it ran faster all right so that sounds great so it could be the cache but it could also be the branch predictor which actually again is based on the addresses of your branches and if these branches collide then you can end up with things running slower there's also this thing called the TLB the translation lookaside buffer which Maps virtual addresses to physical addresses if things don't fit in the TLB because they span two pages instead of one suddenly things become slower there's also a branch target predictor there's a prefetcher there's more all right so this is pretty bad so all of these things can happen you might think all right Lync order is fine the code thing is a little weird but you know hey it's faster right it's 2.8 percent faster that like I don't care it's all good right now it may not be faster on every machine but it's faster today right so here's the problem like anything you do can disrupt this so what could happen one more malloc changes layout right like you've shifted everything over one more or less if you upgrade anything in your system this is going to change layout right so that's bad okay so those things alright I'm not gonna change Lib C and I guess I'll never malloc again fine whatever all right so here's something that may be surprising running it in a new directory so it turns out that your current working directory goes right into your your environment variables right so that's weird right so you know if Vader tries to run your software it's not gonna work as fast because it's one character longer than Luke okay this is a real effect this can really happen it is actually bitten me I had a student who wrote some wrote something he has a long Indian last name I my username is just five letters long it's just Emery and he did something he's like it doesn't run any faster it actually runs slower it was like that makes no sense at all and eventually we were we whittled it down and it was like if I run it as me it's faster okay that's right all right changes your layout so the solution is obvious right run everything all right so so I should add you know all of this is you know like the whole talk is really oriented towards I'm going to improve my performance but everything I'm talking about today can be viewed in Reverse for performance regression like I made a change and things run 2.8 percent slower Oh God rollback maybe not right maybe the next thing you do is going to actually undo that change right so basically layout is super brittle and like you've seen layout biases measurement so one of the questions that we want to know is is it possible to eliminate the effective layout so we can actually understand the performance of things kind of without having to think about well one malloc less or more or you know Luke versus Vader so the answer is yes we built a tool that we call stabilizer so stabilizer addresses this problem that I've just explained to you pardon me and it eliminates the effective layout so this is the wet this is a way to actually measure programs where you can kind of actually know whether the regression you had is real or whether the optimization you had is real and not just an accident all right so how does this work how is this even possible so the way that stabilizer works is that it randomizes layout so it randomizes a lot of things it randomizes the function addresses i randomizes stack frame sizes it even randomizes heap allocations but not only does it do those things it does it over and over again right so while your program is running it's literally doing randomization and this turns out to be important and I'll show you a graph that will explain why but basically if you do this then there's no way layout can bias your measurement because a completely random layout can't bias the results right that's just how things work that's why we run randal randomized control trials you've eliminated something as a possible cause the only other cause that remains is whatever change you made all right so let's walk through what you would do with stabilizer so what stabilizer again clearly you're supposed to run your program a bunch of times but notice what happens to the execution times here the execution times are no longer tightly bound around this this one very small measurement the reason for that is that when you were running that program 30 times it was sort of like you at you were gonna do a survey of 30 people but you just asked one person right because it's the same layout over and over again so you did an experiment on 30 executions but what you really did is you just repeated 30 on one all right so the only noise that you're eliminating is the noise that comes from like network demons waking up or some other random event maybe some thermal issue in your computer but it's not really altering layout right it's always the same layout so here it's not and you get these very nice bell curves so now I'm going to ask you the question so this is an audience poll time is a prime faster than a I just want you to raise your hands if you think that a prime is faster than a alright great now keep your hands up don't send it down all right okay but set them down well if you change your mind how about now okay how about now okay a lot of hits still few like hardcore all right okay so so what you all are doing is what I like to refer to as eyeball statistics and so your Cadillac looks close to me that's too close right but it turns out this is not actually a thing so if you yeah it's not really Statistics when you just eyeball results so this is a bit of a refresher for some of you but I'm going to walk you through this and how this all fits in with stabiliser so in actual statistics and today I'm just going to talk about one flavor of Statistics which is called null hypothesis significance testing there are others notably Bayesian approaches happy to talk about that offline but basically the way it works is you just assume that the things are the same you say what is the likelihood of observing this difference by chance all right so it turns out that this is this is something that's just convenient it's very easy to compute these probabilities for the normal distribution which you all remember from school these graphs are normal awesome it turns out that stabiliser happens to make normal graphs or normal distributions and I'll explain why so how are we going to do this we're gonna run stuff with stabiliser we're going to pick some probability below which we're like okay good enough right so if it's only a 1 in 20 chance I see this probability like the day I see this event occurring I'll be like okay that's good enough for me you could be harsher you could say one in a hundred one in a thousand it's pretty standard to say 1 in 20 this is the AP value of 0.05 so the idea is if there's a low enough probability you reject the null hypothesis the null hypothesis being that they're the same and you conclude that the speed-up is real it's not due to the effective memory on memory layout all right so why we randomization the reason for reran demise ation is that just randomizing once doesn't give you enough randomization so this is an actual program you can see the distribution is pretty wacky it's very far from normal you can't intuitively explore much of the space when you just randomized it startup as opposed to randomizing during execution this is in fact the kind of distribution you get when you randomized all the time and these are normal distributions so why do I keep saying that they're normal distributions the reason is essentially again going back to like freshmen stats stabilizer generates a new random layout every half second that is to say it's a completely independent version of the program right from half second to half second half second it's all randomized and it's the same program the whole time so it's identically distributed and then we're adding them all up and there's this nice result key result of stats which is the sum of a sufficient number so if you run a program for long enough of independent identically distributed random variables it's approximately normally distributed no matter what the underlying distribution was this is the central limit theorem alright so this makes execution times normally distributed which is cool in other ways because you actually know how likely it is that you're going to see some very weird execution right because you know what the distribution looks like alright great so now we have this thing in hand and we're gonna do something insane we're going to see whether optimizations are actually matter and we know some of them matter all right so we have a suite of benchmarks that we're going to evaluate it on we're gonna evaluate them individually and then across the whole benchmark suite and I'll show you how we do it so you build the benchmark so the stabilizer stabilizer is a plug-in for LVM if you just compile it as such it up it goes and randomized does everything but you can actually just randomize things independently if you wanted like just code just heap and just stack so now we run the benchmarks we run them as usual we drop them into one of my least favorite programming languages ever and and then we decide what the result is all right so again we don't ask questions like this because that's eyeball stats right instead we ask a question like this for 10 they were equal how likely is it we'd observe this much of a difference so I also have to say that I'm not you know you should not assume normality like almost ever in my humble opinion unless you have very good reasons for doing so here we have very good reasons for doing so so we can use really powerful statistical tests like the students t-test so this is the test that you use to actually measure this difference so if the p-value is the likelihood of this event occurring is less than some thrush which as I mentioned for is 5% we're gonna reject the null hypothesis all right that is to say it's not because of random layout the difference is real everybody's on board I hope so now we're gonna do it you'll be shocked to learn the - OH - versus - no one makes a difference good I'm would be weird if the result were otherwise so you can see that there are statistically significant improvements right on the right there's some that are statistically significant but don't matter and by god they're statistically significant performance drops so it turns out that that compiler people run these same benchmarks and overfit the way that they do these optimizations and some of them lead to layout changes and it wasn't actually the code change and so we can actually distill out this effect all right great by and large it looks like Oh - over oh one is a win how about ode three verses Oh - ready it's amazing okay so I actually have to change the axes so we can see a lot of these values so I'm gonna zoom in instead of it being zero - tattoo like the range is negative 10 to 20 I'm gonna make it zero to 1.5 okay so now we can see them so they're pretty small effects but some of them are significant and some of them are not again statistically significant 1.75 percent decrease in execution time great a bunch of things where it's not significant and a couple decreases but really very minor effect sizes so what do these results mean you can't actually look at an individual benchmark like there's 30 of them right so drawing a conclusion about all 30 you actually have to do something different you have to collect all of them you get a bunch of these graphs and this is what you don't do like okay this one is slower this one's faster this one's faster this is just eyeballs everywhere okay I mean they're spooky and nobody wants to see those so again we're gonna do the same thing but to test a bunch of things simultaneously you do this thing which is terribly named called analysis of variance and you plug it into our with this awesome incantation and then you do again the same test if the p-value is less than or equal to five percent we reject the null hypothesis all right you ready all right here we go here's the p-value so it has to be less than or equal to 5% or else we're going to conclude the - oh three vs. - OH - is nothing all right so the p-value is twenty six point four percent that means that one in four experiments will just show a random effect right just literally randomly we do not consider this enough evidence to reject the null hypothesis so we're we cannot reject the null hypothesis which is that the effect is indistinguishable from noise right okay so this is all terrible news for people like Luke who wanted optimizations to work and I've actually seen I've actually seen projects it makes it kind of breaks your heart like projects I committed like on github where it literally says - oh nine and I feel like why not - oh 11 there's no there's no - oh nine or 11 it's just kind of bottoms out but you know hope springs eternal all right so great so what are we going to do so what people do when they can't speed things up right they run a profiler so there's these profilers they all basically work the same way you go and you get some result and it says hey here's where my program spent its time you get the number of calls to every function run time for each function and this captures intuitively maybe for most of us like this is what a profiler should do right what's what do I care about there's frequently executed code or code that runs for a long time that's where I should be focusing by optimization efforts all right seems intuitively appealing this is the way profilers have been written since trough back you know back from like I don't know late 60s early 70s so would this in fact speed up Gotham so we're gonna do this experiment we're gonna go and find the thing that runs for the longest amount of time and where it spends all of its time running and so we're going to run it and so we go and we we do this and basically it makes the loading thing flash faster okay so well guess what that's frequently executed and in fact the code that runs for the longest time right so this is not really great especially if Luke spent like more than a minute optimizing that code that's a shame all right so basically in some profilers were developed in an era where everything was synchronous and there was a single core all right that's not today today things are asynchronous or parallel or concurrent or a mix thereof and profilers don't do a good job in these contexts so we need to do better so what would be really cool is if we could have something like this so this is what I call a causal profile so a causal profile tells you visually like if I were to speed up this component by this much on the x-axis then the whole program will speed up by this much all right so this is really nice like if I had this graph I would know I could spend a little effort on the yellow search component and I'll get a big bang for my buck uh eventually it's gonna bottom out are top out it like you know like 70 80 % and the red one I could just keep going right like it gets faster and faster the more I work and the blue one I should never never optimize ever all right it would be cool to know this right it's essentially like an Oracle coming and telling you this is the code you should work on Luke I don't know where I got that way of talking anyway all right so the question is how would we know this like how would we get this information like how would we know that this change would cause this effect right we can't just go and optimize the program by arbitrary amounts and test it that kind of defeats the purpose so we're going to do something different we're gonna run an experiment and it requires one ingredient here which which I'll refer to as the force so we're going to use magic and we're going to speed things up magically and then we're gonna measure how much the effect does buzz of speeding up each component by a certain amount on overall program execution okay so we just keep doing this right we get more and more points right and then I do it for different things it turns out if I could speed up saving things to the three and a half inch floppy it doesn't make a difference right and so on all right now unfortunately we live in the real world where there's no magic sorry well if there was magic to be clear this is not what we would do right but I mean obviously there are many much better things we could do I could think of people I would like to disappear off the face of the earth for example but I could also disappear all the runtime off the face of the earth because why not all right so obviously that's what we would do so we can't do that we have to do something else so what we are going to do as our trick is we're going to do something that essentially takes advantage of this notion of relativity so we're going to do a virtual speed-up and a virtual speed-up speeds things up in scare quotes by slowing everything else down all right so everything else that's running at the same time will then be slowed down and that will allow us to get the impact of how do we sped this thing up what would the results have been all right so here for example if we speed up the sending of the the the picture results back by a certain amount we've slowed down everything running concurrently with it and then that gives us a result of a slowdown which is the same thing as the result of having sped it up all right so we actually can get points on this graph just by running these experiments right so I just got a point here and I do it for everything and I get more points right and eventually I get a graph like this if I speed up indexing when you get the exact same effect right indexing is running at the same time as the compression right so I've get this result and then bang I get these results and again these are like all the results right now I draw your attention to the one weird blue thing so the blue thing is slower right and it turns out that sometimes optimizing things makes your program run slower and the intuition behind this is you can actually get congestion on a lock right or congestion for a shared resource like disk or network and so speeding things up makes things worse you would like to know this before you get started right that would be a very very bad day for Luke that might necessitate several sequels to recover from all right great all right so let's let's dig in to audit all right so what do we care about a novel care about two things we care about how long it takes between a request and a response right aka latency traditional profilers don't do this at all right that's just total run time oh let me get on my soapbox for one moment traditional profilers are about end-to-end run time you know how your servers are all about end-to-end run time or your browser like if only your browser could quit faster right so again like it was all about like here's the program I run it a console and it does something and it's done and that's all I cared about that's not really today all right so there's latency and then the more traditional thing is throughput again this is something that profilers do a bad job of measuring because they're all about end-to-end execution time right so how fast results come back is throughput right so how are we going to do this so with our causal profiler that I'm going to explain in a minute we're going to introduce what we call progress points so the notion of progress points is here's the thing I want to happen faster or here's a beginning and an end of things that I want to happen faster all right so if Luke wants responses to get sent faster higher throughput you just mark this particular start of this component as a progress point and every time the code runs you go and you get another coin all right and then you can do this simultaneously many requests for many users right and all of these things are incrementing some counter so these progress points are measuring throughput and then you basically are going to run the experiments and see what the effect is on the rate of those progress points being executed all right so one point measures throughput like I said if I speed up some component whatever it might be what is the effect all right so now what if I care about latency so we do the exact same thing we said a progress point at the beginning a progress point at the end and then the only thing that has to happen under the covers is it has to have a counter on the counter here measures how many things are in the system at any one time and it turns out that there is this awesome law that holds in a wide variety of circumstances called littles law and so littles law says that the latency is essentially the number the average number of transactions in system divided by the throughput we already know how to measure a throughput so we just take advantage of littles law and we can translate this into latency right great so we have built a causal profiler for Linux it already ships with Debian and Ubuntu so if you're using one of those systems you can install it quite easily so it's just cause - profiler it's quite easy to run so you say cause run - - and whatever your program is and its arguments and it fires it off and it starts doing performance experiments all right I should add it's not entirely true you do need to place progress points if you don't place any progress points it will act like an ordinary profiler measuring end to end execution time but if you do put in progress points then it will actually do its magic all right and this is just some macro like prog progress begin progress end all right so let's apply this to ogle alright I didn't actually build ah --gel neither did Luke but we're going to build it out of pieces like any good programmer would do so it turns out there's this suite of parallel applications that's kind of ready-made for this task so there's ad duplicator that does compression there's an image comparator and then there's a database sequel light that's not in parsec but we'll use sequel light - all right great so I'm going to show you some fun things we did this is a well-studied set of applications people have already tried to optimize these and we have covered a bunch of surprising optimization opportunities so here's ferret this is actually an older version of what our console profile looks like now it runs in a web browser and you can see that there's these lines of code and it says boy if you speed up these lines of code then you're going to get performance increases conveniently these lines of code happen to be located in separate chunks of ferret so the part that does ranking the part that does indexing in the part that does segmentation and why is this convenient it's the I'm not gonna have to change any code to make this faster the reason is that what ferret does is it has this pipeline model and it assigns an equal number of threads to every stage in the pipeline but it turns out this one really doesn't need that many threads so we take away the threads and just by reassigning threads we got a 20 percent speed-up okay so so this is pretty cool because cause actually predicted it perfectly so we increased ranking for example from 16 to 22 threads that's a twenty seven a percent increase in throughput and on the graph that says that that would translate to a twenty one percent overall improvement and that's what we got alright so the cause actually works good so we were pretty happy with this we then are going to move on to 2d doop so dee doop is pretty hilarious so so here's dee doop in action i have two pictures grumpy cat juan and grumpy cat meme alright and so now what do i do to deduplicate these things you can see that there's chunks that are the same so you carve out the chunks that are the same and you separate them out into individual pieces and then an image is now represented by the bits and pieces that make up the image so here a grumpy cat one is this piece and fun is awful is this piece and you saved a lot of memory all right so that's what Dee Duke does so it does this compression via deduplication and it uses a hash function right so it throws everything to a hash data great so it's a pretty standard hash table you just have some hash table it's an array it's a bunch of bins you get a bin number and then you go and you start adding stuff to that bin okay into the bucket so this all seems straightforward you hope that it would do something like this the hash table is accessed concurrently by a bunch of threads but they're not idiots there's not one big lock it's just all locks which is naive but it's fine but surprisingly cause says that the loop that accesses this list is important now if you know anything about hash tables you know that things generally end up balanced right and it's weird that you have this sort of situation so we thought all right well let's just make more more hash buckets right but we made a thousand of them we really should have made a million because you know a million but anyway so you would think this would lead to fewer collisions but it had no effect alright so what else could be causing the collisions any guesses the hash function exactly like this is one of those when all other possibilities have been exhausted right you pick the weirdest one that's not accident exact quote but anyway well you're like how can the hash function be broken like we've been using hash functions since before Knuth wrote about them well turns out people like to roll their own because it's fun and so we did a histogram of the number of items per bucket so again I told you there's a thousand buckets this is the histogram yeah hilariously what they did is they use the pixels that were taken from the the image and they have a some number of pixels then added them but that's actually the central limit theorem in action right they're random right there are independent right and you've summed them together and so they actually form the normal distribution that's not the distribution you want for a hash table you would like a uniform distribution so literally we changed one character we changed the plus two X or so this and we got this okay so so okay I'll take the applause but I mean it was only a 9% speed-up but all right nonetheless it was one character so I think it's the biggest bang for buck ever recorded in optimization effort so so what did it predict it turned out we can also measure the accuracy the prediction so we knew that the blocks per bucket went from 76 ish to - that's a 96% traversal speed-up and again going back to the causal graph it predicted a 9% speed-up which is what we got all right so it's working all right so finally I'm going to talk about sequel light so I have no time left but sequel Lite is pretty awesome it's widely used as you all know but it has this weird thing where it has a kind of strange like a virtual table that they set up at compile time and so whenever you you actually indirect through a config to execute a function like pthread mutex on lock so you would think all right why are you telling me about this well everything looks like this this is an indirect call could be a direct call that would be faster but I mean an indirect call is not that slow but it's almost the same cost as pthread mutex unlock which means that you just doubled the length of all of your critical sections so that's not great so in fact when you go so cos will highlight all of these lines and say you should definitely optimize these so we undid all of the all of the actual like indirect stuff and just made it so that at compile time you changed sequel Lite unlock to something so it doesn't do the indirect and it sped things up by 25% okay I if you look at a traditional profiler by the way those things are like this takes point ooo one percent of time right you would never consider actually trying to optimize that code so we did it for a bunch of programs we got some crazy speed ups my favorite is we got a 68% speed up by replacing a custom barrier with a standard barrier again people should stop doing things at home some so anyway so I'm going to conclude so you can take a picture of this to jump to work from our lab which is plasma UMass org I talk today about sound performance analysis and effective performance profiling everybody should go use the cause alright thanks for your attention [Applause]

Info

Channel: Strange Loop Conference

Views: 259,811

Rating: 4.9523182 out of 5

Keywords: performance, optimization, moore's law, profiling, memory layout, compiler

Id: r-TLSBdHe1A

Channel Id: undefined

Length: 42min 15sec (2535 seconds)

Published: Sun Sep 15 2019