Simulating A Real-World System (Coffee Shop) In Go

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Watched this talk already so often. # Brewput, love it <3

πŸ‘οΈŽ︎ 14 πŸ‘€οΈŽ︎ u/littlebluebrown πŸ“…οΈŽ︎ Jan 31 2019 πŸ—«︎ replies

Really cool, it teaches some principles with a simple example.

πŸ‘οΈŽ︎ 6 πŸ‘€οΈŽ︎ u/ianovici πŸ“…οΈŽ︎ Jan 31 2019 πŸ—«︎ replies

Love real world practical examples in Go!

πŸ‘οΈŽ︎ 4 πŸ‘€οΈŽ︎ u/nwss00 πŸ“…οΈŽ︎ Jan 31 2019 πŸ—«︎ replies

Thanks for sharing! I'll definitely be trying this out

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/Gopher128 πŸ“…οΈŽ︎ Jan 31 2019 πŸ—«︎ replies

This a must watch to anyone involved with Golang. It is incredible how a great real-life example can help illustrate these principles. Excellent presentation!

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/rocajuanma πŸ“…οΈŽ︎ Feb 01 2019 πŸ—«︎ replies

The original video was previously posted to /r/golang (without any comments however).

πŸ‘οΈŽ︎ 2 πŸ‘€οΈŽ︎ u/dchapes πŸ“…οΈŽ︎ Feb 01 2019 πŸ—«︎ replies
Captions
so I'm gonna start with a little love story I fell in love in 2009 I fell in love with gos concurrency model at a tutorial taught by Rob Pike you see I've been programming in distributed systems in C++ at Google for several years and I thought of concurrency in terms of callbacks and locks and thread pools and go introduced me to a new way of thinking about my programs so for example I had a couple of libraries for reading and writing data and replicated big tables like a large-scale distributed storage systems and each of my libraries for reading and writing was seven hundred lines of twisty callback written C++ as I decided to try converting them to go and I was able to get them down to under a hundred lines and the code was so much clearer and it made the concurrent algorithm so much simpler to understand that I was actually able to combine them and this to me was a revelation after that experience I began contributing to the go libraries inside Google and I joined the team full-time in 2011 to make go a production language while my work has focused on scale and go with larger systems and teams my love of the language is always centered around its concurrency model today I want to talk to you about how studying real world systems can help us become better go programmers go was my first practical introduction to CSP the concurrent communicating sequential processes model for me the real charm in this model is how it mirrors the real world I live in New York City a place that's constantly striving to scale with increasing population and tourism I see concurrency and communication in the myriad systems that keep the city running I'll ride the subway to work I'm reminded daily of the system's limited capacity and variable latency and every election day I'm reminded the massive parallel computation we run to sum up a few very important numbers we see concurrency and contention throughout this process and repeated aggregation and communication as we work towards the final totals but most of all I see concurrency in services and systems that serve requests that need to deliver results quickly and reliably to handle incoming load gracefully and scale to meet the city's ever grow demands today I'm going to present a simulation written go of one real-world system a coffee shop I created a simulation to explore a few properties of services latency throughput contention and utilization I evaluated several implementations of the coffee shop and go each one mapping to a slightly different real-world scenario I hope to show you the power and goes model and what we can learn about system dynamics with this simulation the simulator is a kind of benchmark harness that exercises various implementations and measures their performance this diagram shows the goroutines and channels that form the harness the first stage generates load in this case being customers for the coffee shop the second stage executes the function being measured in our case and order a coffee and weight function this stage measures how long the function takes to execute and reports that duration to the final girl routine which aggregates the results we vary the number of goroutines in the second stage which you can think of us a number of people requesting a coffee simultaneously and will vary the function being executed by that stage to explore the design space and find the best performing coffee shop our order coffee and weight function makes a latte and our simulation this happens in four steps step 1 grind the coffee beans step 2 make espresso using the grounds step 3 steam the milk and step 4 combine the espresso in the milk to make the latte by default each of these steps uses CPU for 1 milliseconds and that's a full CPU burn nothing else can run on that CPU at that time so if we do the 4 steps in sequence the total time to prepare the latte is 4 milliseconds this code shows the ideal brew function which runs the four steps and returns the latte the first step grinds the coffee using the grinder and returns the grounds the second step prepares the coffee using the espresso machine and the grounds the third step steams the milk using the steamer and the final step makes the latte using the coffee and milk these two shorts charts show the performance of our coffee shop as we vary the number of CPUs from 1 to 6 in the ideal implementation we can prepare as many lattes in parallel as our CPUs can handle so with one CPU we can prepare one latte in four milliseconds our throughput or rather brew put is 250 lattes per second so just wait this is the starting point on the left-hand chart throughput skills linearly with more CPUs so it's six CPUs we get 1,500 lattes per second on the throughput chart higher values are better the right-hand chart shows the median time required required to make a latte the cafe au latency stays flat at 4 milliseconds on this chart lower values are better we'll return to charts like these throughout the talk to compare our implementations unfortunately our Adeel coffee shop isn't too realistic in a real coffee shop each of these three steps requires a specific machine a grinder for the grinding the coffee machine coffee beans and espresso for making the espresso and a steamer for steaming the milk in the simulation code these machines are real data structures that track the latency for each stage just as it would not work for multiple people to use the same machine simultaneously it is a race for multiple goroutines to call ad on the same machine data structure simultaneously and indeed if we run the ideal implementation with multiple CPUs and enable the race detector we get a runtime error indicating the data race so our first lesson is remember to test with the race detector but you all knew that already there's another useful tip here if you create a simulation and go you can use the race detector to check that you're synchronizing access to shared resources properly so how can we prevent multiple goroutines from accessing the same machine simultaneously one way would be to put a lock on the entire set of machines this would be again to use a real world analogy like putting all the coffee machines in a small kitchen and only allowing one person in at a time the code on the right implements this whole kitchen locking scenario when we measure this scenario we find the exact opposite of the ideal scenario throughput stays flat at 250 lattes per second and latency grows linearly with CPUs why because there's no way for more than one person to make a coffee at a time each additional CPU is another person waiting in line make their coffee when the enth person joins a line they must make wait four milliseconds for each of the preceding and minus one people of course we know we can do better than this because we see better in the real world in the real world different people can use different machine simultaneously as long as the kitchen is big enough one person can use a grinder while another mix espresso and a third can steam their milk now this isn't a traditional coffee shop with Britta's behind the counter this is more like a self-serve Coffee kitchen where anyone can use any machine we'll come back to the Bristow model later Ingo we can implement this scenario using a mutex for each machine a girl routine locks the grinder mutex then uses a grinder that unlocks it it then it locks the espresso machine mutex and so on so now instead of locking the whole kitchen for four milliseconds we're just locking each of the three machines for one millisecond each the fourth phase making the latte where the coffee and milk doesn't need any locks at all now our throughput and latency curves look more interesting as we add CPUs up to four CPUs throughput grows linearly and latency stays flat just like in our ideal implementation but with the fifth and seek sixth CPUs throughput stays flat and latency starts increasing somewhat like our whole kitchen locking scenario well what's happening here when there's just three people in the kitchen each takes their turn at the Machine and moves on beyond three each additional person needs to wait their turn because all the machines are in use notice though that the latency is increasing much more slowly than in the whole kitchen locking scenario this is because the pipeline of people making coffee is advancing each millisecond instead of every four milliseconds we enabled greater parallelism and increased our throughput by minimizing the time we held the lock on any one resource we're also avoiding holding any locks at all during the fourth phase as there's no contention on any shared resource in that step but why does throughput flatten after four CPUs to understand what's happening let's model how this coffee making schedules onto CPUs the first coffee runs on CPU one taking one millisecond per stage the second coffee Waits one millisecond to use the grinder and then can proceed in parallel on CPU to the third coffee waits again for the grinder and proceeds on CPU three and same with the fourth copy on CPU four but what about the fifth by the time the grinder is free CPU one is free again so it can run there it can't possibly start sooner because the grinder is fully utilized this pattern continues indefinitely there's no way for this system to use more than four CPUs in parallel because of contention on the grinder that is why the throughput of the system flattens at a thousand lattes per second 4 CPUs running at 250 lattes per second each so given this structural limit how can we increase the performance of our system let's think about the real world again if you wanted to make a coffee and saw people lining up to use the machines what would you do well it depends on how long the line is you might just give up but let's assume you really need that coffee the problem is that our critical resources of the three machines are running at capacity you'd probably find another place to get your coffee and this suggests the remedy if there were a second set of machines or a second coffee shop more people could make coffee simultaneously so what happens if we double our machines so we have two grinders two espresso machines and two steamers let's simulate this and find out one way we can implement this scenario one go is by creating a buffer channel of size 2 for each machine type and putting 2 machines on each Channel now instead of locking in mutex the girl routine receives from the channel to get access to one of the two machines when it's done with the Machine the goroutine sends machine back on the channel with two sets of machines we see ideal performance up to 6 CPUs throughput increases linearly and latency stays flat here's another way to compare the performance of the implementations we've seen so far the chart on the Left shows the throughput of each implementation when running with 6 CPUs the chart on the right shows their latency distributions rectangles indicate the middle 50% of latencies while the vertical lines indicate the middle 90% so for example the locking scenario the whole kitchen locking scenario has a throughput of just 250 lattes per second and the latency span between 22 and 25 milliseconds with most of them falling between 23 and 24 milliseconds you can see that the whole kitchen locking is much lower through but much higher latency than our ideal implementation the fine-grained locking implementation does better peaking at around a thousand lattes per second due to the structural limits we saw earlier we overcome the structural limit by adding more capacity that is more coffee machines with two of each machine we achieve ideal performance doubling again to four of each machine which is that final column it gains us nothing because we've maxed out our six CPUs now our CPUs are the limiting factor so to increase performance further we'll need more CPUs so in each case our limiting factor is moved so so far we've simulated a shared coffee kitchen where each person making coffee takes turns using a shared set of machines but that's not what we see in the real world in the real world we usually see a small number of Britta's operating the machines this is more efficient because there are fewer people moving around the kitchen and Marisa's operate the machines much more quickly than the average customer the next simulation i'll present is a coffee assembly line in which one person operates each machine the first person operates the grinder passes the grounds to the second person who makes the espresso and so on let's see how this is implemented didn't go finally we have some goroutines and channels in this pipeline we have three stages a grinder a presser and a steamer the grinder receives no new orders on the orders Channel grinds the beads adds grounds to the order passes it along to next stage the presser makes espresso using the grounds and passes the coffee along to the steamer the steamer steamed some milk and passes it back to the goroutine waiting on the order that girl routine combines the coffee and milk to make the latte when we look at this performance of this pipeline this is the green line it looks a lot like the fine grain Locke implementation the latency is identical but the throughput is slightly less until we reach six CPUs why is this latency is the time it takes to run each of the four stages in sequence so it takes makes a sense that it would remain the same in the locking and pipeline implementations but the pipeline's throughput is less because it is not utilizing the three contended machines as efficiently here's the real world analogy to help you understand this consider what happens when the person using the grinder finishes grinding the beans in the locking implementation that person steps away from the grinder and starts waiting for the espresso machine someone else can start using the grinder immediately but in the pipeline implementation the person grinding the beans has to wait to hand off the grounds to the person making the espresso before they can start the next grind if the second person is busy making the espresso the first person just stands there holding out the grounds this is a blocked channel send in our implementation and this leaves the grinder idle under utilizing our resources of course in the real world the person would just set the grounds down on the counter and start grinding more beans for the next coffee that counter space allows for different stages of the pipeline to stay busy even if they're a little out of sync and in the real world where grinding beans or making espresso might take more or less time due to natural variants we need that flexibility and we needed in our computer systems as well because there's natural variants in these systems so how do we model that counter space and go we do so by adding buffers to the channels that connect our pipeline these buffers absorb the variance between stages and so increase throughput and utilization I tested pipelines with buffer size 1 and buffer size 10 the lines are overlapping on the charts both of these buffer pipelines outperform the unbuffered pipeline achieving the optimal throughput of a thousand lattes per second this is because they eliminate the requirement for the stages to proceed in sync and so allowed more work to proceed in parallel but while we see a benefit with buffering it's important to keep those gains in perspective these charts should compare all the implementations we've seen plus 2 multi pipe scenarios the multi pipe scenarios run multiple copies of the buffered coffee pipelines much like the multi scenarios use multiple copies of the various machines you can think of multi pipe 2 as a world with two coffee shops next door to each other what we've seen these charts have said the differences between fine-grained locking in the various pipelines the middle for bars the differences are all pretty minor two big gains came from structural changes the first was moving from a whole kitchen lock to fine grained per machine locks this reduced the time spent in any one critical section so that more work could run in parallel the second was recognizing when our existing resources were fully utilized and adding more capacity in the multi machine scenarios this allowed us to max out our six CPUs the CPUs are now our limiting resources if we want to get more throughput we'll need to add more CPUs to run our simulation in preparing for this talk I tried many more scenarios than the ones I've shown you so far I tried changing the number of stages changing their duration and adding random noise I tried making the steamed milk stage run in parallel with the other stages and what's remarkable is how little any of that mattered most of the changes had only small effects on performance but the structural changes provided major gains the lesson is to identify and remove the structural barriers to parallelism in your system removing these barriers will help your system scale we did this today by reducing the time spent in critical sections we did it again by adding more replicas of contended resources and we did this with buffering which allowed upstream pipeline stages to proceed without blocking on downstream stages and while we focused on the benefits of these changes it's important to remember they also have costs in particular increased resource use I also encourage you to take inspiration from real-world systems try to understand why they are the way they are these insights will help you understand structural performance issues and help you discover new designs one last thing I encourage you all to download the simulator and play with it the code is straightforward less than a thousand lines of go for everything try some more scenarios dig into the results using ghost profiling tools the exception tracer is a particularly useful tool here or perhaps try modeling a new real-world system and go you'll learn more about how that system works and you'll learn a lot about NGO itself thank you [Applause]
Info
Channel: Coding Tech
Views: 70,772
Rating: undefined out of 5
Keywords: go, golang, go concurrency, go programming, golang programming language
Id: jJS6G7irZSc
Channel Id: undefined
Length: 17min 36sec (1056 seconds)
Published: Fri Dec 15 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.