Golang UK Conference 2015 - Evan Huus - Complex Concurrency Patterns with Go

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello my name is Evan I work at Shopify we're an e-commerce company we're actually mostly known as a Ruby on Rails shop so but we do a lot of others up and go as well so here I am this whole talk has its roots with me getting fed up by yet another article online that uses pipelines and worker pools to talk about and count goes concurrency model I mean concurrency is so much more than that go egos ecosystem tries to sell us on this vision of concurrency as I was almost like a solved problem now that now that we have channels and go routines that these these wonderful primitives that these tools are enough but they're not they're just primitives you can build great things with them but they are not in themselves a solution to a lot of problems so yeah in theory you can build your pipelines and worker pools but concurrency doesn't always work like that and don't get me wrong go is a huge step forward as anybody here ever like tried to use pthreads or one of the older concurrency yeah I'll take go any day of the week but even with goroutines and channels when your system is complex questions like this can still be hard and answers are not always obvious that yes go away area and actually see that now so yes go makes concurrency easier a lot easier and makes it way more accessible to to a lot more developers which is great but it's still hard so this talk is really quite simply structured it's basically a case study of a library that we wrote at Shopify it's open source it uses a lot of concurrent tricks and this written in go obviously so I'm just gonna give you a little bit of context with the libraries for what it does what we use it for and then we're gonna walk through it we're gonna pull it apart look at some of the patterns we use some of the trade-offs we made we learned a ton building this library it was our first real big project in go and so hopefully you'll learn some of that too so let's start with a little bit of context anybody recognize these these people up from the screen no that's okay some of them are pretty obscure on the left there you have friends Kafka a an author known primarily for The Metamorphosis and a few other books in the middle you have josé Saramago is a Portuguese author heavily influenced by Kafka and on the right you have Donald Knuth a computer scientist probably needs no introduction the art of computer programming and a many other seminal works so you're probably wondering what these people have to do with anything so let's start with Kafka Kafka is also the name of a Java based Apache project it was started about LinkedIn Apple years ago it's for distributed publish/subscribe messaging it's almost already like the de facto standard for that that field it's very popular we use it at Shopify now in Kafka messages are grouped into topics which are semantic so if you an e-commerce company like us maybe you have a topic for your checkouts topics are then subdivided into partitions which are not semantic they're just for parallelism so they're like database charts and partitions are then individually led by brokers and replicated on to other brokers and brokers are just your nodes in the cluster now the interesting thing about Kafka is that clients are very thick it's not just like a network protocol you have to implement there's a great deal of logic in the client that deals with how it failover between brokers and all sorts of other interesting stuff so Shopify we wanted to use Kafka but we didn't want to use the JVM we're not a nada jabber jaw we are not keen on what putting the JVM everywhere so we wrote Sarah MA and this brings us to our second author Saramago is a terrible terrible pun I apologize the name stock I couldn't do anything about it up at that point so Sarah is a client library for Kafka implemented in go it implements the wire protocol and all of the complicated logic on top the original version way back in 2013 on goal one auto was a proof of concept it was very simple we ignored a lot of the complexity we made it as simple as possible to work to make it actually work because the complexity was to be honest quite scary at the time and premature optimization is the root of all evil right and that brings us to our third author because that's his that's a nice line premature optimization is the root of all evil but we should not ignore that critical 3% of of optimization opportunities and it was that critical 3% that did us because turns out if in Kafka that complexity is there for a reason the first draft of ceremony was very simple very easy to understand it worked it was correct as far that went but it was also an order of magnitude too slow so we salvaged what we could and we rewrote it the second draft of saaremaa that eventually became our one vato had a couple of requirements it had be fast wishing terms of Kafka means it had to be able to batch messages together trading off some latency on some of those messages in return for throughput it had to be configurable now that you're making a trade-off that trade-off has to be configured depending on the needs of your application if your application needs low latency and is willing to trade off a little throughput for that we should let you do that it has to be resilient calf cuts the distributed system distributed systems often fail in pieces and if one node fails then that failure should not affect the rest of the system the inter the rest of the system has to keep going or else what's the point of making it distributed in the first place and it had to be correct which is almost obvious but given the other requirements is not as simple as it sounds Kafka has some interesting requirements so we wrote a producer and a consumer the producer that sends messages to Kafka and the consumer that reads messages back from Kafka so let's tackle the producer first this is the diagram of what the producer looks like an overview it's a fairly scary diagram despite the smiley face at the top that's the user by the way that's the user sending messages into the producer to be sent to Kafka in this diagram every arrow is a channel and every box every labeled box there is a go routine so that's more or less the complete graph of all of our communicate and go routines talking to each other we're gonna work our way more or less from top to bottom following the flow of a message as it is sent from the user and then down and out to the actual catholic cluster a quick disclaimer the code that i'm gonna show is looks like go it was copied from from the actual library but it's been simplified for the slides so much that it's basically pseudocode at this point there's a lot of cases I'm explicitly ignoring don't read too much into it so let's start at the beginning this is the top fragment of the producer this is where the user sends their message in and it goes through the initial processing and this top layer primarily tackles a third requirement I mentioned of resiliency which basically means keeping errors isolated when an error occurs because errors are going to occur everything else all of the other topics all of the other partitions all of the other brokers they all have to keep working because other things are relying on them now in Kafka the most granular level at which you can receive an error is that the partition level the partition is the most granular thing here so the simplest way in go to isolate something is to put it in its own go routine that's a really basic trick Rea straightforward so that's like the very first thing we do here is we fan out we take all of the messages in at the dispatcher we fan them out from the dispatcher to the topic producers which is one per topic and then from there to the partition producer which is 1 1 per partition and fan out code is really pretty straightforward it looks more or less like this you have a map of your handlers your output channels you read a message from your input channel you look up the appropriate output channel and you send it on its way if there is no output channel you create one first but this isn't actually quite enough this isolates every partition to its own goroutine great but there's a problem and that problem is that every step of the way here the dispatcher and the topic go routines and the partition go routines they can all on ok to make network calls to fetch metadata from the clustered various things like that and network calls can block Network calls can be slow they can fail in interesting ways they can fail in really surprising ways sometimes so what happens in this situation if our partition go routine blocks well if it's stuck in a network call then it's not reading from its input channel and if it's not reading from its input channel then the topic go routine that's trying to feed it is going to block writing the back channel and if the topic go routine blocks then it's not reading from its input channel and the dispatcher is going to block and if the dispatcher block then of course the user is going to block and all of a sudden the entire system has ground to a halt that entire chain has blocked and everything else is not receiving any messages because of one bad Network call which is really not ideal Kafka is mostly an in data center technology so you can set your timeouts pretty low that helps but there's actually there's a better option and that is circuit breakers circuit breakers are becoming a very common pattern these days in distributed systems and in particular in micro service architectures they work basically by protecting code that can fail so that even when it fails it at least fails quickly and that's fine for us things are gonna fail all we care about is that they fail quickly so that other messages keep flowing and everything else keeps still keeps working so the top line here is a typical API call it makes a network request so it can be slow it can fail to do all anything at once basically on the bottom you have the same API call wrapped in a circuit breaker to protect it and not pretty occur you pass it a and function in this case an anonymous function and 99% of the time it just runs that function like it so those two those two snippets of code do exactly the same thing however when the method starts to fail when that API call starts to fail or be slow then the breaker has the option B the circuit opens just like a circuit breaker in a an actual electrical circuit at which point the method doesn't even get run the breaker just immediately returns an error saying I like the circuit has been open something's gone wrong fail quickly and that's all I mean the circuit fails quickly the message is rejected immediately and all the other messages keep flowing so once a message has reached the partition goroutine once it's it's been isolated to its own partition goroutine it then passes through that little cloud in the middle of the diagram I don't know it was it was really small so I don't know if anybody saw it and that cloud is basically my super lazy way to diagram a lot of arrows I didn't feel like drawing so it's a dynamic many to one mapping between partitions and brokers so every partition in Kafka is led by a broker is led by one node in the cluster and that mapping can change a partition can be led by one broker and then if that broker fails or if the cluster rebalances for some reason suddenly that partition can be led by an entirely different node so you have to manage this mapping somehow you have to manage you have to be able to follow that and the producer saying alright I'm sending these message to sue this broker and now I'm sending them over here somewhere different simplest and safest way to do that is to leave it entirely in the hands of each partition goroutines I make every one of them completely isolated know exactly which broker they care about and and leave it at that but that actually has a different problem there's nothing technically wrong with that but it makes the speed requirement we have here basically impossible to to match because the whole purpose of the whole requirement behind that speed requirement is to batch messages together is to take messages destined for the same broker and put them in the same network request to avoid those many many round trips that you would otherwise need and if each partition co-routine is is figuring out its own broker in an isolated way then when two partitions end up on the same broker which is a very common occurrence they don't know about it they don't know that those messages can be batched together they have no idea so we need some centralized state which is nothing crazy just a thing in global locked reference counted map is not a terribly revolutionary concept here it's a little unusual in a language that is both channel oriented and garbage-collected but the lock is it's really the simplest way to solve this problem and the reference counting is also kind of the simplest way to solve this problem which sounds a bit odd but you end up with an acquire release pattern so you acquire a broker and then when you're done with it when that partition has moved somewhere else you release it back to the pool before acquiring acquiring your next one and those methods look something like this on the left you have the acquire method which takes the lock looks up the broker in the map and then increments the reference count before returning it and if of course if there's no broker in the map yet it creates a new one and then on the release broker you also take the lock you decrement the reference count and if and only if the reference count is zero then you delete it the reference count is important here because regardless of how many partitions are referencing that broker at any given time from the perspective of the garbage collector it's still being referenced by the map itself so if we didn't reference count and delete here those brokers that ended up unused would effectively be leaked so again not a revolutionary concept reference counting has been around for ages mutexes have been around for even more ages I'm sure a little unusual and go but it's just a matter of always using the right tool for the job so once we once the message has reached the broker or the the third component in our our little diagram here is the final step in the successful path for a message isn't a successfully produced message ends here we have all of these messages coming in from multiple partitions that are all headed to the same broker which is great we can batch them up we can wrap them into a single network request be very efficient about it put them on the wire and this seems like a fairly simple task this seems like you should be able to do it in a single go routine with just a little loop but there are a couple of complications because it's capped but there are always complications the first complication is that requests here the network requests to actually send these messages can be very slow in successful case we're used in network requests like timing out after 10 seconds being slow that way these are requests that because of the the work they generate on the cluster can take 10 or 15 seconds to succeed and that's not like an aberration that's that's normal behavior so you have to be able to handle that again without blocking the rest of the world for 10 or 15 seconds the other complication here is that you have to be able to handle timer events that configuration requirement I mentioned involves in a lot of cases setting timers up to say all right you know I don't want to hold anything from more than 500 milliseconds or or what have you and those timer events have to fire and those events have to be handled now go does not let you select on a channel and a network request at the same time which is a bit I find that a bit funny given the select statements origin as actually selecting only on Network requests but it doesn't let you do that so we have to split that out into two to go routines anyways the aggregator which is does most of the real work to be honest and then the broker producer which is just responsible for putting things on the wire and handling the responses the aggregator looks something like this it's a for loop wrapping a big select statement it acts like a dynamic buffer or flow controller it builds the requests so for every input message it gets it puts it into the request and then it uses this this really clever trick out the nil channels in select statements which is like one of these weird corners of the language that is just super useful sometimes and super confusing other times normally when you try to write to a nil channel that is going to panic that's they'll panic at runtime and people avoid it obvious reasons however when you have a case in a select statement that writes to an ill Channel that case doesn't panic that case they simply ignore it as if it's not there so as if the Select statement didn't have that case at all now of course channels can be nil or not and that can be changed dynamically at runtime which means that by toggling whether a channel is nil or not inside the Select statement we can actually change the shape of our select statement entirely dynamically without like going into reflection or any of the real dynamic select statements so what we do here is we leave the output channel nil most of the time so that last case that where we'd feed the request into the output channel to be put on the wire doesn't trigger 99% of the time however when the request is full when we've gotten enough messages or when the timer fires and we're gonna put it on the wire any ways to avoid undue latency then we set that output channel to a real value so the next time through the loop that output Channel can be selected maybe it's not that next time maybe it's the like the subsequent time but suddenly that's available for the runtime to send on and so once we've sent it once we put that request on the wire we can then be set at 10 he'll generate a new request and that case is dead again until we're ready to send a new request it's a very handy trick the broker producer part of this the the other half of this is really very simple it takes a request at a time puts it on the wire and then handles the response the interesting thing here is that is not this code in particular but in the way it interacts with the previous the previous loop and that they spin together like almost like gears on a bicycle because for every every end rotations every n iterations that the aggregator loop makes through it's a select case this loop makes exactly one now for successful messages that's it they're put on the wire they come back with a success response the success responses basically does nothing because there's nothing else to do and the garbage collector does this thing however unsuccessful messages are another story unsuccessful messages have to be retried of course you can't just give up on the very first time they fail and that's the final component of our producer is the messages that are being tried how do we retry them but again it's not as simple as it sounds because CAF does kind of weird and kind of crazy there are a couple of problems with just sending the same message again which is the obvious solution the first is that you can have split responses so when you send a request and you get your response that response can say okay well these messages succeeded these messages failed these messages failed because they're not even on this broker anymore they should have been sent over here and these messages like at the end these ones also succeeded so you have to be able to parse that out and say okay the successful ones we can ignore the failed ones we have to retry the ones that have been moved have to end up on some other broker in some other go routine in our system and not they have to end up there they have to end up on that other go routine in the right order with all of the other messages that are still flowing down through the top part of our producer which is another interesting problem so we've looked at at two kind of broad options here you can see both of them on the left and the right here we didn't really like either of them but let's leave that aside for a second the first option we looked at was for the partition go routine to keep a queue to keep a list of all of the messages that had seen because it seen them in the right order it's the last point before that cloud that dynamic muxing that keeps all of the messages in the right order so then when the broker producer receives an error or or something like that it can just send a message to the partition producer and say alright these messages succeeded you can drop them from the queue these messages failed please retry them and it has them in the right order and it can do the right thing the other option we looked at was for each message individually to have a little bit of metadata associated locally about whether it had been retried and on which brokers it had been retried and then the broker producer and simply when a message fails it can simply send it right back to the top right back to the dispatcher for it to flow through the same pass everything else with a little bit of extra metadata associated and that metadata can then be used at a later point to shuffle them around and make sure they end up in the right order so these were like the two options we looked at but they both had a problem we didn't really like either of them and that problem is that they add a loop to the graph so if you if you think of that that diagram I showed earlier as a graph a directed graph then both of them add a cycle and cycles are bad in the simple case like you have to go routines that just try and talk to each other if both of these go routines if a and B try and send at the same time that's gonna block they're both gonna block that's in a deadlock go will panic and your program will die which is not great in the simple case here this is solvable with the Select statement you can just put in a select statement like that and go will may take care of saying all right well one of you is gonna read and one of you is gonna write and then next time through one of the other one can read on the other one can write and and everybody's happy but this doesn't actually work for us didn't work for us and it won't work for a lot of cases that look sort of like a pipeline at first glance because in a lot of those cases when you read a message in you then have to write that message out and if you read a message in and then you can't write that message out you read a second message in well suddenly you have two messages and you have to write both of them out so maybe you put that like in a in a buffer and say all right go back to my select statement but then what happens if you a third message you now have three messages sitting in this go routine you have to write all of the note you have three writes three writes to perform but you you have to go back to your select statement and maybe you'll read a fourth and like when does this stop it doesn't be what you've constructed here is an unbounded queue an unbounded queues are all kinds of awful there are serious design smell in a lot of cases and this is why for example go doesn't provide infinitely buffered channels you can provide a channel buffer there's a constant value when you make your channel but you cannot make what a channel that has an infinite buffer this is why unbounded queues are terrible and if you ask in the forums on the go nuts forum or for example like how can I make an infinitely buffered channel the answer you're gonna get is invariably don't do that it's bad it's a bad idea it's a but something is wrong something else is wrong with your program if you need to do this so that's what we did on the principle that if it's stupid but it works then it can't be that stupid it is pretty stupid and in hindsight like there's a way around this it's a little more complicated where you don't need an infinite buffer at all and we should have done that but I only came up with that like while writing the talk like six months after implementing this so it's too late for us now we did we went with option two we went with a little bit of metadata and then just flowing through the same path and we added a retry ergo routine at the bottom and that retry or go routine is basically just a stupid implementation of an infinite channel it loops over the messages and stores them in a buffer I stores them in a slice and then writes them out when it gets the chance there are practical considerations to do with how those scheduler works and and how caf-co works that means that this buffer isn't really infinite there's a fairly weird limit you can work out exactly how big it could get which is not too bad but this is still this was a mistake this was if we if we'd known what we were doing then or if we'd known if we'd known what we know now then we wouldn't have done this we would have worked out a way to avoid the infinite queue so this is the completed producer hopefully most of it makes sense now we've we've gone through all of the little pieces we pause briefly now we all go on to the consumer the other upper half of the puzzle shortly but I'll pause briefly now if there are any questions specific to the producer any of the code I've just shown now is the time so with the part of it that correlates through burkas with producers you have that mutex acquire and release we hold an alternative way of looking at that you might want to think through is using able to apply calculus where you actually send the ends of Chalian channels through channels so you can create a service that issues the ends of channels that talk to vocals so you can have a dynamic network where the topology of the network is changing because you're acquiring ends of channels to talk to focus and then you don't need to have that requirement with these documents and I'm not 100% sure I understand but every broker is like the resource that's returned from those acquire and release methods is a channel that meets so that that is a topology change one in a broker is acquired but what I'm saying is you can actually send the ends of channels to three channels yes yes so you could you could do that you could do that math as a as a go routine and then have it send and receive the channels itself we looked at that it basically ends up being a slightly slower and slightly more code for exactly the same result so if you're obsessed with avoiding new Texas then it's the way to go but in terms of concurrence of Gotham sometimes I have lots of Mines yeah it depends on the situation for sure for us wasn't really necessary anybody else cool moving on to the consumer the consumer came with its own set of challenges it looks like this this is hopefully slightly less scary now we wrote it second we wrote it after wheat the producer was done in in production so we were able to to apply a lot of what we've learned doing the producer to the consumer right off the bat which saved us a lot of time and a lot of headache it doesn't have a top to bottom like the producer was nice and flowed top to bottom the consumer doesn't really do that it flows it's almost bottom to top or almost kind of inside out which basically just means that I'm going to jump around a bit accessable I apologize would be a little probably harder to follow but first a more general principle that we learned from the producer that saved us a lot of time here this thing is how goroutines are obstructed basically like there's a couple of obvious ways to write your go routines as an anonymous function or as a named function but there's also an named functions are easier to manage like an anonymous function is indented more and it's embedded in something else so it's a little harder to manage as it gets more complex typically it gets a name and it gets moved out to its own proper method but when unnamed functions themselves get too complex then we started experimenting with this pattern which I call structuring it's probably a better name for it somewhere it's probably in the literature somewhere but it's a little more code but it makes it a lot easier to manage a very complex go routines so anonymous and named functions are very straight forward you can go func whatever or you can declare a funk with a name and then go that name and refactoring between these two is very easy if your guru team starts at two lines you make it anonymous when it becomes ten or fifteen you give it a name and move it out and that's fine but when it becomes very long when it becomes 80 or 90 lines mult nested loops like this is where you want to start moving it out into helper methods but helper methods as we found doing the producer can be tricky to get right because often you know the go routine has five or six state variables inside of it and you have to ask all of those to your helper method and then you have to take four or five return values of which state has been mutated and you end up with a function signature for this helper method that is exceedingly long and really quite annoying to work with so the patterning that we ended up with looks kind of like this for every go routine that we have that's of any significant complexity we create a struct and that's drop contains the parameters in any state variables that are associated with that go routine we give that structure wrong method which is just the actual contents of the go routine but now it doesn't need parameters it doesn't need a return path welco routines can type return values and then we give it a constructor that takes the parameters that the go routine would have taken originally constructs the appropriate structure and then runs the run method and what this lets us do is that lets you write your helper methods as methods on this structure you don't have to pass a bajillion parameters to them anymore you don't have to worry about returning a bajillion parameters if you want to mutate some state it just works and it's becomes a lot easier to manage so we eventually refactored our producer to use this pattern everywhere which is helped a lot with readability and we started the consumer right off the bat we knew okay these these go routines that we're writing here they're gonna get big so give him a struct and a constructor right off the top and it saved us a lot of time and a lot of headache managing a lot of that code so now on to the first kind of concrete piece of the consumer this is another solution to a problem that we've already seen the partition movement problem again in the producer that we saw where a partition can be led by when one broker and if that broker goes away it will move somewhere else or if the cost would be balances for some reason it will move somewhere else so in the consumers case like this if this starts fairly simple every broker just keeps a list of all right these are the partitions that I care about and it can add or remove to that list dynamic we as partitions move around but again there's always a complication and the complication here is that this again can require network requests to ask the cluster for metadata to say alright like I don't I don't have this broke this partition anymore it's no longer my problem but I need to know where to put it and that can again those network requests can be slow they can block they can fail in surprising and unexpected ways so again it becomes a question of isolation we have to keep those Network requests isolated not let them like that break down everything else in the system so we create a go routine per partition again we create a dispatcher here and the dispatchers job is simply to say alright this partition that I'm dispatching is now in limbo it's not owned by the place it used to be owned by but I have to figure out where it's owned by next and then make that network what the network requests I can block there because there's nothing else depending on me so I can take my time I can retry it I can block I can handle errors and when I figured it out when I know where that that partition is heading for I can send it on and handling that interaction then between the brokers which are just those white boxes at the bottom because there's well look at what's actually inside those boxes in a moment and then interact that interaction between the broker and the dispatcher you basically ended up with this this ownership token so the broker has the token for a given partition and then when the broker is done with it if is it up to a dispatcher and then the dispatcher has the token and then when the dispatcher finds out where it should go it sends that token back to the appropriate broker so whoever has that that magical token although that's the go routine that owns the state for that particular that particular partition and in that way we don't even need a mutex around that state because only one go routine can access it at a time and that go routine is whichever go routine has the token so the dispatcher code looks basically like this it has a trigger channel for receiving its ownership token on when it receives the token it finds the new the new broker of leader and if the broker if it finds one successfully then it simply sends its ownership token on and and it's done however if there's an error then it backs off it it doesn't want to flood flood the cluster with useless requests so it will sleep for a while and then it will simply send itself the ownership token which is a really kind of clever and simple way of continuing this loop the channel that we're listening on has a buffer of one for this purpose so it simply sends itself the ownership token and runs through the loop again and we'll retry that as many times as it needs in order to find a successful new leader and then send its ownership token there the broker side of this is all in handling the response from the cluster so when it's it's major request it gets a response back then it has to loop through that response and say all right which one of these which one of my partitions failed which one of them succeeded do I have messages handle that if if it succeeded if the request succeeded and it has messages and it simply sends those messages to the user however if there is if the request failed then it unsubscribes it deletes that ownership open from its local storage and it sends it to the trigger channel of the of the dispatcher and so again you have this ownership hope in which bounces back and forth between the dispatcher and the broker and it serves almost like a mutex but a little bit more than that so this was a problem again in part of isolation of keeping that at keeping Network requests errors isolated to just just what they've actually has gone what just what has actually broken and continuing briefly with that theme the inside of that box that handles the the broker side is almost exactly like in the producer there's two go routines one of them is responsible for most of the logic the other one is responsible for actually doing the i/o so I'm not gonna get too much into that just a few minor differences nothing really worth exploring in depth now last but not least I would get to turn a classic problem on its head so as programmers were probably pretty used to to having to deal with user input and say alright you know I've asked for a number but the user has given me a frog or the complete text of the Gettysburg Address or you know a pizza and validating user input is a problem that we're we're pretty used to at this point we're not so used to is what surprised us here is being having to validate user output or in particular having to deal with user with a user not even being there to receive their output so the consumer sends its messages to the user on a channel it's the nice asynchronous way to provide them with that data but of course writing to a channel can block and what problem we ran into a couple of times was we would write this message like hey we have a message for you user here's your message write it to the channel and the user wouldn't be there the user would be hung in some other loop somewhere or they'd be stuck in a network request of their own or maybe that go routine had completely had pan if completely and gone away like and again this was a case where we would write to this channel and you would assume it would work and so we would write to this channel we would block and then you know the go routine that was trying to talk to talk to that would also block and the chain and suddenly the entire system had ground to a halt all of our other users all of the other consumers had become stuck on some internal transfer somewhere and that one single goroutine that like that user had crashed our system basically because they didn't read from a channel so isolation again we have to keep that problem isolated a recurring theme here again first step is to put it in its own go routine so we spun up a response video routine that is simply it takes a batch of messages and writes them out to the user on that users channel which is again on its own not enough because simply writing to the response feeder will now block and the entire thing will propagate still so we stuck a neat little trick into the response feeder it looks like this in the simple case here it takes a backup messages in and it just loops through them and writes them all out and then sends an ACK back to the broker to say I'm done writing these messages and that's just a wait group so the broker has a wait group that it waits on and it gets an acknowledgment for every partition in its in its request however the interesting case is the second case in this select which is what happens when there's a timeout what happens when we've tried to write this message to the user and we found out that for whatever reason maybe the user is just being slow maybe they've gone away and I don't know that we can we can't write that message Oh in that case we forcefully unsubscribe myself from that broker we we claim that the ownership token so now the ownership token is not on the brokers go routine anymore that ownership token is now in our go routine we tell the broker we're done even though we're not and then we can just feed those remaining that sitters out at our letter because the brokers doesn't own this partition anymore so it's not gonna do anything we've told the broker we're done so it's just gonna keep feeding the other partitions that are presumably still working and so nothing else depends on us so we can loop throw those remaining messages we can write them out if the user is gone who cares go routines are cheap this will sit here until the program ends and it won't really cost us anything at all and then when we're done assuming that the user you know is just stuck in a network request somewhere they time out eventually and they come back and start reading messages then we hug that broker still we have that reference so we can just send it our our ownership token again and carry on as if nothing ever happened so the broker side code for this like this is the the clever bit the broker side code is even simpler it just adds the appropriate number of elements whose weight group feeds each set of messages to the various response feeders and then waits for their wait group and when that wait call returns the broker it doesn't actually know with all of the messages have been fed out but it does know that every subscription has either fed its messages successfully and returned and is ready for more or its unsubscribed itself so the next time through the brokers loop it doesn't even have to worry about that partition anymore so that's basically it for the consumer you've got a response feeder and a dispatcher that are there's one of those per partition you have a subscription manager and a broker consumer that's one of those per broker and they communicate back and forth Farrelly's fairly straightforward I hope so now any questions specifically related to the consumer all right I just wanted to say you were talking about how to name the pattern of having a stroke with every run method attached and it reminds me of an actor yeah I'm not super familiar with this I think it's Scala another actor based languages but that's certainly sounds plausible to me very interesting you mentioned that go routines are cheap they were cheap to create but they're also routes for a whole lot of your heap and so you don't want to leave them lying around off do you yeah I'm stuck so we don't really know at what point it ever the user is going to come back we could presumably provide a second time out and say like go away entirely after this amount of time has elapsed it's never been a problem for us thus far because in all of the cases where we had blocked in reality it was always just a slow Network call so we did come back after 10 or 20 seconds but yeah that's that's a reasonable concern if the you if the users actually gone away completely and was not coming back you might want to throw a second sign up in there cool so we learned a bunch of lessons doing this this project channels our primitives and therefore communication there there your communication primitive there they communicate between your go teens but what we found out and what we've seen is that we ended up writing a ton of go routines and a ton of extra channels just to manage communication between like the go routines doing quote-unquote real work like we had go routines with circuit breakers and timeouts and and all of its extra patterns just to manage the communication that channels on their own are not complex enough to handle which is fine they're primitives but be aware of that and I'm not topic of go routines structure them or maybe make them into actors it's a we found it a very handy pattern for managing that complexity and managing that when they would 100 per M eaters and and returned values and every in every helper don't trust the network or the user of course we're used to these these lessons now but don't trust them in any case don't trust the network to be to be succeed quickly don't trust the user to be there to receive the output that you're you're trying to send them the infinite buffers smell I cannot stress this one enough its bitten us once like if you find yourself writing an infant buffer or in the in the process of designing something that seems like it ought to benefit from an infinite buffer take a step back reevaluate there is usually there's a way around it if you look hard enough and long enough and it makes a lot of other things a lot easier to reason about and of course don't be afraid of locks don't be afraid of reference counting all of these things that we kind of think of as oh they shouldn't really be necessary and go anymore sometimes there's still the right tool for the job channels and go routines do not solve every single problem and concurrency always use the right tool for the job so credits for the photos and thank you any final questions so just picking up on the comments about actors there is quite a strong correlation between this style of white and go routines with state behave like actors there's a very big difference however which is the actor model every actor has a single infinite smelly cue of off data that's being fed into it obviously with go you have a lot more control of how the channels are configured so it's actually a more sophisticated model using go actors interesting and what languages would would provide that model with an infinite queue by default well I've worked with akka which is Java and Scala the actor model is like Scala actors which are no decorators and that the akka actors are essentially a way of dressing at callbacks your actor sits at the end of an infinite queue and you like the code that pulls things off the queue and processes each item which might involve putting messages into queues of other actors interesting and that's all you've got you've got the one input tunnel you can't change that input channel it's just an infant in you that's it I find that somewhat surprising but I'll have to look at that more that's that's really interesting guess what someone else I think was picking about Erlang is infinitely before by default ladies yes interesting to look at it has pros and cons but infinite buffers can be yeah the thing about infinite buffers is that there's no such thing your system only has a finite amount of RAM I mean if anybody invents a computer with an unlimited amount of RAM please I will buy one but in reality computers have a fixed amount of RAM and so even if your cube is is not bounded by software at some point you're just going to run out of memory and your programs gonna crash anyways so you might as well think ahead and handle that that case proactively yeah but the things that you often end up trading infinite buffers in a channel for stacking up go routines and go teens now more expensive than office space so it's all trade-offs in their pros and cons yeah sure so with that in mind how are you would that Mont how you look into salt so the in particularly in the producer case here yeah so we the the solution that I eventually came up with and have not yet worked up the courage to refactor is that in the first of our two solutions in the solution where the partition manager keeps a fixed queue not infinite and then the broker sentence back acknowledgement saying alright these messages can be dropped these messages have to be retried in the correct order the thing about those is that the acknowledgments that get sent back are ranges and their ranges that are contiguous one to the next so you can do a little math basically and given to return but if you read two messages you can turn those back into one by collapsing those ranges together and so you don't actually need to keep an infinite buffer of messages you've read in you don't need to keep one and keep doing these neat identity operations to merge them together more questions anybody else hi there how does the Goku currency compared to the Java and concurrency in the original I don't know I'm not really a big job at expert I've I believe it does rather a different style because job is can well okay first thing is that the character brokers are actually written in part in Java and part in Scala and I don't know any Scala so they do a bunch of stuff that I don't really understand be quite honest but from what I gather it's a rather different model and it doesn't look really anything like this I when I was first writing the first prototype I did look through some of the design documents for the the JVM version and to get some ideas and I decided pretty quickly that the underlying models are so different it was better to start from scratch more question thank you very much
Info
Channel: GopherCon UK
Views: 11,008
Rating: 4.5416665 out of 5
Keywords: Go (Programming Language), Concurrency, Computer Science (Field Of Study), Concurrency Pattern
Id: 2HOO5gIgyMg
Channel Id: undefined
Length: 53min 45sec (3225 seconds)
Published: Thu Sep 10 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.