GenStage and Flow - José Valim (Lambda Days 2017)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you I've realized I've made a huge mistake that my talk title was not as descriptive as it could be but I'm glad everyone could make it this will have an idea who here has already programmed at least a little bit in Alex here please raise your hands beautiful that's beautiful cool so this talk is about Jen station flow and those are abstractions we have been exploring for a leak here for awhile and you are going to go into them I was detail and the reason we have these abstractions in a lecture it's because we have been since the beginning of the language before we reach it 1.0 we kind of have this plan in mind where we want to make it straightforward for developers to get their code from eager to lazy to concurrence and then to distribute it okay and that's that could be like the second title of the talk in a way because you're going to be exploring this so if you're not familiar what those mean is actually right now don't worry we are going to build up to each of those elements and in order to do that we are going to use an example an example which if you are in doing data processing you you'll be aware this is an extremely cliche example but it is cliche for a reason because it covers the most important scenarios okay of what we need to discuss in order to have that path from eager to distribute it so the problem is word counting we may have a tax pay bills mall may be large and we want to count the words in that text so for example roses are red violets are blue right imagine that this is our text and we want to count the words in there so if we have that as an input as an elective string as an input we want to get an output which in in elector is going to be a map where the keys are the words and the values are how many times that word has appeared okay and how we are going to solve this value clear the simplest solution is to use the in known module which is an eager okay so how I solutions going to look like so the first thing we want to do for example imagine that the this poem or whatever wants to count the words for its scene a text file so the first thing we're going to do are going to call file dot read that's going to load the whole file into memory okay then we can say so we are going to get file the phrase it's going to load the whole file into memory so now for example if we have that small poem that's what we have in memory right now and now if we say string dot split we can break that whole file into a bunch of lines okay so now instead of having a huge string I have a list with multiple strings each of those strings being aligned of that original text right so we are progressing right we had the whole file now we can break it into lines and then now the next step is that we want to go over each line and break each of those lines into words so we can later count those words right so what we want to do the next step we want to call enum flatmap what that's going to do is that for each item in in this collection right which is a list of lines we want to go over each line we are going to break that into words and that's going to give us a list of words which are going to flatten everything together right so that's what's at map does we go over each line we break that into list of words and we flatten everything into a final list of words and now that we have this list of words we need to count them okay and to count that we use reduce or in other languages it would be called fold so I want to go is that we want to go over each word and we are going to have a map where we're going to start over to our way to say hey map do you have this word if the map have that word we want to increment the number of occurrences of that word by one if we don't have that word we want to introduce that word into the map okay with the count of one and that's what we are going to do we are going to call and we are in reduced way to go over the list of words with the initial state of an empty map then for every word we are going to try to update the map with that word an initial value of one so if that word is not in the map its initial call is going to be one because it has just a period for the first time and then if it already exists we are going to increment it by 1 so we can count everything correctly and we get the final result we get a map where the keys are words and the values are how many times each of that each of those words have appeared cool that's in solution that's the eager solution right and the reason why it's eager it's because as we saw at every step we have the whole result into memory right so the first thing we did was to load the whole pile into memory and then the next thing we did was to go over the whole file over the whole contents we had memory and break that into a list of lines and then we wisdom view the list of words and then we do the map ok so what are the advantages of being eager ok it's very simple it's the easiest conceptual module that we can that we have it's easier it's simple it's very easy to reason about right we can go over each step well we know exactly what we had before and what we are going to have after ok it is also very efficient for small collections so if you are working with small text that's the most efficient approach that we have to solve this particular problem in a lecture but it has one big problem which is very inefficient for large collections especially if you have multiple passes why is that because let's see why is that right so imagine that now instead of loading a small of really a small poem we need to read a really large file right so like a 5 gigabytes and gigabytes file now we are loading that whole file into memory don't NJ gobitis memory and now the next step is going to explain that Pangea bytes file to a bunch of lines and then it's going to build a big list with all the lines ok and it's even worse later on if we have to go over each of those lines break into words now we're going to have a really big list with all of the words and it's so big that I've read some benchmarks later comparing all those different solutions that this step in particular it did not I was not patient enough to wait for it to finish the final solution when we get to the end took 15 seconds and this one after three minutes I was like okay I give up it's not relevant right so eager it's simple really straightforward but it becomes inefficient for some scenarios if you have large collections or if we have a voyage of passes because each of those passes we are getting a new result right so we are computing new results over and over again so both of that problem like okay that's straightforward word counting I can use email and I can compute the results but if I have a large file that doesn't work how can we improve that we can be lazy that's the next step okay so from eager now we are trying to become lazy and this is a feature that has existed in Aleks here since version 1.0 and the idea with lateness in military we call laziness Malik series Express is something that we call it streams okay so every time you see a stream in a lecture you see the word Malik sir what it means is that it's lazy you can think of it as a recipe so now if we use file box string instead of file dot read to to read the file just one question can you at the back see the the code quite well beautiful okay so I'm not going to be worried over that beautiful so instead of calling file dot read that is going to read the whole financial memory are going to call file dot stream and file dot stream returns a string which is a recipe and what have security has like oh when you need to when you need the contents of the file only then I'm going to open the file and when I open the file I'm going to give you what is in the file line-by-line so this is really good because they're no longer loading the whole file - memory is that we have a recipe that knows how to give line-by-line of that file this is great and now instead of calling so now this stream already knows how to give us line by line of that text so what is the next step is to go line by line breaking that into words and we could use a non flat map but as we saw in amigo right way to build this whole list of words so what we use instead very simple we change from nm through string what I'm going to happen here is that we are incrementing that recipe some are saying okay I want whenever you need the results okay we never need to iterate over this we are going to open the file and they are going to give you line by line and then for each line I'm going to break that line into words for the way it works is that we get the first line and then we break that line its words and then we're going to give the word by one for the first line when the first line is over we go to the next line and then we break one by one in that and serve those words one by one so this is really good we are no longer loading the whole contents into memory anymore so now we have what we have here is a stream is a recipe that knows how to give us each word of that document okay one by one without loading it into memory so now we can go to the final step where we actually need to open the file and get the words and put that into a map which is what we do exactly as before now we call it numb reduce because now we want to convert that recipe into something real so we call it a reduce and then we are going to give the map as before and we have the same result okay so lazy it folds the computations it's a recipe right it allows us to tell how we want to process those things and then at the end we go line by line item by item sorry word by word item by item okay it is less memory usage because now we are not loading things everything into memory anymore but it does have a costing computation okay because now we have this abstraction of expressing things lazily but if you have large collections are to have multiple passes then it's worth it right because it allows us to work with large or even infinite collections right if something is infinite and you don't need to compute the final result you can use these strings to model that okay you can use the strings to model reading from a socket and put into another socket and have that thing running infinitely because it never loads everything into memory okay cool and when we added strings to the language then one question we came up with one question which is look we have those strings which are recipes of how long to do the computation what if we can get those recipes and say oh I want to run this part of a recipe in one core and I want to run this other part of the recipe in another car and so on so if I start to think about how can we use those recipes to leverage concurrency okay and the answer to that question is slow is one of the things are going to talk about today okay so I'm going to just give you an idea of how slow is going to look like don't worry a lot of other details then we're going to come back to flow and explain everything Park my part okay so how can we convert that code that we just just saw that it's lazy okay into something concurrent so the first step is the same as before we we still need to know how to get contents from the file lazily okay so we start with file stream and now what we are going to do is that we are going to convert that stream into a flow okay and then we are going to use similar operations as before we're going to call flow flat map we are going to call flow partition which are going to understand later what it does and we are going to call flow reduce and at the end we are going to have the same result as before so now we went from in them which is eager to extreme that are lazy two flows that are concurrent okay and we are going to go line by line later about with this code snippet okay so don't worry about it so far but what so but what we gained and what do we lose where we thinks about concurrent when we think about concurrency so first we give up ordering right and we give up an idea of locality because that's exactly what we want right if we want things to to run ously it means that they are no longer running in the same place right they are now running in different places and Melek you would say they are running in different processes those very cheap very lightweight random execution okay so now you are running a bunch of different place and because we have a bunch of entities a bunch of processes doing the work the work that comes out of them now it's out of order we no longer have ordering because you know they are all working at the same time we don't want tor about ordering like as they receive and as they emit items all those different processes it's going to be how it is okay so we are giving up the idea of ordering we can always order later on but initially ordering is lost okay and we are we are losing also the idea of locality no things are in different places okay we lose that but it's good because it can concurrency which is exactly what we wanted okay and flow as the streams allow us to work both with finit which is founded data and infinite collections and we are going to see how okay however similar to you know what I said email is very good for small collections right and then lazy is going to be very good if you need infinite collections there's also got just two flow right so for example imagine that you watch this talk and you go back to your electrical to say I'm going to replace all my unknown calls to flow that potentially very likely is not going to make your code faster right because there is an overhead now of setting up how those those co-current processes they are going to talk to each other okay so it's not magic there is an overhead when data flows for processes so it requires volume it requires you to have a lot of data or it requires you to be relying a lot on IO or CPU bound work for it to be worth it okay so and before we going to get details of the talks I just want to share a couple statistics about flow so the flow library it's actually quite small it's just one thousand and two hundred lines of code and it has more lines of the commentation than lines of code and this was very interesting because it became very clear that the problem with flow is not necessarily necessarily a technological problem it's rather a domain problem if you want to use flow appreciably you need to learn how to use flow properly and how to think on this domain where you lose ordering it and use locality okay so given that I want to split this stop so this is an introduction I want to split this talking two parts so the first one we are going to answer well if flow is only 1200 lines of code how is it implemented right how is the implementation small to be able to given everything we can do with it how the implementation can be still small okay and the other one which is well if the documentation is actually the important bit so how can we reason about flow what what do we need to know in order to write our flow for co-current concurrent programming efficiently okay or effectively rather cool so let's answer the first question so how is flow implemented we probably have an idea of householder employment we use something called gen stage okay so what is gen stage so in a lecture we have those processes which are those very lightweight very cheap tread of execution right and use and they are primitives that come with the virtual machine but when we are writing software and Alex you are in Erlang we don't all we are not always spalling those process directly we are using existing abstractions and those abstractions they typically come with this gem prefix okay so we have something which is very very common which is the gel server and Jim stands for generic right so gen server is a generic server which is something about a client and a server right it allows you to write a process that is going to receive requests from other processes and responsible requests gen stage is quite similar where it's a generic computation stage okay and as you see stages they can be producers they can be consumers they can be producer consumers and so on so so what is n stage it's a new behavior likely the same way we have the server now I have Dennis days a new behavior that allows you to model how our processes are going to work and it's all about exchanging data between different stages transparently and with back pressure okay and we have three kinds of stages or there are producers or they are consumers or they are producer consumers so for example here's how we can have a gen stage pipeline with multiple stages connected to each other so we start with a producer that sends events to a producer consumer that may send an event to another producer consumer to another producer consumer and then it ends on the consumer so all you need to do with dense stages that we need to tell how they connect to each other and then stage going to take care of sending the data between them okay but every time we have a pipeline like this we start to ask some questions for example what is going to happen if the producer consumer in the middle is low right imagine that you know the producer is sending data really fast but that producer consumer needs to do a tap that is low and then it cannot keep up if the producer is only sending data without reason without thinking about the rest of the pipeline is just sending data we can have this right because we can be sending work and work to that producer consumer right which starts to have its cue of work growing growing growing and we find out careful you can even run out of memory because you we are getting data from external systems put into this pipeline and then there is some things low in that pipeline that is just receiving work and then eventually cannot keep up with all the work that we are asking to do so when we have this scenario which is one of the main reasons why dense spatial design we were like ok so we need a way to signal between those stages between those computational processes that ok I I can do more work or wait I'm busy I cannot do more work right so the way we do that analysis is that we say gen stages they are demand driven and what it means is that the producer cannot just start sending data like crazy to consumers it usually works like this instead so the first step is that we're going to say I want the consumer to subscribe to the producer okay so the first step is for the consumer to subscribe and even after this step the producer cannot send data immediately then after subscription the consumer first needs to say ok you can send me ten items you can send me 1,000 items so we say that this console the consumer is sending a demand to the producer here's how many items I'm willing to process so I can say ok I'm going to ask for 10 and then the producer can send the events to the consumer but in a way it does not exceed the demand right and the sending of the event and the sending of the demands it's asynchronous so so the producer for example the consumer can ask hey give me 10 the producer can say I have only 5 for now if I'm going to send you five and then I can send you two or three or whatever more later in a way it does not exceed a demand the consumer can say ask 10 and then it eats - say ok actually I can process 20 so send me 10 more right so they this is this conversation is ongoing ok and what is really nice about this demand driven aspect is that if we go back to that pipeline the demand thing we can push it from the consumer to the whole beginning of the pipeline so we say that we push the demand to our system boundaries so for example in this case we're going to say well the C is going to ask B for ten items and then B is going to ask a for 10 items okay and then a can get those from somewhere from an external source or from internal source and send those things on strings now strain always in a way that we're not going to exceed the regional demand okay so about being demand driven it's a message contract if those are messages those are synchronous messages that these stages are exchanged with each other that say hey give me 10 and then like look here here are five here are five more and so on it's really nice because as we saw if we have a pipeline right we can communicate this whole demand right to the system boundary so we are pushing the back pressure to the system boundary which is good because if you're getting data from external systems you know exactly how much data you need to get from those external systems in GEMA stage is actually just one implementation of this contract of sending messages between those processes so if I didn't this talk--i saying I don't really like gen stage I think I have ideas for a better abstraction you can come up with your own abstraction and connects to a gen stage pipeline exactly because it's just a message contract okay let's give an example okay let's have a very simple example where we have a producer which is a counter so what this producer does that is what it it emits it counts it meets events as counting so it's going to meet 0 1 2 3 4 5 and goes like that forever and we have a printer which prints the events as they come okay so here's how the producer looks like so here's some electrical so we're going to do that as usual in a lecture most of the electrical the huge majority of electric code exists inside modules sheriffing look here is a module that is a producer and we are going and we want to bring the gen stage and we want to implement the gen stage behavior so we say use end stage and now we need to implement two callbacks which we implement with two functions the first function is in need which is going to be called when the stage starts and it's going to receive the counter the value that we want to start counting from which is usually going to be zero okay and then we return okay this stage is a producer and it has initial state of counter and then we need to implement in it and for producers we implement another callback which is hello demands and he'll of demand is called every time a consumer ask for a given amount of items right so hello the man is going to be called with the demand how many ice the the consumer asked and they state which is the counter which is where we are in our County from 0 to 1 2 3 and so on so what we do is that we are going to generate the events from the counter until the counter plus the demand minus 1 we are going to see exactly how this works and then we are going to say ok so every time there is a demand I'm going to omit those events and change my state ok it works like this so imagine that we start this producer and then we start with a counter of 0 ok and then a consumer asked for 10 items so we are going to say ok so hello the man when we give it the initial demand of 10 and the state of 0 is going to return no reply and then it's going to return a list with exactly the amount of demands we asked it for with 10 events which in this case is from 0 1 2 3 up to 9 gives us exactly 10 and they state for the next time around is going to be 10 so the next time we receive another demand let's say in this case 5 items we are going to say ok now handle demand wait returned no reply 10 allowed 12 13 and 14 as the event and the new stage now is going to be 15 and so on and so on right so we are busy stages in this loop where it's receiving them and it's building events from that demand right always increasing always counting in return those responses and the consumer is quite similar we define a module called good swimmer could give it any name you want and it's also we say use gen stage and it needs implements in need callback with floating point difference here where instead of saying it is a producer in the first element of that tempo we say this is the consumer and this consumer is stateless so far they state I passed the net of the same they state does not matter it could do whatever you want it there and and while the producer needs to implement handle demands every time the consumer ask for something the consumer needs to implement handle events which is invoked every time the producer hands events okay so Henry events is called with the events the producer said it's called also with some argument which is which producer senses those events and they state which in this case do not matter so what we do is that we are going to asleep for one second to kind of simulate as if this consumer was doing some work so are going to sleep for one second we are going to print the event and then we are going to say okay I don't have anything to do with this consumer it's the end of the pipeline so I don't have nothing to return this one okay so after we did that so now I defined a producer we define the consumer okay now we can connect them together so we're going to do is that I'm going to start the producer so I'm going to call Jen stage star clink and I want to start a producer that is going to count from zero right and that's going to return a producer from a producer process a producer stage and then I'm going to say well I also want to start a consumer and I'm passing okay as the input because they state does not matter so that's fine whatever we pass there would be okay and then we're going to say I'll respond this consumer and this is the printer so now I have a producer here a consumer here they are running they don't know about each other yet so the last step is okay I just tell the the consumer which is the printer to subscribe to the producer which is the counter and as soon as we do that we are going to wait one second and then we are going to see events printed and we are going to see the first I read see 500 vents printed and then we are going to wait for another second we are going to see 500 more and so on and so on so now so we define how to behave when the for the producer we define how it should behave when it receives demand for the consumer we define how to behave when it receives event and Jenn stage took care of the whole communication and how and asking for more and handling all the backpressure approach okay took care of that first all we need to do is to worry about the business logic of what that stage needs to do at that particular moment however you may be wondering why 500 right where this number came from okay if I learned to understand exactly why we got 500 we need to talk about the SUBSCRIBE option so here in this code we called sink subscribe right and we say I want to the consumer to subscribe to the producer we can pass a bunch of options when we are subscribing and we are going to go over two of those options which are maxed med which is the maximum amount of events to ask and that the sole is 1000 so by default when you connect the consumer to the producer it's when you say hey give me 1000 all right and then we also have the minimum demand which is the valid that when reached we ask for more events okay so in order to understand exactly how this works let's ignore minimum demand for now let's imagine that we are connecting the consumer to the producer and the initial and we have the maximum of 10 and the minimum demand of 0 okay first way to happen is the following the consumer is going to ask for 10 items okay so we connected them and in the consumer say hey give me 10 and then the producer is going to produce that and send to the consumer so the consumer eventually is going to receive 10 items ok and then the consumer receives 10 and then the consumer is going to process of 10 items and then it's going to ask for 10 more ok this is bad why this is bad because the consumer is saying give me 10 and then it's waiting and anything and then it's receiving those pan it's processing all those 10 and then it's asking for 10 more and waiting right so this is not good because the consumer is spending a lot of time waiting for the producer to send data and similarly the producer is waiting a lot of time can be idle because they produce the consumer is not asking for anything it's busy processing this is not good right we want to find a way where we keep both busy right both doing work and that's why we have minimal demand so if we set the minimally men to zero okay it will work afar if you set the minimum demand from zero to five this is how it's going to work the consumer is going to ask fertilizers and then the consumers going to receive ten items but it knows that as soon as it processes half of those items right if you did go to five it can ask for more so what the consumer does not that it's going to process only five of those and items and then after it process those initial five it's going to ask for five more so now we are in a state where the consumer processed half it has the other has to process and it already asked the producer to start working on the next five so while the consumer is processing the remaining five the producer is already doing work to give us an X the next five and it is good because now we don't have this work stop relationship right there ideally if this is well balanced right it's they are both working all the time cool there's a bunch more I could talk about gem stage and other goals we have to the gem stage so we wrote a announcement when it first came out so I recommend checking that out and learning more if you are interested in this particular aspect so going back to our topics right we kind of answer the first question how is flow implemented at the core of flow we have James stage you have a eight lines of code a really small gen stage okay so now we are ready for the next question which is how to reason about flows cool so at the beginning of the top we start with that cliche example roses are red violets are blue everyone to the word coffee okay so that's our input at the top and then we have our output at the bottom and we show the concurrent solution using flow we had this code right now said well all flows their lives right they're also a recipe that tells us how to do some work and then at the end when we want to find a result we can get that final result concurrently which is the same result as we had for eager and lazy okay so we are going to do now for the rest of the stuff for the second part of the talk is to go over this flow over these all the lines of code that we have written line by line and understand how it works and how it gives us concurrency okay so let's start it so the first step is file that string which as we know it's going to return a string that is a recipe that knows how to read the contents of a file line by line and then the next step is that we convert that stream into an each-way flow and they flow like a stream it's a recipe but now we can think it's a concurrent recipe it's a recipe that's later on we're going to break it apart and if you killed concurrently and when we do that when we call a flow from an herbal with a stream what happens here is that we are going to create one process one gen stage that is a producer that's every time other processes ask it for items it's going to emit line by line of that file as items right so if I can upgrade right like we have a stream that knows how to read items of the file line-by-line and now we are putting it inside a stage and it can do all the things that we're talking about them stage now it cannot do that because it's run inside in a stage so we did a upgrade right we put that stream mix all the stage its own process and it knows now how to send items to a bunch of other processes cool so now that we have flow for an uber and we have the string running sided process we can call flow flat map in lots flow flat map is going to do is the following it's going to say okay so I have this stage here which is the producer and now I want to secure this computation so in which is that I'm going to start a bunch of other stages to do precisely this computation okay and and that's how the flow is going to work so let's let's imagine that we are going to execute this flow as it here right now without all the other parts right how does it work so it works like this so we have we have the producer process the producing stage okay and then it knows how to emit those line-by-line right so I would say okay I want to prepare the first line which is roses are red so the producer is going to see that first line we say okay I'm going to send you I'm going to send this line to one of those four stages so I can send that line to stage one and that stage one will now do the job of breaking that line into words and then the next line and that's what it does is going to go roses are red writing is going to break that that line into words but at the same time that stage one is breaking the first line into words another stage is going to ask for the second line which could be for example stage four so the producer is going to send that line to stage four which is going to break the same line inch words right so what we did is that when started using flow instead of having one thing that consumed these strings well now we can have many things consuming the string that's why we lose ordering right because they're all consuming those lines coming from the file at the same time the end result is not guaranteed to be roses are red violets are blue they're going to be breaking apart they're going to everything happen at the same time so they can be mixing and so on okay so that's what we have so far right we have this flow that knows how to get the data from a producer right we send it to a bunch of different stages that are breaking each of those lines into words and you can see here that each of those feet is here at the bottom they they don't have a state what what they do is that they get each line break it into word and move it forward okay but that's not actually what we want right because what is what is our end result what we want at the end we want to map with the word counting right so I try to change that so let's do that for each stage okay instead of sending the words forward we want each stage to count the words to put the words into that same map we have talked about before so after flow that flat map let's call flow dot reduce okay let's see what happens okay so we read say okay now instead of each stage sending the words forward we want to get those words and put into a map and compute this final result so let's see what's going to happen here so now we have that same schema right and then we start with roses are red and we are going to send that to stage one which is going to break into words and it's going to put that into a map where we have the words are red and roses and now we are going to get the next line violets are blue right send that to another stage which is going to do the same thing and break that line into words so in a way it works right now we have we are computing those maps concurrently at the same time okay but we have one big problem here what is the problem this is the problem the word are for example we can find it in one stage here that we can also time in the other stage there and why this is not good this is not good because if you want to do a final world counting where you want to get all the words it means you need to go over all the stages get all their maps and merge it together into a final map right and that's and when we are merging together we need to check for duplicate words right because if we have duplicate words we need to sell them together it is not good because if we are breaking the word apart and then later putting together into a final map this final map tab is Assyria right if so imagine that we had like four two processes right would have to be merging everything into one final map and that's going to be slow that's going to be the opposite what we want not going to be concurrent you need this final map to be built in one place okay so that's not good like having the data the same data is the same account for a single world scattered around all the different stages that does not work okay that's not what we want so how do we solve that so let's go back to our problem right so we have the string we make it in the room call flat map and we know that flat map is giving us all the words right it's split into words concurrently but we know that we cannot simply change those stages to build the map we need to do something before we cannot just call reduce we try that it didn't work so what we do with call partition and the whole idea of partition is that it's going to look at each of those words and it's going to guarantee that if two different stages sees the same word it's going to send that word to the same process at the bottom so when we call flow does partition we start a new layer of workers okay and it's guaranteed that if for example stage one at stage 1 and stage 4 if they see the same word they're going to route to the same stage for example stage C or stage B right this is good because now the date is not going to be scattered around anymore so now that we are partitioning we can call so we can see that not we are partitioning each stage on the top here stage 1 2 3 & 4 it's connected to all these stages at the bottom right and now that we have that we can call flow reduce because reducing will be happening here at the end let's do the example the simulation which is going to make this very clear ok so we are going to start again roses are red we are ready to send that to stage 1 which now only have the responsibility of breaking that line into words and routing that line to the proper processes ok so it's going to say hmm I can see here that roses when I partition this should go to stage 1 so it sends that to stage 1 before to stage a and stage a is going to put that in the map and it's going to see I can see that the words the word are should go to stage C and stage seen I'd like to get that word and put it on a map right and I can see that the word red also goes to stage C so I'm going to put that into the map as well so now when we when the other stage is processing the line violets are blue at the same time okay it knows that the word are in violets are blue should go to stage C and that's exactly what happens the word are goes to stage C and now R is correctly formatted by two onstage C and it's no longer is tattered around for all those stages right that's why partitioning is important because it guarantees that if we see the same kind of data they don't with the same shape we can route it to the proper place and then we can keep all those partitions disjoint we don't have data is catered around okay and what we have here when we are looking at this thing at the end here right where I have a producer and we have stage at the top one two four where they are just doing really straightforward work where they receive an input they compute something and give the output and then we have those stages at the bottom that are accumulating a final state right we can name all those stages at the top from one so far we can call the mappers right because all they do is to map apply a function on the input right and and send those out to put back and those stages at the bottom they are reducers because they have this state and they are receiving those inputs and they are applying to its own state that its building around okay so this is a MapReduce or doing MapReduce concurrently okay so so now we understand what the flow does right those are all the steps we have so far so we start with a file stream we upgrade it to a flow for user producer and now full flat map is a bunch of mappers that are receiving lines breakage words and then we partition create a new a layer of stages that are going to compute the final state and now the last step if you want the file resulting memory if you don't want to define the resulting memory if you want to write to a file or to a web service you could just continue doing like slow dot something and the right external service but if you want a final result in memory we can call in them in chew that's way to put everything into a final map really straightforward because now we don't need to check for duplicate words for different stages okay and that gives the same result that we had for the beginning of the talk cool so we know that reduced recollection all dating to Maps right and when it's done we are streaming those maps into another to let's collecting everything which or final result okay so going back to the beginning of the talk right we start with eager and then we made it lazy and now we're able to make everything concurrent with flow so we started with a noun which is our eager code and then with very few changes we were able to make it lazy and now by learning a little bit about the main of concurrent processing right we're able to make it concurrently flow and then I said that I ran some benchmarks I don't remember the the dataset size and the market was longer than jiggers but this one I was not patient enough the eager one I could need to build a huge list of words I did not wait for it to finish and the string took one second in my 160 seconds in my machine one minute and the flow took 18 seconds on my machine or machine with four cores so there's really good right we almost got if our times improvement there's a little bit of overhead there but it's really good right this is what we want so so what about slow right so as we saw it provides map and reduce operations as well our partitioning but it does a bunch of other stuff that we haven't talked about we can also further up if I have two flows you can merge them together if you have flows of data that you need to join like database joints we can do that with flow as well we have we have left join we have in your joint have right to join and so on it has a configurable batch size which is by considering the maximum demand you can culture exactly how many items you're going to send those coming those four stages are going to communicate and it also has a bunch of data windowing features like trigger watermarks that I didn't talk about but as I say we have 1,200 lines of the commentation so can always read the documentation to find more information okay but there is one last question right which is what about distributors you said it's from either too lazy to compare it to distributed right and when you look at the flow API you can see that flow the flow API has feature parity with framework like a bash spark however there is no distribution or no execution guarantees if something goes wrong in your flow you there is no checking point where you cannot restart the flow and come back from where you were before right so does it mean that this top was for nothing that this is not relevant because there is no distribution not really quite the opposite so there are a bunch of interesting papers so for example this one says is more inputs are common in practice 40 to 80% of caldera customers MapReduce jobs and 70% of jobs in a facebook trace have less than 1 gigabyte of input and then another paper build on top of that saying that for between 40 to 80% of the jobs submitted to MapReduce systems in these cases he's saying distributed MapReduce systems you'd be better off just running them on a single machine there is a very testing paper which is called the cost where it measures that there are many problems that people there is using distributed systems to solve those problems and sometimes if you solve that problem on a single machine ok removing all the content related distribution you are going to solve that problem like 40 times faster there are some problems that are even faster solving on a single core because the coordination with multiple cores already introduced in love overhead to not to not make it efficient with multiple cores so what do I mean by this right the single machine still matters I like to say like this is medium data problem right if you're having million data problems issues like most of those are likely to have right not big data problems this is going to be a great way in actually the most efficient way to solve those problems okay and the positive news is that the gap between concurrent code and distributed code in elixir is small because of the original machine and a so what we need to focus if we really want to have the tribution is on those durability checking pointing concerns that allow if something goes wrong like if you're doing something distributed there's distributed computation and a node goes down you don't want to restart everything right you want to pick up from why that's node laughs and that's where most of the work should go if we eventually want to be distributed all right that's what I had to talk about so I just want to share a little bit about our inspirations so extremes the whole backpressure contract that we talked about from a gen stage it's inspired by extremes we've got a lot of inspiration Apio wise for flow it came from a Bosch spark we as I said we have a bunch of features regarding windowing triggers watermarks that comes from the a passion being project and we also have something called notifications which is an internal mechanism but it's a wreckless that's really neat that allows us to compute results we have without having a single source using something that's called notifications and that's the first time is site under was on the Microsoft my IDE project let's see it as Eric's here a bunch of gen stage was built and designed as part of harm attack which is the company I work for the confident created le 3 and that's what I have to share about today too and that's it [Applause] we have time just for two questions there is one I have I have a question he's so able to you know choose two stages together so you know I mean the power of communicate because a synchronously in case of for example and and operation is it's faster to make an operation on a single process than passing it between two processes for example so the question is if Jenny state your fault like take care of you automatically if you if the computation is faster on a single core yeah for example we have a different culture very microphone for example we have two simple operations like add one and in the next stage add 2 to each element for example and in that case is slow able to you know measure those Oh beautiful beautiful yes thank you so if we have if we have like if we have one if you have like if you say float up methyl the surface model we say product map again we do something small you slow is going to merge the computations so by default flow doesn't create new layers so if you call flow map flow filter flow flat map that's all all going to run in the same layer it's just that the examples here with it was always one function mapping per layer so the question so the answer is we don't need to merge because by default we are not creating multiple passes it's always running just when you do a computation like flow partition or something in particular that it's going to create a bunch of new layers because it has to write good excellent question thank you we had one more I believe one more question yes so I'll be around so feel free like after the talk or around the event to come and ask our big letters yeah so my question is how does the partition work under you know you need to keep somewhere a house which were sent to with reducer yes and we don't work yes so partitioning by default we are going to calculate the hash on the event so in this case our event they were words would calculate a hash of the world but you if you check the documentation for the flow partition function you can pass exactly how you want to hash that so so by before we are going to hash on the term or whatever you're on whatever you're sending but if you have a based partitioning because for example need to partition by a group ID or something it just has a nonnamous function to flow partition and you can consider the partitioning yeah and how are these hashes shirt they don't need to be shared right because what you do is that you compute the hash and then if you have four reducer stages so all all all the producers they have access to the same hashing function right so they evolve invoke the same action function and then they're going to compute the value for example you let let's say that we have we had four stages at the bottle right ABC and D so your hash function is going to say exactly we should go to a or this should go to D and so on and you're doing to share the hash because all the producers they have access to the same hashing function because remember the flow is a recipe when we are calling flow the slope so that saw that we are not starting those processes yet so just when you do something then the flow materializes and then we build the whole thing and when we build we pass to all producers what is the partitioning function that's how they know how to partition the data cool yeah so far the other question is just find me around or come here right after ingress to the other fingers right now and yeah so thank you once again take care thanks you
Info
Channel: Erlang Solutions
Views: 17,932
Rating: 4.9711537 out of 5
Keywords: José Valim, Lambda Days, Elixir
Id: XPlXNUXmcgE
Channel Id: undefined
Length: 53min 32sec (3212 seconds)
Published: Wed Mar 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.