Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
please welcome Boston thank you okay so as hasin just introduced me I'm going to talk about parallel and distributed deep learning and the idea is to give all of us an overview over that rather complex field actually so in order to organize this a little bit more we brought a survey about that field and that survey it has only 60 pages so in the next 45 minutes I will summarize this survey I mean the most important aspects of this and I want to credit my wonderful postdoc talbin hoon who has actually been leading that effort to read more than 300 papers in order to understand the state of the art in the field and I myself read about a hundred of those and we will see how that goes in a couple of minutes so first of all let me motivate this a little bit I'm not sure if we need much motivation after Dave's wonderful talk but I want to focus on a subset of artificial intelligence the Dave was focusing on the whole thing in the this subset is really deep learning so this is something that may or may not be quickly coming but as we know this field was basically kind of started or at least had his first big breakthrough in 1989 with the hand digit recognition by and I couldn't who beat all the previous benchmarks in that and then it quickly continued in 2012 and it beat the image net the challenge for object classification then it was quickly followed up by a better image segmentation algorithm saw better ways to provide image segmentation so the difference the difference between classification and segmentation it's basically during classification you want to label an image what's in the limit what what's in that image and in segmentation you want to also find the objects in the image itself so then in which captioning was a more advanced topic so basically saying well this is looks like it dinner a table or maybe a lunch table then a ie we all heard about alphago and alphago zero and then later in 2017 this neural computer idea came up that you can also use deep learning to actually implement a computation and touring machine like approaches so furthermore deep learning is basically used every cloud industry today so every and every speech recognition or translation system that you can use today even Google Translate or a Skype Translate is using deep learning and it's using all the methods that I'm going to explain so the idea is it's a very promising area of research furthermore it's a very active area of research because of course that's the new hype and as you can see ranging from 2012 to 2017 this is the number of papers on archive in this carried in these two categories artificial intelligence and computer vision and most of them are related to deep learning and but today or actually yesterday there were 23 papers a day coming out in that area and it's actually really really hard to follow that field if you want to stay up to date on everything and this is why we decided to write this survey right so these 300 papers and we looked at those in detail and all these papers were pre-filtered where you try to understand only issues related to parallelism in deep learning and of course this is on a quadratic growth if you fit a function to this a quadratic function will fit nicely we are in 2018 which means we have many more papers a day today than we had in 2017 which also means that parts of that survey is already outdated but this is this is how it goes in this field so let me quickly summarize the technicalities of deep learning how does it work well we have in data set and here I want to talk about image recognition or image classification only but it applies to pretty much everything else that is known in the context of people learning so just bear with me for this image example so we have a data set that has lots of labeled images we have some kind of network structure that we run this data set through for example we take this cat picture and at the end what this cat picture here is an input to a function f of X and the function f of X is the network itself right and this function outputs once you apply this function to the input image outputs the probability distribution for various labels here in this case so they 54% probability of this thing being a cat 28% probability of this thing being a dog and so on right then in supervised deep-learning we have a true label which basically says that with a 100% probability we assume that this is a cat so some some Oracle usually human labeled this as a cat right so this also means that the Oracle the human is not always right so this may in fact be a dog no it's not but the Oracle is not always right so in the labels are nice this is something that you need to understand right so it's not 100% accurate accurate but what we do then is once we have done this so called forward propagation so which gives us a probability distribution we then see okay there's an arrow here because it's definitely not the dock and it's not an airplane it's not a horse and it's not on any of these other things and we do a layer wise weight update on the way backwards which is called the backward propagation and we repeat this process iteratively in order to compute the weights for this network in order to improve the function f of X that this network implements so to give you a little bit more data it's basically that these images or these data sets that are input is what we call big data right so we have an image at 1k it's about 100 gigabytes in image and at 22 K it's a few terabytes of data and actually in industry this is much larger so if you look at for example at Facebook's Instagram we have about a trillion sorry about a billion users and if we assume that each of these users uploads several pictures a couple of several pictures a month or a week and we easily have a trillion pictures all right so that we need to probe we may want to process furthermore if you look at the network itself that seems now smaller because these networks they go to 100 to 200 layers steep so here we have a Google Annette which is not not that deep it's about and 20 layers but there are little networks that are much deeper furthermore they have a hundred million to two billion parameters and these parameters take up to eight gigabytes storage so that sounds reasonable this actually still fits on the modern GPU right so the full network itself however there is we can see some growth in the in the networks themselves and this is a picture that gives the number of operations required and the size of each of each dot here is actually the number of parameters and the accuracy so the higher the better and the more computational intensive goes to the right what we see is actually interesting that the the size the number of parameters is somewhat growing with the term what top one accuracy but it's not necessarily that the largest network achieves the highest accuracy all right so there is some there are some effects in there that are not easy to explain so this is also due to the lack of theory in this field that is it's very empirical study usually furthermore if you look at the label space we have 10 to 20,000 labels this label space is growing because if you think about face recognition then each human would be a label like my name would be a label and it takes weeks to train these networks so if you now look at all these parameters the conclusion we can - we can make is really the deep learning is supercomputing so we need very high computational power and this is something that we all know and we all agree on so this is what we're here at the end an HPC show so let me get a little bit more formal here so let's assume we have a universe where we draw labels from that's called D we have a subset of labels that we call our input data set it's called X and a single example out of that subset of of the universe D or out of our input data set is called small X ok so then we have the label domain Y which is basically all the labels that we map to and we have a true label which comes from an Oracle that is we call L of X all right so here this is our cat with probability 1 then the network itself is a function that map's an example from the input space 2 which draws out of out of the different domain of possible examples to a particular label right so far very very simple but what's now interesting is that this function is made above out of two components and one is the network structure itself this is what we see here this is all these boxes and and whatnot this is what typically human designs are in the past a human designed and the second part of that function f of X are the weights that are not shown in this figure but each of these boxes has parameters or weights they're actually needed in order to evaluate this function f of X ok so at the end one we want to evaluate this function we get this probability distribution and we have the true labels so somehow we need to define an error function that gives us the the difference between the actual guess that f of X makes and the true label that the Oracle designed for us so there are various arrow functions I don't want to spend too much time on this because this is where the survey gets 60 pages so we can have here a squared loss function for example very very similar very very simple so we can have a 0 1 loss which basically if the label is correct we give it a 0 no loss and if the labels incorrect we give it a 1 right so then there are some loss but what's most commonly used is actually the cross entropy loss function down here which is slightly more complicated it also have a saw it has a soft max term in it but I don't want to talk too much about the about the loss functions but you just assume that these loss functions is somehow Express how good my guesses right how good my forward propagation f of X f of X is there are many many more than this and at the end what we want to do since we have a fixed network structure what we want to do is we want to find the optimal set of weights W star in the optimal set of weights is really easy to define it's basically the set of weights that draws from the space of all weights where the expectation of the loss is minimized so this is really just an optimization problem at the end deep learning the learning part of it is an optimization problem ok but how does this not look in practice we have this prescribed network structure where we have here for example a convolutional layer another convolutional layer a pooling layer another convolutional layer and the fully connected layer just we can talk about this later why they have this particular structure but at the end it's really a function that is composed out of multiple functions each each of those layers is one particular function so for example F 1 of X is the first convolutional layer and then the output of that layer is forwarded to the second function f2 of F 1 of X now and so on and the last layer is the overall function of the network right so this is how we decompose the layer into multiple function applications using the chain rule it's still extremely simple so another big ad will a little bit more into details how do we now optimize this function as I mentioned we have this optimization problem in order to find the best set of weights for your network right so somehow we no need to actually do this in practice there are many many ways to do this the most prevalent way is called the stochastic gradient descent method right so SGD and this is what's pretty much used by all of the deep learning frameworks today so what you do here is it's an iterative method it has T steps or it has some other stopping condition but typically it's a predefined number of steps where in each step you take one random element out of your example ok this why it's called stochastic because we take a random element of the example then we do the forward propagation so we apply the first layer and then in a loop we apply all the other layers to the output of the previous layer and then of course we want to learn we do the backward pass we basically take the difference from the actual true label with whatever lost function we had I mean the difference I mean the gradient and we compute the actual gradient with respect to the data which is this Oh ie which is the output of the previous layers and then we also compute the gradient with respect to the weights and this is now Delta in this case table W right for each of the layers ok and then this is where the magic comes in we update the weights for the iteration of the for the next iteration using this magic function U and this function U is the major differentiating factor between learning approaches and there are many many different functions you and I don't have enough time to explain all of them but the simplest one is just to use the simple learning rate so you have a parameter here Etta where you just say well I've it's between 0 and 1 and I learned so much from the new example versus the old example but you can also make this adaptive and you can have momentum tuning and all kinds of different functions then again I don't have the time to talk about you can look at the survey for more details but what does this mean from a computational perspective so this is basically a quick summary a quick and dirty summary by the way I'm glancing over all kinds of details here but I want you to have a basic understanding of what's actually happening in deep learning so from a computational perspective or from a storage perspective we need to store two values at each other sorry four values that lair so we need to store the weights I'm just kind of clear because you need the way you need to store the output of the lair or of the previous layer because you need to you need this output for back propagation you need to store the gradients of the weights and the gradients of the data so which means if you have 8 gigabyte of parameters well you suddenly have 32 gigabyte of values to store so that doesn't fit your GPU anymore right so that's a bit of a problem we will talk about this later how to solve this ok but now let's get a little bit more into the details of the survey itself so what we did is we read about these 300 papers only 227 of them actually applied to the study and the first statistic we did is we looked at the years beginning from 2010 know from pre 2010 so deep learning really only launched in 29 to 2012 in a big in the bigger environment until basically today what we found us that the hardware used has a rather interesting trend that we were moving kind of away from CPU so it's a little bit noisy but you can see the the number of papers using CPUs for experiments and again it's not representing industry it's representing research is declining interestingly also the number of specialized architectures is declining even though we just had the discussion that there may be more specialized architectures but at least in research papers in the parallel context there are less and GPUs are dominating the field right so this is just the data that we collected the second piece of the data is actually that we see again from pre 2010 until 2017 that the papers use distributed memory machines so more than 50% of the papers today refer to experiments with multiple nodes involved in deep learning and this is an interesting development it basically shows that deep learning is largely a distributed memory problem today so we have to care about distributed deep learning ok but distributed deep learning is a little bit more complicated than deep learning because now we need to talk about things like no it counts how many notes are we using and communication how are we communicating between these nodes so for the node count we have another statistic again here pre 2013 to 2017 and here the number of notes and again the idea is here that this is the distribution of the papers we looked at in these years bin so the first interesting insight this is the disbelieve Network nearly 5,000 G 5000 SCP you know it's used to Train network at Google scale what you can see from this pre 2013 so this was 2012 to 2013 is that the number of notes start Lida kind and you can guess what happened this was when people discovered that with GPUs you can actually lower the number of nodes by two orders of magnitude and still achieve about the same performance so this was the GPU impact then it kept declining but now we see a new search going apart so what is that No well we now switch to GPUs here completely basically I mean as you can see in the previous statistics but now we realized that you can actually do more things with deep learning and you can scale and you need to scale this up in order to achieve that performance so your for example the tightest supercomputer more than 10000 GPUs was employed to for very large learning problems in the last year so the next question is then well what is used as a communication mode in these papers and again this is research not necessarily industry so again pre 2013 up to 2017 and we can see that MPI is dominating the game here so it's also somehow interesting so it's another piece of evidence that deep learning is really converging towards HP CS HPC is converging towards deep learning so these are very very similar problems in fact computationally and I'm trying to make the point a little bit more that this is in fact the case so let me look at this again at the stochastic gradient descent that I told you before because I lied to you unfortunately two slides earlier but now we have a little bit more background that I can explain this a little bit more in detail so I told you that sarcastic gradient descent in its pure form which is actually true picks one random element from X but that's not what anybody does because if I pick one element that's going to be very slow what everybody does today is you actually pick a set of elements a random sub sample of all the elements that's called a mini batch and then you process the same algorithm on the mini batch the difference well we are not updating the weights after every single element but we are updating the weights after B capital B elements only which has two interesting effects we first of all it seems like it should get you a verse result then if you actually update the weights after every single element but in practice that is not true because in practice my data is noisy and in practice what's happening is that I'm not average over the set of B elements and I'm averaging out the noise that's at least part of the theory okay so this actually the accuracy is slightly increasing if I increase my B however if I increase might be too much I'm losing the starkest isset II because of my B is just the size of the data set that I only have one mini batch it's not mini batches called batch it's called a batch method and when I'm losing the stochasticity I'm again losing convergence and again losing quality so what what we basically found in this study is that there are three zones with respect to the mini batch size that you can differentiate so first of all the validation error starts rather high because of the noise in the data when you have a too small mini batch size right it goes down if you have the validation error is basically the quality of your model it improves of the inverse quality sorry it improves with larger mini batch size going to this green area and then if the mini batch size goes to large it again declines because you lose stochasticity right however performance is simpler performance is just very bad if you have simple a single element and it just keeps increasing until it saturates because your parallelism is increasing towards the right so what you basically want to do is you want to operate here and the reason why we don't have specific numbers in this plot is that this is incredibly problem dependent all right so the actual sizes you have to define based on your update function based on your data based on your network structure and based on your loss function okay so this is very complex to actually define the exact mini batch size so a couple of years ago people said all in the 10 ish it's about here so 10 ish is good but actually we were moving too many batch sizes up to ten thousands today there were papers that said well 64 is a maximum mini-batch size but again we can push this quite far today with advanced methods like momentum tuning or Nesterov momentum so there there was actually a huge push in order to increase the mini batch size because that is our parallelism again the larger you go here to the right the more parallelism you have the more nodes you can use and the idea here was really that yeah we wanted to make this as large as possible so people invent new update functions in order to make this possible okay so you don't lose performance if you if you scale more if you have larger mini-batch your performance will always go up it'll just get less and less and less because yeah at some point you just hit the maximum of your your hardware that it allows okay so this this is the statistical property here the validation error right and this is the hardware performance so in some how well are you running on your system so you really want to operate in this area when the validation error is not too bad but the performance is maximum right so this is the this is the idea of it again the exact number is actually rather complicated to tune okay so let me now give you a little bit of background on parallelism because we will need this in a couple of minutes it's a little bit of a theory slide here so every computation we can model as a directed acyclic graph or the notes or arithmetic operations like plus minus times or whatever and the arrows are data flow between the nodes by eight dependencies between the nodes so if you have this graph we can actually define two metrics and the one metric is called the Vark which is just a total number of the arithmetic operations we do well it's very simple in this case if you count the number of vertices you get 39 and the second metric is called the depth which is basically the longest shortest path from any of the input vertices to any of the output Persis if you see this is seven year one two three four five six seven and they're all the same in this particular diagram so one interesting and this is a little parallel computing theory here one interesting insight we can gain from this is that the ever average parallelism in that mode is the total work divided by the depth all right kind of a good metric for the average number of compute elements you can employ to solve that problem and we will we will see how that relates to deep learning soon the second case we need to understand is a little bit of communication theory so we need to understand well we need these parameter updates that I will talk about in a couple of minutes they need global reductions and the idea of a global reductions were very simple is we have lots of values x1 to xn and we want to global some of these values ok extremely simple problem and there are many ways to do this so the simplest way to do this is you have a parallel tree basically that we have 4 different values here X 1 X 2 X 3 and X 4 X 1 X 2 X 3 X 4 and then the colors are different compute elements and we want to reduce them so we sum these two we sum these two we have one value but at the end all the nodes need that value so we need to broadcast it again right then you can write down the time that eats Elvis latency of the network log base 2 P P is the number of processing elements and here in this case gamma and M is just the size of my message okay so you get that so you can easily half that with the more advanced algorithms called the butterfly algorithm so basically get rid of these factors of 2 here so then this is very simple but the problem here is actually that you get rid of these factors of 2 but you still have the bandwidth the the gamma times M which is the size times log to P which is not good right there is another algorithm Americans basically send this along a pipelined mode where you see that the bandwidth itself as the constant as times a constant or very small value P minus 1 divided by P basically a constant basically 1 for large enough P but you still have now the logarithmic term and of the latency is now a linear term here alright so this also not too good great so this was by the way the algorithm that Baidu was very happy about that they released multiple press releases as the deep learning reduce algorithm but there is a better algorithm known from the HPC field which is often called the Robin zypher algorithm go to the radio scatter plot scatter algorithm which has the logarithmic term in the latency and the constant term in the bandwidth so this is the optimal one we can say basically for small vectors you want to use this algorithm for large vectors you want to use this algorithm in fact if you look at a theory you will find that the rightmost algorithm is optimal in terms of latency with the factor of two and in terms of bandwidth strictly optimal so great we actually don't need to do much research here to improve this yes yes well and the network is going to be the same just that these that these things are switches and not not any nodes yes is it it gets no actually out there they're still switches here right so if you see that these are the switches the bandwidth is not necessarily going down right yes but it's the same here it's the same here so so here the data also goes down it's you do the same you just reduce by a constant essentially and you know it's all much faster I mean the switch is a much faster introduction so but but yes the the math doesn't really change that much okay so so there's this algorithm that gives you the lower bound and this is something that we have actually discovered in the HPC field right so this is something that we should understand that there's a good opportunity for tech transfer into the deep learning field but now let me get a little bit more into detail of what these networks actually do so this is google annette one of these networks and again there are many many of these networks this is just one that's relatively easy to show if you look at this from the left to the right I mentioned this is this function f of X that implements that transformation and if you just look at the subset of this we find all kinds of different things here like the output of the previous layers convolutions max pooling concatenation and all kinds of different boxes right all kinds of different layer types and in fact um these layer types are there is a huge flurry of them and I just want to talk about some of them in terms of parallelism within the layer usually what you do with these layers you to you process them on a GPU all right the GPU implements thread level parallelism so Sinti and you exploit that parallelism in that layer for example if you have an activation layer which is a very very simple layer it's just a point twice point vise function that you apply and we're looking at work and depth here and what you do is the work is really proportional to n which is the mini batch size C is the number of channels in your picture and it's again specific to pictures H and W is the height times the width of the picture basically the number of pixels you have in the picture multiply it by the mini batch slice right now you can see that if the mini batch size gets larger my parallelism gets larger the depth is 1 right great so my average parallelism goes up like crazy let's do this for the other networks rather quickly because I'm running out of time so for the fully connected Network it's very similar so here we have basically C out of the previous layer CN of the next layer and again the mini batch size here so linear in the mini batch slice and only a logarithmic only a logarithmic depth here right so again the convolution is very similar gets more equations here KX and KY are basically the size of the convolution stencil in the X and the y direction and then here it's all logarithm sums of logarithms if you look at the pooling layers and relatively similar so NC HW and logarithmic here and the last one is batch normalization layer which is apply it to well normalize the batch essentially what you do in the normalization you sum them all up and then you divide it by the number of batches which gives you a logarithmic depth and again the variks and CHW so the basic conclusion from this is that you have a huge amount of parallelism at each layer but this parallelism is extremely fine-grained so this is why GPUs work so well for the labor computation right ok so let me look at one particular they are here so this is a fully connected layer how do i compute a fully connected layer and this is where we come back to see something is very close to Lin Peck the way we compute a fully connected layer is really a matrix matrix multiplication right so we want to have this output vector and we just arrange these weights that are going from x1 to through the first neuron and the second one in a nice way and then we get a matrix matrix multiplication for fully connected layers of course in deep neural networks fully there are not so many fully connected layers so there is a rather small amount of matrix matrix multiplication in here but this is perfect for GPUs the second layer type that is very often seen it's actually dominating the performance of these deep neural networks for image recognition at least but it's true for most of them are convolutional layers so there are two different way out for different ways to evaluate convolutional layers actually more butBut four different major ways to do this so the first category is the direct convolution where you just actually apply your convolution operator point by point and then just compute the convolution it's very simple you can rearrange this in a very clever way to make this more amenable to GPUs that's called the im2 call method where you just change the data layout and well reduce i/o complexity basically furthermore you can actually apply convolutions by a transforming the convolution filter as well as the image itself to an alternative space and for example you could transform it into the reciprocal space and f of T where the convolution can then easily be applied as a Hatem art product between these two operators and then of course you have two FFT forwards you have to apply your convolution and then you have a T backward right so this is very beneficial if you have very large convolution operators that touch a huge part of the spectrum and you can have a video grab convolution which is similar to the F of T or transform to a different space and apply the operation and your transform bank so again if you want to know all the details you can look at this at the paper here but what I want to point out is again the work depth analysis and here we have the very similar result to before we see that there is an incredible amount of parallelism in the work and the very little depth only for all of the methods so again perfect for parallel computing so this is something that is actually really nice and we can also see that the N the mini batch size is basically a multiplicative constant in all of those in all of those equations okay so let me now talk a little bit more about I talked about the parallelism at the layer level thanks sorry the Devourer depth analysis for each of the layer types and now let me talk about the parallelism at the overall framework level so how can we now parallel lies the deep learning problem for real so here we again have our input example and the first type of parallelism is called model parallelism there will be three total that we can exploit in deep learning so the idea for model parallelism is that we take an example or a set of examples mini batch and we run it through the network and he'll be parallelized each layer by itself so the different colors here are different compute elements so this is for example very suitable for vectorization or Cynthia parallelism on GPUs this is what everybody does in a distributed memory system not so much so model parallelism is used rarely because you require a very low latency and very tight coupling between these elements at this level okay so but if you want to do this in distributed memory system then this is rather interesting because you can now distribute the parameters across different processors which means the 32-gigabyte that you would need to hold a full model in a single GPU well if you have two GPUs you would only need 60 gigabyte if you had 4s 16 if you had four you would only need eight gigabytes so this is a way to reduce your state per processing element the problem is of course your mini-bat has to go through all processors the obvious and the backpropagation requires in all - all communication at each of the layers okay so this is not used to offer the second kind of parallelism we call pipeline parallelism the idea here is this is used more often that you again of course you take a mini battery it's always the same Oh an example from a mini batch and then each of the layers is processed on a different processing element so here the green one is the different GPU then the blue one is a different GPU than the brown one so the big benefit here is it actually you can still distribute the layers just in the other dimension across the different compute elements you have a very sparse communication pattern because only GPU one talks only to GPU two not to GPU three right and you can arrange this as a very convenient pipeline but of course as the downside the mini-batch still needs to be copied through all the processors the last mode is actually the most used mode for parallelism and this is called data parallelism this is relatively intuitive the idea here is that you take multiple examples of your mini batch so sub batches and you run them completely independently through multiple different instances of your network in different GPUs of course at the end you need to update all the parameters and this is the all reduced comes in that I mentioned before right yeah the idea is this is a very very nice actually very efficient solution it's very easy to implement this is why this is the prevalent parallelization mode of the networks even though all of the mixes and the problem is of course that you know I have all parameters you have the full network the full state at every single compute element I mean if it doesn't fit your GPU you're out of luck so you need to do something else okay but obviously what you can do is you can combine those three modes of parallelism you can use hybrid parallelism which combines for example data parallelism and model parallelism and even pipeline parallelism so this is what most advanced frameworks today do they offer you ways to combine this into these different layers okay wonderful so now the question is let's talk about data parallelism because that is the most prevalent way to have updates right to distribute data so what can we do in order to make this more efficient well oh actually in order to implement this one way is we can have all the weights in a central position in a central node so we have for training agents here and we put all the weights into a parameter server this was the first mode that this was run it and and if you're with an HPC background you use a kind of cringe a little bit when you hear parameter server because you see immediately this could be a bottleneck right you can imagine you have 10,000 of these training agents and they all sent to one server and that's exactly the problem but what you do with these training agents you basically Center your weight gradient to the parameter server and you receive the updated weights from the parameter server okay of course the parameter server is Charlotte which means they're actually multiple physical machines one logical server but consisting of multiple physical machines that are responsible for a subset of the parameters each okay you can again write down what the computational complexity or the communication complexity in that cases and you see the linear term here in the in the volume data volume again so it's kind of thing okay the other way to do this is called D central this is what the HPC way where we have the training agents again these for training agents but the idea is now instead of sending my data to some centralized location to sum it up we just employ all reduce like you would do in MPI in order to get the global global sum of the data and we can employ one of these algorithms and in fact if you're clever we use a good MPI library and we don't even worry about the already is because we rely on our vendor to implement a good MPI or reduce for us and that's that's basically we can use other optimizations as well we can use MPI topologies neighborhood collect this RMA we can use all of these things to optimize the performance of these networks in fact the parameter server can Hall also be optimized it can be made hierarchical which will get us to a logarithm term here but it'll introduce also another logarithm term in the latency so it's also problematic so the best goal from I mean from a theoretical perspective for large scalability is really using a decentralized okay but now there is another dimension to this problem that we need to explore and the idea is how do we actually in the centralized mode so let's now assume parameter server for a minute right how do we actually do this communication so we could of course sent the weights and then wait until the parameter server sends us the new weights this is called the synchronous method so this may be slow because we need to wait for the parameter server and if one of the one of these training agents comes late well we need to wait for the last one so we have this straggler problem another way to handle this is actually we could just not wait and we could always send all data to the parameter server and the parameter service just sends up sends us weights back that he just has Kiyoshi of course but that's also not the greatest idea because it may happen that one of the training agents is extremely slow and we never get updates in time from that agent so there is an intermediate way to do this is called the stale synchronous method the idea here is that we bound the s and Khurana T basically if one parameter server sorry if one training agent is more than K steps ahead the slowest one then it will wait until the slowest one catches up right so we can basically bound the as in quantity and these methods they have been used in the past beginning from 2011 and this belief moved it to distributed memory systems and this is basically what we do today right okay so in the 3d traits of statistical performance because of course this is the least statistical performance way right because we may never get the right data but it's the most Hardware performant way because we will always be busy doing something we never wait and this is the most statistically performant way because we always have the correct data but we the least hardware performant way because we may always wait for somebody so we can do something very similar for the decentralized method and the idea is basically the same we can have a synchronous decentralized method which is employing all reduce NPI all reduced a blocking one we can have a bounded as micron this method which could be employing MPI I all reduce so something that I introduced to the MPI specification a while ago or we can have a fully asynchronous method to have you just don't wait ever right so it's basically the same we can also talk about completely different models here for example you could employ gossip so we're not even all the nodes may be reached okay and then we can take this line from consistent to inconsistent even further we have synchronous SGD stale synchronous asynchronous but we could even say well we could use another even the less synchronous way where we have a so-called model averaging the idea is that somehow well whenever we get an update from some server we will just decay that update by physical forces so some spring model or some other model with the lateness of that update right so we just reduce the importance furthermore if we go really inconsistent what we can actually do is we can do a so-called ensemble learning we can learn multiple networks completely independently and then employ an averaging function to do the actual prediction this this could in fact be another network or another network of networks right so you can do this recursively as long as you want so this is the most as in from this way so now just to give you some more ideas about communication optimizations in the last minutes so there are many many different ways that we can actually optimize these updates so one of them is sufficient factors but I don't have enough time to talk about this so let me skip this but the most prevalent one is to use lossy compression somehow when sending the data and accumulating the error before we sent the data before the arrow that we lose or they'd be introduced by compressing locally one of those ways to do this is quantization we basically say well I have a 32-bit weight but let me quantize this down into a large to a smaller range of numbers let's say 16 bits or 8 bits or on one bit and this has been done so so low precision is kind of standard but for waiting sorry for sending the wait updates even one bit SGD has been employed where you basically send one bit per wait it's just the direction right so this works and I'm not kidding for some cases it doesn't work for all cases the second way to reduce the communication volume because this is the major bottleneck the communication second way to reduce the communication volume is parcel vacation and the idea here is that we actually don't send all the weights we just say well let's just send the top ten weights we have a thousand weights and we just pick the largest ten and send them and we accumulate the other ones locally until they become the largest ten and then we will send them you can prove that this will eventually converge but you only send one percent of your data ever so this is also an interesting but this is actually something that we've experimented with in my lab and another way to deal with actually a necessary way to deal with this is that you have to assume and you do this that each of the nodes again here are four nodes one two or three for each of the nodes may have a different subset of the parameters that are large this is your parameter space and the white ones are is the zeros that you're not sending and the black ones are the non zeros that you're sending and each of these nodes will have a different subset and as you go up as you sum these to the vector itself will get denser and denser and at the end it'll be a relatively dense depending on the distribution of your parameters so we found in a study that this works well so we did this on on pit-stained so you can get a speed-up of of nearly two and if you have a really bad network in fact an ec2 you can get a speed-up of twenty all right so this is one of those major messages if you do communication optimization and if you want to show that your communication optimization is great really go to a very bad network and then you will see a lot of impact on pit stains you've got you don't see such a big impact but the accuracy is basically the same so this is very interesting result and you can find it at the bottom in archive ok so then there are two more things I want to cover very quickly because they're also very important for parallelism in deep learning one is architecture search so I mentioned at the beginning that these networks themselves the dis structure of the network is usually designed by humans like LX nets alex net alex chris Nevsky came up with this alex net and named it after himself so but what you can actually do is you can employ a different optimization strategies to find a good network as well but then we get into a complete Nirvana here with the computation costs because remember you need to train each network before you can tell how good it is and if you do reinforcement learning or evolutionary algorithms for developing these networks then you may have to train many many networks each of them taking weeks to converge very high computational cost so for example Oakridge did this in an experiment they ran 24 hours on the full machine for training one network that gave them two percent higher accuracy then the untrained network or the unlearnt network and you can now compute what the actual cost of 24 hours of the full Titan machine is and that is quite expensive and then with that I want to close my talk and I basically want to well make this a call to action so the idea of the survey was to really connect the deep learning community to the HPC community so we have a clear analysis what are the computational pieces of this of this deep learning field and and interestingly this didn't quite exist before we did it unfortunately it ended up being 16 60 pages but we have a full analysis of parallelism distribution synchronization much more than I could cover in that talk the talk was actually rather shallow compared to the overall the full survey there's additional content like cans and auto-encoders and our rnns and LST ms i didn't talk about recurrent networks at all but these are actually used for voice recognition and translation but the idea is really that we should try to understand how to bridge the gap between this field of deep learning in the field of HPC and I think it's happening naturally right now but it would be very good to accelerate this a little bit so thank you very much and I'm open for questions you
Info
Channel: InsideHPC Report
Views: 9,623
Rating: 4.9832635 out of 5
Keywords: hpc, supercomputing, HPC Advisory Council, ETH Zurich, Swiss HPC Conference, insideHPC, Deep Learning, AI
Id: xtxxLWZznBI
Channel Id: undefined
Length: 44min 22sec (2662 seconds)
Published: Mon Apr 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.