Efficient Distributed Deep Learning Using MXNet

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

professor in GCI and she's also a filly with Amazon at the moment and she has done lots of amazing work using tensor spectral method for probabilistic graphical models but now all this talk is she's going to talk about a distributor deep learning Thank You Lee and you don't find looked at the series of talks in this workshop I think this will be the most practical talk you'll hear so you know I hope this is a different perspective in terms of when we say computational aspects of machine learning a big aspect is how do we actually you know enable these computations at scale you know the infrastructure support the software support and running you know being able to quickly develop such you know take the algorithms and productionize it so I'll discuss more on you know the practical side of such aspects and also described how MX net is a very compelling package to use for distributed deep learning at scale so we're all very excited about the developments and machine learning and artificial intelligence right but if you really think about what are some of the crucial ingredients that enable this revolution its many other things in addition to the core algorithms right so an important part of it was having very robust software packages you know it's the software support that enabled people to quickly innovate and try out algorithms at scale and indeed the compute power right so in a sense most of the algorithms have been around for decades but what changed the equation now was being able to do it at scales that were not possible even five years ago in some cases and of course the third important component was creation of data so having access to large open source datasets data sets right enabled us to really push the boundaries of what's possible in machine learning and and the question today is like how we can have such an ecosystem where we are actively enabling algorithm writers to easily launch their us I know software at scale and so when it comes to deploying any of these deep learning models right so they are enormous so this is I think the inception v3 Network one of the state-of-art networks and indeed you would not be thinking about programming this from scratch so we need like enough set of primitives and enough software support to enable such an operation and the other thing is they the number of computations even for one image to go through this network is enormous right so you need billions of computations for each image so how to enable this is you know important and this is where the question of what hardware to use is also comes into play and GPUs have been very successful because they provide much more flops compared to the comparable CPU and so it's the framework of choice for deep learning and now we are seeing growth in the Asics or you know much more dedicated hardware for these deep learning so if we want to now go to the next level in terms of scaling up deep learning a hardware and architecture becomes an important consideration as well right so the other consideration is memory so as we have bigger and bigger models how do we enable this and it's so and we could all at the same time we want to do this on smaller devices you know we can do this on your smartphone you can have IOT devices right the Raspberry Pi so you know there Sherman we have large strong memory constraints how do we still get state of art results or accurate results with that so if you naively do the computations the memory consumption will grow linearly as you add more and more layers and the state of the art networks have hundreds of layers right so the question of how to easily program such large networks how do we do computations at scale and how do we worry about memory consumption I see these as three important aspects when it comes to deploying deep learning yeah so when it comes to our software package today there is a whole host of choices so how do we judge in terms of having some desirable attributes to run deep learning at scale so as I said the first consideration is program ability right so whether there is software support in terms of enough set of primitives which simplify defining the networks which simplify the optimization procedures and which provide us flexible programming environment to for people to you know be able to optimize for computation and memory right so program ability is an important consideration the second one is portability so as I said today we have a whole host of choices when it comes to different platforms where we want to run our machine learning algorithm and the more portable the code is it's just easier to you know just try try it across a different set of platforms and this is where the ability to also adapt to different memory requirements will be useful the third one is efficiency right so as researchers in machine learning we so far haven't been paying much attention to this part because you know if thinking of computation as something that's out there that'll solve our problems but if as we scale up the computations more and more efficiency becomes important because we just can't you know keep having a hundred more or thousand more GPUs right so our tasks are getting more complex the models are getting bigger so efficiency will give us a big win over time and that Amazon we are enabling machine learning applications on the cloud infrastructure and so there is a huge motivation for having efficient packages because that would also result in a huge reduction in costs for the customer running this on the cloud and so we want to worry about efficiency both during training got at inference so as I said there is a whole host of choices for our deep learning software and I think that's a good thing my dargah bleah that has also led for research to really take off right the availability of a whole host of software so what we are doing that the AWS deep learning group is actively developing on MX net and supporting MX net extensively on the AWS infrastructure so we say that it's a framework of choice for us and what we recommend to our customers and there are two important justifications for that so one is that this is the most open source software or of all right so we have the Apache open source certification and we have an active ecosystem of developers outside of Amazon in fact it started out as University projects so some of the key contributors muli who was a student at CMU is now at Amazon T and chi-chan was at u-dub you know there's kind of a lot of university effort that went into building MX net and now as I said the second aspect is we are actively ensuring that it integrates well with the AWS infrastructure and gives very efficient scaling for many of the deep learning applications so let us now see like what I'm Xing it offers in terms of program ability right so if you think of programming models there are mainly two types right so okay before the models so there is first the choice of languages right so there's a whole host of five programming languages depending on your background you may prefer one or the other so I think for a lot of data scientists probably Python is the top choice right may be followed are maybe the choice for statisticians in a MATLAB for some others so you have the whole host of choices when it comes to the front end but the problem is the performance varies if you just run these frameworks as it is so what a mixed net offers is take any of these friends and languages and through a C++ interface you can connect it to efficient back in libraries it could be the CUDA deep learning libraries or the mkl libraries for the CPU alright so that's the idea that your language of choice should not matter for the end performance and so you have the whole choice of languages that you can use to program you know you can also use scholar if your spark background now you can also integrate your spark jobs with a mix net in this manner alright so the next part is the programming model that I was talking about so they're mainly two forms right so the simple one is the imperative style of programming you know this is the simple Python code here so you just interpret this line by line right so there's no compilation process this is easy to write it's easy to debug right but of course the downside is this is bad for optimization right so you especially in a parallel system if you want to run this code optimally you want to think about optimal memory allocations and so on there is no way to do it with imperative programming as it is and so you know the simplest like if you think about MATLAB Python they have the imperative style of programming but that's why it's also very bad for performance right so imperative style just by itself is not good for performance so the other approach is the declarative style of programming right so here what you do is you declare all the variables and you compile this program to get a computation graph and because of this computation graph then there is knowledge about how you can optimize for us a memory and computations in this simple example here you know that you know this you have this variable C and then it's overwritten in the very next step right so you don't need a separate piece of memory for both C and D so you can you know replace the memory of C with D later so you can plan out the memory allocations better because you have this computation graph and that's why like the many of the state of art deep learning packages have the declarative style of programming because that not then you can have better performance using that but of course the downside is that this is more involved to write compared to the imperative style right and it's also leads to much longer programs and so on and so so what do we have in MX net so what MX net does is it helps you mix both the styles of programming so MX net has this ndra API so this is nothing but a multi-dimensional array right so you can declare this in the imperative style and do all you know any set of operations using imperative programming and then you can import that into your symbolic executor during the process right so the idea is you can have parts of the computation that are not very important for optimization right for optimization of memory in the imperative style and then you can bring that into your symbolic executor right so here in this example the network architecture is written out in this side symbolic manner right and intuitively it's the architecture that matters a lot for performance because like how the network is what the forward and backward passes are right that's important to optimize for but if it is maybe some operations on the array it may be a say the gradient step right that can be written in the imperative style so you can mix these two paradigms of programming together in MX net so you can also do the opposite you can have your modules written in declarative style and then bring that into your imperative programming so in this example here you're bringing in this module that was written in symbolic style and then you're doing this parameter update right in the imperative style and this leads to much shorter programs compared to writing everything in the declarative sense right so that's the idea that MX net gives the developer the flexibility to decide you know when do we need in imperative programming and when do we need a symbolic execution and so depending on where you you know getting the computation graph is essential for doing many efficient memory allocations and so on there you write it in declarative style in other parts of the execution you write it in symbolic in imperative style right so this gives you the most flexible programming framework out of all the deep learning packages so let us now see how MX net makes it easy to polt across different platforms so the idea is you can fit all the core libraries all the dependencies into a single source file and so you can easily port from one platform to the other so can compile in any platform so this was a simple smartphone app for come you know recognizing objects and it was done in a pretty straightforward manner using MX Net on JavaScript so because of the JavaScript support you can easily now run applications on your smart phone all right so so this is another thing that I mix in it supports that makes it easy to port applications into device different architectures so the other thing that I described that is important is memory optimization right so if we as we have our models growing bigger and bigger right so one option is to keep growing also our hardware right but there are limits and model parallelism across different machines is still quite challenging so are there ways to you know still train big models but do more intelligent memory allocation right so this is the computation graph for the standard backward forward and backward pass right so your going forward in the network sorry this is a typo and then you share the information about the activations in each layer here right and then you go backwards so the input is here right so you start here and in the forward pass you are just computing the activations in each layer and during the backward pass you are updating the weights of each layer right and for that you need information about the activation in that layer so use that and then you update the weights so this is the standard computation graph and if you were to just allocate memory for each layer you would have linear requirement of memory right so if you stored all the activations in this forward pass you would require memory that's linear in the number of layers so what are some other things you can do instead so one option is not to store everything but to recompute to as needed right so in they are in this example here what we did was just divide this into two segments and we'll only store the head node in the segment right so the activations that are computed here we won't be storing in the memory so now the memory requirement is much more reduced because I'm only storing these head nodes and so what then needs to be done is during the backward pass you recompute the results right so now as I need the information about this activation here I would be recomputing it on the fly and then going go further and so now the question is there is this trade-off between you know the computation and memory that you can play so you can say like how many times do I recompute this versus store everything in memory right so depending on the memory you know space that I have I can play this game so the one interesting question is also what to store here right so if you think about just storing say the first few activations here that's not very effective right because you would have to like to compute this activation here you would have to run through all of this so that's where it's more useful to segment them into parts so that way you can start with this head node and only start computing from that point onwards and so with this you can show that the memory requirement can be reduced to as much a square root of the number of layers while just requiring in total an extra forward pass all right that's because like in each segment here you only like start from the head node of that segment and the proceed a few steps until you need to recompute the activation needed and yeah so this way you can now get a sub linear memory requirement for your networks without doing any compression right so you're at the end of the day still getting exactly the same results as you would in a fully in the complete memory model but now you can run this with the sub linear memory and so there's lots of interesting questions here on how to trade-off memory and computation and MX Net enables you to you know play with these trade-offs pretty easily and so this is the paper if you want to like learn about more details how much memory you need so I had some numbers here right so just to see like how much we can reduce memory so the ResNet like thousand layers right so bad size 32 so if you want to store like the activations of the entire batch right so each forward pass you take the activations of the whole batch and you do this four thousand layers that's about 130 GB if you were to store all of that and now only for storing this you know the head nodes you would see a big reduction in this memory and same like LS TMS right so you can now you know think about different schemes the trade-off computation and memory are using this framework and so that way you can outrun this even on smaller devices or enable larger models with more layers but not incur a huge memory cost yeah so now the third aspect is performance like how do we ensure good performance even as we scale up to hundreds of GPUs right so the always the challenges there's a lot of overhead when it comes to the control layer when it comes to how do we you know get the data to these GPUs what's the communication cost in Walled you know in them come you know putting together the results so I'll show you some benchmarks on how MX Net performs on different measures and indeed you know this aspect of optimizing for parallel programs is a very classical problem right but why we need new frameworks now is that the network's have gotten very complex so just looking at having two GPUs and a CPU and looking at all the forward and backward pass computations right it's it gets very messy so there is MPI but MPI is just too low-level r2 enables parallelizing deep learning applications so we really don't want to worry about the low-level details of each you know computation with which device to send it to when to get the result back and so on so how MX net paralyzes is based on the philosophy of hierarchical parameter server right so the lowest levels loved ones are the workers so the GPUs in this case since if you are running on GPUs the GPUs are the workers and so if this is the data parallelism model where each part of the data goes to each GPU so they have local updates of their gradients and then there is a level one server which is one of the GPUs and so that takes in the gradient updates right combines them together sends it back to the workers and this continues and so the idea is most of the communication happens just within the GPUs and this can be done through the high bandwidth PCIe switch and so this is much faster than you know going to the CPU or for that matter going all the way to the networking back so you try to minimize the communication to the higher levels as much as possible and most of these iterative operations are happening at the level of just gpo to GPU communication and that's faster so you can think about scalability right at multiple levels one is within a single GPU there are lots of course so how do you scale across the course right plan so the next level is across multiple GPUs on the same machine so the largest one we have on AWS has up to 16 GPUs the K ATS and this is the P to 16x large instance and of course you can take lots of such nodes together and parallel eyes across them as well yeah and we have the benchmarks on github so you can go and verify this yourself there and so what we see is on a single instance of P - 16 X large with 16 GPUs you get about 91% efficiency for the state of art networks right the resonator and the inception v3 you get very high efficiency so the ideal is of course as you double the GPUs your throughput also doubles and you can see that we are getting quite close to that so the next level of scaling is then to ask what about multiple such instances so here we took 16 instances of Pedro 16x large and that's like 256 GPUs and we do see some drop in efficiency but not a whole lot right so as I said on the github repository you will see a lot of details but just to mention a few this was done on the image net and you cannot here the inception and the rest net the batch size is quite small it should the curve looks like a superlinear why not yes you more machines we cannot get to go yes the ideal is linear right so we are below that I mean this is like this is doubling here yeah yeah yeah so this is the the scale is yeah yeah it's not a linear scale yeah it's strict but yeah so that's I know it's one thing to notice you can also have this on small batch sizes you know like you can get to much better efficiency on large batch sizes but that's bad for generalization right so you don't want too large a bad size and doing this one smaller batch sizes is the challenge and and we're doing good even there yeah so that's the idea that you get this high efficiency even as you scale up to hundreds of GPUs so if you want to run these systems continuously Amex net is a very strong candidate for that yeah measured based on the bad sense or is it what is the so I think the so this one is like the number of images per second that and yeah but this is a bad size we feed in writes so we have an ambitious roadmap to grow Amex net to encompass even other aspects of machine learning you know to offer more machine learning algorithms under this platform and in terms of like ongoing work I would say an important addition that will be soon available is sparse data types you know the current deep learning mostly requires only dense operations but if you want to think about other applications or even think about schemes that compress the activations in different layers you need sparse support right and that's an important addition we are doing and the other part that's you know closer to my research right has been in terms of how to enable tensor operations at scale so what I'll show is some preliminary results in you know looking at neural network architectures and enabling tensor operations and leading this leading to better performance so the idea is indeed tensors are the right framework to think about multi-dimensional data right so we have now many different many multiple channels that go into a neural network for instance you know at the input layer that is the width height and the channels right which is RGB the input layer and as you go through the activation layers you have these number of filters so it's variable number of channels so what you really have is tensors that are being processed in each layer of the network but we're really not thinking of them as tensors but as matrices and so the question is by you know learning weights that operate directly on the tensor can we get a good better accuracy or you know model with fewer parameters that does equally well or almost the same amount so that's the idea that can we now enable these tensor contraction layers into our neural network models so this is the popular Aleks net model right so we have a whole set of convolutional layers right so convolution fooling and so on and then at the end of it there are all these densely connected layers and these are always the most expensive ones because they have a whole lot of parameters here and so now the question is in these densely connected layers we are taking these tensors flattening them into matrices and you know learning the weights can we now change that and learn tensor weights that contract the tensor directly right so this is the classic the usual Alyx net so you go from input and then after many convolution and pooling layers you get a tensor here you flatten it and feed it to the fully connected layer and instead what we'll do is we'll do tensor contraction and then see how that changes the result and so this is the intuition that if you you know have this tensor you you know want to really think of a contraction in all the dimensions simultaneously right so can we now directly learn the contractions that would result in a low rank tensor and so this is the layer so I mean don't worry about the details here but the main intuition is you know if you have this tensor you want to learn the weights that contract in each of the dimensions and so we are just like kind of you know transfer like transposing the tensor and just learning the weights in each for each dimension and so just a preliminary results what we see is you know the best cases you get a now almost 44 percent savings in terms of the parameters in these dense layers and in fact even a better accuracy compared to the baseline right so the idea is because now you're exploiting the multi dimensional structure in your chancer coming from the convolution layers you can get now better accuracy with fewer parameters and indeed this is just the preliminary result I see that there can be use of tensors in a whole lot of different neural network architectures so this is a I think an interesting direction that I hope to have more results on near future any questions this one's our top top one dangerous yes your top one yeah that's the harder one right so we wanted to him for that yes so now in the remaining 10 minutes I want to give you an overview how at AWS we are you know enabling you to quickly launch deep learning jobs and be able to kind of not worry about setting up the infrastructure say updating the libraries like kind of this er you know all these are mundane details but they take up a lot of time from the data scientists and so one of the you know aspects is this deep learning army army stands for Amazon machine images so these are pre-configured to quickly get you started with deep learning and as I said the other aspect is enabling GPU instances so the p2 instances are the powerful ones and so you can get lots of GPUs if you like to do your deep learning right so just to give you a sense of how AWS is enabling computations that are very large-scale the experiments that we ran with 16 p2 16 X large instances is about 1.1 peda flops right and the world's fastest supercomputer I think it's in China is about 93 para flops so we can get very close to even the world's fastest supercomputer by you know going through the cloud infrastructure and there is you know if we have to really scale up machine learning even further right the cloud I think is the answer to a lot of the problems and so the other aspect as I said is the Army's so with this deep learning army what you can have is all the packages are up-to-date compiles right so it's not just a mechs not any package the choice that you have it's all like maintained for you so you don't have to worry about updating you know your build breaking down and so on the drivers everything's pre-installed so this is like a one-stop shop where you can quickly start your doing deep learning rather than worrying about setting up your infrastructure so I just had a very quick video of how it's just a few clicks to get you deep learning started on AWS right it will just take a couple of minutes you don't have to worry about all the details but it's just to show you that how easy the process is right so you start from your ec2 console and you create a stack and you know so as I said there's the cloud formation script you can take from one of the existing scripts and and then after that you decide what kind of instance you want right you want a p2 instance and you know the details like how many workers you need and what you like to name the stack as you see this demo with Donna bit earlier and yeah pretty much like there are few things like you know permission controls that you want and that's pretty much it now you created a stack for running your deep learning and you can now track the progress of whether this stack is being created right so that's the details you have all your details of the different stacks on your console right so you can now see that you know it's tight as it's running so the workers are already ready to now launch the deep learning jobs and right so SSH to the master instance and then you hear we are launching from one of the MX net repositories and that's it you have the training going right so um anybody can just very quickly as you saw get started with running the jobs than worry about setting up the infrastructure or setting up the libraries and so yeah that was just a very quick demo to show how far you you know we make it easy for you to you know worry more about the algorithms than about the details of how to set up the infrastructure and so in addition to the MX net effort that we are doing in the group in Amazonia we have a number of other services as well so recognition is more for doing production level image recognition we have legs that is natural language understanding so it helps chatbot developers easily develop interfaces to interact with the customers and Poly's a real-time text to speech so let me just very quickly run through those so with recognition you can get real-time ba recognition both object recognition as well as face detection so you know obviously as you know this would be giving like different category labels with different confidence levels right and we can also get a lot of different facial analysis from this service so the Lexus uh but framework we recently launched or in fact the aw summit two weeks ago and the idea is you can now harness the power of Alex or the knowledge graph of Alexa and then now also give your own domain data and buying this together to have a good experience for your customers so it enables different both text and voice or interactions with your customer and there are a number of enterprise connectors as well so just to give you a sample example the idea is if it's about flight booking you want to enable all these different pieces right so if there is a voice module that would first recognize what the customer is saying it's about booking the flights and then there is the knowledge graph component which now understands what flight means flight booking and London means London Heathrow right so it's now tying it to known entities and then it's about making the decision of what's missing here and so asking the customer for new information and completing this loop and so this is how I know chatbots work and now the question is like how should the developer you know specify like the working of these chat BOTS right so you can now use the knowledge graph from Alexa and from the Amazon domain and combine it with your own training data and come up with different chat BOTS so the details are available on the website I won't provide a lot of it and as I said the last service we've currently launched has been a Polly which is real-time text-to-speech I mean indeed like generating human-like voices a big challenge right so there is you know and especially if you want it in real time and multiple languages so I won't again give a whole lot of details but just maybe a sample of how this works Ellis was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister reading but it had no pictures or conversations in it and what is the use of the book God Alice without pictures or conversations so you know there is and that was all just generated in an automated manner all right so though I mean indeed there are a lot of issues when it comes to generating human likewise one is being able to understand the context like here for instance f means fan height and also like changing the pronunciation of the word depending on the context today in Seattle Washington it's 11 degrees Fahrenheit quite cold remember music life in the Madison Square Garden so live and life being different right so being able to recognize that how much weight a woodchuck chuck if a woodchuck could shine were probably better than I can see and the other of course as I said depending on the context you also want to change the pronunciation so the fact that it's a telephone number would change how you pronounce the numbers richards number is two billion 120 240 1237 richards number is 2 1 2 2 3 4 1 2 3 7 yeah so so it is you know but definitely there's a lot of research challenges that are still are outstanding in this area on how to have very realistic pronunciations but poly gets you at least some part of the way and at the end of course at the end if there is still pronunciations that you are not happy with you can go ahead and customize that my daughter's name is Kasia my daughter's name is Kaia so you can go and change the pronunciation at the end and and yeah so that's um I think mostly or what I have in terms of you know there is a whole range of different AI services and tools we are developing here at Amazon right so one is the M excellent package enabling data scientists to run deep learning at scale and making it easy to launch deep learning jobs using deep learning army and all the cloud formation templates the other is a set of a is enable services that allow enterprises and large customers to quickly get the power of AI without having to worry about training large models and at the same time we have a lot of academic engagements you know my goal is to also increase a lot of research coming out of Amazon and we are actively hiring and growing the team there at the same time we have credits available for research and for raw education so if you have a project that where you think it's the compute infrastructure that's the barrier we don't want that to be the case right so you can apply to the grants program if you're teaching a course and you require credits for students to do projects again you can apply for that here thank you [Applause] so actually one person but it's a speech synthesizer can I just think so at the moment it's text-to-speech right so I think click with you can combine Alexa and then falling and you should be able to then because with Alec so you can do speech recognition and then I know but currently we don't have the style transfer yeah right now we don't I mean by that SIA that's definitely something that is it good to have [Applause]

Info

Channel: Simons Institute

Views: 3,702

Rating: 4.8545456 out of 5

Keywords: Simons Institute, Theory of Computing, Theory of Computation, Theoretical Computer Science, Computer Science, UC Berkeley, Anima Anandkumar, Computational Challenges in Machine Learning

Id: ScRtj2bNMJE

Channel Id: undefined

Length: 45min 18sec (2718 seconds)

Published: Tue May 02 2017