BayLearn15-Keynote3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Very timely. Making me really wish I had gotten tickets to Baylearn in time. Thanks for posting.

👍︎︎ 6 👤︎︎ u/say_wot_again 📅︎︎ Nov 10 2015 🗫︎ replies

Suprise question by Yann LeCun at ~37minutes.

👍︎︎ 8 👤︎︎ u/econometrician 📅︎︎ Nov 10 2015 🗫︎ replies

What kind of projects is TF appropriate for?

Let's say I'm building an online retailer. I want the suggestions my site gives to customers to be intelligent - is this too simple a thing problem for TF?

👍︎︎ 2 👤︎︎ u/IgorAce 📅︎︎ Nov 10 2015 🗫︎ replies

In for later. This looks awesome!

👍︎︎ 1 👤︎︎ u/PetrolEng 📅︎︎ Nov 10 2015 🗫︎ replies

Ooh, bunch of videos I need to watch now.

👍︎︎ 1 👤︎︎ u/thecity2 📅︎︎ Nov 10 2015 🗫︎ replies

Thanks for posting. Really curious about TF now.

Jeff reminded me to look at some math concepts again.

👍︎︎ 1 👤︎︎ u/BulletSea 📅︎︎ Nov 12 2015 🗫︎ replies

If I could be like Jeff.

👍︎︎ 1 👤︎︎ u/pohatu 📅︎︎ Nov 10 2015 🗫︎ replies
Captions
we're now arrived at our last invited speaker of the day and it's a pleasure for me to introduce Jeff Dean who works with me at Google Jeff has been at Google for more than 15 years now he has basically invited all the cool infrastructure we have at Google that includes MapReduce that includes BigTable protocol buffers and the list is very long but recently he decided to revolutionize deep learning as well at least at scale and so he together with other he created a big group where I work with him and I think he's going to talk about that now yep so thank you so yes Sammy said this is joint work with our team the Google brain team and a bunch of other teams at Google and one of the things I'd like to stress is how pervasive these kinds of techniques have become over the last few years we started a few years ago just kind of exploring how these kinds of models could be used in a few different domains and over time what we found is that more and more teams have been sort of amenable to trying these kinds of techniques and getting very good results and then using them in their actual products and we've also been doing a lot of interesting research that is not necessarily designed to influence products today but sort of down the road if we can solve these problems we know that would be generally useful and so you can see a fairly steep ramp and the number of directories that contain sort of model description files it's a proxy for how many different people our teams are using these kinds of systems and it's pretty pervasive across you know a bunch of different domain areas at Google so the outline of the talk that I'm gonna give today is to describe a little bit about the different kinds of infrastructure we've built to help us in doing deep learning research basically we systems that allow us to Train on large datasets at scale to turnaround experiments quickly I'll describe to two different generations the first is disbelief which we wrote about in nips 2012 and I'll describe a little bit about a second generation system we've been putting together that we think is a cleaner and nicer based on what we've learned and you building and using the first generation system and then I'm gonna give an overview of kind of some of the ways in which we do research at Google's we to find a thread and then figure out different ways in which that thread of research can influence different kinds of things and then finally I'm going to conclude with a new approach for training but it's for people not models so as I said the project started in 2011 actually and ruing was spending a day a week at Google and I bumped into him in a micro kitchen and I said oh what are you working on and he said no mads I'd actually done a thesis as an undergrad on parallel training of neural nets back in the sort of first grand era of neural nets and they were I thought a really interesting abstraction for these kinds of problems but they were sort of not computationally ready at that time they sort of would do interesting things on small datasets but I kind of moved on and I did all kinds of other stuff but I have returned because I believe in this model and we had a big emphasis when we started the project on using large datasets and large amounts of computation to see what we could really do if we kind of push the boundaries in these two directions there's no shortage of interesting datasets across lots of different kinds of domains and I think that's kind of one of the most interesting things about these problems is that often you confuse data from different kinds of domains the talk we just heard about fusing different kinds of sensor data in cars for example is a great example where these models learn how to integrate all those kinds of information in a very sort of cohesive and end-to-end train way so how can we build systems that really can take advantage of this raw data you know there's lots of different things you'd want to do if you had images you want to be able to like localize objects and tell what they are and this is obviously hard because the bottom three things are all tractors but they look completely different but we as humans can easily tell that those are those are tractors you know we have a bunch of more specialized kind of visual tasks you want to be able to find house addresses and they all look completely different in the world you know some are slanted and colored and you want to be able to read them that can help us improve our Maps more generally we'd like to be able to see take kind of cluttered scenes like this and first find all the text in the scene then read it and understand it and use that to kind of help people understand the world around the physical world around them you know text understanding is another big thing that we care about deeply if you just read this paragraph you know that's a terrible review of this movie but it has lots of positively sounding words okay looks good I was in awe and it's actually a little bit hard to tell as a human this is kind of a pretty negative sentiment so understanding the true meaning of all these words in paragraphs and sentences in the world is a pretty pretty ambitious but important task there's a really nice property that neural nets have the results tend to get better if you have more data and you can use a bigger model to capture the kind of more subtle patterns that occur in that data and obviously to do that you need more computation of course better algorithms and new insights and better techniques always help too and you need all these things in order to really make progress so one of the things we've been focusing on a lot in our work is good turnaround time for experiments there's a very different feel you know when you use a slow compiler in your writing software it's very frustrating compared to wandering using a fast compiler and this is sort of the same kind of thing you want turnaround time for experiments you're doing to be measured in minutes an hour that's a very different feel than even something that's you know a few days kind of turnaround time and that's even another level of productivity over something that takes many weeks and if it's many months to do an experiment you're basically you're not gonna even make any progress because you you have to run so many experiments in parallel and then you're like oh what was that experiment I started four months ago it's not ready yet in baked and I can't even remember so we focus a lot on techniques that get us really good turnaround time on these experiments maybe I think good catchphrase is training today well we take a single GPU card six weeks that that's kind of a good motto because the day turnaround time or even half they feels qualitatively different so one of the things we started looking at is how can we paralyze these kinds of training processes and clearly there's lots of parallelism to exploit in neural nets I'll talk about two kinds model parallelism and data parallelism that we use all the time both independently and also in conjunction with each other so model parallelism is basically this idea that you have some sort of deep network maybe it's a convolutional ones you have local receptive fields and that's kind of nice especially if you're going to partition this model so the idea is we're gonna partition this model across a bunch of machines or maybe a bunch of devices or maybe a bunch of devices on a bunch of different machines and then allow those partitions to sort of do the computation in parallel and that's very helpful because now you can sort of dramatically speed up the time it takes to run a single batch of examples through one of these models the second kind of parallelism is a little more subtle in finicky to use but is somewhat easier to scale so the idea is you're gonna have a bunch of replicas of your model you're gonna have them reading independent data examples so they sort of have partitioned the data and you're gonna have a set of parameter servers that are going to keep track of the current state of your model and one of these replicas will download the current set of parameters for the model it will process a mini batch of examples so my process you know 100 images or a thousand images compute some gradient for the adjustments that like to make the model it won't apply the adjustments locally instead it will send them to the centralized service which might itself be spread over hundreds of machines or tens of machines and that service will apply this update it'll do new parameters equals the old parameters plus the the gradient and then before the next batch of examples this model replica will essentially do the same thing and we'll get a another parameter update and that'll get applied and what's really happening is all of these replicas are applying these updates independently so you can either do this asynchronously where there's no synchronization between the model replicas and that can cause a bit of issue in that the gradients each model replica computes may have been to the original set of parameters that it got and those parameters may have moved in the mean time because other asynchronous replicas have said oh please move the parameters over here or you can do this more synchronously or essentially so if you do it synchronously you have a bunch of replicas all in lockstep fashion getting parameters so they all get the same set of parameters and then you they each apply a mini batch of size n so now you have n replicas and end times the batch size as one replicas and the nice thing is you don't get any noise from this asynchrony because it's all completely synchronous but it's a bit less false top fault-tolerant basically if any of those replicas die you have to essentially do some recovery and the whole system kind of needs to deal with that or you can do it asynchronously as I showed in the previous slide where there's no synchronization and you get noise in the gradients that you might think would be very disruptive but it turns out it generally kind of works and you can get anywhere from kind of ten to a thousand independent replicas working asynchronously on updating the parameters in the model kind of depending on the structure of the model sparser models where only some of the parameters are touched by any given set of examples tend to be more of the thousand range and dense ones tend to be more at the kind of 10 to 100 replicas you can also do a hybrid of this of course where you have AM a synchronous groups of N Sync leas replicas there's a whole sort of sliding scale between this so one of the things that's an important consideration here is if you're trying to share the set of parameters between a bunch of model replicas you want the computation time for the model to be large relative to the amount of time it takes you to transfer these these parameters over your network so models that with fewer parameters that reuse every parameter multiple times the computation tend to do better one level of reuse you get is through mini batching so if you download the set of parameters then you process a batch of size B you're respectively reusing every parameter at least B times in in that example but certain kinds of mod structures that are actually very successful to these days including convolutional models and recurrent models tend to reuse each parameter many times across the same example as well so in addition to that batch size factor of B you get for example convolutional models might reuse the same convolutional filter in hundreds or thousands of of different spatial positions in an image and that gives you like another factor of a thousand reuse of that parameter recurrent models you tend to unroll through time maybe 50 or 100 steps or something and that gives you a factor of a hundred reuse of the set of dense lsdm parameters in in an LST M so there's we've observed this trend internally of focusing on model types that tend to have fewer parameters and we use those parameters more I don't know if that's because we're doing lots of distributed training and so those tend to work better and that kind of set up but that is an observation that we've been making for example the the latest image model that we've been using has about 6 million parameters even though it's much more accurate than one we were using a few years ago that had about 60 million parameters because we've essentially removed the fully connected layers which had a ton of parameters that were all reused only only once per example and replaced them with compilation 'l parameters okay so now let me take you through a thread of research that and how it kind of has evolved over the past couple of years in the different directions and how it can be sort of reused in other scenarios so sequence the sequence models were originally developed by our Oriole venules alias let's gather and quickly in our group looking at basically general framing of a problem where you want to take a sequence and map that to another sequence now that sounds kind of abstract but it turns out this kind of problem crops up all the time right basically you can take a sequence and then map that to a high dimensional vector so the representation of that sequence that's been consumed so far you can view as some point in high dimensional space and typically we use deep LST M's to process the elements of this time sea in order to construct the high dimensional representation and this high dimensional representation might be you know four thousand activations of a bunch of different Alice TM layers so if you connect these two things you can actually get a machine translation system that was kind of the first use that Ilya oriole and Quoc put to this sequence to sequence model essentially you read in English you get a representation of the English sentence and then you train the model to start with that representation and spit out the corresponding French so that's kind of cool and it works well it's actually better than state of the art at the time this was published and there's been a bit of follow-on work about how to deal with rare words in these kinds of models that you can find an archive so you can actually connect this slightly differently or use different data essentially and get a chatbot so basically we oriole i have the idea i think of taking our internal text op text chat logs so we have this system where you can say my laptop's not working can you help me and so here's an example we call it brain stop instead of text up high I have a problem with our machine hi I was Sheva how are you doing today so the blue is what the model generated hi how are you I'm fine thanks how may I assist you today I want to access using VPN and then the model knows enough to ask a follow-up question you know currently is it connected to the corpnet work no and then check the solution and then it spits out a URL and they say thanks bye and brain stops us thank you well if it goes so that shows you that these things really are understanding fairly long term context in the sort of sequence of data but they're the processing Oriol then focused on using this for parsing so the idea is you want to take out arbitrary sentence and then generate a parse tree and you can encode the parse tree is another kind of sequence and that works pretty well that's going to be in nips this year and the neat thing is it's a completely learned parser with no parsing specific cut right essentially trained on the output of a state-of-the-art parser and then we can actually do better than the state of your parser another kind of cool thing you can do is feed in a bunch of points and then train the system to generate a bunch of graphical properties of those points so for example you can feed in a bunch of points and have trained the model to output the convex hull of from that set of points it's a subset of the points that form the convex hull or you can do a doodle on a triangulation or travelling salesman tour these actually work reasonably well so that's kind of cool from that one basic sequence to sequence model you can sort of Masad your input data or massage the kind of problem you're trying to solve this one actually had a little bit of tweaks to the model to allow it to refer back to other points in a slightly different way but basically these kinds of directions are sort of this meandering tour you have in in a research space of how can we really deal with sequences and understand them well so obviously there's been really big improvements in object recognition over time using CNN's so this is sort of top one accuracy for imagenet and though as I said the latest model that was developed at Google buy a bunch of vision researchers essentially replaced all the fully connected layers that had been being used in kind of the state-of-the-art models with just more convolution so you just have more convolutions and now one of the things is that every layer you have convolutions of different sizes that can pick up on on sort of smaller bigger patterns at that level of representation and these models actually are really good at giving good fine grained classifications so they're actually better than humans in some cases I would say flour but you know it does better than that you know they're good at generalizing these things have no mmm real pixels in common in some sense they're both very different looking things but you'd want to call them both emil probably they make kind of sensible errors right like it's sort of understandable that you might say snake for that i know that's not a dog i actually had to look closely to tell if it was a goat or a donkey and i'm it might be one of each i'm not really sure what do we think donkey yeah it's a little hard to tell that's the whole point and so you know we've been working with a number of product teams to actually take these kinds of systems that can do a pretty good job of recognizing what's in what's in actual images just from the raw pixels and deploying them in different ways so one of the ways that we can deploy this is essentially allow users to search their photos without tagging them and so this is public post a user made and he said wow this is really cool i didn't tag these and i was able to find my statue photos and another user said wow I could find my drawings that's cool and I was pretty happy that it found the Macra made yoga Yoda cuz that's not like most other Yoda's you see you know for that cluttered street team one of the first tasks is to be able to find the text in the street scene this is work that my summer intern from a couple years ago Matt Zieler did in conjunction with some of the people in the street view team to basically find text and raw images and you know you can see that it's fine you know it's doing a good job of finding you know text in different character sets and there's actually some text in that window that's pretty hard to pick up on but it's in an unlit neon sign you know you've seen things and all kinds of different fonts and they're different sizes and scales and these things actually just work quite well so you can actually take the sequence to sequins model and rip off the first sequence thing and instead initialize the state of the model with the output of a convolutional network and so you can essentially take pixels in form some internal state and then use that to try to generate a human level caption so a given training data of the form you know picture and sentence about the picture the model can actually absorb the pixels generate a representation that is really good at generating what a plausible caption might be and so for example you see this is a test image that has never seen before and it generates that sentence and it's actually a generative model so on the same image I've seen it also generate a close-up of a child and a teddy bear or something so it generates multiple sort of plausible sentences about the same image and that works pretty well you know considerable improvements over previous state of the art that was a cvpr paper here's a couple more examples you know what it fails it's kind of amusing so yeah if you squint just right I'm sure you can see the samman flying for the year riding the snowboard and you see in the upper left that it's kind of knows that tennis is going on but it doesn't know that the person is serving right a human would probably give a more specific kind of description than a man holding a tennis racket that is true but it's not particularly interesting and there are three kinds of pizza on the stove but it's alright okay so let me switch gears again and tell you about a system called tensor flow so tensor flow is kind of the new system we've built for doing all of our research and our production training of bep nomads essentially it's we've taken the lessons from the first system we built the motivations essentially where the first system was really good for scalability and for production training of all kinds of model like if you wanted to train a network that had basic feed-forward and then feed backward for gradient paths kinds of things and was a standard fairly plain vanila thing maybe with convolutions or in LS TM it was great for that if you wanted to do something more exotic like kind of weird reinforcement learning where you do a bit of extra magic to generate your gradient and it's not example up/down thing it was not as flexible as we wanted so what we really want a system that has the scalability and production readiness properties of disbelief our first system but was much easier to express models in much more flexible for research purposes and so this better understanding of what we're actually should be building allowed us to make a bunch of dramatic simplifications in the system as well and so I'm gonna talk about that so the first part of tensorflow is the execution core that I'll talk about in a minute and there's very low overhead it's essentially a system that takes computational graphs and executes them and there's different ways that you can specify those computational graphs and we're working on sort of additional ways of enabling sort of reusable modules for making this even easier right now we have Python in C++ front ends that allow you to specify the graph in those two different languages we expect more languages to be added down the road we've had some interest internally at Google for go front ends for example and so this is a lot of code for a slide I just want to kind of give you a flavor of the kinds of things you would specify in a tensor flow model so typically the first thing you do is you say I want to import the tensor flow library this is the Python front-end you say I want a new graph then you say I have some examples and I've already loaded them in some data set here I have some labels that I've also already loaded and then I create a couple of weight matrices variable W which is rows by columns my number labels and then I create a bias thing which is just the number of labels this is this model is going to do batch logistic regressions or not very deep but it had to fit on a slide so the real deep neural map would be like ten more lines of code or something and then I'm gonna compute some widgets I'm gonna have a loss and then I'm going to use a particular optimizer to optimize that loss and that's essentially what you say to specify the model and that gives you a computational graph and then you can say okay now with this graph create a session for my computational engine and I want to initialize all my variables first so that's going to run some initialization statements for variables like you see that truncated normal that's gonna initialize that W matrix with a bunch of normal normally distributed numbers and then I say for step in number of steps run the graph and run it up to the optimizer in lost part of the graph that's essentially all I do and that I repeatedly do that and every so often I'm going to print the step in the loss so a pretty much wholly self-contained example I didn't show you how to load the training and data set in labels but other than that it's pretty much complete and so the computation you get is a data flow graph and so you have graph of nodes these we call operations or ops you know we're kind of haven't settled on a single name and everything that flows along the graph edges are tensors so arbitrary n dimensional arrays and we've added notions of state to the model so variables are stateful so biases is a variable for example and then there's a bunch of graph computations that are stateless and then there's a minus equals operation that's going to update the biases right so some some nodes are stateful mosters stateless it's also distributed so we can take that graph that is an arbitrary specification and then we can map that on to multiple different devices so if I have just a CPU with a bunch of cores in it that's fine that's just very easy to map if I have a GPU card in my machine plus the CPU the system might decide to put some of the computation on the GPU because the things that'll be faster and then put other computation maybe we don't have a GPU implementation for some of it on the CPU and it'll take care of moving the data between these different devices if we have a big distributed system the same thing will happen it'll map the computation on to the set of devices on all the machines we have and insert the appropriate kind of communication between elements in the graph to move data around and the user can also specify hints like I would really like this up I need this operation on a GPU or I really want this on machine 1 and this on Machine 2 okay we also can run this same graph on say a mobile phone which is a nice property you can essentially move from you know a big production training environment in a data center to running that model for inference purposes on a phone no in fact that's what that slide says right I don't need to say more so one of the things we really like about this model is it's quite flexible right so all the deep learning kinds of things that we have built are a library on top of that core set of graph execution primitives we think it's up we we know it's also useful for other some other kinds of machine learning algorithms we think it's a fairly general purpose framework that could be used for a lot of different kinds of sort of numerically intensive computations and the thing another thing that's nice is it abstracts away the underlying devices in computational hardware you have a bit you can give hints where needed but you don't have to it's also extensible so we have a whole bunch of primitives implemented for things like matrix multiply and you know Rallo and things like that but you can extend it with additional operations and kernels for you know your particular thing that maybe isn't as easy to support in you know the current set of primitive operators and so we use tensorflow for neural nets all the time basically we're in this transition phase where every project that is using disbelief is essentially migrating to using this system all of our new research is being done in this and essentially a typical neural that layer in some sense maps to one or more tensor operations and then we have a whole bunch of libraries of these kinds of operations that are specialized for for no mats so we have convolutions and pooling and softmax and different kinds of losses and different optimizers one of the things we have similar to Theano is automatic differentiation which has been a big help we've met what makes it a lot easier to figure out how to you know you can just say this is the loss I want optimize Africa's so we found this pretty easy to experiment with a lot of different kinds of models and kind of flexibly in the like sequence the sequence work that I was telling you it's pretty easy to like stitch an image model together with an LS TM for doing sequence prediction and try all kinds of weird combinations like that and to do so quickly so training that quickly another simplification we were able to make is unlike disbelief which had kind of a separate parameter server notion we just have stateful modes in the graph and the job is to map the computation on to the set of devices that you want to use possibly with some hints for the user so this is like a graph where you've replicated the computation that you care about three times you have three replicas and this is asynchronous so each one of those graphs can be driven independently without any synchronization between them if you want to do synchronous training you construct almost the same graph except you synchronously drive the graph you run the computation through three copies of the model each on different examples so now your batch size is kind of three times as big and you add up the gradients and then you update the parameters so it's very flexible in this way you could easily do the M of n hybrid things we have you know three groups of these that are synchronous that are themselves asynchronous okay on to training so one of the things that we found is that this area is very exciting there's a lot of really interesting work being done in both kind of the perceptual domains and in language understanding domains trevor's talked earlier on robotics you know connecting these things to kind of real-world robotic systems I think is really interesting there's tons of interesting stuff going on and we're always looking for people that are wanting to learn about this area and to do great research in this area so because there hasn't this resurgence of deep learning is sort of only happened in the last few years they haven't been a lot of sort of formal curriculum except in a limited number of academic institutions so one of the things we've decided to try is a bit of an experiment is what we're calling the brain Google brain residency program so basically is a one-year immersion program in doing deep learning research and the idea is you're gonna work with research scientists and our team to perform independent research you get the real job with a one-year employment fixed-term salary and benefits and the goal after that one year is to have people in the program to have conducted several several different research projects you know one of the things that's great about summer internship programs is we bring in all these people with new ideas and what we often find is that 3 months is just kind of short right like by the time they kind of get up and running and really get sort of rolling on their project the summer is over and kind of this I think is a much I mean internships are still great but I think a 1-year thing really will allow people to get a lot more experience and to get a lot more sort of productive research done and another nice thing is we have interesting problems of course the tensorflow system I just described is you know the basically the thing we're using for all of our research and we have a bunch of computers so the kinds of people were imagining what apply our people with a bachelor's and master's in computer science probably but math statistics fine you know understand math programming experience and people are really interested in sort of learning how to do research and in deep learning there's more info the timeline applications are open this is the first public announcement of us so today we are accepting applications until January 15th it roughly coincides with like grad school application timelines and we're gonna sort of get a bunch of applications we hope and review them in kind of January and maybe do interviews of some form phones calls maybe TV video conferencing or maybe on-site interviews for people who are local and then make a decision on people in the program and accept the first cohort of people and then the program would start in early June so that's kind of it there's more information there g-got co / brain residency where you can send us email I'll leave that up for a moment and that's all I have so I will take questions laughs plenty of time of question yes Mike can you do loops these tensorflow I'm sorry can you do loops yes so we do have a control of a handful of control flow primitives that we've added they're less fully utilized like we haven't been using them very much we don't have a lot of internal experience but we have the control flow primitives in the system and we were sort of learning how they're useful and when it makes sense for example to use loops rather than unrolling in LS TM for training Easter flow completely in-house system or are you going to open it up to the public tensorflow is in-house we've had our interns who have come and then if left are somewhat sad to have left so we're pondering what to do but we don't have anything to to announce are there are questions for Jeff yes there's one fish so one of the things that interests me is not just how we identify stuff or solve problems but also how we fail and how we know we failed and what we do then so for example the street recognition you know I look at it say okay it's a street oh wait there's a lot of New York in there so it must be New York right and then I will try to narrow it down is in Manhattan is a Bronx so I'm curious whether you see yourself going in that direction or is this is this something that you're working on is that something you've thought about the the problem of understanding how the model how these kinds of models fail in various ways is that what you knowing that you failed knowing that you failed and then if you have failed understanding why right yeah I think there's a bit of interesting work about having models also omit confidence levels so or how confident are you in this response I think that's certainly an interesting direction in terms of when you do fail understanding why it's sort of for human understanding so that you can sort of improve the model or understand the general kinds of mistakes it's making I think that's a pretty active area of research we have a number of people on our team who are working on visualizations and other kinds of sort of debugging systems for understanding the internal structure of models particularly for this example you know what kinds of things are our what patterns are being picked up on in this example that are causing it to be wrong or right I think that's a pretty pretty interesting area that will be fruitful for research you know for many years down the road because these these models are not going away they're becoming much more important and prevalent and they're going to be trained to do more end end kinds of actions you know drive the cars accelerator directly from visual inputs and that kind of thing as a lot previous talk was just discussing and I think it's it's really important to understand what kinds of what kinds of properties these models have because the end can you comment on what kind of optimization algorithm you use for distributed training over multiple nodes so let's say you have a node with multiple GPUs that's pretty easy to paralyse right but when you go over multiple knows what kind of methods you used sure see so I mean I think there's no one answer unfortunately it kind of depends on the model structure you know for models with kind of embeddings of kind of things like textual entities or query words or things like that you know will tend to use outta grab because that tends those tend to have the K the property that you see some examples or some particular words a lot and some not so often and so the outta grad properties we're sort of the natural center of mass of a embeddings it sort of stays a bit more fixed for things after you're updated a bunch of times is a kind of a nice stabilizing property for the models we also use a lot of just standard SGD with momentum kind of things so how do we distribute so in tensor flow it's a lot more flexible because we have the ability to for example insert additional aggregation nodes say within a machine boundary before we send data over the network and so you can essentially do like the synchronous replicas maybe within a machine for a bunch of GPU cards and then have a bunch of asynchronous copies of that is one possible variant but it really depends on the particular model and you know how much how much hardware you want to throw at it in order to get the the the the training time down if you if you don't care about training time it's more efficient to just use a single GPU card but typically that's untenable for big problems you don't get that much of a speed-up if you go across nodes you know it's not like you can use something like a hundred nodes and get a significant speed up unless you in algorithms no but let's say you're trying a fairly large ComNet which I mean your your computer network it depends on your computer network of course it does but but but it also depends on your algorithm a lot so you know if they say you use InfiniBand or something like that and you have I don't know you know a rack full of nodes with you know for a GPUs each which is you know what we can build nowadays like how do you distribute the optimization over multiple nodes that's a big question it's an unsolved problem you know you can tell us it's unsolved yes well I mean obviously it's clearly not completely solved but there's you know we would basically distribute the set of parameters in that model across a bunch of the nodes and have each model replica get sort of gather the current set of parameters process some examples aggregate the gradients locally if possible and send them back you know and then send back the little bits for the different parts of the parameters right which doesn't scare very well with more than if you know it's basically it depends on the model structure but yeah just the back is tensorflow fault tolerant at all there's a checkpoint itself does it have any other fault tolerance yes it does check pointing and you can control how often that happens saving the model state and restoring it is actually just another little piece of the graph on the side that you can choose to execute every so often which is kind of a nice property it's not like a completely separate code path in the system it's just some graph execution where you happen to be executing save ups and restore ups and we've thought a little bit about making nodes fault tolerant by periodically replicating them to memory of other of other processors so you take a parameter matrix and have multiple copies of it that you keep loosely and think we haven't actually explored that too much but that's probably a cheaper way of in checkpointing hi greet talk so you have shown that you feed images it generates descriptions and then you have the chat bot so you know how far away to really a sophisticated sort of experience for example you know when they seen description when you compare it with a professional you know writer actually way look at the video and the rights of this screenplay and also the kind of the chat BOTS when you're kind of carry out in intelligent conversation right so how far away from there you know where the technology and missing pieces is that gonna you know incrementally improving deep learning or they're they're gonna be some other fundamental shift in order to get to the next level so I believe for the captioning work you could probably have just a much bigger training set and I'm possibly a slightly bigger model and you would actually improve that a lot because the the training the public training data set from MS Coco was actually not that big Sammy was involved in that work so he said and I think for more general textual understanding I think there's still a lot of work to do there I think we're gonna need both some algorithmic and model representation breakthroughs in order to really make good progress on that as well as kind of the ability to scale things and you know ingest large amounts of text and and use that in your model to sort of boost the understanding and and general general is ability of the model there's another question here go ahead first yeah sorry okay so I need to look over here so did you compare the performance of tensorflow in terms of training speed also compared to the existing scripting framework of deep learning like top of the analyst alone so we haven't done a careful comparison because most of our training is in a distributed setting and so but it's sort of a single node it's sort of in the ballpark for most models for comparable to the lot of the because most of those just use the same GPU primitives we're using on the in a single machine compared to our old system tenth circle is actually a fair bit faster in a distributed setting because we you know we just really the general graph model makes it so you actually have fewer things - the more time is stand in the inner loops of that system and you can spend a lot more time sort of focusing on making that system perform really well whereas disbelief attended to be a little bit more organic and the parameter server communication was like a different path than the model level communication and so it's actually quite a bit faster for training on our big distributed set up than our earlier system maybe one last question here two quick questions about the residency program the first being is this research going to be entirely internal or are you going to have actual publications coming out of it no no the hope is that we would publish papers in places like nips and you know posting papers on archive and then submitting them - wonderful and then the second thing is notably absent is PhD students or people reasonably yeah that's fine too we yes we will not hold a PhD against anyone with that I think that says thanks Jeff again [Applause]
Info
Channel: Bay Learn
Views: 40,175
Rating: 5 out of 5
Keywords: Machine Learning (Software Genre), Artificial Intelligence (Industry), Jeff Dean, Google (Award Winner)
Id: 90-S1M7Ny_o
Channel Id: undefined
Length: 44min 49sec (2689 seconds)
Published: Mon Nov 02 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.