"Large-Scale Deep Learning with TensorFlow," Jeff Dean

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to today's ACM learning webinar this webcast is part of ACN's commitment to lifelong learning and serving the over 100,000 computing professionals and students who are ACM members I'm Stephen I Barak II chair of the ACM Professional Development Committee a member of the ACM practitioners board and it's my pleasure to welcome you today you will find more info on my background in the bio widget on your screen for those who may be unfamiliar with the ACM or what it has to offer ACM offers educational professional development resources that bolster skills and enhance career opportunities our mothers our members can stay competitive and the constantly changing world of computing with a range of ACM Learning Center resources at learning dossier org you can see more of the highlights on your screen ACM recognizes the role of computing and driving the innovations that sustain competitiveness in a global environment ACM provides timely computing information published by the ACM including the communications of the ACM and Q magazines access to the ACM digital library the world's most comprehensive database of computing literature and international conferences that draw leading experts on a broad spectrum a computing topics and support for education and research including curriculum development teacher training the ACM Turing and the ACM emphasis foundation Awards ACM enables its members to solve critical problems using new technology that enriches our lives and advances our society in the digital age before we get started I'd like to quickly mention a few housekeeping items shown in the slide in front of you first the slides will advance automatically through the throughout the event on the bottom panel you'll find a number of additional widgets and resources if you're experiencing problems with the slides or audio press the f5 key and windows command R if you're on a Mac or refresh your browser on a mobile device or you can close or relaunch the presentation to control the volume adjust the master volume on your computer if you have questions during this webinar please type them into the Q&A box at any time during the webinar and click the submit button will organize the questions as Jeff Dean speaks and he'll reserve time at the end of the presentation to address them this session is being recorded and will be archived you will receive an automatic email notification when becomes available and check learning ACMG org for updates on this and other upcoming webcasts at the end of the presentation you'll see a survey open in your browser please take a minute to fill it out to help us improve our webinars you can also use Facebook and Twitter widgets on the bottom panel to share the presentation link with your friends as well as tweet comments and questions you should the hashtag ACM learning we'd be watching your tweets today's presentation is large-scale deep learning with tensorflow for building intelligent systems by Jeff Dean Jeff is a Google senior fellow where he leads Google's deep learning research team working on systems for speech recognition computer vision language understanding and very specific tasks he is code design and implemented five generations at Google's crawling indexing and query serving systems and co.design and implemented major pieces of Google's initial advertising and adsense for content systems is also code designer and Cole implemented Google distributed computing infrastructure including including MapReduce big table spanner systems protocol buffers leveldb systems infrastructure for statistical machine translation and a variety of internal and external libraries and developer tools Jeff without further ado take it away okay thank you so what I'm going to be talking today about is some work that we've been doing at Google over the past several years to build intelligent systems using deep learning this is joint work with many many people at Google so this is not all my work I'm presenting the work of many people and let's get started so one of the things that's happened over the last decade or perhaps 15 years is that we've gotten a better handle on how to build systems that can actually store raw data and manipulate it in various ways tools like MapReduce and BigTable and spanner and those things have open-source equivalents such as Hadoop and HBase and Cassandra and all those tools really make it so that it's possible now to take large data sets and do computations on those data sets but what we really want from these systems is not just the ability to deal with raw data but we want systems that can actually understand data at a very high level so what do I mean by understanding and so here's an example if I showed you this image as a human you would have no trouble telling me what that is a picture of you could write a sentence about it you could say that's a cute little girl holding a teddy bear if I showed you this street team you would obviously be able to read all the text in the street scene and know what kinds of stores exist there and that that 20 second story places where you might go to get a men's suit and that this yellow mannered place has the lowest prices compared to the place next door if I'm speaking into a into a microphone computers have traditionally had relatively difficult times understanding what people say speech recognition is a field with a long history but it's made some dramatic improvements in the last few years and that will have big implications for how we build mobile devices especially because the speech is a very natural interface for such devices in terms one of the things that Google cares about a lot search queries you know one of the problems we have with searches is that we have a query like this from a user car parts for sale and we need to make a decision about which documents are most relevant to that and in the past information retrieval techniques have traditionally done fairly simplistic things like count how often different words occur in their proximity and match the words in the query to the words in the document in a fairly strict interpretation but if you look at these two example documents you see that for example although the first document has a lot more prevalence of the words in the query it's actually probably not a great match for the query whereas the second document even though it has very few of the words in the query if you really understand the language in the query and the language in the documents you'd know that the second document is a much better match for that query and that's the kind of thing that we want to be able to build computer systems that can accomplish this task to do a better job of matching and understanding the language in the queries and documents and knowing that the second one is a better match um in the future I think queries are going to get much more complicated if you think of queries that people might make of traditional say database systems you know often those are very simple structured queries like count how often this thing occurs or give me everyone who lives in Iowa sorted by zip code or things like that these are the kinds of queries that I think are now starting to become possible some of them are possible today some of them were still a little bit away from but I think are going to be quite interesting kinds of things down the road where we're able to actually understand things at a higher level and answer queries like this so for example if you have you might want to say which of these i image --is show symptoms of diabetic retinopathy which is an eye degenerative eye disease or find me all the rooftops in north america from a large collection of satellite imagery or describe the video in spanish buying me all documents relevant to reinforcement learning for robotics and summarize them in german that's sort of a set of three or four very difficult tasks all strung together but if we're able to do all of those that would be that would be pretty amazing trying to find a free time for every one of the smart calendar project to meet and set up a video conference so this is an example of something where not only do you need to actually understand that query but you actually need to be able to take actions in the world in order to accomplish that task so one of the things that has happened in the last sort of five or six years is the journal networks which have actually been around for quite a long time have had a big resurgence and part of what I'm going to talk to you today about is why that's the case and some of the kinds of things that neural networks are able to accomplish as a way of showing you how some of the queries on the previous slide can actually be handled and helped by having neural networks so nonot works one way to think about them is they can learn a very complicated function from observations of data so as one example one thing they can do is take in any image the pixel values of an image and put it through a sequence of these layers and actually accomplish something like tell me what is the dominant thing in that image so you can take in pixels and put out labels like cat or garbage truck or leper and these light systems are composed of layers that are all simple trainable mathematical functions and I'll talk to you in a minute about what I mean by simple and trainable these systems are compatible with many different kinds of variants of machine learning so mostly what I'll be talking to you today about is supervised learning where you have examples of both the input and the output you want from the system so cat and but that picture of a cat plus the label cat and millions of other such images they can also be used for unsupervised learning where you only have the input you only have a bunch of pictures you don't actually know what's in them or reinforcement learning where you might have to take some actions in the world and only after taking a whole bunch of actions do you find out whether that was a good idea or a bad idea and then you want to make little adjustments to all the actions you took to either make those actions more or less probable depending on whether the outcome was favorable so these systems are loosely based on what we know about how real brains behave in particular we know sort of the most about the visual pathways in in human brains and although these are sort of fairly abstract representations of how we think real brains behaves they do have similar characteristics and that they tend to build hierarchical abstractions so though the abstractions at the beginning of these models tend to be fairly simple and then the ones farther to the right of this diagram tend to be more complex things that are composed of a bunch of primitive features that are automatically learned so as you make your way from the left to the right in this diagram the features get more and more complicated over the past five or so years we've actually used deep learning and deep neural networks and you hear the term deep learning often that refers to very deep neural networks networks with many many layers that can build up these hierarchical abstractions we've actually found that they're useful across a really broad set of different kinds of application domains in particular lots of things in computer vision and language understanding and speech recognition but also things like drug discovery and robotics research and translation and so you can see that we this is a chart showing the number of different directories in google's source code shared source code repository that contain model description files for one of our neural network training systems and as you can see that that we've had tremendous growth in the number of teams and number of different directories using these kinds of approaches simply because these techniques are so broadly applicable to many different areas so the fundamental building block of these models is a very abstract notion of how real neurons behave and this is an artificial neuron so essentially what these are is each neuron has a bunch of inputs denoted here by x1 through xn and each input to the neuron has a weight a floating-point number W 1 through W n that it's associated that associated with that particular input those weights can be positive or negative and they're an indication of how much importance the neuron places on that particular input feature so as I said this is an abstract notion of how real neurons behave and that they take some inputs and sort of weight them and then produce some sort of output we're not simulating things at the sort of very deep neurotransmitter or chemical level but this more higher level abstract mathematical interpretation of how neurons behave and so we compute the weighted sum of the inputs and the weights and then we put the that weighted sum through some nonlinear function so f and in the early days of neural networks many years ago that that function f was typically some smooth function like a sigmoid or a tan h function more recently people have had more success with what's called a rectified linear unit this function max of 0 comma X so basically the neuron outputs zero until it reaches a certain threshold and then as soon as that weighted sum becomes bigger than zero it's going to output the value of the weighted sum and that turns out to be a good nonlinear function for composing together lots of neurons and having them cooperate to to accomplish a task so to make this concrete we're going to have a bunch of weights a bunch of inputs those inputs are also just numbers and we're going to compute the output Y of this neuron using the rectified linear function max of 0 and then that weighted sum in graphical form here's how this look for a particular neural map this pretty one is taking input images and then it's trying to determine if that input image is a cat or a dog it's actually a little bit hard to tell here that turns out to be a dog so we're going to feed that the pixels of that input image through the first layer and those some of those neurons are going to activate based on their weighted sum and then we're going to propagate upwards the activated neurons and then that will cause other neurons perhaps to fire and emit some values and we're going to continue all the way up and we're going to try to say whether we think this image is a cat or a dog and that model the model that we're training is going to say let's say that it says cat we can actually make little adjustments to the model because this is actually a dog and this is supervised learning so we can now try to make adjustments to the model so that it's more likely to say dogs the next time it sees that image and that's the basis of training these models where you try to make them do a better job of emitting the proper output values for the input the training examples that you have so we'll make little adjustments back through the model so the next time we see this image we're more likely to say dog rather than cat and so the learning algorithm for these these kinds of models is actually pretty simple while we're not done basically a loop for as long as we want to train we're going to pick a random training example input come output so we have the desired output for that particular input we're going to run the neural network forward on the input value and then we're going to adjust the weights on the edges to make the output closer to the desired output the true output most sort of the desired output and it turns out that back propagation is a technique for using derivatives of the functions computed by the neural net to make little adjustments the derivative turns out to point us in the direction of the error that we're making and we can follow the gradient of the error with respect to the connections and that allows us to adjust the weights in such a way that the next time we're more likely to make give the correct answer and so then we can actually use the chain rule from calculus to back propagate these adjustments not just at the top layer of the model but all the way through the model because the functions involved tend to be all differentiable and so you can use the chain rule to make adjustments by following the derivative and using the gradient of the error with respect to these connections so that the way that might look is here's our input neuron we've got these weights and based on the gradient we might decide that we need to make little adjustments to the weights in this way so that the next time we're more likely to to give the correct answer and then we would go on to the next training example and do this again and make little adjustments to weights in the model in order to improve that so the surface area of these models is actually quite complicated this is showing a very complicated two dimensional function but real neural nets are functions of hundreds of millions of variable variables and so what we're trying to do is get to the global minimum of this function so we have some loss function basically says how well are we doing on assessing the accuracy and it turns out is very difficult to get to the global minimum but it turns out there's lots of local minima that the gradient descent algorithm I just described will end up in and most of those tend to be fairly close to the global minimum or at least good enough for solving real problems which is what we really care about so there's plenty of raw data in the world just a few examples here so how can we build systems that really understand this data one really nice property of neural nets is that the results tend to get better as you give them more data but in order to give them more data and have them take advantage of it you often need to make the model bigger because what you're trying to do is pick up on not just be the obvious trends that show up when you have a small amount of data but also the more subtle patterns that show up only when you have a large data set and the combination of those two factors needing to Train on a much bigger data set and also to a larger model on that data set actually means you need quite a lot more computation in some cases better algorithms and new insights also help as well but a really nice property is that you can scale these things up and they tend to do better as an aside neural nets are not that new they were actually developed twenty to thirty years ago and a lot of the fundamental training algorithms and so on that we're using now successfully actually were developed in that time frame and so what changed is we now have sufficient computational resources to train large models and we also have these large data sets that are interesting in various ways to allow us to train models that can do interesting tasks rather than sort of more contrived tasks okay so let's go quickly through some ways that deep learning is having a big impact at Google so one of the first way things that we did was work with our speech recognition team to improve the speech recognition models using deep neural nets rather than the earlier kinds of machine learning models they were using and essentially these models take an acoustic input so an audio waveform over time basically recording from a microphone and then feed it through a deep neural network and the output of that neural network is actually a transcript of what what the user said so how cold is it outside is the output of this model given the acoustic input of someone saying that and these kind of approaches have reduced the word error rate in the speech recognition system by more than 30 percent through a series of improvements and that's been really important because as speech recognition starts to work better users start to rely on it more and you can count on it as a really sort of fundamental part of your user interfaces and user experience whereas if it doesn't work as well that that tends to limit its utility another big area where neural Nets have made huge improvements in the last five years or so is in image recognition so there's a a public contest called the image net challenge which has about a million images that are labeled with one of us in different classes each image and so there's a correct answer for each of these images the image in the upper right there the correct answer is leopard and the goal of the models in this contest is to produce five guesses about what the labels should be for a given image given just the pixels of the image and so you see some examples of the kinds of images in this contest and people have been making a lot of improvements using neural nets to the results in this contest here's just one example done by a team of people at Google of a fairly complicated very deep architectures you know more than 20 layers deep we're each one of those boxes is you know tens of thousands of neurons with a particular connection pattern and what you see is that before people use neural nets in this this contest they were using more traditional computer vision techniques um and they were getting errors where you get an error if you're any of your top five for all five of your top five predictions are not the correct label so essentially you didn't get the correct label in the top five that's an error and the error rate was hovering at about you know twenty five point eight percent there and then Alex Khrushchev ski Ilya sutskever and Geoffrey Hinton who were at the University of Toronto and then subsequently came to Google used neural nets for this particular task and they got a very significant reduction in the error rate down to sixteen point four percent and then over succeeding years you see dramatic improvements in the error rate as people have scaled up these approaches developed a bunch of interesting new techniques to to train these kinds of models better one thing you see here is the row labeled Andrei Karpov a so Andrei is a grad student at Stanford who helps administer the imagenet contest and he and his colleagues were kind of wondering what the actual human error rate would be on this particular task you know you're given an image and you have to make five guesses about what true label is and how well could you do and so Andre did the the correct scientific thing where he said down like a machine learning algorithm and gave himself a bunch of training data and trained himself over a weekend I think he did about a hundred hours of training and then he had a test set where they were images that he hadn't seen and he had to give the proper label and he actually got an error rate of 5.1% and he had another colleague who was perhaps not as diligent it only did about fifteen hours of training and that person got about a twelve percent error rate he has a great blog post about his experience doing this and so you see that we're now sort of actually better than human level accuracy on this particular task it's not to say that computers are good at vision as people but for certain kinds of tasks that are fairly specific they're actually better than humans and these models actually give good fine grained classification so for example they can know that that's a hibiscus and that's a Dahlia whereas I would just say well they're both flowers but not much of a flower expert they generalize well so if your training data represents both of these images as meals they can say that these are both meals even though you know visually these are not that that similar than image and they make kind of sensible errors so the image on the Left model might say snake and you can sort of understand why I might have said snake even though that's a banana slug image on the right it might have said dog and that's actually a donkey in the back and I think of the donkey in the front although it might be a goat so you can see these are actually pretty hard examples the classify correctly and so one of the areas that we've been using this kind of approach is in google photo search where we can now actually take photos users have uploaded and understand what's in those photos for the purposes of organizing the photos for the user so knowing that that image contains a notion you can now do things like search for show me my pictures of oceans and you'll be able to find them so we can group images that users if uploaded by the categories of things that they contain we can do like my photos of Siamese cats and it will retrieve Siamese cats and maybe other cats and type Frehley abstract things like drawing and it will retrieve those and it's very important that you be able to retrieve your photos of Yoda even if that's a macrame Yoda which looks nothing like most of the others the same basic approach can actually be used to look at board positions in the game of go and our deep mind group in London actually had a breakthrough result in beating a human champion at the board game of gull which was kind of one of the last unsolved board games that computers were still worse than humans at but as of last March we've actually been able to be human champion using neural mats plus some reinforcement learning and I recommend reading that paper listed at the bottom of church than that um so one of the things we found is that we can reuse the same general model for completely different problems so we have the same basic model structure so I'll give you some examples of given an image we'd like to find interesting parts of an image I'll show you how that can be used in different contexts so one of the first things we wanted to be able to do once our vision models were working better was to be able to find and read text in Street View imagery so Google has driven cars around the world and taking pictures of street teams there to improve our maps product and it would be really helpful to be able to read the business name for example and know what businesses exist in the world and to confirm that the street names are what we actually believe they are and one of the first tasks for doing that is to actually be able to just find text in the street scenes and so here's a basic model that my summer intern and I worked on in conjunction with people in our street view team a few years ago where we wanted to be able to just basically emit a heat map of which pixels contain text and which ones do not and this model is trained on Street View scenes where we've had people draw polygons around bits of the image that contain text and you can see that this model works well on lots of different kinds of character sets on different size fonts different color fonts purely from learning from the examples we gave the model well you can take the same general approach and use it in satellite imagery and here's a product that a few people at Google put together because they thought it'd be kind of interesting to locate rooves in satellite imagery of houses and then understand how much area that roof has and at what angle its it faces the Sun generally so that you can understand how much solar power you might be able to generate by putting solar panels on those roofs and it's the same general problem you have satellite imagery you want to be able to first find where the house house roof tops and then do some further processing and a third domain where this is useful is in medical imaging where we have images now there are retinas and we want to be able to find parts of the eye that indicate symptoms of disease in this case a disease called diabetic retinopathy and there's little subtle hints of diabetic retinopathy if you look closely in these kinds of images and being able to identify where those signs are is actually quite helpful let's get back to our language understanding problem so we have this query we want to be able to match which document it actually occurs in which is a better match for that query and so one of the things we'd like to have is a sort of smoother understanding of what the word car and parts and so on really mean rather than just looking exactly for the letter C AR and so one of the things that you can use for in neural nets is what we'll call an embedding function where essentially we're going to represent different words by a point in a high dimensional space so each different word unique word in our vocabulary is going to be a point in this space and we're going to move around as we find that they're similar or different from other words and so will be hope is that we end up with an embedding function where say words like porpoise and dolphin are near each other in this this very high dimensional space I'm showing a three-dimensional space here but really we would use you know a hundred dimensions or perhaps a thousand dimensions for each word so we hope that words that are similar end up very near each other in this space and that words that are unrelated like camera and porpoise are not really anywhere near each other in this high dimensional space and so one way that you can you can train these embeddings in the same way you train other parts of a neural net with backpropagation so essentially there's a model called word Tyvek where essentially we're going to take a small window of words and we're going to try to use the Center word in that window to try to predict one of the other nearby words so I'm going to use the word automobile to try to predict the presence of the word parts so we're going to look up the embedding for automobile so now we have a say a hundred dimensional vector for automobile we're then going to use the contents of that 100 dimensional vector which is just a hundred floating-point numbers to try to predict the word parts from our vocabulary and if we're able to do that great if we're not able to do that then we're going to make little adjustments using back propagation to the model and including moving the embedding representation for automobile around a little bit so that more likely will predict the word parts next time we see the word automobile and if you do that you get actually the property that the vocabulary items that are nearby so I'm showing you a language model has been trained on Wikipedia where each one of these words or phrases is a vocabulary item represented by a 500 dimensional embedding vector and so what you see is that things like tiger shark the nearest neighbors - tiger shark if we look in absolute distance in the embedding space are all kind of shark related things in-car is perhaps even more interesting in that you get things like girls are very nearby but you also get things like synonyms and other words that you might use to refer to a car auto car automobile pickup truck or all nearby in the synthetic space so we've really captured the essence of the meaning of car rather than just the fact that it contains the letter C AR and it turns out the directions in these high dimensional spaces are also meaningful for example if you take all of these country names and capitals you turns out in the high dimensional space you go roughly the same direction and distance to get from the country to the corresponding capital for any of these different country capital pairs and this is the 500 dimensional space projected down to two dimensions using a technique called principal components analysis so we've essentially selected the two most important dimension mentions here projected it down to the two two dimensions so that we can visualize the space better and you also see that things like verbs present and past tenses of verbs tend to make the same kind of structure regardless of what the verb is in this base and this actually allows us to do interesting things like solve analogies with just simple vector arithmetic on the embedding vectors so Green is the king as man is two women can be solved with as manage to what can be solved with just simple arithmetic on the on the vector one of the so we've been actually exploring the use of neural networks in our search ranking function and last year we launched a system we call rank brain which is essentially a deep neural network for given a query in a document we want to say how good a match is that document for that particular query you know so looking back at the data that I've showed you we'd like to say that a document containing rebuilt transmissions is actually a good match for the query car parts for sale and actually it turns out that using this were actually able to have the third most important search ranking signal hundreds of different signals once with system launched okay I think I'm going to get quickly ahead there's a lot of interesting things about recurrent models basically looking at how do you model sequences of data and there's a simple model of memory that you can have in a computer where you can write a memory cell a read a memory cell or a race or forget what's in a memory cell and so that you can think of as a model of how DRAM works for example it turns out that there's a really fundamental piece of work bicep hock writer and jurgen schmidhuber developed in 1997 called long short-term memory which is essentially a way of taking that idea of a memory cell and turning it into a building block for neural networks that is actually differentiable and the way to think about this is unlike a real memory cell where you discreetly either read or write or erase it here you can actually read a little bit of what's in the memory or write 0.6 of the value into the memory and key point for of whatever is in the memory or forget a little bit of what's in the memory and by making these memory cells differentiable it turns out that these are really good ways of modeling recurrent modeling sequences and being able to keep hold of information for a long time and then you returning it when you need it and I'll show you some examples of this so here are the kind of fundamental equations of the LFC M cell that were developed by hope writer and maneuver and I don't want you to necessarily read all this code but in our tensor loops in our tents are for system you know one of the nice properties that researchers really like is that there's a rough correspondence between the kinds of equations you might write in a machine learning paper and the sort of code you would write in the center flow system to implement some of those equations so just remember that as an LST emcell building blocks so my colleagues Ilya sutskever oreal venules and Kwok Li decided to use LST M cells to model a problem of taking one sequence and transforming it into another sequence now this seems like a rather abstract problem but it turns out it's actually quite useful in a bunch of different domains that I will talk about so you have an input sequence ABCD and we have some training data ABCD maps to X Y V Q and we want to actually train the model to when it sees ABCD to emit X Y Z 2 it turns out that's actually the problem of translation if you if you frame it in this context where you have a French sentence here and you want to train the model to when it sees a French sentence to be able to emit the corresponding English sentence so we take in French one word at a time when we see the corresponding end of French token we're going to train the model to emit English one word at a time that is the corresponding meaning and so we have some training data that is English and French there English French and English sentence pairs and we just repeatedly train the model to do a good job of committing the correct words given the input data and so that's how the model is trained and then at test time you get a completely new sentence you've never seen before and then you actually can explore the set of possibilities of sentences that the model would like to generate because each of the outputs here is a probability distribution over words and you can sample from that probability probability distribution in order to find the sentence for the maximal likelihood so one of the things that happens is in this kind of model we now have after we've sort of looked at all the different words in the input sequence we now have the state held in all of these L of TM memory cells and you can think of that as another one of these high dimensional vectors that you can visualize in the same way that we projected things down to two dimensions so if we look at the state after absorbing some different input sequences and then map that state which is really like a 4,000 dimensional vector down to just two dimensions what you can see is that words and phrases with similar meanings in depth nearer each other than words and phrases with the same words but the opposite meaning so John is in love with Mary is close to John admires Mary and those are both far away from Mary admires John which is what you would hope because we actually care about being able to map similar meanings two points that are near each other in this high dimensional space similarly here's some other examples that show the words at the top the three phrases at the top all have the same meaning and they're all far apart from the words at the bottom that have the opposite meaning so one of the ways that we've been able to use this in some of our products is a system called Smart reply and in 2009 we actually made an April Fool's joke about this we said haha Gmail is actually going to start responding to your email automatically for you but it turns out that in November last year we're actually able to launch a real product around this and the way this works is essentially the sequence sequence model so we take in the incoming email a word at a time and we try to predict what is a likely response for this email given the context of the incoming email and this works quite well so now more than 10% of replies on our mobile inbox application for Gmail are generated view using the automatically generated responses and so here's an example of how this works we first use a small neural network to take this message that a user is received and predict is this the kind of thing where a short response might make sense that we could actually do have a reasonable chance of generating a good response and so though and if yes then we activate a much more expensive model a sequence the sequence model to try to predict given this input data you know what are the likely outputs and so here's an example of a message one of my colleagues received from his brother it says hi we wanted to invite you all for an early Thanksgiving dinner please bring your favorite dish RCP by next week and then we activate the smart reply model and it generates these three plausible responses count us in we'll be there or sorry we won't be able to make it and so you can select one of those and and automatically reply you can also edit the response after you selected and the sequence the sequence work has actually spawned a huge set of follow-on work this is just a few of the examples of work done at Google and elsewhere for using this model or variations of this model and different kinds of domains that this model can be applied to many different kinds of problems so I'll show you just one more remember that cute picture of the little girl so it turns out if instead of accepting input one word at a time you rip off the front part of this model the sequence the sequence model and instead initialize the state of the decoding part of the model with the high level features from one of these convolutional image models so now we have an image model we take in pixels we get high level features out of that image model and we use those to initialize the state so those high level features have a really good understanding of what kinds of things are in that image because that image model was trained to be able to say you know little girl and teddy bear and it turns out that you can now use a little bit of data where you have human captions human sentences written about those images to train the model to do a good job of generating human sentences so this is a test image that the model is not trained on and here is an example caption that the human actually labeled this image Rin and he it's a generative model so the model can actually generate multiple plausible sentences and here are two sentences that it generated given the the pixels here as input so for the close-up of a child holding a stuffed animal or a baby is a sweep next to a teddy bear and this is I think pretty pretty advanced thinking for a computer vision model I would not have thought this was possible a few years ago and but it actually turns out to do a reasonable job of giving quite good sentences about images that is never seen before here's a few more examples you know it's not as sophisticated as what a human might actually say about these things so a human might say a tennis player getting ready to serve this says a man holding a tennis racket on a tennis court yes that's true and when it fails it's kind of funny you I assure you there is actually a snowboard down below there but you can't see it and there is a man there but oh well you can put all these things together so we have vision and translation then we have an app that allows people to when they're traveling hold up the camera viewfinder to signs that are perhaps in a language they don't understand or speak and we superimpose their language onto the image and the sign the text that we find so we actually have to find text in the camera view pixels translate that text and then superimpose that translated text on to the to the image okay so let me very briefly talk about the tool that we've been building a system called tensorflow so there's a huge difference in turnaround time if you're trying a machine learning experiment and you get results in minutes or hours rather than weeks so it's just a very fundamentally different feeling doing research when you can launch an experiment results that afternoon one of the things we focus a lot on is how can we make research happen quickly so essentially you know train in one day what would take a single GPU card six weeks to do and there's a few different ways that we can do this so I'm going to go quickly through this but essentially there's lots of parallelism present in the neural net training process and so we can exploit that parallelism I'm going to skip over well so model parallelism is essentially the idea of taking a large model and performing the computations of that model in parallel on different computational devices for example we could partition the computation here into three pieces and now we can have three different computers or three different GPU cards work on the computation needed to perform the operations for that neural network data parallelism is this idea of having many different replicas of the model and having a shared set of parameter servers that are going to hold the parameters of the model the parameters that were making with little adjustments to as we do gradient descent and we can essentially train those by having each model replicas before picks up a small batch of examples get the current set of parameters that it runs through a small batch of examples computes a gradient that it would like to apply to those parameters and instead of applying that locally attended back to the centralized parameter servers and that allows those parameters to be adjusted by the parameter servers so now we have a new set of parameters and we can actually do this on the next step get into the new parameters make little adjustments to those and apply them there and what you see is you can actually do this all in parallel with many different replicas training on different data examples and then making adjustments to those parameters and this actually helps quite a bit so if you see the training time in hours and la x-axis here and the accuracy on the y-axis it takes a lot longer using one GPU to get to a certain level of accuracy than it does to use 50 GPUs 50 replicas of the model you get about a 30x speed-up using 50 GPUs and that allows us to do the research that we want to do much more quickly okay so there's a bunch of things you might want in a machine learning system one is being able to express lots of different machine learning ideas one is being able to scale to large data sets and run experiments quickly and you'd like to be able to run on a variety platforms and reproduce those results and have research that you've done be ready to go directly into real products without having to re-implement it so we've actually been working on a system called tensor flow we've now actually open sourced this system and we think it does a good job of meeting those requirements in the previous slide we released this in November last year and it's available on github and we have our own website that has a bunch of tutorials and pointers to the github code it has a very flexible Apache 2.0 license so people can take this and pretty much do whatever they want with it and use it to solve machine learning problems they might have and there's been a really good reaction to this in the community we have a white paper about it but I won't go into you can see this is a pretty significant group effort on many people's part at Google this paper kind of described the inner workings of how tensorflow operates we've had quite strong internal external adoption of sensor flow so this is tensor flow in measured in terms of github stars and forks which are a couple of different measures on github of how active the community is and how how much people are interested in that particular repository compared with three other different machine learning deep learning packages that are also available on github and have been actually been around for a lot longer than tensorflow in fact tensorflow was the most forked repository on github in 2015 despite it only being available starting in November so there's been like tremendous community interest in tensorflow and we're really excited about that that we now have this flourishing community of machine learn people using tensorflow for lots of different things one of the things that I think has been really helpful about tensorflow is that we put together a bunch of tutorials that describe the kinds of models I described earlier in the talk that show you how to use tensorflow to accomplish some of those things and actually external users have been doing lots of different things intensive Lo and also posting their results on github but there's you know five of these repositories or so are from Google and the rest are from external community members I'm going to skip over the details about center flow works and just go on to a trend that I think is going to going to happen which is that we're going to have much more heterogeneous Hardware because general purpose CPU scaling has actually slowed quite significantly specialization of hardware is going to be really important going forward and one of the things that Google has done is invested in building the right kind of hardware for machine learning applications so one of the things we've we've done is had a custom ASIC design team at Google design a chip essentially a custom ASIC called the tensor processing unit that is designed for making machine learning algorithms run quickly on that system has been in production use in Google data centers for 14 months it's used at every search query it's used it was used in the alphago match against Lisa Dahl in Korea a few months ago and we think these kinds of hardware are actually quite important for making machine learning algorithms of extensive deep neural nets actually quite practical I'm going to skip over the next part and go directly to a final video of one of the things we've been waiting tensorflow for is to Train robotic systems so I'm just going to hit play on this video this is essentially robots that we've set up in a lab that are learning to teach themselves using cameras which you see above the robots to pick up different objects you can actually turn this into a supervised problem because if you close your gripper all the way you failed to pick something up and if you close it partway and pick something up that's a success and so these models can actually learn from the pixel input of the camera how to grasp different objects that they're seeing and so we've actually been able to train systems using this collection of robotic arms to actually grasp things quite well and that's kind of a fun and interesting use of tensors well and with that I'm going to conclude and here are a few references to other kinds of problems and I'll turn it back over to Steven thank you Jeff and just a wealth of information there so let's now move on to the questions and answers and our first question is let's see here are any of the modern deep network breakthroughs useful for discovering features useful for solving particular mathematical operations and algorithms and any hope that they might be used to discover new factoring methods for example so it's possible there's some interesting work starting on using neural nets to help with fear improving kinds of problems um and I think that's going to be an interesting line of research in the next you know few years here I'm not aware of work specifically related to factoring methods but certainly neural nets can do impressive things so there's a paper by Ilya sutskever and logic Zaremba where a system was given a seat of Python programs little snippets of a subset of Python and the output of those snippets and was actually trained on example inputs and outputs and was actually able to learn a Python interpreter purely from the examples of the data was given so that an example of a similar kind of problem where it was actually able to learn a very complicated function a Python interpreter purely from observing examples next question does Google have any plans to make tensor processing units commercially available or available for academic research use or are TPU specialized for use in Google's specific hardware infrastructure only so at the moment they're internal only we are releasing a cloud machine learning product around tensorflow so that people can give us tensorflow programs they would like to optimize and then we can optimize that tensor flow model for those users and that we don't have any plans to use TP use in that system at the moment but we do plan to offer a cloud we actually our cloud machine learning service is an alpha and people can access that through our normal cloud APs Jeff is it possible to write custom tensorflow yes so certainly that that is possible the model is one of operations and then you have implementations of those operations that can run on different devices so it's very easy to plug in your own operations and also implementations of those operations and the tensorflow pro droids have lots of lots of examples of this a bit of a broader question so far we've been learning functions for matching recognition perform a particular task how do we move from learning functions to consciousness well that's that's a difficult question I mean I think you know systems that are actually able to move from raw data to higher levels of understanding are kind of the main path forward to actually seeing you know making the kinds of connections between disparate pieces of information that humans are really good at and being able to do that in computer models I think would be you know if you were able to do that really really well that would actually be you know kind of like consciousness I don't don't want to get into the philosophical debate of you know what is consciousness in the computing system but in some sense I think higher levels of understanding and being able to do act perform actions based on that understanding are going to lead to very intelligent systems that will improve significantly over the next you know 10 to 15 years there were separate question to run this sort of idea and that's sfx's the backs of 0x but it's not differentiable at zero does that create any problems yeah it's you're right it's not differentiable zero it doesn't seem to create problems and again a series of questions you know in turnable networks where you have a large image of a cat what is really the representation taken its input to the next layer so the actual input representation is maybe a 200 by 200 array of pixel values where the pixel values are just three numbers for red green and blue and then that feeds into the next layer which might have a whole bunch of neurons at each spatial position and so each of those outputs might you know you might have 128 neurons at each spatial position and those are just the outputs of those are just numbers again arranged in a grid and so essentially that's a three-dimensional array which you can think of as a tensor not in the physical the physics sense but in the sense of the kind of more informal meaning of a tensor which is a multi-dimensional array and so that's how tensor flow actually got its name is it's very good at representing the kinds of tensors that flow around in these kinds of models how many layers does it take to call a neural network deep that's more of a invented term by the popular press deep-learning so you know I don't have a specific answer you know I would say you know 20 or 30 years ago we were training models that were two or three layers deep now we're training routinely ones that are 10 or 50 or a few hundred layers deep I would say anything more than five or six layers you could call deep if you want this is our probably our last question or next to last do you see any possibility to use deep learning reinforcement learning settings with very little feedback yeah I think this is actually a big research challenge for this kind of area of work is that generally neural nets tend to need a lot of examples to do a good job but what we actually care about is being able to generalize very well from a very few number of labels examples you know humans are incredibly good at this I can see one or two examples of a whiteboard marker and all of a sudden I just know that those are whiteboard markers and whereas neural nets might need you know hundreds of examples of the images of those um and I think the general areas of transfer learning and multitask learning and unsupervised learning are the ways forward in actually being able to build systems that can solve a new task with relatively few examples okay this this is going to be the last question and can you talk a little bit more about TP use and if they're inspired by the neurological architecture of the human brain so sure I can't say too much about GPUs they're not particularly inspired by the human brain they're really looking at the kinds of computations you want to perform in deep neural networks in particular and the realization that for example some of those kinds of computations they're less tend to be actually quite forgiving of a bit of noise and so you can actually take advantage of those properties in building a chip that is customized to run those sorts of algorithms so you can do things like have significantly lower precision than traditional GPUs or CPUs in the computations you do um I think it's useful to look at biology for inspiration but on the other hand viola biological systems and silicon systems have different strengths and weaknesses so it's important to kind of build systems that can actually do a good job given the capabilities we have in silicon today while also taking inspiration from biology you know we're flooded with so many questions we could ask you one more people are asking how does tensor flow compared to some of the other platforms that are like like piano or cafe and others in terms of farmers capabilities right so that's a pretty long answer question there's a lot of actual comparisons on the web sumit chinchilla who works at Facebook has a public github site that actually does a performance comparison of a bunch of these packages and I would say basically that the torch and tensorflow tend to be at the same level performance typically in terms of ease of use I think people tend to find tensorflow and piano fairly easy to use from Python they both both tensorflow and Theano feature automatic differentiation which means you don't have to sort of differentiate your functions the frameworks will do it for you which is a nice property in these backpropagation based learning algorithms um but in in sometimes we feel like tensorflow has a good mix of production readiness ease of use and performance and scalability and that's kind of why we built it and open sourced it the other packages have different strengths and weaknesses in those areas that that are you know somewhat different trade-offs we feel like tensorflow is a good trade-off for the kinds of problems we work on and we hope other people agree so Jeff again I want to thank you I'm afraid we've run out of time today and again a special thank you to Jeff Dean for his informative presentation and spiteful answers to the many questions and the special thanks to each of you for taking time to attend and participate today this webinar was recorded and will be available online in a few days at learning DCM dark for slash webinar and you'll find announcements on upcoming webinars and other ACM activities at learning dot a cm org in ACM org also please fill out our quick survey where you can suggest future topics or street speakers which you should see on your screen and again on behalf of the ACM Jeff Dean and myself Stephen IVA rocky thanks again for joining us and I hope you would join us again in the future this concludes the webinar
Info
Channel: Association for Computing Machinery (ACM)
Views: 20,634
Rating: 4.8923078 out of 5
Keywords: Jeff Dean, ACM, ACM Learning, Webinar, Learning, tensorflow, deep learning, large-scale, google, engineering
Id: vzoe2G5g-w4
Channel Id: undefined
Length: 65min 8sec (3908 seconds)
Published: Tue Aug 09 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.