Deep Learning: Practice and Trends (NIPS 2017 Tutorial, parts I & II)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so yeah thanks for joining us and being here so early hopefully those coming from east don't feel too bad but thanks anyways and first I would like to point out that indeed our field is growing as we've probably all seen these kind of plots showing the growth of registrations and as a result I think we're not quite figure out how to handle registering thousands of people in such a short amount of time so you know there's for those coming in maybe a bit later will obviously put slides and so on so hopefully they don't miss too much but yeah otherwise welcome and thanks for maybe queueing very early today it's great to have you here and we're very happy to be talking about deep learning practice and trends we're gonna first actually we did a bit of a survey on Twitter actually so so we asked what people kind of want to take out of from tutorials and we found most people are actually interested in the kind of bleeding edge research which would sounds a bit counterintuitive given this is a tutorial so we're gonna try to have that second part about Trends which talks about more recent works that we'll try to explain you in in depth however the first part the practice part is gonna be more about giving you a gist about deep learning about all the sort of tools available to a researcher or practitioner in industry alike on how to handle basically getting into deep learning and just understanding what entails to you to get a model into production perhaps or get a research idea and trying it out and so on and then the second part will be about trends we picked here five trends that we consider are or have been important in the past years or will be important in the next few years obviously there are many more trends and we cannot cover all of them but we're definitely happy to discuss these during the questions perhaps so let's then begin and I like to think of deep learning as a toolbox enabler so there is there is basically a gigantic source of papers source code tutorials I think this is great it lowers the barrier of entry tremendously and the way you can see this is you have these all these tools available to you frameworks and whatnot and perhaps at the Model A level you you get to decide certain aspects of your neural network architecture maybe how do you optimize it what is the task what are the inputs what are the outputs and you put this together you piece this together and there you go you have a model you can obviously also zoom out a little bit of this picture and there are other choices that are also quite important for instance even before thinking about the model how are you gonna train your model how are you going to deploy your model you can have specialized hardware you can have generic hardware like GPU CPUs you can also decide to do so in a cloud based approach and so these decisions might actually affect quite a bit what your model looks like or what you can do in your research and so these are very important early decisions to be taken the second of which probably is the framework so there are many frameworks available almost too many it feels but you have to sort of understand a bit what are the differences between these frameworks perhaps maybe more importantly as a researcher these frameworks might limit you in some way on the kind of things you can try easily but also for deployment and whatnot some frameworks may have better tools for certain platforms and whatnot so all these decisions are quite crucial to get right otherwise you might later on regret them and and lastly there's there's a best amount of datasets that you can decide to work on obviously if you want to deploy something and you are you have a company maybe you have your own data sets but also deciding what those those look like how big they need to be and so on there are all quite important decisions right so this is sort of the zooming out on the aspects of deep learning and obviously we can also go the other way and zoom in and if we zoom in you start seeing sort of the life of a like practitioner or researcher so there are many choices starting from neural network architecture details like what Nong in the IITs are you gonna use also optimizers how do you optimize your model this is really the engine the core of what you're gonna try to do here which is get the weights to settle on to having a very low or low loss or high reward there are things like connectivity patterns and so on that you have to choose a mix and match perhaps um the loss you choose also is quite important and we are seeing through special reinforcement learning quite a variety of losses that do not need to be differential anymore and so you can really truly optimize things end to end which is quite great and then perhaps the elephant in the room which which I'm not sure anyone likes here is hyper parameters this is really like crucial to make models work and the list goes on and on it goes off slide so you don't see many of these but they're really important choices and deciding tuning these is kind of a part of your day-to-day almost if you do train many models um and so all these and many more are sort of details that once you get into deep learning you have to start carrying about so for the talk we're gonna try to use these notations so to speak so we're gonna talk about a few topics and generally they they are either about inputs and outputs about architectures or about losses so you'll relate these in in the slides and it might be useful to kind of always keep in mind these part of the talk what is it about is it about a loss is it about a modality that we haven't been able to use before in deep learning and so on so forth so let's start perhaps by sort of very basic things that we were doing with machine learning and they're still very important and very relevant which are like sort of what what I call vectorize inputs so this may be perhaps the most classic data set which which many people maybe have used especially those who are you know have maybe trained super vector machines and other kinds of models there's the UCI machine learning repository and these these are like basically a mix of papers perhaps Khan News and categorical attributes that you need to make predictions from right so here in in in it's an example of a data set called adult and you get to see things like the age of a person you know the education you know whether they are in a relationship or not and then you can predict any other attribute from these data set perhaps you want to predict their income so this is a very simple supervised learning a task and the input is sort of a bit and structure and mixing this this sort of continuous categorical things and honestly neural nets perhaps are these are the least they're the least suited for this kind of data sets lots of things like SVM's or boosted decision trees and so on are perhaps state of the art nowadays but what neural nets and deep learning really started to make good progress on is these kind of more structure perceptual perhaps signals so images are much higher dimension than the UCI repository datasets in general and you can do all sorts of things with with images as we know so we've seen many of these examples and for instance there you have like a classification problem where you want to classify these onto a thousand classes this is imagenet we've seen a lot of very cool progress on generative models like almost unbelievable how good these models have gotten where a model is just training to reproduce the modality that you give it to it so these these would be you if it's celebrities faces and the model understands the joint distribution of all the pixels and is able to sample realistically looking images and also some applications that were quite surprising to most of us and I think this is why these field is so great because there's so many people thinking about maybe novel applications that we we perhaps haven't thought about them so I love this one where you basically do style transfer where you take I mean a picture and then style perhaps photo like this breath looking thing and then you can combine them and this actually is perhaps one that that is commercially used I mean there's apps that you can do these and people have lots of fun it's perhaps Photoshop empowered by deep learning and I think we're seeing more and more of these sort of applications which is which is quite great and then perhaps the modality that I've worked personally more on is sequences or sequences and sequences in a way extend images because you can think of an image as a sequence and also obviously you have videos which naturally do our sequences of images which themselves are sequences of pixels and so on and text also is kind of an important domain that deep learning sort of took a bit perhaps longer to get really state-of-the-art and being deployed in production and and also some other things that are quite fun to work with like programs and and sequential decision making problems that inherently have a sequence underneath so maybe two to finish this this sort of section on input and outputs and so on I'm gonna just maybe give some some sort of generic advice which is so when you're faced with a new dataset you typically have an idea or perhaps you just want to try a train up a baseline model and what you have to do is first make the model run so this could be like from compiling your code to the graph in a in a framework to be no not having not cycles and so on then you want to run this model and see the laws the laws going down and perhaps more and more we see people using cloud services to then do the may be annoying or slightly difficult hyper parameter search which is so crucial for deep learning and then you iterate there quite a bit and you know if you're a researcher you might want to write a paper if you're an industry maybe you deploy your model and so on and so forth but this is definitely the steps to sort of see someone something succeed in not even deep learning which is machine learning and it's important to sort of obviously that this is simplified but they're there it's important to understand this so I'm gonna start with sort of explaining three key building blocks that are going to be used heavily in the more advanced transaction the first of which is convolution and perhaps all these architectures in retrospect have these common characteristics which is they have the right in that if biases right so in deep learning we don't like to maybe have hand tuned or engineered features but what we we definitely want to have is this inductive bias so I'll repeat this word a lot during the talk and I hope you will relate to this so but before getting there here what we're talking about is perhaps the simplest form of image classification so the inputs are gonna be image pixels the outputs are gonna be perhaps a class label the architecture is gonna be obviously centered around convolutional neural networks and the losses are fairly standard losses about you know cross-entropy for classification or perhaps you're gonna regress to a continuous target and whatnot and convolutional nets have been around for quite a long time perhaps how I learned about them was actually through this paper which is which is very cool and has very very very good results on amnesty and maybe things were a bit complicated when I started working on machine learning such that these frameworks did not make it possible to just train a model on some data set in literally like 10 lines of Python code so it was a bit overwhelming to work with these models I think in the early days as well as not having GPUs that are so good at computing these sort of patterns of computation that are explained in the next few slides so as I was saying the the key the key sort of imbalance or inductive bias that convolutional neural networks have is this idea that in images we care about two sorts of invariants one is locality so we believe that things that happen nearby in pixel space are correlated and and form a group so to speak and also that and this is application dependent but generally whether something is on a position like top left or bottom right it shouldn't change too much at least in terms of knowing what the object is right if you move the object it's still the same object and these cookies to embarrasses are very critical and when you design the architecture that's precisely how you go from a fully connected architecture which is shown here on the left that would essentially connect all the pixels to the next layer and it would not make use of these inductive biases to these sort of convolutions that first you can start by saying well let's not think about conclusions yet but let's make the output of the disease will be the output of the next layer and so for instance this green point does not connect to all the points in the image this is an N by n image it only connects to let's say a three by three region on the top left and then this purple one connects to these three by three region and so on so forth so there are actually these are not quite convolutions but there are almost convolutions and this model actually was used at some point as well it was untying the weights in a convolution essentially although this is not so good for GPUs because then if the weights are untied you cannot batch so many computations in one sort of kernel core so the second assumption that you say is whether you are on the top left or bottom right I want to use the same sort of filter to analyze this image and this filter would be this these weights denoted here by these lines so by means of tying or sharing the weights you get what a convolution in two dimensions look like and that's why they have this name obviously and and then just to be kind of more specific to understand what what this means is you have a input that's perhaps four by four so this is a very low resolution image and then with a three by three kernel you create a second image that's two by two here so we assume no padding and no straight and so on we're not going to go into much detail there but this this kind of operation is first by parallelizable on GPUs which is great and secondly has these right imbalances which help have the model learned quite quite a bit and just to extend or to actually explain what convolutional network is generally you don't have a single plane input and a single plane output you generally have many planes as input for instance you have RGB channels and later in convolutional neural networks you have these feature layers that might be like sixteen thirty-two and whatnot and so but these did is all fine because all you need is to just create a matrix multiplication that instead of three by three in this case would be three by three by three right so here we have twenty filter weights and that's basically like the the building block of convolutional neural networks and obviously you can also have several output channels which is quite useful in most cases you do you do want to sort of expand the image from the three colour channels to perhaps channels that eventually will become class discriminant and so on for classification so if you put all these together you just stack many layers of these convolutions this is a particular architecture that does this namely Alex net and you also have some pooling layers which are essentially convolutions with fixed weights that just some these 3x3 patches or some may be across all the all the inputs and what happened which was quite amazing we already knew it was very good for M news but on image net this really made a discrete jump in performance in basically from on this very important competition that was being done every year I'm including this year so and this basically allowed researchers in computer vision to test themselves with a standardised benchmarks that B that also you didn't have access to the test data so it was really like very well run or and well thought idea of testing progress in machine learning and the first architecture that used this idea of convolutional neural networks was in 2012 and it really reduced the error rate from 28% 26% which seemed to be like kind of a pretty mild slope to really reduce the error rate significantly and ever since the error rate has decayed quite substantially two levels of almost like humans performance in terms of detecting these thousand classes so Alex net I described it before but it's it's a very important architecture it really put deep learning perhaps in in the mainstream although me personally I knew deep learning was very good in for speech recognition which was actually a bit before computer vision so that was 2009 this was 2012 and the other thing that happen is that just by adding by the means of adding layers every year this competition was basically seeing used almost at a rate that seemed unstoppable their rate right so it went from 16 to 11 to about 6 or 7 to 3.6 and and and even now it's sub 3 so so really that the revolution of that was very clear for images and this is very important but training a deep net is not easy so I'm gonna describe a few things that made this possible that perhaps if you just did this naively and when Alex Ned was proposed it wouldn't have worked right away the problem with depth is twofold one is computationally things get expensive you can paralyze convolutions because you use the same weights in this manner of moving them around in the image but depth you cannot paralyze that requires to compute the previous layer and so on so the sequence of computation cannot be paralyzed easily and as a result things get slower as you increase the depth the second issue which perhaps is more fundamental because people were maybe willing to wait for a bit long to train these models is optimization optimizing these models is not easy and in fact there's lots of issues with vanishing gradients and so on that also present in recurrent neural Nets so the to deal with perhaps the depth and the computational issue and also the explosion of in number of parameters people basically nowadays use almost only exclusively 3x3 convolutions which if you start in depth do a more like larger receptive field of 5 by 5 7 7 times 7 and so on so in this way that gives you for a fewer parameters larger receptive field than if you just had to have a 7x7 convolution so this insight is very important we seed it in this in a lot of architectures nowadays and then optimization wise there was a budget normalization which was also pretty critical in the times of Inception and then also all people started investigating all sorts of tricks to do weight initialization properly and so on so these were kind of perhaps some breakthroughs that we had and then later on the idea of residual connections came about and it really also enabled us to train these more much more so this is just putting sort of inception which has some very clever ideas on how to set up the architectures with batch normalization and here you can see in blue like you you really get much faster training so this was maybe the times where we work work a bit on both modeling and also optimization jointly and then perhaps even more impressively because it's even simpler is the idea of adding residual and skip connections to your model and the idea is so simple that it fits in this light which is you have these deep layers of convolutions all you have to do is skip namely instead of having the output of this small neural network here behave of X which would be like a bunch of maybe 3x3 convolutions you have to be f of X plus the original X and that simple idea works very well not only actually on convolutions but also on residual l STM's and and all sorts of related ideas there's highway networks also a very interesting approach that that's kind of the same idea of skipping ahead computation and these really like enabled something that if you just tried naively wouldn't work so here is without residual connections if you go from 20 to 32 to 40 for 256 layers you see the training loss is degrading which is something that mathematically is not possible because a neural net that has more expressivity it has more depth should be able to generalize a network that is shower but it was not possible to train without the residual connections but voila when you add these residual connections you get that adding depth actually adds performance in training and also likely he generalizes to the test set so this was a great result I think and very simple and very influential and it also reduced the error rate again by Harvey in 2015 and with that like the kind of image net is kind of a benchmark that is really really getting so good that I believe the competition might not be run anymore or there's other challenges that might be beyond classification that people are gonna consider in the field there are two related ideas that I will just quickly explain just for completeness that dense net and this was a very good paper like busier which proposes to simply skip connections essentially between everything and everything so it's kind of generalizing ResNet but it actually works very well for classification and and so on and the other paper did that's extremely cool as well is this unit architecture which proposes to skip connections in a neural net that essentially reduces the resolution as it goes through a bottleneck and then increases the resolution again to do things like image segmentation but then it adds a shortcut between the same resolution from the encoder to the decoder so that's that's also very cool so I'll leave this summary here for also when we put the slides there's some additional resources but I wanted to move on to another kind of two different kind of data sets which are data that's like sequences and and so these the actual model is recurrent and attention so here again inputs and outputs can be images actually text audio waveforms we've seen a wide variety of applications of sequences and then the architectures are recurrent over time and/or space and they do have attention and recently we've seen a suite a switch to attention only architectures which are pretty cool and I think Scott will go in a bit more detail later and then the losses again here are pretty straightforward cross-entropy loss so I think there's two key ingredients on what happened in not image but language and these are very important sort of almost discoveries that together have brought deep learning into a toolbox for natural language processing the first of which is to embed so in text we have these inputs that are discrete and their words and there could there can be many words we don't even have I mean we don't have a full vocabulary so we can also go to characters and then describe words by sequence of characters and so on but the key inside here which is actually fairly straightforward in retrospect is that a word can be represented as a one hot encoding right so a word let's say we have a vocabulary of size 10,000 we can represent this this word as all zeros but one one at the position of the integer for which we decide to encode the word and that is a very simple idea but it's how we enable to go from this discrete representation to these this vector space that we like our our neural nets to operate on like which are dense vectors that are maybe like 256 512 dimensions and so on so that first key inside allow us to take text and convert it into a vector that's very important the second one is recurrent language models which right away really outperform other approaches for language modeling which would really kind of made and in another empirical evidence that people started to believe in in sort of recurrent neural networks and so on working very well for language and for me the first time I saw this result which was amazing was from Thomas McAuliffe in into speech 2010 and to go a bit more in detail the key insight here is to vectorize context so we have a context of previous words so for instance here that the task is we want to predict that the next word in this sequence would be Matt and we have the previous words which is the cat sat on their right so we're gonna one hut encode these words we're gonna embed them be a multiplication that obviously you can do very efficiently because this is a sparse multiplication isn't you don't do it densely unless it's a very small one hot and then perhaps the simple thing to think about is you encode a fixed length of words maybe you have a window of five previous words you embed the words and then you have a our five times whatever let's say hundred dimensional embedding vector for the five words so it's five times 100 and then you can have a matrix multiplication pass it to a softmax that tries to predict what the next word is and in this case you have a very nice loss that you can set up and you just go go ahead do that and train a model and you get fairly reasonable language model performance doing this but this is a bit annoying because you need this fixed window length and also it's not very natural like to - there's no not the right amount of invariance here so the invariance we would like to know is that as I see a word I update the state of sort of what I believe the next word is gonna be like and I don't wanna have this fixed length assumption which is anything beyond five words ago I actually cannot input in these models there's a basically a very heavy Markovian assumption here so recurrent neural networks sort of solve this issue in in a in a very natural way which is you're gonna embed the word one at a time but then you have a hidden state that you keep updating with a simple as a function as this one for instance so you take literally the word embedding you multiply by a matrix you take the previous state of your of your network the hidden state the the memory so to speak and you multiply by another matrix you sum these two together apply a non-linearity and this defines the next hidden state and then from each hidden state you can also predict what the next word is so this sort of operation you can repeat and now as you read words you keep adding them to your sort of working memory so to speak and that enables you to have way more invariance and no independent assumptions that you needed otherwise and so recurrent neural Nets really are like state of the art language models in terms of lot probabilities that they achieve on test set and then a slight extension to these is well we can we can generate language let's say unconditionally but now we we have a way to have this working memory read in a sequence and output a sequence so there were a bunch of papers that proposed this in several forms and in several on several data sets but people nowadays call this approach sequence to sequence there was a tutorial at ICML so I'm not gonna extend too much on this but the main idea is so simple that it's almost impossible not to show it which is instead of generating the next word given the previous words we're gonna have as a previous words some sequence that we probably didn't need to generate anything from so for instance we can take French read it all in in the state and then start generating the translation in English and that is that was a very simple insight that thanks to the power of recurrent neural networks people felt like both to try and to sort of change the paradigm of what otherwise will be statistical machine translation and in a nice way and I I alluded to this before is that nowadays with these frameworks there's very simple code that you can write to do all these kind of complicated architecture so here for instance these these few lines of Python code need to read it obviously these are represent the Alice TM architecture which was very critical for sequence to sequence and was introduced quite a long time ago but the first time I actually knew this in 2007 or so it just wasn't possible to implement all the gradients and so on it was very very quite quite quite cumbersome nowadays it's as simple as either using a library or you can write it from scratch like there and it's very simple so definitely these frameworks have had quite a bit and as a result maybe not as impressively as in imagenet but obviously convolutions had been around for quite a long much longer than imagenet but in neural machine translation we see a similar trend which from a strong baseline based on statistical machine translation to state of the art these are blue scores and higher is better we sort of had the first sequence to sequence papers which for single model they didn't outperform state of the art although you could attempt and sample the models and they actually outperform but then recently we've seen more and more papers that by means almost of scaling up and adding attention which I'll describe in a second they actually achieve you know they defeat sort of traditional methods and of course combining both is even better and so on but this this sort of this kind of switched the paradigm of in machine translation to a neural approach which was quite nice to see but maybe the the there is a very strong limitation on sequence to sequence which is when we when we design it if we didn't think too much about this because we actually train a very extremely large model like it had 8,000 hidden units but there is a very strong bottleneck and here I'm showing this is from the paper that we wrote a while back sequence the sequence as a function of the sequence length of the input right so the more words you try to kind of put into the memory you might think that this memory event eventually will be overwhelmed because it has a fixed dimensionality but our dimension ID was 8000 which is quite large for neural nets however the other paper or another paper from Montreal saw that as you increase the length you actually decrease the performance because the memory gets overwhelmed and I believe their model had a thousand dimensions or maybe 512 I forget exactly how much so this was a problem and and you can visualize this problem by looking again at this graph which is here you're encoding ABCD into this box which represents the back door of the your hidden state but everything here has to be now decoded to the target sentence and this is a very strict bottleneck that perhaps is unnatural and perhaps there's a way to have again a model that has the right inductive biases to extract or to translate or to whatever we want we need to do to do sequence to sequence and that's precisely what attention does so attention relieves this bottleneck and says look for translation for some languages it makes sense to think about translating parts of a word or or words or multi words onto other words and there's a sort of an alignment an alignment idea which actually statistical machine translation uses quite heavily so this alignment between inputs and outputs this network does not care at all it doesn't have any structure that induces these alignment explicitly of course internally it might know that if I read the word cut it would it should translate to the word Gatto in Spanish and so on but there's no explicit alignment right whereas this attention mechanism allowed you to latch on to this issue that there is a natural I mean between inputs and outputs which is shown very nicely in this in this figure from badenov a tall paper from 2015 one of the best papers of the year no question because these attentional mechanism is really like going taking over in many other fields now but the idea of alignment is like it's it's it's there the model has the ability to align or to kind of attend to the input but we don't actually have supervised data to know what should be aligned to which what should be run to what it just learns to align just by looking at a massive amount of data which is extremely cool so here you can kind of see an alignment that goes here reversed because you know French and English they don't align monotonically all the time and this was very cool to see that this was emergent from the data just by adding this right of way of inductive bias so in the inductive bias I'm gonna explain the mechanism a little bit it's it's quite simple you take a sequence to sequence model and then you add this sort of mechanism in between coder and the decoder that will allow the decoder to query the encoder at every time step for information that it might need to decode what word goes next okay so in here this decoder has seen this word which is start of the sentence and it's gonna sort of try to from its own state and from all the states of the encoder produce a query and an AB Ector that is maximally useful to predict the next word and this goes on and on so this F of input all the inputs and each one goes for every time step here and then how it's used this sort of embedding that get get almost reads from the input gets that fed back to the to the decoder to predict X and also it gets fed into the state because maybe something you read might not need to be read again and so on and so this mechanism is very very simple and it's all differentiable at least in the way that it was done in balla now at all and it produces these beautiful attention masks and it also improves translation quality quite a bit especially if you have a very small bottleneck if your hidden states are not too big so that's all I wanted to say about I mean these attention mechanism and just to be a bit more specific on what what how to actually do this say the inputs are I am a cat you have a word embedding or an LSD I'm embedding for each of the positions in the sequence B 1 e 2 e 3 4 for each of the four words and then the decoder at some point has a state H I and then this the way you do this is you do a dot product or some function that takes H I and takes all the inputs and compute the strength or an alignment or an attention over these four inputs so in this case a simple dot product assuming the dimensionality is much and what not is what some people do and so you take this H I transpose times e one that's a single number you do this for you have four single numbers that represent the strength of the connection on how this H is querying which input and then you simply normalize these with the softmax which is again differentiable and then you read out from the input something that's relevant so you have this vector of strengths they produce this strength vector like point zero zero five point nine point zero five and perhaps for this particular H I this is asking you to attend to the word a because maybe that's the word did you need to produce next and whatnot so it's a very natural mechanism and it works really really well so this is this is and this is again like sort of taking over sequence of sequence models and it's been out there for a while and so in terms of dealing with sequences if they're very long you as I said you can have a bigger state but attention really really helps it has the writing the active biases you can also have some tricks like reversing the input which works in some applications like translation and then you know there's other tricks here I'll leave them for for the slides for you to read later but they're they're sort of tricks of the trade almost that if you deal with sequence models LST M's and so on you would definitely know don't want to miss and every paper should sort of idea we report these Trig's in some form of or another and as a consequence of like these ideas of sequence prediction sequence to sequence and so on a plethora of models came out that can do also that basically they're kind of drop drop in into your models you can have for instance very cool things like read write and memories which was presented in the neural Turing machine you have key value memories which was done at Facebook which is very beautiful where you can have a memory but also attach it a value and so on so there's lots of extensions to these and I recommend you to to read about them if you're interested about sequences and and so on and then I'll leave the additional resources there was a full tutorial at this year at ICML so I didn't want to do to extend this section too much and so we're gonna move now to trends perhaps if you have a like urging question you can ask now but otherwise we can leave it also at the end for questions you can remember sort of the section and but if there's a question now about convolutions recurrence basically like basic model components would be a good time all right otherwise in the break you can also come so this is something that it's very hard to do right like so so the next section we'll be trends of course their trends that we like as a researchers to work on and also trends that the community sort of taking on so one thing that we did is do this kind of source that the cloud word for like all the abstracts that we're papers paper submitted that I CL are about a month and a half ago and then obviously you you identify the typical deep and neural and networks and learning and so on and in in in black here I showed things that I just discussed like sequences require and attention convolutional these are pretty established methods right now but there's some that we're gonna discuss in the next section like graphs which are kind of maybe trending a bit now also adversarial like obviously there's a lot of attention and interesting work being done in generative adversarial models and generative models which Scott will talk in in in a bit and also there's other words that we're not just gonna cover because there's just been recent tutorials that did an excellent job for instance deep reinforcement learning is quite a hot topic and you can see agent or like environment or reinforcement as words that appear a lot in abstracts in papers and I suggest to go to the last year tutorial at nips which was excellent by John and Peter and and with that here Scott is gonna come up and talk us a bit about more of a trendy topics like auto regressive models and then we'll take a break in in a few and we'll continue also after with more trans but for now I'll stay [Applause] okay now let's talk about auto regressive models first I'm gonna talk about where they fit in the landscape of generative models there's many different kinds besides just Auto regressive so I'll just quickly go over them so there's latent variable models like BAE and variations like the deep recurrent attention writer is implicit models like Ganz that can generate samples but don't give you likelihoods there's models that learn transformations invertible transformations from simple distributions to let's say images and then there's many different kinds of autoregressive models so there's been a really good tutorial at uii on deep generative models that cover the first three in a lot of detail and also there's a nips tutorial from last year on ganz that you should check out if you want to dive more into those I want to talk about Ganz more in the domain alignment section but for this part I'll just talk about auto regressive models so what's the idea of autoregressive models we want to make use of the chain rule of probability so if we have some joint distribution that we want to learn we can factorize it by ordering and possibly grouping the variables as long as we're consistent in this ordering and don't violate causality we can learn the joint distribution and each factor can be parametrized by some theta which could be for example a deep neural network and this can be one such data per factor or you can share them over the factors and so the main modeling choice is how do you order and group the variables and how do you parameterize each factor so the building blocks that we'll be using in this section for the inputs and outputs I'll talk about models for images for text for raw audio waveforms and these things can be inputs to the model and output to the model and you can also view them as conditioning variables for example some problems involve more than one of these like text to image synthesis or text to speech synthesis there was of architectures they really span everything we've talked about so we have recurrent networks over both space and over time previously we've mostly seen over time or like over sequences of words but we can also think about it in terms of sequences over like pixels in space will use calls with convolutions convolutions with attention and even architectures that only have attention that they don't have any convolutions or recurrence in them for the losses you can use cross-entropy loss or in the continuous case you can use mixtures of gaussians or mixtures of logistics the first audio so here is just infographic of what a waveform looks like that we want to model so one way to model these is to use causal convolutions so the inputs to the network is just the raw waveform for example in wave net you go through several layers of convolutions and the key thing to notice is that each output only depends on inputs from prior time steps so for example this node is not going to have in its receptive field any information about the future if you want to get more context for every prediction an important tool is dialated convolutions so even though the convolution kernel has the same number of weights you can increase the extent of time that is able to affect the prediction at any given time step so the larger your dilation rate the more quickly you can expand the context that is used to make a prediction at a particular moment in time this turned out to be very important for modeling audio and you can also stack these things multiple times you have several layers have dilated convolutions where you have a scheduled level dilation rates that repeat so you can have one two four eight and then several stacks of this repeating so how do you train these things so the simplest thing that you can do in terms of a loss is cross entropy loss so given the preceding observations at these time steps you compute logits over y and these will be used to compute the probability of the intensity value being in one of each of the possible quantized values and so you can do a softmax to normalize these and then the objective is the negative log likelihood and so in tensorflow there's a useful function softmax cross-entropy with flow jets so this is the simplest thing you can do but there's other ways so one disadvantage of cross entropy loss is if you have many many possible values the memory consumption is very large so a different approach was taken in this pixel scene n plus plus paper where they proposed a discretized mixture of logistics loss so that what's the motivation for this so the left is a plot in that paper of the marginal distribution of sub pixel intensity values on C for 10 so this is images that were discretized with 8 bits so there's 256 bins here on the x-axis and you can look at the frequency of each pixel value appearing in the data set so you see Peaks around zero when Peaks around 255 and so they use a mixture of logistics law so here's the logistic PDF and the CDF so you can see the CDF is actually a sigmoid function so this is the same logistic Sigma that you can use as a non linearity and neural networks so what's the actual loss so we model the value is coming from this mixture of logistics and so you parameterize this by a mixture component PI and a location and scale parameter mu and s so you model the data as being generated like this and when you want to compute the likelihood the probability of a particular intensity value of x given these mixture of odysseus parameters all you need to do is compute the CDF at the right side of this bin minus the CDF at the left side of this bin that tells you the probability that you assigned to the intensity value of being at X that's a sample from these models if you're using the standard wave net you have to go in sequential order at the beginning of sampling you don't have any wave form so there's nothing to depend on so you have to actually do one network evaluation naively to generate one sample and so you can see moving from left to right yeah but the sampling procedure every network evaluation gives you one more time point how do we speed this up there was a paper that just came out about a called Perla wavenet where you can distill a student network from a teacher network so that sampling goes from Big O of N to Big O of 1 the idea is that you can pre train a wave net teacher in the in the usual way and then train a student network kind of like the generator and again that can take in noise as input and then generate all of the waveform samples in parallel the objective function for this would be to get a high likelihood under the teachers distribution and also to maximize its own entropy so for more details check out this paper on parallel waving that and here's an animation of what sampling would look like so now instead of going sequentially from left to right in the wave form you can just feed in the noise all at once and actually produce the waveform in parallel so this is actually what's being used now and here are some mean opinion score is showing that it's quite good compared to non neural net based systems and it's fast enough to be actually used in production so now I'll talk about modeling text similar to audio which is so one the sequential problem so now instead of modeling samples in a waveform we can model words or we can just model the characters in the text or words and characters together or even bytes and bits so words are useful because they give a shorter sequences and the units are semantically meaningful character level we have longer sequences independent then meaningful but the vocabulary size is much smaller so there's trade-offs between them so I'll show how the receptive fields grow in predicting a word for deeper and ends compared to something like an auto regressive model like bike net away from that so one of the advantages for using Auto regressive convolutional Auto regressive models for text is that the architecture is parallelizable along the time-dimension you don't need to unroll your RNN during training so it can be a bit faster and you can have easy access to many states from the past by using dilated convolutions just like we do for audio and so this can be plugged into applications like neural machine translation you can also use causal convolutions in neural machine translation with attention so there's a paper called convolutional sequence of sequence learning that came out recently where in addition to having causal convolutions over the output sentence in the target language at every time step you can attend to words in the source sentence so unlike in the recurrent attention models you can actually batch the attention because you have access during training to all of the target language words at training time you don't need to depend on having done previous attention lookups to produce the next attention lookup everything can be done everything can be bashed at training time you can also have models that are Auto regressive over time that are neither convolutional or recurrent so this was in the paper attention is all you need so this is transform our formal model gives the inputs a positional encoding and the only care you need to take is to masks this dot product attention over the inputs to preserve causal structure so here's what this looks like the self attention procedure compared to what you would do in convolution so in convolution you have this kernel that you're sliding over the input to produce the features of the next layer and it's always the same kernel with the same weights that you that are being trained with something like self attention to produce the hidden unique addition to the next layer the you have access to the whole spatial extent of the previous layer and the weights are actually adaptive so you can see the shading of these arrows here can actually change depending on the particular inputs and outputs so it's so much more flexible architecture potentially people have also done things similar to the wavenet distillation for text so there was a paper on non auto regressive transform that works for machine translation the idea here is to use an old idea for machine translation called fertility z' so on the left you see the input sentence on the right you see trying to predict the output sentence German and what they do is for every word in the input they predict what's called a fertility value that says how many times this word will be repeated for the translation Network so if there has a fertility of two it's repeated twice if this next you have a fertility in zero there mitad and then you just basically learn a mapping from this fertility augmented input to the the target sentence and so you can actually do this in parallel you don't need to model the causal dependencies over time but the big question is where do these fertility has come from so one thing you could do is pre-training autoregressive teacher network like they do in parallel wavenet or you could train a model with attention and then use the attention value to somehow learn to predict these fertility but it's interesting that the similar idea that we use to do parallel sampling and an audio can also carry over to text so now I'll talk about modeling images so it may seem like a strange thing to do with auto regressive models because we don't have this very obvious temporal structure like we did for audio and text but then the question is how do we come up with an ordering among pixels so one way to come up with an ordering to do raster order or interlaced like in this if we're you just decide on some kind of way of ordering pixels like going from left to right at the bottom for example or you could do it group by group so you have some way of grouping pixels and then ordering the groups and you get different models from each approach so pixel by pixel is the simplest way so here's a familiar chain rule and here the assumption is that every factor this data is a shared neural network so we put pixel CNN or spatial LS TMS use this kind of approach and the key component that allows you to do this is causal convolutions so on the left you can see a spatial masking for a convolution kernel so all the ones mean that information from the parts that are that have one will be passed forward to the next layer and everything with zeros is hidden from the network so if you want to predict let's say the pixel at the center value it's not allowed to say anything from the future in addition to spatial masking the convolution kernels need to be masked over channel dimensions so if you predicting color images you might want to make a green channel depend on red and blue depend on the red and green but not any other ordering so what would the perceptive field look like so if this is a big image and this is one convolutional layer if we want to predict this black pixel after one convolution that is a mask in that way you can see information from these four green pixels and after doing several convolutional layers you can see the receptive field grows to contain everything above it in the image and everything to the left you need a little bit of care needs to be taken not to have any blind spots and so the pixel CN and paper goes into detail about how to do that properly okay what about modeling images group by group so the equation looks for the joint resolution looks almost the same except now instead of single pixel being predicted they actually have a group of pixels depending on all previous groups the group structure then encodes conditional independence assumptions if you factorizing it in this way you assume that all of the pixels in the current group XG are independent given all the previous groups so that's going to limit the expressive power of your model but it will allow you to predict them in parallel so there's this inherent trade-off in terms of your model but if G is very small compared to n then Sam then becomes way cheaper than pixel by pixel so how could we do this in 2d so one reasonable way you could do this is to interleave four groups so you could take the upper lip so divide the image into two by two blocks take the upper left corners and that's Group one pixels and then detect everything to the right so that's group two and then everything to the lower left and then the lower right until you've filled in the whole image so each one of these transitions can be paralyzed by a neural network so from Group one predicting cube to one it's you predicting three one two three predicting for each of those can be a neural network so we went from all event factors here to O of one but where did these Group one pixels come from I mean if you predict all of those in parallel it's going to be very difficult to recover did you actually produce compelling images if you have enough context to model them as independent then just generate them in pairs one is fine maybe if you have something like a really detailed segmentation or previous frames in a video it's feasible to predict these in parallel otherwise you can deploy the same procedure recursively now we have to generate an image of half the resolution and so we can do the exact same thing factorize it into these four groups and so if you knew this procedure broker civ lee you'll end up with log end factors which is much better than an O of n in terms of sampling we can do the same thing in three dimensions so instead of four groups you could have eight and they were just a paper on this that was using auto regressive models for scan completion so if you have some room you can generate data by virtually scanning it and using a 3d reconstruction algorithm fortunately these things are filled with holes due to sensor occlusions so then the task is how do you condition on this and get back your clean volume of your room that you're scanning so you can use these 3d autoregressive models for this purpose and they end up doing a pretty good job of filling in sensor occlusions and it's a scalable model so to summarize about autoregressive models we talked about fully sequential models that factorize it to per pixel or per sample in a wave form so things like pickle CN n + + wave net and previous auto regressive models they typically have very fast scoring like one network evaluation will tell you the likelihood of any image under your model but sampling is over then sequential because if the assumptions made by the model if you make consume conditional independence assumptions you can reduce the cost of sampling 201 or login depending on how strong your assumptions are and then the new class of distilled models actually can have the best of both worlds so this is like the parallel wave net or these parallel machine collisions papers they can have scoring may be more expensive but if you don't care about scoring you just want a sample they can do very fast sampling so there's kind of a duality between the the simple auto regressive models and these distilled models okay so now I think we'll take a break for 10 minutes 10 minutes [Applause] okay.i this section i will talk about domain alignment in particular unsupervised or weekly supervised domain alignment so this I think is one of the most promising and exciting things that's happening in unsupervised learning and so what are the building blocks so what I'll talk about at least in these slides the building blocks are sets of images with some shared structure but there's no direct alignment labeling which pairs of images on one domain of the other correspond and for text you could think about text corpora in different languages but where you don't have matching sentences with the same meaning but you still want to learn to align them all over the two languages so here in the section the architectures I'll talk about it's really nothing fancy these models are all about hooking up pieces like simple pieces in clever ways so for images they'll be just convolutional nets for text we'll see pretty vanilla convolutional networks maybe with attention but think the game is really due to wire things together with a loss so that an alignment emerges between two domains so these could be losses in the latent space of our neural network where you know you want them to be indistinguishable to a neural network across two domains or in pixel space or in some raw observation space there's things like cycle consistency that we'll talk about and also what I find interesting about these models is that there's cases where adversarial objectives are used and also where very simple maximum likelihood is used and in in several different cases in very different models you can start to see the emergence of alignment across domains without supervision so what do I mean by visual domain alignment so here's examples so you could have these street view house numbers and em nest digits so clearly they're talking about roughly the same thing I mean numbers but the motet but the actual style and structure of the images is very very different so to a human it's easy to see which one's match up but to get a computer to get that without manually labeling it it's very tough faces who can think about photos of faces and cartoon avatars or pictures of the same scene in day and night or summer in winter or photographs of buildings and sketches of buildings so to humans it's very easy to match these things up there's been work on doing the alignment in a weekly supervised way so suppose you have images sketches clipart and other modalities that are talking about the same thing would have a different structure if you have some kind of weak labeling like what is the class of image is it a cat or a dog is it a plant or a castle that kind of thing what you can do is have a modality-specific encoder followed by a shared a set of layers that are stuck shared across modalities and if you have it if you just train this to optimize some downstream tasks by classification with some regular risers to encourage alignment across domains what you can what you can find is that this thing learns neurons across different domains that activate for the same semantic concept no supervision and then you can do you can query the model you can plug in let's say a photograph and then retrieve very similar images in other domains like clipart or special text or sketches and in some cases they even have very similar spatial structure so there's actually neurons that it was spatially sensitive that didn't care about where these things are happening and semantically what is happening but insensitive to the specifics of domaine so another approach people have used is adversarial learning suppose we again we have several domains you share an encoder here's one in green across all the domains and again there's some downstream tasks that's shared like classification and what we add to this now is a domain predictor so we have a network that's trying to use these shared features to predict whether it's from the domain of photographs or from sketches and instead of optimizing this thing we actually compute the gradient and then flip it so this is thing called gradient reversal and by reversing the gradients and using them to train the shared encoder we end up learning feature encodings that are invariant to the domain so even a neural network as good as they are a function approximation wouldn't be able to figure out whether it's a photograph or a sketch and so by construction you are you are aligning the domains this was used for classification but we can also do it for image generation so in this setup you could think about trying to learn an encoder decoder architecture that could produce face sketches and you don't have any actual aligned pairs of face and sketch but you can do the following you can learn an encoder F that can see both sketches and face photographs okay we're good the encoder again is shared over over domains but then you have a decoder that just produces sketches so for the sketch that was input you just want to make sure that you can reconstruct it you encode it and decode it you should recover if you encode a sketch and decode a sketch okay you better get a sketch back and it should look the same if you encode Brad Pitt's photo and then decode what you hope is that you get a sketch of Brad Pitt so you can encode that again using the same Sheridan coder and make sure that the Layton's actually looked the same and then you can also pass both sketches to a discriminator that tried to detect whether actually the domains match whether whether the the domain of sketch or photo is is distinguishable to a discriminator so here's what some samples look like so given a Street free number can actually produce the eminence digit that corresponds with it or given a face it can produce the cartoon avatar of the face and they're not perfect but in many cases they're quite good and it's quite remarkable that you can do this with no supervision for producing these really high resolution outputs now we'll talk about cycle consistency so the idea here is we have two domains x and y we want to learn a function G going from X to Y and F going from Y back to X so the property that you want to maintain is that if you translate from X to Y and then Y back to X you should recover the same thing so it's being shown in this picture so you should you should make sure that these two blue points are close together in an X and the same thing for y if you go from Y to X and then back to why you should make sure that those two red dots are close together in addition to that you do the usual domain classifier adversarial loss so you have a domain classifier for Y that tries to detect the difference between actual samples from Y and generated samples from Y and you have a domain classifier for X that tries to differentiate actual samples from X and generated samples of X and this actually works remarkably well you can do things like translate the zebra pictures to horses or summer to winter using these with their code they're calling them cycle games a similar but slightly different idea is in this paper called unsupervised image to image translation using encoder decoder networks so here instead of doing cycles from like one pixel space to another pixel space what they do here is learn a shared latent space so take an image from domain one use the domain one encoder to get to some shared light in space Z and there's two paths that it can take it can use the domain one generator to reconstruct itself and it should just match or it can take the other pathway and use the domain two generator to try to translate it into the other domain and then it has to go to a discriminator the D to disk romantic will say okay is this from D 2 or not and it has to fool that and you can do the same story for an image from domain 2 so you can encode and then reconstruct or encode and then translate with a losses to fool the other discriminator so here's some samples from this model I really like these I was really impressed by these dogs actually so you get this dog as input and then you want to translate it to the domain of sheep dog or husky and what's cool is that the pose of the dog's face is roughly Preserve and even some parts of the background like this grass but clearly that breed of the dog is changing and again no supervision to do this and the last one that I'll talk about is called disco again these things have some cool names sometimes where again you are learning in encoder decoder or a camapign from domain a to domain B you have some kind of pixel wise reconstruction loss and you also have domain classifiers that try to say whether a sample is from domain a or domain B and you try to fool these samplers and an experiment from this paper that I really like is car to face so they have a series of these things where you have two things that are actually like really different it's not two dog breeds it's two completely different things but the network invent some kind of alignment so sugars out that okay faces and cars they have a front in the back maybe you know they there they have there's this manifold this is a sortation manifold that you can you can use to align both datasets and so it finds these alignments unsupervised so there's also some practical uses emerging from these so this model is called grass began and the idea is to use generative models to cross this the the reality gap so it's easy to generate synthetic images of these arms grabbing things in a bin and doing many plays and tasks but we actually wanna learn on a real robot and we can't actually like use graphics engines to produce photorealistic videos of these arms so what do we do we use again to actually try to synthesize images so so going from simulation images to actual like photorealistic images so that the policy that we learn can actually work in reality so what does grass again look like you have some synthetic rendering it's not realistic but it has the right content which a generator has to do it's some kind of unit it has to produce it has to produce like transited into some photographic looking image but it also has to preserve the relevant content so there's some segmentation that says where's the background where is the bin where's the arm where are the objects and then the discriminator decides whether that sample looks real or fake as usual the actual model is more complicated they have more pieces but this is sort of the generative modeling portion now I'll talk about text quickly so typically in neural machine translation you need some hair data at least some that says the same sentence in both languages but some language pairs don't have much parallel corpora and so we ideally want to make one model that could be trained with non-aligned text some books in one and some books on the other we learn models on supervised and let me get a translation model out of it so the first paper is the simplest it's just doing maximally head as far as I can tell we have again a shared encoder over the domains language one language to the first objective is just denoising so you take a sentence for which one encode and decode and you do maximum likelihood on this decoded sentence you try to reconstruct it and then you do back translation so tickets in some language one you actually encode it and decode it you sample in the other language and given that sample you back translate and it turns out that if you do this always doing back check the scene with the latest model that you have this converges to an actually pretty decent translation model not quite as good as is the fully supervised one but like it actually works it's the point and you can also do things like semi supervised training you can also add Gans into the mix so this is very similar in terms of overall design but one piece that's different is that this sentence encoding can be fed into a domain classifier so is this sentence embedding language 1 or language to and you want to fool this adversary but overall similar idea [Applause] so now for a slightly different topic although these generative models are really getting really good so there's actually gonna be a symposium and a workshop on meta learning so I'm just gonna give you a bit of like the taxonomy and what do we mean by meta learning again like inputs and outputs and architectures here is not really the main element of what meta learning is about meta learning by and large is about the laws and it's you can see it as a laws that models another laws and I'll explain this in a second but and there's like three ways to do meta learning that people have come up in that maybe like last couple years although the term and the definitions are quite much older as well so what is learning to learn or meta learning and I'm pretty sure it would be hard for us to agree on a definition so maybe as an exercise ask a colleague and and see what they say but for me what when learning to learn or meta learning is is to going beyond the terrain test this the you know having this strength test paradigm of training and a machine learning model where you have a training set and a test set coming from the same distribution and it's really about instead of having this distribution and having a sample from it and a model that hopefully generalizes is having a sort of almost different task and when you sample a new task what you want your neural model to do is adapt and learn very quickly on the new tasks without needing to go through many steps of sarcastically in the sand which we know is necessary for training things like image net and so on and with with a great problem you need great sort of data sets and luckily both Brandon and also more recently like in a paper we introduced a couple of data sets that are perfect for doing something called one-shot learning which is an instance of learning to learn and meta learning and these are you can see the datasets they're kind of transposed M needs in a way where there are many classes but very few examples per class so in an what is what is meta-learning what is different between learning to learn and just learning a model and so here is what learning a model looks like typically you have some training data and you have some test data and what you're gonna do is typically fit a model such that the likelihood of your model is maximized and you estimate this likelihood by means of taking an expectation over batches of data from these data sets right so you sample a batch of data this green and you feed your model you sample another batch of data you feed your model and so on until you converge so meta-learning adds a twist to this and it's it's it almost it's only a digit additive to this paradigm but we are so familiar with meta-learning does this so in a picture what meta-learning does is treats a whole training data and test data as a single instance so you can think of a full data set of training and test examples is now a single training example okay and that's what Hugo and and Jesse also used this this term she calls or and he calls it met a training set right so it's it's a training set of training sets if you will that's why it's metal and then you're gonna test your algorithm on a meta testing set where there's gonna be let's say some classes or categories or tasks that you've never seen and in the more traditional way to expose the likelihood maximization problem what you do is you have a model that not only maps from X to Y but also takes in a s what we call support set or training set but it was too confusing to call it training set so we decided to call it support set and then this model takes this support set and it's trying to feed for a batch of data this likelihood so this is exactly the same equation except there's this s and this s is the training set that you should use so here you have let's say a 500 a classification problem with five new classes and this is gonna be your s and given this set s what you want to do is have a model that will immediately classify these unseen examples very well right so you want the model to generalize essentially and so what you do is you sample first a task tip and you sample a few labels so let's say these five categories are sample from the thousand categories of imagenet and then given these five categories you sample a support set that's going to be your training set or support set in the metal algorithm and then you sample also a bunch of images that you must classify correctly and this is how you would train one shot learning from a dislodging to learn perspective which is which is sort of how you actually train the model right this is the training procedure but now what kinds of models do we have and here is what I distinguish these in three three rough classes the first is a model-based approach and it's kind of cool because you can see this as a sequence to sequence model so my model is gonna be conditioned on a support set the supports that are gonna be for instance these two images of the dogs and which classes this dogs are like they say this is a red kind of dog and this is a blue kind of talk so your model is reading the dog image the label the next documents the label and so on so forth it ingests this training day data and then what you ask the model to do is a test time you input an image and you ask it to classified correctly whether it's good the blue order or the or the red dog and this is trained end to end and there's a family of models that use this idea of model-based meta learning there's another family which has a better inductive bias for classification which means they usually train faster and slightly better which are metric based and here the gist of the idea is that I'm gonna have this support set now be sort of add a training set that I keep in memory so I know that this dog is blue this dog is red and maybe some other dogs and so on and I'm not gonna learn a metric or a function that allows me to compare a new dog a new image of let's say the red dog and so as to when I do the sort of nearest-neighbor computation I get the red right level out and I train this end to end so it's kind of a differential nearest neighbor where there's deep features everywhere so on and again there's a bunch of papers that use this metric based approach to meta learning and the last one which is perhaps the more close like closer to learning to learn it's quite cool is the idea that I'm gonna get these pairs of these support set like these doggies blue these doggies with yellow and so on and if I was doing traditional deep learning I would maybe fine-tune a model with these new levels although it's a bit tricky because you would need a new soft max layer and so on but let's say you want to do that well if you just apply gradients on these five images repeatedly you're going to terribly over feed to these images you're not gonna generalize to this so what these models do instead is they compute gradients right knowing that gradients have the right information probably to fit a new model but instead of applying the gradients to the model you simply learn a controller that takes in these gradients and hopefully knows how to apply them to your model such that after applying a few gradient steps to this model they're gonna classify a new image that they haven't seen in training of this small training set let's say of four images and it's gonna generalize so the three have a slightly different flavors but they're they're a bunch of papers and a lot of progress actually is being made which is quite exciting and then another kind of a different application of this idea of one-shot learning and learning from essentially a novel demonstration is these paper from opening eye which essentially took a trajectory or a demonstration a single one and said look I have a policy that needs to act on the world and I wanted to hopefully act like this demonstration that I'm giving to you so it's very similar to one-shot learning except it's for reinforcement learning and the results if the internet works is are pretty it's pretty it's pretty good actually so giving it a single demonstration of you know touching some objects and so and whatnot the policy learns by means of these having many such examples of a single trajectory and what you intended to do and fitting your model when you give it a normal demonstration at test time it's it's able to sort of generalize and work quite well so I think these these these models as I said there's there's there so kind of active areas of research that there's both a symposium and a workshop so I invite you to check that out because I'm sure there's gonna be interesting dogs and other aspects of meta learning like learning models architectures is another aspect that's quite interesting and quite relevant nowadays so going beyond these meta learning there's perhaps and this is this is kind of cool because it goes back to these kind of what can neural networks not do or deal with very well and should we fix it and I think that's kind of a bit of really like there's graphs are just generalized sequences and generalized trees so there are very generic data structure there are many tasks that you can think of being important to be able for your neural network to deal with the graph but I think it's just the main motivation is there's natural turn off this fixed input representation these tensor inputs like images or videos and whatnot sequential inputs which are kind of maybe a special case of tensor inputs and then graphs which are this structure that is it's kind of hard to represent as a tensor however there are many models and proposals on dealing with graphs which I'm going to describe here in a second and also graphs of course like there's probably stick graphical models which are very interesting for medical diagnosis for example and there's some some algorithms that operate on graphs that we would like to perhaps get some inspiration from and add it as maybe an element of our neural networks if they could do reasoning on a graph they might actually be able to do things that they can't yet so again this is gonna be all about graphs and trees and the architecture that I'll describe the most is this this what we call message passing which is the kind of architecture that deals very naturally with graphs so going back to the inductive biases we have spatial inductive bias for image models the sequence kind of inductive bias of recurrence for grabs the kind of bias that we would like our neural networks to just have not have to learn but just inherently have is the fact that if I rename nodes in a graph the graph is the same if I have a function that takes some some assignment of let's say V 1 V 2 V 3 I would like if I somehow vectorize these and I embed this graph which is given to me as an input I embedded to a back door or I predict something from this graph I would like this prediction to be in Burien that if I had renamed the nodes to something else because the graph inherently is the same is the same as moving an object is the same object right so this is perhaps the main property we want to preserve and there's actually an oral at nibs from thinking about set called deep sets that I also recommend you attempt so I'm gonna describe a model that actually generalizes many models that have been proposed which is great because anytime you think about maybe empowering a neural net to deal with graphs you might think about thinking about this framework and then seeing how you can improve it which we called message-passing neural networks and I'm gonna explain how these operates on a given graph as the input through an example okay so like here there's this graph let's say this is a molecule and it has certain bonds and this graph has we represent each node of the graph has a back door so each node stores a vector and this vector is a let's say a state or a imagine it's a hundred dimensional vector let's say so that we have h1 h2 h3 h4 and h5 these are vectors that are sort of stored in the graph in the nodes of the graph and then we potentially also have edge edge features that are perhaps another vector or maybe a property right like this is a strong bond or a weak bond so the edges can also be parameterized by either a vector or maybe a category or a property depends on the models so message-passing neural networks have essentially a single phase that you repeat over and over which is to pass messages to your neighbors very simple and obviously graph invariant which is the desired property the first element is the message the message essentially takes so the message for a node V is going to be the sum of what all its neighbors so the incoming messages and you're going to sum essentially some M this M is a neural network so it could be many things like it could be as simple as summing the embeddings or maybe having an Alice TM or something else and it's a neural network that takes as input the representation of node B so that my own vector the representation of my net neighbor and some sort of representation of the edge okay and this creates a message that's perhaps a bacter that I think it's incoming to my to my node now the node itself age needs to update its state so age computes a message m and it has a previous state and then it just updates the state this could be over writing it or it could be a gru and so on and then finally after repeating this process you read out an answer from the graph so here is how it looks like right so let's say h1 gets two messages which which will be two back doors passed to it through a neural network and then it updates the state right so it was yellow and now given this message it decides to go orange and so this happens throughout the graph every neighbor communicates with with each other and so on through messages and then they update their state and you repeat this and you can repeat these a certain number of times this is an i / parameter and now once you've repeated these let's say five times you will take these nodes the representation of the graph and then with these nodes you're gonna compute the answer let's say a property for a chemist a chemical element and whatnot and this property here is one point six whatever right so this framework actually turns out to generalize a bunch of other frameworks and in our case we applied this for chemical discovery but there's many many papers which I'm not going to go into detail that deal with graph data and that they propose certain algorithms like for instance convolutional neural combustion networks and graphs turns out to have a specific form of message of update rule and of readout rule right and so every many models that we see in the literature for instance interaction networks another one that has this this specific form very generic but actually doesn't do message passing per se so you have you you you just do one message passing and that's it gated graphing your own networks a paper from you gia also uses a GRU to update the state of the nodes and so on so many of these models can be expressed with these framework which is quite useful because it allows us to not be too confused about models that deal with graphs I think this framework is general enough that you can play and place your model here and I will finish with saying that with graphs especially batching is extremely tricky because when you sample 10 graphs to form a bad-size of time step size then let's say these these grabs are going to be different sizes and they have different connectivity patterns so batching becomes quite tricky and here is where choosing the framework and choosing the right let's say abstraction is what's gonna make it or break it for you I mean implementing these models and batching them is to be more efficient is important and there are certain you know frameworks that provides things like while loops in tensorflow also pie chart is very nice for graphs because the graph the computation graph is dynamic so you can just essentially load a neural network sorry load a graph and generate the neural network and so on but these are sort of important technical aspects of graphs although if you don't care about speed you can just do bite size one and and be good with that like and then do the message passing stuff so I'll leave the summary here and further reading and then Scott is gonna conclude with some very interesting topic on program induction and then we're gonna have some questions at the end I think there's going to be 15 minutes or so or 10 thank you [Applause] okay program induction with neural networks so the research landscape that the Iowa covers basically like this so the the simplest approach is you think of the neural network as a program so you somehow embed the program into the weights of your network so like the cartoon version is you have some inputs to a3 and the network learns at two plus three is five so you give it the inputs and the network it directly gives you the outputs so there's been a lot of models that do this over the past couple of years the second major type is a neural network that generates source code so you show it an example of an input and an output or a set of input and outputs you show it to the network and the network actually produces the the the tokens of the actual program the nice thing about this is if that program is correct it's always going to work it will generalize perfectly to any input whereas if your neural network is the program there's a question of how well is it gonna generalize to new inputs that I didn't see during training but there's lots of work in both directions and then there's the whole field of probabilistic programming which I won't go into but people are starting to try to integrate that with neural networks and in deep learning and there's a variety of very cool-looking frameworks to do that so what are the building blocks we'll have again discrete symbols like we had for natural language but we can also think about the program itself the text we can look at execution traces of the program so what did the program actually do when it was running and we can also look at inputs and outputs of programs and these can be mixed with perceptual data it's not just bits is that not necessarily just symbols of architectures they're mostly recurrent but sometimes you will you see convolutions especially if there's pixels involved like there's a visual front-end for example and the loss oftentimes if you're predicting discrete outputs and everything is differentiable you can just use cross-entropy as if you are doing language model but if it's not renewable then you had to do RL so one of the first papers I've got people really excited about baking programs into neural networks what's called learning to execute it's very bold paper idea is you give it a simple Python program and can even have like loops and they're simple arithmetic and in some cases it can actually correctly predict the output of this program so people were really impressed with this of course it's it can't really completely learn Python but the fact that they work at all got people very inspired some people also thought about how to learn more parallel programs so instead of doing just recurrent networks you can think about how to use confidence so there's just architects ago they called neural GPU that is based on repeated iterations of convolutional gauge recurrent units and they can learn algorithms like binary addition and multiplication that can generalize to long like bigger problems and I saw during training and if you look at the animations of this thing running it looks like a cellular automaton on YouTube you can find them it looks very trippy another important paper was the neural Turing machine or now they're being called differentiable neural computers in here we try to see the addition of more structure to the neural network used for program induction and the most important kind of structure that we see here is the addition of a memory module so at every time type of processing for some problem you get to write to a memory and also read from memory and so this network learns a policy of how to read and write in order to solve these problems all the people took the approach of actually learning the interpreter of a program so in this paradigm the user will fill in some parts of the program like this would be bubbles bubble sort in the fourth programming language you supply most of the lines that maybe there is some blank that you want the network to fill in so what they do is actually make this program language differentiable so that when there are these blanks the neural network learns to fill in the missing behavior to actually reproduce the desired behavior and it has interesting features like a soft version of a program counter that models is uncertainty about where it actually is in in the program execution this is it ICML another way to add structure into these models that embed the Department of the network is to look into hierarchies so this neural programmer model in the Naropa term interpreter what it does it's basically a router from programs to sub programs conditioned on the environment and so we showed that it can learn simple algorithms like addition or sorting that generalizes to beyond two problems beyond the size of thought during training and on the right you can see like the execution the hierarchical execution trace as this thing does addition one drawback is that you of course need supervision about okay what sub program are you gonna call at each time step so you need detailed execution traces there's been steady progress since that paper was written on reducing the supervision so exploiting recursion and also exploiting flat traces rather than detail hierarchies to train these things in a much more efficient way there's also been progress in deploying these ideas on actual robots so a slight variation of this model code no task programming allows teaching taking place algorithms taking place that manipulation algorithms to-to-to robots and again they show that they have very nice generalization properties so in an actual robot they can show that this hierarchical version this hierarchical program reduction model can actually do a much better job than flat models of sorting objects and they can sort any more of them so you see much better generalization and so these things they're actually starting to bear fruit and practical problems now I'll move on to two approaches that actually generate code in some form or another this has several very cool papers that have some kind of domain-specific language that they used to attack a particular problem and they generate programs in this DSL so in the decoder paper they're dealing they're interested in programs that manipulate array of numbers and these are the kinds of things you might see in simple like programming contests I'd say so the programs in this that are being considered are just kind of straight line programs like just like a sequence of API calls basically of things that when they played these arrays and so you want to find the right sequence of API calls that map this input to this output now we're sort of inputs to set of outputs so the approach they take is to train a neural network to predict attributes of the program like does this program have a call to filter or sort or reverse or does it have this particular little substring in the program does this long attribute vector characterizing things about the program it's not a full specification but the point is that when you want to use classical search based methods to actually find the right program to accomplish this input output mapping you can use the neural networks attribute predictions to search much faster to to prune the search space basically and so they show that you can have big speed ups in terms of the time it takes to actually find the program that correctly maps inputs to outputs so this is called a deep coder another approach actually generates the tokens of the source code directly from the program so again you have input output you could imagine this is like some rows in Excel and you want to canonicalize these these names and put them in last name comma first name order let's say so given these inputs and outputs you want to be able to actually generalize to some more names and so the program is some string this the sort of string manipulation program it's a domain-specific language that these authors created for this for this problem and so the task for the neural network is look at these input-output pairs and then generate this program it's called the robust fill so so the tomorrow is actually fairly simple and elegant you just encode all of your I up air and the previous tokens of your program and then predict the next token of your program with tokens come from this domain-specific language and the really cool thing that they show is if you compare against excels previous flash fill we're not based on their own that works if you have no mistakes in your little training set maybe flash was a bit better but this neural map model is almost as good if you have one mistake flash Ville will just break like like the trip typical like old-fashioned brutal methods because they're they're not robust annoys but the norm that methods actually degraded very gracefully so to me this was like a very very inspiring because it's the first proof that neural program induction can actually be useful so beyond those approaches people have used are more and more integrating neural Nets into probabilistic programming so if you think about graphics programs that produce things like 3d faces 3d shapes through the human poses we can use these graphics programs to generate lots of data but then suppose that you have an image and you want to find the program that would actually reconstruct it and once you have that program you can tune to tune the knobs and and re-render with different poses or different lighting that you might want to do for like things like animation so where the neural networks fit in this pipeline so for these public programming languages we have a scene language and a renderer which may not be differentiable it could be some very complex system which produces renderings and then you have some observed image that you want to match you can use features of a deep neural network to actually match to see how well these renderings match up with the actual image that you have and to help guide your search over programs that would actually reconstruct your image using this this graphics program so that's it for you know programming now I'll just go over some brief conclusions and thoughts about the future so one thing that we saw is that deep auto regressive models and convolution that's our now ubiquitous in production and they're already useful in a lot of consumer applications so anything with images like image search or with auto regressive models a machine translation text-to-speech I mean these are all in in in product and also we saw that inductive biases are really useful so in deep learning now we're all really against hand engineering things like specific features but it's still really valuable to think about architectures with the right biases that I still have some flexibility to actually learn the weights like spatial invariants and CNN's or time occurrence and Arlen's permutation and variance in in graphs if you're careful about choosing the inductive biases you can gain a lot in performance we saw that like simple tricks like residual networks skip connections we just give us a quantum leap in performance and I think there will be new ones that will be discovered that in hindsight may look it may look obvious but we don't have yet I think adversarial networks and unsupervised domain adaptation they will have more and more interesting applications on phones for fun of things like style transfer I think we think that metal in meta learning or because of metal learning more and more of the life cycle of a model I'm training and validation and testing this will be just part of an end-to-end training process so everything we do now will will start to look like the inner loop of some more general thing and then some more general thing and so on and then I think we'll see that we think that programs that this is and programs that this is combined with graph networks will be very important and hopefully find more real-world applications like we saw with a robust fill okay thank you [Applause] you have questions or comments we have some time otherwise stop by land yeah like this I think there's mics on the sides if you want to ask questions alright so then it was very clear alright last chance no yeah yeah you Keegan shotty we will repeat yeah the slides we'll put them like like available soon will will Twitter something so look for it yeah okay alright so have some coffee thanks for coming so early thank you [Applause]
Info
Channel: Steven Van Vaerenbergh
Views: 27,861
Rating: undefined out of 5
Keywords: machine learning, presentation, nips, nips2017
Id: YJnddoa8sHk
Channel Id: undefined
Length: 104min 25sec (6265 seconds)
Published: Fri Dec 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.