Lecture 7: Introduction to TensorFlow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC] Stanford University. >> Happened is that Nishith and Barak are going to be giving an introduction to TensorFlow. So TensorFlow is Google's deep learning framework, which I hope everyone will be excited to learn. And at any rate, you have to learn it, because we're gonna be using it in assignments two and three. So this should really also help out for the second assignment. And so before we get started with that, I just want to do a couple of quick announcements. So the first one was on final projects. So this is really the time to be thinking about final projects. And if you've got ideas for final projects and want to do the final project, you should be working out how to talk to one of me, Kevin, Danqi, Richard, Ignacio, or Arun over the next couple of weeks. And again, obviously, you've got to find the time, so it's hard to fit everybody in. But were are making a real effort to have project advice office hours. There were also some ideas for projects that been stuck up on the projects page. So encourage people to look at that. Now, people have also asked us about assignment four. So we've also stuck up a description of assignment four. And so look at that if you're considering whether to do assignment four. So assignment four is gonna be doing question answering over the SQuAD dataset and you can look in more details about that. So then there are two other things I wanted to mention. And we'll also put up messages on Piazza, etc, about this. I mean, the first one is that for assignment three, we want people to have experience of doing things on a GPU. And we've arranged with Microsoft Azure to use their GPUs for doing that and for people to use for the final project. And so we're trying to get that all organized at the moment. There's a limit to how many GPUs we can have. So what we're gonna be doing for assignment three and for the final project is to allow teams of up to three. And really it's in our interest and the resource limit's interest if many people could be teamed up. So we'd like to encourage people to team up for assignment three. And so we've put up a Google form for people to enter their teams. And we need people to be doing that in advance, because we need to get that set up at least a week in advance so we can get the Microsoft people to set up accounts for people so that people can use Azure. So please think about groups for assignment three and then fill in the Google form for that. And then the final thing is, for next week, we're gonna make some attempts of reorganizing the office hours and get some rooms for office hours so they can hopefully run more smoothly in the countdown towards the deadline for assignment two than they did for assignment one. So keep an eye out for that and expect that some of the office hour times and locations will be varying a bit compared to what they've been for the first three weeks. And so that's it from me, and over to Nishith. >> Hi everyone. Hope you had a great weekend. So today we are gonna be talking about TensorFlow, which is another deep learning framework from Google. So why study deep learning frameworks? First of all, much of the research in deep learning and machine learning can be attributed because of these deep learning frameworks. They've allowed researchers to iterate extremely quickly and also have made deep learning and other algorithms in ML much more accessible to practitioners. So if you see your phone a lot smarter than it was three years ago, it's probably because one of these deep learning frameworks. So the deep learning frameworks help to scale machine learning code, which is why Google and Facebook can now scale to billions of users. They can compute gradients automatically. Obviously, since you all must have finished your first assignment, you must know that gradient calculation isn't trivial. And so this takes care of it automatically, and we can focus on the high-level math instead. It also standardizes ML across different spaces. So regardless of whether I'm at Google or at Facebook, we still use some form of TensorFlow or another deep learning framework. And there's lot of cross-pollination between the frameworks as well. A lot of pre-trained models are also available online, so people like us who have limited resources in terms of GPUs do not have to start from scratch every time. We can stand on the shoulders of giants and on the data that they have collected and sort of take it up from there. They also allow interfacing with GPUs, which is a fascinating feature, because GPUs actually speed up your code a lot faster because of the parallelization. Which is why studying TensorFlow is sort of almost necessary in order to make progress in deep learning, just because it can facilitate your research and your projects. We'll be using TensorFlow for PA two, three, and also for the final project, which also is an added incentive for studying TensorFlow today. So what is TensorFlow actually? It's just a deep learning framework, an open source software library for numerical computation using flow graphs from Google. It was developed by their Brain team, which specializes in machine learning research. And in their words, TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. So now I'll allow Barak to sort of take over and give a high-level overview of how TensorFlow works and the underlying paradigms that so many researchers have spent so much time thinking about. >> Thanks, Nish, for starting us off. I'm gonna be introducing some of the main ideas behind TensorFlow, its programming paradigm, and some of its main features. So the biggest idea of all of the big ideas about TensorFlow is that numeric computation is expressed as a computational graph. If there was one lesson that you took out of this presentation today, at the back of your mind is that the backbone of any TensorFlow program is going to be a graph where the graph nodes are going to be operations, shorthand as ops in your code. And they have any number of inputs and a single output. And the edges between our nodes are going to be tensors that flow between them. And the best way of thinking about what tensors are in practice is as n-dimensional arrays. The advantage of using flow graphs as the backbone of your deep learning framework is that it allows you to build complex models in terms of small and simple operations. And this is going to make your gradient calculations extremely simple when we get to that. You're going to be very, very grateful for the automatic differentiation when you're coding large models in your final project and in the future. Another way of thinking about a TensorFlow graph is that each operation is a function that can be evaluated at that point. And hopefully we will see why that is the case later in the presentation. So let us look at an example of a neural network with one hidden layer, and what its computational graph in TensorFlow might look like. So we have some hidden layer that we are trying to compute, as the ReLU activation of some parameter matrix W times some input x plus a bias term. So if you recall from last lecture, the ReLU is an activation function standing for rectified linear unit in the same way that a sigmoid is an activation function. We are applying some nonlinear function over our linear input that is what gives the neural networks their expressive function. And the ReLU takes the max of your input and zero. On the right, we see what the graph might look like in TensorFlow. We have variables for our b and W, we have a placeholder, we'll get to that soon, with the x, and nodes for each of the operations in our graph. So let's actually dissect those node types. Variables are going to be stateful nodes which output their current value. In our case, it's just b and W What we mean by saying that variables are stateful, is that they retain their current value over multiple executions, and it's easy to restore saved values to variables. So, variables have a number of other useful features. They can be saved to your disk during and after training, which is what facilitates the use the niche talked about earlier, that allows people from different companies and groups to save, store, and send over their model parameters to other people. And they also make gradient updates by default. Will apply over all of the variables and your graph. The variables are the things that you wanna tune to minimize the loss, and we will see how to do that soon. It is really important to remember that variables in the graph like b and w are still operations. By definition, if there can be such a thing as a definition on this, all of your nodes in the graph are operations. So when you evaluate the operation that is these variables in our run time, and we will see what run time means very shortly, you will get the value of those variables. The next type of nodes are placeholders. So placeholders are nodes whose value is fed in at execution time. If you have inputs into your network that depend on some sort of external data that you don't want build your graph that depends on any real value. So these are place folders for values that we're going to add into our computation during training. So this is going to be our input. So for placeholders, we don't give any initial values. We just assign a data type, and we assign a shape of a tensor so the graph still knows what to compute even though it doesn't have any stored values yet. The third type of node are mathematical operations. This is going to be your matrix multiplication, your addition, and your ReLU. All of these are nodes in your TensorFlow graphs. And it's very important that we're actually calling on TensorFlow mathematical operations as opposed to NumPy operations. Okay, so let us actually see how this works in code. So we gonna do three things. We're going to create weights including initialization. We're going to create a placeholder variable for our input x, and then we're going to build our flow graph. So how does this look in code? We're gonna import our TensorFlow package, we're gonna build a python variable b, that is a TensorFlow variable. Taking in initial zeros of size 100. A vector of 100 values. Our w is going to be a TensorFlow variable taking uniformly distributed values between -1 and 1 of shapes of 184 by 100. We're going to create a placeholder for our input data that doesn't take in any initial values, it just takes in a data type 32 bit floats, as well as a shape. Now we're in position to actually build our flow graph. We're going to express h as the TensorFlow ReLU, of the TensorFlow matrix multiplication of x and w, and we add b. So you can actually see that the form of that line, when we build our h, essentially looks exactly the same as how it would look like a NumPy, except we're calling on our TensorFlow mathematical operations. And that is absolutely essential, because up to this point, we are not actually manipulating any data, we're only building symbols Inside our graph. No data is actually moving in through our system yet. You can not print off h, and actually see the value it expresses. First and foremost, because x is just a place holder, it doesn't have any real data in it yet. But, even if x wasn't, you can not print h until we run a tune. We are just building a backbone for our model. But, you might wonder now, where is the graph? If you look at the slide earlier, I didn't build a separate node for this matrix multiplication node, and a different node for add, and a different node for ReLU. Well, ReLU is the h. We've only defined one line, but I claim that we have all of these nodes in our graph. So if you're actually try to analyze what's happening in the graph, what we're gonna do, and there are not too many reasons for you to do this when you're actually programming a TensorFlow operation. But if I'm gonna call on my default graph, and then I call get_operations on it, I see all of the nodes in my graph and there are a lot of things going on here. You can see in the top three lines that we have three separate nodes just to define what is this concept of zeroes. There are no values initially assigned yet to our b, but the graph is getting ready to take in those values. We see that we have all of these other nodes just to define what the random uniform distribution is. And on the right column we see we have another node for Variable_1 that is probably going to be our w. And then at the bottom four lines we actually see the nodes as they appear in our figure, the placeholder, the matrix multiplication, the addition and the ReLU. So in fact, the figure that we're presenting on the board is simple for what TensorFlow graphs look like. There are a lot of things going behind the scenes that you don't really need to interface with as a programmer. But it is extremely important to keep in mind that this is the level of abstraction that TensorFlow is working with above the Python code. This is what is actually going to be computed in your graph. And it is also interesting to see that if you look at the last node, ReLU It is pointing to the same object in memory as the h variable that we defined above. Both of them are operations referring to the same thing. So in the code before, what this h actually stands for is the last current node in the graph that we built. So great. We define, question? So the question was about how we're deciding what the values are, and the types. This is purely arbitrary choice, we're just showing an example, It's not related to, it's just part of our example. Okay. Great, so we've defined a graph. And the next question is how do we actually run it? So the way you run graphs in TensorFlow is you deploy it in something called a session. A session is a binding to a particular execution context like a CPU or a GPU. So we're going to take the graph that we built, and we're going to deploy it on to a CPU or a GPU. And you might actually be interested to know that Google is developing their own integrated circuit called a tensor processing unit, just to make tensor computation extremely quickly. It's in fact orders of magnitude more quick then even a GPU, and they did use a tender alpha go match against lissdell. So the session is any like hardware environment that supports the execution of all the operations in your graph. So that's how you deploy a graph, great. So lets see how this is run in code. We're going to build a session object, and we're going to call run on two arguments, fetches and feeds. Fetches are the list of graph nodes that return the outputs of the nodes. These are the nodes that we're interested in actually computing the values of. The feeds is going to be a dictionary mapping from graph nodes to actual values that we want to run in our model. So this is where we actually fill in the placeholders that we talked about earlier. So this is the code that we have earlier, and we're gonna add some new lines. We're first going to build a session object called tf.Session. It's gonna take some default environment. Most likely a CPU, but you're able to add in as an argument what device you want to run it on. And then we're going to call, first of all, session.run on initialize all the variables. This is concept intensive flow called lazy evaluation. It means that the evaluation of your graph only ever happens at run time, and run time now we can add an interpretation to run time in TensorFlow, so and so means the session. Once we build the session we're ready to actually call unlike the tensa flow run time, so it is only then that we actually stick or assign the values that we initialize our BMW on to those notes. BMW never mind [LAUGHS]. After those two lines, we're finally in a position to call run on the note we're actually interested in, the H, and we feed in our second argument of dictionary for X, it's our placeholder With the values that we're interested. For now just some random values, question? Initialize all variables will initialize all the things that are formerly called variables in your graph like b and w in this case. [BLANK AUDIO] And so the question was. What is the difference between variables and place holders, and why we might, we might want to use which. So, place, sorry, variables are in most cases will be the parameters that we're interested in, you can almost think of them as the direct correspondence, X are data is not a parameter, we're interested in tuning. In the models we are working with. Additionally, it's important that our parameters have initializations in our model to begin with. They have a state. Our input doesn't really have a state as part of our model. If we're gonna take our model and Export it to somebody else. There's no reason for it to actually include any real data values. The data is arbitrary, it's the model parameters that are the foundation of your model. They are what makes your model interesting and computing what it computes. Great, so what have we covered so far? We first built a graph using variables and placeholders. We then deploy that graph onto a session which is the execution environment. And next we will see how to train the model. So the first question that we might ask in terms of optimization is how do we define the loss? So we're going to use placeholder for labels as data that we feed in only at run time and then we're going to build a loss node using our labels and prediction. The first line in code here is we're going to have this Python variable that is the prediction at the end of your neural network. It's going to be the top of some soft max over whatever it is that your neural network is outputting a probability vector could be a regression. The first sign is where is the end of the feed forward stage of your neural network? It's what your network is trying to predict. We're then going to create a variable called label that is a place holder for the ground truth that our model is trying to train against. Now we are ready to create our cross entropy node which is just like in our assignment one. It's going to be the sum of the labels times the TensorFlow log of the prediction on our column. So just an interesting point, so the sum and log do need to be TensorFlow functions, but TensorFlow will automatically convert addition, subtraction and element wise multiplication into TensorFlow operations Question. >> [INAUDIBLE] >> Yep. >> [INAUDIBLE] >> It's going to sum the row altogether which is what we want to do since label in the label each row >> Is going to be a one hard vector. So you wanna multiply that by our prediction. And it's going to multiply it at the point of the target index. And when we sum that, it's going to give us the correct result. Everything else will be a zero in that row, so it's squashing it into a column. Since zero access is the rows, axis 1 is the columns. So it's gonna collapse The colons. Yes? The question was are the feeds just for the placeholders? Yes, that is correct. The feeds are just used as a dictionary to fill in the values of our placeholders. Great, all right, so we've now defined the loss and we are ready to compute the gradients. So the way this is done in TensorFlow is we're first going to create an optimizer object. So there's a general abstract class in TensorFlow called optimizer. Where each of the subclasses in that class is going to be an optimizer for a particular learning algorithm. So the learning algorithm that we already use in this class is the gradient descent algorithm, but there are many other choices that you might want to experiment with in your final project. They have different advantages So that is just the object to create an optimization node in our graph. We're going to call on the method of it, it's called minimize, and it's gonna take in its argument to the node that we actually want to minimize. So this adds an optimization operation to the top of our computational graph, which when we evaluate that node, when we evaluate this variable I wrote in the top line called train_step equals the line. When we call session on run on trainstep, it is going to actually apply the gradients onto all of the variables in our model. This is because the dot minimize function actually does two things in Tensor flow. It first computes the gradient of our argument, in this case cross entropy. With respect to all of the things that we defined as variables in our graph, in this case the B and W. And then it's actually going to apply the gradient updates to those variables. So I'm sure the question in the back of all your minds now is how do we actually compute the gradients? So the way it works in TensorFlow is that, every graph node has an attached gradient operation, has a prebuilt gradient of the output with respect to the input. And. So when we want to calculate the gradient of our cross entropy with respect to all the parameters, it is extremely easy to just backpropagate through the graph using the chain rule. So this is where you actually get to see the main advantage of expressing this machine-learning framework as this computational graph, because it is very easy for the application to step backwards, to traverse backwards through your graph, and at each point, multiply the error signal by the predefined gradient of our node. And all of this happens automatically, and it actually happens behind the programmer's interface. Question? The question was is the gradients are competed with respect to the cross With respect to all of our variables. So the argument into the minimize function is going to be the node that it's computing the gradient of, in the numerator, with respect to automatically all of the things we defined as variables in our graph. Doesn't that you can add as another argument what variables to actually apply gradients to, but if you don't it's just going to automatically do it to everything defined as a variable in our graph. Which also answers a question earlier about why we wouldn't want to call x a variable because we're not actually we don't actually want to update that. So how does this look like in code? We're just going to add the top line in the previous slide. We're gonna create a python variable called train_step that takes in our Gradient Descen tOptimizer object with learning rate of 0.5. We're gonna Minimize on it over the cross_entropy. So you can kinda see that that line encapsulates everything, all of the important information about doing optimization. It knows what gradient step algorithm to use, the gradient descent. And knows what learning rate and knows what node to compute the gradients over and an oath to minimize it of course. Okay. So let's actually see how to run this in code, the last thing we have to do. Question. Let me answer that. The question was how does session know what variables to link it to? I think that this answers it. The session is going to deploy all of the nodes in your graph Onto the runtime environment. Everything in the graph is already on it, so when you call minimize on this particular node, it's already there inside your session to like compute, if that answers it. Okay, so the last thing we need to do now that we have the gradients, we have the gradient update. It's just to create an iterative learning schedule. So we're going to iterate over, say 1,000 iterations, the 1,000 is arbitrary. We're going to call on our favorite data sets, we're gonna take our next batch Data is just any abstract data in this arbitrary program. So we're gonna get a batch for our inputs, a batch for our labels. We're then going to call sess.run on our training step variable. So remember, when we call run on that, it already applies the gradients onto all the variables in our graph. And it's gonna take a feed dictionary for the two place holders that we've defined so far. The x and the label where x and label are graph nodes. The keys in our dictionary are graph nodes and the items are going to be NumPy data. And this is actually a good place to talk about just how well TensorFlow interfaces with NumPy because TensorFlow will automatically convert NumPy arrays when we feed it in to our graph in to tensors. So we can insert in to our feed dictionary numpy arrays which are batch_x and batch_label and we are also going to get is an output from sess.run. If I defined some variable like output equals sess.run, that would also be a NumPy array of what the nodes, of what the nodes evaluate to. Though train_step would return you the gradient I believe. Are there any other questions up to that point before we take a little bit of a turn? Yes. So I actually believe there's some ways to create queues for inputting in data and labels, that might be the answer to your question. I can testify to why this might be the best method too. But it certainly is a simple one, where you can just work with NumPy data, which is what Python programmers are used to. And that is the insert point into our placeholder. One more question, yes? Your question was, how does a cross entropy know what to compute? So the cross entropy is going to take an, I haven't defined what prediction it is fully, I just wanted to abstract that part. The prediction is going to be something at the end of your no network, where all of those are symbols inside your graph. Something before that's going to be, all these notes in your graph. I think this might be a better answer to your question, when you evaluate some node in the graph, like if i were to call session.runonprediction It automatically computes all of the nodes before it in the graph that need to be computed to actually know what the value of prediction is. Behind the season TensorFlow it's going to transverse backwards in your graph and compute all of those operations and that happens behind you. That happens automatically inside my session. So the last important concept that I wanna talk about before we move over to the live demo is the concept of variable sharing. So when you wanna build the large model, you often need to share large sets of variables and you might want to initialize them all in one place. For example, I might want to instantiate my graph multiple times or even more interestingly, I want to train over like a cluster of GPUs. We might not have the benefit to do that in the class because of the research limitations you wanna talk about, but especially moving on from this class it's often the case that you wanna train your model on many different devices at one go. So how does this concept work if it's instantiating or model on each of these devices, but we wanna share the variables. So one naive way you might think of doing this is creating this variable's dictionary at the top of your code that a dictionary of some strings into the variables that they represent. And in this way if I wanna build locks below it that depends on this parameters. I would just use this dictionary. I would call variables_dic and I would take the key as these values. And that might be how I would want to share my variables. But there are many reasons this is not a good idea. And it's mostly because it breaks the encapsulation. So what the code that builds your graph's intensive flow should always have all of the relevant information about the nodes and operations that you are using. You want to be able to, in your code document the names of your neurons. You wanna be able to document the types of your operations and the shapes of your variables. And you kind of lose this information if you just have this massive variables dictionary at the top of your code. So TensorFlow inspired solution for this is something called variable scope. A variable scope provides a simple name spacing scheme to avoid clashes, and the other relevant function to go along with that is something called get_variable. So get_variable will create a variable for you if a variable with a certain name doesn't exist. Or it will access that variable if it finds it to exist. So let us see some examples about how this works. Let me open a new variable scope called foo. And I'm gonna called get_variable with the name v. So this is the first time I'm calling get_variable on v, so it's going to create a new variable and you'll find that the name of that variable is foo/v. So kind of calling this variable scope on foo. It's kind of accessing a directory that we're calling foo. Let me close that variable score an reopen it with another argument called reuse to be true. Now if I call get_variable with the same name v, I'm actually going to find the same variable, I'm gonna access the variable, I'm gonna access the same variable that I created before. So you will see that v1 and v are pointing to the same object. If I close this variable scope again, and reopen it, but I set reuse to be false, your program will absolutely crash if I try to run that line again. Because you've set it to not reuse any variables so it tries to create this new variable but it has the same name as the variable we defined earlier. The uses of variable scope will become apparent in the next assignment and over the class but it is something useful to keep in the back of your mind. So in summary, what have we looked at? We learned how to build a graph in TensorFlow that has some sort of feedforward or prediction stage where you are using your model to predict some values. I then showed you how to optimize those values in your neural network, how TensorFlow computes the gradients, and how to build this train_step operation that applies gradient updates to your parameters. I then showed you what it means to initialize a session, which deploys your graph onto some hardware that creates like the runtime environment to run your program. I then showed you how to build some sort of simple iterating schedule to continuously run and train our model. Are there any questions up to this stage before we move on in this lecture? Yes? It doesn't because in feed_dict, you can see that, in feed_dict it takes in some node. We're not really understanding feed_dict with what the names of those variables are. So whenever you create a variable or a placeholder There's always an argument that allows you to give the name of that node. So when you create the name of that node, not name in my Python variable, but name as a TensorFlow symbol. That's a great question, the naming scope changes the name of the actual symbol of that operation. So if I were to scroll back in the slides and look at my list of operations. The names of all those operations will be appended with foo as we created earlier. Maybe one more question before we move on if there's anything? Yes? Yes, if you load a graph using the get variable, it will call the same variable across devices. This is why it's extremely important to introduce this idea of variable scope to shared devices. One more question. [BLANK AUDIO] >> Can we share variables- >> The question was, can we share variables across sessions? I believe the answer to that question is correct. I might be wrong. But I'm not entirely sure, as of this time. Okay, so we just have a couple of acknowledgements. When we created this presentation, we consulted with a few other people who have done TensorFlow tutorials. Most of these slides are inspired by Jon Gauthier in a similar presentation he gave. We also talked with Bharath and Chip. Chip is teaching a class, CS20SI, TensorFlow for Deep Learning Research. So we are very grateful to all the people we talked with to create these slides. And now, we will move on to the research highlights, before we move on to the live demo. >> Hi everyone. Can you guys hear me okay? Hi my name's Alan. And let's take a break from TensorFlow and talk about something also very interesting. I'm gonna present a paper called Visual Dialog. Here's a brief introduction. Basically, in recent years we are witnessing rapid development improvement in AI. Especially in natural language processing and computer vision. And many people believe that the next generation of intelligent systems. Will be able to hold meaningful conversations with humans in natural language based on the visual content. So for example it should be able to help blind people to understand their surroundings by answering their questions. Or you can integrate them together with AI assistants such as Alexa to understand people's questions better. And before I move on to the paper, let's talk about some related work. There have been a lot of efforts trying to combine natural language parsing and computer vision. And the first category is image captioning. Here, I'm gonna introduce two works. The first one is a paper called Show, Attend, and Tell. Which is an extension of another paper called Show and Tell with some attention mechanisms. And the second one is an open-source code written by Andrej Karpathy. In both models the models are able to give you a description of the image. And for the second case, that's a typo right there. It should be Video Summary. Basically, the model's able to summarize the content of the video. So imagine if you are watching a movie and you don't wanna watch the whole movie. You wanna see what's the main content of the movie. This model would be pretty useful. And in this category is Visual-Semantic Alignment. So instead of giving a description for each image, this model actually gives a description for each individual component in the image. And as we can see on the right, the data collection process, it's very tedious. Because you actually need to draw a lot of boundary boxes, and give a description to every single one. And the next one is more related to our paper, which is called visual question and answering. Basically given an image and a question, the model answers the question based on the visual content. And in this case as you can see the answers are either binary, yes or no, or very short. So one number, or a circle or different types of shapes. And this paper, Visual Dialog, actually tries to solve the issue I just mentioned. And it proposes a new AI task called Visual Dialog. Which requires an AI agent to hold meaningful conversation with humans based on the visual content. And also develop a novel data collection protocol. And in my opinion, this is the best invention ever. Because you make contributions to science, make money, and socialize with people all at the same time. And it also introduces the family of deep learning models for visual dialog. And I'm not gonna go into too many details today, because we are gonna cover deep neural networks later in this class. This model encodes the image using a convolutional neural network. And encodes the question and the chat history using two recurrent neural network. And then concatenates three representations together as a vector. It is then followed by a fully connected layer and a decoder which generate the answer based on the representation. And here's some analysis of the dataset. As you can see the dataset is much better than the previous work, because there are more unique answers. And also the question and answers tend to be longer, and here are some results. They actually show the model in the form of a visual chat bot. Basically you can chat with a robot online, in real time. And if you guys are interested, please try it, [LAUGH] and that's it. >> [APPLAUSE] >> All right, let's get started then. So we're gonna start with linear regression. I'm sure all of you if you have taken CS 221 or CS 229 then you have heard and coded up linear regression before. This is just gonna be a start to get us familiarized with TensorFlow even better. So we're gonna start at, what does linear regression do again? It takes all your data and tries to find the best linear fit to it. So imagine house prices with time for example or location, it's probably a linear fit. And so we generate our data set artificially using y equals 2 x plus epsilon where epsilon is sampled from a normal distribution. I won't really go much into how we obtain the data. Because that, we assume, is normal Python processing and not really TensorFlow so we will move on. And actually start implementing linear regression and the function run. So in this first function, linear regression, we will be actually implementing the graph itself. As [INAUDIBLE] said, we will be implementing and defining the flow graph. So let's get started. So, first, we're gonna create our placeholders, because we're gonna see how we can feed in our data. So, we have two placeholders here, x and y. So, let's just start with creating x first, and this is gonna be of type float, so we are gonna make float32 And it's gonna be of shape. So we're gonna make this likely more general and have it of shape None. And what this means that you can dynamically change the number of batches that you can send to your network. Or in this case, your linear model. And it's just a row vector here. All right, and we're gonna name it x. We're gonna create y which is the label and which will also be of the same type and shape as well. All right, and we're gonna name it y. All right, so now that we have defined our placeholders, we're gonna start creating other variables. So we start with first by defining our scope. So let's say tf.variable_scope, and we're gonna name it just a lreg, because linear regression. And we're gonna call it scope, all right? So now that we are here, we're gonna create our matrix which is w. So we're gonna call it tf.Variable and since it's just a linear regression, it'll just be a single integer or not an integer, my bad, but just one number. And we're gonna randomly initialize it with np.random.random. We're gonna start with the normal distribution rather. Let's do that, yeah. And we're gonna call it w. Now we're gonna actually build the graph now that we have defined our variables and placeholders. We're gonna define y_pred which is just prediction and it's gonna be given by tf.mul(w, x). So far so clear, any questions? Yes. Yeah, so as I mentioned earlier, none in this case is so that you can dynamically change the number of batches that you can send to your network. So imagine like if I'm doing hyper parameter tuning, I don't want to go and change shape to be 10 or 32 or 256 later on. Almost, you can imagine that you're dynamically saying that okay, I'm gonna change the number of batches that I'm gonna send to my network. Does that answer your question? Yes. So as we mentioned, it'll just go into the variable scope and then define the name as it pleases, so, yeah. All right, so let's go ahead. So now that we have our prediction, the next logical thing is to actually compute the loss. So this we are gonna do by just, there are two norm. So let's just first get the norm itself. And that's gonna be given by square, and we just do, (y_pred- y). And since we wanted to be of a particular shape, its gonna be over reduce some. Let's reduce_mean rather, all right. Okay, so now, with this we have finished building our graph. And so now we'll return x, y, y_pred, and we'll return the loss as well from our linear regression model. Now we're gonna actually start computing what's in the graph. And we first start by generating our dataset. And I'm gonna fill in code here, which we'll define the training procedure for it. All right, so let's get started on that part. So first we get what we call the model. We make an instance of it, and that's just gonna be given by this. All right, so once we have that, we are gonna create our optimizer. And this is gonna be given by, as Barack mentioned earlier in his slides, DescentOptimizer. And we are gonna define the learning rate to be 0.1. And we are gonna minimize over the loss that we just got from our model. All right, any questions so far? We just created our optimizer. Okay, now we are gonna start a session. So with tf.Session, As session. And we are first gonna initialize, yeah, that's one thing I forgot. We are gonna first initialize our variables as someone earlier asked. Why would we do that? So this is actually a new function. So it's likely different from what Barrack mentioned. And this sort of explains our tens of our base really quickly, and since the time Barrack made the slide then I need the code. It's already been updated so we're going to change that. [Albeit so this is just initializing the variables here. Initializer, all right. So we created a session. And now, we are gonna run the init function which is just initialization of variables. Now, we are gonna create our feed_dict. And so what we are gonna feed in is essentially just x_batch and y_batch which you got from our regenerate dataset function. And y here would be y_batch. All right, now we're gonna actually just loop over our data set multiple times, because it's a pretty small dataset. 30s is just our arbitrary chosen here. We are gonna get our loss value and optimize it. I'll explain the step in a second. So now we're gonna call run and what we want to fetch is the loss and the optimizer and we are gonna feed in our feed dict. Does anyone have any questions on this line? All right, and we are just gonna print for the loss here. And then since this is an array, we are just going to want the mean because we have almost 101 examples. All right, so now that we're done with that, we can actually go and train our model, but we'd also like to see how it actually ends up performing. So, what we are gonna do is we are gonna actually see what it predicts, and how we get that is, again, calling the session.run on y_pred. So we are gonna fetch y_pred. And our feed dictionary here will be just this. All right. Yes. So the optimizer was defined as a GradientDescentOptimizer here. So you can see we are not returning anything for that, which is why I just ended up with a blank there. It's just a syntax here. So over here you see I'm returning nothing over there. Yeah, all right, so we can actually go and start running our code, and see how it performs, okay? All right. So let's actually go and run our [INAUDIBLE]. Let's see how that performs. So you see the loss decrease, and we can actually go ahead and see how it turns out. Okay, I guess it didn't like my tmux. Anyways, So you see we fed a linear line over the data. All right, so far so good. All right, so now we are actually gonna go and implement word2vec using Skip-gram which is slightly gonna be more complex. This was just to see how we create very simple models in TensorFlow. All right, let's go ahead. So now. Any questions so far? Yes? All right. So, in this. This is refine our understanding of word2vec again. If you have the following sentence. A completely unbiased statement here. The first cs224 homework was a lot of fun. And if you were suppose to make a dataset out of this sentence here. We would have. Consider a window size one here. And we are going have the cs221 For a 224 end [INAUDIBLE] first. And so we are just basically decomposing other sentence into a data set. Remember that Skipgram tries to predict each context word given this target word. And since the number of context words here is just two, because our window size is one. And so the task now becomes to predict D from cs224n. From first, a lot. And fun from off and so on. And so this is our data set. So just clarifying what word2vec actually was. Alright. So let's go ahead and start implementing that. I've already made the data processing functions here. So we won't have to deal with that. We have our batches. And this function load data already loads the pre-process training and validation set. Training data is a list of batch inputs and their label pairs. And we have about 30,000 of those. And we are going to see a train as well here. The valuation data is just a list of all validation inputs. And the reverse dictionary is a Python dictionary from word index to word. Right? So let's start and go ahead and implement Skipgram first. All right. So we are, again, going to start by defining our placeholders. And so this is going to be batch inputs. And we are going to define a placeholder here. But in this case, since you just have integers. We can define with int32. And the shape is going to be batch_size and nothing. So we have that. And we can avoid naming here. Because we are not going to call multiple variable scopes. That will be fine. Then we go and create our batch labels. Which is, again, tf.placeholder. And 32. This will also be of the same shape as previous. And finally, we will go and create a constant for our validation set. Because that is not going to change anytime soon. And that is going to be defined by a val_data which we previously loaded. And we have to define what type it is. And the type for that is int32 again. Just like our training set. All right. Now that we have defined, yes? So since I'll be applying transposes later. I just wanted to make sure that it's one. It doesn't really make that big of a difference. So in this case, I'll be calling transpose on labels. Which is why I just wanted to make sure that it transposes fine. You wouldn't. It's just, I wanna make it absolutely clear that it's a row vector. Not a column vector. >> [Question] >> Yeah, exactly. All right. So now we can go and start creating our scope for. All right. So, this is where we'll define our model. And first, we are going to go and create an embeddings, as we all did in our assignment. And that's going to be a huge variable. And it's going to be initialized randomly with uniform distribution. And this is going to kick vocabulary size, which you have previously defined in the top. And it's going to take embedding size. So this is going to be a number of words in your dictionary times the size of your embedding size. And we are going to define. Since it's a randomly uniform distribution. We just going to also give the parameters for that. So far so good? All right, so we just created our embeddings. Now, since we want to index with our batch. We are going to create batch embeddings. And you are going to use this function. Which is actually going to be pretty handy for our current assignment. And so we do an embedding lookup with the embeddings. And we are going to put in the batch inputs here. All right. Finally, we go and create our weights. And we are going to call it, here, .variable, here. So we are going to use truncated normal distribution here. Which is just normal distribution where it's cut off at two standard deviations instead of going up the internet. Okay. All right. This is also going to be of the same size as previously. But this is going to be vocabulary size and embedding size. And this is because I turn tracks with our input directly. Since this is truncated normal, we need to define what the standard deviation is. And this is going be given by one over the square root of the embedding size, itself. [BLANK AUDIO] Okay. Finally we go and create our biases, Which are also going to be variables. And this is going to be initialized with zeros of size vocabularies. All right. Now we define our loss function, now that we have all our variables. So in this case, we used a soft max cross entropy in our assignment are the negative log likelihood. In this case, you'd be using something similar. And this is where Tensorflow really shines. It has a lot of loss functions built in. And say we are going to use this called negative constraint, negative concentrate. I forgot the exact name. But it is very similar, in the sense that the words that need to come up with a higher probability are emphasized. And the words which should not appear with lower probability are not emphasized. And so we are going to call tf.nn. Nn is the neural network library in TensorFlow, our module. And this is going to take a couple of parameters. You can look up the API. Yes? Embeddings? All right. What vector presentation. Which is what you're trying to learn. No, w is the weight matrix that is a parameter that you're trying to also learn. But it's interacting with other presentations. Effectively, you can think of these embeddings as sort of semantic representations of those words, right? Yes? Right. So, imagine. So, our embeddings is defined as the vocabulary size. So let's say we have 10,000 words in our dictionary. And each row is now the word vector that goes with that word. Index of that word. And since our batch is only a subset of the vocabulary, we need to index into that EH matrix. With our batch, which is why we used the embedding lookup function, okay. All right, so we're gonna go and just use, this API obviously everyone would need to look up on the TensorFlow website itself. But what this would do is now take the weights and the biases and the labels as well. Okay. I defined them as batch_labels. And they also take an input, which is batch_inputs. Okay. And so here's where TensorFlow really shines again, the num_sampled. So in our data set, we only have positive samples, or in the sense that we had the context words and the target word. We also need context words and noisy words. And this is where num_samples will come in use. We have defined num_samples to be 64 earlier. And what it would essentially do is look up 64 words which are not in our training set and which are noise words. And this would serve as sort of negative examples so that our network learns which words are actually context words and which are not. And finally, our num_classes is defined by our vocabulary size again. All right. With that, we have defined our loss function. And now we have to take the mean of that, because loss needs to be the size of the batch itself. And we get that by reduced mean. This is gonna be slightly nasty. So we get, the loss is given for that particular batch. And yes, exactly. It's given for multiple samples. And since we have multiple samples in a batch, we want to take the average of those. Exactly. Okay. And, so great. And now we have completely defined our loss function. Now we can go ahead and actually, if you remember from the assignment, we take the norm of these word vectors. So let's go ahead and do that first. So that will be reduce_mean, this is just API calling, which is very valuable and detailed on the TensorFlow website itself. All right. Keep_dims=True. So this is where, in this, I have added an argument called keep dimensions. And this is where, if you sum over a dimension, you don't it to disappear, but just leave it as 1. Okay. And now we divide the embeddings with the norm, to get the normalized_embeddings. embeddings/norm. Great. And now we return from, we get batch inputs, we return batch labels because this will be our feed. We have normalized embeddings. And we have loss. All right. With this done, we can come back to this function later. There's a slide part missing, which we'll get back to. Yes? Thank you. All right. So now we go and define our run function. How are we doing on time? Okay, we have 20 minutes, great. We actually make a object of our model. And that's just by calling. Embeddings. And loss from our function, which was just called word2, or skipgram, rather. Okay, and now we initialize the session. And over here, again, I forgot to initialize our variables. We can call. We just initialized all of our variables for the default values, as Barak mentioned again. Now we are gonna go and actually loop over our data to see if we can actually go ahead and train our model. And so let's actually do that first step. And batch_data. Train_data. So for each iteration in this for loop, we are gonna obtain a batch, which has its input data as well as the labels. Okay, so we have inputs and labels from our batch. Great. And we can now define our feed_dictionary accordingly where the batch_inputs. So this is a dictionary. And our batch_labels would just be labels. Any questions so far? Okay. We go ahead and call our loss function again. And we do this by calling session.run, where we fetch the optimizer again and the loss. And we pass in our feed dictionary, which we already defined above. Okay. We'll get the loss. And since we are trying to get the average, we're gonna add it first and then divide by the number of examples that we just saw. All right. So we're just gonna put a couple of print statements now just to make sure, to see if our model actually goes and trains. And see. Print loss. Step [INAUDIBLE] and then average loss. Since the loss will be zero in the first step, we can just [INAUDIBLE]. All right, so now we have our average loss. And we reset our average loss again, just so that we don't for every iteration loop. Okay. So we have almost finished our implementation here. However, one thing that's, yes? I forgot to define that. Good call. So we can define that as the beginning of a run step. Gradient. Optimizer. And we'll take a learning rate of zero, and we're gonna minimize the loss. All right, thanks for that, okay. One thing that we're missing here is we haven't really dealt with our value addition set. So, although we are training in our training set, we would wanna make sure that it actually generalizes to the value addition set. And that's the last part that's missing. And we just gonna do that now. But before we do that, there's only one step missing. Where we, once we have the validation set, we still need to see how similar our word vectors are with that. And we do that in our flow graph itself. So, let's go back to our skip gram function. Anyway here we can implement that, okay. So, we have our val_embeddings against index into the embeddings matrix to get the embeddings that correspond to the validation words. And we use the embedding look up function here, embedding_lookup, embedding and we call in train data set or val data set. We'll actually use the normalized_embedding because we are very concerned about the cosine similarity and not necessarily about the magnitude of the similarity. Okay, and we use val_dataset here. Okay, and the similarity is essentially just a cosine similarity. So, how this works is we matrix multiply the val_embeddings which we just obtained and the normalized_embeddings. And since they won't work, you can just may just multiply both of them because of dimensional incompatibility, we'll have to transpose_b. And this is just another flag. All right, since we also returned this from our function, again this is just a part of the graph. And we need to actually execute the graph, in order to obtain values from it. And here we have similarity. Okay, and let's do, since this is a pretty expensive process, computationally expensive, let's do this only every 5,000 iterations. All right. So, the way you're gonna do this is by calling eval on the similarity matrix, what this does is, since we had this noted it actually goes and evaluates. This is equal on to calling session.run on similarity and fetching that, okay. So, we go on call, and we get similarity, and for every word in our validation set, You gonna find the top_k words that are closest to it. And we can define k to be 8 here. And we will now get the nearest words. So let's do that. And we'll sort it according to their magnitude. And since the first word that will be closest will be the word itself, we'll want the other eight words and not the first eight words and this will be given by top_k+1, any questions so far? Yes? Right, so your embedding is on number of words you have in your vocabulary, times the size of your word embedding for each word. So, it's a huge matrix, and since your batch that you're currently working with is only subset of that vocabulary, this function, embedding to lookup, actually indexes into that matrix for you and obtains the word. This is the equivalent to some complicated Python splicing that you do with matrices, but it's just good syntax should go over it. Okay, all right, almost there, okay. So, we have our nearest key words, we'll just go and, I have this function in my utils, you can check this on the GitHub that we'll post after the class is over. And you can play with it as you wish, and In the past a nearest, and a reverse_dictrionary gesture actually see the words and not just numbers, all right. Finally, be open our final_embeddings which will be a normalized_embedding at the end of the train and we're just going to call eval on that again which is equal to calling session.run and passing and fetching this. All right, we are done with the coding and we can actually see and visualize how this performs. Okay. And python word2vec Oops, I must have missed something. I missed a bracket again. So, we'll first load up our data set, and then it will iterate over it and we will use our scripting model. Oops, let's see that. All right, where did this group? You know why, please tell me why have to be there, okay. Perfect. So, as you can see here we have 30,000 batches each with a bad set of 128. Ahh, man, [LAUGH] Let's see. All right, so as we see, the loss started off as 259, all right, ends up at 145 and then decreases, I think it goes somewhere to around 6. Here we can also see as a printing, the nearest word for this is e leaders orbit, this gets better with time and with more training data. We only use around 30,000 examples, so there's a lot of potential to actually get better. And in the interest of time, I'm only limited to around 30 epochs, yes. So TensorFlow comes with TensorBoard, which I didn't show in the interest of time. Essentially, you can go up to your local host, and then see the entire graph, and how it's organized. And so, that'll actually be a huge debugging help, and you can use that for your final project. Enter bold, yeah. All right, well, thank you for your time. >> [APPLAUSE]
Info
Channel: Stanford University School of Engineering
Views: 218,413
Rating: 4.8942809 out of 5
Keywords: TensorFlow
Id: PicxU81owCs
Channel Id: undefined
Length: 72min 33sec (4353 seconds)
Published: Mon Apr 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.