[MUSIC] Stanford University. >> Happened is that Nishith and Barak are going to be giving
an introduction to TensorFlow. So TensorFlow is Google's
deep learning framework, which I hope everyone
will be excited to learn. And at any rate, you have to learn it, because we're gonna be using it
in assignments two and three. So this should really also help out for
the second assignment. And so before we get started with that, I just want to do a couple
of quick announcements. So the first one was on final projects. So this is really the time to be
thinking about final projects. And if you've got ideas for final projects
and want to do the final project, you should be working out how to
talk to one of me, Kevin, Danqi, Richard, Ignacio, or
Arun over the next couple of weeks. And again, obviously,
you've got to find the time, so it's hard to fit everybody in. But were are making a real effort to
have project advice office hours. There were also some ideas for projects
that been stuck up on the projects page. So encourage people to look at that. Now, people have also asked
us about assignment four. So we've also stuck up
a description of assignment four. And so look at that if you're considering
whether to do assignment four. So assignment four is gonna be doing
question answering over the SQuAD dataset and
you can look in more details about that. So then there are two other
things I wanted to mention. And we'll also put up messages on Piazza,
etc, about this. I mean, the first one is that for
assignment three, we want people to have experience
of doing things on a GPU. And we've arranged with
Microsoft Azure to use their GPUs for doing that and for
people to use for the final project. And so we're trying to get that
all organized at the moment. There's a limit to how
many GPUs we can have. So what we're gonna be doing for
assignment three and for the final project is to
allow teams of up to three. And really it's in our interest and the resource limit's interest if
many people could be teamed up. So we'd like to encourage people
to team up for assignment three. And so we've put up a Google form for
people to enter their teams. And we need people to be
doing that in advance, because we need to get that set up at
least a week in advance so we can get the Microsoft people to set up accounts
for people so that people can use Azure. So please think about groups for
assignment three and then fill in the Google form for that. And then the final thing is, for next
week, we're gonna make some attempts of reorganizing the office hours and
get some rooms for office hours so they can hopefully run more smoothly in
the countdown towards the deadline for assignment two than they did for
assignment one. So keep an eye out for that and expect
that some of the office hour times and locations will be varying a bit
compared to what they've been for the first three weeks. And so that's it from me,
and over to Nishith. >> Hi everyone.
Hope you had a great weekend. So today we are gonna be
talking about TensorFlow, which is another deep learning
framework from Google. So why study deep learning frameworks? First of all,
much of the research in deep learning and machine learning can be attributed because
of these deep learning frameworks. They've allowed researchers to iterate
extremely quickly and also have made deep learning and other algorithms in ML
much more accessible to practitioners. So if you see your phone a lot
smarter than it was three years ago, it's probably because one of
these deep learning frameworks. So the deep learning frameworks help
to scale machine learning code, which is why Google and Facebook
can now scale to billions of users. They can compute gradients automatically. Obviously, since you all must have
finished your first assignment, you must know that gradient
calculation isn't trivial. And so
this takes care of it automatically, and we can focus on
the high-level math instead. It also standardizes ML
across different spaces. So regardless of whether I'm at Google or
at Facebook, we still use some form of TensorFlow or
another deep learning framework. And there's lot of cross-pollination
between the frameworks as well. A lot of pre-trained models are also
available online, so people like us who have limited resources in terms of GPUs do
not have to start from scratch every time. We can stand on
the shoulders of giants and on the data that they have collected and
sort of take it up from there. They also allow interfacing with GPUs,
which is a fascinating feature, because GPUs actually speed up your code a
lot faster because of the parallelization. Which is why studying TensorFlow is sort
of almost necessary in order to make progress in deep learning, just because
it can facilitate your research and your projects. We'll be using TensorFlow for
PA two, three, and also for the final project, which also is an added
incentive for studying TensorFlow today. So what is TensorFlow actually? It's just a deep learning framework,
an open source software library for numerical computation using
flow graphs from Google. It was developed by their Brain team, which specializes in
machine learning research. And in their words, TensorFlow is
an interface for expressing machine learning algorithms, and an implementation
for executing such algorithms. So now I'll allow Barak to sort of take
over and give a high-level overview of how TensorFlow works and
the underlying paradigms that so many researchers have spent so
much time thinking about. >> Thanks, Nish, for starting us off. I'm gonna be introducing some of
the main ideas behind TensorFlow, its programming paradigm, and
some of its main features. So the biggest idea of all of
the big ideas about TensorFlow is that numeric computation is
expressed as a computational graph. If there was one lesson that you took out
of this presentation today, at the back of your mind is that the backbone of any
TensorFlow program is going to be a graph where the graph nodes are going to be
operations, shorthand as ops in your code. And they have any number of inputs and
a single output. And the edges between our nodes are going
to be tensors that flow between them. And the best way of thinking about
what tensors are in practice is as n-dimensional arrays. The advantage of using flow graphs as the
backbone of your deep learning framework is that it allows you to build
complex models in terms of small and simple operations. And this is going to make
your gradient calculations extremely simple when we get to that. You're going to be very, very grateful for
the automatic differentiation when you're coding large models in your
final project and in the future. Another way of thinking about a TensorFlow
graph is that each operation is a function that can be
evaluated at that point. And hopefully we will see why that is
the case later in the presentation. So let us look at an example of a neural
network with one hidden layer, and what its computational graph
in TensorFlow might look like. So we have some hidden layer that we are
trying to compute, as the ReLU activation of some parameter matrix W times
some input x plus a bias term. So if you recall from last lecture, the
ReLU is an activation function standing for rectified linear unit in the same way
that a sigmoid is an activation function. We are applying some nonlinear
function over our linear input that is what gives the neural
networks their expressive function. And the ReLU takes the max
of your input and zero. On the right, we see what the graph
might look like in TensorFlow. We have variables for our b and W, we have
a placeholder, we'll get to that soon, with the x, and nodes for
each of the operations in our graph. So let's actually dissect
those node types. Variables are going to be stateful
nodes which output their current value. In our case, it's just b and
W What we mean by saying that variables are stateful, is that they retain their
current value over multiple executions, and it's easy to restore
saved values to variables. So, variables have a number
of other useful features. They can be saved to your disk during and after training, which is what facilitates
the use the niche talked about earlier, that allows people from different
companies and groups to save, store, and send over their model
parameters to other people. And they also make gradient
updates by default. Will apply over all of the variables and
your graph. The variables are the things that you
wanna tune to minimize the loss, and we will see how to do that soon. It is really important to remember
that variables in the graph like b and w are still operations. By definition, if there can be such
a thing as a definition on this, all of your nodes in
the graph are operations. So when you evaluate the operation that is
these variables in our run time, and we will see what run time means very shortly,
you will get the value of those variables. The next type of nodes are placeholders. So placeholders are nodes whose
value is fed in at execution time. If you have inputs into your network that
depend on some sort of external data that you don't want build your graph
that depends on any real value. So these are place folders for
values that we're going to add into our computation
during training. So this is going to be our input. So for placeholders,
we don't give any initial values. We just assign a data type, and
we assign a shape of a tensor so the graph still knows what to compute even though
it doesn't have any stored values yet. The third type of node
are mathematical operations. This is going to be your
matrix multiplication, your addition, and your ReLU. All of these are nodes in
your TensorFlow graphs. And it's very important that
we're actually calling on TensorFlow mathematical operations
as opposed to NumPy operations. Okay, so let us actually
see how this works in code. So we gonna do three things. We're going to create weights
including initialization. We're going to create
a placeholder variable for our input x, and
then we're going to build our flow graph. So how does this look in code? We're gonna import our TensorFlow package,
we're gonna build a python variable b, that is a TensorFlow variable. Taking in initial zeros of size 100. A vector of 100 values. Our w is going to be a TensorFlow
variable taking uniformly distributed values between -1 and
1 of shapes of 184 by 100. We're going to create a placeholder for
our input data that doesn't take in any initial values, it just takes in a data
type 32 bit floats, as well as a shape. Now we're in position to
actually build our flow graph. We're going to express h
as the TensorFlow ReLU, of the TensorFlow matrix multiplication
of x and w, and we add b. So you can actually see that the form
of that line, when we build our h, essentially looks exactly the same
as how it would look like a NumPy, except we're calling on our
TensorFlow mathematical operations. And that is absolutely essential, because
up to this point, we are not actually manipulating any data, we're only
building symbols Inside our graph. No data is actually moving
in through our system yet. You can not print off h, and
actually see the value it expresses. First and foremost, because x is just a place holder,
it doesn't have any real data in it yet. But, even if x wasn't,
you can not print h until we run a tune. We are just building a backbone for
our model. But, you might wonder now,
where is the graph? If you look at the slide earlier, I didn't
build a separate node for this matrix multiplication node, and a different node
for add, and a different node for ReLU. Well, ReLU is the h. We've only defined one line, but I claim that we have all
of these nodes in our graph. So if you're actually try to analyze
what's happening in the graph, what we're gonna do, and
there are not too many reasons for you to do this when you're actually
programming a TensorFlow operation. But if I'm gonna call on my default graph,
and then I call get_operations on it, I see all of the nodes in my graph and
there are a lot of things going on here. You can see in the top three lines that we
have three separate nodes just to define what is this concept of zeroes. There are no values initially assigned yet to our b, but the graph is getting
ready to take in those values. We see that we have all of these
other nodes just to define what the random uniform distribution is. And on the right column we
see we have another node for Variable_1 that is probably
going to be our w. And then at the bottom four lines we
actually see the nodes as they appear in our figure, the placeholder, the matrix
multiplication, the addition and the ReLU. So in fact, the figure that we're
presenting on the board is simple for what TensorFlow graphs look like. There are a lot of things going behind
the scenes that you don't really need to interface with as a programmer. But it is extremely important to keep in
mind that this is the level of abstraction that TensorFlow is working
with above the Python code. This is what is actually going
to be computed in your graph. And it is also interesting to see that
if you look at the last node, ReLU It is pointing to the same object in memory as
the h variable that we defined above. Both of them are operations
referring to the same thing. So in the code before,
what this h actually stands for is the last current node
in the graph that we built. So great. We define, question? So the question was
about how we're deciding what the values are, and the types. This is purely arbitrary choice,
we're just showing an example, It's not related to,
it's just part of our example. Okay. Great, so we've defined a graph. And the next question is
how do we actually run it? So the way you run graphs in
TensorFlow is you deploy it in something called a session. A session is a binding to a particular
execution context like a CPU or a GPU. So we're going to take
the graph that we built, and we're going to deploy it on to a CPU or
a GPU. And you might actually be interested to
know that Google is developing their own integrated circuit called
a tensor processing unit, just to make tensor
computation extremely quickly. It's in fact orders of magnitude
more quick then even a GPU, and they did use a tender alpha
go match against lissdell. So the session is any like
hardware environment that supports the execution of all
the operations in your graph. So that's how you deploy a graph, great. So lets see how this is run in code. We're going to build a session object, and we're going to call run on two arguments,
fetches and feeds. Fetches are the list of graph nodes
that return the outputs of the nodes. These are the nodes that we're interested
in actually computing the values of. The feeds is going to
be a dictionary mapping from graph nodes to actual values
that we want to run in our model. So this is where we actually fill in the
placeholders that we talked about earlier. So this is the code that we have earlier,
and we're gonna add some new lines. We're first going to build
a session object called tf.Session. It's gonna take some default environment. Most likely a CPU, but you're able to add in as an argument
what device you want to run it on. And then we're going to call,
first of all, session.run on initialize
all the variables. This is concept intensive
flow called lazy evaluation. It means that the evaluation of your
graph only ever happens at run time, and run time now we can add an interpretation
to run time in TensorFlow, so and so means the session. Once we build the session we're
ready to actually call unlike the tensa flow run time, so
it is only then that we actually stick or assign the values that we initialize
our BMW on to those notes. BMW never mind [LAUGHS]. After those two lines, we're finally in
a position to call run on the note we're actually interested in, the H, and we feed
in our second argument of dictionary for X, it's our placeholder With
the values that we're interested. For now just some random values, question? Initialize all variables will initialize
all the things that are formerly called variables in your graph like b and
w in this case. [BLANK AUDIO] And so the question was. What is the difference between variables
and place holders, and why we might, we might want to use which. So, place, sorry, variables are in most
cases will be the parameters that we're interested in, you can almost think
of them as the direct correspondence, X are data is not a parameter,
we're interested in tuning. In the models we are working with. Additionally, it's important that our
parameters have initializations in our model to begin with. They have a state. Our input doesn't really have
a state as part of our model. If we're gonna take our model and
Export it to somebody else. There's no reason for it to actually
include any real data values. The data is arbitrary, it's the model parameters that
are the foundation of your model. They are what makes your model interesting
and computing what it computes. Great, so what have we covered so far? We first built a graph using variables and
placeholders. We then deploy that graph onto a session
which is the execution environment. And next we will see
how to train the model. So the first question that we might
ask in terms of optimization is how do we define the loss? So we're going to use placeholder for
labels as data that we feed in only at run time and then we're going to build a loss
node using our labels and prediction. The first line in code here is we're going
to have this Python variable that is the prediction at the end
of your neural network. It's going to be the top of some soft
max over whatever it is that your neural network is outputting a probability
vector could be a regression. The first sign is where is the end of the
feed forward stage of your neural network? It's what your network
is trying to predict. We're then going to create a variable
called label that is a place holder for the ground truth that our model
is trying to train against. Now we are ready to
create our cross entropy node which is just like
in our assignment one. It's going to be the sum of the labels
times the TensorFlow log of the prediction on our column. So just an interesting point, so
the sum and log do need to be TensorFlow functions, but TensorFlow
will automatically convert addition, subtraction and element wise
multiplication into TensorFlow operations Question. >> [INAUDIBLE]
>> Yep. >> [INAUDIBLE]
>> It's going to sum the row altogether which is what we want
to do since label in the label each row >> Is going to be a one hard vector. So you wanna multiply
that by our prediction. And it's going to multiply it at
the point of the target index. And when we sum that,
it's going to give us the correct result. Everything else will be a zero in that
row, so it's squashing it into a column. Since zero access is the rows,
axis 1 is the columns. So it's gonna collapse The colons. Yes? The question was are the feeds just for
the placeholders? Yes, that is correct. The feeds are just used as a dictionary to
fill in the values of our placeholders. Great, all right, so
we've now defined the loss and we are ready to compute the gradients. So the way this is done in TensorFlow
is we're first going to create an optimizer object. So there's a general abstract class
in TensorFlow called optimizer. Where each of the subclasses in that
class is going to be an optimizer for a particular learning algorithm. So the learning algorithm that
we already use in this class is the gradient descent algorithm,
but there are many other choices that you might want to
experiment with in your final project. They have different advantages So that is just the object to create
an optimization node in our graph. We're going to call on the method of it,
it's called minimize, and it's gonna take in its argument to the
node that we actually want to minimize. So this adds an optimization operation
to the top of our computational graph, which when we evaluate that node,
when we evaluate this variable I wrote in the top line called
train_step equals the line. When we call session on run on trainstep,
it is going to actually apply the gradients onto
all of the variables in our model. This is because the dot minimize function
actually does two things in Tensor flow. It first computes the gradient of our
argument, in this case cross entropy. With respect to all of the things that
we defined as variables in our graph, in this case the B and W. And then it's actually going to apply
the gradient updates to those variables. So I'm sure the question in the back of
all your minds now is how do we actually compute the gradients? So the way it works in TensorFlow is that,
every graph node has an attached gradient operation, has a prebuilt gradient of
the output with respect to the input. And.
So when we want to calculate
the gradient of our cross entropy with respect
to all the parameters, it is extremely easy to just backpropagate
through the graph using the chain rule. So this is where you actually
get to see the main advantage of expressing this machine-learning
framework as this computational graph, because it is very easy for
the application to step backwards, to traverse backwards through your graph,
and at each point, multiply the error signal by
the predefined gradient of our node. And all of this happens automatically, and it actually happens behind
the programmer's interface. Question? The question was is
the gradients are competed with respect to the cross With
respect to all of our variables. So the argument into the minimize function
is going to be the node that it's computing the gradient of,
in the numerator, with respect to automatically all of the things we
defined as variables in our graph. Doesn't that you can add as
another argument what variables to actually apply gradients to, but
if you don't it's just going to automatically do it to everything
defined as a variable in our graph. Which also answers a question earlier
about why we wouldn't want to call x a variable because we're not actually
we don't actually want to update that. So how does this look like in code? We're just going to add the top
line in the previous slide. We're gonna create a python variable
called train_step that takes in our Gradient Descen tOptimizer
object with learning rate of 0.5. We're gonna Minimize on it
over the cross_entropy. So you can kinda see that that
line encapsulates everything, all of the important information
about doing optimization. It knows what gradient step algorithm
to use, the gradient descent. And knows what learning rate and knows what node to compute the gradients
over and an oath to minimize it of course. Okay.
So let's actually see how to run this in code, the last thing we have to do. Question. Let me answer that. The question was how does session
know what variables to link it to? I think that this answers it. The session is going to deploy
all of the nodes in your graph Onto the runtime environment. Everything in the graph is already on it,
so when you call minimize on this particular node,
it's already there inside your session to like compute, if that answers it. Okay, so the last thing we need to
do now that we have the gradients, we have the gradient update. It's just to create
an iterative learning schedule. So we're going to iterate over, say 1,000
iterations, the 1,000 is arbitrary. We're going to call on our favorite data
sets, we're gonna take our next batch Data is just any abstract data
in this arbitrary program. So we're gonna get a batch for
our inputs, a batch for our labels. We're then going to call sess.run
on our training step variable. So remember, when we call run on that, it already applies the gradients
onto all the variables in our graph. And it's gonna take a feed dictionary for the two place holders that
we've defined so far. The x and the label where x and
label are graph nodes. The keys in our dictionary are graph nodes
and the items are going to be NumPy data. And this is actually a good place to talk
about just how well TensorFlow interfaces with NumPy because TensorFlow will
automatically convert NumPy arrays when we feed it in to
our graph in to tensors. So we can insert in to our feed dictionary
numpy arrays which are batch_x and batch_label and we are also going
to get is an output from sess.run. If I defined some variable
like output equals sess.run, that would also be a NumPy array of what
the nodes, of what the nodes evaluate to. Though train_step would return
you the gradient I believe. Are there any other questions up to that point before we take
a little bit of a turn? Yes. So I actually believe there's
some ways to create queues for inputting in data and labels,
that might be the answer to your question. I can testify to why this
might be the best method too. But it certainly is a simple one,
where you can just work with NumPy data, which is what Python
programmers are used to. And that is the insert
point into our placeholder. One more question, yes? Your question was, how does a cross
entropy know what to compute? So the cross entropy is going to take an,
I haven't defined what prediction it is fully,
I just wanted to abstract that part. The prediction is going to be something
at the end of your no network, where all of those are symbols
inside your graph. Something before that's going to be,
all these notes in your graph. I think this might be a better answer to
your question, when you evaluate some node in the graph, like if i were to call
session.runonprediction It automatically computes all of the nodes before it
in the graph that need to be computed to actually know what
the value of prediction is. Behind the season TensorFlow it's going
to transverse backwards in your graph and compute all of those operations and
that happens behind you. That happens automatically
inside my session. So the last important concept that I
wanna talk about before we move over to the live demo is the concept
of variable sharing. So when you wanna build the large model,
you often need to share large sets of variables and you might
want to initialize them all in one place. For example, I might want to
instantiate my graph multiple times or even more interestingly, I want to
train over like a cluster of GPUs. We might not have the benefit to do that
in the class because of the research limitations you wanna talk about, but
especially moving on from this class it's often the case that you wanna train your
model on many different devices at one go. So how does this concept work
if it's instantiating or model on each of these devices,
but we wanna share the variables. So one naive way you
might think of doing this is creating this variable's
dictionary at the top of your code that a dictionary of some strings into
the variables that they represent. And in this way if I wanna build locks
below it that depends on this parameters. I would just use this dictionary. I would call variables_dic and
I would take the key as these values. And that might be how I would
want to share my variables. But there are many reasons
this is not a good idea. And it's mostly because it
breaks the encapsulation. So what the code that builds your graph's
intensive flow should always have all of the relevant information about the
nodes and operations that you are using. You want to be able to, in your code
document the names of your neurons. You wanna be able to document
the types of your operations and the shapes of your variables. And you kind of lose this information
if you just have this massive variables dictionary at the top of your code. So TensorFlow inspired solution for
this is something called variable scope. A variable scope provides a simple
name spacing scheme to avoid clashes, and the other relevant
function to go along with that is something called get_variable. So get_variable will create a variable for you if a variable with
a certain name doesn't exist. Or it will access that variable
if it finds it to exist. So let us see some examples
about how this works. Let me open a new variable
scope called foo. And I'm gonna called
get_variable with the name v. So this is the first time I'm calling
get_variable on v, so it's going to create a new variable and you'll find that
the name of that variable is foo/v. So kind of calling this
variable scope on foo. It's kind of accessing a directory
that we're calling foo. Let me close that variable score an reopen it with another argument
called reuse to be true. Now if I call get_variable
with the same name v, I'm actually going to find the same
variable, I'm gonna access the variable, I'm gonna access the same
variable that I created before. So you will see that v1 and
v are pointing to the same object. If I close this variable scope again, and
reopen it, but I set reuse to be false, your program will absolutely crash
if I try to run that line again. Because you've set it to not reuse
any variables so it tries to create this new variable but it has the same
name as the variable we defined earlier. The uses of variable scope will become
apparent in the next assignment and over the class but it is something
useful to keep in the back of your mind. So in summary, what have we looked at? We learned how to build a graph in
TensorFlow that has some sort of feedforward or prediction stage where you are using
your model to predict some values. I then showed you how to optimize
those values in your neural network, how TensorFlow computes the gradients, and
how to build this train_step operation that applies gradient
updates to your parameters. I then showed you what it
means to initialize a session, which deploys your graph
onto some hardware that creates like the runtime
environment to run your program. I then showed you how to build some
sort of simple iterating schedule to continuously run and train our model. Are there any questions up to this stage
before we move on in this lecture? Yes? It doesn't because in feed_dict,
you can see that, in feed_dict it takes in some node. We're not really understanding
feed_dict with what the names of those variables are. So whenever you create a variable or
a placeholder There's always an argument that allows
you to give the name of that node. So when you create the name of that node, not name in my Python variable,
but name as a TensorFlow symbol. That's a great question, the naming scope changes the name of
the actual symbol of that operation. So if I were to scroll back in the slides
and look at my list of operations. The names of all those operations will be
appended with foo as we created earlier. Maybe one more question before
we move on if there's anything? Yes? Yes, if you load a graph
using the get variable, it will call the same
variable across devices. This is why it's extremely important to
introduce this idea of variable scope to shared devices. One more question. [BLANK AUDIO]
>> Can we share variables- >> The question was, can we share variables across sessions? I believe the answer to
that question is correct. I might be wrong. But I'm not entirely sure,
as of this time. Okay, so
we just have a couple of acknowledgements. When we created this presentation, we consulted with a few other people
who have done TensorFlow tutorials. Most of these slides
are inspired by Jon Gauthier in a similar presentation he gave. We also talked with Bharath and Chip. Chip is teaching a class, CS20SI,
TensorFlow for Deep Learning Research. So we are very grateful to all the people
we talked with to create these slides. And now, we will move on
to the research highlights, before we move on to the live demo. >> Hi everyone. Can you guys hear me okay? Hi my name's Alan. And let's take a break from TensorFlow and talk about something
also very interesting. I'm gonna present a paper
called Visual Dialog. Here's a brief introduction. Basically, in recent years
we are witnessing rapid development improvement in AI. Especially in natural language
processing and computer vision. And many people believe that the next
generation of intelligent systems. Will be able to hold meaningful
conversations with humans in natural language based
on the visual content. So for example it should be able
to help blind people to understand their surroundings by
answering their questions. Or you can integrate them together
with AI assistants such as Alexa to understand
people's questions better. And before I move on to the paper,
let's talk about some related work. There have been a lot of efforts trying
to combine natural language parsing and computer vision. And the first category
is image captioning. Here, I'm gonna introduce two works. The first one is a paper called Show,
Attend, and Tell. Which is an extension of
another paper called Show and Tell with some attention mechanisms. And the second one is an open-source
code written by Andrej Karpathy. In both models the models are able to give you a description of the image. And for the second case,
that's a typo right there. It should be Video Summary. Basically, the model's able to
summarize the content of the video. So imagine if you are watching a movie and
you don't wanna watch the whole movie. You wanna see what's the main
content of the movie. This model would be pretty useful. And in this category is
Visual-Semantic Alignment. So instead of giving a description for
each image, this model actually gives a description for
each individual component in the image. And as we can see on the right, the data
collection process, it's very tedious. Because you actually need to draw
a lot of boundary boxes, and give a description to every single one. And the next one is more related to our
paper, which is called visual question and answering. Basically given an image and a question, the model answers the question
based on the visual content. And in this case as you can see
the answers are either binary, yes or no, or very short. So one number, or a circle or
different types of shapes. And this paper, Visual Dialog, actually
tries to solve the issue I just mentioned. And it proposes a new AI
task called Visual Dialog. Which requires an AI agent to
hold meaningful conversation with humans based on the visual content. And also develop a novel
data collection protocol. And in my opinion,
this is the best invention ever. Because you make contributions to science,
make money, and socialize with people
all at the same time. And it also introduces the family of
deep learning models for visual dialog. And I'm not gonna go into
too many details today, because we are gonna cover deep
neural networks later in this class. This model encodes the image using
a convolutional neural network. And encodes the question and the chat history using two
recurrent neural network. And then concatenates three
representations together as a vector. It is then followed by
a fully connected layer and a decoder which generate the answer
based on the representation. And here's some analysis of the dataset. As you can see the dataset is much
better than the previous work, because there are more unique answers. And also the question and answers tend
to be longer, and here are some results. They actually show the model in
the form of a visual chat bot. Basically you can chat with
a robot online, in real time. And if you guys are interested,
please try it, [LAUGH] and that's it. >> [APPLAUSE] >> All right, let's get started then. So we're gonna start
with linear regression. I'm sure all of you if
you have taken CS 221 or CS 229 then you have heard and
coded up linear regression before. This is just gonna be a start to get us
familiarized with TensorFlow even better. So we're gonna start at,
what does linear regression do again? It takes all your data and
tries to find the best linear fit to it. So imagine house prices with time for
example or location, it's probably a linear fit. And so we generate our data set
artificially using y equals 2 x plus epsilon where epsilon is sampled
from a normal distribution. I won't really go much into
how we obtain the data. Because that, we assume, is normal Python
processing and not really TensorFlow so we will move on. And actually start implementing linear
regression and the function run. So in this first function,
linear regression, we will be actually
implementing the graph itself. As [INAUDIBLE] said, we will be
implementing and defining the flow graph. So let's get started. So, first,
we're gonna create our placeholders, because we're gonna see how
we can feed in our data. So, we have two placeholders here,
x and y. So, let's just start with creating x
first, and this is gonna be of type float, so we are gonna make float32 And
it's gonna be of shape. So we're gonna make this likely more
general and have it of shape None. And what this means that you can
dynamically change the number of batches that you can send to your network. Or in this case, your linear model. And it's just a row vector here. All right, and we're gonna name it x. We're gonna create y
which is the label and which will also be of the same type and
shape as well. All right, and we're gonna name it y. All right, so
now that we have defined our placeholders, we're gonna start
creating other variables. So we start with first
by defining our scope. So let's say tf.variable_scope, and we're gonna name it just a lreg,
because linear regression. And we're gonna call it scope, all right? So now that we are here,
we're gonna create our matrix which is w. So we're gonna call it tf.Variable and
since it's just a linear regression, it'll just be a single integer or not
an integer, my bad, but just one number. And we're gonna randomly initialize
it with np.random.random. We're gonna start with
the normal distribution rather. Let's do that, yeah. And we're gonna call it w. Now we're gonna actually build
the graph now that we have defined our variables and placeholders. We're gonna define y_pred
which is just prediction and it's gonna be given by tf.mul(w, x). So far so clear, any questions? Yes. Yeah, so as I mentioned earlier, none in
this case is so that you can dynamically change the number of batches that
you can send to your network. So imagine like if I'm doing
hyper parameter tuning, I don't want to go and change shape
to be 10 or 32 or 256 later on. Almost, you can imagine that you're
dynamically saying that okay, I'm gonna change the number of batches
that I'm gonna send to my network. Does that answer your question? Yes. So as we mentioned,
it'll just go into the variable scope and then define the name as it pleases,
so, yeah. All right, so let's go ahead. So now that we have our prediction, the next logical thing is to
actually compute the loss. So this we are gonna do by just,
there are two norm. So let's just first get the norm itself. And that's gonna be given by square,
and we just do, (y_pred- y). And since we wanted to be of a particular
shape, its gonna be over reduce some. Let's reduce_mean rather, all right. Okay, so now, with this we have
finished building our graph. And so now we'll return x, y, y_pred, and we'll return the loss as well
from our linear regression model. Now we're gonna actually start
computing what's in the graph. And we first start by
generating our dataset. And I'm gonna fill in code here, which we'll define
the training procedure for it. All right, so
let's get started on that part. So first we get what we call the model. We make an instance of it, and
that's just gonna be given by this. All right, so once we have that,
we are gonna create our optimizer. And this is gonna be given by,
as Barack mentioned earlier in his slides, DescentOptimizer. And we are gonna define
the learning rate to be 0.1. And we are gonna minimize over the loss
that we just got from our model. All right, any questions so far? We just created our optimizer. Okay, now we are gonna start a session. So with tf.Session, As session. And we are first gonna initialize,
yeah, that's one thing I forgot. We are gonna first initialize our
variables as someone earlier asked. Why would we do that? So this is actually a new function. So it's likely different
from what Barrack mentioned. And this sort of explains our tens
of our base really quickly, and since the time Barrack made
the slide then I need the code. It's already been updated so
we're going to change that. [Albeit so this is just
initializing the variables here. Initializer, all right. So we created a session. And now, we are gonna run the init function which
is just initialization of variables. Now, we are gonna create our feed_dict. And so what we are gonna feed in
is essentially just x_batch and y_batch which you got from our
regenerate dataset function. And y here would be y_batch. All right, now we're gonna actually just
loop over our data set multiple times, because it's a pretty small dataset. 30s is just our arbitrary chosen here. We are gonna get our loss value and
optimize it. I'll explain the step in a second. So now we're gonna call run and
what we want to fetch is the loss and the optimizer and
we are gonna feed in our feed dict. Does anyone have any
questions on this line? All right, and we are just gonna print for
the loss here. And then since this is an array, we are just going to want the mean
because we have almost 101 examples. All right, so now that we're done
with that, we can actually go and train our model, but we'd also like to
see how it actually ends up performing. So, what we are gonna do is we are gonna
actually see what it predicts, and how we get that is, again,
calling the session.run on y_pred. So we are gonna fetch y_pred. And our feed dictionary
here will be just this. All right.
Yes. So the optimizer was defined as
a GradientDescentOptimizer here. So you can see we are not
returning anything for that, which is why I just
ended up with a blank there. It's just a syntax here. So over here you see I'm
returning nothing over there. Yeah, all right, so we can actually go and
start running our code, and see how it performs, okay? All right. So let's actually go and
run our [INAUDIBLE]. Let's see how that performs. So you see the loss decrease, and we can actually go ahead and
see how it turns out. Okay, I guess it didn't like my tmux. Anyways, So you see we fed a linear
line over the data. All right, so far so good. All right, so
now we are actually gonna go and implement word2vec using Skip-gram which
is slightly gonna be more complex. This was just to see how we create
very simple models in TensorFlow. All right, let's go ahead. So now. Any questions so far? Yes? All right.
So, in this. This is refine our understanding
of word2vec again. If you have the following sentence. A completely unbiased statement here. The first cs224 homework was a lot of fun. And if you were suppose to make
a dataset out of this sentence here. We would have. Consider a window size one here. And we are going have the cs221 For
a 224 end [INAUDIBLE] first. And so we are just basically decomposing
other sentence into a data set. Remember that Skipgram tries to predict
each context word given this target word. And since the number of context
words here is just two, because our window size is one. And so the task now becomes
to predict D from cs224n. From first, a lot. And fun from off and so on. And so this is our data set. So just clarifying what
word2vec actually was. Alright.
So let's go ahead and start implementing that. I've already made the data
processing functions here. So we won't have to deal with that. We have our batches. And this function load data already
loads the pre-process training and validation set. Training data is a list of batch
inputs and their label pairs. And we have about 30,000 of those. And we are going to see
a train as well here. The valuation data is just
a list of all validation inputs. And the reverse dictionary is a Python
dictionary from word index to word. Right? So let's start and go ahead and
implement Skipgram first. All right. So we are, again, going to start
by defining our placeholders. And so this is going to be batch inputs. And we are going to define
a placeholder here. But in this case,
since you just have integers. We can define with int32. And the shape is going to
be batch_size and nothing. So we have that. And we can avoid naming here. Because we are not going to
call multiple variable scopes. That will be fine. Then we go and create our batch labels. Which is, again, tf.placeholder. And 32. This will also be of
the same shape as previous. And finally, we will go and
create a constant for our validation set. Because that is not going
to change anytime soon. And that is going to be defined by a val_data which we previously loaded. And we have to define what type it is. And the type for that is int32 again. Just like our training set. All right. Now that we have defined, yes? So since I'll be applying
transposes later. I just wanted to make sure that it's one. It doesn't really make
that big of a difference. So in this case,
I'll be calling transpose on labels. Which is why I just wanted to make
sure that it transposes fine. You wouldn't.
It's just, I wanna make it absolutely
clear that it's a row vector. Not a column vector. >> [Question]
>> Yeah, exactly. All right. So now we can go and
start creating our scope for. All right. So, this is where we'll define our model. And first, we are going to go and
create an embeddings, as we all did in our assignment. And that's going to be a huge variable. And it's going to be initialized
randomly with uniform distribution. And this is going to kick vocabulary size, which you have previously
defined in the top. And it's going to take embedding size. So this is going to be a number of words
in your dictionary times the size of your embedding size. And we are going to define. Since it's a randomly
uniform distribution. We just going to also give
the parameters for that. So far so good? All right, so
we just created our embeddings. Now, since we want to
index with our batch. We are going to create batch embeddings. And you are going to use this function. Which is actually going to be pretty
handy for our current assignment. And so we do an embedding
lookup with the embeddings. And we are going to put
in the batch inputs here. All right. Finally, we go and create our weights. And we are going to call it,
here, .variable, here. So we are going to use truncated
normal distribution here. Which is just normal distribution
where it's cut off at two standard deviations instead
of going up the internet. Okay.
All right. This is also going to be of
the same size as previously. But this is going to be vocabulary
size and embedding size. And this is because I turn
tracks with our input directly. Since this is truncated normal, we need
to define what the standard deviation is. And this is going be given by one over the
square root of the embedding size, itself. [BLANK AUDIO] Okay. Finally we go and create our biases,
Which are also going to be variables. And this is going to be initialized
with zeros of size vocabularies. All right. Now we define our loss function,
now that we have all our variables. So in this case, we used a soft max cross
entropy in our assignment are the negative log likelihood. In this case,
you'd be using something similar. And this is where
Tensorflow really shines. It has a lot of loss functions built in. And say we are going to use this called
negative constraint, negative concentrate. I forgot the exact name. But it is very similar, in the sense
that the words that need to come up with a higher probability are emphasized. And the words which should not appear with
lower probability are not emphasized. And so we are going to call tf.nn. Nn is the neural network library
in TensorFlow, our module. And this is going to take
a couple of parameters. You can look up the API. Yes? Embeddings? All right.
What vector presentation. Which is what you're trying to learn. No, w is the weight matrix that is
a parameter that you're trying to also learn. But it's interacting
with other presentations. Effectively, you can think of these
embeddings as sort of semantic representations of those words, right? Yes? Right. So, imagine.
So, our embeddings is defined
as the vocabulary size. So let's say we have 10,000
words in our dictionary. And each row is now the word
vector that goes with that word. Index of that word. And since our batch is only
a subset of the vocabulary, we need to index into that EH matrix. With our batch, which is why we used
the embedding lookup function, okay. All right, so we're gonna go and just use, this API obviously everyone would need to
look up on the TensorFlow website itself. But what this would do is now take
the weights and the biases and the labels as well. Okay. I defined them as batch_labels. And they also take an input,
which is batch_inputs. Okay. And so here's where TensorFlow really
shines again, the num_sampled. So in our data set,
we only have positive samples, or in the sense that we had the context
words and the target word. We also need context words and
noisy words. And this is where num_samples
will come in use. We have defined num_samples
to be 64 earlier. And what it would essentially do is
look up 64 words which are not in our training set and which are noise words. And this would serve as sort
of negative examples so that our network learns which words are
actually context words and which are not. And finally, our num_classes is
defined by our vocabulary size again. All right. With that,
we have defined our loss function. And now we have to take the mean of that, because loss needs to be
the size of the batch itself. And we get that by reduced mean. This is gonna be slightly nasty. So we get, the loss is given for
that particular batch. And yes, exactly. It's given for multiple samples. And since we have multiple
samples in a batch, we want to take the average of those. Exactly. Okay.
And, so great. And now we have completely
defined our loss function. Now we can go ahead and actually,
if you remember from the assignment, we take the norm of these word vectors. So let's go ahead and do that first. So that will be reduce_mean, this is just API calling, which is very valuable and detailed on the TensorFlow website itself. All right. Keep_dims=True. So this is where, in this, I have added
an argument called keep dimensions. And this is where, if you sum over a
dimension, you don't it to disappear, but just leave it as 1. Okay. And now we divide the embeddings with the
norm, to get the normalized_embeddings. embeddings/norm. Great. And now we return from,
we get batch inputs, we return batch labels because
this will be our feed. We have normalized embeddings. And we have loss. All right. With this done,
we can come back to this function later. There's a slide part missing,
which we'll get back to. Yes? Thank you. All right. So now we go and define our run function. How are we doing on time? Okay, we have 20 minutes, great. We actually make a object of our model. And that's just by calling. Embeddings. And loss from our function, which was
just called word2, or skipgram, rather. Okay, and now we initialize the session. And over here, again,
I forgot to initialize our variables. We can call. We just initialized all of our
variables for the default values, as Barak mentioned again. Now we are gonna go and actually loop over our data to see if we
can actually go ahead and train our model. And so let's actually do that first step. And batch_data. Train_data. So for each iteration in this for
loop, we are gonna obtain a batch, which has its input data
as well as the labels. Okay, so we have inputs and labels from our batch. Great.
And we can now define our feed_dictionary accordingly where the batch_inputs. So this is a dictionary. And our batch_labels would just be labels. Any questions so far? Okay. We go ahead and
call our loss function again. And we do this by calling session.run, where we fetch the optimizer again and
the loss. And we pass in our feed dictionary,
which we already defined above. Okay. We'll get the loss. And since we are trying to get
the average, we're gonna add it first and then divide by the number of
examples that we just saw. All right. So we're just gonna put a couple of
print statements now just to make sure, to see if our model actually goes and
trains. And see. Print loss. Step [INAUDIBLE]
and then average loss. Since the loss will be zero in the first step, we can just [INAUDIBLE]. All right, so
now we have our average loss. And we reset our average loss again,
just so that we don't for every iteration loop. Okay.
So we have almost finished
our implementation here. However, one thing that's, yes? I forgot to define that. Good call. So we can define that as
the beginning of a run step. Gradient. Optimizer. And we'll take a learning rate of zero,
and we're gonna minimize the loss. All right, thanks for that, okay. One thing that we're missing here is
we haven't really dealt with our value addition set. So, although we are training
in our training set, we would wanna make sure that it actually
generalizes to the value addition set. And that's the last part that's missing. And we just gonna do that now. But before we do that,
there's only one step missing. Where we, once we have the validation set, we still need to see how similar
our word vectors are with that. And we do that in our flow graph itself. So, let's go back to
our skip gram function. Anyway here we can implement that, okay. So, we have our val_embeddings against
index into the embeddings matrix to get the embeddings that
correspond to the validation words. And we use the embedding look up
function here, embedding_lookup, embedding and we call in train data set or
val data set. We'll actually use the
normalized_embedding because we are very concerned about the cosine similarity and not necessarily about
the magnitude of the similarity. Okay, and we use val_dataset here. Okay, and the similarity is
essentially just a cosine similarity. So, how this works is we matrix multiply the val_embeddings which
we just obtained and the normalized_embeddings. And since they won't work, you can
just may just multiply both of them because of dimensional incompatibility,
we'll have to transpose_b. And this is just another flag. All right, since we also
returned this from our function, again this is just a part of the graph. And we need to actually execute the graph,
in order to obtain values from it. And here we have similarity. Okay, and let's do,
since this is a pretty expensive process, computationally expensive,
let's do this only every 5,000 iterations. All right. So, the way you're gonna do this is by
calling eval on the similarity matrix, what this does is, since we had this
noted it actually goes and evaluates. This is equal on to calling session.run
on similarity and fetching that, okay. So, we go on call, and we get similarity,
and for every word in our validation set, You gonna find the top_k
words that are closest to it. And we can define k to be 8 here. And we will now get the nearest words. So let's do that. And we'll sort it according
to their magnitude. And since the first word that will
be closest will be the word itself, we'll want the other eight words and
not the first eight words and this will be given by top_k+1,
any questions so far? Yes? Right, so your embedding is on number
of words you have in your vocabulary, times the size of your word embedding for
each word. So, it's a huge matrix, and since your
batch that you're currently working with is only subset of that vocabulary,
this function, embedding to lookup, actually indexes into that matrix for
you and obtains the word. This is the equivalent to some complicated
Python splicing that you do with matrices, but it's just good syntax
should go over it. Okay, all right, almost there, okay. So, we have our nearest key words, we'll
just go and, I have this function in my utils, you can check this on the GitHub
that we'll post after the class is over. And you can play with it as you wish,
and In the past a nearest, and a reverse_dictrionary gesture actually see the words and
not just numbers, all right. Finally, be open our
final_embeddings which will be a normalized_embedding at
the end of the train and we're just going to call eval
on that again which is equal to calling session.run and
passing and fetching this. All right, we are done with the coding and
we can actually see and visualize how this performs. Okay. And python word2vec Oops,
I must have missed something. I missed a bracket again. So, we'll first load up our data set, and then it will iterate over it and
we will use our scripting model. Oops, let's see that. All right, where did this group? You know why,
please tell me why have to be there, okay. Perfect. So, as you can see here we have 30,000 batches each with a bad set of 128. Ahh, man, [LAUGH] Let's see. All right, so as we see,
the loss started off as 259, all right, ends up at 145 and then decreases,
I think it goes somewhere to around 6. Here we can also see as a printing,
the nearest word for this is e leaders orbit, this gets better
with time and with more training data. We only use around 30,000 examples, so there's a lot of potential
to actually get better. And in the interest of time,
I'm only limited to around 30 epochs, yes. So TensorFlow comes with TensorBoard, which I didn't show in
the interest of time. Essentially, you can go up
to your local host, and then see the entire graph,
and how it's organized. And so, that'll actually be a huge
debugging help, and you can use that for your final project. Enter bold, yeah. All right, well, thank you for your time. >> [APPLAUSE]