The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high-quality
educational resources for free. To make a donation or to
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. JEREMY KEPNER: All
right, welcome. Thank you so much for coming. I'm Jeremy Kepner. I'm a fellow at
Lincoln Laboratory. I lead the Supercomputing
Center there, which means I have the
privilege of working every day with pretty
much everyone at MIT. I think I have the
best job at MIT because I get to help you all
pursue your research dreams. And as a result of that,
I get an opportunity to see what a really wide
range of folks are doing and observe patterns between
what different folks are doing. So with that, I'll get started. This is meant to be some
initial motivational material, why you should be
interested in learning about this mathematics,
this mathematics of big data and how it relates to machine
learning and other really exciting topics. It is a math course. We will be going over
a fair amount of math. But we really work hard to make
it very accessible to people. So we start out with a really
elementary mathematical concept here, probably one that
hopefully most of you are familiar with. It's the basic concept
of a circle, right? And I bring that up
because many of us know many ways to state
this mathematically, right? It's all the points
that are equal distance from a particular point. There's other ways
to describe it. But this is a basic
mathematical concept of a circle that many of us
have grown up with. But, of course, the
other thing we know is that, right, this
is the big idea. Although I can write down an
equation for circle, which is the equation for a
perfect, ideal circle, we know that such things don't
actually exist in nature. There is no true perfect
circle in nature. Even this circle that we've
drawn here, it has pixels. If I zoomed in on it, if
I zoomed in on it enough, it wouldn't look
like a circle at all. It would look like
a series of blocks. And so that approximation
process, right, where we have a mathematical
concept of an ideal circle, right, but we know that
there are not really-- they don't really
exist nature, but we understand that it is
worthwhile to think about these mathematical
ideals, manipulate them and then take the results
of the manipulation back into the real world. That's a really productive
way to think about things and, really, the basis for a
lot of what we do here at MIT. This concept is essentially
the basis of modern or ancient Western
thought on mathematics. If you remember your
history courses, this concept of ideal
shapes and ideal circles is the foundation of
platonic mathematics some 2,500 years ago. And at the time, though,
that they were developing that concept, this
idea that there are ideal shapes out there
and that thinking about them and manipulating them
was a more effective way to reason about the real world,
there was a lot of skepticism. You could imagine
2,500 years ago someone is walking
around and saying, I believe there are
these things called ideal circles and ideal
squares and ideal shapes. But they don't actually
exist in nature. That would probably
not be well-received. In fact, it was
not well-received. Many of those philosophers
who were thinking about this were very negatively received. And, in fact, if
you want to learn about how negative the
response was to this, I encourage you to go and read
the Allegory of the Cave, which is essentially the story of
these philosophers talking about how they're
trying to bring the light of this knowledge
to the broader world and how they essentially
get killed because of it, because people don't
want to see it. So that struggle they
experienced 2,500 years ago, it exists today. You as people at MIT will try
and bring mathematical concepts into environments
where people are like, I don't see why that's relevant. And you will experience
negative inputs. But you should rest assured
that this is a good bet. It's worked well for
thousands of years. You know, it's what
I base my career on. People ask me, well,
what's the basis of it? Well, I'm just
betting on math here. It's been a good tool. So this is why we're beginning
to think this way when we talk about big data
and machine learning. So really looking at
the fundamentals, what are the ideals that we need
in order to effectively reason about the problems
that we're facing today in the virtual world,
right, and the fact that this mathematical concept
described the natural world so well and also described
in the virtual world is sometimes called the
unreasonable effectiveness of mathematics. You can look that up. But people talk about math. Why does it do such a good job
of describing so many things? And people say, well,
they don't really know. But it seems to be a good bit of
luck that it happens that way. So circles, that gets
us a certain way. But in most of the
fields that we work with, and I would say that, in
almost any introductory course that you take in college,
whatever the discipline is, whether it be chemistry
or mechanical engineering or electrical engineering
or physics or biology, the basic fundamental
theoretical ideas that they will
introduce to you will be the concept of a linear model. So there we have a
linear model, right? And why do we like
linear models? And again, it can be physics. It can be as simple as F
= MA Or, in chemistry, it can be some kind of
chemical rate equation. Or in mechanical
engineering it can be basic concepts of friction. The reason we like these
basic linear models is because we can
project, right? I know that if that
solid line represents what I believe
to-- you know, if I have evidence to support
that that is correct, then I feel pretty good about
projecting maybe where I don't have data or into a new domain. So linear models allow
us to do this reasoning. And that's why in the
first few weeks of almost any introductory course they
begin with these linear models, because they have proven
to be so effective. Now, there are many
non-linear phenomena that are tremendously important, OK? And as a person who deals
with large-scale computation, those are a staple
of what people do. But in order to do non-linear
calculations or reason about things
non-linearly, it usually requires a much more complicated
analysis and much more computation, much more data. And so our ability
to extrapolate is very limited, OK? It's very limited. So here I am talking
about the benefits of thinking mathematically,
talking about linearity. What does this have to do with
big data and machine learning? So we would like to be able to
do the same things that we've been able to do in other
fields in this new emerging field of big data. And this often
deals with data that doesn't look like the
traditional measurements we see in science. This can be data that has to do
with words or images, pictures of people, other types
of things that don't feel like the kinds of data
that we traditionally deal with in science
and engineering. But we know we want
to use linear models. So how are we going to do that? How can we take this
concept of linearity, which has been so powerful
across so many disciplines, and bring them to
this field that just feels completely
different than the kinds data that we have? So to begin with, I need to
refresh for you what it really means to be linear. Before, I showed you a line
and, hence, the line, linear. But mathematically, linearity
means something much deeper. And so here's an equation
that you may have first seen in elementary school. We basically have
to two times three plus four is equal to two times
three plus two times four. That is called the
distributive property. It basically says multiplication
distributes over addition. And this is the
fundamental reason why I would say mathematics
works in our world, right? If this wasn't true very
early on in the earliest days of inventing
mathematics, it would not have been very useful, right? To say that I have two of three
plus four of something, OK, and then I can
change it and do it in this other way, that's really
what makes mathematics useful. And from a deeper perspective,
the distributive property is basically what
makes math linear. This is the property that,
if this property holds, then we can reason
about a system linearly. Now, you're very familiar
with this type of mathematics, but there's other
types of mathematics. So if you'll allow
me, hopefully you will let me just replace
those multiplication symbols and addition symbols with this
funny circle times and circle plus. And we'll get to why
I'm going to do that. Because it turns
out that, while you have done most of your careers
with traditional arithmetic multiplication and
addition, the kind you would do on your
calculator or have done in elementary
school, it turns out there's other
pairs of operations that also obey this property,
this distributive property, and, therefore, allow
us to potentially build linear models of very
different types of data using this property. So, as I mentioned,
the classic two are circle plus is just equal
to regular arithmetic addition, as we show on the first
line, and circle times is equal to regular
arithmetic multiplication. So those are the standard ones. And, by far, this pair,
this is the most common pair that we use across
the world today. But there are others. So, for instance, I can
replace the plus operation with max and the multiplication
operation with addition, OK? And the above
distributive equation will still hold, right? That's a little confusing. I often get confused that
multiplications is now addition. But this pair sometimes
referred to as max plus-- you'll sometimes hear about it as
max plus algebra-- is actually very important in machine
learning and neural networks. This is actually the back end
of the rectified linear unit, is essentially this operation. If you didn't understand
what that meant, that's OK. We'll get to that later. It's very important in finance. There are certain
finance operations that rely on this
type of mathematics. There are other pairs, also. So here's one. I can replace addition with
union and multiplication with intersection, right? Now, that also obeys
that linear property. This is essentially
the pair of operations that, anytime you make
a transaction and work with what's called a
relational database, that's the mathematical operation
pair that's sitting inside it. It's why those databases work. It allows us to reason about
queries, which are just a series of
intersections and unions, and then reorder
them in such a way. In databases, this is
called query planning. And if that property
wasn't true, we wouldn't be able to do that. So this is a deep
property of that. So we can put all different
types of pairs in here and reason about them linearly. And this is why that
many, many of the systems we use today work. And so this class is
about really exposing that, that, really,
the mathematics that allows us to think linearly
about data that we haven't really thought of
as maybe obeying some kind of linear model. This is essentially the
critical point of this class. So it goes beyond that, though. So hopefully you'll allow
me to replace those numbers with letters, right? So that's basic algebra there. Just for a refresher,
the previous equation, we had A = 2, B = 3, C = 4. But we're not limited to these
variables, or these letters, to being just simple
scalar numbers, in this case, real numbers
or integers or something like that. They can be other things, too. So, for instance, A, B, and
C could be spreadsheets. And that's something we'll
go over with extensively in a class, so that
I can basically have A, B, and C be whole
spreadsheets of data and the linear equation
will still hold. And, in fact, that's probably
the key concept in big data, is the necessity to
reason about data as whole collections and
transforming whole collections. Going and looking at things
one element at a time is essentially the thing that is
extremely difficult to do when you have large amounts of data. A, B, and C can be
database tables, right? Those don't differ too
much from spreadsheets. And as I talked to you
in the previous slide, that union/intersection
pair naturally lines up and we can reason
about whole tables in a database using
linear properties. They can be matrices. I think, for those of you
who have had a linear algebra and matrix mathematics,
that would have been the first example, right, when
I substituted the A, B, and C and had these linear equations. Often, in many of
the sciences, we think about matrix
operations and linearity as being coupled together. And through the duality
between matrices and graphs and networks, we can
represent graphs and networks through matrices. Any time you work
with a neural network, you're representing that
network as a matrix. And, of course, all these
equations apply there as well and you can reason about
those systems linearly. So that provides a
little motivation there. As we like to say,
enough about me, let me tell you about my book. So this will be the text that
will we use in the class. We are not going to go
through the full text, but we have printed out copies
of the first seven chapters that we will go through. And we will hand those out
later when you do the class. So let me now switch
gears a little bit and talk about how this
relates to, I think, one of the most
wonderful breakthroughs that we have seen, or
I've seen in my career, and many of my colleagues
here at MIT have seen, which is what's been going
on in machine learning, right, which is-- it's not hype. There's a real real there there
and it's tremendously exciting. So let me give you a little
history, basic history of this field. So in a certain sense,
before 2010, machine learning looked like this. And then, after 2015, it
kind of looks like this. So when people talk about
the hype in machine learning, or AI, really deep
neural networks are the elephant inside
the machine learning snake. It has stormed onto the
scene in the last five years and basically allowed us to do
things that we had almost taken for granted were impossible. Just the fact that you're
able to talk to computers and they can understand you,
that we can have computers that can see at least in a way that
approximates the way humans do, these are really almost
technological miracles that, for those of us
who have been working on this field for fifty years,
we had almost literally given up on. And then all of a sudden
it became possible. So let me give you a little
sense of appreciation for this field and its roots. So machine learning,
like any field, is defined as a set of
techniques and problems. When you ask what defines
a field, you ask, well, what are the problems that they
work on that other fields don't really work on? And what are the techniques
they employ that really are not really being employed by them? So the core techniques,
as I mentioned earlier, are these neural networks. These are meant to crudely
approximate maybe the way humans think about problems. We have these circles
which are neurons. They have connections
to other neurons. You know, those connections
have different weights associated with them. As information
comes in, they get multiplied by those weights. They get summed together. And if they pass certain
thresholds or criteria, then they send a signal
on to another neuron. And this is, to
a certain degree, how we believe the
human brain works and is a natural starting
point for, how could we make computers
do similar things? The big problems that
people have worked on are these classic problems
in machine learning, are language, how do we
make computers understand human language, vision,
how do we make computers see pictures or explain
pictures back to us the way we would like, and strategy and
games and other types of things like that. So how do we get them
to solve problems? This is not new. These core concepts trace
back to the earliest days of the field. In fact, these four
figures here, each one is taken from a paper
that was presented at the very first
machine learning conference in the mid-1950s. So there was a machine learning
conference in the mid-1950s. It was in Los Angeles. It had four papers presented. These were the four papers. And I will say
that three of them were done by folks at MIT
Lincoln Laboratory, which is where I work. And so that was basically
the neural networks of language and vision. And we didn't play
games, so that was it. And you might say,
well, why is it? Why was there so
much work going on in Lincoln Laboratory
in the mid-1950s that they would want to
pioneer in these directions? At that time, people were
first building computers and computers were
very special purpose. So different organizations
around the world were building computers
to do different things. Some were doing them to simulate
complex fluid dynamics systems, think about designing
ships or other types of things like
that or airplanes. Others were doing them to,
say, like what Alan Turing was doing, break codes. And our task was
to help people who were watching radar scopes
make decisions, right? How could computers enable
humans to watch more sensors and see where they're going? How could we do that? So at Lincoln Laboratory, we
were building special purpose computers to do this. And we built the
first large computer with reliable, fast memory. This system had 4,096 bytes
of memory, which, at the time, people thought was too much. What could you possibly
do with 4,096 numbers? The human brain, of course! Right, that's enough, right? Most of us can remember five,
six, seven digits, right? So a computer that can
remember 4,096 numbers should be able to do things
like language and vision and strategy. So why not? So they went out
and they started working on these problems, OK? But Lincoln Laboratory, being
an applied research laboratory, we are required to get answers
to our sponsors in a few years' time frame. If problems are going to
take longer than that, then they really are the
purview of the basic research community, universities. And it became
apparent pretty early on that this problem was
going to be more difficult. It was not going to
be solved right away. So we did what we often
do, is we partnered. We found some bright young
people at MIT, people just like yourselves. In this case, we found a young
professor named Marvin Minsky. And we said, why don't you go
and get some of your friends together and create
a meeting where you can lay out what the
fundamental challenges are of this field? And then we will figure out how
to get that funded so that you can go and do that research. And that was the famous
Dartmouth AI conference which kicked off the field. And the person leading this
group, Oliver Selfridge at Lincoln Laboratory, basically
arranged for that conference to happen and then subsequently
arranged for what would became the MIT AI Lab that was
founded by Professor Minsky. And likewise,
Professor Selfridge also realized that we would
need more computing power. So he left Lincoln
Laboratory and formed what was called Project MAC,
which became the Laboratory for Computer Science. And then those two entities
later merged 30 years later to become CSAIL. So that was the initial thing. Now, it was pretty clear that,
when this problem was handed off to the basic
research community, there was a feeling that
these problems would be solved in about a decade. So we were really
thinking by the mid-1960s is when these problems
would be really solved. So it's like giving someone
an assignment, right? You all are given
assignments by professors and they give you
a week to do it. But it took a little longer. In this case, it took five weeks
or, in this case, five decades to solve this problem. But we have. We have now really,
using those techniques, made tremendous progress
on those problems. But we don't know why it works. So we made this
tremendous progress but we don't really
understand why this works. So let me show you a little
bit what we have learned, and this course will explore
the deeper mathematics to help us gain insight. We still don't
know why it works. At least we can
lay the foundations and maybe you can figure it out. So here I am, fifty
years later, a person from Lincoln Laboratory
saying, "All right. Question one has been answered. Here's question two." Ha. Why does this work and
hopefully you can begin, be the generation
figured it out. Hopefully it'll take
less than fifty years. Historically this type
once we know how it works, it usually takes about twenty
years to figure out why. So I mean impasses
but maybe maybe you know some people are smarter and
they'll figure it out faster. So this is what a neural
network looks like. On the left you have your input,
in this case, a vector, y zero. It's just these dots
that are called features. What is a feature? Anything can be a feature. That is the power
of neural networks, is they don't require you
to a priori state what the inputs can be. They can be anything. People have said,
well, you know, neural networks,
machine learning, it's just curve fitting. Yeah, but it's curve fitting
without domain knowledge. Because domain knowledge
is so costly and expensive to create that having a
general system that can do this is really what's so powerful. So the inputs: we
have a input feature. It could be a vector,
which we call y sub zero. And that can just be an image,
right, the canonical thing being an image of a cat, right? And that can just be
the pixels, values just rolled out into a vector,
and they will be the inputs. And then we have a
series of layers. These are called hidden layers. The circles are often
referred to as neurons, OK? And each line
connecting each dot has a value associated
with it, a weight. And the strength
of the connection between any two neurons
is given by that weight. And then, ultimately,
the output, in this case, the output classification,
the series of blue dots there, are the different
possible categories. So if I put in a cat picture,
one of those dots would be cat, maybe one would be dog, maybe
one would be apple or orange, whatever I desired. And the whole idea is that,
if I put in a picture of a cat and I set all these
values correctly, then the dot
corresponding to cat will end up with the
highest score, right? And then I mentioned earlier
that each one of these neurons collects inputs. And if it's above a
certain threshold, it then chooses to pass on
information to the next. And that's where these b
values, which are vectors, are just the
thresholds associated with each one of those. It's a vector, one value
associated with each one of those that does those. This entire system can
be represented relatively simply with one
equation, which is that yi plus one, which is
the next vector in the layer, OK, can be computed by
the previous vector, yi matrix multiplied
by the weight, W. So whenever you
see transformations from one set of neurons to the
next layer, you should think, oh, I have a matrix that
represents all those weights and I'm going to multiply it by
the vector to get the next one. Then we apply these
thresholds, all right? So we add these,
the bi's, and then we have a function that
we pass it through. Typically, this h function
has been given the name rectified linear unit. It's much simpler than that. It's just, if the value is
greater than what comes out of this matrix multiplied,
if the value is greater than zero, don't touch it. Just let it pass through. If it's less than zero,
make it zero, right? You know, it's a
pretty complicated name for a very simple function. That's actually critical. If you didn't have
that h function, this nonlinear
function there, then we could roll up all
of these together and we would just have one
big matrix equation, right? So that's really considered a
pretty important part of it. So that's pretty
much what's going on. When you want to know what the
big deal is of neural networks, that's all that's going on. It's just that equation. The challenge is we don't know
what the W's and the b's are. And we don't know how many
layers there should be. And we don't know how
many neurons there should be in each layer. And although the features
can be arbitrary, picking the right
ones do matter. And picking the right
categories do matter. So when people talk about,
I do machine learning or I'm off working
on-- they're basically playing with all
of these parameters to try and find
the ones that will work best for their problem. And there's a lot
of trial and error. And you'll hear
about there's now systems that try and use
machine learning to do that process automatically. You know, how do you
make machines that learn how to do machine learning? The basic approach is a
trial and error approach. I take a whole bunch
of pictures of cats that I now have cats in them,
OK, and other things, right? And I randomly set all those
weights and thresholds. And I put in the vector
and I see what the system-- I guess what I think the
number of layers and neurons and all that should be and
I run it through the system and I get an estimate
or a calculation for what I think these
final values should be and I compare it with the truth. That is, I just
basically subtract it. And then I use those corrections
to very carefully adjust the weights. Basically, with the
last weights first, I do what's called back propagate
these little changes to try and make a better guess on
what these weights should be. So if you hear the
term back propagation, that's that process of
taking those differences and using them to adjust
these weights by about 0.01% at a time. And then we just do
this over and over again until eventually
we get a set of weights that we think does the problem
well enough for our purpose. So that's called back
propagation, all right? Once we have the set
of weights and we have a new picture that
we want to know what it is, we drop it in
there and it tells us it's a cat or a dog or whatever. That forward step
is called inference. These are two words
you'll hear frequently in machine learning, back
propagation and inference. And that's all there is to it. There's really
nothing else to that. If you can understand
this equation, you'll be way ahead of most
people in machine learning, you know? You know, there's
lots of people who understand about all the
software and the packages and the data. All of them are just doing that. And I'd say this is one
of the most powerful ways to be ahead in your field,
is to actually understand the mathematical principles. Because then the software
and what it's doing is much clearer. And other people
who don't understand these mathematical principles,
they're really guessing. They're like, oh,
well, I do this and I throw this module in. They don't really know
that all it's doing is making adjustments to
these various equations, how many different layers
there are and stuff like that. Now, why is this important? You're like, well,
what does it matter? As I said before,
we have this system. It works but we don't know why. Well, why is it
important to know why? Well, there's two reasons. One is that, if we want
to be able to apply this incredible innovation to
other domains-- so many of you probably want to do that. Many of you want to say,
how can I apply machine learning to something else
other than language or vision or some of these other
standard problems? I kind of need some
theory to know. Like, OK, if I have a problem
that's like this one over here and I changed it in
this way, there's a good chance it'll work. There's some basis for why I'm
going to try something, right? Right now there's a
lot of trial and error. It's like, well, it's an idea. But if you can have
some math that says, you know, I think that
will probably work, that really is a great way
to guide your reasoning and guide your efforts. Another reason is
that-- so here's a picture of a very
cute poodle, right? And the machine learning
system correctly identifies it as poodle. One thing we realized
is that the way you and I see that picture is
actually very, very different than the way the neural network
sees that picture, all right? And, in fact, I can make
changes to that picture that are imperceptible to you
or me but will completely change how the
neural network-- that is, given our neural network,
I can basically make it think anything, right? And so, for instance,
this is a famous paper. And they got the system
to think that that was an ostrich, right? And you can basically show
this for anything, right? So what's called robust AI,
or robust machine learning, machine learning that
can't be tricked, is going to become more
and more important. And again, having a deeper
understanding of the theory is very, very critical of that. So how are we going to do this? What's the main concept
that we are going to go through in this class? This has mostly
been motivational. But how are we going to
understand the data at a deeper level? You know, what's the big idea? And the big idea is
captured now in this, I apologize for this
eye chart slide, which is what we call
declarative mathematically rigorous data. So we have this
mathematical concept called an associative array. And it's corresponding algebra
that basically encompasses the data you would put
in databases, the data that you would put in
graphs, the data that would put in matrices and it
makes it all a linear system. And the key operations are
outlined there at the bottom. If you recall, we have
our basic little addition and multiplication. And then what's going
to be very important, probably the real
workhorse for this-- and i didn't show it before--
is called essentially array multiplication or
matrix multiplication. And that's the far
one on the right there, which we often abbreviate
just with no symbol, just A B. But if we really want
to explicitly call out that its matrix multiplication
as a combination of both multiplication
and addition, we put in what we call the
punch-drunk emoji, which is a plus dot times. You're probably all young
enough that you don't even remember emojis when
they had to type them out with just little characters and
we didn't have icons, right? So that meant you went to the
bar and lost to the fight, right? But, anyway, that's really going
to be the workhorse of what we're doing here.