Stanford CS229: Machine Learning | Summer 2019 | Lecture 1 - Introduction and Linear Algebra

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we're going to start off today with some introductions of the teaching team that's me anand i'm a fourth year phd student in computer science i work with professor andrew ing on machine learning and applying machine learning to different problems mostly in healthcare we also have a wonderful team of teaching assistants you'll meet the rest of them through the quarter in office hours and such okay so the goals of this course what do we what do we what are the goals in this course so by the end of the course you will we want you to be an expert in machine learning which means you understand the internals of how machine learning algorithms work not just using them but understanding what's happening inside the algorithms that's the main i would say goal number one of this course goal number two is to enable you to build machine learning applications which means having a good understanding of what algorithms to use and which scenarios how do you how do you decide whether a given problem is a good machine learning problem not all problems are good machine learning problems and once you know that a problem is a good machine learning problem how do you treat it as a supervised learning problem unsupervised learning problem we'll go through all those things right and the third goal is also enable you to be able to start doing machine learning research which means making you familiar with a lot of the terms and concepts that you will encounter if you pick up and read a machine learning research paper so those are the uh the top three goals for this course prerequisites so um machine learning is is uh part mathematical and part computer science in the sense it requires it's really the culmination of both computer science and um and and statistics so you will need to have basic computer science uh principles uh you need to know them for example you need to know what is bigger notation right you need to uh kind of understand what's you know what's computation versus what's memory you know basic computer science principles you'll need to be comfortable writing non-trivial uh programs with python and numpy it would be great if you're comfortable with recursion you know some of those uh some of the ideas from recursion will be useful when you're translating um algorithms to kernel methods you know we'll go into detail about them later so um you know we we will assume that you are all of you are comfortable writing programs especially in python or at least be willing to put in the time and effort to uh pick it up the other big um prerequisite is probability you should have taken some kind of a probability um course in your you know academic history so far so concepts like random variables distributions should be you know you should be pretty comfortable with them you should know what's the difference between a random variable and a distribution you know they're you know they're distinct things then you should know already what what's what's uh what they are you know things like expectations we'll be using all these concepts very liberally throughout the throughout the quarter the last prerequisite is uh linear algebra and multivariate calculus so you should already be comfortable with things like gradients and hessians you know you shouldn't be seeing them for the first time you should you should be comfortable with these concepts already uh and and you know things like eigenvalue eigenvector um as well so what about the honor code so you are strongly encouraged to form study groups in fact i would say forming study groups is a key key step for being successful in this course especially if you're not so comfortable with the prerequisites please do form study groups and you know work together in study groups however you should still write up your homeworks and write up your code independently right so it is fine to discuss and you know work on solutions together and once you're done with your study group meeting put aside all the material that you used during the study group and independently from scratch you know write up your homeworks by yourselves in your own words you know please do not refer to you know the material that you used in your study group session while writing up the writing of your your homeworks right another important note is that it is in violation of the honor code to refer to homeworks and solutions from previous years whether they are the official solutions uh released by this course or solutions written by some other students in a github or whatever so it is against the honor code to refer to um solutions from the past and we are very serious about the honor code and uh though we will you know um we we will not actively look for honor code violations we trust that you will be you'll be following them so the course structure there are three homeworks each homework will be about two weeks long and each homework will will count towards about 20 of your final grade and there will be a final exam it will be a take-home final exam uh the reason it's a take-home is because you need a computer uh to to do your final exam there'll be some code uh more details on that you know towards the end of the quarter uh but it's going to be a take home final exam very likely a 24-hour take home final exam um we haven't yet decided what's what's the exact start of that when the exam will start but it's going to be a final exam during the finals final slots logistics so the course website is on it's it's up it is cs229.stanford.edu on the course website we have the office hours calendars the course calendars there are two calendars up there one is for all the office hours when when each ta is holding their office hours and and where exactly and the deadlines are part of the course calendar so we can have a quick look at the course website anyways it it's not opening over here anyway so the yeah you can look up the course website uh on the course website there are three big buttons as soon as you open them you know one one button is called syllabus um let me see if i can there you go okay so you have three big buttons the syllabus linked to piazza and and um the calendars so on the calendars there are two calendars first one is the office hours you can you can click on this and add it into your google calendar if you wish to do so um but the exact office hours of every ta is is uh is out there and if you scroll further down you'll see the course calendar basically we are in this lecture right now and we'll find details about when each pset is out when each pset is due all the deadlines are in the course calendar so if you are you could you could just add this to your you know your personal calendar to keep track of the deadlines um and then we have the syllabus page so on the syllabus page um we have a collection of topics that we're probably not going to cover all of these but we are going to cover a pretty big subset of these topics and further down um here is going to be a lecture by lecture of what we are going to cover in each lecture with the corresponding slides or you know notes corresponding to that lecture and and the syllabus page uh the this will be updated through the quarter depending on what what the actual what the actual progress we make is right and piarza forum so all of you should have um all of you should have received an invite to piazza already if not click on the piazza link and enroll right away piazza is probably the most important most important platform for the course in terms of logistics because all announcements will be made on piazza the homeworks will be released on piazza and you know anything that you need to know in terms of you know the final exam logistics all those details will be announced on piazza so please sign up on piazza and and you know and monitor it great scope all the submissions will be done on great scope if you if you haven't used great scope before you know spend a few minutes to get familiar with it um the homeworks will be uploaded by you into gradescope directly and we will grade them on gradescope um again for gradescope also you should have received invites by now if you have not received your gradescope invite you know create a private pr supposed so uh in terms of any questions that you have through the quarter your first your first destination will be piazza if it is uh any a question related to the course content about the course logistics create a question on piazza it's where it's very likely some others may have already created a similar question so you can you can either if you know the answer help them out or if you uh if you don't find the question posted there if it is a question that is specific to you in the sense for example if you're not able to you know submit a homework on time and you know that upfront and you want an extension things that are specific to you or if you have an oae letter create a private pr supposed you know things that you don't want the rest of the class to hear about um if there are things that you want to uh if there are things even beyond that that you want to say for example just let me know um and not even let the tas know just send me an email directly with the word cs229 in the in the uh in the subject so that is about the logistics any questions on the logistics i'll take that as a no ok moving on all right so what is machine learning so the term machine learning was coined by arthur samuel in 1959 and one of the very early programs that that kind of got people interested in machine learning was a checkers playing program written by arthur samuel so it was a program probably most of you know what checkers is it's it's a two-person game there is uh a checkers board that looks like a chessboard and you have white and black pawns it has certain rules so arthur samuel wrote a program um in a few years before 1959 and it became popular in 1959 where the program would learn to play against itself without any any any kind of strategy being explicitly programmed into it right it learned to play by against itself and it improved its you know game playing performance as it experienced more and more games against itself and eventually it learned to play checkers better than you know arthur samuel himself i don't know how good a player he was to begin with but it got pretty good by just self-play there was no there was no strategy coded in by arthur samuel into the program only through experiences the program learned to play better and that was kind of the um the event that kind of sparked interest widespread interest in machine learning a more common definition by tom mitchell he's a professor at cmu is machine learning is the study of computer algorithms that improve automatically through experience experience is is a loose term what do we mean by experience most of the times by experience we mean prior data right examples from the past where the machine learning algorithm itself is a very generic um template algorithm you don't and you just you feed the algorithm data from the past what you call as training data and the algorithm analyzes the data that's given to it and you looking for patterns in the data the the algorithms improves its performance at whatever task it is it is uh designed to operate on so the the key here is data or experience from the past that that distinguishes machine learning from other fields right machine learning is also related to the field of artificial intelligence so ai is a much broader term and we see the terms ai and machine learning being conflated a lot of times you know especially in the news and media but strictly if you look at the definitions of ai and the definition of machine learning machine learning is a subset of ai right ai is a much broader field ai is about um it's about building programs where the programs operate at the level of performance of of say a human being for example again that's a very loose definition of ai right so ai is um is a broader is a broader definition and it deals with algorithms that perform at a cognitive level similar to a human being machine learning is one approach of how to implement such programs for example you could implement an ai program that does not look at any kind of training data you just use your smarts and write up a you know a very smart program that that works really well and behaves as you know at the level of a human in terms of performance and that's still ai but that's not machine learning in machine learning data is the key you look at past data use it as a training set or you build a simulator from which your program can kind of um build experience by interacting with its environment and that program which is looking at data or which is looking you know which is getting experience and some kind of a simulator improves its performance over time that's machine learning that's one way to make your programs intelligent however most of the attention recent attention in artificial intelligence is due to machine learning right artificial intelligence has been around for a while machine learning has been around for a while but some of the recent advances in machine learning has made machine learning very successful and therefore the interest in artificial intelligence is is kind of reinvigorated again and even within machine learning there is a specific um a specific type of machine learning called deep learning which has actually seen the most advances recently so it's it's uh it's good to have this clear picture of ai which is a much broader field which is which is dealing with programs that operate at a similar cognitive level as a human being right it does not ai does not tell you how to implement such programs within ai machine learning is a subfield which which prescribes one way to implement such programs that's basically look at prior examples and and and learn from examples and even within machine learning there is a subfield of machine learning called deep learning uh which uses something called neural networks which we will see later in the course and that that has um that's this the the field that has made a lot of progress in the last five to ten years okay so here are some examples of where machine learning has made a lot of progress so computer vision and image recognition um this was um this is a field that has probably undergone the most progress in the in the recent years there is a famous data set called imagenet a computer vision data set and there is a kind of model called the convolution neural network which together you know more or less revolutionized computer vision and this is all recent you know since you know after 2012 and 2015 you know within the last last decade and it has also um significantly improved autonomous driving where you know machine learning techniques are used to sense where you know um pedestrians are where the stop sign is where the traffic lights are um and it it is you know um it's a key component for for autonomous driving machine learning has also significantly improved speech recognition uh and and and things like you know it has made and it has made possible things like voice assistants like you know apple siri or um the google assistant etc language translation so google translate now uses machine learning or from what we know from what google tells us it uses uh deep learning and in fact there was also a paper from facebook i think last year which does unsupervised translation which means just give it a corpus of you know a whole bunch of you know english of english sentences and english documents and just give it a whole bunch of for example french documents and just by looking at them where it was the algorithm was never told what is a matching english and french statement but just looking at two different corpus it learned to translate sentences from one language to another and similarly with google translate they had this um they had this paper a few years ago where they built a single model that could translate from various pairs of languages and that model automatically learned how to translate from new pairs for which it had never seen a training example before those are examples in language translation you know very exciting exciting field there's also a lot of progress happening in reinforcement learning or deep reinforcement learning and most of the progress there has been in game playing like for example mind over the last few years they built um reinforcement learning models that could play atari games at the level of humans so you you feed the model the pixels from the display of an atari game and just by looking at the pixels of how how the game looks the um the agent will you know controls different actions and the agent learned to play atari games at a superhuman level across a whole you know across 30 40 games and also alphago until very recently go is another board game and it was widely considered to be extremely difficult to play because you need a lot of strategy a lot of planning and um after chess was kind of um conquered by by by computer programs in in in the 80s and around the 1980s go was then you know thought of as you know the next big uh board game that's really hard to to kind of beat and you know um alphago again you know was was built by google again you know which which plays better than like the world's top go players those are some of the recent progresses and some preview into the course what we'll be doing um through this quarter so we're going to look at different supervised learning algorithms unsupervised deep learning algorithms some learning theory and some reinforcement learning so most of the time will be spent on supervised and unsupervised and a good amount on learning theory and maybe a little bit on reinforcement learning theory so in supervised um learning we're going to look at problems and classify them as you know regression problems versus classification problems what are regression problems when the output of your machine is is some kind of a real valued output for example if you're trying to predict the wind speed tomorrow at you know at stanford and you're going to build a machine learning model to predict the wind speed based on today's weather conditions as the input the output is going to be a real valued number some you know kilometers per hour or miles per hour so that's a real valued number so that's uh for example a regression model um classification model uh for example if the output is binary or some kind of a class will it rain tomorrow yes or no right that's a classification problem given to race conditions if you're on a predict you know whether it's going to rain tomorrow the answer is yes or no and you can you can classify machine learning algorithms in in along many different dimensions as generative versus discriminative probabilistic versus non-probabilistic so any given machine learning algorithm you can kind of place it on this hyper cube of all these dimensions and kind of taking a step back what is supervised learning supervised learning is a kind of machine learning technique where the training data that you have come in pairs each pair has an input and an output right an input is what you feed into the machine learning algorithm and the output is what the machine learning algorithm should output right so and that's called supervision because for each example you're telling the algorithm what the right output is so that's the that's where the supervision comes in in unsupervised learning algorithms you are you give the algorithm some data there is no explicit supervision and you just ask it to look for patterns or structure interesting structure in the data and the output of your algorithm is that interesting structure or you know interesting pattern and again we kind of classify unsupervised learning algorithms as those that look for clusters versus those that look for subspaces you know it's fine if you don't fully understand these terms now just giving a flavor of what's to come in the rest of the quarter and deep learning is what's also commonly called as representation learning where the deep learning can can um can kind of plug into can you you can view deep learning in a supervised setting unsupervised in a reinforcement setting etc where you're trying to learn representations of your data whereas in non-deep learning approaches you are manually feeding in what the right representation is for the supervisor and supervisor reinforcement learning algorithm whereas with deep learning you're you're letting the algorithm itself learn the representation as well along with the final task and then we'll cover learning theory super important concepts of you know the bias variance decomposition bias variance trade-off generalization and uniform convergence why do we expect a model that was trained on this training set to even perform you know with any level of accuracy when you when you start using it in the world when you put it to production because you trained it on this specific training set now why do you expect this model to work well in the real world you know questions like that we're going to answer them in learning theory and that's probably the learning theory is probably the the the part of the course that that makes this course interesting in the sense you learn principles that are common across learning algorithms you know common across learning algorithms that have not yet been invented right you know the the foundational principles of why machine learning works and then we'll spend some time on reinforcement learning as well though not a lot so here's an example of a supervised learning algorithm image classification so these are examples of few images of handwritten digits this comes from a very famous data set called mnist where the input is a set of pixels you know i think this by is pixels or some some resolution like that and uh they are black and white images uh in fact they are grayscale images um where you you have the pixel value at each location you feed the set of pixels um as input to some learning algorithm and the output of the learning algorithm is going to be is going to be a one-off uh digits one through nine or zero through nine you know it has ten it's it's a a multi-class classification problem so the output is it's it's a classification but not binary it's one among 10 classes and you want to learn a machine learning model given this training data which which is able to which is able to predict what the handwritten digit is for you know digits it hasn't seen before in fact you will be doing you'll be working on this data set and implementing a neural network to to build a handwritten handwriting digit classification in one of your homeworks okay so unsupervised here's an example of um an unsupervised uh learning so in this example uh it's what's commonly known as the cocktail party problem so imagine there are n speakers in this case assume there are two speakers who are in a cocktail party just the two of them and there are two microphones that are placed in the room at you know some some specific locations now um the two microphones are recording what the two speakers are speaking and each of their recording is some mixture of the the what the two speakers are saying and the the task here is now to take these two audio clips each of them has a different mixture of the two speakers and the machine learning algorithm is expected to output two different two different waveforms where the source the voices are separated right so all that you're feeding into the algorithm is just two clips you're not saying you know there's no kind of supervision telling you know speaker one's y sounds like this or speaker two's voice sounds like this there's no supervision whatsoever you're just feeding it two clips where the clips have mixed audio of two different speakers and the uh model outputs um the separated voices so let's see if the audio works here right that's you know from one microphone there were two speakers [Music] right that's from another microphone probably the the the sound of one of the speakers was a little louder and the second one or uh vice versa and when you separate them [Music] one two three four five six seven eight nine ten right so all what you gave the algorithm was these two audio clips and nothing else right just two wav files and it you know it analyzes the audio waveforms looking for structures and it's able to separate the voices with no supervision whatsoever and you're going to be doing one of this in your homeworks as well it will not be two but actually a set of five audio clips that are mixed together and you're given five different audio files and you will implement the uh the ica algorithm that you're going to cover in the course and and separate the audio into five you know distinct distinct clips and finally reinforcement learning so here's an example of reinforcement learning where the goal of the algorithm is to control what's called what's commonly called the inverted pendulum or the card pole so imagine you have a stick and you're trying to balance a stick on your hand a vertical stick on your hand you know and we are trying to control this um this card poll agent in order to keep the stick that's placed on it well balanced right so uh when you start a learning algorithm from scratch so uh this is episode so each trial is an episode when once the stick falls you start the next episode and for example this is episode zero it is falling falling falling now it's episode one falling again episode two scroll down episode three alone and you you train this algorithm you continue trying it for more episodes and eventually it will it'll get better um i stopped the algorithm for for this uh stopped at about 130h but if you let it continue it's going to learn to learn how to balance the card pole for you know potentially indefinitely long there you go oops i think it it falls off the table or something anyway so um again this is going to be on your homework as well you're going to you're going to write a reinforcement learning algorithm that's going to learn in a in a simulator like this how to balance balance a stick on top of a cardboard yes question so the question is uh for for uh those who are watching uh the lecture on online um is it necessary to have as many number of audio clips as the number of speakers yes that's exactly right you need as many number of audio clips as the number of speakers okay so that's about uh it for the introduction so for the rest of today we will be yeah so for the rest of today we'll be covering um doing a review of linear algebra uh in case you you know it's been a while since you took linear algebra to brush you up and familiar familiarize you with the concepts that are more important for this course and in the next um next class we'll be doing covering matrix calculus and and some probability and statistics um and from you know onwards we'll be uh getting into actual machine learning uh models yes questions yep so uh the the uh question was the um in unsupervised learning you're you're just given a data set and asked uh you know to how to find structures in the data and i also spoke about an example uh with with the uh language translation where a model was just given unpaired mappings of language and it still learned how to uh map from each other uh the answer to that is a little technical let's take that offline you know after the lecture you know come up and we can talk about that all right so i'm going to be assuming all of you are are familiar with um some basics of linear algebra in the sense all of you know what a matrix is know what a vector is you know what you can do with vectors and matrices so we'll be reviewing them but also i'm going to be assuming some familiarity from all of you so first of all some notation so what's a vector right so um for the purposes of this course we will assume that a given vector let us call it v is an element of r to the t now let's let's let's let's uh make sure all of us understand what what this means right this means an element of you know from set theory so v is a vector it is an element of this set and this set is a d dimensional real space right so what's a d dimensional real space so this is imagine this is the real line and this is x y z so this is an example this is r 3 because there are three dimensions right so a vector is an element of a d-dimensional real space and vectors are when we say it is an element it means it is a point that lives in this d-dimensional real space vector can also be represented as e1 v2 ed we generally write a vector generally is considered as a column vector we write it as a column in this course you can also write a row vector as v transpose it's going to be v v d right so what what what do these numbers mean these numbers are basically the coordinates of this point so this is v1 this length and this length is uh we just write it here v1 v2 v3 right and this point over here is is v right similarly we we also have a matrix right so we write matrices as uh let's call it a letter a so generally matrices have we use capital letters for vectors we use small letters and we say is member of r m by n right you can you can interpret uh a matrix a as um as an element living in an m by n um dimensional real space what does that actually mean right so you can think of that as so matrix is is a grid of numbers um real valued numbers for the purpose of this course um through the course we we hardly ever deal with complex numbers so um it's going to be only real numbers think of it as a m by n grid of real numbers and that's a matrix you can interpret this as n m n column vectors of m dimension each right or you can think of it as m row vectors of n dimension each right there is a special matrix called the identity matrix which is ones along the diagonal and zeros everywhere else i just write a big zero to indicate that all the individual cells are zero you can also have a diagonal matrix where a diagonal matrix is one where all the cells except the diagonal are zeros and these could be any value something called a symmetric matrix a symmetric matrix where a equals a transpose what is transpose switching rows and columns exactly so a transpose of a matrix in this case um take the first row i'm sorry the first column and treat it as the row of the transpose right and take the second row second column and make it the second row and so on right the the property that the transfer satisfies is that a i j equals a transpose j i here i and j are the uh row and column um indices of of of a given matrix so when you transpose it the row number becomes the column number and the column number becomes the row number the trace of a matrix is the sum of the of the diagonal of that matrix so take a square matrix and sum up the diagonals um and that that's the trace let's look at some basic operations that you can do with vectors so if you have two vectors you can you can kind of combine them in two different ways right so the first is called the inner product so first this so vector vector operations given two vectors what can we do with them so the first is is the inner product and um technically the inner product and dot product are distinct concepts but in this course they're going to be the same so the dot product so one way to think of the inner product is if you have two vectors x comma y both in rd so in order to perform the inner product the two the two vectors have to be of the same dimension for example d and the inner product is the sum over i equals 1 to d x i y i and this is also written as x transpose y so this is the the mathematical um representation of the dot product you could also think of this as like this you think of a vector written as a row vector another vector written as a column vector the two have to have the same dimension and when you perform the inner product you get a scalar right you feed in two vectors the output is a single number right you you've lost the angles or or the orientation of the vectors once you perform an inner product right the other operation you can do with two vectors is the anybody you can do the cross product uh the cross product however is not very interesting for us you also have something called the outer product right so the cross product and outer product are different things right so uh in cross product you uh you you're given two two vectors and you find a new vector that is perpendicular to the plane in which the two two vectors are ah but here we're going to talk about the outer product in the product and we have the outer product so in the outer product you can have x in let's call it r d and y in r p right so for the outer product the two vectors need not have the same dimension right but in a product they have to have the same dimension and um mathematically we write the outer product as x y transpose right it could also you you could also um have a different outer product which is y x transpose in this case x transpose y and y transpose x for the inner product was the same so you know you could you could switch the order but for outer products if you switch the order you get something which is a little different um the way to think of the outer product um at least pictorially is something like this okay right so think of vector one 2 so we have a column vector and a row vector right so in this case this is d dimensional and this is p dimensional right so with the inner product we took a row vector a column vector and convert it into a scalar right but with an outer product what we do is you take each each and every possible pairs of elements from the row vector and the column for example take the first element from the column vector and say the third element from the row vector multiply those two elements right and that becomes the first row third column cell of this matrix so you take two vectors and construct a matrix out of them right by pick the pick the ith element from this vector j element from this vector multiply them and that becomes the ij element of of of the matrix right so this this uh for the outer product you don't need the vectors to have the same dimension and a matrix that you construct from one row vector and one column vector is also called a rank one matrix right why is it called rank one because one way to think of it is it is made of one pair of row and a column vector right so that makes it what also called as a rank one matrix right you could also for example um two rank one matrices add them up so these together will give you one bank one matrix add it with another rank one matrix right so the row call the the column vectors have to be of the same dimension the row vectors have to be of the same dimension but the two need not be the same dimension right so if you know these two have to be d dimension these two have to be for example p dimension here you get a rank one matrix you get another rank one matrix you sum them up element wise and you get a another matrix of rank so you take two rank one matrices you add them up you get a rank two matrix so you take two rank one matrices and you add them up you can you get a rank two matrix assuming the vectors are linearly independent right if the vectors are linearly independent you take two rank one matrices add them up you get a rank two matrix now what happens if say we add k of them so 1 2 what's the rank of this matrix rank k so the rank of this matrix is actually going to be the less than or equal to min of d p and k so the rank of the matrix cannot be bigger than the smaller dimension of the matrix right so by adding the rank one matrices or you know uh or linear adding linearly independent rank 1 matrices or increasing the rank of your resulting matrix but you can only go up as high as the smaller of the two dimensions yes question yes uh we will i will define the rank uh for now um let's let's uh think of a rank as um think of this as a sheet of paper like you know think of this as a matrix and you add more sheets of papers you're increasing the rank of the matrix you will precisely define what the rank of a matrix is in in a few minutes and and just by looking at a matrix if you're just given a bunch of numbers of rows and columns it's pretty much impossible to tell what the rank of that matrix is right it's superficially you you cannot just tell by looking at um um looking at a matrix what its rank is right it's it's like an inherent property there are you know simple cases if you're given a diagonal matrix where everything is zero southern diagonal yes you can tell the rank but in general just by looking at the values of a matrix you cannot tell what the rank of matrix is ok so those are vector vector operations lets look at some matrix vector operations right so um in a matrix vec so um we went to what vector vector operations the inner product and the outer product for the inner product you had to have both the vectors of the same dimension and you would get a scalar for an outer product they could be of different dimensions and you've got to rank one matrix right now let's look at matrix vector operations so for a matrix vector operations first we're going to consider operations where you have a matrix let's call it a and a vector call it x and this let's call it m by n and this is n so a belongs to r m by n and x belongs to r n right so this is um and and so a x is now of dimension this one so how do you think about this operation so first you can um think of the matrix as a set of rows right so each row has n elements right and the vector you're multiplying with is a column vector right and all you're doing is the inner product right take the first row get the inner product with the matrix and you get a scalar right take the second row do the inner product you get another scalar and you're going to get as many number of scalars as the number of rows you have in the matrix right so so ax rm you can write this as m by one but you know um we assume vectors are column vectors so if i write just you know in rm it you know just assume it's it's a m by one if you want to think of it that way another interpretation of uh matrix vector multiplication is it is the same matrix except we are going to think of them as columns right m by n this is n so this is a and this is x so this has i'm just going to use different symbols let's assume you know there are three elements um and all right this has four so let's let's assume this has four elements and the uh n is four a is a m by four matrix um and you're trying to calculate a x the other interpretation here is let me use some colors so here we did inner products right here what we're going to do is pick the first element of of x scale the first column of a right and then pick the second element of x scale the second column of a by that number by scaling i mean do an element-wise multiplication of of this number with all the elements here you know just rescale it and pick the third element of x scale the third column okay and and then you just um sum them up along this way so sum up the scaled versions right so ax is again you know you're summing up uh m dimensional column vectors that are scaled by the corresponding values in the vector and that's just going to give you you know and whether you do the operation this way or this way you'll always end up with the same same right hand side right so that is matrix vector and then you have matrix matrix so again for matrix matrix there are kind of you know two interpretations of how you kind of visualize uh the product of two matrices so let's assume you have two major six matrices m by k times k by n so it is important that the the number of columns of the first one and the number of rows of the second one are the same right and in interpretation number one we'll go with the inner product interpretation so you have m rows of k dimensions each and n columns of k dimensions each right the dimensions are matching of the row and the column so naturally we want to do the inner product right now take every pair of row and column vector um so that that gives you m times n so any any row vector any column vector perform the inner product and that becomes the cell value of the corresponding i throw and j column right so pick the say the second row and the fourth column do the inner product and that is the second row fourth column of of of your output matrix so matrix multiplication essentially a whole bunch of inner products that you're doing concurrently in parallel you can also um the same operation m by k k by n and in this case think of them as right so now we have k column vectors and k row vectors right m and n are different so this is m dimensional and this is n dimensional but we have k of each of them right now pick them pairwise not all possible best but pick them pairwise so first row and first column i'm sorry first column and first row do the outer product you get a rank one matrix right so you get a rank 1 matrix of so this is column 1 from the first matrix and row 1 plus pick the second column and second row column 1 and row column 2 and row 2 right and add k of them so column k and row k so the columns come from the left matrix and the rows come from the right matrix and you add them up you get a another matrix right and the matrix that you calculate this way or the matrix that you calculate this way are going to be exactly the same so they're just two different interpretations of the same mathematical operation any questions so far yeah yeah so um the question is um should this be a rank one matrix um so the the uh when you take a a column vector and a row vector and take the outer product this will be a rank one matrix and this is another separate rank one matrix and this is another rank one matrix and when you add them up when you add k of them you get a potentially you know a min of you know m n and k rank matrix it will if if the rows and columns are linearly independent it will not be rank one you're increasing the rank as as long as you're taking linearly independent rows and columns but you cannot you cannot keep adding uh matrices rank one matrices and increase your rank forever you have an upper bound which is the smaller of the two dimensions of the row or column all right so [Music] there was another question of what the rank was and before we get to that let's see why do we need to learn linear algebra for machine learning why is linear algebra even relevant for machine learning so uh some of the um some of the situations where we will end up using linear algebra you know through the rest of the quarter is first is represent data so suppose in a supervised learning setting um your let's say you want to predict uh the price of the house and the inputs that you are given are things like you know the number of bathrooms the number of bedrooms you know the area etc um you would then represent your training data as x as you know where each row x is is you it's also called the design matrix where each row is a is an example is a different example and the columns corresponds to what are also called as features uh for example uh this could be house one house two house three house four etcetera and this could be say number of bedrooms area number of bathrooms etc and this is generally called a design matrix so your design matrix x is most conveniently represented as a matrix similarly your supervision values the actual house prices of those houses is most conveniently represented as a vector right so just to represent your data it is convenient to use concepts from linear algebra like matrices and vectors right and then we're going to do a whole lot of manipulation with these like you know multiply this you know um x vector with a you know another parameter vector called theta and so on and so on so we're going to be using linear algebra for representing our data and also doing operations on them another um situations where we will need um matrices is going to be in in probability to represent what's called covariance matrices you'll cover probability review tomorrow and we will go over what what a covariance matrix is so for example uh covariance matrices they are generally written as you know they tend to be symmetric um symmetric uh means the matrix and its transpose are the same right and we're going to use a linear algebra for for calculus multivariate calculus right so things like gradients so gradients and i'm expecting you know what a gradient is so um you um gradients tend to be vectors right so you you would represent vectors as as as um column vectors hessians hessians are matrices um hessians are basically like you know think of them as the second derivative in the multivariate setting um and and um jacobians jacobians are derivatives of a vector valued output from you know a vector-valued input all of those can be so gradients maybe vectors hessians matrix matrix symmetric jacobians in the matrix and jacobian will likely not be symmetric uh and so forth and we're also going to see uh use linear algebra very liberally in kernel methods you're not expected to know what a kernel method is i'm just listing it here um just so uh you know you know that you know learning linear algebra has uses in all these various scenarios right and by calculus i also mean you know in general optimization you know given a loss function um we need to minimize it right and and uh we're going to use gradients and hessians um and um so so on there right so yeah so so linear algebra is is is important and it's very important that you are comfortable manipulating these concepts such as matrices and vectors and you know multiplying them taking inverses etc any questions before we move on to a few more interesting concepts no questions so we saw matrix vector multiplication over here right so here's it like let's do a geometrical interpretation of so linear algebra is one field of mathematics where it's very easy to visualize things which makes it really fun and also kind of easy once you kind of get the core intuitions right so we saw the matrix vector multiplication uh so a let's let's call a matrix as a it's in r m by n and let's say you have a vector x in rn right now when you do a matrix vector multiplication you get something called ax right and this is in r m right so one way to to think of ax is through these you know row interpretation or column interpretation right but an even more useful way to think about this is to think of a as some kind of a function right think of a as a function it's you know you we've been looking at the matrix as a set of numbers right so stop thinking of it as a set of numbers but think of it as a function a function that takes x as input and outputs you know that vector now let's assume you have we have a in our three by three right which means we have uh it takes us input values from a three-dimensional space and its output is also a three dimensional space right now we are thinking of a as a function right who which takes us input the inputs are vectors and the outputs are also vectors right so x is a vector ax is also a vector they are in different dimensions yes right so it takes as input an n dimensional vector and produces an output which is an m dimensional vector right so let us pick some so assume this is the a which which ah does the multiplication and so these are let us call them x axis y axis y axis is three by three and let us say we so this is a vector in you know this is some x it lives in a three dimensional space and it has you know a column representation right multiplied by a right and lets say it maps it here in the output space right now let us take another vector let's call it and this is some other vector run it through a it's going to give you a different output right so perhaps it comes here and similarly you know take a third third vector right through a and it's probably going to come here right so the what this picture is representing is you have this function a right we don't think of it as you know numbers anymore we think of it as a function for which we feed we we take some point in the input space as the input feed it into the function and it outputs you know a new vector in in a you know possibly a different dimensional space in this case it's three and three to make the visualization simple right now if a is full rank which means its rank is um is three in this case right there is going to be a unique mapping between every point in the input to every point in the output there's going to be a one you know one by one mapping one to one mapping between every point in the input to a corresponding point in the output right now if a is such a matrix that makes such a one to one mapping possible a is a full rank matrix right now um so this is input space and this is the output space right now you could also imagine that there exists some other matrix call it b right that takes these values as the input right say the red as the input run it through b and it outputs the original point over here right now if a is a full rank matrix then such a b will exist you know which which does the reverse mapping and this b is also called a inverse right so a is the function that takes that takes vectors as input and maps them to some output and as long as it is full rank there will be there will be another matrix which is the inverse of it which takes this as the input and corresponds the and outputs the original point as as the output right now this is in the case of full rank matrix now what happens if a is not full rank so this was full rank now here if a matrix is not full rank it's technique called a rank deficient right so again this is the input space x axis y axis z axis similarly that is output space x axis y axis z axis right and similar we have an a over here a is again in r 3 by 3 but let's assume this is a rank 2 matrix right right and this was a full rank machine this was a rank 3 matrix so it could uniquely map every point every other point and you could go back through the inverse now if it is ranked efficient what this means is if the rank is two it means there exists a two dimensional subspace in the input space you know so there exists a two dimensional subspace what is the subspace mean so assume this is the ambient space is three dimensional assume this room to be the ambient three dimensional space any two dimensional place for example this paper that extends indefinitely in all along all directions is a subspace that lives in this three dimensional space right now if if the rank of the matrix is two you know and it is a three by three matrix then in this three dimensional ambient space there exists a two dimensional subspace that is specific to this matrix and that subspace must pass through the origin right and also a corresponding subspace in the output space which is also two dimensional which is what makes it you know a rank two matrix there is a two dimensional subspace and a two dimensional subspace again this also has to go through the through the origin and there exists a one-to-one mapping between points on this subspace to points on this subspace so a point on the subspace may get mapped here another point over here make it map here similarly a third point here you get mapped here right now these this subspace extends to infinity in all directions right in both in both the input and output space and a gives you a unique one to one mapping between elements in these subspaces but a is a three by three matrix right it's it's a function um think of it as a function for which you can feed any input you know just because a subspace exists for which a one to one ah mapping exists doesn't mean you know for example i may take this point that lives outside the subspace right this is you know think of this as the ambient three dimensional space this is a two dimensional subspace what about the point over here where the pen decide it's not living in the subspace it's outside the subspace right take take such a point that you know that doesn't live in the subspace you can still feed this point as an input to this function and it's going to map it somewhere right where but where will it map it to right the way to think about that is any given point in the input you know in the input space of a matrix can be decomposed right you can decompose x into two parts one part so let us assume this is x and x lives outside the subspace you can decompose this x into two parts where one part lives in the subspace and is in a way nearest to the given x and another part which is this residual from that nearest point to the actual point right which means assume the origin is in that corner and this is a two dimensional subspace that kind of passes through the origin and we have a point over here this point can be decomposed into two part into two parts this point is represented by a vector that starts at the origin and ends at this point you can decompose into two parts where one part lives strictly in the subspace and it extends all the way till it is nearest to this point and a second part which is a point that is perpendicular to the subspace that originates from that nearest point to this point right it's like you know pythagoras theorem you have two two components which are at right angles and any vector can be decomposed into a component that lies in the subspace and a component that is strictly perpendicular to the subspace right and okay i'm sorry what's the question yes uh you you you can call it a a resolution but a more commonly used term is decomposition so you decompose that vector into two parts right and these two parts when i mean decomposition i don't mean take the vector you know the and split it into you know a two dimensional versus a three dimensional um um you know what i mean is you know you should not be confusing this this uh decomposition as splitting it into one part that lives in uh the subspace versus another that's not what we are talking about right what we are talking about is you know x can be written as i am going to use the word projection over here so the projection of x onto the subspace plus the the technical term over here would be the row space plus projection on to the null space now what does that mean i mean we saw two terms over there a row space and a null space now row space is actually the subspace which which gets mapped bijectively to the corresponding you know the output subspace right this subspace in the input space right for which the bijection exists is also called the row space and it's called the row space because it is made up of precisely those points which can be represented as linear combination of the rows of a right yes question are all projections orthogonal i'll come to that in a moment yes um so um the row space is made up of all points that that can be represented as linear combinations of the rows of a and it is precisely that subspace for which a bijection exists between you know the the output space and and the input space so this is called the row space and this is called the column space right it's also called like um the range of it's also called the range or you can think of it as the columns and this is precisely those the set of all points which can be represented as linear combinations of the columns of a right now the the point over here which did not exist in the row space we said can be represented as the sum of two components we can decompose it as the sum of two components one component that lies in the row space and is nearest to x and another component which is you can call it you know what's left or the residue right which is orthogonal to your row space and the question was uh is the projection always perpendicular to the space onto which you're projecting it to yes the answer is yes so the the way you want to think of it is let's assume this is the subspace this is a two dimensional subspace in this ambient three-dimensional space and you have some point over here you know where this you know black pen is and you want to project it onto the space onto the subspace now the the the way you do the projection is search all the points that exist in the subspace calculate the distance from that space from that point to this point and choose the point that that has the smallest distance right that point is the projection of the the the point having the smallest distance from this point is the projection of this point onto the subspace and the the line connecting the true point to the projection point is always going to meet at an angle of 90 degrees it's always going to be perpendicular to the subspace itself and it's it's kind of intuitive right so um um in the subspace you know um the point that is nearest on this on this plane to this point is going to be the one that is like directly below it and so that line is always going to be you know perpendicular does that answer your question okay yes question subspaces always pass through the origin by definition they pass through the origin because this is always going to be some linear combination of the row or the columns and the linear combination can have be just zeros you know just zero times column one plus zero times column two so the origin is always part of both the input subspace and the output subspace yes question yeah so the question is what is row space so row space is precisely the set of all points that can be represented as linear columns of the rows right so column space is the set of all points that can be represented as linear combination of the columns of of a which means the set of all points that you can obtain by multiplying some vector with that matrix right and according to this definition it is some linear combination of the columns of of a and you know um you take the transpose and you get the row space yes right so the question is are the subspaces so the question is are the points of different colors are they the rows or columns of the matrix a is that the question so good question thank you for asking that question so the points that i'm choosing over here right have nothing to do with in the sense they need not be either a row or a column of a right the points that lie on this subspace can be represented as some linear combination of the rows of a right what it means is one of the so if a has has has three rows there will be three points in the subspace which corresponds to the three rows of a it has to it has to lie there because you know um what i mean by that is let's assume a has three rows this is a it has three rows and this is the subspace obtained by linear combinations of these three rows which means one of the combinations could be one zero zero right 1 times the first row plus 0 times the second row plus 0 times the third row must align the subspace right which means first row exists at some point in the subspace does that make sense yeah so ah yeah so that's that's the uh row space and any given x now what happens you know back to the question of what happens if we take some x that does not exist in the in the row space of x and you multiply it through and multiply by a what happens and this is where the the decomposition you know helps us find an answer so a of x is going to be a of i am going to abuse notation and write it as x r plus x n where it is this is the projection of x onto the row space and this is the residue right and this the direction in which the residue lives is also called the null space of a which means any point that lives in the null space of a will always get mapped to 0 in the output space right that's called the null space of a so any point in the input space can be broken down into two components a component that lives in the row space and a component that lives in the null space right and a of x you know you can represent it as you know the row space component plus the null space component and because a is a linear function you can write this a of x r plus x n right and this is 0 because you know you're feeding this this is a vector in the null space you multiplied by a it always goes to 0 right and so this is just a of x r so the the operation you want to think in your mind is given any point in the input space which may or may not lie in the row space when you multiply it by multiply it with a what happens is effectively you project that space onto the row space and transfer that you know and find the corresponding output of that projection right so it essentially means this operation is effectively just pruning out anything that's outside the row space of you know from from the input find the projection onto the row space and and carry forward that point onto the output space right so this concept of projection is super important we're gonna um we're going to come across this concept this this projection again when we when we uh do linear regression right so what's what's uh technically how do we calculate the projection right here i just represent it as some abstract function called projection and here i used you know the the resulting vector after you do the projection but how do you actually calculate the projection right and that's not too hard either so so projection we'll start with the simplest case the simplest case where we have considered you have a vector let us call this v some vector and you have another vector let us call it b right there are two different vectors and now we want to project b onto the direction onto the subspace spanned by v right v is a vector and it induces a subspace and that subspace is a one dimensional subspace which is made up of all linear combinations of the vector b which means take b multiplied by 2 you get you take v you know 2 v is part of that subspace 1 you know 0.5 v is part of that subspace minus v is part of that subspace so this entire line is the subspace spanned by v by the vector v and now we want to project the vector b onto the subspace right and how do we do that for that there is um for for a given v there is something called the projection matrix so the projection matrix of v is v v transpose over v transpose v we'll decipher this in in in a minute now this is a projection matrix which means you take this matrix and take any other vector right it could be this vector it could be this vector right and if you multiply b by this matrix the resulting point is going to be the projection of projection of of b onto that onto the subspace spanned by v now why is that the case that's the case because let's look at what's happening here i'm going to rewrite this as so v v transpose b over v transpose b and this is the same as v over norm of v v over norm of v transpose b right so v transpose v is the same as the square of the length of v its that is the norm square of v and that um so this is the square of the length i distribute one of the length to this and one of the lengths to that right so when you divide a vector by its length you are effectively rescaling it into a unit length vector right so now i am going to call this say v tilde v tilde transpose b where v tilde is a vector along v that has length equal to 1 maybe this point over here is v right you can you can pick any vector along this along on the subspace and it's always going to result in the same projection matrix because you're rescaling it by the length so it doesn't matter which which specific vector you chose in the subspace we rescale it by the length come back to v tilde and the projection matrix is v tilde v tilde transpose now this write it as v tilde times v tilde transpose b right and this we know is the when when you take the inner product between a unit length vector and any other vector it's going to give you the the the length of the projection of this vector onto the unit length vector right so the if v transpose is v tilde is is some unit length vector you take any other vector b the inner the inner product between two vectors where one of them is unit length is going to be the magnitude of the other vector along the direction of of the unit length vector right so this is going to be the length of the projection of b along the dimension of v and then this is just some scalar you're rescaling v tilde by some scalar again right so this this is the projection matrix of some vector v where the projection matrix projects the input that is fed into the projection matrix onto the subspace spanned by it is this clear any questions now what happens if instead of one vector v we give you we we have a collection of vectors a bunch of column vectors for example we have let's say you know we have a subspace spanned by the a collection of three row vectors now if these are the vectors on to which ah whose subspace onto which we want to project x we follow something very similar to this in space in in in place of v instead of this being a vector we will have a matrix where the matrix here is will be some matrix x whose columns are vectors which make up the subspace onto which we want to project it right in this case it it was a vector it was a subspace with just one vector but if you want to project b onto a subspace spanned by multiple vectors right then in place of v we are going to use some matrix x whose columns are precisely those you know a set of vectors which make up the subspace yes question they need to be linearly independent by definition uh because those are you know to make up a subspace you know you need a set of linearly independent vectors right so um take take this some set of linearly independent vectors from that subspace right and wherever there is we just plug in x so we are going to get x this looks different from this and that's because in v transpose we was a scalar and you could you can divide something by a scalar right but x transpose x is is going to be a matrix and you cannot divide something by a matrix it's not even a meaningful operation how do you even go about doing it right so the the the right way to think about this is to rewrite this as v right so i just rewrote this as v v transpose v inverse v transpose which is uh which is the same and it is this form that can generalize into a matrix form right so the projection matrix for a set of columns is is this and we're going to come again come on come across this concept again when we do linear regression so remember that yes i could have put it anywhere yeah so the question is um why did we put the x transpose x inverse in the middle rather than to the left or to the right right so um it's i mean the the answer is is um is is i don't know if this is a satisfactory answer but this is the only thing that makes sense if you you know no other position will match the dimensions right so x is m by n x transpose x or x transpose x inverse is n by n and x transpose is n by m so in a way that's the only place where it kind of fits in order to kind of multiply things i know that's not a satisfactory answer but you know that that's the only operation that kind of makes sense right so this this this um this concept of what happens when you're this whole idea of of projecting things and and finding a point for which you know uh bijections exist and moving over is is is going to be a recurring theme it's it's it's something you want to kind of understand really well right any given uh for a given matrix a which is not full rank there exists subspaces in its input and output space and the the dimension of the subspace is going to be equal to the rank of the matrix and and whenever you multiply a vector from the input space what you are effectively doing is projecting that vector onto the row space that the input subspace for which a bijection exists and then carry forward that point on to the output dimension which also means that now if we have a point that lives outside the subspace in the in the output space and we want to find out what point in the input space will give me this as the output which is effectively we are asking the question what is a inverse right that does not exist there for no point in the input space can we reach a point that lives outside the the the column space right this point is unreachable and and in when when we when we cover linear regression we're going to ask exactly this question when we have a point that is not unreachable you know what are we going to do about it how are we going to find an input that you know that's that's basically linear regression and which we'll cover on on on friday but this this this whole idea of of subspaces you know row space null space decomposition uh uh bijection the these are things you want to you want to kind of digest it well you know um and absorb them really well let's see we have about eight more minutes all right so uh we might be able to squeeze in this one so we might run over a little over time today but i'll try to wrap up as much as possible and we can continue the rest all right so here we saw kind of a visual representation of what a the matrix a does to the input and output now next we're going to limit ourselves to only square matrices which means the input and output spaces have the same dimensions and for the purpose of visualization we're going to only consider let's say a three-dimensional space right now in this diagram we have two different of two different pictures for the input space and the output space but now because a is symmetric i'm sorry a is is a square matrix we're going to overlay the input and output space onto the same space right so here the input space and output space are are being overlaid here now let's ask the question this you know now let's ask the question we saw what happened uh for you know pick you know choose some points you know run it through a you get an output output point now what happens if we take the unit sphere around the origin by unit sphere i mean [Music] just the points on the surface of the unit sphere think of it as a soccer ball that is centered on the origin and you take every point on the surface run it through a you get a corresponding output point for every you know input point of of of the soccer ball right how would the resulting shape look right so that is going to look as an ellipsoid its its you know almost like an ellipse so that's going to be right so what what what exactly happened here we it's a three dimensional input and output space we started with the input as we didn't have one input but we had a collection of points as inputs and that collection was precisely those points that live on the surface of a unit sphere and let's say we we took this point in the input space run it through a we got a corresponding out point and that point is say this one right so every point on the input surface maps to some point on the output surface of that shape right we could have done this with any input shape but you know sphere is easy to kind of analyze right now similarly um you do it for another point let's say this point and let's say that maps here and let's say we take what color do we alright green and let's say we pick this point and that maps here okay now we saw what what what happens when you when we uh take some shape for example a sphere and run it through a instead of thinking of uh um running a point through a now imagine running of you know a full shape through a which which essentially means you're running every point in that shape through a and calculating the output point which you know which will be some other shape so now uh think of a as as taking as a shape as input and outputting a different shape as the output right if the shape is some unit sphere around is the unit sphere around the origin the output is going to be an ellipsoid right now this ellipsoid assuming a is also symmetric right will have essentially just two things so first of all so this was the input this was the output this one which means there were two operations being done here one is there is a change in magnitude so the magnitude came down a little bit and there is a change in direction right a did essentially like you know effectively two things a change in direction and a change in magnitude now there are going to be some points along you know some points on the unit sphere for which the only change is going to be a change in magnitude and no change in direction right and those are your eigenvectors for a full rank three by three matrix there are going to be three eigenvectors right and this is going to be one of them and if it is symmetric those eigenvectors are going to be perpendicular oh sorry used a wrong color i shouldn't have used that's it so this is going to be one eigenvector and there's going to be another eigenvector which will map this point on the input to a point on the output where the only thing that changes is the magnitude and not the direction yes question so does it necessarily have to have three eigenvectors so it will have three eigenvectors however some of the eigenvalues may be zero i'm going to come come to that right so so this is going to be you know eigenvector number one for which the the scaling is the maximum right and the eigenvalue is the the the ratio of the length of the input vector and the length of the output vector so if the output vector is is stretched a lot along the eigen eigenvector then you have a large eigenvalue right and if if if the eigenvector is is uh shrinking the point closer to the origin then the eigenvalue is you know less than one if it is bigger than the original point it's greater than one your your point may also get reversed in direction in which case your eigenvector is negative right now that's that's the eigenvector and um the eigenvalues is the ratio of the output to the input magnitude it may so happen that when you take the the unit sphere or the soccer ball as the input and field at the output the output may not be an ellipsoid but it may be a flat ellipse which means you lost a dimension it got kind of smooshed into a two-dimensional ellipse instead of having a rigid three-dimensional ellipsoid shape now that means the eigen value along the third eigenvector was zero right so along along the third eigenvector you know all the lengths kind of got mapped to zero is that does that make sense so now the a question was asked earlier on now what is the what's the meaning of the rank of a matrix so the rank of a matrix is precisely equal to the number of non-zero eigen values right if the if the matrix so this is in in in a square matrix setting if a matrix has full rank then it means a a unit sphere will always come out as some kind of an ellipsoid with the same number of dimensions if your rank deficit the ellipsoid that comes out may have lost a few dimensions
Info
Channel: stanfordonline
Views: 44,970
Rating: undefined out of 5
Keywords: Anand Avati, Machine Learning, CS229, Stanford, Introduction, Linear Algebra
Id: KzH1ovd4Ots
Channel Id: undefined
Length: 111min 13sec (6673 seconds)
Published: Tue Apr 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.