Ali Ghodsi, Lec [3,1]: Deep Learning, Word2vec

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay let's start so we have seen a feed-forward neural network we have seen regularization for network in general how we can avoid overfitting and come up with the right model today actually I'm going to show you a model which was quite popular since 2013 that the first paper was published and it's called now war to vac I showed you a couple of examples of word to rack in the first lecture and those examples actually even got news coverage that you know they have vector representation of words that that was their favorite example in which got to dnews as well that King - man plus woman is equal to Queen so you have words representation for words such that when you take the vector corresponding to King - the vector correspond to man plus the word curse one to Roman it would be equal to the vector which is a queen so basically a king who is not a man but is a woman is a queen so in 2013 two papers were published and now these two algorithms are called word to XO word Tyvek is basically a feed-forward neural network and we are going to see the details of word to AK how it works before we move to other type of neural networks okay so if you want to do language modeling you need to have representation of words in vector space okay so most of the machine learning algorithms that we are familiar with are very using as input they take vectors so we images to vectors and we have to turn text to vectors and work two vectors basically inputs are in the vector space and in vector representation and you can imagine you know the type of tasks that we are doing in natural language processing there's a wide variety of tasks as you may want to do but you can imagine that in almost all of them you need some sort of vector representation of the words to be able to use the existing tools that we are familiar with you one say for example the translation and you know word translation is easy you know it can just consult a dictionary and find corresponding wording in another language for each word but what about the order of these words how you are going to order them so most translators are actually compute two probabilities the probability of different candidates as a sentence and they compare them you want a model such that for example the probability that the cat is a small is greater than the priority of a small D is cat so this is the right sentence and the wrong sentence you want to take this one or you are not sure which word you have to choose you can say working home after school or you can say working house after school which one is correct one so this priority should be larger in both of these two cases in many other cases when the other tasks that we can imagine you need a representational force the most are simple and like maybe trivial way of representing words as vectors is a one heart vector you know in this form you know you have a very sparse vector the length of the vector is cardinality of your vocabulary set so how many words do you have you have 1,000 words the length of vectors would be 1,000 you have 1 million words the length of vector would be 1 million so the length of the vector is the cardinality of your vocabulary set and then it's a sparse basically it's 1 in one spot and 0 everywhere else okay so this is a very easy way and it has been used and a soul is in use in many you know text mining and a natural language processing most likely you are familiar with with the concept of term frequency matrix in turn frequency matrix a you have a matrix such that rows of this matrix are different words and column of the matrix are different documents say I have 10 documents I go to the first document I look at the number of times that this first board has been repeated in this document the word book has been repeated 10 times and the word you know table didn't appear and so on you know that's representation of this duhkham this representation of this document is basically summation of this type of vectors right so basically we represent each of these words by a sparse vector such that it's 1 at this particular spot and 0 everywhere else and then we represent all of the words in this document by this type of vectors and then to represent the whole document we add them up ok so basically we're counting the number of frequency the number of words I mean or the frequency of the words in this document and we represent the whole document using summation of this type of representation well instead of summation could be logical or between all of these words this is another way to represent a document yes it is bag of word map yeah yep it is bag of board model yeah but there are a couple of different problems associated to this type of representation of the word one is that semantically and synthetically there is no relation between words if you have word book and you have word library there is no similarity between these two vector I mean randomly you know you select a position for the board and randomly this one appears in one position zero everywhere else so we can't compare these two vectors they say that they are similar or dissimilar you know no repeats not representative in this sense another problem is that when the size of vocabulary is large the size of these vectors will be large and then they will be intractable in many tasks okay so what's the solution what can we do about this type of representation well how can we come up you know how can we reduce this dimensionality Attalus so one idea first idea is for is to reduce dimensionality using say for example singular value decomposition you know I have these vectors of words that are high dimensional D is quite high I can do singular value decomposition of the word and if you do if I do singular value decomposition then this matrix of D by n will decompose to 3 matrices such that the first matrix is eigenvectors of XX transpose and the last one is eigenvectors of x transpose x and the middle one is a diagonal matrix of eigenvectors eigenvalues right okay so where this basically idea comes from you know you are familiar with the notion of dimensionality reduction and we mentioned about the concept of subspace and manifold in a space very briefly in one of these lectures one of the previous lectures so this basically has been motivated and inspired by this assumption that the data is on a subspace you know all of these words are in a subspace and this subspace basically has intrinsic dimensionality lower than the origin or less space so the original is face is the size of vocabulary but we assume that there is a service face with lower dimensionality and using singular value decomposition we can map them in lower dimensional space and there are many techniques that use this idea the most famous one is latent semantic indexing and latent semantic analysis that take this input matrix and do similarly decomposition it's exactly I mean latent semantic indexing is exactly the same as PC except that you don't remove the mean of the data okay so basically still you are finding subspace of the space and then this is going to be the new representation of your words you know if you had n words here then you have n words here in K dimension not in D dimension and cake would be very less than D so we can reduce this dimensionality yes if you run it on term frequency matrix then this vectors will be representation of your documents you know because each column here is a document you can you can do it this way I mean if you apply like latent semantic analysis or indexing to this word frequency matrix the way that we define then this matrix is going to be a vector representing of your documents in lower dimensional space not words if you want to do it forwards basically can apply to the transpose of this matrix I mean consider each of these or representation of one word I mean each word is represented but by the frequency of that word in different documents or you can directly make a matrix of this sparse vector 0 1 vectors and do the composition on top but then you're doing SVD the massive matrix idea I'm that's one of the problem with this sweetie yes yes yes like you easier because endowment is greater absolutely so what they want is not 0 entering every time in latent semantic indexing you want to interpret the data I mean in lower dimensional space if you take the mean out then you lose this interpretability because you know these frequencies were being negative and what does it mean that you know a word has been repeated negative number of times right so you can't basically represent the work I mean it's more clear if you apply two images you know you can do singular value decomposition on images and then if you reconstruct say for example this is set of images and they do singularity decomposition and I take the first K eigen vectors if I multiply this that would be rank K reconstruction of the original matrix if I look at each of these columns as still it looks like a real image but if you centralize the original matrix and then do singular value decomposition then reconstructed look at each row it doesn't look like an image anymore you know you have to add the mean to this to look like a real image you know because there's there would there will be an image with negative entries you know as intensity and so I mean there are millions of variations you know normalize the data this way don't normalize the data non-related the other way in I mean LSI is pretty old method right when there are many different techniques there are many ways actually to do this and it's not just the LSI for matrix decomposition you know more recently non-negative matrix factorization has been applied to this matrix you know because using SVD you have two matrices here you decompose this mesh to two matrices but the original matrix X is non-negative means the elements of frequency are non-negative but when you do singular value decomposition the values here could be negative and values here could be negative right so one way of interpretation here suppose that you know I merge this central matrix with one of these two then I have multiplication of two matrices so one way of interpretation is that we reconstruct each of the columns of the original matrix as a linear sum of these bases okay and that's how people interpret LSI even has been applied to term frequency matrix this matrix will be topics and that would be the proportion of topics you know that they use it for topic modeling basically you know suppose that I had in original matrix one column is an article the second is a different article and so on Ida compose it and they have K bases and every single document in the original matrix is a reconstruction of this K say this K is three everything is reconstruction of this three if you look at the words in these three you can see some sort of clustering you know some of the words are related to politics some of the words are related to say a sport some of the words are related to art and this reconstruction means that the first article is 50% politics and 20% art and 30% sport okay so basically we decompose by decomposing the matrix we decompose the topics of the matrix and we cluster them but the problem is that these bases could be could have negative values and it's not easy to interpret them you know what is this base that has a board which has been repeated minus five times you know so instead of using singular value decomposition you may use a non-negative matrix factorization such that these two matrices are positive so you have bases of positive and you have you know in terms of images it's easier to imagine you know suppose these these columns are images and then I have three five ten bases which is positive so it can be interpreted as image and then I have coefficient of these bases so each image each column is just summation of these bases we cannot subtract you can just stop at so what would be these bases you know I have this image and I have five bases to add and make this original image so these bases should be basically or one way is that they are segments of the original image then you can just add them and in PCA or is singular edit the composition doesn't happen this way you know because then you may have negative values and you add and subtract and all your basis is not interpreted okay so not just singular it decomposition and PCA which are linear methods they are finding service based and nonlinear techniques also are have been applied to this type of matrices this is a result of locally linear embedding a nonlinear technique which finds a non linear manifold in their space and you can see that the words have been mapped in two-dimensional space so there are two-dimensional vector representation for each word and you can see that you know if you look at this cloud of data here this cloud correspond to this words fout fighting capture killed and so on and semantically there are relation between these words in the original is phase we didn't have any semantic relations between words we had some as far as made vectors with no semantic relation but here we do ok and you can it's not just lle any dimensionality reduction technique if you apply you're going to have this type of property so one problem with SVD is that you can't apply to huge matrices because it's quadratic on lower dimension something so if the size of matrix is huge which is often the case then SVT will be problematic and the other problem is that it's hard to apply to out-of-sample problem if you have a new word to sample it to Needham I mean low dimensional space if it's just SVD actually it's not hard because it's linear you know you have transformation you can upload and map it to lower dimensional as with you if it's nonlinear is going to be hard later on actually we are going to see that maybe these are criticism about SVD are not that relevant you know this is classical criticism about SVD and why we need different type of methods like neural network type of methods to learn low dimensional representation but when we learned about war to back then we get back to this and see how relevant are these criticism to previous techniques so because of this type of problem with SVD some people how to find vector representations directly and this is a list of some papers that basically try to do this and today we are going to talk about this one it was published in 2013 actually he McAuliffe published two papers in 2013 and both of them are known as war to whack one of them is a model which is you're going to see that it's a continuous bag of word model and the other one is a skip graph the most common one in a skip grand model okay with both of them are called we are going also to see a little bit of or the main concept of glove which was a paper published afterward to AK and has almost it has been claimed that it has been almost the same quality in terms of vectors that it generates okay what's the idea of glove and war to back the idea is that you can basically predict surrounding word of a word given a word or vice versa you know if I give you you know some words you can tell me but the central word is or if I tell you what the central word is tell me what other words are you know I have sequence of words and suppose that you know I give you this sounding words can you predict what this word does what this word is you know we as human being can do this you know if you don't hear one word you know in a conversation if you just don't hear this one you have a pretty good guess that what this word should be right or if it's not in a text you have pretty good guess that with this word supposed to be or vice versa you know I have this word I have this word can I predict others this is basically the idea of a continuous bag of word and the idea of a scheme skip gram that we'll see but this idea that by predicting surrounding words you can have you know sorta I mean if you come up with some vectors which have a good prediction ability of predicting sounding words it's a good representation you know this idea has been motivated by itself by distribution distribution hypothesis which basically emphasized the fact that the meaning of a board is determined by the word surrounding that board you know so words that appears together semantically have similar me you know that's basically the main idea okay based on this idea these are techniques were to back I'm in glove directly and were to work indirectly implicitly they are using a concept of co-occurrence matrix you know and suppose that you know this is my text silence is the language of God all else is poor translation that's my text okay so I can make a word I make I can make a co-occurrence matrix this way first I need to define a window and this window has a parameter lamp see in this example I assume that the window of length 1 is considered so the length of window is 1 so basically in a window of length 1 my question is that if these if a given word appeared in a window of length 1 of another word okay so given a word anything else in that window is called context of that word question is that if that that word is in the context of this word or is not so silence is not in the context of silence because silence never appeared right beside silence but is appeared here you know right beside silence so I put one here D didn't appear language didn't appear and so on if I look at is is closed to silence and is close to D and so on you know this is co-occurrence matrix okay and many algorithms like love we use this to make word representation were to whack indirectly and implicitly use this information to make word representation as we are going to see if you look at this Corcoran's matrix it's very similar to our term frequency matrix except that here instead of having one document we have one window right it's not just an entire document it's a window of document represented by word silence another document another window represented by word is so these are different context in a step different documents okay yes for every sentence you're physically making you know for the whole corpus for the whole corpus you make one yes it's it's just a parameter you know it like any other technique you know you have to play with this parameter to have the best result and then claim that they beat everyone else in the world sorry I mean intuitively if it's too big it's meaningless you know if a board appears you know 100 words array of this word you can't expect any semantic relation between these two words so the length of the window should be something reasonable intuitively 510 or common you know five six seven are common if it's to s to as small like one like here that would be like bigram model x' in traditional models like what would bigram you know you just look at it then you have three it's similar to you know basically you have I mean traditional component of each of them of those but you can go up to six seven ten but if you make it so large intuitively doesn't make sense there are variations of that that they've eight the distance if it's right beside this word gets higher weight if you are to is step away you get less weight and so on so it it's not crazy to 0 and 1 you know changes yes and go for different subgroups you know we are trying to avoid any type of manual you know clustering of the text you know we want to make everything automatically no we don't do that usually we just make co-occurrence matrix for the whole corpus yes I would word and ask the floral that word in the same word or would I told you four words on go co-occurrence matrix usually they take you the same board but it depends on your application but usually yes usually they take it of the same words usually they take it as the same words in Terra frequency matrix in some some applications we are going to see now that you know war to whack for example don't take it the same they take a different because it's able to capture this difference in the vector space pretty nicely in all the distance between for R and singular but in more traditional techniques usually like book and books are the same and it depends also on the language that you use you know in English you have book and books but in Arabic they have two completely different you have completely different words for a poor and a singular it's not as easy so you know words are coming from same route but in different forms and you know a part of that is language dependent but in general yes charm taken to alter ego confident whispers the silence with the what when the silence occurred before the hands the here actually we don't captured no here we don't see that right no as long as they are in the context of a word you're going to see them as Navy yeah yeah you have each word you go one by one to each fort yeah what is that you know is so right now this is going to be a symmetric matrix because it seems like if you take all the words after all the words before your window is different right you can have a window that only looks in the posterior you can also look at what is the prior right that will make it symmetric versus in these measures you know all of these variations exists in the literature but the common one is that basically when you define a window or a context of a word for word T the context is all words V T - C + V T plus C when you set us here it's before and after that's the most common way of doing this okay so I told you 100 times so far for this example here is another example of war - whack montreal canadiens - Montreal + Toronto is Toronto Maple Leaf okay quite amazing how how does this happen you know it sounds very mysterious you know that you have a word vector representation that you know has this type of properties but if you think about it it might be not that mysterious because you know I would like to have this relation I want a king - man + woman to be almost equal to Queen okay so it means that King - man should be almost queen - one if I put all of the words with similar like semantically related words close to each other you know if I have a map such that King is close to man as king sir is close to Queen King is close to Queen because semantically or they are related and woman is close to man then you have this property because the distance between King and man is the same as distance between woman and Queen so when I showed you the result of lle or latent semantic indexing when they map the words in two dimensional space and they put all of the words that are semantically related together ah it's possible that they also can you know have this type of relation a can provide this type of relationship war to back is very more accurate than those you know but potentially this is something that could happen in a method based on singular value decomposition based on dimensionality reduction any technique that put similar words together in lower dimensional space or in another space okay so I think that a part of success of this war to work is because of this very nice demo and representation so if you show it this way that okay see how king and queen are close to each other and men and women are close to each other it's not that impressive to see that okay see King - man plus woman is queen you know and they devise Li represent the result this way other than this way that was the traditional way of representing the result of this this type of vector representation okay so as I told you we have two models both of them published in 2013 same author two different papers the first model is a continuous bag of word and in continuous bag of word given context you would like to predict avoid okay so in this sequence in this sequence of words I assume that one of these word is missing I don't know that word so can you predict that word for me this is continuous bag over in a skip gram it's the other better I know this word can you predict the context you know I know this middle word tell me what's the true previous word and two words after this this middle or which is again you know something intuitively possible you know if I tell you aboard you know the probability of all other words are not the same you know some word are more possible more related to this work than than others okay ah so let's start with the details of this continuous bag of work okay continuous bag of word model so I have basically in our context and I would like to predict this word which is at the middle and we train a neural network to do this okay so for simplicity assume that the context the length of context is one I just give one word to the network and I would like the network to predict save the word after so if I give you one word what would be the word which comes after this word you know I train a network for this and you can see it this can be easily extended to you know case that you have you have it differently well you know that's going to be my new run let's work okay this neural network actually has only one hidden layer you know it has been recorded as one of the success of deep learning but it's not really deep it is very shallow it's just one layer but you know it's you can see it in the literature of deep learning that you know using deep learning with it this but in fact it's just one layer one hidden layer only so as the input of the vector I give one of these long sparse vectors representing a board so X D is it's various parts and now we don't have any other marker here this is really bad that's blue and blue doesn't appear well in video there is one in the third floor in the supply room we can use my key in taken over oh thank you very much doubt Thanks okay so I have this D dimensional picture as input and it's a sparse matrix such that you know it's a vector such that X K is 1 and X K prime is 0 for all K prime not equal to K ok so it's 1 only in one spot of this vector which represent a single board in my vocabulary and then there are weights here between all of these nodes and this hidden layer I call here the set of weights matrix W and there this is my output layer and there are weights here which I call it W Prime it's a matrix of all of these faiths and y1 would be you know the result of a nonlinear function applied to this matrix times this hiddenly you know basically it's going to be each of these nodes would be a linear sum of the nodes on the hidden layer and then we have like a sigmoid function applied to that okay yes and here is a W matrix we're going just to neuron h1 where is that no this W take care of all of these weights weights between X 1 and H you know I have D here I have P here so W would be a D by P matrix and W prime would be a P by D matrix so basically when I want to find H which is a vector this H is going to be W transpose X right or you can alternatively you can see each of these nodes or linear sum of all of these basically it's going to be one row of this matrix W tensor that because it usually implies we have our vectors themselves in the vector space which is very large for all the words we have this is w then have a larger dimension are we stacking no no W doesn't stack the input W is just a transformation matrix transform X - H look like my question though is so each of the HD images we have is a vector right very large factor right and we want some transformation that will take each of these vectors to the next layer which is going to be the output so each neuron gives essentially a scalar right so we have one vector in that layer H right but so is if X 1 X D has some dimension n is W D times n as like three dimensions no no see when I put this I mean it's a vector this is a scalar you know XD has indeed dimensional this is X 1 this is X D these are not vectors everything together is just one vector indeed image is that clear now yeah each of them are a scalar because I saw the figure where it had WT minus C and then WT plus C and the first layer ok grits with in the window that was a different double I mean let me warn you know I try to be consistent in my notation as much as I can but the notations of word to whack in the literature are really bad notations you know I come up with a better many different times and each time you know I get to some step that I see that okay now as so good is not not good annotation let's do it a different way I try to be consistent at least during this lecture but at any point if there is any ambiguity let me know it's quite possible that this previous W is not this level but when I said that was a small W as a wart this is capital W as a matrix of weights okay so the network is actually just a single word one single board you know when I say X X is one one single board in that sparse space of you know that we are just represent board by indicating the index of that word in the vocabulary that's input okay yes I'm almost wondering if we have a bias node for each one so that it stabilizes or do we keep it like this and in this model they don't have any bias for them but it can have but now they don't yes okay I will tell you what the outputs you know in ice imply the model this is a continuous bag of words simplified version that the length of the Minto is one I give you one word I want to predict the next word okay so the output here is not going to be the next word is going to be the probability of the next word okay so when it's d means that the length of my the cardinality of my vocabulary was d right that's how I represented each word you know by just putting one at the particular you know there is seat here if you want to yeah okay so that's how we represented each word you know everywhere zero except one place so the length of this vector is cardinality of my vocabulary all right so output has the same land but each of them represent the probability that the next word is going to be this word you know index one correspond to one board to correspond to different word D is to the last word of the vocabulary so y1 is the probability of observing word with index 1 y2 is the probability of observing word with index 2 yes it should be tea I don't know yes at the input on the one of the word is one rest of them are zero right when we multiply with the weight matrix all of the weights will not be 0 except one and where the word is well yes right so basically when we are having the speed matrix it will only be I mean only one of the Roban you watch true yeah yeah that's right yes so if we add up all the wise one yeah yes you know that's that's the idea that basically you know let me represent W let me represent W divided by W and context by C so in this model I want to basically predict this word given its context you know these are word in these are context words and this is the board okay in this simplified version I have only one word as the input in my context and one word that I need to predict okay so - via CC I mean this context word and W is this word that you know that was that was context of this work okay so with this notation basically why I would like why I to be probability of observing W given context so summation of all wise over D is going to be one okay is it clear so far okay that's the plan now H is w transpose X basically we transform X to hidden layer using this matrix C W is d by P so W transpose is P by D and this is d by 1 and H is going to be P by 1 vector okay but as some of you noticed this X is a sparse right so this X has this form like zero zero one zero everywhere zero and then I multiply this matrix to this you know what's going to happen you know it's going to pick a the first it's going to pick the first the not sorry the first it's going to be one only one row of this matrix right row of my matrix W or column of matrix W transpose see when I multiply this to this one you know it's going to pick only this one because everything else is zero and then I multiply this it's going to pick this one everything is else is going to be zero so if this is in the kate position then it's going to capture the kate column of w transpose which is the k2 role of w okay so basically H is going to be the kate row of w where this multiplication is going to be so so h transpose is our H is equal to W transpose X this is matrix these are vectors because since X is a sparse and non zero only at XK equal at XK then W transpose X will be decayed role of W I show it by you know this notation it's like MATLAB notation you know rokay right all columns all values and so this wk is is the K Rho of W yes this this was W transpose a column of W transpose which is a row of w because we multiply w transpose by X so it when I showed what what I showed here was a column of W transpose which is a row of W right so this is the caterer of W and we can denote let me see I mean show this by show this cater oh you know I just put this as V so what do we give as input we give the context right okay let's let me call it VC VC is a vector and this is also a vector you know VC is a vector so basically our W transpose X is going to be H is going to be equal to VC just a name I call it VC so we see is H is w transpose X okay now I go I'm in hidden lair now I computed H from H to output layer I have to multiply W transpose to H and then apply a nonlinear function on that to get Y so before you know this Y is the result of this nonlinear function suppose that suppose y I is just a nonlinear function over Nicole get Phi Phi of UI okay but UI itself is just a linear transformation of H to this output layer so U which is a vector is going to be W prime transpose H right H was P by 1 and W transpose is P by D so this is D by P yeah you would be D by 1 ok so and if I take one element of this vector like you I what UI is going to be you know I have this matrix times H this is H and you know this row times the column would be the first element of U so u I basically would be a row of this matrix times H and this matrix is w transpose W prime transpose so basically it's going to be one column of W Prime right so this is going to be a column of W prime so column I transpose times H this is you are so a column of W Prime is a P dimension okay so I call this column so we call this row WC call this WC call W prime I V W it's a vector right I just call this vector V W okay now why I is nonlinear function applied to UI but I want this nonlinear function to be I would like to interpret Y as probability right I want them to sum to 1 so which function should I use here have couple of numbers you want to you d I want to interpret them as probability softmax right so why I is basically going to be e to the power of you I divided by Sigma e ^ UI overall I and you know I just put a different notation or UI prime or I prime not get confused with this I so if you write it as e to the power of you and then sum all of these e to the power of use that's you know you can interpret it as probability but this is what this is e to the power of this is e ^ UI when UI is no I write it here first UI is w prime I and W prime I we call it VC times h h is sorry that was v w right and H is VC and that was transpose actually okay UI is V W transpose VC which is basically a column of this a row of this matrix and a column of this matrix that's you so why I is going to be e ^ V W transpose VC divided by Sigma overall i prime i'm e we W transpose VC so over what here I have to sum it over what were the V or C I have to sum it over V or I have to sum it over see you know C was context V was the word in a continuous bag of word model input was the context output was the word so I have to sum it over all outputs write all possible words as output so I have to sum it over all words so W prime not C C input is fixed right I gave us okay so this is my output this is why I and this w prime comes from text you know all words in the text I have to sum it over all words and the text because you remember D was the number of texts in the world you have to sum over all I okay now I would like to train this network means that I need to find a weight or I need to one find this V W and V see because V W and V CE are rows and columns of these two like matrices right so training this network means learning this vectors yes I just wanted to have a different notation from this so these are small w's has nothing to do with this capital lovely yes these are this is indicator of word in contrast with context that W is for weight has nothing to do with this that's capital this is a small okay believe me it's way better than notation in the original paper look at that you know we get we had a group meeting you know during the summer and we spend fair amount of time reading these papers and believe me it's I mean the original papers are quite confusing you know late I mean notations and explanations is quite quite confusing it's not bad to look at those original papers okay so so this is overall words okay so now I want to train my network okay so to train my network I need an objective function and then I need to optimize this objective function what would be a good objective function here presentation you know this is the scenario you have given a text okay in this text you can empirically compute the probability that f word comes after another word for each word you can compute right so here what we have here is basically probability that context probability of a word given context right let this probability that's my estimation of the probability so I need to find weights of the network and then baits up the network is this V C and V W I need to find weights of the network in a way that you know I have source of empirical I can compute empirical probability from the data and this is my way to model that probability yes yeah we want to have the largest probability and we need to define an objective function to do this yes - so you are not answering that question you're right you're asking in your quest okay okay what's the fact again - I was in this the dog is sleeping - the dream okay each word it depends on our objective function it depends how we can optimize this objective function you know if you know it was back propagation for example if we turn out that you know we have to Train it using back propagation for example yes then you go one by one you know if it's like gradient descent or a stochastic gradient descent most likely yes you go one by one board if you turn out that it has closed form solution you know I'm going to find it in closed form so it completely depends on the objective function that we are going to define and the tool that we have to solve that objective function to optimize that objective function if it turns out there is a gradient descent yes it's going to be the way that you explain okay actually one good objective function here because you know at the end of the day you have a probability here is likelihood you know maximum likelihood so try to maximize the likelihood of the model and this is your parameter the parameter of the model is VW and VC litter weights in these two layers and this is I can treat this as a probability so a good objective function would be to maximize I mean find the parameters in a way that the likelihood is maximized okay so I can maximize the likelihood and basically I can define the likelihood of the model then theta here is we W and VC so the likelihood of the model would be like multiplication of a product of W given C over all w's right that's my life or I can work with like likelihood which is Sigma of log of this probability where all words in Dec so in the next half of the lecture we'll see how I can do this you

Info

Channel: Data Science Courses

Views: 36,526

Rating: 4.9559231 out of 5

Keywords: Ali Ghodsi, Deep Learning, Text mining, machine learning, neural network, Glove

Id: TsEGsdVJjuA

Channel Id: undefined

Length: 73min 29sec (4409 seconds)

Published: Fri Oct 16 2015