Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we're looking at transformers or rnns fast autoregressive transformers with linear attention by angle Oh Scott ro Poulos app or via nikolaos Papa's and Fon suavely hey so this paper on a high level proposes to interpret the attention mechanism in transformers with a in terms of a kernel function and therefore the resulting higher dimensional linear operation can be used to formulate the linear transformer which is orders of magnitude faster than a classic transformer they also show that in the case of autoregressive transformers this makes the transformer essentially equivalent to a special kind of RN n so yeah that's that's what this paper is about and I think some comments to make that I haven't really seen made by others though I have to admit also haven't really looked at many comments so I I might just be telling old boring things as always if you like content like this consider sharing it out leave a like if you liked it leave a comment to you know let me know what you think I do read the comments and they're generally comments section is very very helpful and to me and also people you know respond to each other it's it's fairly cool to see that the community is usually very helpful to people asking questions and yeah just let me know what do you think alright so what's the problem with transformers and I've done many videos on transformers and I you know keep referring back to them for people who don't know what it is but there's this original paper called attention is all you need that where I made a video about that so if you don't know what transformers are you can go look at that that should basically cover everything you need to know but there's many more transformers in the meantime there's Burt GPT to GPT whatever the number is after that many sequence processing models are now transformers many set processing models are now transfer so this has reached a very very made a very big splash in the community so essentially transformers come with this attention mechanism where you have an input set actually but let's consider it a sequence so a sequence of texts maybe like I have an ice cream cone something like this and you want to classify the text or you want to perform language modeling so in language modeling the problem is as follows I give you this piece of text and I ask you to predict the next piece of text this is this was kind of the first task that these transformers were used on and this is what is called an autoregressive transformer because you always have a piece you predict the next piece and then I give you that next it gives you that entire piece and then you predict the next piece yet again and so on and this auto regressive property is gonna you know come in play in this paper later but ultimately what you have in a transformer is called an attention mechanism so an attention mechanism is the following each layer in the transformer you can imagine as having the same number of nodes kind of same number of neurons as the secret your sequence is long now from this input sequence you're going to generate for each of these tokens you're going to generate three different things you're going to generate a key query in the value so and in this you do from so usually this doesn't come in form of a letter right this comes in form of some kind of embedding vector and from that you're going to generate three different things I should probably use different colors for sure so this is a function you're going to produce three different things from that you're going to produce a key you're going to produce a query and you're going to produce a value now the key is you can imagine it being attached to this lower layer right here so that's the key for this this token right here that's the key the key here for that token right here it's a word piece right so the keys again are also you know vectors vector vector the query you figuratively attach to the top layer right here so the queries they go here for each token and they are also vectors and the values will keep out of it for now so the queries and the keys defined basically how you go out the information and you route the information by going over each so each each you have to imagine each token right here this this have or have it needs to aggregate information from all the other tokens right so we're going to through multiple layers of this and in each layer each of these tokens is aggregating information from the other tokens if we do this in multiple rounds is eventually you know the each token is aggregating information eventually each token knows about all the other tokens but how this information aggregation is done is very important for example if the token is a pronoun it would be very interested in information coming from any sort of named entity in the sentence because it very much wants to know what it is referring to right if you are a if you are the pronoun in the sentence it is very vital that you understand which of these things you refer to so you'll start aggregating information for that and then once you know who or what you refer to then the other parts of the sentence can make use of that information so they will start requesting information from you and so layer after layer each token aggregates information from each other token so this works by let's say we're at this token right here what we're going to do is we're going to form the inner product between that vector and each of these vectors and then we're going to transfer that into a soft Max which makes this into a first of all there's so we do the query together with all the keys and then we run it through the exponential function and after that we're going to normalize it by the sum of all the exponential functions that will give us a properly normalized distribution so a histogram basically of where we are going to get our information from this is going to be the highest where the inner product right here is the highest so from this token right here and you know this is fairly fairly standard what I drew by accident is fairly standard that a token probably wants to know a lot about itself so you want to carry forward the information that you already have in this particular token that's why your inner products going to maybe align a lot with your own key so the keys and queries are learned so each token decides what kind of information it wants to advertise to the others and then also each token decides what kind of information it wants to gather from the others and the routing then is put through a softmax function and that gives you this right here you do this for every single token so the problem with this is that every single token needs to do the inner product of its query with all the different keys and each of that has to go through the softmax and then the value that's actually aggregated are these values right here now the values are simply a transformation of the incoming values values are what's really propagated you can think of it as just like a one layer neural network ultimately you could also leave away the values people don't do this some people do the same queries and keys but the values are just a transformation of your input so the important thing is this right here this decides how you're going to aggregate the values all right so this is has a quadratic complexity so if you if you have n input tokens then this entire process will require n-squared operations because you need to form the inner products between each pair of calories and keys right and it also is going to require that much memory and this we're going to see this is in large part due to this softmax operation because because we have a soft max it makes the whole thing nonlinear and it being nonlinear basically means we'll have to you know store everything keep everything around and we have to recompute for each query we're going to see in this paper formulation where if we make the whole process linear then we will not have to do that so let's dive into it so here they go linear transformers they start off with saying each transformer layer is essentially this right here so this is a this is kind of a higher level of you what we viewed so far is just this part right here this is the attention routing mechanism each layer is actually wrapped in a residual connection and also a simple element wise or row-wise feed-forward layer but these things are usually not that much into consideration what's really hurting in the transformer if you go into very long sequences is this attention routing mechanism so the attention routing mechanism is as follows you can see right here this is the formal expression of what I described right here here you have the end notice this is an outer product so if if I have if I have n sequence elements the queue right here are the queries so this transforms each of the n into a into a d-dimensional space right and also the keys will transform each of these into a d-dimensional space so this here is going this here is going to be a n by n matrix right this is this Q K T is going to be an N by n matrix this is a WQ WK x and this transposed right here nope yeah like this okay so this is sort of an outer product and then we're going to take the row-wise softmax and that will give us for each row in this matrix so for each row in this matrix we're going to have these dis distribution of how to aggregate information each row will give us basically for each of the upper-level tokens for each of the outputs how we need to aggregate information from the inputs and the information that we're aggregating are these values right here now they generalize this first of all they say we can also we can write it in this form right here instead of having a soft max we can actually think of any kind of similarity function it between the queries and the keys so here you see what we want to do if we want to calculate output I here the important thing is there is no longer this is an entire matrix and we considered a row wise softmax and now we write this out into the individual elements of the output and we can we can do so I mean say okay how do we obtain one element of the output we're going to calculate some sort of similarity of that particular Creasy I hear I hear we're going to calculate some sort of similarity between the query of that particular output with all of the keys so here you can see all of the keys of the input and we're going to act and we're going to normalize right this is the normalization that happens also in the softmax and that will give us like a histogram of how we aggregate the values right here so all of this of this red stuff will give us again some sort of a histogram of how we're going to aggregate information if you look a bit like this and you know how the softmax is defined you'll see that if we plug in the exponential function for as the similarity function then you'll get back to the softmax okay these they say equation 3 is equivalent to equation 2 if we substitute the similarity function with the exponential function now they go to go ahead and they go into kernels so for that you sort of need to understand what a kernel is a kernel is a special kind for the purposes that we are you know looking at here a kernel is a special kind of a similarity function it needs to have some properties right here but essentially they say well this this kind of looks like a kernel and we will simply say okay here this similarity what if we use a kernel here so a kernel simply is a similarity function of two vectors if you interpret it like it has some more conditions I know and now don't freak on me but the interesting properties about kernels is that if a similarity function is a kernel it means that there exists a mapping and where do we do so if K between a and B is a kernel if K is a kernel that means that there exists a similar a function Phi such that Phi such that the kernel between a and B can be expressed as a linear product between 5a and Phi of B transpose okay this is like this is an inner product so what it means is that this can be like a super nonlinear function a kernel for example it can be and the example often given in like learning classes is maybe something like this you have one dimensional data right and here is the here is zero and you have two kinds of data points you have the X's right here and you have the circles right here now I cannot classify this data linearly however however I can transform this into a higher dimensional space so my function Phi is of my function Phi of X is going to map to the vector X x squared and that will transform the data into a two-dimensional space right and the data will look something like this so it's going to the y axis is going to be the square of the x axis okay and like this and now I can find a linear classifier okay so in this case right here you can see that in this higher space things become linear things become linearly classifiable and very similarly like this is you can define the similarity between things right here so the similarity function would be the square function right here and this would be a quadratic an example of a quadratic kernel so this function right here can be very non linear I mean it can be a linear function but it can be very non linear but it is equivalent it is equivalent this means it is equivalent to a linear function in a high dimensional space now to figure out linear to figure out what this function Phi is is the big the big question of course for a couple of kernels we know the function Phi right for the quadratic kernel for example we know we just saw that Phi Maps this to the vector of the coordinate end it's quadratic too it's square we know for a couple of other kernels what their Associated functions are but in general if it's a kernel then we can just replace it with a linear function M with with with this thing and in Reverse we can just say well what we could do is we could just simply define a function Phi and basically map this into map these a and B into a higher dimensional space where this super duper nonlinear function would just become a linear function wouldn't that be much easier linear functions are much easier to work with and long nonlinear functions and if we know that as long as we get the correct Phi we do exactly the same thing as the nonlinear function you know that would be helpful so there is an entire literature in the entire like decade of kernelization and kernel eyes everything kernel eyes SVM's to start but then you can go way further in this and this is just the beginning and this is just a very sloppy explanation by me right here but ultimately they're saying hey instead of doing complicated nonlinear similarity function like the softmax can't we just project a and B into the higher dimensional space and then just do the linear inner product and do the same thing as the softmax and the answer is yes and no we know that for the softmax the particular phi function that we would need would map to an infinite dimensional space usually usually this is applied in Reverse it's like oh here instead usually in machine learning if they say you know we want to do this we want to map it into a high dimensional space that is linear but we we can't because these spaces are too high dimensional and therefore we find an equivalent kernel function and it's usually said well we've used an RBF kernel that corresponds to an infinite dimensional space and so on that's pretty cool here it's the reverse here it's we want to do we want to leave function we and the equivalent of the softmax function is an infinite-dimensional function which we can't do right we can't feasibly compute an infinite dimensional space explicitly so it's not possible to just do the equivalent thing than in a transformer however you can still do something else you can still use polynomial kernels you can still use any kind of kernels that have corresponding functions that map to a finite dimensional space and that's what this paper does so here they say if we have such a function that map's these things into a higher dimensional space such that their inner product such that the similarity function in this higher dimensional space is an inner product then we can write it as just this inner product right here explicitly and then because of the associativity you can see that here is an i and here there is no I so we can just sort of pull this out of this sum and as well right here it doesn't don't don't cross this away these are vectors right but you can see especially here you can see pretty clear why is there a cursor stop you can see that this here you have to pay attention to the matrix dimension so if we use like bracket notation this is like this like this like this and like this okay so here on the bottom you see that there is an inner product so each output will be normalized by this inner product right here however the top is going to be a vector we know that you know each output is a vector so the top will aggregate these vectors right here according to this routing okay but if we write it like this you can see that what we could technically do is we could technically compute this part here once because it doesn't contain any I so no I in that part so we could just compute it once because we have these two these two layers of the attention mechanism and these K and V they just refer to this lower layer right here we could just compute that thing on the right once that's going to give us a matrix as you can see right here from the dimensions and then we can simply take the product of that vector right here of the vector on the left with the matrix on the right and we be done it's one operation right instead of for each thing you know going and attending to each other and then do these softmax without the softmax we can all do this in a linear fashion so that makes it a lot easier in fact it makes the computation linear in so this is now o of n okay plus of course the work that you have to do for mapping this into the higher dimensional space but this is also not quadratic this is done to each of these elements individually okay so this this is now as we said it's pretty easy you can calculate the matrix on the top you can actually also calculate this part right here this vector you can aggregate over the bottom and then if you go through the top it's simply a inner product with the vector of the queries and you're done and this is it in fact in matrix form you can simply write it down as one matrix multiplication seems pretty easy so the computational cost goes way down and they use the following function right here okay this is their map to the higher dimensional to the higher dimensional space so they say for our experiments that deal with smaller sequences we employ a feature map that results in a positive similarity function as defined below so right here you have to pay attention you can't just pick any function but you can you can pick a lot of different functions where a lute denotes the exponential linear unit activation function okay like this seems this seems fine they also say in our experimental section we show that the feature map of equation seven performs on par with the full transformer while significantly reducing the computational memory requirements this you know it seems it seems like the the original transformer this choice of the softmax function even though it's you know powerful and can't be approximated with this trick right here it was also somewhat arbitrary I mean there is a reasoning behind it but it's also somewhat like and it's entirely possible right that that this here is way faster so I want to jump this causal masking thing for now and look at the results where you can see they verify the fact that in terms of time in terms of GPU memory if they apply their transformer and here on the x-axis is the sequence length and you can see that the this is log plot right these are log plots you can see that the original transformer right here has a way steeper slope then their transformer which is the black line right here the blue lines are the reformers which we've also I've also done a video on reformer if you want to check that out that is also a trick that uses locality sensitive hashing to get rid of the quadratic attention mechanism now the locality sensitive hashing also means that you kind of lose some accuracy so that's the trade off right here but you can see that is also linear actually it's n log n depending on the sequence length but the log n is negligible so you see GPU memory and time way down and in terms of experiments it does perform on par it seems like it has different optimization trajectory but they show that you know there is this trade-off for the reformer where you lose an accuracy they do not experience that trade-off in the linear transformer compared to the original transformer in their particular experiments now they do their experiments sort of show that they are not on par with the original transfer like they are on par in some of the tasks but also in some of the tasks they are not on par for example this speech data set right here where they do fairly well they actually beat the pile STM baseline and the reformer but they do not beat the softmax transformer so there there's it is still the case that the softmax transformer is more powerful than the thing here and will give some intuition very shortly on that but the linear transformer is way faster here it's three times faster and up here it is 300 times faster and LM nest and if you go and see 410 is 4,000 times faster simply by a property of the longer either your sequences are that you input the much more matters the fact that the softmax transformer has a quadratic runtime whereas the linear transformer has a linear runtime and I was also surprised here to see that the reformer wasn't that much faster that's probably due to the fact that it already has like a big overhead in these hashing rounds and so on that probably is hurting it at sort of a constant level I guess if you were to up the sequence length even more than the reformer would also improve a lot more over the softmax transformer okay so what's what's happening here what's happening with these with this attention and why is it different what does it makes it different from the old attention now I want to I want to sort of connect this to the kind of old and old the olden literature of topic modeling so if you think of the of this transformer again of this ancient mechanism what you'll have is a dynamic routing of information right so from each from each output token you get to look at all the input tokens if we for example select this one you get to look and you get to decide for each one how do I want to aggregate my information okay and this is what makes this quadratic from each of the output tokens you get to look on all of the input tokens and decide how you want to do that and that is can be very long nonlinear in terms of when we use these softmax and so on so that what makes it expensive what this thing is doing is the following he takes all the keys right here so here we have all the keys and it's going to map them through this five function right each key is going to map through the five function and each query is also going to be mapped through the Phi function into these higher dimensional spaces and then an inner product is performed between the two and that decides the routing this is very similar to like topic models where if you interpret this this right here can be a mapping of my dimension of these keys and queries to the topics so essentially what's happening right here is for each of the input tokens sorry with input tokens here output tokens here the dimension of this map defines is how many topics there are so in you know in these topics modeling you you would have things like I want to I have news articles or words and then I define like a set of topics and I'm going to assign each word to a topic and then I'm going to assign each news article to a topic and so on and then you kind of do this dimension reduction but this can be done in many ways so let's say this is a mapping to three dimensions what this does is essentially this five function decides how you're going to map each of these inputs into these three topics so you can say oh this token goes here and here this one goes here and a bit here this one goes here and so on so again this is a this is a mapping into a well in this case a lower dimensional space and then this function decides how you're going to a great these topics are over across here and since this is you know this is now a linear multiplication between the two things so these two are going to be your matrices this here is going to be your Phi of K and this here is going to be your Phi of Q so you can see the difference here between the old attention mechanism and the new attention mechanism right the old attention mechanism each token was directly able to look at all the input tokens and decide how to aggregate the information and here it's sort of we have this in between is in between representation in this higher dimensional space and we can aggregate in only a we can distribute in a linear fashion and we can aggregate in a linear fashion in and from this higher dimensional space that's sort of how how I sort of how I imagined that okay so you get to distribute each token right here into these topics and then the the outputs day they don't see the inputs anymore right you see that in the formulation there is a sum over J so right here there is this sum over J and that means that the outputs here they don't see the different inputs as different inputs they only see the inputs through the map of the Phi function so they can only see the individual dimensions of that Phi function they cannot see the outputs anymore and therefore yeah therefore you don't have the dependence on the big quadratic dependence on this on this n okay however you do have occur of course now a dependence on this D dimension of the intermediate representation and they also they say this right this is reasonable yeah they do derive the gradients here to save even more memory so you don't have to such that you don't have to let's say store of all of these activations that's pretty cool as well and they implemented in CUDA there is code available for the linear transformer all of this pretty pretty cool okay so the last thing they say they make the connections to are n ends now this is a bit detached from the linear transformer but because they formulated how they do they can make this connection so this now this now is valid for all transformers what they say right here but keep in mind it is valid for the original transformers in practice if you can make this mapping Phi to map to infinite dimensions which you can't but but the analysis is equivalent so they say look if we write the attention mechanism like this and therefore like this what we can do is we can define these two quantities right s and Z and this is what we said before we can actually pre compute these quantities right here okay so that reduces to this right here if now we are looking at a autoregressive transformer and we said before what an autoregressive transformer was an autoregressive transformers you have a piece of sequence and you are tasked to predict this next thing right here now usually if you want to train this using an RNN you have to you know run your RN n input this hidden state and input that map for with the hidden state so you have to do all of this forward propagation in order to derive at this hidden s-- at this output right here make the output and then you need to back drop through time right here there is no way to what you would like to do is you would like to say here I have a sentence I can actually make like five different training examples from that sentence so the first one is the one you just saw I just block off the last word but I can also make that training example right here right - Laurine I just cut a second to last word and so on I can actually make all of these different training examples for a language modeling from a single sentence and what I would like to do is I would like to train them all in parallel right dial out my data point once I already have it why can't I just train everything at the same time like predict this from this word now predict also this from these two words and the Transformers are you know very well conditioned they are very good at this basically so what a transformer can do is if you input a sequence like sorry like the thing at the bottom you can calculate the training signal for all of these different things at the same time and okay this was maybe a mistake you can calculate the training signal for all this at the same time by using what's called causal masking in attention so if I have my attention mechanism right here let's consider it again let's consider these two layers if I have my attention mechanism what I want to do is I want to constrain each token to only attend two tokens that came before it in the sequence so for example this token right here I'm going to constrain it to only attend to itself and the past because it will it will predict the next token in the sequence and it wouldn't be it would be really easy if we could attend to the input of that token right it could simply remember what that token is and then aggravate that here and then predict that so if for each token we restrict the attention to the tokens that came before it like also for this right here we restrict the attention only to go backwards then we can train all of this in parallel this is called causal masking it's usually implemented with like a mask that is like an upper diagonal and it's a bit unclear if you can attend to yours to yourself because then I guess this will become the output and you can only attend to this I don't know how exactly how it is implemented but there it is usually realized with an upper triangular matrix as a mask and you apply this mask to each layer now they say that this is actually like an orang n' and with their formulation you can make this pretty explicit namely you have these two states s and a Z and in each sequence element it's actually like an RNN where you update the s and z with these quantities right here and so it's like an RN and where these are the hidden states that you pass forward right and then you can formulate any transformer as an RNN that simply updates these two states but you see you need the explicit mapping of these of this kernel function you need these explicit mapping in order to be able to do this because otherwise this is here this is not going to be a linear addition it is going to be complicated you can't do it by simply remembering the past state so you need that formulation in order to be able to express it as an RNN but their analysis shows that this a transformer autoregressive is essentially an RNN and you can you can so you can make a connection in that and you can actually formulate this as an RNN which means that you can train in the transformer fashion everything at the same time but what is cool about an RNN an RNN at inference time an RNN once it has produced you know this word it can then because if you produce order aggressively you simply say hey I have this beginning of my news article please finish it so the model must output the next word and then from that sequence it must output the neck spurt the next word and then from that the next word and so on and RNN because of the nature of simply passing forward hidden states and inference time can simply you know remember what the hidden States were input those again input the output here and go on so it's pretty fast at inference time which a transformer isn't with their formulation now if they have the explicit function Phi they can use this at inference time to be so much faster in fact on their website which a link of course in the in the description you can play with image generation using one of these transformers in your browser so you can simply start a transformer run in your browser that's how easy this becomes so you can see the linear transformer with causal masking you'll simply update these states right here and then pass those forward so easy and the backward pass as we said I don't want to go into the gradient calculation but they derive the gradient such that you don't have to remember these hidden states and it becomes or it is linear in or it saves a lot of more memory than before okay note so this is the last comment from my side note that this this causal masking transformers they are they are a bit of a hack in transformers and because so ultimately let's say let's say I have this sequence right here this is given and I want to predict this word right here what and okay let's make it up here so I need multiple layers for this so I want to predict that next word and I have multiple layers right so I want to predict this from from the outputs right here let's say there is an output node right here I want to predict that particular word it's true that I should only be able to aggregate information from whatever was you know on the back right here but technically in a transformer it would be completely valid to say that this node right here which is let's say that's an article and it followed by a noun right would be able to attend to that one and then that one would be able to attend to that one and or sorry the output right here would be able to attend to that one this would not violate the autoregressive properly right you can but you can see that in the intermediate layer this node right here is attending to a forward node now if you do things like this you can't do this trick anymore where you train everything at once because if if this connection exists that also means that if in this other training sample where this is the word to be predicted then this node could aggregate information from that node and basically cheat but the the technical autoregressive property is not violated by this connection right here and you only get this RNN formulation if you do not have these connections right so the this this hack to make the autoregressive transformers train in parallel is actually making the transformer formulation much weaker and therefore that's then equivalent to an RNN okay I did it's not that transformers in general or equivalent to an ordinance or at least this paper doesn't show that it's just that these hacked transformers are and I think that's an important distinction to make here rather than saying transformers are are n ends if we could only approximate the softmax in these infinite dimensions I don't think that's entirely true but it is true for the Transformers the autoregressive transformers that we currently train now why is this connection so powerful it allows at how can to attend to you know tokens forward of it and what does it mean to be able to attend like let's say it's really important that this token right here attends to that token right here what would you need to do if you couldn't do that if you let's let's let's say this is a program right so this token is the function f and it needs the input in this argument a of whatever token comes in front of it and it needs to do something conditioned on a so if a if a is 1 it does something if a is 2 it does something else right if you if you don't have if you can't input a then you can't simply pass on the output value what you'll have to do is conceptually is basically you'll have to store the entire code of the function into hidden state if this is an RNN right it can't look forward it needs to store the entire code of this function f so all it needs to basically build this map if a is 1 then this if a is 2 then this if a is 3 then this store that in the hidden state and then once a comes around in the next time step this can be resolved you can see that this is infinitely more complicated than simply looking forward and outputting the value yourself so that's sort of the difference in power that these two formulations are talking about ok so yeah two parts to this paper first part linear transformer through kernels second part if you formulate it like this it is equivalent and so a autoregressive transformer in this way becomes equivalent to an RNN and here is some of the output samples you know they're they're pretty pretty good though if you look at the more output samples they have here it so here this this is the linear one and you can see for example it's already in this very bottom one this one right here it's the kind of all the other transformers get the slant to the right and that be the original has whereas this one is simply straight I mean I don't want it I wouldn't want to dunk on this like these others make a lot of mistake mistakes right here but here I also saw you know all of them get that this is going to be the number three while this one is somehow making this circle in here so it is not perfect and even though it's on par aware in the tasks they see you can see right here that especially in this speech recognition the original transformer right here is significantly outperforming the linear transformer which is the one in black right here in fact in all of the tasks but ultimately it might not matter because they reach you know the same they reach the same they reach the same accuracy or whatnot and the linear transformer is way way faster so I can see that this is going to be a I think that people apply I guess time will tell right I invite you to read the paper tell me what you think I might be totally wrong here with any of my formulations or my intuition about what this new attention mechanism does yeah please let me know and I'll see you next time bye bye

Info

Channel: Yannic Kilcher

Views: 19,459

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, nlp, natural language processing, attention, attention mechanism, linear, linear transformer, linformer, reformer, idiap, epfl, queries, keys, softmax, kernel, routing, inner product, rnn, recurrent neural network, transformer, bert, autoregressive, dimensions, topic modeling, language model

Id: hAooAOFRsYc

Channel Id: undefined

Length: 48min 5sec (2885 seconds)

Published: Sat Jul 04 2020