Explained: The conspiracy to make AI seem harder than it is! By Gustav Söderström

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone my name is Gustav sadistan I'm co-president at Spotify I was asked by my colleagues to do a deep dive on AI for all of you from Engineers to Executives of Spotify specifically on this new type of generative AI and try to explain how these things actually work how is it that we have services like Chachi BT where you can create an entire novel or services like stable diffusion or mid-journey that can create beautiful images even music out of just text or white noise now for us as employees and Executives in the tech industry it is quite literally our job to understand this but I think that even if you're not in the tech industry it is almost an obligation to understand what is going on right now because this is a big thing people will talk about 2023 100 years from now because this is the year that computers started passing the touring test meaning that they could pass for being humans to someone who doesn't know if they're speaking to a computer system or to a human and I think people will talk about this 100 years from now just as I'm talking to my kids about splitting the atom almost 100 years ago and I feel that if I were there you know in the 1930s when we split the atom I would have liked to understand what was going on I wouldn't have liked to have realized 20 years later that it happened and I was there but I was actually unaware so I think it's important for everyone to sort of at least get into issuance for what it is that has happened and how it works so my bold upfront promise to you is that after this presentation you will feel like you do understand what is going on even if you don't know a lot of math unfortunately I've found that it's pretty hard to actually get a grip on what is going on and I think it's partially because of this so George Bernard Shaw created this notion of conspiracies against the laity and what it means is really so late for example could be the priest class and what it means really is that any profession tends to raise barriers towards other people entering that profession this could be deliberately by creating certification authorities and Rule system around this profession but also less deliberately but just creating very complicated vocabulary and lingo around this profession you all know what I mean you take something like Finance or legal and you often get this feeling that it seems very complicated and hard to understand and then when you actually do understand it you kind of ask yourself like why couldn't they just have said that often the vocabulary itself makes it seem harder than it actually is now it's not it's often not actually deliberate it is a fact that specialized groups tend to create specialized vocabularies because it's more effective for them to talk to each other at sort of a higher level but that also creates these barriers to understand what is going on for everyone else I think the problem that makes it seem more complicated than it is is that people confuse Theory with practice and while the practice of getting these things to really work is very complicated and actually does require a lot of math the Theory actually isn't I think it's entirely possible to build intuitions about what is going on without having to understand the Practical problems so let's try to see if we can expose this conspiracy now for those of you who know this stuff well you will see that I take a bunch of shortcuts here and there and that the actual numbers and percentages don't always make full sense but you'll have to indulge me as I'm trying to keep it simple and just keep it true enough to create largely correct intuitions are you ready let's go so what is an llm well llm stands for large language model and this is the thing that powers something like chat GPT so after this section hopefully you'll understand how chat gbt actually works but there are a bunch of steps we have to go through the first step is to understand how you even get a computer that literally only understands numbers to actually understand language well there isn't a pretty straightforward intuition to understand roughly what's going on so let's say that you as human you take the English dictionary you just start at the first word so Ace then amazing then appreciative aromatic all the way to the last word and there's about 600 000 words in the English dictionary and you just give them a number so the first word Ace has number one second word amazing has the number two third word appreciative has the number three and so forth so now you can literally take a sentence in English let's say hey how are you and you can just look up every word and see what number it has so for a computer a sentence like hey how are you it's not actually a sentence it's just a sequence of numbers so the word hey has for example the number 25 the word how has the number 30 the word r has the number five and the word you has for example the number 75. now you may see that these are very small numbers but that's just to keep it simple so it's just a lookup table you have this long list of 600 000 words a unique number for every word and you just translate a sentence into a sequence of numbers so let's go into a bit more detail how does this actually work well let's start with this super simple model let's say that we have this this Excel sheet where we have every word in the English dictionary as a row and then we have every word in the English dictionary as a column so for every word we have a percentage for How likely every next word is so let's say we get the word r the computer that means you get the number five and now the job of the language model is to say what is the most likely next word after the word r or more correctly what is the likely the most likely next number after the number five which represents the word r according to all the words or all the sequences of numbers I've seen on the internet so you could imagine that as a human you might give an equal percentage to a lot of words it could be are you because maybe the sentence is literally hey how are you but it could be are they it could be our things because the sentence may be how are things it could be our fine because the sentence may be they are fine because you only have one word to guess from one word of context is going to be hard to guess correctly you literally can't guess correctly you don't have enough information but you can do a good guess so maybe these things like you they things fine they have the same percentage but you're gonna have words that have a very low chance of being right or animals is a very uncommon sentence on the internet most likely it happens it's just uncommon so again through computer is not guessing words what the computer would say is based on all the text I've seen on the Internet or more correctly based on all the sequences of numbers that I've seen on the internet according to this translation from words to numbers after the number five I usually see the number 75 which represents you or after the number five I usually see the number 42 which represents they or after the word five I've usually seen the number 97 which represents things so it's doing something very similar to what you would do as a human and as a human you have intuitions about the percentages and I think it's easy to understand that if a computer could just look at all the statistics over all the Wikipedia and all the internet it could also have a pretty good idea about the percentages and the statistics about the next word all right so this is as far as we get with this one column of 600 000 rows or one 600 000 rows and 600 000 columns now let's say that we get one more word of context so instead of R we get how R which the computer again is the numbers 30 and 5. well now the percentages change for you as a human when when you just had the word r they were you know you things they find were equally likely but when you get two words how are all of a sudden you is probably more likely maybe it's 50 likely now because it's more common to say how are you on the internet then how are they for example so if you had to guess you'd probably go with how are you and it's even more likely to say how are you or how are they then how are things perhaps and if you take something like fine which was pretty likely when you just had one word because our fine could happen if you say they are fine no one says how are fine so now all of a sudden the word fine has a very low percentage translating this to numbers what it means is the lag the large language models now has two numbers to guess from and just like you as a human would guess much better with two words the large language model based on all the numbers it's seen on the Internet is going to guess much better with two numbers as well you can imagine what comes next what if you had three numbers or three words so now the context is hey how are or the numbers 25 35 now your guess is going to start to get really good hey how are you it's now very likely let's say it's 70 percent hey how are they maybe even less likely you might say how are they but you seldom say hey how are they because they are not in front of you so that might be go down to five percent and something like hey how are things that was a bit less likely than how are they now shoots up because hey how are things is something you would see on the internet quite often and then hey how are fine and hey how are animals very low percentage so the whole point is the more context you get as a human the more words you have the better your guests the more sure you're gonna be about the next word and it's the exact same thing for a large language model the more numbers it gets the numbers represents word the better its guess is going to be on the next number so in a sense this is all that a large language model does it just guesses the next word or more correctly the next number from previous numbers and so it's actually a little bit of a misnotion that we call them large language models they should probably be called large number models or large sequence models which they are called sometimes because it turns out that you can turn words into numbers but everything are numbers you can turn pixels into numbers or actually pixels or numbers so you can put pixels or an image into this large language models and based on the previous pixels the previous pixel numbers the RGB values it is going to get very good at guessing the next pixel you can put audio samples for example from someone speaking those are just numbers and if you have those numbers and you train on those numbers a large language model will get very good at guessing the next sample in an audio sequence so remember it's not a language model it's actually just a number model and everything in the world can be translated into numbers so anything that is the sequence if you have lots of data these models can learn the statistics about those sequences and correctly guess the next number so in a sense now you actually understand what large language model does but there is a problem here that you should understand why why hasn't this happened before if it's so simple well even though it's simple in theory it's quote unquote just statistics just guessing the next number it turns out that guessing the next numbers for long combinations of context for for many words is very computationally intensive so if we go back to your Excel sheet where for every word you have a percentage on How likely every other word in the dictionary is now I said there was there were about 600 000 words in the English dictionary but let's just simplify it and just take like the most popular 50 000. because 50 000 is like roughly the size of a vocabulary that a large language model actually uses so now we made it smaller you have now have 50 000 rows of words and you have 50 000 columns for every such word so that's fifty thousand times fifty thousand so that's actually a pretty big Excel sheet it's like two and a half billion cells or something so pretty unwieldy Excel but but it's still like sort of doable it's a lot of cells but it's doable but this is just with one word and remember when you just had one word to guess from your guess is going to be pretty bad now let's say that you get two words to guess from so now you're gonna have fifty thousand rows times fifty thousand columns and then times fifty thousand again because you have combinations of two words so now you have something like 125 trillion cells in your Express in your spreadsheet and it's it's getting uh beyond what is what is uh what is solvable so it turns out while the theories I said is deceptively simple the practice is actually very hard it's just very very hard to do these statistics in practice enter the Transformer so there's this paper that came out from machine learning scientists as Google in 2017. called attention is all you need and it suggested a specific machine learning architecture called the Transformer which is what you see on this image and don't worry you don't have to understand its image at all it's just going to look cool if you say like oh I recognize that it's a Transformer the only thing you really need to understand is that they managed to find a clever way of solving this problem of how do you have a lot of context turns out a Transformer can handle thousands of words in fact it can handle tens of thousands and recently even something like a hundred thousand words as context just to guess the next word so think about that like a hundred thousand words that literally means that the job that you would have as a human is you get to read an entire book and you just hide the last word on the last page but you have the entire book as context to guess the missing word and your guess would be amazing right you would be almost completely correct in that guess because you would have so much context even if that missing word was something very unlikely let's say that the story ends and then she went to right and maybe the correct word is Pluto but you would know that because you know that the rest of the novel was about space and she was on her way there right so you would be able to make even very unlikely but correct guesses because you have context this is the problem that the Transformer the Transformer machine learning model saw so this architecture and the reason the paper is called attention is all you need is because the way it solves this is that it allows the model to literally pay attention mathematically to put different weights on different words it doesn't put the same weight on all the words so the problem doesn't blow up in the same way as your Excel she did it can pay attention to different words depending on the guess now if you want to know more about this you can read about it but the only thing you really need to understand is that the Transformer model solves the context problem in a very clever way and if you want to go into this don't don't be afraid that the math isn't that hard it's actually mostly a bit of linear algebra a little bit of calculus but not the hard stuff this is not quantum mechanics so if you want to go there go there but this is this is what everyone is talking about this is what the Transformer model is it's simply allowed machines to do these statistics on internet scale all right so now we come pretty far so now we have this thing that is very good at guessing the missing word so let's take this sentence for example my dog's name is Ben he's a big dog with large paws Ben likes to play fetch with me let's say that the large language model tries to hide the word fetch right so it is on the word play and it's supposed to guess the next word well you can imagine in the model where you just get the word play the best guess would maybe be play soccer it's the most common word on the internet uh play basketball Maybe play fetch is pretty uncommon but now that you have context the language model can say that who is playing well it's Ben and then he goes even further back and says my dog's name is Ben he's a big dog so now he knows that Ben is a dog and now all of a sudden with that context the most likely next guest probably is fetch because that's the most common word or number to follow that sequence of numbers so now we have this attention-based Transformer that is very good at guessing the missing word and it can train itself on the entire Corpus of text on the internet so we're almost there so now we can generate language how do you generate language well let's say that we have this sentence how are the numbers 30 and 5. now all you do when this model is trained is you say what is the most likely next word according to everything you've seen on the internet well it's probably going to say the most likely next word after the after how are or the most likely next number after 35 is you okay so then you add the word you and then you take the word you the text you just generated and you put it back into the model and say now given this text how are you what is the most likely next word and then the model says okay after how are you I think the most likely next word is I okay so we add that then we take the sentence and feed it back into itself again and say after the sentence how are you I what is the most likely next word it's probably going to say something like the most like Linux word is am and then we add that and now we feed this into itself and ask what is the next the most like the next word it's probably fine and this is how you build sentences and you can just keep going forever so what you do with these language models is you put in some text some numbers and you just ask it to fill out the likely next number take that fit it in again have the most likely next number take that fit it in again forever this is what is called an auto regressive model by the way if you ever hear that word it just takes the context and tries to guess the most likely or one of the most likely next words so now we can do something like this you could for example feed in half of the Shakespeare text into the large language model which to the language model again will be a long series of numbers but it will have seen these numbers on the internet there's a lot of Shakespeare text on the internet Shakespeare or Shakespearean text so it's going to say like hey I recognize these numbers I know what comes next so if it just takes them take the most likely next numbers again and again you are going to get something that is either very close or quite literally actually the rest of this Shakespeare play so that's pretty useful now we have this thing that can take a piece of text and complete it with something very likely often actually if it existed on the internet not just very likely but even the exact same thing because if you see a lot of the same Shakespeare play those exact words of the Shakespeare play are the most likely next numbers right okay but what about creativity we just said that chat GPT doesn't just repeat the things you've seen on the internet we just said that it can write poems that never exist that it can write new text how does that work well we need to go one level deeper to understand this but bear with me because we're almost there now you need to understand something called temperature so what we said previously was that we have the words how are or to the computer the number is 30 and 5. and we said that we took the most likely next word which is you that's if the temperature is zero let's imagine that zero means you take the most likely next work and then you take the most likely next word after that which is I most like the next word after that which is am most likely next word after that which is fine and so you're going to get this sentence and you're actually going to get this sentence every time because these words will always be the most likely so there's this misunderstanding that large language models are probabilistic meaning that they give different answers every time because when you use chat gbt you actually do get different answers every time even for the same query but in reality large language models are very much deterministic if the temperature is zero if you take the most likely next word of course the statistics don't change you will always get the exact same sentence what is happening in these large language models is that when you want them to be a little bit creative and not just boring not just repeat what you already know from the internet you actually do something different what you can do is you can pick something that is very likely but not the most likely so think of it as instead of taking the word that is the 100 most likely which in this case would be you and it will be you every time you randomize a little but around the most likely words so you're going to pick one of the likely words but not necessarily the most likely so you're still going to take a word that is very likely and that's important because if you take away there's very likely the sentence is still going to make sense grammatically because the word is likely and it's still going to make sense semantically because the word or number in the case of the machine learning model is likely so let's say that we take not the most likely word u but we take the second most likely which is day so now you actually created text that never existed on the internet the statistics are likely the sentence is going to make sense because you're picking something that statistically comes after how are but it's not a copy of what existed on the internet it is by different definition something new and this is called raising the temperature the more you raise the temperature the more you sort of randomize around the most likely words and you can imagine that if you stay very close to the top it's going to still be quite similar it's going to be new it will not have existed on the internet unless by chance but it's not a copy of what existed on the internet it will be novel but it's going to be very close to what exists on the internet and you can imagine that the more you raise the temperature the more you randomize the further you go from the most likely the more creative the model is going to get but if you go too far it's going to start to pick words that are unlikely so if you raise the temperature too much it might just say how are animals and so quite literally if you raise the temperature too much the model goes from seeming very creative to starting to seem a bit unhinged and insane and I think it's a very interesting analogy to humans here it tends to be the fact that the most creative humans are somewhere on the borderline of very creative and sometimes they just seem crazy and maybe that's a coincidence maybe it's not maybe they just have a higher temperature than the rest of us we'll know someday all right so now we have a large language model that can not only complete a piece of text with the most likely text but if you raise the temperature a little bit it can actually complete the text with something novel that never existed before so now we're pretty close this is actually gpt3 which was about a year and a half ago that came out that was sort of an auto complete on steroids it could take some text and it could complete it very believably and if you raise the temperature it could complete it with something that was never written before right but it wasn't it wasn't chat GPT for those of you who had access to it it was quote unquote just an autocomplete on steroids so a way to think about where we are now and and where open Ai and other companies were about a year and a half ago this gpt3 stage it's sort of a it's called a base model and you can sort of think about it as maybe a kid or sort of an unhinged teenager it has a lot of knowledge about the world it can complete text it can generate text that never existed but you can't really steer it you can't really format it so what you would like to have is something like chat GPT where it doesn't just complete sentences it actually answers your questions and you would also like to be able to steer it to stay away from certain areas and talk about other areas maybe you would like it to not answer questions about how to create a bomb for example so how do we get to that last stage well that requires two things something called supervised fine-tuning sft and something called reinforcement learning with human feedback again conspiracies against the laity sft and rlhf that's exactly how you get people to stay away from your profession so let me dig in to what it actually means again it's not very complicated so let's start with supervised fine tuning now we said that what we have now is this machine that can take a sequence of numbers and guess the best mixed numbers so it's your iPhone autocomplete on steroids now you can imagine that if you take a bunch of documents from the internet that happen to be q a documents right so there's a there's a question and someone else answered it if you train the model on autocompleting that it will learn the pattern that every time I saw a question those types of numbers questioning type of numbers there was always an answer so if you're lucky even with the base model if you post a question to it and it has seen a lot of question answering dialogues on the internet it may actually give you an answer just because it's a pretty common numerical or language pattern on the internet so you could say Q what is the the width of the earth and it could say answer some number right but it could also just answer with another question because that's also common on the internet on all the documents is seen that you have question question for example like a math test and no answers so it's kind of like it has some of this knowledge about question and answering um but it's not deterministic sometimes you get what you want if you sort of ask it to autocomplete the right structure sometimes you don't so what you do with the supervised fine tuning is whereas the base model was self-supervised like it literally trains on all the text on the Internet by hiding one word from itself at a time or one token and by the way for those of you know this one token is not exactly one word but for purposes of this presentation let's say one token is one word is one number so the base model trains itself on all of the internet so trillions of of tokens of text now you do something different you create a small supervised data set what does supervise mean well it means that unlike the self-supervised this is supervised by humans so now you create a bunch of documents that are just the format you want it's always a question always followed by an answer and you create let's say pretty small number compared to the internet let's say 10 000 of these documents so now you ask humans to go and create 10 000 documents of questions answers questions answers and then you take this base model which train on the entire internet and you do what is called fine tuning you chain it a little bit more you don't start over it's already trained you just train it a little bit more only on this data set that is always a question and answer and then what happens as an intuition is you can almost say that the language model keeps all its base knowledge but then it learns this Behavior so it's going to to start answering everything as an answer to a question just based on this this little extra data set at the end you can almost think of it as it sort of remembers what you did at the end and if you overrepresented everything should be an answer to question it's going to start mimicking that behavior so now it takes all the World Knowledge it has but it always answers that as an answer to a question so now we're really close now we went from iPhone Auto to complete on steroids to a question answering machine where everything you put in will be formatted and answered as if it was an answer to a question so now we have an assistant but this assistant still doesn't have any real Behavior or values it's just going to reflect whatever values and behaviors and statistics are on the internet some good some not good at all so there is one more missing step which is when you ask that question of how do I create a cheap bomb out of chemicals from my store how do you get this thing to not actually answer that question we just trained it to always answer questions this is where reinforcement learning with human feedback comes in so this is yet another step where you use humans so now we have this language model you can ask questions and it gives answers and sometimes it gives answers we like sometimes it doesn't both in in the content but also maybe in the formatting maybe it answers a question with too much text or too little text so what we do now is a really clever step which is we take a bunch of humans again and we're again we're talking about not Millions but a few thousand and we take a single question and we ask the language model this question and then we get a bunch of different answers remember the temperature is not zero so we don't get the same answer we raise the temperature a little and so we're going to get a lot of different answers for the same question so one question a bunch of answers and then you ask these humans to rank the answers so for this same question which of these let's say 100 answers did you like the most and the least and you rank them you give them a score let's make it simple let's say it's 10 answers and you as a human are supposed to score them from 10 points nine eight seven all the way to one best words so now you get a new type of data set where for the same question you know what a good answer looks like according to a human and you know what a bad answer looks like according to human and in between so this is a new type of data set and now what you do is you take another machine learning model or actually technically also a language model and you train it on a little bit of a different task so what you do is you take the question that you had and you take one of the answers that you had you put them in together and you ask the machine learn model to guess how would a human have scored it was this the 10 answer to the question or was this the one answer or the Five Point answer and then you train it until it gets really good at predicting that this type of answer a human would have scored 10. this type of answer the human would have scored one so now you build something called a reward model a model that is good at guessing based on supervised data from humans what the human would have thought about this answer okay so now we're almost there so we have the large language models we super fine tune it to always answer as if it was asked the question and now it produces answers but it produces answers of varying quality now we have this other model which is a reward model that can look at an answer and say what a human would have thought about it if it was good or bad now you just take these two models you hook them together and you just let it go this is the reinforcement learning part now it can surprise itself again it gives takes a question generates an answer scores its own answer and said that was bad I should do better I should do this it scores that answer and said that was good according to human I should do more of this so now it can train itself against the reward model this is what it's called the reinforcement learning and this is a closed system where you don't need humans so you can do this Millions tens of millions hundreds of millions of times until it gets really good at answering not just in the format that a human wants but also in the style and and literally if you if you choose the values that you want so I think this is interesting because people ask like what does this model think what are the values and the truth is in the base layer the values of the model are just an average of the entire internet it is what the internet thinks good and bad but the reinforcement learning step is actually what inserts a certain behavior and that is actually pretty small group of humans so that is where a lot of the responsibility lies for how this model behaves all right that is chat GPT now you understand how it works you've gone from back in history all the way to 2023 you understand how these things are passing the Turing test it wasn't that complicated was it at least you can imagine how you as a human would solve these things if you had infinite time an infinite big Excel sheets so now one question is why is everyone so surprised and why did quote unquote no one see it coming some people claim they did but most didn't even machine learning scientists are very surprised in general about how quickly this happened and the Machine learning models themselves aren't that new the the Transformer architecture in 2017 was clearly Innovation but language models has been around and modeling language has been around for a long time so what was it what was it that surprised people here including the experts well it was a scale and speed so the the notion of guessing the next word was not something that was recently invented people have been trying to do this for a long time and when the Transformer architecture came along it became easier to do it at scale but what wasn't obvious was that just doing this simple thing much more would start giving completely new behaviors what is called emergent behaviors so you could see these large language models not being very good at certain things like math for example and then without changing the architecture just by scaling it up all of a sudden it started getting good at things and that was surprising that you would sort of get these emerging capabilities that many people thought would require some new mathematical or architectural innovation so just that that scale alone improved performance was very surprising the other thing that I think is surprising to most known machine learning people is this creativity thing now the the temperature notion was not surprising to people in machine learning it's been around forever but again the concept was there but even machine learning people were surprised at that it actually scales to something that looks very much like human creativity it is very very much unknown if this is what we do but certainly the result looks like what we do so either we we got the type of creativity that we have or we managed to simulate the type of creativity that we have and that surprise people and lastly the thing that was missing was this ability to steer it the supervised fine-tuning to give it a certain Behavior to be an assistant and then the reinforcement learning to be able to use it practically you know gpt3 was really cool as an autocomplete but it was a bit unhinged and supervised fine-tuning made it the interface workable because it was q a machine but it was still unhinged then reinforcement learning with human feedback made it practically useful in reality so those were the unlocks and I think what is really interesting here isn't that we're not surprised that we finally sort of managed to crack intelligence and how complicated it was what surprises people is actually the opposite that we kind of cracked intelligence and it was so simple it's almost provocatively simple and I think what this does so about a year ago around chat GPT or gpt3 these models were often called quote unquote just statistical parrots and that was meant as a derogatory term or actually run dpt2 they were saying like these are just statistical Pirates meaning that they just pair it back as I said statistics from the internet like this isn't real intelligence it's not very impressive actually maybe useful but not impressive but then as gpt3 came along GPT 3.5 chat GPT gpt4 this question of aren't these things just statistical parrots turn to oh crap what if we are just statistical parents and so it really puts a mirror to ourselves and it gets a lot of people to start thinking about what they are which I think is is very exciting all right so we did the first part we actually went through what a large language model is and how chat GPT works and now I think you have as good intuition as most people about what it is that actually happened so now hopefully you have at least an intuition for how something like GPT works why it works the way it does how it can understand questions and answers and why when you ask it how to create a bomb it doesn't actually tell you and rather it says as a large language model I'm not going to answer that question this is the reinforcement learning through with human feedback part all right so you've understand that language is sort of just statistics because language can be represented as numbers and maybe they are even to us actually and numbers are just statistics but I wanna I wanna teach you one more thing that I think is really cool again made overly complicated it's not actually that hard to understand but once you understand it your mind is a little bit blown so you may have heard about something called vectors or vector space or embeddings or codes or distributed representations all of these fancy words without necessarily further understanding what they are I'm going to explain to you what they are I'm going again to show you that it's a very straightforward concept but still very very cool so we said that language can be represented as numbers and you can simply give every uh word in the dictionary its own number but it turns out that instead of giving a word just one number which you can do you can also do a little bit better than that so you can take a word and you can represent it instead of just saying the word r is the number 54. you can say I'm going to use like three numbers or four numbers to represent the word r what do I mean with that let me show you let's take a very simple world let's say we live in a universe that only has three dimensions in it only three dimensions there is things are either royaltiness they are masculinity or they are femininity it's a very simplified Universe it only has three dimensions to everything so now in this universe you can take a word for example like king and you can say instead of just saying King is the number 29 you can say how much of these dimensions are in the word King so you can say that the word King has almost 100 royaltyness so 0.99 nothing in statistics is a hundred percent so one means 100 zero means zero percent so 99 means almost 100 royaltyness right because the king is almost always a royal but the word King is also very high on masculinity it's almost 100 percent we don't know most Kings so far have been man historically we think and then you have femininity which is low probably not zero nothing is ever zero but low statistically on average so 0.05 so now instead of a single number for the word King you have three numbers which corresponds to some dimensions of how much of something that word is now let's take another word let's take Queen so Queen is also about 100 royalty right a queen is is almost always Royal it is very low in masculinity so let's say 0.05 almost zero and very high on femininity 0.98 for example let's take another word woman so a woman could be royalty but statistically across the population pretty low it's almost zero percent very low masculinity and very high on femininity let's take a word like princess so princess is interesting because it is almost always royalty it's a very high on royalty it is usually not not a masculine and very high on femininity all right so now we have these four words described not as single numbers but as numbers that represent how much they are of some Dimension and in this simple Universe only three dimensions royaltiness masculine and feminine so now we have these words not described as a single number but as several numbers and these numbers actually represent how much of something of some Dimension are in these words in this simple Universe we just have these three dimensions how much something is royalty masculine and femininity but you could easily imagine as we did before that instead of this these three dimensions you literally take the entire English dictionary as dimensions so maybe you could have a dimension that is age and you could say the word King if one one hundred percent is old age and zero is young age a king is maybe 0.7 70 in terms of age he's an old Queen maybe 0.6 a bit younger on average a woman literally 0.5 fifty percent right in between whereas a princess would be usually young so maybe 0.1 and you could just go down the English dictionary and as a human you can Intuit you could try to put a percentage on how much of every word in the English dictionary is in every other word of the English dictionary does that make sense so you take the word king and you take in the worst case these 600 000 words and you try to put a percentage on how much is King royalty masculinity femininity age car things that would be zero percent so most of these would actually be zero percent but you can imagine a very long Vector we have a percentage for for how much of every word is in every other word so it would be a vector that is 600 000 numbers and in reality that's not how you do it these models usually have about 1 000 dimensions and they sort of pick the most useful dimensions and I'll come back to how it picks them later but it could be good to know if you're if you're wondering like seems unwieldy with six hundred thousand that's true you would have about a thousand Dimensions that describe every word and how much of something is in those words but for purposes of principles from practice again let's go back to our simple universe but there are only three dimensions so now we have these words that are described in how much they have of these three dimensions now we can do something really cool we can actually do mathematics with language because they're represented as numbers let me show you now we have our universe with the three dimensions we take a word we take the vector for that words we take King which was almost 100 royalty almost 100 masculinity and almost zero percent femininity and then we simply literally subtract a man so we take a king and we subtract a man let's do the math here what happens to the royalty so we had 99 royalty on the king minus 0.01 royalty on demand that means 0.98 royalty we had 99 masculinity in the king minus 99 masculinity in the man so now we have zero percent masculinity and we had 0.05 femininity minus 0.05 femininity so we have zero percent femininity so we took a king we subtracted the man from The King and we got a new word vector what word do you think this is what is it that is 100 royalty but it's genderless it is royaltiness pure royaltiness okay so now we got a new word what happens if we take pure royaltiness and we add a woman so let's do the math we have 98 royaltoness plus another two percent Royal thinness in the woman that means we literally get to 100 royaltyness we had zero percent masculinity plus 0.01 masculinity it's almost zero percent masculinity we had zero percent femininity plus 99.9 femininity so almost 100 femininity so now we have this new word vector which is almost 100 royalty and almost 100 femininity what is that what's a queen so now all of a sudden you can take a mat you can take a king you can subtract the man you can add a woman and then you have a queen so now you're quite literally doing math with words and this is why vectors are so interesting and useful because they encode how much they are of something so this can be really useful and I'm going to show you how but first a question might be how would you go about doing this in in practice in theory again you as a human you could just sit and guess at these percentages actually we just did so I intuitively if you can do it the computer could probably do this but how would you do it statistically well again there is the internet so here's one way of doing it let's say that you take the entire internet or maybe Wikipedia and for any word you want to learn in this case for example the focus word is learning you just say how close the other words are to it in a sentence so for example in this sentence an efficient method for learning high quality distributed vector you can see that the word for and high are right around the word learning so that means the computer will give it a high score because it's physically it's literally close to the word learning whereas the word unefficient and distributed Vector are further away so they will get a lower score so if you just take that simple method of you go through sort of all the documents on the internet and you say for this word what are the words that come up really really close to this word on the internet they are probably related to this word so they will get a high percentage words that are far away from this word in all the documents on the internet they get a low percentage so now just like with the the other lamps you have a scalable statistical method of learning these statistics and learning these vectors so now hopefully you have an intuition for not just whatever word Vector is but also how you could automatically learn word vectors for sort of every word on the internet just based on how close it is to the other words on the internet in sentences okay so now you kind of know what a vector is but why is it useful well it turns out that if you go through for example all the Wikipedia and you do this it's going to turn out that if you take if you think of this in our simple three-dimensional universe but there are only three dimensions there's royaltiness masculine and femininity and you have this Vector King for example that is almost one on royalty almost one on masculinity and almost zero and femininity you can think of that literally as a vector in this simple three-dimensional space that points in a certain direction or another way to think about it is that it is a in a certain place in that space so it's going to turn out that words that are similar if you think about it that have a lot of the same dimensions they're actually going to sort of be pointing in the same direction or be close to each other in this Vector space right so the word king and queen we just said they're going to be close to each other because they're both going to be high on that royalty royalty value they're both going to have a bunch of other words that they have in common or dimensions and if you can Intuit that words can be close to each other or far away from each other in this Vector space then it's pretty intuitive that sentences that are just combinations of words if you literally take the vector for every word in a sentence and you sum up how much royaltiness and how much uh masculinity and femininity are in all of these words you're going to get the sum value for the entire sentence so now you can also say how close sentences are to each other in this very simplified three-dimensional world so it turns out that if you look at something like Lion is the king of the jungle that sentence is going to be pretty close in this Vector space to the tiger hunts in this Forest and that kind of makes sense to you as a human I think because lion and tiger they're somehow similar they're both animals so they will be high on the animal Dimension they're both sort of majestic maybe they're high on that Majestic royaltiness dimension jungle and Forest they should be close right the trees in there they're similar so it kind of makes sense to use a human that line is the king of the jungle is probably close to Tiger hunts in this Forest and if you look at the vector Dimensions it will be because it will be it will score high on the percentages on the same dimensions whereas a sentence like everybody loves New York it's probably going to be further away in this Vector space from these sentences so again now you have this way of taking a language and not just turning it into a single number but turn it into several numbers that allows us to understand that certain words or sentences are more or less similar to each other or what is called close to each other we thought about it in a simple three-dimensional world but the real world then would be if you take every word in the in the dictionary 600 000 Dimensions but it doesn't matter think of it as three dimensions it's the same concept okay why is it helpful to understand if things are close to each other or far from each other let me give you an example that I think will drive this home here's another world this world has three other dimensions it has how much something is Rock how much something is classical and how much something is EDM electronic dance music again simple world there's only Rook classical or EDM as dimensions in this world and now we take something that isn't a word but it's actually a song Here Comes the Sun by The Beatles fear release by Beethoven Levels by Avicii and Bohemian Rhapsody by Queen and now we try to give percentages for how much Rock classical and EDM there is in each of these songs let's see if we can agree so let's start with Here Comes the Sun by The Beatles it is pretty much a definition of rock so maybe it's 98 Rock there's not a lot of classical in there let's put it close to zero 0.02 and there's very little EDM it wasn't even invented then so 0.01 so now we have a vector for Here Comes The Sound by the Beatles what about Fearless well not a lot of rock in there so about zero percent 0.01 a lot of classical so the definition of classical so 99 percent and almost no EDM 0.05 then Avicii comes along and there isn't that much Rock maybe in levels let's say 0.02 not a lot of classical 0.01 but a lot of EDM obviously kind of defined EDM so let's say 0.99 or 99.9 percent and then Bohemian Rhapsody is interesting because it's it's not just one of the things I think most of you would agree that Bohemian rhaps is pretty unique song it has a lot of Rock in it so probably 99 Rock but actually has a lot of classical in it as well maybe 99 classical but it doesn't have a lot of ebm EDM so maybe 0.05 so now you've taken songs and if you imagine that you as a human you would sit and take all the songs on Spotify and you would score them on these Dimensions now it's really useful to understand which of these songs or words do you think of the song each song is a word or close to each other in Vector space so you can say that this user listens to fear release so they may be interested in Bohemian Rhapsody and they will be close to each other in Vector space because they both score high on classical but they may not be interested in levels because it doesn't overlap at all on any of these dimensions so levels and the release before far away from each other in Vector space as well here comes the sum all those three will be far away from each other but Bohemian Rhapsody will be quite close to both free release and here comes the song because it's high on Rock this is actually how recommendation system work like Spotify but one question then is I explained to you how you could learn these vectors for words you could go through the internet you could take the text on Wikipedia and you can say you know this word is close to these other words and all the documents on the internet how would you go about doing this for songs you almost wish that there would be a lot of documents where songs are close to each other that's what a playlist is so Spotify has a few billion of these playlists and if you think of this playlist as a sentence you can literally take the song in the middle and say how close is this song to all the other songs in this playlist or if you think about all the playlists as one big document if they're in the same playlist they are probably close to each other so songs are in the same playlist they would score high relative to each other so you can see how you could build a vector where you kind of understand how much of every song on Spotify is in every other song on Spotify as a percentage and now you have a vector representation of every song on Spotify and because it's a vector and it lives in this multi-dimensional world you can do recommendations so a taste profile on Spotify is actually if you simplify a little bit just all the songs that you listen to and all those dimensions added together so you get a score for how much classical is there in all the songs you listen to you sum that up you're divided by the number of songs and then you have a classical score for that user you do the same for rock the same for EDM the same for jazz and now you have a test profile for that user and now you understand where that user is in the vector space and you can say that these two users they have almost the same vectors they're close to each other they have the same music taste so now not only do you understand what vectors is and Vector space embeddings codes distributed representations it's all the same thing it's all what I just showed the different names to make it seem harder than it is but you also understand why it's useful and you actually happen to accidentally understand how a recommendation system works all right so now you hopefully understand what a large language model is how GPT works what a word Vector is and why it's useful but I also promise to explain to you how it is that you can make images from text or images from noise and even music from noise so in order to do that we need to go one step deeper I'm going to explain in a simplified way what a neural network actually is and again in practice very complicated in theory and not that complicated so the neural network is Loosely based on the biological neuron that we have in our brain and it looks something like this in a cartoon so you have these little arms on the yellow part called dendrites those are the inputs let's say that they get electrical signals from your retina so lights hit your eyes their electrical signals they go into these dendrites in the yellow part these things combine so let's say that this particular neuron it is looking for vertical lines that's a vertical line or horizontal lines in front of your eyes right and so maybe this particular neuron when it gets a pattern of these uh these electrical signals these dendrites they get a value it's going to hit some threshold that says like hey I think I'm I think I'm seeing a vertical line here and then it's going to send a spike along this axon to the right that goes to the next layer of neurons that goes to the next layer of neurons this is all that a neuron does it takes a few input signals it combines them and if it hits a certain threshold value it's going to say hey I'm seeing something here and it's going to send a signal or a spike so what computer scientists did was they did a very simplified idealized mathematical version of the biological neuron called the artificial neuron so the arrows pointing in here are the equivalent of the dendrites so you have A1 A2 A3 these would be the electrical signals from the eyes in a computer World they would be the pixel values from a camera they come in they get multiplied by these things called w123 which are weights I'll talk about that later and then as you take the electrical signal you multiply them by the weights if this cell body the circle in the middle hits a certain threshold for example the cell says I think I see a vertical line it is going to send a spike or a signal to the right called Zed so it's a very simplified version of what the biological neuron does okay if you didn't fully get that don't worry you'll see it in practice now let's say that we have a picture of a cat why not internet is full of cats now let's say that you take a camera and you take a photo or a video camera and you point it towards this cat what the computer is going to see are pixel values remember the computer everything is numbers the pixel values are just numbers maybe we simplify it and say it's a grayscale so number zero is black and the number 255 is white and in between there's Shades of Gray so just numbers so now you put up a bunch of these artificial cells and each of these inputs gets a pixel value and now you can imagine that maybe the top neuron there it is going to spike if it sees maybe a diagonal line like this right and it says like hey I see a diagonal line and then the neuron below it maybe it's looking for diagonal line in the other direction like this and it's only going to spike and send a value to the rest of the network if it sees that and now you add a second layer of network this is why they're called Deep neural networks you add more and more layers and so now the second layer of neurons it can get a spike from the first layer and says the first layer saw a diagonal line like this and a diagonal line like this and the second layer neuron is going to spike only when it sees both of those in the first layer so maybe this is actually the shape of the tip of a cat's ear and maybe this line is the shape of a whisker and so you go one more layer and it combines all of these signals and at the very end of this layer which can be very deep the last neuron would actually say I see all the inputs from all of these layers or what is actually a cat and now you have what is called the cat classifier so how does this work well if you look at this you can almost Intuit that if you have all of these weights the W ones and twos and threes if you had infinite time you could imagine sitting and tweaking all of those numbers so that these neurons happen to hit that threshold and Spike exactly for the shape of a cat from many many different directions right so some of these neurons are looking for tips of ears some are looking for neurons some are looking for eyes and pores and so forth you could imagine that if you had infinite time you could tweak all of these parameters in the network so that it only Spikes all the way back to the end when there is a cat but not when there is a car or an airplane or anything else so again in principle not that hard into it in practice pretty complicated how would you tweak these parameters because there can literally be many billions almost up towards a trillion now of parameters in a network like this but we're on the we're on the theory level where things are simple so again there is one thing you should understand here scientists actually a long time ago came up with something called back propagation and what that means is they found a way for the model to teach itself all the right parameters and the way to think about this is you have a lot of pictures of cats and things that are not cats and instead of a human sitting and tweaking these parameters themselves what you do is you just initialize all these parameters as random numbers it's completely random so if you think about it the first time you show this network a cat image it's going to get completely wrong by definition random right but it also means that by random chance sometimes when you show the cat it will actually guess that it was a cat and then what you do is through something called back propagation you say like hey wait a minute you happen to guess correctly keep those values in fact move all the W values a little bit in that direction because you were right and then when you guess it's wrong you do the opposite you move them in the other direction and then you do this for literally tens of thousands hundreds of thousands millions of images where you say like is this a cat and it says yes and then you say good Network move a little bit move all these values a little bit more in that direction then you show it maybe an airplane and it says it's a cat and then you say bad Network move the move all device a little bit in the other direction and you just keep reinforcing it and if you do this millions of times eventually all of these parameters because you reinforce the good behavior are going to to end up on finding exactly the combination of shapes in this image through multiple layers that represent the cat but not a dog and not anything else so this is what a neural network does again it's quite simple in theory even though it was it was hard and took a long time to do in practice so this is important to understand because one when you understand this this notion of taking numbers multiplying them seeing if they go over a threshold then taking those numbers multiplying them now you can understand how these image generation networks and so forth actually work so here's another concept that I think is very interesting to understand intelligence is compression now this is stated as a fact it's not a proven fact but it is a theory that a lot of people have that one way to think about intelligence is as compression what do I mean with that well back to intuition if you speak to someone who knows something very well they're usually very good at explaining it right whereas if you speak someone who doesn't know something very well it's hard for them to explain it so the person that knows something well can explain something in a simple way that usually means that if they can explain it in a simple way they understand it better than if they can't it's already there you can see that there's something around compression right probably that person spent a lot of time on this problem and they learn to take all the details and compress it into what it actually means and understand it deeply and then all of a sudden they're able to explain it so there is actually even this thing on the internet called the hutter price which is a competition where you're supposed to take all of Wikipedia and try to compress it as much as possible including its own extractor actually without losing the information in Wikipedia because the idea is that in order to compress Wikipedia effectively the system that compresses it is going to have to understand a lot about the world like you have to be smart in order to compress you have to understand the dimensions of the world really well to be able to compress the world and like I said this is intuitive humans that that can explain things simply they usually understand more so the system that compresses it had to develop understanding to think of intelligence as a side effect a necessary evil of being able to compress information so in this hotter price you can actually make money for every percentage that you can compress Wikipedia in that case losslessly without losing any information but in general the concept is if you can compress something and retain most of the value you probably understood it really well because you could represent the same thing with less information and representing something with less information kind of requires understanding on the part of the system that compresses it all right let's get a little bit more practical there so remember to a neural network everything is just numbers languages numbers pixels or numbers samples are numbers DNA sequences are numbers anything is a number so if we take this neural network that looks a bit funky uh let's say that we take a sentence like a cat jumping out of a window which then again will be represented as a sequence of numbers so one two three four five six numbers and then what we do is this neural network is just take those six numbers and through these neurons it multiplies it and adds it and has it represented by fewer numbers then multiplies and adds it again and in the middle it only gets three numbers to represent those six numbers I know it seems weird but hang on so we force the network to take six numbers that represent the full sentence and and represented with only three numbers and now you ask the network to do the opposite from these three numbers again it multiplies it and tries to turn it back into five six or seven numbers in this case one two three four five numbers so all we did was we told this network that here's a sequence of numbers that to us means a cat jumping out of the window you have to compress those six numbers to three numbers and then expand without any new information just from those three numbers back into the same sentence or as close as possible that you can get so the entire training task here is to take a sentence compress it and try to recreate the same sentence so in a perfect world the output would be a cat jumping out of a window the exact same as the as the input so you train this network again and again to try to compress these numbers and recreate the same numbers and it's not going to be possible for the network to recreate the same numbers perfectly because when you go from 10 from six to three numbers you will lose information so by definition you lost information you lost at least those three numbers from six to three so it's going to do the best it can and that means that it's going to have to pick these three numbers in the middle back to dimensions is going to have to pick the three dimensions the three numbers that sort of best describes the world what I mean with that well it means that if you look at something like a cat jumping out of out of a window maybe these three numbers in the in the middle they represent something like not quite a cat that's too detailed but maybe a pet and maybe the second number represents something like going in and out of something and the third number represents something like an entity or a house so what you will get on the output is something that is similar conceptually to the sentence you put in but not quite the same because information was lost so if you put in a cat jumping out of a window maybe what you get out is a pet leaving the house or a dog leaving the house right because the system had to pick it had to lose information it had to abstract it had to pick the most important Dimensions to do as well as it could it had to compress so what this means is that this thing in the middle the embedding code again the vector hopefully if you do this right over a lot of sentences is going to find the numbers that are the best representation of the world that is being trained on which in this case would be the text of the internet right it's a textual world for this for this for this network this is called embedding so you take the sentence and you embed it from six numbers into three and the network if you're training correctly is going to choose the right dimensions that give the most they give the most information about the world that completes the training task the best so again it seems like a pretty useless task especially for language why would you want to get like a different version of the same of the same sentence out but maybe an example that would be easier to understand right now this is important to understand for the diffusion mods that's why we're going through it so if you instead think about images or maybe video you can imagine on the left side here that you have an image or a video that takes a lot of space and when it comes to images you know that you can compress it and actually lose a lot of information and it's still good enough right this is what all image formats do they compress the image so that you can send it over the internet using much less data than the actual original image so one way to create that compression would be that you take an image for example of a cat on the left you force the network to take all those numbers all those pixel numbers and represent them there in the Middle with much fewer numbers and then just try to recreate the same image of the same cat as well as it can and if you do this right it's going to be very good at recreating almost the Right image and if you teach the network to do it as well as possible it's going to find the dimensions that are most important to keep there in the middle in order for human to think that this is still the same image of a cat so the network is going to learn itself to compress to the most important dimensions for example humans tend to care a lot about low frequency things in images and not as much about high frequencies so we'll probably learn to keep some of the low frequencies and not the high frequencies Etc but it's not important with the details the whole concept is it takes a bunch of numbers in that case of the image pixel numbers you force the network to try to pick the best representation with much fewer numbers that still keeps as much as possible about the information in that original image and then recreate it so if you do that now you actually have a compression algorithm where you can take an image but instead of sending an image over the internet you just embed it and you just send this code in the middle and the receiver has the other side of this network and then decodes it back into the full image so now you saved a lot of bandwidth this is called an auto encoder and it is actually how some compression algorithms are done on the internet so now you not only understand what a vector is what a word Vector is but you also understand how you would actually create this vector and what an embedding is which is a network that through training takes a piece of text an image or something and automatically creates this Vector for you and it automatically importantly chooses the best dimensions instead of having six hundred thousand dimensions of the world you force it to choose much fewer and in forcing it to choose fewer you force the network to actually become intelligent at least According to some definitions of intelligent in how it picks those dimensions all right we're almost there so what about images music Generation video Etc there's only one last thing you need to understand now that you understand all of this in order to understand how something like stable diffusion works or or mid Journey or something like this and that is this notion of diffusion models so the future models it's actually also something that I think conceptually is very intuitive so remember this neural network right let's say that you take an image the one to the left here called t0 and what you do is you just add a little bit of noise so you go from the image on the left to the second image on the left you add a little bit of noise on top of this image and now you train a neural network to simply try to find that noise and remove it again and if you look at those images the difference between the first and the second image it's very small it is intuitive I think to you that you could remove that noise if you just had time and so why couldn't a neural network it doesn't seem that hard because you just add a little bit so that's the first step so think of it almost as you have a separate neural network that just removes a little bit of noise from the second back to the first image now you take the second image with a little bit of noise and you add a little bit more noise and now you train a network to just remove that additional noise like the image that you added in the to the third the noise that you added to the third image back to the second not all the way back to the first it's just one step at a time so every step here is just removing a little bit of noise and now you take that third image you add a little bit more noise you train a network to remove that just that noise and you keep going you take that image you add more noise you train the network to remove that additional noise Etc and when I say remove you train a network to identify this was the noise added and simply deduct it from the picture so to recreate the original image so hopefully it's intuitive that you could do this so you have like a network that can take an image at any stage and remove that additional little noise because if you think about it at every stage it was a deceptively simple little task but at the end you've added so much noise that there is Pure Noise in the image there is no image anymore so to the right the task is actually to remove the last piece of added noise from complete noise to almost complete noise in the second to last image all right so again this seems like a stupid task why would you take a perfectly good image and destroy it slowly and train a network to remove a little bit of noise at a time well the really cool thing is now you have this network that starts with a good image learns to remove a little bit makes them as worse it moves a little bit again but if you not take this network and you kind of run it backwards so instead of taking the neural network that is furthest to the left that removes just a little bit of noise from the good image you start with the one to the right that removes a little bit of noise from the Pure Noise image what you can do is this you start with just random noise and then you run the network backwards and what what will happen is essentially what you've done is you've trained the network that is desperately trying to look for sort of face like noise in an image and so it takes this complete noise image and even though there's nothing there it's going to say like hey I've been trained to find sort of a face looking noise in here or if you want to simplify I've been trained to find the the rough outlines of a face in here I think I see something there and it's going to remove a bit of noise that actually makes it look a little bit more like a face and then in the next stage the next sort of network even though it's the same network but think of a separate takes that image and says I've also been trained to remove noise to find like face like noise here and remove it I think I see the outlines of the face in there even though there's hardly anything there in the first stage there was nothing there it was just noise in the second stage there is actually a little bit of a face there because the network itself removed exactly the pixel that made it look more like a face another Next Step says hey wait a minute I see the outlines of the face I'm going to remove I've seen this before I'm going to remove this noise which is going to make it look even more like a face third image takes over says oh I clearly see the outlines of the face there I know exactly what kind of noise I should remove to make this look even more like a face and you just keep going and and at the end all the way to the right you will have created a face out of Pure Noise meaning a face that actually never existed so to be clear not a copy of a face that existed you train this diffusion model to remove noise across you know millions of different faces so it didn't learn to create a specific phase it learned to create General faces and is looking for General face like noise in this in this first random noise image and so this is how you get to something like this this is the site called this person does not exist.com that every time you go in there it literally generates very believable faces of people that never existed now for those of you who know more about this this particular site actually doesn't use the diffuser model it uses something called a gan that came before the generative adversalon network but the idea is the same the fusion models have sort of taken over from Gans so now we know how we can generate at least faces out of pure white noise like one particular thing but I promise to explain to you not just how you can create one thing but how you can do something like what stable diffusion or mid Journey does and on these Services you can do more than just get many versions of the same thing like a face or something like that you can put in text and you can ask it to generate for example a picture of an astronaut riding a horse on the moon which is something that you can Google for example and so how does that work how does this text conditioning work well it is a diffusion model so it does what I just showed you but it does something more this is why you just learn about vectors so let's go back to this thing of intelligence is compression and how you actually condition on text so I showed you this network that we said look pretty useless when you take a sentence you sort of compress the numbers in that sentence the fewer numbers that hopefully somehow captures the important dimensions of that sentence and then you try to expand it to the same sentence again and we said it was pretty useless but now it's going to turn out to be pretty useful so what you can do once you train this network is you cut off the right part and you just keep this part so now you have a machine that you can give a sentence in English and you can ask it to embed it to compress it to these numbers that represent what is in that sentence as well as possible okay so now let's imagine that you're a service let's say a social network or search engine that has a lot of examples of images and captions to that image so for example maybe a picture of a cat staring at you and the caption a cat staring at me now what you can do is the following you can take that caption a cat staring at me and you can take this encoder you can take that sentence which then again is just a sequence of numbers so one two three four five numbers and you can ask this network to compress it into these three numbers that hopefully captures what it means to be a cat staring at me and now we take the diffusion model so we take the image that belongs to this caption as before we add a bit of noise then we add a bit of more noise then we add a bit more noise until it's complete noise and we put in this neural network in between that is going to try to remove the noise and this is exactly what we did with the faces right so if we just did this we're going to build a diffusion model that always finds cats staring at you which is not what we wanted we wanted something that is steerable so we're going to do one more step we're going to take this other network we had that takes the sentence a cat staring at me embeds it from these five numbers into these three numbers the code the pink thing and as this diffuser model is trying to remove the noise between the second and the first picture we're going to give it these three numbers as a clue right so remember we're giving it the picture of a cat staring at you and we're giving it the three numbers that represent a cat staring at you so you can think of the diffuser model now as having a clue about what kind of noise it's looking for it's not just looking for one type of noise it's going to say like hey these three pink numbers I've seen them before it means there's probably a cat in here or there's probably cat-like noise here and then you do that and at the next step you give it the same clue like you're still looking for cat-like noise The Next Step you give it the same clue remember to a neural network everything is numbers the pixels are numbers but these uh these sentences are also numbers it doesn't really understand that one is images and the other is text it's just saying like I've seen these numbers when I saw these numbers with these three numbers on top there was always cat like numbers in there or cat noise like numbers in there so let me try to see if I can find a cat in here staring at me so you do this for example for this image a cat staring at you and then you have other similar images but with different captions so instead of a cat staring at you maybe another image is a cat jumping out of a window and now this this encoder at the top is going to embed that sentence which is one two three four five six seven six words six numbers into three numbers these three numbers would probably be similar there's there's cats in there right but there's also the concept of jumping and there's not the concept of staring there's a concept of window and so forth so this Vector will be similar in the dimensions but a little bit different and now you do the diffusion on that image so again the network has this clue these three numbers as it's trying to remove the noise so that it can it can understand what kind of noise it's looking for so remember in the first process we always removed the same kind of noise face like noise but now we're doing the fusion model that gets a clue for what kind of noise it's looking for and so we're very focused on cats here but this could be anything it could be a picture of an airplane and the text saying an airplane flying and then it's going to kind of learn What airplane like noise looks like or the removal of airplane like noise and what you can do is you can actually take something like a song right a song is an audio wave but it turns out you can take a piece of audio and you can transform it into what is called a spectrogram so a spectrogram is the visual representation of a song it kind of says how much of every frequency is in a song at a certain time and the magnitude of that doesn't really matter if you just imagine that you can take a piece of audio and transform it into a spectrogram you can represent audio as an image so now what you can do is you can take a song for example obla di Oblada by the Beatles you can take the spectrogram of that song and you can take the description in text literally by The Beatles or maybe a song the song by The Beatles you embed that so now the code is going to somehow represent the concept of a song the concept of The Beatles and and some other things so we don't fully know it's going to find the best representation of this sentence and so now you're giving that sentences the song of the deal by The Beatles as the clue to this diffusion Network as it is trying to find this spectrogram and so you're going to teach it to to first of all find spectrograms but you're also going to teach it if you show it many spectrograms with many different types of music a certain type of spectrogram so what happens now is once you train this network on millions of different types of images as I said cats dogs airplanes images of Music whatever you want you can now take this network you can turn it around and you can start with just pure white noise so there is no structure in there there's literally nothing in this picture and now you take a sentence that never existed let's say an Avicii song in the style of The Beatles and as you've trained this network this encoding process hopefully will have captured the dimensions of the world so we'll know whatever sort of what avici is and what it means and we kind of know what a song is that it should do a spectrogram now and not a not a picture of a cat and sort of what the Beatles represents and what those kinds of spectrograms look like and so it's going to embed this into this code that in our simplified example is just three numbers it's not three numbers in reality it's more numbers but for purposes of simplicity and now we take this Pure White Noise we give this code as a clue to the diffuser model for what kind of noise is looking for and the interesting thing again here is now we're telling it to look for kind of noise that actually never existed before I reached this song in the style of Beatles kind of noise and it's going to try it's darndest to try to find that kind of structure in this Pure White Noise so it's going to remove noise that makes it look a little bit more like a spectrogram or song over which in the style of The Beatles and at the next step we give it the same clue and say try harder try to really look for the structure of an avici song in solid beetles remove some more noise and we do it again and we do it again and if you're interested in reality these diffusion models have about 50 of these steps and at the end of the the line at the 50th step you're going to get a spectrogram of a song that never existed that hopefully is an Avicii song in the style of The Beatles when you transform it back from a spectrogram into pure audio so now you're finally there hopefully you'll be the judge tell me what you think you have some intuitions about how it is actually possible to create new novels poems images even music out of just text or even white noise and hopefully you feel like we managed to debunk this conspiracy a little bit thank you very much for paying attention
Info
Channel: Spotify R&D
Views: 349,585
Rating: undefined out of 5
Keywords:
Id: 2eWuYf-aZE4
Channel Id: undefined
Length: 90min 41sec (5441 seconds)
Published: Mon Jul 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.