Can OpenAI Codex Recreate Itself?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Interesting. As far as I can tell, your question of wether Codex could create itself is a little inconclusive. It was certainly not a complete failure, so the question cannot definitively be answered in the negative.

As far as I can see, it mostly generates syntactically valid code and got close to generating a proper "Hello World". Do you think the model you designed and Codex built would have got there in with more training time and tweaking GPT-2 model parameters?

An alternative title could have been: "Can you bootstrap a GPT-3 using GPT-2?". Unfortunately, you had to abandon the question because of limited resources. There's been a lot of discussion suggesting that just making models bigger isn't a silver bullet. So it would be significant if anyone could distill the Codex knowledge into a much smaller model. Although distillation happens frequently (ie ImageNet and GPT-J), being able to DIY with your own model would be cool.

👍︎︎ 1 👤︎︎ u/mhummel 📅︎︎ Sep 19 2021 🗫︎ replies

Captions

hey everyone i hope you are all doing well today we're taking another look at openai codex now if you haven't heard of codex before it's an ml model that open ai released that is essentially able to generate code in various languages so it's like a language model that was trained on all sorts of different code from github here you can see some of the or all the languages that you can use it for and it is very impressive if you haven't seen any videos on codex before or anything like this i think you'll be happily surprised at how competent competent it is and how many things it's actually able to do so if i do just to show you an example real quick i say print hello world you should see that it should be able to do that with ease yep and it does it multiple times nice so it writes the code for us right all we have to do is put in a comment so what are we going to be doing with codecs today well we are actually going to be trying to recreate codecs with codex now take that with a grain of salt right because let's let's be real with one single comment opening codex is not going to be able to reinvent itself it's something that was made by many talented engineers and research engineers and scientists it's not something that's super easy to create but what we can probably do in this video is create an ai that can generate code even if it's not quite as good as openai's codecs now i hope that is uh exciting enough for you all if if that doesn't meet your standard of excitement i i don't know what will in terms of these sorts of things but yeah that's essentially what we're going to want to do as we go i'll sort of share my thought process with you and yeah this is unscripted again i'm i'm sort of getting these unscripted videos i think these i kind of like the casual format of these hopefully you don't mind i don't want to make you guys wait any longer so last thing i'll say is do consider subscribing to the channel if you like this type of content uh but without further ado let's get into it so let's first start by selecting our model davinci codex is what we're going to want this is their their best codex model and let's up the response length a little bit we want to generate a little this is how much code we generate at once uh and next thing we're going to need to do is just start typing in our comment here that describes what we want to do and let me make sure i'm zoomed in enough for everyone to see this so we essentially want to make a program so let's say the program and i'm just going to describe what i want so creates and trains a model to generate python code now we're going to be doing python code specifically to keep things simple and this is going to be using i'm going to use a gp2 base model or at least uh jeep let's do gp2 small as a base model and if you haven't seen gpt the sort of gpt family of models there are models that can do all sorts of language modeling so starting with gp2 as a sort of pre-trained model will make this process a lot faster and easier for us now we're actually going to be using data i found this 150k python data set and essentially what it is it's it's just 150 000 python files that were scraped from github so i'm going to give it this data we can essentially describe the format the data is in and hopefully it should be able to pick it up and use it without actually seeing it so this is a general description if we go ahead and generate this like we could generate from here right we're probably not going to get great results though because you know yeah it's importing some things but the issue is we still haven't specified lots of things this is actually this is already really good though right it's importing all the right things gp2 yeah so so this is already impressive but it's not quite uh i want to specify more so let's delete this and let's specify some more before we start generating and i will help i will help it out as we go if i spot errors i'll i'll rewrite comments and have it regenerate code i think that makes sense because this is such a difficult problem so the first thing you want to do is let's see i had something actually had some notes written let me see if i can find those really quickly okay so i found the notes i was looking for i i reworded this very slightly still the same thing basically so the first thing that we want to do is download and load a pre-trained version of gpt too small so this will essentially load the pre-trained model that's already been trained but not on code specifically right just general words so this transfer learning it will speed up our training process i only have so much compute power to use right you know i can't retrain uh this whole thing on just my my weak little gpu now we're using the small model you know the small gpt2 is still fairly large so i think we'll be fine on that for at least a sort of proof of concept what we want to do next is load data from the data directory now the data that i'm giving it as i said is just those python files i can actually this is where i'm going to be actually pasting the code and running it so you can see if we go into data there's all these different uh we have uh there's a python file here there's python files here there's all sorts of python files so i'm just gonna we can put this wherever we want and specify it to look for it wherever we want i'm just going to specify it to be in the data directory so that's where i put it next thing we're going to want to do is clean the data and use the gpt to uh tokenizer to prepare the data for training so once we have all those python files right we need to clean up the data there might be some weird stuff like get rid of next line characters if if that really matters actually i'm not sure we want to do that but we're going to clean it up in some ways and we need to tokenize it and essentially break it down into a format that the computer can understand or our neural network can understand whatever the gpt2 model expects essentially step number four is going to be to split the data into training and test and testing partitions partitions this is this is you know we want to test and see how it does an unbiased estimate of how our model is performing then we want to train the model in a semi-supervised fashion with the following task so this is how we will essentially be doing the training the specific task we are going to be doing is let's see feed the model a portion of the code for an example and have it have it predict the next token now this is one way that these models are very commonly trained right is you you have like some text that says she went to the market and then you feed the neural network this part of it right that she went to the and then you have it predict the market right so it learns uh it's essentially it's autocomplete on steroids is what this is essentially going to be to be fair that's basically what codex is as far as i'm aware and then the last step is going to be to test the model on the next token generation narration task and report the metrics awesome so that is what we want to do all we have to do now the only thing all we have to do we'll see it's generate right so so let's generate i'm going to turn i'm actually going to keep the response link i'm actually going to keep the response length not too high because we can always adjust it and we can always generate more so let's let's get going let's see what this generates so some imports some pie torch i should have specified pytorch looks like it figured that out so okay loads up the token io okay so it loads up the tokenizer that's good that's what we'll tokenize our input loads up the model so the language modeling head i believe this is the type of gp2 model we wanted and then it opens up this data so it opens up the data directory dot slash data python data okay so this is an issue right because if we go back to our data over here you can see that it's broken up into many folders uh of varying you know depths and that the python folders are inside of these so what are we going to oh my what are we going to do about that well we can actually just delete what it's deleted so far this is why i didn't want to make the response link response length too long and we can now specify this in in more detail so we can say load data from the data directory directory the data i guess we can say the data directory contains many more recursive or many more folders so we can say use glob so blob is a specific thing we can use to get get a bunch of file names from this right so we can use glob to recursively search or get all dot pi files so i do have to specify to a large degree what exactly we want to do maybe i don't have to specify this much but the more i specify the better right the better idea it will have of what to do so let's see if it can do this imports glob makes a data so and then okay so i'm not actually sure if this is going to work or not this might mean there needs to be exactly one file in between them i guess we'll see either way we're not training on all the data so this will probably be okay but then it opens up each of these files for file and oh actually i found an issue right here so for file and this open oh no this works this works r as f and then we read the data and append it to this right here next thing we want to do is clean the data so remember we did want to do this remove the new line characters this isn't something we actually want to do right we do want to be predicting new line characters so let's get rid of that tabs we also want to keep those double spaces double spaces we can remove right and replace with single spaces remove the single spaces that's definitely something we don't want to do this whole thing is not very uh helpful remove empty lines you know i think empty lines are actually a good thing right because sometimes it can help with uh it can help with formatting and stuff so let's actually just get rid of this clean the data part and i'm going to remove it from up here instead of cleaning the data we'll kind of just let it go as it is and that's okay right because we don't this this hasn't looked at the data it won't necessarily know exactly what we need to do to clean the data so i'll give it the pass on that so where is clean the data we will just say use the tokenizer to do the thing cool so that should be fine next it says use the gpu show user to prepare the data for training uh tastes cool so let's let's let it continue let's let it do its thing tokenize function okay so it returns tokenizer.i xd tokenize encode decode pad okay so i guess we'll just keep going hopefully that all works out now splitting it into training and testing partitions let's see what happens here so split data um i'm not sure if this is going to work as expected so let's actually go ahead and copy what we have so far and see if it works because if it doesn't work we're going to want to you know nip these issues right here before they get too big to deal with so we have visual studio here copied the code over let's open up a python terminal we are going to want to open up a command prompt you're going to want to say pythontest.pi now one thing i will add on to this is i put a lot of data in here so let's uh let's put a limit on this and i'll just do this manually so do i equals and enumerate and then we'll just say if i is greater than a thousand we will break so we can we can remove this later if we want but this will just set a limit of a thousand files to use for now and let's see if this runs into any errors okay downloaded something uh so that must be the gp2 model that it's downloading in the tokenizer looks like it's doing well okay so it actually has a lot to download oh we do that let's actually write some print statements right here so let's print the train data and let's get the zeroth instance of it after each of these and we're going to want to see how this looks and see if it's about what we would expect because it's very it's very easy to sort of do some formatting on data not realize it's wrong and then have to deal with it a lot later on so checking it now will probably hopefully save us a lot of pain later on so that downloader that's pretty quick and i guess now it should be loading the data all right so loading the data here okay so we ran into an error good so pad sequences let's see sequences 0. size list object has no attribute size so i guess that's an issue maybe this should be or is that happening path sequence rnn you know what i'm going to do is i really i'm really not a fan of how it's doing this whole thing i think what i'm going to do is i'm just going to have it regenerate this and i'll be right back after i regenerate it and see if we get something that is working on the next go okay i'm back and i figured everything out now i want to be fully transparent because the whole idea of this video is to give you all an idea of how well this actually works right so i only did one more generation and this is roughly what i got now i did have a little trouble and the trouble ended up actually being the fact that this glob i had specified using the glob library to get all these different files right as it turned out that was not super useful so i did slightly reword the way i did this i said essentially go through all these uh different folders at varying depths to get all the dot pi folders and i didn't mention the glob library or the module or anything else i just kind of let it do its own thing and it looks like that ended up working for the best so sometimes it you know it even it actually is better to knock it in the way of it it looks like but this is for the most part what it came up with by itself on xnxt on its next generation so you can see here that we grab all the files and then we actually define a class for the code data set and it's very simply whenever we try and grab something it just encodes the data using the tokenizer and returns it then it does this training test split this actually uh it's a lot simpler this whole thing is a lot simpler than last time which is kind of nice and it creates data letters it's already copied this into this over here if you come over to my test folder so i'm actually i also move this over to a python notebook an ipython notebook just so it's easier to sort of see and ran it all and it's working fairly well so we download the model we load the data set and i actually just i wrote this myself just so we can actually see an example of what's in the data set make sure it's working so we take the train data loader and we just yeah we essentially just print out the first thing that comes up after this is finished loading there we go you can see these are indeed indices it looks like there's an error here token index indus indices sequence length is longer than specified maximum length yeah yada yada yada that is actually maybe going to be an issue later we can work with uh truncating this later for now i just want to get a working example so that's pretty that's pretty great we got all this going it's actually quite a bit now what was the next step i'm actually forgetting we can move on to so it's used the tokenizer it's split the data up now we can actually train the model so let's just run this and see what happens i guess we'll do another generation let's see define loss function yeah optimizer okay and for 10 epochs it loops it looks like okay let's see what's going on here so loss function optimizer we do need those so that's good this is very low learning rate i guess this is a maybe you typically do use low learning rates i guess for these large models so maybe that makes sense uh but let's see we loop through batches in the train data loader see the gradient so batch zero get the batch data i'm not sure what's actually in here this this knows better than i do honestly i don't know why we're indexing zero uh there's probably a reason oh maybe it's like the let's see so it gets data then the data from everything what is this hmm oh i see so for every example it gets up to no this is the last token and then for each example for the input data it gets everything up to the last token makes sense and then get the model output so the model we put in input data we get the output so hopefully this is the expected token i'm not actually sure then we calculate the loss of the loss function go backward do a back propagation and then step and then it looks like this didn't finish generating so let's actually let it finish real quick i'm actually uh oh and then starting the test now i'm gonna be honest with you all i have no clue if this will work it looks like uh it it might it might not so uh that's the end of the program though you see if i hit submit it doesn't generate any more uh that actually means it's it's come to the end of this so let's actually copy this in here and see if it works i i'm very curious i'm very curious to see if this will work or not so we don't actually need this anymore let's oh dang it i didn't mean to do that we need to recopy this so let's copy all this go over here and paste in here so i want to just format this kind of nicely so it's a little bit easier with us for us to work with so let's define those and then for our epochs we are going to have the chest and train epochs in different cells so we can run them separately and we are going to see what happens i guess are we using are we using the cpu let's see does it say gpu or cuda yeah so this is gonna be on the cpu this is gonna be incredibly inefficient i should have specified to train on the gpu that's actually kind of my bad uh but we can adjust that so let's just run this and see what happens let's see okay we get in there we get in there that's that too many indices for tensor of dimension one this looks like it might not be super easy to debug so let me go ahead and take a look at this and then i'll get back to you all okay i am back and i'm not gonna lie i just spent a long time a long long time trying to figure out what was going wrong with this and you know i figured out the issue but essentially the issue right is that it's not using this the proper way well one was that the tokens the the sequences were too long so we can truncate those that's no problem but the other was that it's not just it's not using this model the proper way in the first place so this wasn't going to work that doesn't mean this can't work so i think the best thing we can do now is to sort of scrap what we have on this point and just try regenerating you know this can generate many many things and it can generate us different answers so i think we should try and give it a little bit of a helping hand by specifying how it could go about doing this and then trying to regenerate so let's do that let's go through here and i am going to kind of put some more comments as we go to sort of explain to the model what exactly i think it should be doing so what this model should essentially be doing and i'm about to write out here is it should be passing in the sequence the sequence of input characters so that's the sequence into the model the i should say the input sequence into the model as both the input and the labels and the reason we should do that is that's essentially how the models built it it the label should be what the input is because it's predicting the next character for every single input character instead of just the last one so before what it was doing if we go back over here uh where is this you can see it's breaking it down into the input data and then just the next token at the end of that input data instead we just want to pass in everything for the input and for the output so let's see if it can get that so pass the input sequence into the model as both the input and the labels the [Music] model will output a loss that can be used for training okay let's see if it can get this now i'm i'm crossing my finger oh gosh oh no so as you can see it's just repeating the exact same thing at this point so what we can do is we can turn up the where is it the frequency penalty so the higher this is the less likely it will be to repeat the same thing so let's try this okay so it only printed out twice a time so model.train oh oh it's still going okay so model.train puts the model into training mode output equals the batch and batch now the batch i believe yeah i think that should actually work the loss equals outputs zero now i'm not sure if that's how we get the loss i don't think that will actually work but it might uh we backwards and stuff this is looking better though this is looking better and then it looks like after we've completed one epoch we will go into eval mode and then we will go through the test loader and do the same thing this this isn't looking too bad so let's try this and it will print out the test and then save the model okay now this might actually work so let's go ahead and try this and it is kind of funny it printed it twice and it was like nope it can't take anymore anywhere less but we'll copy it as is and and see how this goes um so let's get rid of everything we had for the training before and put this bad boy in now we can copy do you want to copy anything else i guess this is fine so i don't think this is gonna work this loss right here so well let's just see what happens let's see what happens um index out of range that's right i should have told it to limit i'm just gonna delete this i should have told it to limit the number of tokens it can use now i'm just gonna i happen to know off the top of my head that this i think has 1024 is the max limit of token so i'm just gonna put up to a thousand tokens um sorry sorry i'm not doing this for the model i i've i've actually been here for several hours trying to deepen not several hours like an hour trying to debug this now and i i'm feeling a little bit tired um so i'm this is a bit scuffed and now we don't need to be printing that anymore because it looks like it's working okay okay i think maybe i shouldn't be saying this too early but i think it would have crashed by now if there was near oh epoch zero batch zero the loss is 2.749 not too bad not too bad now there is one issue with this uh well there's actually a few issues with this one this is very very scuffed um that's okay for example using like one batch at a time is very inefficient and the other issue is that this is on the cpu we do not want to be doing this on the cpu that is incredibly slow so what i'm going to do is i'm going to pause the video real quick and i'll come back to you all so yeah that's that's definitely faster it's lost 1.6 this is still taking a long time though um i wonder oh it's because it's printing it out every only one every 100 batches oh so actually before that wasn't too bad so we're starting at this loss and let's see where we go i'm going to so that went down but it's only it's not that much so we can't say anything for sure yet i'm going to take a break and i'm going to come back and when i come back we are going to actually see if this has improved it all oh never mind because we have an issue so i'll come back once i fix the issue and then i'll run this and then i will show you all how it does we are back i made some minor changes for example i uh i made sure that this wouldn't run for too long so i i added some minor things but i haven't changed anything big and i ran the code it's fully working now and you as you can see we have working training here so the loss is going up and down i'm not sure if that's because we just haven't trained for long enough which definitely might be the case or what exactly is going on here but i only have so much time so hopefully this will work i guess we'll see if it does or doesn't you know if uh whether whether or not this works or does not work you guys will see the truth so i also ran the test this isn't the best way to do testing though but anyway it's all running the model is trained and that means now we can actually test the model and see if it actually writes code so that is the one last thing we are actually going to be generating so if we go over to our thing right here what we can do is this all looks good um yeah i think this is all good so the last thing we have to do is just i'm going to write a comment and essentially i'm going to say use the generate the model.generate function to test to get test outputs and we're going to submit this and we are going to see if this gives us anything so hmm this is not what i wanted oh model to generate model.generate the batch and it gives a length the temperature okay i don't think we want i'm not sure if this will work but let's try this out so let's copy this over and essentially all we're trying to do now is input actual code and see actual code right we've been able to see these losses right and lower losses are better but it doesn't you know it's it's kind of hard to tell exactly what's going on unless we actually test this out ourselves so let's copy this and paste this in here but we'll just take this and model.generate so we'll take the batch i figured if we need to do anything without you i think it should be good uh let's see if it works oh oh wow um this is interesting well i guess we should print out what was in the batch right because we are going to be generating stuff after the batch oh this is what's after the batch but what's before the batch and also length set to then but max length is set to 20. okay so we should fix that let's go ahead and uh fix this up a little bit yeah so okay new example so it starts by priming with this no yeah it starts replaying with this import unit test and then try and then what it generates is is this well that's not very good what about this example so in this case we give it this and it gives us this so it's just it's just repeating the same thing over and over and over again so that's not very good so what we can do is one thing let's uh make the the repetition penalty repetition penalty 1.2 i don't know if that's good or not but let's give it a go and we will do this let's uh make this a little bit more defined let's see how this goes so new example it gives it this copyright claim and then it asks it to make something okay so this is like this program is distributed under the mit license it does some imports it's a function the function hmm [Laughter] this is not great i'm not gonna lie okay we'll do one final thing and for this final thing what we are going to do is we are going to have our input code yeah input code and it's going to be something like uh prints hello world so we'll feed in this comment prince hello world and then hopefully it will print hello world i i don't know if that will actually work but i guess we'll find out so let's uh say the batch equals tensor of tokenizer dot encode the same thing that we were doing up there in code input code dot unsweet zero dot um here let's grab this um the model that generate so we're running batch through and yeah i guess we'll see if this works so prince hello world is what we have uh and then we get from django it imports some random stuff deaf main print hello world well uh it's definitely doing some print hello world stuff from pie face dot face dot face dot face dot face dot face face well um you know i've tried my best i think that's all we could do for now though i think this really just needed some more time to train unfortunately i don't really have the time or the uh the gpu power to do that but i hope this was at least interesting and gave some insight into how this all works and what sort of the limit of openai's codex is in the last video i did we saw it work almost flawlessly to build a vision model which was quite impressive right it could do mnis very well this time we tried to generate python code and clearly we've we've run into some shortcomings here right at first it wasn't generally the right thing we specified it and it did generate some stuff but at the end of the day this was very much uh needed to be modified by hand to work now the the outputs are probably more of a result of just lack of training time so i wouldn't blame codex for that but yeah i mean i guess you take it for what you will right we it did generate still quite a bit of code over 100 lines of code that mostly worked so you know i think it worked okay i'm still very impressed by this anyway that's it that's all i have for this time if you like this type of content and you appreciate the video do consider subscribing it means a lot and it knows it lets me know that you guys want more content like this so i hope you all have a great day and i hope to catch you next time thank you so much for watching

Info

Channel: Edan Meyer

Views: 7,834

Rating: undefined out of 5

Keywords: openai codex, github copilot, AI, codex ai, machine learning, openai copilot, ai singularity, self improving ai, ai that codes, ai that codes for you, self programming ai github, meta machine learning, nlp, nlp for code, GPT, gpt-4, gpt-3, machine learning model, openai codex demo, openai codex tutorial, codex demo, what is openai codex, how to use openai codex, two minute papers, openai, codex, self-replicating ai, codex openai, open ai codex

Id: 7QWVJ5rWy2s

Channel Id: undefined

Length: 31min 24sec (1884 seconds)

Published: Sat Sep 18 2021