The Do's and Don'ts of OpenAI Codex

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone today we're going to be talking about open ai codex but a little bit differently from the last videos we are actually going to be talking about the do's and the don'ts of how you should be using opening codex to get the most out of it if you haven't heard of codex before it's essentially a code generation model so i could type something in like print hello world or something of that sort right and i have this set to python so it should generate us yes print hello world and python and it does it many times great great great now although we have codex here in this sort of playground area codex has actually been available for a while even before the api was available and before this playground was available previously it was available in the form of github co-pilot which is as you can see here a programming autocomplete tool and i've actually been using this for a few months now i really love copilot because it's actually sped up my development time by quite a bit that being said as with everything you know it has its its strengths but it also has its weaknesses in places where you really shouldn't be using something like copilot codex so today i want to talk about those specific use cases and what i've found working over the past few months with github copilot to take you through the do's and the don'ts of codex i am actually going to be showing you examples in a real project i've been working on over the last few weeks this is a project for essentially decoding signals in the brain there's a bit more to it but it's not very important what is important is that over the past few weeks as i've been going through this i have been using codex and it has been immensely helpful so all the examples of the do's and the don'ts i'm going to give today come directly from what i have personally been using so i will go back through the code and take you into places where i've used codex and it's worked very well and other places where it has been sub-optimal before we get into the first example if you want to help out the channel which i really do appreciate a lot do consider hitting the subscribe button liking the video and hitting the bell icon for notifications without further ado though let's get to these tips starting with the do's of codex the thing i really love the most and oh my gosh this is so great is auto completion for documentation now documentation that's boring right uh and yes it's it's mostly boring and that is exactly why codex is so great when it comes to this because it's not something that's super hard to do but it is something that can take a fair bit of time so to give you this example right here we have a get token embeddings function so this is something i wrote and essentially all it does is you pass it a list of tokens and it gets the embeddings for those tokens so here you can see i have this documentation this itself is actually i think this was probably generated when i was working on this by codex and we're going to try regenerating it as you can see i have all the arguments listed out and the returns along with the shape of these embeddings that will be returned which is what i like to see in my documentation or my docs strings so if we go back here and we type oh look it's already suggesting that we want to have a doc string so let's follow that and okay so here it's saying return a tensor of shape length of tokens by the embedding dimension and that is correct so it's already getting the shape right which is awesome so it wants to stop there but we can actually push it a little farther and have it give us the arguments so tokens list of tokens yep embedding dim dimension of the embeddings and device is the device to use and then finally we can also have it give us the return so tensor of shape length tokens and embedding dimension so this is all correct as you might notice it's not exactly what i had before but this is very similar and it even gets the shapes right in everything now i will mention that this is not the most in-depth docstring but it will do for especially personal projects and stuff i think it's really not too bad a really funny thing if you actually go to my github and you look at lots of my previous projects you'll know that i have a really bad habit of awful documentation but if you look around the past few months once i started using github copilot you'll notice that my code has actually been a lot more better documented and surprise surprise it's due to this i really love this now to item number two of the things you should be doing with codex is you should be generating short simple and common things you generally forget how to do in your programming language of choice so the example i have here is in the main file of this whole program and in this main file we start by loading up a config file so you can see i specify a path to the config file as an argument and then down here i load it in however whenever you're doing this sort of thing you should really get the absolute directory of where you're working with or where this file is and that's because if you call this main file from anywhere except for this current directory that it's in well it can cause all sorts of issues if you give a local path to the config file essentially this whole path thing gets confusing when you mix relative paths with absolute paths so one thing i always forget how to do and i forget this again and again and i don't know why i can't seem to remember it but it's getting the current directory of this file so as you can see i actually have a comment right here and the reason i have this comment right here is because this is the comment i originally used to generate this code get the path of the config file in relation to this main.pi file and if i go to the next line here you can see we have a suggestion and it actually did it in one line unlike what we had before so config path equals os.path.join so this is getting the directory of this file and then we pass in the relative config file location so this is great something as simple as this can actually really go a long way when you consider all those short amount of times you spend looking things up really add up and you can save a good bit of time by using codecs in this way next we're on to item number three of the do's for codex and item number three is you should be using codecs to generate repetitive code after you enter the first iteration of what you're doing now if that sounds a little confusing this is essentially what i mean by that so this is a file that handles some of the training so as part of the training for this model we need to calculate losses so i have a function here that calculates the losses you can see i have another doc string generated by the lovely openai codex and i have this sort of thing where there's actually three modules that are in this model and i need to calculate individual losses for each of those modules now you can see that here i calculate it for the first module here for the second module and here for the third module so what i can do here or what i did do here when i was making this is i first typed out and let me let me go ahead and delete the rest of this i first typed out this first calculation of the loss for this first module and then after that instead of typing out all these other ones because i can't just copy and paste it because there are variables that are specific to each of these modules what i can do is i can have open ai codecs generated for me so if i go down here you can see it's already generating the comment and then if we just follow this it will generate line by line everything that we want and not even line by line just all the lines at once even better so we can take a quick look through it you can see it's replaced sc with mc so this is the single channel encoder that we're calculating the last four and this is the multi-channel encoder and it's essentially almost replicated this except for it's replaced all the variables that i need replaced now you should be careful when you're doing this because sometimes you will actually get mistakes now this is a great example of a don't this is something you should not do and you should not just blindly expect what codex gives you because there are mistakes so here you'll notice that these are pretty much the exact same except for the replace variables but in this case i actually don't want this on squeeze here even though it was in the previous example and that's something that codex really couldn't know so i can't fault it but we should be very aware as programmers as to what these you know code completion tools are giving us so do always be aware and do be checking that now finally to finish this off there was one more module we need to do this for and to do that we can just again have codecs generate the whole thing for us so give it a second there we go now here has just the unsqueezed two we actually want the second unsqueezed and i just happen to know this because i've already written this but you know if you were actually doing this yourself it would be very helpful to look over this take a second and make sure that it's generated the correct thing so looking through the rest of this it looks like it's using the right variables for this calibration model it's replaced all the we had before the mc variables with now the calib variables so this all looks correct to me and instead of just writing all this code i could just have it be auto-generated again there were minor adjustments we needed to make and we need to look over it but that still is a huge time saver so overall for the things that you should be doing the do's of opening a codecs there is filling in documentation generating short simple and common things you just forget how to do and then generating repetitive code after you enter the first part of the thing that needs to be repeated and finally the last thing is that don't forget to check over the code it generates don't blindly accept what codex gives you and expect it to be right because i promise you you will run into trouble and not necessarily just programming trouble there is even the chance of getting into legal trouble considering it could reproduce copyrighted code now in my experience i don't think you will really see that happen very commonly if ever but you should always be vigilant just to be safe okay and now we are on to the don'ts what you should not be doing with codex and to be honest there are quite a bit of things that it is not very great at the first do not of codecs is you should not be generating very long portions of code while codecs can generate long portions the longer segment of code you let it generate the larger the chance that it's going to derail and the harder it's going to be for you to notice that so for example here i have a wave to vec model now wave to vac is actually a pretty popular type of model that was brought about by a research paper and there is an implementation of it online i think there's actually several so this could be something that codecs might know about not to mention i tell it that it starts off with a 1d convolutional layer and then it's followed by a transformer if so if i give it all these input parameters we would hope that maybe it could generate something like what i want so let's go ahead and delete all this this is quite a good chunk of code right here right and if we have it try and generate things so what it's generated here is actually i must say it's very good however there are a few issues one of the first issues you'll notice is that i did in fact give all these parameters and when you're building something out like this you might not know all the parameters and if i don't include the parameters here it's going to be a lot harder for this to happen another issue is that if we go back to what i originally had let's go all the way back what we will see is there are lots of things that did not happen for example i have these lines calculating this variable called embed sequence length this variable is actually very important to my whole program to figure out what the size of the embeddings generated are which what that means you don't need to know but you do need to know that this is a very important variable and it was not generated by openai codex and you know it really shouldn't have to know how to do that it also didn't do other things that i didn't exactly specify for example positional encodings which are very important when we're dealing with transformers it didn't create this and without this well we wouldn't know what order our inputs were to our transformer which could very much impair our performance so well it did do a fairly good job i will honestly say for how long of a segment of code i gave it there were still things it left out that could have degraded their performance of this or stopped my model from working altogether and because it was auto generated it would have been much harder to debug later if i did find an error later and need to come back and fix it the second don't that i will go over for codex is that you shouldn't be generating code where the goal of what you want to do is somewhat ambiguous so the example i'll give for that is actually just a little bit down here so we've defined this whole model up here right and we have this forward function so if we look at this forward function down here you can see that there is actually a lot of stuff going on we are checking for the convolutional layer we're transposing inputs applying the convolution retransposing inputs applying a function we do a bunch of checks to see if things are available before we apply some there's a lot of stuff right and what i want to sort of show you all is that if we don't have the context for what we should be doing then we're probably not going to get anything close to what we want so i'm going to delete all this and just give this the function name and the parameters and then if we try and generate it we're gonna see what happens so it looks like it actually doesn't oh there it goes it's trying to generate something so let's let it do this so first of all it's not even generating this doc string in the same format i had before that's a little bit weird but that's okay and let's see so what's happening well first we're applying the embed hook function well this is actually not good already because what was happening before is the embed hook function should actually be applied after the convolution and here it looks like either the embed hook or the convolution is being applied that again is also not good that's not what i wanted at all and then here we're finally applying the transformer afterwards not only is this not the right order of things but it's not formatting the inputs correctly so this would almost certainly give us an error this is all sorts of wrong and to be honest we should expect it to be wrong given the fact that all we're doing is giving it the name of the function and the parameters and essentially asking it to generate the most likely code and you know this looks like something that might be likely but it's not what i want to do and that's because as programmers we have all sorts of things we might want to do and unless we make it very clear well codex has no way of knowing what we want to do so we shouldn't be using it in cases like this where the goal of what we want to do is either underspecified or ambiguous we can maybe try this if i gave it a very good doc string but even then this is fairly complex so i'm not sure it would get it so this is another case where you want to steer clear of using open air codecs moving on to the third thing you should not be using codecs for is having codecs suggest parameters especially if they're important parameters for something like a machine learning model you do not want codecs to touch that at least in most cases so here we're back in the main file and if you remember from earlier we have a config file that is being loaded up so as part of this config file it you know it essentially specifies all the parameters for my model and all sorts of things i want to manually edit some of those though and see what codecs will suggest for us and see how valid those are and i'm going to guess that it's probably not going to give us some great numbers but let's see so what i've set up is i'm just overwriting two values from my config file so you can see we are overriding the embedding dimension and the max primary input length with two variables that are actually not yet defined so we're going to have codecs to find these and i'm going to tell you whether or not they're good numbers based on what i know about this problem and my model that i've been putting together so if we go up here and we try and define embedding dimension we can see that it suggests 128. now for lots of things this might work well but for this specific problem i can tell you that 128 would actually work terribly here that is way too large we are actually using very small embeddings for this problem because we are looking at data that does not have many dimensions to it so something like 128 would probably lead us to definitely overfitting now let's check the next variable right here max primary primary input length and see what we get here 100 so 100 in our case is way too small as it turns out we are working with very small dimensional data but sequences that are very long so we want something like a minimum of 200 but probably something closer to 500 or a thousand or something in the range of that and again we really can't expect codex to know what we should want here we can only expect codex to know what we've already told it and infer things based off that but here we haven't told it anything about our model anything about what our data looks like so we shouldn't expect it to know things like this and so we shouldn't be using codecs in these sorts of scenarios where we need to specify some parameters if you don't know what the parameters should be codex isn't going to know either so you better do a little bit of work to figure out what those parameters should be without relying too much on codecs now onto the fourth and the final thing that you should not be doing with opening a codex is letting it handle a whole lot of raw data that needs some sort of processing or handling done with it you really cannot expect codecs to know what your data looks like given the fact that it cannot you know open up the file and browse around in it and i found that even when you do specify what the data looks like codex just doesn't really handle data pre-processing very well i'm not sure what the reason for that is but it has seemed to fail every chance i've given it so far so here i have a pre-processed data function so this is something that takes in meg or eeg data so those are neuroimaging types of data and it does some sort of analysis here right so it does an independent component analysis which essentially breaks the signal up into different components no need to worry about it it excludes certain channels and it also scales the data so that the input is in a format that will work for neural networks now i actually haven't tried this yet but i'm going to delete this and we're going to see what this does for us so the function itself is called pre-processed data and this does say takes the position of the raw data and or sorry takes the target portion of the raw data and pre-processes it so we're going to see what it generates for us if we just let it go wild there we go we got a solution and let's see what's going on here first it picks out whether we should be using image or e j eeg data that's actually good and then it gets the data puts it in a pandas data frame it does a few things it removes bad channels it filters out the data via a low pass in a high pass filter and then it passes it through a standard scalar so you might be asking what's the problem with this this looks pretty good and i will say it actually has identified several things that are very common to do when you are pre-processing this sort of mg or ee data eeg data i can clearly not say that word right um but this actually is good in the sense that it's gotten the right general idea but there are some issues here primarily the two issues that i'm seeing are that these bad channels that it thinks are specified in these low pass and high pass frequencies that it wants are actually not specified in the data so doing this i'm pretty sure would give you an error so while it does have the right idea here and it actually does get the scaling right you know giving it a little bit of credit it doesn't know the exact type of pre-processing that is best for our data because it hasn't seen our data so we shouldn't expect it to know i guess really what this whole video sums down to at least the don'ts section of this is that you shouldn't expect codex to be able to do something unless you've given it enough information to know exactly what should be done it can sometimes infer some minor things but you shouldn't leave large unanswered questions open when you are trying to generate this type of code anyway that's the final don't i had so that fully covers all the do's and don'ts of open-air codex this is by no means an exhaustive list but it is what i found over the past few months to be the most important things that you should and should not be doing so i hope this has been very helpful to you if you want to help out the channel a little bit do consider subscribing i really do appreciate it anyway thank you so much for watching and i hope to catch you next time
Info
Channel: Edan Meyer
Views: 1,409
Rating: undefined out of 5
Keywords: openai, codex, openai codex, github copilot, AI, codex ai, machine learning, openai copilot, ai singularity, self improving ai, ai that codes, meta machine learning, nlp, natural language processing, nlp for code, GPT, gpt-4, gpt-3, gpt model, machine learning model, openai codex demo, openai codex tutorial, codex demo, what is openai codex, how to use openai codex, two minute papers, how to use github copilot, how to, python
Id: c09QV6yljnY
Channel Id: undefined
Length: 19min 21sec (1161 seconds)
Published: Mon Oct 18 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.