StarCoder: How to use an LLM to code

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
stockholder is a brand new large language model which has been released for code generation ever since it has been released it has gotten a lot of hype and a lot of AI experts claim that it is one of the best large language models out there for code generation so in today's video I'm going to be talking about what exactly is star coder how does it compare with other state-of-the-art large language models out there specifically for code generation and we are also going to be testing out star coder in Visual Studio code to see exactly how it works and if it actually is worth all the hype so let's get started stackcoder has been created by an open source AI organization called Big code and big code is a collaborative effort between hugging face and service now the amazing thing about star coder to start off with is that it covers over 80 plus programming languages and it has taken data from git commands GitHub issues and other various Jupiter notebooks the fact that it is able to do that is because it has been trained on a data set called the stack the stack is an extremely large data set consisting of code from GitHub additionally starkoder base actually outperforms existing open source code large language models on very popular programming benchmarks and also it surpasses a lot of closed large language models such as code Cushman from open AI which is the original codex model that powered early versions of GitHub copilot and also starcoder has a context length of over 8 000 tokens although there are a lot of other large language models which can take in much larger amount of tokens but given the fact that this is an open source model and it's completely free this is one of the largest amount of tokens for such model other close large language models out there are able to take much larger tokens but for starkoder given that it is completely open source this is definitely on the higher end next up let's take a look at the evaluation metrics of starkoder and how it performs compared to other similar large language models so first off they found that starcoder actually outperforms much larger large language models for example Palm Lambda and Lama despite starkoder being significantly smaller it has managed to actually outperform these models additionally it has been evaluated on something called human eval data set so the human eval data set is a really popular way of evaluating co-generation amongst large language models the human level data set consists of 164 different handwritten programming uh questions and it's a great way of testing large language models especially since a lot of these llms are trained on GitHub data sets and human eval data set is a really great way to test exactly how these llns perform on the human level you said star coder received about a 40 score and that's significantly larger than these other large language models that we see although they are much larger in size and also in parameters compared to start coder so that is absolutely amazing however when we compare it to something like gpd4 on human eval data set gpd4 received a 67 score which is also significantly larger than star coder so this is something to note the paper of stockholder goes in depth into the type of data sets they use and what type of pre-processing they have done as well as their model architecture some of the most important contributions of star coder is the fact that it's open source and completely transparent in the way that it has been created in addition to that they have also Incorporated a new attribution tool in their vs code demo which I'll also be showing you guys exactly how you can use it but the cool thing about this is that it can actually help you to detect if the code that has been generated for you has been taken from any other source or if it has been taken from the data set as well and this is a really common word that a lot of developers have when they're using large language models for code generation so this is a great way to identify if your code has been used elsewhere and it's also a great way to give attribution as well to test out star coder in a very basic setting we can make use of the Star Chat playground which has been created by hugging face and we're actually able to give any sort of prompt so for example how can I write a python function to generate and Fibonacci number and if we give it that it is going to give us the exact code as well as some explanation as well I'm going to test out how can I extract data from a website using python so it says there are several libraries available for you to access the most common one is beautiful soup and then it also gives a simple code example of how exactly we can use the python Library called Beautiful soup to extract data from websites this is really amazing now to make this much more intuitive we can also make use of starcoder directly in for example Visual Studio code as well as Jupiter notebook I'm going to be showing you guys how to use star coder in Visual Studio code now what you want to do is open Visual Studio code and create a python file which I've already done I've called it star coder.pi and what you want to do is you want to go into extensions and we want to search for HF code autocomplete which essentially stands for hugging phase code autocomplete and once you find it you you can go ahead and install it so this extension makes use of the star coder model and we're going to be testing it out once it has been installed once you have installed this extension we have to do two main things we have to set up our API token so you can find your API token for hiking face on your account on hugging face and once you've done that you can just copy your API token but we have to set our API token inside of vs code so in order to do that we can go and click command shift p or Ctrl shift p and then once you do that you should see a menu like this and you can go ahead and click on hugging face set API token in order to do that and once you click on that you should see something like this and you can paste your token here and once you have pasted your token just click enter to confirm to test out the stockholder model let's actually start by typing in a comment so that you can actually auto complete that so I want to type out how to import a CSV file into a pandas data frame this is one of the most basic things that you have to do as a ml engineer so let's go ahead and do that you can see it loading right here and that's how you know it's working so as you can see it has come up with this and you can click tab to accept and there you have it that is exactly how you would import a CSV file in Python next up we're going to be trying something a bit more complicated we're going to try to create a very basic linear regression project in Python and see how it comes up with some code for this so this is what it has come up with it has we're gonna tap and then let it auto complete so the first thing is importing some libraries and then importing the data set it has come up with this 50 startups data set and let's see what it continues to come up with and I'm going to click tab again with this so it's importing a data set and it's creating the X and Y values and also it's making use of a label encoder and then one hot encoder to create labels and then to create to convert categorical labels into numerical labels so this is also very similar to what you as a programmer or developer would do now let's actually go ahead and figure out if a lot of this code has been reused or it has been found in the data set that it has been trained on so what you have to do is select the code that you want to test and then click command shift a and interestingly enough we have managed to find that this code is actually found in the stack the stack is the data set on which star coder has been trained on and this code was found in the stack so this is something to note and it's great that starcoder has implemented this functionality of being able to identify code that it has been trained on so as developers you can definitely use it with a lot less stress or hassle because you can easily check if the code has been found in a previous data set or it has been trained on the similar code so that's definitely very user friendly and super convenient as well stock quota also has a jupyter notebook plugin that you can download which works very similarly to the visual studio code one but it's also very convenient because I'm sure a lot of people who are doing data science or ml they are definitely using Jupiter notebook so it's very convenient to have that directly in Jupiter notebook as well so stockholder also can be used as a technical assistant and inside of the stockholder paper we see some examples of how exactly it can do that so these are the instructions that it was given and this is stockholders response right here in the first instruction it's I need to integrate a python function numerically what's the best way to go about doing it and start coder responded with you know there are a few options available depending upon whether you have access to libraries like scipy or numpy which makes sense and if you do it also then gives an entire code snippet on exactly how you can go about doing that and then there's a lot of other examples as well as to how it can help to be a technical assistant so stockholder is definitely a very helpful tool for developers and it also helps you to break down your project or your code file into much more simpler steps or even give you definitions of different functions that you might need in your projects and I think that definitely helps to break down the steps as well now as for actually generating code it does a pretty good job of doing that as well there are definitely some limitations for Star coder one of the biggest limitation as we have seen and this has been pointed out in the report as well is that there there has been a failure case of the model in that it actually produces comments saying solution here instead of the code and it does this because the data set that it was trained on the stack the large data set me comprising of code from GitHub also has a lot of these comments saying solution here or code found here so it has learned from this data set and that's essentially why it sometimes produces comments saying solution here instead of the actual code and that's definitely a limitation the the biggest strengths about star coder is that it's a completely open source and extremely transparent model so you can easily find out what sort of training has been done for start quarter what sort of data sets it makes use of and the entire process involved in building it and the amazing thing is that it has a really easy to use interface with the plugins as well so as a developer you can easily download that and make use of that when you're coding let us know what you guys thought about this video and what you guys thought about star coder and if you plan on trying that out when you're coding thank you guys for watching And subscribe for more AI related good content
Info
Channel: AssemblyAI
Views: 9,861
Rating: undefined out of 5
Keywords:
Id: 1PH3oDly1bc
Channel Id: undefined
Length: 12min 54sec (774 seconds)
Published: Mon May 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.