Fine Tuning LLM (OpenAI GPT) with Custom Data in python

Video Statistics and Information

Video

Captions Word Cloud

Captions

hello everyone hope you all are doing very well for those who are new to my channel my name is avkai and if you are interested learning how you can take any existing large language model whether it's an open source or proprietary in both way if it has ability to fine tune and can be extended for your own document this video is for you in this video you are going to learn by using GPT index or Lama index open source library to fine tune a large language model just by writing three or four lines of code and then using the generated index loading it and then querying whatever questions you have also in two to three lines so just under 10 lines of code you can actually fine tune a large language model and you can also ask question from it here in this demo we are going to use the radio base application to First fine tune the open AI large language model by providing this archive tool for more language model that can teach themselves to use and we are going to ask questions based on this document which we have just used to fine tune it first we need the open AI API key here is the openai key open AI pH set in the next step I would be taking this PDF which I have already downloaded and drop it here this is the PDF once this PDF is downloaded I will use this option to create Index this PDF has been transformed into the text and then this text has been fed into the large language model available at the open AI cloud is GPT and based on input text the index has been created that index is actually downloaded on the local file system and that is the index name here so if we would want to ask question based on this document we have to load the index and if I run this command as you could see that this document shows that we have this index available so I'm using this index file as the source and unloading the index after that the index is set it means now I can ask question about this particular content so first thing I can ask what is this this document about at this point the large language model which is hosted at the open AI cloud and I'm connecting it through the API by using the open AI API key and here is the answer to our question the document is about tool farmer if we can ask other questions such as who are the authors of this document and as you can see that these are the authors and the open AI answer is question two one so you can this application is designed to not only process the PDF you can actually use the text document the idea is that anything any Source uh content you are giving is processed into the test and then after the text is used to train as we fine-tune the large language model and that's how you are going to get output from it before we go and dive into the coding part of this demo let's spend few minutes to learn the fine tuning method or the need of third party or the open source library to fine tune a large language model whether it is hosted at open AI or you can use any open source model which is actually downloaded on your local machine or you can host that model or that model is already hosted at the third party repos such as hugging face as well so before we go and look into the code and learn how to extend any large language model which has capabilities to be enhanced through the external data source that spend some time and look into that the need for fine tuning as well as how it is possible means how a large language model can be fine-tuned for the external data sources first we already know that we have a large language model depending on its capabilities the data it has been trained and the material losses it has been trained by you have your large language model could be a size of few million parameters all the way to few billion path meters larger the large language model better it has capabilities to answer your question and the capabilities surrounding the type of topics it can and answer your question very well the basic type of questions are general question answering text summarization getting something new out of your large language model we already know that the basic large language model are designed to take the text as input prompt based on input prompt given to your large language model you get the text output however the large language model is not limited to just giving the text output actually what you are getting is the numeric representation of the text which is output and because the input tax prompt as well as the output text prompt they both are generated based on a common embedding or the text embedding dictionary depending on whatever Source language you have you can very easily understand the numeric text representation it is basically the encoded value of your output text here is an example where we are asking this large language model to write a haiku about frame Haiku is like three line poem from Japanese contact test here is the output raindrops filing fast Nature's music fills the air life is renewed refreshed this text output actually is the numeric or the numeric representation of this text output is going to look something like that basically the every word it's encoding in exact format or the vector representation of your text that is what you are getting from your large language model based on your giving text input however the large language models can be multi-model it means not only processing the text content as input you can actually pass the image as a prompt or you can also add audio as a prompt or feed you as a prompt the very new very recent chat GPT 4 model which is available through the chat GPT premium service like a paid service is actually using the multi-model version of the chat CPD just an extra notification or as an example to give you an example of multi-modal large language models the output is just the text output because these models are the large language model or the language model inherently inside even when your multi-model inputs are not only text but image audio or video everything has been translated or the pre-processed into the text prompt internally and then that text is being shred into the multi-model or the multimodal large language models as text input and output is going to be the tax double good however in Future these large language models will be combined with other generative AI models where you are actually using image audio or video as the input and output can also be the non-text output something what your target type output is whether it's image or audio video because the large language model can be combined with other models programmatically and then you can get output that way but the large language models are created as a language processing or they actually process language and generate language so that's why they call large language marks in future things could change in very differently so we know that large language models are trained on a very large amount of data web scale data and this data has been used to trim these large language model and the bigger the input or bigger the training size is that's why we say the large language models have billions of parameters because they are trained for a very large input data so the question comes how a very large some a lot large type of this input text content has been created into or transformed into a large language model your language or your content or anything else is not being sitting inside the large language model in large language model is trained based on how this content is written and then after when you are asking questions or when you are when your prompt is being processed by the large language model this large language model actually knows the output language if it is English and based on that it actually generates the tokens based on what is really thinking so it is not like it's reading the text and then composing something and then creating for you know everything has been created on the fly so the word as our tokens are created on the Fly based on the semantic of the language and that's how the question which you have asked is answered by the large language model the objective for this video is not to show you how this training happens but why we are going in the route of the fine tuning a large language model so we know that the most popular but a very popular technology to generate the large language model is the Transformer based technology I wouldn't go really deep about it however the idea of Transformer is the groundbreaking research behind the generation or behind the creation of this large language models so we take very large amount of data and with the help of very large amount of Computer Resources and the Transformer technology available or they produced or they are I would say the came out from the Google is the basis of breaking down this very large piece of Source content and distribute it and on a distributed platform it is trained and that training is actually what makes or generate these large language models so the first thing what we do is that we take the input data and then we convert it to the text embeddings basically the encoding so that it can be processed properly by the machine learning or AI algorithms and that's where your Source data is no longer needed now the text with embedding is the source for the training and that's what generate your large language model and that's how you can just in nutshell take the idea that the large language model is created based on this the text embeddings of the source data and reason why we are using this word here because that is an important aspect when you are going to enhance a large language model through the libraries or publicly available libraries so that you can ask question about a test which these large language models are not trained for we already gone through this and this is how you get the answer to your prompt based on quick understanding of the training with very large data so one of the biggest example as of now is the chat GPT we there are several other open source or the proprietary large language models available from Facebook open AI Google and other third-party I would say open source companies such as big science they have Bloom and there is uh there is another Neo GPT there are a variety of them you know so there is a significant amount of research has been done to generate these large language models which you can download if you if you have resources available on your machine and you can try locally or text chat chat GPT kind of these large language models which are hosted at the creators resources such as for the open AI wherever it is hosted and using the API you are accessing it so most of the time or the most popular uses of these large language models which we have seen very recently is the model is hosted in the cloud and you are connecting this model through the API so if the model is hosted in cloud and you would want to train train may not be a good word fine tune this large language model with your own custom data how you can make it possible and that's where the role of fine tuning large language model through various third party libraries comes into the existence and so so the concept is still the same that how we are going to connect these large language model with the external data sources so that we could fine tune them and fine-tuning process is quite similar to the one which we have seen earlier when we were talking about training and very first thing we take the external document and we break it down into the indexes and these indexes actually have the encoded content as well as lots of other information which actually convert Source document to the indexes and then we do not really need this Source document at all these indexes are used to fine tune the large language model and once they find tuning is completed the large language model in conjunction with this external or documented desk now this is the model which you have which can actually answer question based on your own document and that's where the fine tuning part take over however in order to answer the questions such as what is this document about you are actually asking a very large language model to refer this external document index which I have just used to fine tuning and using that index the chat GPD can still use for any large language model can still use its own knowledge how to write the language how to in how to interpret the input question and refer to this external document index and combining all the knowledge it already had before as it was trained it can actually generate answer your questions and that is what the fine tuning a very large language model age after the brief introduction now it is over time for us to learn how we can use the python code just few lines of code to First fine tune a large language model to generate the indexes and after the indexes are generated we can load that index to ask question query summarization question and answer whatever our objective is with our large language model which is already connected through our index which we have already created previous so decoding will be done into separate files first we are going to use the fine tuner which is just three or four lines of code and then we are going to use the query indexer which is also the few lines of code so just in 10 lines of code you are going to see how to fine tune a large language model with your custom source index I do have the pycharm open and let me open the terminal the configuration which I'm using is the python 310 and the runtime python environment is actually using the Condor end time and as you can see here that this is the Python 3 chain and pip info and I do have the GPT index is already installed here pep show so GPT index has been already installed in this machine and that's where the 0.4.30 is version which is we are using is the same index right now so you must have that package installed and so I will clean this code so I'm creating the main which is the training part so I could say that this file actually is the fine-tuned dot pi and then we are going to create the query dot Pi so first training reading the source document and generating the index out of it and the query reading the index and answering your questions so first import GPT index that's we are going to use and from GPT index import and at this point we can look into the GPT index document documentation first we need to know that from GPT index we really need to import these simple as you can see here simple so simple web page reader reader direct reader so because we are going to read the data from directory so let's use the directory reader then we also need the GPT simple Vector index it means document has to be read from local file system and then we need to convert that to the vector index so these two modules we needed from the GPT index once so first we are going to use this simple directory reader simple directory reader you can actually look into the documentation if you are interested here is the example sample directory reader in simple directory reader you will pass your input directory and then you can load data from it so simple directory reader will going to read your location so we need to create a new location here we can create Source data is the folder Source data dot load data and this is going to read the data from this folder so say content which is here is going to be read load data and we can say loaded content equals so this loaded content has the text after we have read the files available in this folder and then we are going to use this GPT simple Vector index and this is what is going to generate the index this method is going to use loaded content as input and it is going to generate the output index and this index is actually going to save to the disk so if you will say here you can also save to this here save to the disk and path so here we can say that output index dot save to the disk and here we can say that output data is going to be saved into the folder called index data and within the index data we could say slash index dot Json and that's what the output data is and these three lines of code is all you really needed in order to read the data and save that data as the Json once it is there now we need to make sure that we have access open AI subsystem which can perform this job so with before even we go there let me run this code and let's see what happened let's put the breakpoint here we are at the very first line and at this point I will drop or raw txt file in the source data folder there you go so the source data has a file I have already dropped this it's just some text content I have just copied here now let's run this code as you can see that loaded content is nothing but a text file and this method does not need any access to open AI Library only thing needed when we are converting the loaded tags to the output index at that point because we do not have access to a llm model as of now if we try to go next step we are going to get an exception and as you could see here is that did not find open AI key means it needed the openai key and it's asking us to use this variable open AI API key so we need to set this open AI API key and the way you can set this open AI API key is the in the environment variable so we could say import OS and Os dot environment I think it's the OS dot environment and within the environment you can actually this is your variable and at this point you can actually use your key so OS dot environment and then if you can set your open AI key let me see where my case Okay this will this key I'm just using right now after this tutorial I will change my okay so this key will not be accessible to you so please do not use it so this is the open AI key I have set it up let me close this debugging and as you can see here it is trying to look into the text plugin c003 model at this point default llm let's clear it let's run this code again so this open AI API key has been set loaded content it we have already seen at this point we can go next and at this point the this content which we have already processed has been sent to the open AI subsystem and after that the output index has been generated looking into the output index as you can see that it has all that information which is needed for the large language model to answer your question when this index is loaded so I will close this and we will save this open index to the index data index.json and here is our index.json so that index has been saved in local file system and that is what your fine tuning is so just using three lines of code you have actually created the index now we are going to create the query application let's call it query index dot pi and we are going to use same thing here actually just use we need to read the index first so here the GPT index and Os needed and we need to make sure that we do have API key available when we are going to use the query index so that is set for us we have used GPD simple Vector index so same thing from GPD index import GPT simple Vector index because this is what we have used so you can say and now index has to be loaded first load from disk so load from dictionary load from disk load from string so we are saying that okay you can load from disk and here we have same path this is our complete path and that's what gives the index we could say loaded index equal and if you look into this if you say dot and here you can actually query it so once this thing is loaded this is our loaded index inquiry so your query is a method and here you can ask your question so I would say prompt is equal to what is this text about and that's where the prompt is going to be used here and that's pretty much it is so let's put a breakpoint here now we can debug it let's run loaded index at this point request went to the open AI backend through API and using this index now you have the loaded index available you can take a look here that's your loaded index and then using the loaded index you are asking the question and here is your prompt and now the request going back again to the opening API subsystem where we are working with the tax uh the wnc model zero zero three now we have just got the output here we can actually run this code and this text about the how artificial intelligence can transform the devops that's where the document was we can convert this code so we can get to use the same index again and again and the output will be written here so print okay save it let's remove this breakpoint run this code here is our prompt we can say give me top five questions about this document here are the five questions which are available in this document what are the what are the objectives off devops and how AI can help you can say exit the knowledge which we have learned fine tuning as well as using the index to perform the queries is going to help us to build this radio UI based application where we are going to train or fine tune the open AI model for document as well as the PDFs and we could ask and that will be the part of our second presentation so this was the part first of our tutorial in the second part of the video we are going to start immediately writing gradier UI based application with multiple tabs so that first we can use various text and PDF documents to generate the index index is actually based on Source documents and using the index we can ask queries whatever you have with the help of open AI large language model thank you so much for your time I hope you have enjoyed the video and I'm looking forward to seeing you in my next video Until then thank you so much

Info

Channel: Prodramp

Views: 12,075

Rating: undefined out of 5

Keywords: ChatGPT, OpenAI, Anthropic, Claude, LLM, Large Language Model, GPT, GPT-4, GPT-3, Prompt Engineering, Fine Tuning

Id: 6aOzoJKNLKQ

Channel Id: undefined

Length: 29min 48sec (1788 seconds)

Published: Sat Mar 18 2023