How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey YouTube today we're going to talk about the intricacies of setting up data sets for fine-tuning your large language models we'll start by talking about the considerations we need to make before even constructing the data set then we'll talk about what is the total pipeline look like do we need embeddings or do we not then we'll talk about how can we structure these from raw text like books using llms and then go over a couple of examples specifically doing a coding example and walking through a medical appeals process using embeddings and if you'd like to skip any of this feel free to check check out the chapters below but otherwise let's get started and I apologize for my voice today I have a pretty bad cold so before we start to build our data set for fine tuning we want to ask ourselves what do we want our model to do and how do we want to fine tune it and this is going to determine a how we're going to approach creating our fine tuning data set and then also are we going to build any additional infrastructure structure to help support this fine tune so there are a few broad categories that we could throw or a fine tuning into and that is a question answerer something that answers perhaps specific questions about a book or a more broad model that can answer questions about topical areas such as a science or a field of science so for example if you want something to explain medical topics and you can ask it questions about ailments do you want a text generator something that generates documents so for example medical Appeals do you want something that's going to summarize text scientific papers or legal documents do you want to translate for you into various different languages or from various different languages or do you want it to be a creative writer where it can write fantasy novels or stories or sci-fi or even comedies or do you wanted to help you in coding and create examples for databases like Cipher or SQL or even give it better examples of languages like Java and python so in a few of these we're probably going to want to support them with embeddings where we can provide examples say an example of this would be if we want to create medical appeals we want to teach the model how to write good Appeals and what a good appeal is is going to hopefully be a data set of appeals that have worked over the years and then we can supplant it with additional information from some embedding model so now let's go on and look at how we could set up a framework that could support our infrastructure for asking questions and then supplying embeddings for uh giving some additional context to the bottle so before we get into how we can use embeddings with our fine tunes let's review what embeddings are and if you'd like to go into a deeper dive check out this video here but let's do a quick refresher so embeddings relate Concepts in an n-dimensional space and all this means is that things that are closer to each other in this space are more related so in this case we have a three-dimensional embedding space and this happens to be what's called word to back and it's a 10 000 word example and for example these Concepts over here could just be car Concepts so in engine for example a hood a trunk and so forth anything that has happens to do with a car could be up there could be names it could be various other things but the general idea is that The Closer they are in the space the more related they are and in the case of our massive text embeddings they're very high dimensional so they tend to be 700 a thousand or more Dimensions but it's the same idea just projected much much higher and the power behind Transformers is in this example this embedding space is static but Transformers can move around in this space as vectors and they can take a whole sentence and place it with context in a different part of the space then the word combinations would necessarily imply and so for example we would have a sentence like I like to walk my dog which would be tokenized and just perhaps something like this but what would happen after getting put through a sentence or phrase embedding you would get back a single vector and this Vector is just a group of floating points and they are n-dimensional so they're the same size as your embedding vector and we can leverage this to provide more information to our llm and in our next section we'll get into more detail how but we can provide it with additional context so in the kind of medical ailments and appeals example we could provide it with additional information on our medical ailments and this allows us to have a more General model so we don't have to provide the model with every possible scenario instead we can teach the model how to take information and process it so what do we have that we could use well there's English versions like instructor Excel and E5 large and then there's also multilingual ones which we'll include in the description below but now let's move on to how can we set up this Pipeline and get it to work so in this schema fine tuning is going to work just like we're used to we're going to have in the case of medical appeals we're going to have a corpus of medical appeals that have been successful and we're going to fine-tune a foundational model based on that and we're going to do it with Laura's or q lauras and then where the twist comes in is we're going to slot this fine-tuned llm into this pipeline and then we're going to feed an additional context about medical ailments and ways to bolster an appeal to a rejection because we there's no way that we're going to be able to teach this model every single possible appeal and every single possible medical condition or medical ailment so what we're going to do instead is we're going to have a large embedding database of example medical ailment documentation that it can lean on to help further write better Appeals and how this works is we're going to first chunk our documents because just like llms these embedding pipelines have a token limit they can be 768 or 2048 they all vary so you just want to make sure that you're aligning with what your token limit is for your mtep but for an example let's say your mtep has a 768 token limit but each of your documents has 7 680 tokens then each document is going to get broken out into 10 total chunks and those chunks will each get ran through this embedding Pipeline and stored in a vector DB and this Vector DB you'll I'll have a list in the description below but you have several options so you have chroma DB is a popular example postgres is another popular example and then pine cone is another very powerful example but what we're going to do is now when we run a query about an example denial we're also going to embed this query and pull relevant information out of our Vector DB and we're going to supply the llm with examples and we will give an example of this with a super Booga and its embeddings in a trained model and we're just going it to generate appeals based off of this and then if we wanted to we could also return the examples that were used so we can make sure that they're relevant but now let's move on to how could we structure the documents that we plan to fine-tune with and how do we deal with more difficult ones like books now node two data sets are created equally especially when we're talking about Raw versus structured like Json or XML and one of the data types I get asked the most about is how do we turn a book into something useful for fine tuning specifically into kind of a structured format now what I like to do is leverage an llm because I don't have the time to convert a book into a data set and I'm sure most of us don't either so what I've done is I've created a script that uses uba booga's web API which all you have to do an Uber Booga is enable the API and the script will communicate with Uber buga and whatever llm you're using and help convert your text into a structured format and the script will be available in the description below and all it does is I give a command that says you are an API that converts bodies of text into a single question and answer into a Json format each Json contains a single question with a single answer only respond with the Json and no additional text now you can modify this into whatever structure you prefer it may take some tweaking to get it to work correctly but this one works pretty well now I it doesn't work perfectly I do have some try catch here that uh it's meant to give it a few retries if it doesn't work correctly and then move on to the next one if it just doesn't seem like it's going to work and now I'm printing oops here but in the repository it will print uh statements so you can debug and know if it's uh not working correctly very often um and the example that we'll be training with is 20 000 Leagues Under the Sea and what I'd like to do is just turn this into a q a format and see how well that performs so this will take a while to run then let me run it and then we'll come back once this is done running we should have a Json output just like this where we have uh questions and answers that we can now parse and turn into some other format now in my case I'm going to be turning it in to my usual kind of instruction and input and output and if you'd like to see how we actually go through the fine tuning process check out these videos here on Laura and Q Laura but um let's take a look at how it performed in this case it asks what is the title of this book Twenty Thousand Leagues Under the Sea by Jules Verne what is the conclusion part one chapter one about conclusion part one chapter one is about a mysterious and puzzling phenomenon which occurred in 1866 so it did a pretty decent job of creating a pretty rough draft of a q a data set and if we wanted to put some time into refining it we could but at least now we have a basis on which to fine-tune our model and this model should work pretty well for a variety of different if not all books um so one of my uh viewers asked about doing the King James version of the Bible that should work here as well um comedies uh parsing scripts Etc so all of these should work just depends on what output format you're looking at but now let's go on and look at how these models performed after running them through some fine tuning so the first fine tune we're going to look at is a model that I fine-tuned to write what are called Cipher queries which are used by graph databases like neo4j now if you're not familiar with it that's okay we just want to see that the language model can learn to write in different programming languages in different query languages so this is the base model that I fine-tuned it's the stable LM base Alpha 7 billion and I'm just going to ask it to write a cipher query to find all users in the database and yeah the response we get is not exactly what we would want it to be but let's look at what ciphers should be and this is the data set that I've trained it on and this is a fairly small data set of just a few hundred examples and but you get this kind of visual query language so you asked to match on a node a certain type with some metadata and you can give relationships which you can think of as joins in SQL but that's just a more visual language that you can write in but if we go to the model that we have fine-tuned we can try asking it to the same thing so write a cipher query to find all users in the database and we should get a much different answer this time and we do we get what we would expect the at least the structure that we expect and so this shows that even with a small amount of data and this application is cool by the way it's called H2O llm and I'm going to go over it in a video or in a live stream because it's actually pretty great with a lot of features but this video is already going to be pretty long so we'll do that in a separate video but it doesn't take a lot of samples to start getting your language model to recognize this style so now let's move on and see how well we can get a medical application to perform so this is a slightly fine-tune model for writing Medicare and Medicaid appeals nothing too deep I didn't have a ton of samples to give it but what I'd like to show you is the difference between how the model performs when we give it some embedded information and without that embedded information in this first round where I asked it about a Mr Alan Jordan who has been declined treatment for his Hodgkin's lymphoma being treated by glevic it's going to generate create a fairly decent appeal so it says Dear Medicaid appeals I am writing this on behalf of Mr Jordan it goes into that he was you know what is lymphoma that is prescribed glebec it's a decent treatment for this denying this light saving treatment is unacceptable um and that the doctor feels that this is medically necessary and they urge them to reconsider but we are now using super Booga which is a very powerful platform that we can drop drag and drop files in here to give it embedded files so I've taken the med quad data set and I broken it out into just the information in the med quad data set so now um super buga will be able to reference the information in that through these tags appear specifically this injection point so it'll inject data to the model and then we'll see how well it performs it'll take me a while to get the data uploaded but once we do I'll come back and show you how it looks so about 30 000 samples were uploaded to uba booga's Vector database for embeddings and to use this in superbuga all we have to do is drag and drop the file we'd like to use here and hit load data and it just handles the rest for us but now what we'd like to do is see how well this performs using um the embedded data and see if it performs better than it did and what we should see now is that we should have a lot more additional data being given to justify the appeal and and that is what we seem to be seeing here and the usually it'll give slightly more information like I said I didn't give it a ton of training here but we do see some additional information like the American Society of clinical oncology giving an issued statement about itself labeled use and so forth but this is the power behind leveraging and embed writing database alongside a fine tune allows you to get much more particular Behavior without having to worry about covering every possible sample and this applies for coding examples comedic examples book examples and so forth so that's really the power behind this and hopefully this helps with how we approach a more General design in our fine tunes if this was helpful please like And subscribe and please let us know in the comments below what you'd like to hear about next and join us in our live stream today when we'll be going over how to broaden context in local models up to 32 000 contacts and we'll be going over the results of our book training and our comedy sets see y'all later
Info
Channel: AemonAlgiz
Views: 22,797
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Language Models, Fine Tuning, Data Preprocessing, Embeddings, NLP, Text Generation, Dataset Creation, Coding, Medical AI, AI in Healthcare, AI Development, AI Modeling, Language Processing, Open source, LLM, LoRA, QLoRA, Data Science, AI Training, Natural Language Understanding, AI Techniques, Deep Learning, Python Programming, Oobabooga, H2O LLM, llama
Id: fYyZiRi6yNE
Channel Id: undefined
Length: 17min 21sec (1041 seconds)
Published: Tue Jun 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.