Data Extraction with Large Language Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm going to walk you through data extraction using large language models we're going to extract names and organizations from short and really long pieces of text we'll take a look at extracting into Json objects and also in yaml format I'll talk about the cost of different approaches GPT 3.5 GPT 4 and then the best open- source model in my opinion for extraction open chat 3.5 7B so let's take a look at the agenda we'll talk a little bit about motivation then I'll show you the two different formats we're going to use for extraction I'll go through some of the key problems and why extracting data is difficult then I'll talk about some of the best models uh the best open source model and then the best private model I've tried is gbt 4 unsurprisingly I'll go through a demo then where I compare the performance using open chat 3.5 versus GPT 3.5 and GPT 4 I'll show you short examples which are more illustrative and then I'll show you long examples and the technique where you can concurrently make requests to a server all setup and that gets a very high throughput so you can do extraction very fast in parallel by pinging the server with different chunks of the text that you want to extract data from I'll do um if I've time a little demo on Noto which is a fine tune of the Mixr model so open chat is actually a fine tune of of the mistr 7B and Noto is a fine tune of the Mixr so some kind of similarity there and then I want to talk a little about costing when you're using gbt 4 it is uh quite expensive so I'll go through the costs when you're extracting per 100,000 tokens of text and I'll do that for gbt 4 gbt 3.5 and then for uh open chat in terms of motivation for data extraction it's quite simple if you have a very long piece of pet text and you don't use a language model it's going to be very hard to write a program that will just extract names you would have to do some kind of pattern M matching maybe find common names like Michael or Michelle or something and see if they occur by direct text comparison uh you could also do some kind of vector search um Vector search is where you would embed basically convert your text into vectors and then you would Define some terms that you're looking for and get the vectors for those terms and by comparing the set of vectors in your text versus the terms you're looking for you can determine um whether the search vectors appear within your database now that's generally not perfectly exact so also difficult to pull off and this is really the beauty of using a language model it's extremely flexible whereby you can give it a prompt to ask for the names to be returned in a certain format and then it's able to use much more meaning much more precisely to get you the names or the organizations now I will show just names and organizations today I'll show examples where I'm extracting both which is definitely more difficult and less robust than trying to extract one type of data um but I wanted to challenge the models a little you could of course tweak the examples here if you want to extract numbers maybe stock prices number of objects uh different data formats that would all be possible in an analogous way now some of the problems that you run into trying to extract data I've put them into three rough buckets here um and actually first be before I talk about the problems I think it's it's probably helpful to show uh the yaml and the Json structures I'll just jump right into an example here um so let me first show you an example of Json extraction so this is one data format that can be used and here you see a Json object so I've asked from this long uh transcript which is the Burkshire hathway meeting from last year I've asked GPT 4 to extract a list of names and organizations and return it in adjacent object and you can just see here I just want you to take in the rof format it is uh perfectly formatted and because it has this predictable format we can easily write a python or other script and uh manage and control this data as we wish from here so this is what the Json format looks like um I'll also show you this yaml format so yaml is actually a little bit simpler and cleaner you've got uh you've got the heading here names a colon and then just a dash for each of the entries and then the same for organizations so comparing yaml and Jason you probably use yet less tokens up um if you really want to M minimize costs you probably use less tokens by doing a yaml type approach um now as you'll see when I go through some live examples uh in some cases the AML performs a little better in some cases the Json performs better and it will depend on how the model has been trained I'm not saying I understand the cases of exactly where One performs well and the other doesn't um but hopefully I can give you a somewhat nuanced understanding just by showing you some live examples so what we're going to do in the next um step is we're going to generate some prompts and I'll talk about that maybe it's helpful if I show you some of the exact prompts that I've used so let's uh scroll right up to the top here so here let's go through through the AML format um so the prompt is as follows extract names and organizations from the provided text and return them in yaml format use the following schema so I then give an example of the schema that I want used so I have properties and names and the names we've got string an array of strings and the organizations are also going to be an array of strings and we're going to require that the language model return names and uh organiz ganizations then continuing on with the prompt I say here is an example of response in yaml format so not only do I describe what the schema is I also give an example of a of a response so names and then sample strings and then organization and Sample strings and this is exactly the format that I want the model to respond in putting in that one example here that makes it a one shot uh oneshot request to the language model this definitely helps with performance now I give a little extra guidance so I say do not include anything that's not explicitly mentioned in the text analyze the text carefully to ensure all requested data is extracted include each name and organization only once rep avoiding repetition in the list of organizations which is a problem you'll see adhere strictly to the response format without adding Extra Spaces or text so what you want is the model to respond just with the object you don't want it to say oh here's the object you're very welcome or give the object and say do you have any other questions you want it to be very predictable um what the model is going to respond with now it's true that if there's something before or after you could just slice that off programmatically but still it adds a it adds an area where you get unpredictability and that's not what you want um when you're trying to do extraction so after this pre- prompt here what comes next is the actual text and um if you're going over the context length then this would not just be all of the text it would be a chunk of the text and so by by repeating this entire prompt multiple times in parallel you can treat each chunk separately and then you can accumulate your results um across those chunks at the end so anyway after the pre- prompt we start off the text and we just inject the text right here now this is a a really long piece of text when we get to the end of the text um we close things off by saying text end and this just gives uh a little Dil to tell the model that okay here is the chunk of text here's where it starts here's where it ends and that adds a little bit of clarity to the entire process now what I like to do then is actually give a final prompt after having injected the text so you can give all the prompt before the text but just by putting that very last piece of text and knowing that models pay attention particularly to the very start and the end of the prompts it's nice to put this little piece at the end it can improve performance as well and just kind of reiterate what uh the model is to do so I say now answer immediately and only so again encouraging the model just to provide the object of the response in yaml format and with that we have the yaml format being provided by gbd4 now let me just go through a little more quickly the Json example so again I've actually considered the exact same piece of text and you'll see the prompt is very similar exract names and organizations or return them in Json format use the following schema but now you see the schema is a Json type schema and the example I provide is also uh adjacent type example so then I say do not include anything that's not explicitly mentioned in the text so here I'm just reiterating the same guidance trying to encourage the model not to miss anything by saying all the requested data is extracted trying to avoid repetition here by saying not to repeat anything and then trying to ensure nothing comes before after the response and again I put in the entire text wrapped in text start and text end just to provide some clear delimiters and I finish off with the one sentence prompt at the end which says now answer immediately and only in Json format which GPT 4 does obligingly so there's one other thing I want to show you here that when I run these models you can see I'm running with a gp4 turbo model right here um I also have got examples where I run with uh GPT 3.5 turbo and you'll also see the temperature it's quite small on my screen but I've set the temperature to zero I would definitely recommend setting temperature to zero you definitely want the highest probability tokens you don't want some kind of variance that will take you away um from repetition you want to have H probably low temperature in this case now if you're running with uh chat GPT that can be more difficult which is why I'm using chat. tr.com this is just a fork of chatbot UI I believe chatbot UI are coming out with a new version that will support uh GPT uh for API but what this allows you to do is put in an open AI API key which means sadly that you will be paying per token but when you create a new chat you have the flexibility here and um this this is open source and available you could e either run your own instance or use chat. tr.com input your API key I don't have any access to your data um all of these conversations are stored locally on your browser now you can see the beauty is that I can select these uh longer context models than are possible with even a Plus subscription the Plus subscription allows for 32k tokens as of now um this goes up to 128 and further you can use um the gbt 5 3.5 turbo and even the instruct model if you want it but also you can lower this um you can lower this temperature so that you get a more uh focused and deterministic or in other words get the model to pick the most likely token at each stage so with that I've gone through the main formats and I've shown you the examples in GPT 4 and 3.5 I haven't commented yet on the accuracy which is hard to say without either reviewing manually um as a human or comparing to some other model so we'll get to the comparison later but first I just want to highlight a few problems many of which are addressed by the that I showed but there are some other factors uh that we'll that we'll come upon too so some of the main challenges with extracting data are first that you get incorrect syntax um so I showed you some prompts that worked but uh occasionally and definitely depending on the model it will just respond with a slightly different syntax that's not the correct Json or it's not the exact syntax you wanted and that will make it very difficult to parse because it's unpredictable exactly what syntax will be provided another problem is uh and I see this a lot with the mix TR model um which I won't be running because I'll run the fine-tune not hooks which is a little better but Mixr tends to add a lot of context before and afterwards and that makes it difficult uh to do the parsing and then a third key issue is repetition you ask the model to extract say the names and it just keeps saying something like Warren Buffett Warren Buffett repeatedly and this um is a common issue for language models they get stuck in a pattern where they're just reproducing the same uh output tokens in the same pattern and this results in using up the full maximum tokens you've specified for the output so you'll get slow responses and also just a needless amount of the same uh name being returned moreover once you get stuck on a given name it's not going to find any other names that are in the passage so it means it's going to fail to extract all of the names or the organizations so these are the key issues um that we see um and what we're going to try address when we go through some of the live demo so very briefly before I start the demo on open chat I want to talk about a few models I've looked about looked at and there are many that I've tried unfortunately many have not worked as well as I'd hoped but um maybe somewhat surprisingly one of the best models I found is a 7B model it's the Open chat model 3.5 and you'll see as in the testing it's comparable to gbt 3.5 turbo it's by the way somewhat coincidental that it has the same naming here and that's a bit misleading um open chat is a fine tune of the Mist trial 7B model in some cases it's almost as good as GPT 4 and maybe for really long passages it's uh able to generate a much lower cost of extraction than GPT 4 and somewhat similar quality now on a technical note open chat 3.5 had released two more updated models one in December and one in January so there are three models available in hug face Hub we're going to use the original one H for some reason I find that the more recent models they perform a little worse on function calling which is an area I'm familiar with from other videos but also they tend to hallucinate a little bit more for reasons I don't understand so we're going to stick to the base open chat 3.5 model another model that works quite well is cod L 34b now because it's much bigger so it's about five times bigger than the 7B model it's quite a lot slower to inference so that'll make it more expensive as a solution um it tends to provide um the correct syntax but it often misses entries within the input so in many cases I still think the 7B model is somewhat better now when I chose models to look at I gravitated towards coding models they often are very good with recognizing positions they're good at retrieval tasks I think because code helps them a lot on positioning and structured responses so I'm always tempted to try out the Deep seek coder models which perform very well unfortunately the drawback with the Deep seek coder models at least the instruct models is that they are instruction fine-tuned and guided towards only answering if you have a code question and that's true even if you leave out the system prompt that tells the model only to respond to code questions you could use the base model but it becomes less predictable somewhat like Mixr and doesn't necessarily complete the sentences it will tend to blab on with answers even when you want it to stop as I said now a few times Mi Mixr the syntax is typically wrong and that means it's not a good model for this application not hook 8.7b as we'll show at the end it generates decent syntax it's a bit inconsistent on accuracy by which I mean it often leaves out entries and the prompt it's very sensitive to the prompt so small changes in the prompt uh tend to make um big changes in the output and you'll see this in General with weaker models the beauty of gbt 4 is that it's quite robust even if your prompt is not quite right it might have typos or it might not be optimally structured GPT 4 in most cases and we'll see some cases where it doesn't but in most cases it's very robust to that and open chat 3.5 is quite robust as well but even on some of the bigger stretch cases you'll see that small changes in prompts can make a difference for the performance comparison I'm going to be using the advanced inference scripts you can purchase access uh lifetime access to this repository which now contains a lot of uh lot of scripts for server setup on ec2 run pod or vast AI API setup guides for VM and TGI and now a number of scripts including function calling speed tests and just added today data extraction scripts now if you just want the data extraction scripts you can purchase those on an individual basis here rather than purchasing life ACC lifetime access to the repo which includes further updates now if you don't want to purchase at all you can make use of course of the prompts and hopefully I've given enough detail in the earlier part of the video to help you just copy those prompts and put them in to a piece of code yourself so here I am in the advanced inference repo and I'm going to be focusing on the data extraction portion so I'm going to go over to VSS code where I've got that open up and and I'm going to go to the readme file so in the readme there's a set of instructions uh for getting started the first step is to get an API started for open chat now I've provided uh a one-click template that you can use and this oneclick template is freely accessible in fact you can find a number of templates at this public repo here which is called oneclick llms you can find oneclick apis for many uh many models in TGI VM and also Lama CC CPP you can run them on um runp pod or you can run them on VM so the ones that I'm going to get going are the open chat 3.5 model and then the Noto 8X 7B so I've just clicked on that link to the open chat model and then typically I like to pick an a6000 so I click here to deploy everything should already be configured here now if you're going to run I recommend running it in full um 16bit Precision you can run in awq it would allow you to run on a smaller uh amount of vram but this is a 7B model so you only need about 15 GB of vram so an A4000 or an a6000 um will fit it I recommend running in 16bit format because that is actually one of the fastest formats if you're going to concurrently ping the API which is what we want to do because we want to process many shards many chunks of the text in one to do things really quickly and get really good cost Effectiveness also in 16 bit we get better Precision than 8bit or 4bit so that's going to be a benefit for performance too now one other tweak here is that often I have speculate three onto these um oneclick templates what this does is use speculative decoding based on the prompt so it will use the prompt uh to guide it in guessing some future tokens and this can provide a speed up in decoding probably about 30% however if you're going to Ping the API concurrently with a lot of requests it's actually better not to do speculation because speculation does take up more of the GPU so you can consider getting rid of the speculation it potentially will Speed Up Performance a little bit if you are going to Ping the API concurrently which is relevant if you've got a very long piece of text maybe over 10,000 tokens and so here are my pods I have open chat and I have notok 8X 7B I am using awq for notok just so I can fit it on an a6000 and you can see that it's already loaded and ready to go here and then I will have open chat which should be qu pretty quick to download so I'm copying the ID of the Pod here and I'm going over to my uh vs code and specifically I want to go to the EnV file and in myv file I'm going to just put in here my my API endpoint uh so I've set up this endpoint here in runpod it's going to hit the Open chat and I've set my model to open chat open chat 3.5 now just a word of warning if you're using the blocks awq for 3.5 there's currently no tokenizer template no tokenizer chat template so you should always use this as the base model because setting the model here is what sets up the chat format so use open chat here and if you're going to use not hooks we'll set that one a little later in the video so let's go back to the read me and scroll up here and this is the readme for the overall repo uh for advanced inference but what I want is actually the readme for the data extraction so I'm going to close down that first read me go back to the data extraction and I've got my pod launched I've got the EnV file adjusted so that I'm hitting that runp pod by the way if you're just using the Run runp pod by yourself without having access to the repo there are instructions in the readme of the runp Pod as to how you can make requests using the runp Pod ID now I'm going to CD into the data extraction folder so let's open up the terminal here and let's CD into Data extraction and next I want to make sure that I activate a virtual environment I've got this virtual environment called extract EnV the very first time you run you'll have to create it with python dmvnv extract EnV and then you'll have to source that environment and then install the requirements I've already installed the requirements so I'm just going to Source the environment and here we are in the extract en now and with that we're ready to immediately make calls to the API we can make Json calls and we can also make yaml calls uh down here just by providing ajason or yaml flag so we'll have an output format that's that's the uh what that's the Json or yaml we have the chunk length measured in characters so this is currently 8,000 characters which is about 2,000 tokens and I recommend this length because it's about it's about half of 4,000 tokens so it's about half of the context length of a 4,000 token model and you like to have something fairly short because the longer the context going into a language model the more difficult it will be to extract data from it we'll see even GP T4 has got trouble on very long contexts with extracting all of the data as required so typically I'm going to slice up my text into chunks of 8,000 characters and send those chunks through the GPU in parallel which the GPU is very efficient at doing and that's what this batching equals true parameter is it's just concurrent requests so I'm not just going to send in one uh block of messages um for decoding I'm going to send in blocks in parallel that each are have a copy of the prompt but with a different chunk of text inside it and then I'll get back the chunks and assemble them together into a final uh list a final Json object or yaml and the last parameter in here is the input file name so I'm putting in one of three files which I'll just pull up um I'll call them out because the font size is small on screen but the main file is Burkshire 23 it's the transcript from the Burkshire hathway 2023 meeting it's about I think 60,000 tokens so about 240,000 characters in length then I've got a shortened one to 60,000 characters and one shortened to 122,000 characters which is below 4,000 tokens in length so I'm going to start off and we're just going to work with the shortest file because I want to get some quick responses just for demo purposes so let's head back to the readme and I'm going to run first off um this TGI prompt here uh using Json now before I press enter let's just check that the Pod is running so here we are with open chat and it looks like the container is up and it looks like when you see invalid host n down here that means that uh this the API is ready to accept requests and that's why the GPU memory is up as it should be okay so we're going to press enter and let's see if I did something wrong there I actually missed the first piece of pting actually right here we're processing the 60k and that's fine it'll probably work fine it just did and um I validate after doing every chunk that the format it returns is valid so here at the bottom it's just telling me that the error rate was 0% that doesn't mean it's gotten all the entries but it means that the errors is in terms of validating the Json format were zero now actually I've just modified this cuz I wanted do a quick example that is short on the 122k uh file so just a shorter context and I'll just show you what's happening here in the terminal um because it's 125,000 tokens in length and I've specified the chunk length of 8,000 there are two chunks one of 8,000 one of 4,2 500 and here you can see exactly what has been fed into the llm so I'll show you the second chunk it's matching very much what we've put in uh to gp4 so as you can see it's Jason and we have the prompt formatted for open chat then extract names and organizations from the provided text and return them in Json format then we have the schema then we' have an example of the response then we've got some guidance then we've got the text and then at the end of the text we've got now answer immediately and only in Json format and I've printed out the screen here um the response which is indeed adjacent of names and organizations and you can see that the tokens generated um were about 139 and about 26 tokens per second and the second Json we got back um was about um yeah sorry that was the second one 26 and the first one here was 22 and you can already see here in this uh second one when we're using Json that there is a little bit of repetition within the list of organizations now there's one little tweak I want to show you on that point which is that in the chat completion script which is where we send the prompt into the language model there are parameters here including the maximum number of tokens to return doe sample is false this is equivalent to setting temperature to zero because it means that the language model is not sampling across the distribution they're just picking the most likely response and there's one more here which is called repetition penel penalty and this this adds a penalty for repeating the same response so we can put this to 1.1 by the way if you put it to 1 it's not a penalty it's just multiplying everything by one so 1.1 penalizes repetition a little bit and we can rerun the script and just take a look at whether that helps us out so you can see indeed um when we put in the repetition penalty it does avoid any repetition in the responses so that's a good thing but on the down side uh you can see that the first Jason that came back actually had not got any names or organizations included within it so I'll caution that it is possible to tweak performance a little bit repetition penalty I find it works a bit better using Json uh when doing Json then when doing yaml if you're doing yaml adding the repetition penalty often in my experience leads to a worse performance so next I'm going to just run the script uh for yaml to show you that performance here so this is very similar script but this time with the AML flag and with the model responding in yaml format and here you can see actually I've run the one for 60k of tokens so let's just run the same now with the shorter input text and here you can see it's now responding with uh yaml format so names and organizations and uh we're getting quite a nice response here for both so these outputs um are now saved in output. yaml for the yaml example which we can bring up here and you can see the names that have been extracted in the ORS and then in output. Json you can see the names and the orgs and what I'm going to do actually is just rerun the example for Json but I want to make sure that repetition penalty is off just for like for like comparison so I'll run Json on the short text and that's going to update this Json file notice that when I do combine uh the Json objects from different chunks you can do D duplication so you can remove duplicate answers and that definitely helps to clean up your response a little bit so here we have the response from uh the short text using the Json method now the next thing I want to do is compare performance to um running GPT 4 and GPT 3.5 so we're going to do that and take a look and the best way to do that is to copy the exact prompt that we need so here I'm going to copy this Json prompt and strictly speaking I could do this in chunks as well but R didn't do it in chunks I'm just going to put it in as a single prompt so here we go back over and we go to chat. UI create a new chat and paste in this prompt here now I want to put in the full text I don't want to put in just a chunk I know this isn't exactly apples apples but it's just the quickest way I can show a demonstration so I'm going to copy all of this short text and paste it in here and now ask GPT 4 turbo for a response and you can see it's definitely a fairly slow tokens per second probably just a few tokens per second coming out uh from this model but it will allow us to compare performance pretty nicely all right so that's uh gp4 and we're going to copy that and put it over into gp4 and remind me that's Jason so I'll just copy it paste in here and next I'm going to do the same thing I want to go back to this chat and paste in the exact same thing and here we have the response from GPT 3.5 which is a little bit less neat but we'll paste it in here so that's 3.5 and next we want to do the same for yaml so to do the AML response um I can just rerun the script just so I have it lower down here here we go so all I want to do is copy the exact prompt again so answer immediately and I want to make sure that I get all of this in here so again let's create a new chat let's do 3.5 first and again because we want to just do everything at once R in chunks just going to copy in entire text here paste here and gbd 3.5 should be quite a bit faster it's not exactly a format that I would like you can see it's actually not giving the correct format um there is a little tweak that I know because I've been playing around that should allow us to get out yaml so here you can see already the GPT 3.5 turbo is less robust um but you can just say in code pen let's see if that helps it does indeed so here we copy that and back over and into the yl 3.5 and save and then this whole prompt that we did for yaml for 3.5 I'm just going to copy all of this create a new chat this time for GPT 4 zero temperature and ask it uh I've put in in a code pen gbt 4 doesn't necessarily need that but it's not going to do any harm so here we have the names and organizations and that's ready for me to copy and over here I'm going to paste it into gp4 for the AML Okay so so now we have the uh the results uh in Jason and yl for GPT 4 GPT 3.5 and we also have the results of open chat 3.5 in Jason and yaml and there's a nice script that we can use in the readme file it's a comparison script and simply by running python compare Json we can get a nice comparison of performance between all of these models so let me just expand my screen a little bit and see if if we can uh get a better view so here's comparing the Json performance uh first of all on the names so we have our Baseline model the Open chat and then GPT 4 and then GPT 3.5 and basically what you can see is that um they're all pretty much on par it looks like GPT 3.5 is missing one or two answers uh so you can see for example uh Ron all is missing from GPT 3.5 um whereas the output of open chat uh has made a little mistake here by including Geo gico is not a person Geo is a company um which I guess would be hard to know potentially but that's still a mistake also in a sense there's a mistake here because open chat does not know that a Jane and a Jane is in fact referring to the same person which gbd4 is aware of and you can also see that Monger is included uded in open chat whereas Charlie Monger is also included so you could say that this is an error but actually within the transcript Monger is often referred to as just Monger um but I guess with extra knowledge then uh the model would be able to uh statistically recognize that this refers to Charlie Monger so overall I would say performance on the names here is relatively similar perhaps though a little bit higher of an error rate when you look at the output of open chat and uh maybe a tiny benefit over uh chat GPT 3.5 just in terms of getting all of the entries now the next one here for organizations and here in Json format we're definitely seeing some difference this is more challenging of course the second list is more challenging so they're both done at once if you want better performance probably only asking for one data type at once is a better idea um but here you can see there's fairly close match between um between the Open chat and GPT 4 open chat GT4 includes city city is not a company could be relating to City Bank because Jane Frasier is mentioned so that's somewhat incorrect actually should be Citi um meanwhile Geico is not recognized by open chat and ocidental petroleum is also missed by open chat in Json format and in gbt 3.5 you can see that there are few missing too ocidental petroleum is just written down as oxy which is uh the ticker and also it is missing um looks like it's missing BNSF and also missing CNBC so you can see here GPT 4 definitely performing uh better in this case now let's look at the AML results and before I press go on that just take a note that GPT 4 is currently recognizing 15 uh names out of this initial text and it's recognizing 12 organizations so let's see what happens when we compare the AML results and right away you can see gp4 is predicting the same in yaml at least in total number although the answers are a little different we'll get to that and uh you can see in terms of organizations um we have 12 organizations so that's a little bit different as well in yaml than uh in Json so you can see that even the most powerful model is not exactly consistent and it depends on whether you ask in Jason or in yaml now the performance of the Open chat you can see is quite good here so open chat in yaml format is actually getting um it's largely getting it's largely getting all of these you can even see it's saying mic in one case uh which probably refers to miks and toally although there's so many mics in the world I'm not sure you could really criticize that um but really the spelling here is a bit different they both think that Debbie has a different surname and uh you can also see that instead of having warn Buffet open chat just has Buffet um but really quite good performance uh from open chat here on yaml meanwhile uh chat GPT 3.5 turbo is definitely missing a few compared and interestingly it doesn't capitalize sometimes the surnames which I don't fully understand um but here you can see with yaml actually open chat is uh outperforming um the GPT 3.5 model and it's doing really well compared to GPT 4 and even when you look here at the organizations there are a few things that are a bit questionable like is om a Convention Center an organization I mean maybe it is but it's probably not because it's a Convention Center that's owned by some company um other than that you can see chat GPT in not chat GPT gpt3 or GPT 4 um includes City again and it includes CNBC twice because it includes cnbc.com which you know is really part of CNBC so probably isn't a distinct entity uh o oxy Dental petroleum is also missed here in this example I will note that if you make some small tweaks you sometimes can make it appear as accidental petroleum um but broadly speaking uh here it is GPT 3.5 actually thinks it's accident petroleum um so uh you can see in general open chat not too far from GPT 3.4 and in yaml format we're getting a little bit better performance here next up I want to push the model by looking at a really long context length so for this we're going to look at the entire birkshire hathway meeting transcript which I think is about 60,000 tokens in length and we're going to run this by uh yaml and we're going to run it on the open chat model so let's take a quick look here and and go back to find a script that we can run so scrolling up I'm going to grab the AML script but I'm going to adjust it so that we're looking at the full file and we are indeed going to run it in yaml so let's just go ahead and paste this here and what's happening now is that within the script um called TGI data extraction there is a concurrent mechanism set up so that there are multiple threads that are being executed in parallel um so we're sending all of these messages into the GPU and text generation inference the API is um is batching them all together so that we're operating in parallel across the GPU with all of these requests and that allows us to get um really fast inference as you can see here we've already processed about 45 requests Quest probably in probably in 30 seconds to be honest and um we have an error rate of about 6 to 7% so you can see here that in one case there was a repetition problem so we got a lot of Coca-Cola um from this invalid um from this invalid yl but we had um a lot of the other files more than 90% that were okay and so if you do start to Ping uh with a very large number of examples um it is it is difficult to avoid all errors but this still will compile into an output file so we have output. yaml here and we've extracted um much of the text so here we have all of the names and there are quite a few and here we have all of the organizations and you can see that uh they're listed alphabetically because that's how the extractor combines them and so yeah this is a massive Target here massive project so let's just take a look um and put that whole context into GPT for Turbo an expensive Endeavor but let's try it uh so I'm going to go back to this example here where I had the AML and I'm what I'm going to do is just swap out the text so that instead of putting in the short piece of text I'll put in the full context and let the long let the long context of gp4 Tero handle that we'll talk a little bit about cost towards the end of this video you can see already that uh it's definitely slow if you're going to put in this length of context um but let's just take a look at how this comes out in terms of answers and what I can tell you is that even even with a very powerful model like GPT 4 Turbo it's going to leave out uh certain answers although it does a good job of getting quite a lot and there are very many organizations within this example here and this indeed is a very long answer here um but what I want to just show you as a key point is that gbt 4 is not able to get all of the entries uh for example if you look here at the output that we got from open chat if I just pick um pick an example here that open chat is finding and check here within um the input text you can see that Wally Weiss indeed does appear within the input text uh this I think might be a listing of uh the directors but it does not appear or rer Wally does not appear within the list that uh gp4 outputs so there's no appearance of of Wally wise here and the same is true for for other characters too like merla Whitburn if you look for Mera Whitburn here um that name does appear within the output of open chat but does not appear within uh the output of gbd4 so definitely just using a really long context even with a very powerful model is not going to get you strong accuracy although if you chunk that up and send in chunks to GPT 4 um no doubt you're going to get better performance much as we are here because we're doing parallel uh chunking with smaller chunks and as you can see when you do smaller chunks you do um get more repetition I've duplicated and removed any repeating but even still I'm going to get repetition in the sense that Howard and Howard Buffett are going to appear as different characters and so you may want to run through an Adam again to kind of uh reduce this down if you see that there are names that are quite similar like these two names here um and you want to filter those out further now I had said I take take a quick look at the Noto model so let's do that I'll just close off some of my files here and I'm going to need to adjust my EnV file so let's open that up here I'll go back over to runp pod and grab the Pod ID from not hooks which is up and running so with that I will paste in here that API endpoint and I do need to get the name of the repo so I can grab the chat template so what I'll do here is I'll do model equals and paste in Block so this is the awq model from the Block always had tip to the block and we're ready now to do a quick run I won't run the really long text I think maybe I can try and run the 60k text like this uh using notok let's run it first in yaml format and see what what notok comes up with so you can see the prompt format for Noto is like Mixr which is like Lama it's this inst type format and we'll see how it performs it's much slower because it's a larger model 7 by it's about 56 G um 56 billion parameters instead of seven so about eight times uh larger then the 7B and when that when that's done we'll also run a quick uh we'll run a quick example um using Json as well so we're just getting back the start of the responses the bach of responses now from not Hooks and I can see already just from the erors the error rate is about a half so the syntax of the AML file was broken in about half of these eight cases and um if you want to scroll up you can uh see some of the issues here that one there looks Yeah so basically names wasn't provided in one of these uh objects so actually instead of responding with names and then a list of names it's actually responding including properties and items um so it's getting confused by the the um schema that I provided you might argue that not including the schema might have helped somehow in this case um but without without too too much further Ado let's just run a quick example on Json and see how that performs I'm running it on the shorter piece of text so there'll just be two requests being sent in in parallel and again you can consider increasing the repetition penalty a little bit for Json to try and reduce repetition in the Json responses and here indeed we get out a reasonable uh set of answers so that's that's not too bad at all um the list itself looks a little bit shorter than what we got with open chat so I'm not sure it's exactly including everything um likewise here we've got eight entries whereas there's actually about 12 so we're getting some sensible responses here uh from not hooks now sorry these are not the Consolidated ones so what I should do is compare the Consolidated answers maybe I can do that if I just run um a quick comparison of the Json so what I can do is run compare sorry python compare py Jason I think that will do it make a little more space and here we go so this is again comparing with uh the results of GPT 4 and GPT 3.5 it has a different spelling for a g chain but indeed yeah when we consolidate and I forgot there were two different parts that need to be Consolidated uh the performance here on the Noto model is not bad um just looking at it visually you can see there's that extra repetition of a chain and there's also Buffet being added in as an addition to Warren Buffett so that's a second entry that's there twice and there's also a mic there on its own so basically it's matching gp4 um with the addition some single names um and when we look at organizations you can see uh that it's doing quite well it's actually matching most of these it's not getting accidental petroleum and it's adding in om Nebraska which is a city not an organization and it also must be missing one because I see that it doesn't have accidental um yeah it's got the convention center here like um open chat had previously however it's performing quite a bit better than uh GPT 3.5 so you can see that in Json format uh the not no hooks model is performing reasonably well um although the performance was not quite as good when you're in the AML just double check that I have no repetition penalty so it just shows you you don't necessarily need to have repetition penalty for Json performance to run well but because Noto is quite a bit bigger as a model and because the token generation rate is going to be quite a bit lower we can maybe just uh scroll up and see how fast those tokens were generated uh so here we go about six tokens per second instead of 22 so it will take you probably about at least four times uh the time if you're going to be running using this larger model and because of that there's definitely a cost Advantage if you go and run with the Open chat model which is what we're going to get to right now with respect to cost to pull this all together I want to talk about the costs of different approaches so starting with GPT 4 the price currently is about 1 cent for every thousand tokens now you do pay additionally for output tokens but you'll typically have a lot less output tokens so most of the cost is associated with the input GPT 3.5 meanwhile is about a tenth of that cost so as an example if you are trying to extract test from text from 100K context tokens that's going to cost about a dollar if you're going to do withp GPT 3.4 sorry GPT now if you're using gbt 3.5 it's going to cost about 10 meanwhile if you run an a6000 as I did on runp pod and that's 60 cents now on vast AI sometimes you can run them for 30 cents or you could run an A4000 and probably run one for as low as maybe even 10 or 15 cents on vast AI uh I will note though that the user experience is better on run pod so that's why I end up using that more often either way you're going to have some less than a dollar per hour cost for running a server and as you saw in my demo for running about 50,000 tokens it's about 1 minute uh maybe a bit less than that about 30 seconds in what I showed so that makes it somewhere around half a cent to 1 cent for 50,000 tokens and so for um doing 100 uh K tokens which I've got written incorrectly here for 100k tokens it's going to be uh just 2 cents and when we bring that all together as comparison GPT 4 it's going to cost you to extract from a 100K context it's going to cost you about a dollar uh GPT 3.5 it's going to cost you 10 cents and with an a6000 uh it's going to cost you about 2 cents although probably you could get it down to half a cent um if you use a cheaper GPU and also it was actually a bit quicker than my estimate here now this is kind of unique because it's quite rare to be able to beat the cost of just using uh the open AI API and the reason is that open AI have got the massive advantage that they have got so many requests from customers they can send in very large batches to their gpus and gpus are very effective at processing batches because they're constrained by the rate at which you read the model weights into the GPU and when you read in the model weights it's pretty much almost free to be able to do the computation many times you don't get bottled on bottlenecked by computation until you're up at larger batch sizes like 16 or 32 or even much larger and so it's very favorable if you can have large batch sizes which most companies indivi individuals will not have because they're only doing single requests however if you hit your API concurrently as I showed in this tutorial if you do a concurrent request with many many chunks you're able to fill up your GPU and get very good utilization and it's only because of that that I think you can get down and competitive with the cost of using more more expensive Services of course the quality as we showed is probably not quite as good as GPT 4 the best performance for data extraction will be using short contexts maybe about two 2,000 tokens on GPT 4 and sending chunks in parallel but really you can get quite close using the 7B model of open chat 3.5 and that's it for data extraction in Json and yaml format if You' have any questions on how this works just give me some comments down below if you want to get access just to these scripts you can find the link under Advanced inference on tr.com you can also there buy access lifetime access to the GitHub repo for advanced inference all right folks all the best
Info
Channel: Trelis Research
Views: 9,893
Rating: undefined out of 5
Keywords: json extraction, yaml extraction, data extraction llm, llm data extraction, extracting data with llms, data extraction with language models, json extraction llm, extract names and addresses, extraction information llm
Id: zmf1Kujygt8
Channel Id: undefined
Length: 55min 43sec (3343 seconds)
Published: Thu Jan 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.