Use LLMs To Extract Data From Text (Expert Mode)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

when it comes to extracting a piece of text your best friend may be your eyesight in copy and paste however Microsoft isn't spending billions of dollars so you could sling your mouse around with control C control V around a few PDFs language models to the rescue if you're watching this video I'm guessing that your skills are rare to medium rare for text extraction after this video you're going to be immediately bumped up a few notches I'm going to be sharing a business idea that I launched on Twitter and has over 80 signups today we're going to be learning about the core Library which was created by Eugene yertsev and is built on link chain core has a super easy interface for you to pass in a piece of text with a structured configuration output and get data out the other end for this exercise what I want to show you how to do is extract the company and Tool information that you find within another company's job descriptions so for example here we see that they list spring boot AWS gcp elasticsearch and Docker on this OCTA job description and well that's what we're going to extract all right first thing we're going to do is import our packages so the core special stuff is going to be extraction and it's going to be nodes they're going to pass in your open AI API key now normally you do this as an environment variable and please if you're doing this in production check out security best practices this isn't it and then we're going to do our language models we're going to do a chat language model now yes 3.5 would be a whole lot better from a cost standpoint however four is better from a reasoning standpoint it gives us a little better output so that's what I'm using today so for our core hello world example what I want to do is go over what an object in core is it is going to be the high level object that is going to hold information about the configuration of text that we want to extract that is a long way of saying it's basically a configuration object that you're going to specify what you want to get out of it this object I'm going to give it an ID of person this could be whatever you want but it's really helpful for you to remember what this thing is and for the description this is going to be information about your object and this is going to help the language model understand what types of information it's going to be collecting and then the cool part is going to be under attributes So within attributes you're going to put in the different fields or the different um pieces of data that you want your language model to extract from your text I'm going to have a text node and I'm going to call it first name then I'm going to give it a description which also helps the language model understand what it is exactly that you want to pull out I want the first name of the person awesome then what's cool is you can pass in examples so this is when you give the language model examples about certain call it sentences or pieces of text and how you want it to extract the output so I'm saying Alice and Bob are friends and what I want this language model to do is give me an output of first name because this is the node that we had right here and I want you to give me Alex or Alice and then I want you to give me Bob okay so I'm going to go ahead and run that and then what we're going to do is we're going to create this extraction chain and this is from core as well we're going to pass in our language model and we're going to pass in the person schema you'll notice that I didn't pass in any text and so it's not doing anything quite yet so what I'm going to do here is I'm going to Define my text I'm going to say my name is Bobby my sister's name is Rachel and my brother's name is Joe and my dog's name is spot and then I'm going to take this text and I'm going to pass it through to the chain and do predict or dot predict and parse and so we'll go ahead and do that and with this print output here that's just a function name but it made up at the top to do a pretty print so here we have person and we have uh first name and we have first name and first name for Bobby Rachel and Joe what's cool to see here is that it did not name spot because spot isn't a person it's a dog and so the language model is smart enough to not include that one on there the next thing we're going to look at is the dog went to the park and so let's go ahead and print this output and you'll notice that it gives us an empty list back because there were no people mentioned so it also can happen or handle empty data as well which is pretty cool so say you have multiple fields that you want to pull out instead of just the one like the first name well in this case I'm going to do a plant schema so I'm going to pull in different plants and for the attributes here I'm going to have a plant type which is the common name of the plant I'm going to have the color and I'm going to have the rating of the plant so I have three different uh nodes here that I want to pull out with three different fields notice how this one is a number for the examples side of the house I'm going to say roses are red lilies are white and an 8 out of 10. that's even a little confusing for me to go back and look at it but with the output I'm telling it I'm saying plant type give me roses and give me the color red because these two are connected to each other and it's going to know that for plant type Lily color white and rating of eight because it's an 8 out of 10. let me go ahead and run that and then for the text this is the new example palm trees are brown with a six rating sequoia trees are green we're going to put that into our plant schema or we're going to put our plant schema into our chain and go ahead and run our chain the output we have we have our plant the first plant we're going to have is the palm tree it's brown and it has a six rating which is exactly right and this is sequoia trees are green so sequoia trees are green with no rating because I didn't specify one that's sweet well say you had a list of data well in this case what I'm going to do is I'm actually going to embed one object in another one so Alexa let me actually start down here with a car schema I'm going to look at a car here I'm going to give it or I want information about the car the BMW is red and has an engine and a steering wheel so what I want it to do is give me the type of the car which is the BMW that's color red and its parts are going to be engine and steering wheel my nodes that I'm going to do is going to be text which is going to be the type of the car text which is going to be the color but then I'm going to pass in parts and parts is actually another object that we have up here at the top it is going to be the name of the part so the jeep has wheels and windows wheel and window okay let me go ahead and run that okay so let's go ahead and run this one and let's see what we got here cool so we have the car which is the type Jeep exactly its color is blue and then we have three parts rear view mirror roof and windshield do you want to see what the actual prompt was that was sent to your language model you can give it your text and say dot to string and then press print out the prompt and so this is what core created for us on the Fly and passed over to the language model now what's really interesting to see about this space is the instructions that we gave it even just a couple months ago were pretty straightforward in human readable instructions now these are still human readable but they're getting pretty nuanced and pretty Technical and that's the level of capability that these language models are able to do and that's why it's really nice to not have to do any prompt engineering yourself but you offload that to core because Eugene and team are going to be experts at what is the exact prompt that we need to extract the information so I like to leave it to the experts for myself one of the other cool applications of this as well is the ability to structure user intent so this could be in the example that you're creating an app and you're going to be collecting natural language responses from a user and that user is going to be telling you what you want to do now you may be able to go give that output to an agent and it might be able to figure it out but say you didn't want to trust an agent or you wanted something with a little bit more reliability on there well in that case you could parse the user response and get structured output for types of actions that you should be taking in your app I'm going to be creating a forecasting app so my description for this is users controlling an app that makes Financial forecasts they will give a command to update a forecast in the future and so one of my text fields that I want is I want the year to be pulled out from this user command the user says please increase 2014's customers by 15 I want 2014. I want a metric so this is going to be the thing that the user wants to increase the unit or a metric a user would like to influence please influence 2014's customers by 15 you get customers there and then please input 2014's customers are 15 I want the amount and you get the amount that's right there so let me go ahead and run this and then let me run this and I say please add 15 more units sold in 2023 and so what we have here is we have our forecaster the amount is going to be 15 15 of what units sold and in what year 2023. now for the real world example about how this is going to work so what we're going to do is we're going to look at a list of job descriptions and then we're going to pull out the technologies that come out the other end and I actually did this for a side project and launched it on Twitter right over here and what we have here is that list of companies and I threw this in airtable and then up to software for a quick front end but basically you're able to extract all the different tools from different companies and what's cool is that if you do this for a bunch of jobs well then you can get a pretty good idea about the tech stack about what a company's working with okay so in order to do that what we're going to do is we're going to create our chat model again using gpt4 and then I'm going to create a quick function that is going to pull jobs from Greenhouse now half the magic of this is getting the data in the first place and luckily Greenhouse has a public API which is cool so I'm going to try this out for OCTA and in order to do that all you need to do is just pass in the board token to this function and then you're going to get a list of jobs let's let that load for a second cool so the status was 200 which is good and it found 142 jobs so I'm going to look at job number one to start and that's actually not the job idea that's the job index let me take that and so let's just look at the well okay let's look at the first job and go ahead and look at it and so what we get is we get a response back where we get a bunch of cool information so the absolute URL uh some data compliance stuff internal job ID the location it's in Melbourne which is sweet next thing I'm going to do is I'm actually going to go look at a specific job because I looked at this one beforehand and its job ID is going to be this one and I'm going to go ahead and print this out so with OCTA we're going to be looking at a staff software engineer it was last updated April 11th and here's the link for it go ahead and take a look at that there's our staff software engineer and so here's some content of the job description itself and as you can see it has a bunch of HTML in there that I want to get rid of so what I'm going to do is I'm going to use beautiful soup to actually parse that content and I'm going to pull out soup and say actually get me the text for it but then it's going to return even some HTML cleaned up HTML for me and I wanted to convert this to markdown because this is we're going to be reducing the number of tokens that we have which means it's cheaper for us to run our model let me go ahead and run that so here we go get to know OCTA and that's kind of cool as you can see how that is bolded up at the top and this is bolded right here because it gets no OCTA markdown and then we have our clean markdown text right here and that's going to be stored within the text variable next what we're going to do is create our core object I'm going to give it an ID of tools and for description I'm going to say a tool application or other company that is listed in the job description and these are some negative examples because I saw it was giving me analytics and e-commerce beforehand so it's only going to have one attribute which is going to be the tool and that's going to be the name of the tool of the company and for examples I wanted to load it up with examples and so it would have more information about what I'm looking for so experience working in netsuite or looker a plus cool I want you to extract netsuite and I want you to extract the looker nice experience working with Microsoft Excel well there's Microsoft Excel you must know AWS to do the job I want AWS and troubleshooting customer issues and debugging from logs like Splunk and so we have Splunk that's pulled out right here let me go ahead and run that and then I'm going to give it my tools schema create my chain and then I'm going to ask it to do the chain here except I want to put this as clean output output let's go ahead and run this and see what it comes back for us nice and so here we have a list of tools that it found within the job description itself so in this case OCTA yes this one's a little bit of a whoopsie because it's the company but I'll cut a break on there and then I thought an interesting one here was spring Boot and so let me go see if that's actually in here and yes spring boot is in there and if we were to look at what the heck spring boot is um well it's a tool basically so yes if well it's a tool basically awesome so the other thing that I saw which is kind of interesting too is they had salary information now I believe in California there's a new law that says you have to list the salary information on your job description we can create an object that is going to pull out this salary data for us and there's going to be two attributes which is going to be two numbers and one is going to be the low end of the salary and one is going to be the high end of the salary and so this position will make between 140 000 and 230 000 and you can see here I intentionally put a really sloppy format where it's 140 000 and then uh 230.000 and for here I wanted to pull out the low end 140 000 high end 230 000. I'll go ahead and run that I'm going to say Jobs go pull from Cruise because I know that they have a good one for me and specifically there's one job ID that I want cool found 219 jobs I'm going to pull out that job ID it's going to go ahead and grab it and then here's the content and here's the cleaned up content that we have down below alright so what I'm going to do is I'm going to pass it the salary range schema and again for this one I'm going to say hey give me the output and then I want you to print the output and let's see what it gives us here so we ask it the high end of the salary is going to be 165 000 and the low end is 112 and that is the this is a quote from the job description and it pulled out exactly right there if we wanted to triple check this let's go check it out let's see one one two and there we go the position is one one two it's a 165. nice finally I imagine you're going to have a lot of information that you're going to go want to parse so cost is going to be pretty important when you do that and one way to find out the cost of your parsing is to use Lang chains uh get open AI callback and in this case what you can do is you can pass it your query and then it's going to tell you the stats around how much that's going to cost you so for that query that I just did on the cruise job that is going to cost me uh five cents which may not sound like a ton but if you're doing this for thousands of Records or hundreds of thousands of Records you're gonna get yourself in trouble pretty quickly here if you don't have a big wallet keep that in mind before you go and run wild on this if you wanted to actually take this project and go run with it there's a few suggested to Do's I would recommend I found a lot of success reducing the amount of HTML that I actually passed to the language model so if there's any opportunity to cut out low signal text that you know isn't going to have your data I would do that you'd go and grab a list of about a thousand different companies however many you wanted you'd want to run through most jobs but if a company has 5 000 jobs you don't need to run for all five thousand to get the information you want you likely only have to do a sample from each different department because you're going to start to see repetitive data pretty soon you'd want to go and store the results if you snapshot this daily that would be pretty sweet oh yeah a really good follow-up here is actually you want to go follow Greg on Twitter for more tools and if you want to chat about this project and then finally if you really wanted to do this too please talk to me because with those 80 signups that I mentioned I emailed every single one of them and I said hey what's your use case for a tool like this and I got some interesting feedback I wanted to share one of them for you that actually came from an investor so this is an investor Persona and they said hey Greg thanks for reaching out I always thought Goldmine or job posts were gold mined of information and often suggest identifying targets based on these so that's pretty cool and I'll let you read the rest of this on your own time but basically this investor said that they want to look at a company's job descriptions and extract information that they would find valuable for investment and that is using core for structured output please let me know if you have any questions or leave a comment on there thank you

Info

Channel: Data Independent

Views: 25,155

Rating: undefined out of 5

Keywords:

Id: xZzvwR9jdPA

Channel Id: undefined

Length: 15min 28sec (928 seconds)

Published: Thu Apr 13 2023