Solving Real-World Data Science Problems with LLMs! (Historical Document Analysis)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey what's up everyone and welcome back to another video today we'll be doing another solving real world data science problems Project walkthrough and if you're unfamiliar with this video format basically we walk through a python project and at various points in the video we pause and I present a task for you to try out on your own and either when you solve the task or if you can't figure it out and need to see how I would approach it then you can resume the video and the project kind of continues so we have some real world tasks for you to try out as we kind of walk through this full project and so today we'll be doing some historical document analysis using large language models and natural language processing techniques and the context of these documents is that we have a ton of historical documents from post American Civil War so these documents are specifically from the Bureau of of refugees freed men and abandoned lands this is also often just called freedman's Bureau for short and so the freedman's bureau was set up at the end of the Civil War or after the Civil War to help aid formerly enslaved individuals uh During the Reconstruction Era so you know helped provide School housing jobs Etc and so we have documents from the various roles of the bureau our goal is to see if we can analyze and find any sort of meaningful insights from these documents and so one thing that's super exciting about this project is that when I say this is real world I really mean real world this is an actual freelance project that I worked on that I got paid to work on um that I did on upwork so it's super cool to be able to present some of this in video form to all of you so as part of this project there's a corresponding kaggle data set and I'll link that in the description of this video I encourage you to check this out and contribute notebooks with any of your research findings to help further move forward This research and this has all of the information you'll probably need on the data set and the contents of the data set uh very well documented quick shout out to Trent self who helped put together and document this kaggle data set but one thing I specifically want to call out is the website that's corresponding to the research and the ongoing insights that have come out of the research so we can go to Freeman bu. info also linked in the description and this has you know some more background and also some insights that have come so far from analyzing these historical documents um one thing that might be helpful for all of you as you start diving into this data is that there's specific categories uh and some insights from specific sets of these documents so we will'll be working specifically on the indentures of apprenticeship documents in this video so to get started you can either go ahead and click new notebook in kagle or you can download the Clone the repo that's on GitHub everything is going to be linked in the uh description of this video but I'll start on kaggle I'll probably bounce a little bit back and forth because certain things I think we can only really do locally and certain things will be better to do in kaggle but I'll go ahead and create a new notebook I'll just call this like research findings or something like that and first thing you'll want to do is go ahead and run this cell which will load in all the data what the heck is Corgi mode oh wow there's a corgi running across the screen that's cool didn't know about that Can I toggle Kitty mode and Corgi mode wow fascinating Okay cool so we have all these documents and to not be too distracting I'm going to un toogle Corgi and kitty mode we have all these documents so the first thing you'll want to be able to do is just make sure you can read these documents so we'll do a quick DF equals pd. read CSV and we'll do Slagle we can just copy one of these links and we'll be specifically looking at Contract records in this tutorial so let's go ahead and do that and I will also do DF doad cool we can see what we have there that's good so we've loaded in our first CSC file but what we're actually going to want to do here for task number zero I'm going to say because this is kind of like before we're actually getting into the video task number zero is just to configure a large language model to use with python we're going to do this in two different ways because I want to make this as accessible as possible to all of you so one is going to be using the open AI API and number two is going to be using olama to run llama 2 locally so that was a little bit of a mouthful but one is using an open Ai and one is using an open source model so use whatever makes sense to you you're going to have better performance with the open AI models probably but you won't be cost you don't be charged anything with llama to and so our specific task I'm going to say is kind of how you complete this is generate a story about a data scientist finding all sorts of cool things in historical documents so this is what you're going to be promp in the LM to do to complete task number zero feel free to pause the video try to configure this on your own and resume when you want to see how I would go about doing it all right so how would we go ahead and configure an llm well as I said we're going to want to use open AI so what you want to do if you don't have an account already is go to platform. open.com you might have to sign in or create an account if you haven't already once you're logged in to platform. open.com dooc overview you're going to want to click on this API Keys option you're going to want to create a new secret key and I'll just call this YouTube tutorial copy this and you're going to want to go back to kaggle or if you're using a local notebook that will work too and one thing that I want to do is add a secret because I don't want this just stored in my notebook so I'm going to call this uh open AI key value I'll blur this out if I need to save that perfect and now I can go ahead and how do we access those Secrets I've listed a link in the GitHub readme so to access those with kaggle you can do this code here and we'll try that our secret label was open AI key and I'll just print the first three values of this just so you can see if it loaded in properly cool so we see we have a secret key there it's loaded in and then what we need to do is we can go to the open AI API set API key install python okay so these dos will probably be helpful I'll add them to the read me on GitHub okay set up your API key and we can do something like this because I don't think we actually are going to use environment variables so so we can go ahead and how do they do that do they import open AI from open AI import open AI can try that too and I think by default kaggle already has open AI so it's good to know and then what was the latter half of that we want the client code here and I do command um forward slash to uncomment multiple lines so we don't need environment variable here what we actually need is that secret value and run that no okay no module named open AI if you run into this uh what you can do is you can run actual pip commands via the terminal by doing pip install open AI so you do a explanation point that should install your library in in the notebook there might be other ways here too in kagle but this is just what I do if I'm in a juper notebook already cool we have [Applause] that okay let's run this again perfect and now our goal is ultimately to generate a story about a data scientist finding all sorts of cool things in historical documents so how can we do that this looks like some nice code to just copy and paste we'll do it on the next cell you are a creative genius able to write short stories with a bunch of humor okay that's our system message system message basically defines how this chat client will operate throughout the duration of a conversation so let's say you wanted it to always translate whatever was input into Spanish you would tell it to do that as the system role because then every user message that comes after it will know to generate an assistant message that's translating it into that other language so this is kind of like if you want it to follow a rule the whole time use a system message to do that and here uh if we want to just do the task create a short fun story about a data scientist that makes here we go a huge Discovery when analyzing historical documents okay I'm going to run this no okay I think we can also one thing that's helpful is I think you can do we're going to try something real quick I think if we do import open Ai and then open AI dot set envir set open AI key for duration of program I'm gonna just look up something like open ai. API key okay I think we can do this let's try let's see if it works now the code no I mean I guess I don't have to do client again we can we already have that so if I run this run this let's see if it generates our story no what happened we exceeded our current quota okay this is something good to know and this is part of the reason why it's frustrating sometimes to try to use the open AI stuff so I think that you get three requests per minute so honestly what we'll find is if I run this again in a second it'll be fine but this is going to be really tough for like a production grade project like we're working on so what we're going to want to do here is I think going to settings billing um add payment let's see I think go into usage increase limit must be on a paid plan to manage uses limits so everyone do what you're most comfortable with if it's not a big deal to do something like this uh and buy credits go ahead and do it that probably will make your life easier throughout the video but also note that we'll show how to use open source so you don't have to pay anything I just want to make this as realistic as possible and kind of in the real world right now A lot of people are doing this and and and buying credits so I'm going to add $20 to start um I'm not going to automatically recharge let me add 25 to start not going to rechar no I don't want to recharge start there confirm payment and so all you had to do was $5 and honestly that will probably get you a long way I'm just covering myself a little bit here but what we now see is in our um usage in our link limits in our rate limits we have a lot more requests we can work with so now we usage tier one you can look at um open AI rate limits and we see that yeah you only get three requests per minute in the free tier but if you go to tier one then you get way more requests so something interesting to know let's go back to kaggle I'm G to run this again run this come on what's our story no what happened this time we definitely have increased our quoda how long for quota to update open AI API what we might try real quick is just creating another API key I'm going to call this YouTube tutorial new create secret key this should have much more you know it should be an updated information so hopefully it uh I'll call this open API key paste that new value save done API key run that now let's see what happens come on you know that we have an increased quota don't don't don't do this to me there we go cool so created a new API key and we now have an answer but I'm going to make this a little easier to read I'm G to grab specifically the content out of this run this oh no I'm going to get create a new one maybe I move this to the next cell okay once upon a time you can read through this if you want wow long forgotten civilization that would be awesome if that's what we discovered I don't know if it's going to happen though but cool awesome so now we have open AI working okay now let's say you didn't want to do the old um open AI route you just don't want to deal with it you want to use open source let's show how we can do that too so in this case because I don't think we can get this to run in kagle maybe I'm wrong if you know a way to do this in kagle with olama drop a comment down below but basically there's this awesome new I don't know what you call it framework I guess we can look at their documentation to see what they call themselves this new platform AMA let's see what they call themselves that basically really makes it easy to run large language models locally so there's some downloads for Mac Windows Linux um I'm on Mac right now so I'm going to download this and basically once you download it once once you have it set up you know you've refreshed your terminal and everything you can open up a terminal window and you can do AMA run and then there's a bunch of different models you can run so specifically I think the best option is probably the Llama 2 this should be accessible to most people um note that to run the seven uh billion parameter models you should have 8 gigabytes of RAM on your machine so that's kind of a minimum requirement here 16 gig to run 13 billion miles Etc we'll just do llama 2 because that will be most accessible to everyone uh you know this is a trade off here it's like you have to use more of your machine to run these large language models locally versus you have to maybe pay a little money to use open AI API so here's the trade-offs that we deal with in data science but you can run AMA run llama 2 and basically once you have that up and running you can do something like hello there and it's as simple as that and one thing that's super cool though is that this is now accessible via your python code so if you look into AMA Python and you can do a pip install AMA but basically what we can do is if you clone the repository of code that is linked in the video description so it's going to be at github.com Kei gall SL historical docs analysis you can clone this repo and then all the libraries that we'll need to use are in the requirements.txt file and so this is a nice little trick in the in the python world you can do pip 3 or pip whatever works for you install dasr and you can install all of the requirements we'll need for this video so I'll do that so that will install the o llama library but now once you've done that or you could have just simply also done like pip 3 install AMA we'll see that we already have it we're going to want to import that probably would also want to like import open AI we we'll tweak these Imports but if we go to their documentation on the python Ama we see ama Python and then what I recommend is just go ahead and like copy um some of this code and paste this in one of these cells down below see if this works it's a little bit slower because we're running it locally but it answers that question but that was not the goal of task number zero the goal of task number zero was please write me a short story about a data scientist that makes a huge Discovery when analyzing historical documents make it kind of funny why not it's got we gotta we gotta laugh okay let's see what it generates run that run that the great document dumpster fire geez okay I'm gonna zoom out so you can read this oh God that's how do I make this accessible to everyone to read okay copy this just gonna open like a sticky note just GNA paste it in a sticky note okay oh man that's hard to that's even harder to read how do I make the text h i want it black I just want it black there we go okay feel free to pause the video if you want to read the story wow this is pretty ridiculous like the time sir Lancelot accidentally dawned a unicorn costume to a Royal Banquet thinking it was a fashion it's kind of a fashion statement oh no it's it's kind of funny I'm pleasantly surprised I don't know it's it's kind of funny I'm I'm you know I'm I'm happy about that and that was an open source model free of charge llama 2 cool stuff thanks meta um how do I close no what the heck how do I deal with these sticky notes okay cool so that's us running it with olama all right awesome we've now run code with open Ai and olama a you know API endpoint and a open source one uh you can expand you know play around with different open source models here you also could play around with different paid um API models one that I've tried is coheres you could also do like Google bard I think has an API check that out so a lot of different options here and I think part of doing this type of work is figuring out what's the best option for you note that as we fill this out you can always go to analysis completed to take filled in sections of task number zero Etc I'll also make sure on kagle that there's a good notebook that you can just build off of there too okay let's make some more cells real quick press B to May insert uh new sales below b b okay use LM to parse simple sentence examples all right we now that we've done the warm-up task our first task is going to be to use llms to parse some simple sentence examples so what I mean by this is if you go into the code and you look at um the analysis. IPython notebook file you'll see some example inputs and outputs and this is kind of gearing us up to get comfortable with what's actually in the contents of these historical documents but I'll just paste them again into this notebook and ultimately we have sentences that are like random things like John James agrees to pay $50 a month to rji Hampshire to for work on the farm so Loosely following things you might see in some of these historical documents but we want to use large language models to be able to parse this into meaningful information so when we analyze these massive dumps of text ultimately we want to pull out names we want to pull out numbers we want to pull out frequencies we want to pull out what is known as these entities so that we can then you know imagine we pull out all these entities for all of our documents now we have a nice thing to analyze on we have you know concrete numbers Etc that we can analyze so we want to pull out the entities from these sentences using large language models okay so I'm going to run this cell and how would we approach this ultimately the goal is to get this list of dictionaries type format that has the fields pay recipient amount pay frequency and description that's your goal with this task and you want to use an llm to do it so feel free to pause the video try this out and then resume when you want to see how I would approach it if you have no idea how to approach this feel free to watch a little bit more I'll give a little bit more context before I get into it cool so the way that I would think about this maybe at first maybe using the open AI approach is let's paste in exactly what we had before so I'll just call this like par output and now what I want to do is give it a good system message here and my recommendation here is that it's going to be tough to type in exactly what you want to right there so I'm going to define something I call system message so what might be our first stab at like doing this task well we could say something like grab the payer recipient amount pay frequency and description from any sentence you're given this might be the simplest first thing that we do and so now our content is not this it is going to be our system message and now our user message we could just actually pass in one of these examples oh what did it do command C so I could do something like this and I could see what it creates for the output I'll do the output down below let's run these two lines so make sure this is run up top there we go and we'll want to do as we saw before choices zero message. content to make this more readable okay that's decent if we just wanted to extract things out like this would be a pretty dang good start I think the issue comes is like this is the first output we get what happens if we run this again that's pretty consistent will it keep being consistent it's like even just little things like this monthly is capitalized in this one it's lowercased here amount is $50 I might just want it to say 50 and just know that it's dollars I think it's also interesting if we pass in a different type of example so like this one oh oops I meant to paste it in here that makes a lot more sense like and now we have this form like John Smith and Jane Smith for the recipient yes that's correct and it's pretty impressive seeing how well this is doing right away but we need a little bit more of a uniform format so what we might tell the system message to do and one little trick is that you can do three quotation marks to make this like multi-line um output the following Json object one little Quirk or actually let's see I think that brackets here are special syntax so one little Quirk is I think you need to do double um double brackets for Json so I might do pay lowercase name of payer want to have that in string and so you're just kind of filling in some dummy data here um recipient a name of recipient amount uh amount in USD and then you might like one thing that's neat here too actually I'll I'll I'll fill this out fully and then I'll give some tricks frequency you know frequency of payment and what was the last thing we had in the examples uh description what the payment was for okay let's see what it does now run this run look at that now we get a nice little Json that's pretty good let's keep running it okay we have some challenges here let's just keep seeing how the frequency changes let's see if it changes each time we run this okay per day that's pretty good it it is staying pretty stable however one thing that one little trick is let's say you need a specific format and sometimes you know it's not going to be per day it's going to be each day and watch how we change this run it now it's frequency each day one trick is in this little example output you can give it a little comment and do something like can be can only be one from daily or let's say hourly daily [Music] weekly um monthly yearly or other so now if we run this we hope that the frequency should change to daily and this will just make things a little bit more uniform for future analysis down the road see we get daily now that's awesome and now watch what happens if I change this back to per day and run this we still get daily so you can utilize little things little tricks like using little comments notating your your example output to help improve it um one other trick we can do here too is I ideally would have this be a list of Json so I want it to be a list of each person and I might just change this like for each person that is paid in the example if there are multiple people paid return the list of Json within an array something like this so you're giving more and more detail here okay this looks great awesome it's so impressive how good this model is out of the box pretty dang good one additional trick you could use if you have any issues so if we run this again will it do the same thing it like this time it gave a new line it's still pretty dang good still looks pretty good one thing that you can do as a trick though and this might help definitely for for um for probably the Llama 2 model is you could actually pass in kind of like fill this in with some examples so user content let's say we said this to inputs zero and then the corresponding output would be roll assistant content string outputs zero or something like that and like now you're giving it some samples of what it you're expecting to output and let me just real quick check what string outputs zero is yeah that looks cool so I would hope that this would maybe make this format here a little less of the new lines um and more like this I'm curious let's see so basically we just like pretended we like just to understand what we did here we're basically saying oh this was actually in your conversation you already did this example you already filled out and produced this output we're like tricking the AI to think that it already knows about this in order to help us more accurately annotate the second example so if this this work successfully I think we would hope to see a little less new line characters maybe in our output oh no oh I should probably use commas look at that looks much closer to how we output it here and outputs zero pretty dang cool and then finally because this is a YouTube video we only have limited time like if you had all the time in the world I would recommend like actually you know maybe creating a test set a training set in this example I just use one of our inputs to you know help us produce outputs but what you might want to actually do is confirm if certain values are the same as what you expect so what can we do we want to actually get this to be in Python syntax because if we try to like grab the first item from this we're going to get an error here so I'll just show this because it's a string right now actually we'll get I guess a character of the string but we want this to actually be a python a python object so try to see if you can figure out how to make this a python object on your own and resume if you want to see how I would do it okay so how I would approach this is there's a library called as that we can use an A has a and that stands for I think Abra abstract syntax tree as Library python let's see what it yeah abstract syntax tree and it has a method that's part of it called literal eval and we could actually pass in this to our little eval and call this like you know actual output equals this and the expected output equals you know outputs one because this was this example so that's index two so index two so we actually wanted outputs two as our expected output and now we could compare specific Fields so I could do like actual output zero to get the first item in that list expected output zero oops spelled that wrong oh and I want to print both of these print print a new line in between them then print run that cool okay and then you could like actually test to see if the fields were the same this description field is going to be really hard to get that actually the same there's no one right answer there but there is if you define it properly you know a correct answer here here here here Etc so you could actually do comparisons like and and validate that these are what they expect you expect them to be cool that is showing how we can parse out some entities from using open AI now let's try the same thing using the AMA llama 2 interface and one thing I'm going to recommend here it's not you don't have to do this but I think it's worthwhile doing and I think we'll do it throughout the rest of the video um one library that I really like for llm stuff is Lang chain so you can uh look up Lang chain olama and you can actually use aama in Lang chain and so Lang chain basically gives you all sorts of nice helper methods within with large language models it also gives you access to things like agents and let you chain language model commands together which can be super super helpful as you kind of get more into this space so for this exercise of trying to classify these examples with llama 2 we'll go ahead and use um chat o llama so this looks pretty good and we might want human message we might want I think there's a system message and then there's a AI message so pretty similar to how we were interacting before but we could try just copying that same exact system message from above I do system message I think actually I coped the whole thing so there we go system message I'm going to move this LM call just down a bit okay so I might now do system message equals content equals system message and then I might pass in text the user human message and the content would be inputs I don't know why it's saying binp puts inputs oh I guess I accidentally typed it B there input zero and let's see what it outputs come on it doesn't seem exactly what we want maybe we accidentally added something weird here like it's it's outputting multiple things like this looks correct it's getting thrown off by this so I'm going to just remove this stuff real quick we're going to run this again okay sure here is the Json object based on the sentence you provided and it's great but we only wanted the Json object so one cool thing that we can do like let me say you can do words like important return only the Json and nothing else oh wow that was pretty good now what we can try to do is access fields from that so chat model response do content and then we'll access like the payer field what happens if we do that okay so this is still a string so we're having issues so in addition to the literal eval one other thing that's useful is within J you can do a Json I always forget if it's dumps or loads it's loads in this case because we're loading a string into like a python dictionary um so we're going to try to load this in um we'll call this like response dict equals this load that in and then we could do a response dict and then we could grab the payer field for example John James perfect and is that correct John James agrees yeah he's the payer so let's run this again see if it works again looks like it worked however what we wanted to do is actually be able to Output a list because it's going to get really confused I think when we do this example number three which is index 2 inputs 2 let's see what happens here yeah so it like tries to format things like this what we wanted was an individual entry here and ultimately you're you can make this decision on what works better for your future analysis ultimately the reason we're doing this warm-up exercise is because we want to be able to make parsings like this in all of our documents that we have in this database so you can decide what the schema is that you want to adhere to but my gut was just to like make this a list and and have a single entry for each person um in it so what I might do to help it understand this is I think it's pretty good at outputting a Json object but I'm going to make it results and then I'm going to have a list in here the list so I'm going to like kindy of give it let it know that it might be multiple people here I don't know why it give you that okay note if there are multiple people being paid each should have a full entry in the results list okay let's try this it's pretty good it's still doing recipient with multiple names so how could we get it to have a full entry well I bet you we could trick it by passing in hum human message which is this and then a AI message which is the output the ACT ual output and like obviously we wouldn't want to like it would be cheating if we passed this in and then had it try to Output inputs to again but we'll try to have it output a similar one where there's multiple payers so something like um the local sports club agrees to pay $75 each to coaches Sarah Miller Danny Glover Alex Reed and Jamie Fox for conducting a weekend sports clinic so that should have four people in it so I bet you we that that is going to be Index this is1 this would be -2 backwards so that would be inputs -2 let's see how it does now with some uh fed in information payer local sports club recipient Sarah Miller amount 75 weekly look at that it knows to do do the four entries that's awesome so if you need it to adhere to a certain format try you know leveraging like this this kind of uh tricking method where you give it oh yeah you know here's your system message but oh you know that the human said this and you know the AI said this uh so you're basically tricking it to understand the format you're going for and this ultimately gives you better for performance when you run a a new example so that is pretty dang good we'll load that in oh no why did that not work is there something weird here oh it gave like a oh it gave the list but it didn't return the Json that's weird one additional trick up seve with the Llama 2 uh method you can also pass in format equals Json here into the chat Alama method and this will force it to be Json on the output so I think that this will help us and these are the things you just learn with frustrating trials and errors in this this world okay load this okay now we want to get results and then we can get like the first entry here look at that then you could get you know the recipient cool so this is parsing with llama 2 and I'm not going to go like if you want to really like get into this you could be strict and like try to write some unit tests to Che text I can't speak to test these things but really I just want to us to with task number one to have a sense of how we can do this so that when we approach our documents we have you know a strong grasp of things a quick little tangent that's worth mentioning one thing that I really like about Lang chain is that it makes it so easy to uh like switch models so all I would have to do to switch this to open AI is do Lang chain open AI chat or I don't even think I need to do chat models but let's see if this works chat open AI pass in let's actually just do another line here so I don't accidentally delete something chat open AI I might have to pass in a key here let's see what happens actually I have a key to find already you would need to pass an API key if you don't have open AI API key defined in your environment variables because I'm running this locally I do have that defined so I think I just run this and look let's see so yeah th this all I had to do was change this one line to chat open Ai and add this and now we have open AI um working of this and again you might have to pass in the API key here if I pass in some dumb value this is not going to break yeah but uh it's automatically defined if you have open AI API key all caps in your environment variables list feel free to leave a comment if you don't understand how to do that and I can probably try to provide some help all right for task number two we want to grab apprenticeship agreement rows from our contracts. CSV or contract records I believe. CSV file okay so this is what we want to do from this file we want to grab the apprenticeship agreement subcategory feel free this is a pretty straightforward task it's kind of the the warmup task to the actual number task number two which is going to be to connect documents that are similar to one another but we'll start with just grabbing these rows feel free to pause the video and resume it when you want to see the solution all right so to do this if you're on the kaggle world you can look at our data and note that from the first lines we wrote um the contract records are here so we can go ahead and do DF equals pd. read CSV and make sure you have pandas imported and we can run this and then do something like DF doad and we see that we get some rows and we see actually the first ones are apprenticeship agreements however are they all apprenticeship agreements so we could to check that you could do DF do subcategory do unique and I can't spell and we see we have these subcategories one thing we note is what the heck why is there apprenticeship agreement and apprenticeship agreements uh no matter how much you try to clean data things like this kind of come through the cracks so if you find yourself in a situation like that and don't know which one to use I recommend using the value counts method and we see that apprenticeship agreement non- plural has 245 and apprenticeship agreements has just one so ideally we would just change this entry and then resave and upload it to kaggle I'm not going to worry about it too much for now but what we do want is apprenticeship agreement so I can do I'm going to call it AA equals DF DF [Applause] subcategory equals equals equals and then paste that in apprenticeship agreement and then let's uh run that and do aa. head see that and I I might just do like as a sanity check like what's the length of the data frame and then what is the length of AA that looks like the right value of apprenticeship agreements that we saw before so that looks good and so what is in this document what are in these apprenticeship agreements and if this is all we saw this is kind of hard to read this transcription text which is the actual um contents of these documents so real quick I just want to do an aside if you're developing locally and using visual studio code do DF equals PD this was just a nice little trick that I realized the other day um sure I have pandas imported oh what am I doing and in local it's in data slash what do we have there I think it's contract records yeah so data contract records. CSV run that okay we have that and we can do DF doad run that and again it's so hard to read this so one cool trick is if you have a GitHub co-pilot installed I think I don't know if it's just co-pilot or co-pilot chat you need in jupyter notebooks within Visual Studio code you can do command I create a co-pilot chat window and what I recommend doing is you could do something like change Panda's display settings to not cut to the unlimited size for displaying you know text data in a column when I call DF doad so something along the lines of this in your co-pilot chat it'll give you something like display. Max columnwidth none and if I run it now look at how much better this is to read and I think still it's kind of hard to read these documents because we have these weird characters that pop up everywhere um I'm going to quickly look up what the heck this character [Applause] is a carriage return so somehow we got these carriage return characters so what I would recommend is it's just a return character it's not a big deal if you're seeing this you might be able to load in the CSV a different way so that you don't see this but another option for us would have been just to go ahead and do DF like uh transcription text equals DF transcription text and it's already giving us an auto complete that's kind of nice string. replace that carriage return character and instead just have it be a space run that okay and now our data is way easier to read you could you know replace other characters too but now we can actually read through this and we can do something similar in in kaggle land let's do it here okay I'm going to rerun this look at that we got rid of the character and now we can read a lot more too cool and so if you wanted to like we we got all the rows we did the task I guess if you really wanted to you could add in that uh other row which was there's more elegant ways to do this [Applause] but oh no you should be surrounded in parentheses if they're multiple conditions maybe I have to use the weird syntax like this cool and now if we do length AA we see we get that other row so whatever solution you want works so this would be completing task 2A all right for task number two we are going to connect pages from this apprenticeship agreements uh data frame that belong to the same document so what do I mean by this let's look back at our documents here and if we read the text here and it's hard to read I'll make it a little bigger maybe but we see like this one we see Mrs Kate shamblin and if we look at the next document we see Mrs Kate shambliss I guess the writing is bit name but like we can assume that this is the same person as this and is there any other similar names like Betty Taylor is this mentioned at all we see Betty Taylor so I can assume that this and this are part of the same document and one thing that's cool is if you actually want to see what these documents look like you can use the document URL link go to the Smithsonian website where they live and we see the actual this is what's been digitally transcribed to the text that we could actually work with it would be a lot harder to try to analyze this image here but we see this and then if we go to the next page with Kate chamblin we see this and we see you know the same names and yeah this text is so hard to read that like the transcribers here too are going to make mistakes like this is a extremely messy data set so you know details like this they it happens but ultimately like this document should be associated with this document for analysis we want to group like the same pages together maybe they weren't from the exact same document but like when we're doing our analysis we probably shouldn't be double counting and treating these as separate things ultimately our goal in the upcoming tasks is to pull out the people that become freed through these agreement of apprenticeships and we want to do some analysis on their ages their locations Etc so we want to group together similar documents so that's what our goal is in this task so by the end of this we want to take rows that are part of the same document that we're considering merge them together into a single document that has all of the text and then save that we can use large language models to help us with this task so that will be kind of the hint is that we can use either the open AI or the Llama 2 um to help us figure out is is this the same part like from the same document as this so that's kind of the hints feel free to try this on your own and then resume the video when you want to see how I would approach it and to help you out here one thing that we will do is I recommend it's going to be pretty tough to figure this out probably for the full document to start so if you want to have a like reduced version of this document so I'm going to call this AA reduced or AA small or something like that apprenti apprenticeship agreement small ideally honestly we'd probably have a name that we actually you know spelled out fully like this but I just don't feel like passing that name around everywhere within this analysis so I'm going to call this AA small and that's going to be equal to pd. read CSV and I guess actually we could just do um you know aa. head like 30 or something like that and now if we look at the length of AA small it's 30 another thing we could do is um if you want to actually I'll show it in a sec so maybe just do it for the first 30 documents because it'll be easier than the full task all right feel free to pause the video resume it when you want to see how I would approach this solution okay so how might we approach this well I think we should again given the hint was kind of that we could use our large language models I think that we want to Define another system message it's going to be equal to something like your job is to determine whether two pages are from the same document or not you should determine this by checking if [Music] similar checking if similar names places dates Etc appear in both documents if they do [Music] return do return only the Boolean value true if they do not return only the Boolean value false important only return a Boolean and nothing else okay now how would we actually package this well if we go back to I guess actually you know we could use this format uh I'm going to use Lang chain and we're going to kind of copy what we did previously in the local setup so we will go and copy this code go into kagle and we won't have Lang chain yet so we'll probably have to do pip install Lang chain and probably I think we could do pip install Lang chain and Lang chain open AI I don't know if you have to do both or not but doesn't hurt to try I hope it I don't know what this error is I don't like the error okay let's see we can get rid of this system message we don't need that anymore we also don't need this stuff and our new human message is just going to be something like document we'll use an F string here document one then pass in the variable and then something like new line new line new line document two and then pass in the variable doc to feel like something's off here oh we're going to use make these Imports at the top just for good practice un comma this guy all right let's see what happens if we try this okay so it says we can pass in open AI API key into this we didn't have to do that locally because it was set as environment variable there'sa helps us and note secret value was defined very top of this um by using Secrets within kaggle see what happens if we run this again oh yeah we don't have any doc one or doc two captured we could just do dummy ones for now so doc one will say is there is a cat named Hugo that was very silly and then doc 2 would be Hugo was so silly that he went on a snowboard lift and shredded down the mountain I am terrible at generating funny little things on the spot that's why I have chat gbt I guess but these are clearly from the same document so let's see what it produces true look at that that's pretty cool and now let's do something like Keith is cool that should probably say false not really that clear but look at that if I do something like there was also a dog named Jamal I don't know why the dog's names Jamal but shout out Jamal Jamal was even silar also clearly from the same like these two pages are clear or I guess I could call them page one page two maybe because they're part of the same document that's what we're considering it semantically page one page two okay does it say true for this no uh if I said something like Hugo and Jamal were friends I bet you it woods so you know there maybe then we would have changed up the system message because it didn't like that we didn't use the word Hugo the second time but pretty good now instead of doing page one and page two and filling this out this way what we should do is page one should be like a a small transcription text zero and this should be a a small transcription text one and we could print those values out real quick Qui print page one print page two and let's see what happens oh I might real quick just so we know what page one and page two is also annotate this so we know clearly when page one ends page two starts okay and this is what we already saw Kate chamblin Betty Taylor Mary Taylor and then page two has some details about this and we see output we see the output is true that looks good I just want yeah cool so it figured out that this was associated with this not too bad and again we could have kind of use the trick of like providing it some examples to context off of but what I recommend here now our task is to connect pages that belong to the same document we want to do this for every Row in AA small and what I recommend is basically we need a way to identify which are part of the same document so like you can kind of think of the algorithm maybe pause the video try to think of an algorithm on your own but our goal here is to basically go through our table check the text check the next text see if they are they correspond to one another if they are give it you know an ID and then check two to three like row two to row three if Row three is part of row two you know give that the same idea as Row one keep going and then we find that Row three is not like row four then row four kind of starts a new document and we check four to five if four and five are similar then we connect them note this table is ordered so we only need to check like consecutive ones luckily sometimes maybe though it would be helpful like to check like a couple rows apart because if this is if Row one is clearly part of row two row two is clearly part of Row three and row four is not clearly part of Row three but row four is clearly part of Row one we'd like that connection but to simplify things let's just look one at a time and then basically try to give unique IDs based on what document they will be in the merged output and then we'll actually do the merging it's kind of confusing when I'm getting it but hopefully it makes sense when you see the solution here okay so let's maybe encapsulate this in a function def is from same document we'll pass in page one page two that's going to run through here we're going to want to return chat model response. content and we might actually want to do as. literal eval response. content and we might also want to like surround this with a try and accept in case things fail llms out put all sorts of wild stuff so you never really know what it's going to return and then we'll just have this return false it's not the end of the world if we don't group documents together but ideally we group as many as possible together in the real world you want to be a little bit you you definitely want to be stricter here and you'd want to have unit test you want want to make sure that this works exactly as you want it to work um but there's only so much we can cover in this tutorial cool so now we have a function is from same document oh this should not be what this is page one and page two are passed in now okay we have this function defined and we can just see if this kind of works is from same document Hugo is a silly cat Hugo is a lot of fun this should produce true does if I said something like Python Programming it is super cool that should say false cool it does um all right now we need to iterate through our sample document and to make sure that your results are similar to to what produ is produced by my code we're going to load in a file from the GitHub so the GitHub repo is at github.com Kei g/h hisorical docs analysis it's in data and it's going to be task 2 apprenticeship agreement sample if you click on Raw here and then take this URL we can nicely just load the specific URL this is a nice little trick into kagle so now we have our test data frame test DF head and one thing that's nice about this is you can kind of see what we're expecting with this step basically if they have the same expected ID it should be part of the same document so first and second as we went through oh actually is this the same oh this is a little bit of a different set but you see James Connelly here you see James Connelly the name here this row is now different than this so we want to get the same expected IDs down here basically start a new ID each time you're at a new document so we need to write some code that does that how could we do it so we're going to do for index Row in DF or test DF do iter did I type this right pop up little menu it Rose there we go good old auto complete okay and I'm just going to do print row just want to see what this gives [Applause] us this is something I sometimes do like if index equals 5 will break so it's just going to print the five first five rows real quick oh first five transcription text okay we can access it that way um what happens if I did expected ID cool we get that so basically we want to add we're going to basically want to try to add like a merge ID to this table and basically at the end of the day we hope that our merge ID is equal to the expected ID so I'm going to make a copy of this table real quick and we'll do test or test DF copy just so we can have the data frame and and change things without actually affecting what we're iterating over I don't know if this is necessary or not but maybe it is basically what we're going to want to do is look at two rows at a time so we're going to want to look up the first row will be maybe I didn't need to do like index row this syntax but call this page one equals test dflo index and then we want to get the transcription text page two is equal to test df.loc index + one and we want to make sure that if index + one is less than the length of test DF is that right I think it's length of okay so if we had a three item array one two three then the final index would be two because it starts 0 one two so three length three index + one we want to make sure index is one yeah so it has to be strictly less than the length I think that this is good we can kind of just skip on the last iteration kind of talking to myself right now but hopefully this makes sense okay that gets our page two we want to compare them so is from if is from same document page one page two then basically we want to give them the same merge ID so I think what we want to do is we're going to set some V value called merge ID up here equals zero if they are from the same document then test DF index merge ID is going to be equal to merge ID and index + one the merge ID will be equal to that same merge ID right what if they're not from the same document then basically we want to increase the merge ID by one we want to set only the first item okay this is going to make seconds in a second and you know th this is not expected to be easy I've thought about this problem multiple times so it's a little easier for me this iteration but a lot of this would be trial and error in the real world and probably still will be trial and error but then we want to increase the merge ID plus equals 1 so let's think about this logic first and second document are different right we want to set just the first document to have merge ID zero and then the second document would be merge ID one on that first iteration we look at the first two rows we see that they're not the same we're only going to set the first row to zero then we increase merge ID plus one so now we're have a merge ID equals one next iteration now we're looking at the second and third documents let's say that they are the same then and they're both set with merge ID uh the same merge ID which now is one so they would get one and one that seems like the logic makes sense let's see if this code runs this would be crazy if it works on this their first go I should have printed out some stuff to see if uh it's going to take a little while because it has to make a bunch of um uh open AI calls here feeding into this we will get a print message if there is an error with any of these calls so I'm kind of happy that we're not seeing any print messages yet I'll fast forward through the this little section one thing I might have done to just make this more clear is like I might have printed the index each time we iterate it just so I know where we're at I always like that when I'm running code because sometimes you realize that it's not getting past the first iteration and there's something seriously wrong but if you see that index increasing then it's you're you know you're usually on the right track then we can go ahead and check the test DF head down here I'm a little bit concerned that this did not complete come on all right it doesn't appear to be running I'm going to give it one more minute this is why you should be printing out index doesn't appear to be running and I think one thing I'm going to stop this this is why we should be printing as we go [Music] um one thing I recommend is if you're using Lang chain in the invoke function we can pass in a Timeout going to give that 20 seconds we might have been just spinning and spinning trying to make a call and it never worked so at least we'll see an error if this timeout is hit rerun this function rerun this load and then also let's see I think that this should be outside of this if statement and we should print out definitely like our index processing index index and we can make this an FST string index of length Okay I think one of either this timeout variable that we added or this printing I think should help us see what how it's doing okay that's that's going so quick why was it spinning for years I don't know what it got caught in but yeah visualizing things like this is helpful know if it lost internet connection or something okay I think it's on the last one but I guess it's on the last one it doesn't have a call to make index plus is it running into some error is it still spinning oh it ran okay oh the last one I guess the it should have been of length minus one because the index only goes to 26 okay let's see what happens did it give us that merge ID oh sweet oh no okay so we have some issues here we expected these to be the same let's try to see more of our document and one thing that might be yeah actually this is still good why did I just do five I meant 50 does it ever have the same merge ID don't think it does are we ever getting inside this let's see print here oh interesting this is going to reset it no matter what how we put this so we should be continuing if we hit this case and going back to the top of the loop not executing this code that might have do the trick so it was basically doing this probably was getting inside this function but then it was resetting the current index to be the merge ID adding one and then you know basically doing that plus one one would set the merge ID for the next row and I think we just basically got yeah just consecutive values let's hopefully that made kind of made sense it's taking a sec it's taking a sec see it's it these models sometimes are weird so I like the timeout because at least it will give us an output if it struggles or something but it looks like it it's gone through now I'm seeing here a little too frequently though don't think we should see it on every call like this should not be the same as this I'm going to say if the same names okay run this this run this run this why does it keep getting in why does every document say it's from the same document okay at least we got some variant here H this is very weird this is the type of thing where maybe we pass in some samples of valid inputs and outputs I'm going to do that here we'll try this [Applause] I ideally shouldn't be referencing a data frame directly in this but I'm just going to do it for testing and we can clean it up later [Applause] this is actually I think from a different sample than our actual data okay so this is going to be page one example page two example and if we look at AA small we see Kate chamblin Kate chamblin here so this is definitely true so this should be a AI message and it's just going to be true that's probably find one that's false though too still the same still the same this looks different so three and four indexes is three and four are different copy this just giving it a just testing this I'll fix up the naming so this should be false so this is example [Music] one two and this should be false let's see if this improves performance at all I'm liking that this is not connecting as many together we'll see 0 0 one one two two three oh no this was close it thought that these were different I bet you if I added like one more example so you can like you know train it with some like give it some additional context let's see how that works all right let's see how our expected IDs compare sweet sweet good good no it's it's struggling on this one th this is so clearly a continuation of this like maybe if I just give it like another clear indicator is if the first few words in page two are a continuation of of the last few words on page one let's see if that fixes things doing this in real time cross her fingers and note there's infinite ways to do this what I'm showing is definitely not the only way but this is how you have to approach these you got to iterate you got to you you know be creative you got to figure out what educ cases it's not hitting on 0 0 1 1 two two three oh no why is it struggling so much did I rerun this maybe I didn't rerun the oh my gosh I don't know what's the deal this was one approach you know you could try a different approach here uh I want to change this language one of the key ways to determine this is by checking if the same names places I'm going to call this locations dates [Music] appear in both maybe I should say Pages yeah pages that might be better the input format will be user message passing in page one page one content new line new line new line page two page two content uh another thing we could try here is there's this text wrap Library I don't think that this will help but it's a good trick to know I bet you all have to pip install text wrap I don't know if this will work is text wrap part of this oh wow there's this text wrap Dent function you can try that um times helps because sometimes the indented text is not great for your performance so we tried a couple different things let's see if this helps us at all 0 0 one one 2 two two uh it got one more I got oh wait sweet this expected ID might be off orphan Jan uh is this a different orphan chain I know it's still other than that it did pretty good and like it's only missing one right now it's only not hitting this value right here wait I was the one that labeled these expected IDs so it's possible I'm off too like I think as long as you're close enough that's pretty good ideally you know if you have more time you want to be more and more precise but I don't want to spend too much too much time getting this we're pretty close to having a good thing so I'm going to just kind of continue onward I'll put resources to other ways I went about solving some of these problems in the past just to share one quickly when I was trying this task before the video I went with this I defined this from same document function that also had some like retry logic in it but it basically uses this prompt template syntax that is a bit different than the way that we did it with um system message and whatnot it's like the system message I feel pretty much the user like you only ever see this the system message is only ever used and it's just invoked on just this with the documents directly inserted instead of trying to use like the AI or like the AI messages and the user messages and then uh some of the additional libraries are imported up here so yeah this was another approach to this try to get it close enough it doesn't have to be exactly perfect cool all right now finally once we have these merge IDs in this document we actually need to go ahead and merge them and you know remove the duplicate rows and merge similar documents doents so to do that we could call let's look up our document again basically we want to group by the merge ID or you could do expected ID if you really want to they're pretty dang close sorry pass than merge should be group by merge ID and then basically we want to aggregate those grouped by rows so we'll take the transcription text from each grouped by merge ID we're going to call this doag function and we want to join these similar pages with new line characters and we'll do do join this is kind of some special syntax and then we can reset the index what does that give us okay cool so these are the merged documents they're longer if you wanted to you could add like two new lines to maybe make it more clear what's been appended on oh I guess it's not showing the if we did display I think maybe it would work or if we did print I guess print this I think you'd see the new line characters yeah oh maybe not I don't know why it's not showing the new lines okay then we basically want to get this back onto our original data frame so what we could do is we could do our original let's set this to aggregated equals this test DF equals test I'll call this output DF equals test DF do drop duplicates and we want to drop any duplicate merge IDs any duplicates with merage IDs uh I guess though the UR document URL is no longer valid either we just want to keep the category and subcategory okay we'll do this drop duplicates then we want to basically merge them together so we can do output DF equals pd. merge output DF with aggregated DF on will be merge ID how do we need to do how we want to keep the stuff in output data frames so how equals left uh anything else you need to do then because they're going to both have this transcription text column we should add suffixes don't need a suffix for the output DF but we should have aggregated appended on to the new one let's see what output data frame looks like after this output DF cool good now we can drop a bunch of these so we could realistically final DF equals output DF do drop I'm going to drop project I project ID should be the same so so we just want to drop document URL maybe we want the URL because yes it's not going to be necessarily the same but at least you could rep find a document that's nearby the other ones so I think this is probably good enough we do we do want to drop though uh the original transcription text that'll be confusing if we keep that so final DF do head cool all right these are merged documents and I know this wasn't perfect but at least it gives you a sense ultimately in this process we want to like to do analysis we want to have all the similar pages together so that we can analyze those one at a time there's other strategies to go about it but this is helpful in various document categories here so I think it's worth seeing uh at this point we should run this for all of the rows in our original apprenticeship agreements data frame but because that's going to take a really long time uh for Simplicity sake I just put a pretty good output version of that um in the GitHub repo so you can go to github.com historical docs analysis and then it is specifically data and then it is um merged apprenticeship agreements and it's in here so if you copy this URL then we go to kaggle we can go to task number four number three which is going to be parse out values from merged documents and we're going to get a data frame which is going to be pd. read [Music] CSV that link there so this is like the full version just so you don't have to run it on every single one but you would have run that chat command on everything to get to this point all right we're already pretty far into this video so to I think make things a bit simpler I will have kaggle all available for you all and I encourage you to push things on kagle because I want to see your research finding I would love to see how you and analyze these docs but um just for Simplicity sake I'm going to just um I think play around with the rest on Visual Studio code uh okay so we want to get the apprenticeship details from these documents so if we look at the documents we have this aggregated text and honestly I might like how do I display these new lines actually showing up in the doc let's see if uh command I helps me here that didn't work yeah whatever not the end of the world but basically what we want to grab is we want to grab information like who the Apprentice is so their name so in this case it would be probably like minor orphans freed James James same net we might want to grab Apprentice age which in this case do we see James's age 14 uh we might want to grab like the location of this so Carol Parish Louisiana so County we want to go the state and these are American states so I realize that La is Louisiana that might not be immediately clear to non us residence here uh that would be Louisiana I might also want what we'll call like The Apprentice sir I don't know what you call that Mentor is I think how we've listed it previously so you have the apprentice and the mentor so in this case it would be Mrs Kate chamblin and this is weird because like you know her name is listed different ways different times so like I think that this spelling is probably more correct um cool maybe like an official or something like that but I think that this is a good starting point and basically yeah so this format you see here it should strike you very similarly to what we did at the start of this video where we you know use a format like this so I think what we want to do is copy this actually maybe copy all of this code copy all this code go down here create a new cell paste that code in and now we want to grab different stuff so we want to grab Prentice name uh Apprentice age no we don't need quotes here because it's an age it should be a number so we can do uh we want Mentor I guess we should probably should say Mentor name how do I call it before call it Mentor name and we'll say name of person taking in The Apprentice that should be in strings we want County the county the contract County where the contract was made State the state where the contract was made and one thing we might want to do here is like we saw La before and we know that that should be Louisiana so we might want to explicitly say make like write out full State not just abbreviation okay that's good grab information about apprenticeship agreement contracts your job is to parse out information about apprentiship agreement contracts in the United States output the following Json object note if there are multiple people multiple apprentices in a single document create a dictionary item entry for each one in the results list there we go decent let's see how this works we can just take a snippet from one of these okay let's see what happens here if we just pass this input text probably use the three quotes to make it easier okay and our llm will be chat open AI let's see what it produces here we screwed something up oh input text why did it oh man that is is pretty good right from the get-go so how did we get the results before did response sck Json loads and so note for context the reason this is important to us is because if we produce a ton of these jsons then we can analyze we analyze these ages we can see what's in the documents we're pulling out this concrete information and we're making it actionable so that's like quite quite cool um so we need [Music] to oh I just want to show me the result results let's get item zero parth AG 14 perfect okay we're processing this now we need to run it for multiple things how would we do that we have this DF we're passing in input text so what we're going to want to do is I and I don't think we're going to want to go through the whole data frame so I'm going to say DF small equals DF do head like the first I don't know 50 documents for I index row and DF small dot we probably want to make this a function now so def get output I'm trying to think if we are doing this for each we should ultimately probably store like the Json in a new data frame or in this DF small because then we can just process it all together so what we're going to do is we're going [Music] to what's in our DF DF again we want to use this column so we're going to do for index Row in DF small. it rows we want to grab our input text is equal to row transcription text aggregated okay we want to feed that into our function here oh no output is equal to get output input text and then we want to ultimately like output Json equals the output and I recommend timeout equals 20 recommend probably using a try catch here just in case things fail I feel like often times things fail in llm world so it that's what we're doing uh we'll just do return none if this is the case you could also add in retries or something but like we'll just use none for now and I think it's always good to print the index we're at as we iterate through this sometimes you might need to do a time. sleep to not hit your rate limits too but I think we should be good let's see what happens here and we'll add a new code cell below this oh I guess we did DF doad we probably should have made DF head. copy here so we didn't actually affect the original data frame at all and I'm just doing DF small because because uh it will take a long time to run this on all but if you have the time to spare you can run it on all all rows let's just start this from scratch and I want to add a row below look at that okay it ran on those 50 and we see we have this output Json um in our data frame cool uh because we only ran it for yeah those 50 if you wanted to see what an example output for um a a larger chunk of the data set pretty much all of those apprenticeship agreements you can go to the GitHub tasks 4 raw and you can take that I will say that again try to do all of this computation in a really short time span so most of the work in like doing a research project like this I feel like is spent actually double cheing uh triple cheing the values that you produce in these stages that's where a lot of real effort needs to go into and that's what the time consuming effort is so I think with that in mind note that like you know some of these values won't be perfect but hopefully this process in general you'll understand and you can take this and be more specific more exact As you move forward with building upon this knowledge so anyway there's this task for we're going to use that for our analysis step um I'll just paste that in here but real quick yeah when we have these output parsed Json outputs that's pretty cool and what could we do with them well let's yeah we'll do that in the analyze results stage cool sweet all right so the final task is to analyze our results so I'll also actually save this DF small so you can reference reference this too um I'll just call this like task for par small. CSV um so that will also be in GitHub I'll add that uh in a bit but we have this you know DF small here that has these output jsons and honestly just to make things simpler for us I might just let's just get the output Json column okay cool and what we can do is we could do something like um info list equals blank we could do for index row DF small iterate rows I don't like doing just this I I want to get the individual lists here and add those individual objects because I think that's what's going to be easiest to Output into something meaningful so I think we want to first um results list I guess first we have to um uh output dict equals json. loads that results list equals output dict results and then we basically want to for result and results list add that to our info list and I would say like we'll try this for each thing uh accept exception as e print e all this stuff you never know if it's going to be perfectly formatted so it's always good to use the triy accept so we don't break but let's see what happens if we run this now look at our info list let's get like the first item perfect that looks very easy to parse three perfect so I'd say that the task here is from our DF small let's get the average age of apprentices let's see how we would analyze that feel free to pause the video and resume when you want to see how I would approach it okay okay this is an example like for Loop we could get the average age and we should expect it to be somewhere between 0 and 21 so let's see what we actually get what did I do wrong what is the average age 10 10 years old exactly it says here I wonder I just get nervous when I see such a perfect number what did I [Applause] do okay it seems like it's it's actually 10 [Music] um okay that would be like average age cool we can make this more interesting though and do like age counts equals empty dictionary we get the age uh if it's an instance of an in if age and age counts we plus equals one else we set it and now we run that and we run age counts this seems a little too perfect that doesn't seem right so look back at our info list [Applause] ton of people in here it looks like maybe it repeated stuff way too much yeah looks like the output is not actually what we wanted it to be dang oh my gosh oops I realized that input this was had input text in it and there's input text right here so just need to make sure we delete this uh so that it runs uh properly so let's rerun this awesome we got 50 there uh we'll go ahead and save this to we'll call this task for um first small. CSV index equals false run that and okay all righty for task four we will actually analyze our results and again like we went quickly through this so we didn't do all the testing checks that that we should ideally do so if you're to do this in a more involved project if you wanted this to be more of a portfolio type project I recommend really testing the values along the way and seeing if you're getting the right outputs I recommend writing like some unit tests to kind of like see if for let's say 25 examples you always parse the correct output and then if you do that for like 25 examples you'd be pretty confident applying it to the rest it's not perfect right now what we have but um really this is to Showcase what process you can use to do this type of analysis so to analyze results let's just start with the small document so you can grab it by doing um uh I'll have it saved as this on GitHub so similar to how you've grabbed other GitHub documents you can just fill in the name right here on this URL um so you go to I haven't pushed it yet so I'll have to push it but we have our DF small and we want to just specifically let's just look at the um output Json column here and we see we have a bunch of you know Apprentice information here so I think the first thing is let let's set the concrete task of we want to find the average age in this DF small output Jon list of 50 things so find the average age of apprentices so how could we go about doing that well I think we want to get it into an easier format to work with so I think the first thing we'll want to do is kind of iterate through all of our rows and just add the actual individual dictionary objects into this info list I think that will be easier to process so we're going to do um we want to first load it in as a dictionary so we'll call this output dict equals Json loads the Json so now we have it in Python form call this results list equals the results object within that cool so then we want to go iterate through the items in that list okay and then let's see what's in our info list hopefully that works oh no okay I would just say we can always just try doing this if it doesn't work we'll just do nothing to our info list we'll just do accept exception as e print e okay we got a few errors but I think we should probably have enough in our info list to get a rough average age we'll just grab the zerth element cool we have nice age there so I'm going to do average age equals zero for info and info list and we should probably always do a dog get printice age and I don't want to do get zero I would say if this value is not there or it's like a none or something let's say our first check should be age equals info. getet if is instance age of int I like how these autocomplete is just helping me get along there um average AG plus equals info. getet and then we also do count plus equals 1 count equals zero and now we want to print perfect let's see what we get 12.45 3125 that seems reasonable we know that like most of these apprentices end at age um 21 so it seems like 12 would be a reasonable value so so that is the average age from the DF small another interesting thing I think to do would be like age counts and make this a dict um we could do okay probably the same here uh then we would do something like that like how Auto it just knows what I want to do now let's look at H counts look at that from our 50 this is what we get for counts uh maybe be interesting to histogram that and I'm going to use co-pilot to help me here um from a dictionary mapping values to the count of those those [Music] values make a histogram of that information and so in this case our data we already have that's age counts so I'm going to just set data equals age counts let's see if this WOW do we have a bunch of 18y olds yeah 12 wow here's a histogram of that information cool um I think what might be interesting though is maybe instead of DF small let's load in DF equals pd. read CSV we'll get the full spreadsheet which if you go to GitHub github.com hegy data um par apprenticeship agreements it should give us like a full thing go to Raw get this link cool go back here we'll paste this in call this like uh I'll call it just DF why not uh does this load cool now we're going to make a big info list actually let's just real quick look at what's in that DF what is the output Json here or do we have a different name or something it's called outa Json okay results okay I think that's probably good let's see what's in info list okay it's still got things oh we added an additional field official here uh let's see what happens with these age counts very different we got a lot more values and we got a nice histogram here is there really an 80-year-old this doesn't seem right but interestingly enough Let's uh maybe drop any data that's dot drop I'm kind of cheating here maybe there's mult you know older apprentices but I don't think there would be an 80-year old Apprentice I feel like that seems off or that's seems like a more reasonable histogram so that's like in all of our uh apprenticeship agreements this is the rough histogram might not be perfect because we didn't double check this um but I think interestingly enough we could go to the Freeman bureau's website so kind of tying this together and we can go into categories we can go into Dentures of apprenticeship and we can see what we get here so this was just from like a you know slide presentation at some point um this is like a rough Apprentice histogram distribution so ours looks a bit different uh this is why you double check things I'm thinking that maybe this is counting nuns or something here or maybe the code that we have right now it's not been tested a ton so maybe it's not properly parsing tables but it is nice that generally you know we see that a fairly similar Trend with the rest of our data you know kind of a normal-ish distribution which I think makes sense you know you're not going to many zeroy old apprentices and once you get to close to 21 probably like you know don't not as common other things we can look at is like in [Music] here is like the people with the highest number of apprentices it would be interesting to also analyze this so how would we get this in our code so what we might [Music] do guess we could just do this down below so let's get we'll copy this code now instead of age counts we'll do call it like Mentor counts info. get mentor [Music] Mentor Mentor Mentor no longer does this have to be a instance of that shift tab no longer does this need to be greater than 25 if Mentor in Mentor counts we want to Output Mentor counts I might just do a quick iteration let see oh it didn't work we call it Mentor right Mentor okay what did I do wrong here for info and info list Mentor equals info. Mentor if Mentor in Mentor counts oh this is doesn't work work oh okay else oh I'm modifying the wrong dictionary cool and then we could do something like for value i g just do like if value is less than 15 or like less than 10 how about like I want to see who's the highest numbers in here we could do Mentor counts. poop key all right we're making a copy just so it doesn't give us this error run this and then let's make another histogram this is now Mentor counts hm okay so here are some names and these are things to check one we see is James Connelly and so I think one thing that's interesting is if we look at our the website you see James Connelly he was the biggest there I think other things to notice is like hm why um you know HH Foster appears a ton but I don't see him on the actual website count which is surprising well if we look into the data real quick if we look at the data real quick and we look up the name HH Foster we're gonna wrap this HH Foster uh okay it looks like it appears here try again HH Foster where is HH Foster okay basically we can see right here is a good example that HH Foster is a assistant superintendent Bureau of the Freeman's Bureau you know uh refugees abandoned lands Etc yeah a Fosters so this person in the way that they're being classified in this document here that person should be considered an official not a mentor that's why they're getting mistakenly counted in this when we see in the results uh the actual Freeman Bureau results on the website they weren't appearing so there's a lot of checks that need to be done to make this right so I really encourage you if you want to take this to the next step is to iterate on this and try to really validate that your outputs are what you're expecting but I'm hoping that through this process it was a very long process but I'm hoping that you have a sense of okay so what did we do here we had a bunch of documents we connected the documents together if the the pages seemed very related to one another so now we have a bigger document from those bigger documents we wanted to pull out specific names and entities because pulling out specific details like that are much easier to analyze than a big chunk of text so we pulled out Apprentice names we pulled out their Mentor names we pulled out their ages now we have a massive list of these um you know dictionary type objects and that makes it way easier we can iterate through those dictionary objects and you know calculate the average age calculate a histogram of Ages calculate the names that appear most frequently and that's how we can get some insights on uh you know key people figures key you know ways that these things were approached at that time period so hopefully this process makes sense I would love if people take this to the next step keep at it like contribute uh notebooks to kaggle find some cool information from the research even if it's just finding unique stories within the data would love to see that in kaggle really encourage you to do that maybe we can do some like live stream kaggle sessions or something or live stream reviews of notebooks that you post on kaggle would be totally down to like build off of this uh let me know ideas in the comments but I think with that we'll call it very a lot of Real World skills in this video this was a real you know data science project that I worked on so hopefully you enjoyed seeing that I want to say thank you to the entire freedman's Bureau team that helped out on this it was a really interesting project to work on and it's cool what can be done with these large language models hopefully you learned a lot in this video if you did make sure to throw the video a thumbs up subscribe if you haven't but more tutorials coming until next time everyone peace
Info
Channel: Keith Galli
Views: 13,190
Rating: undefined out of 5
Keywords: Keith Galli, python, programming, python 3, data science, data analysis, python programming, python project, llms, large language model, large language models, llama, llama 2, openai, chat-gpt, chat gpt, gpt 4, gpt 3, openai api, gpt4, gpt3, chatgpt, open ai, langchain, ollama, pandas, python pandas, nlp, natural language processing, llm, python3, programming project, data project, data, jupyter notebook, data scientist, data science full course, data science for beginners, real world
Id: MeyVptCRubI
Channel Id: undefined
Length: 159min 33sec (9573 seconds)
Published: Wed Mar 20 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.