Train GPT-3 on Any Corpus of Data with ChatGPT and Knowledge Graphs - SCOTUS Opinions Part 1

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everybody David Shapiro here uh back after a Hiatus um I've got a lot going on you'll be excited for some news coming up um first I want to address um a big elephant in the room I just put most of my videos back online um this comes after of course making a recent video explaining why I took them down um I did leave some of my videos down some of my code down but most of them are back up uh both videos and repositories So after talking with people I figured out like striking the balance between creating tools that will help people rather than replace people it's inevitable that things are going to change but you know a tool is a tool right and it's anyways I don't need to get lost in it go watch the other video so now that I'm back um it's time to get my get our hands dirty again so one of the questions that pops up a lot is people want to train gpt3 on how do I how do I do how do I fine-tune a question answering bot so that I can talk about I think someone asked about like the the case law in Argentina or something um I don't have that data but the short answer is you don't fine tuning doesn't work that way um so fine-tuning is about teaching it a structure you do not teach it with uh you don't teach it new knowledge with fine tuning what you do is you teach it patterns so chat GPT is a pattern so the pattern is I ask a question and it writes a response like that you and then you ask a follow-up question and it writes another wall of text that's the pattern Chad gbt was not taught anything new it's only taught new stuff when you retrain the underlying model you can't do that it's way too expensive to retrain the underlying model so I figured let's pick something that will be a good Exemplar of this so just to do a quick recap the the Genesis of this whole this whole project was that people ask for um how do you fine-tune um a question answering thing that will you know do case law or any kind of knowledge base right um it's all the same behind under the hood right you have a collection of documents wherever they happen to be how do I do QA against that with gpt3 so here we go I already have one one that was um answering complex questions from multiple documents but this is a little bit different um because there's there's going to be a few steps to this so anyways to show you what I mean I went over to chatgpt and um I said what is the kind of law system where law is established by precedent and it says this is the common law system and they this is opposed to civil law so common law means that a Supreme Court decision kind of sets the law of the land so if you want to understand the American legal system you really need to understand case law and more often than not it comes down to Supreme Court decisions because that is the highest court in the land so they set the tone for everything so Supreme Court decisions really teach you how it works so um I went over to a Library of Congress and I found that uh you can download uh all Supreme Court opinions and they're grouped by uh by by case topic they're also grouped by um volume or justice but by topic that's going to be more relevant right because let's imagine that you're an Anti-Trust lawyer and you're an Anti-Trust lawyer and you want to say give me everything about antitrust law I need to know everything that there is about you know um about this so that I understand the legal precedent right because on the one hand there's established procedures right there's procedural things um oh and I know all this because my fiance's cousin is training to be a lawyer and um when they come visit uh this is what we talk about because we're nerds um so there's all kinds of procedural stuff that I don't even remember but there's there you know uh the idea is that when you have rule by law it is all about procedure and protocol rather than emotions so we actually have a very stoic system where it's we're going to think through this we're going to look at the letter of the law we're going to have an impartial system um of course when you have an impartial system that requires expert navigation that automatically privileges people with access to lawyers AKA people with training or money privilege is a whole other topic anyways the system is there it's a very sober system um where it's about like let's let's let's read through the established protocols if you're a friend of the court and stuff I watch legal eagle too legal eagle is great um so anyways all that kind of stuff that's fine but interpreting established law common law uh or case law is a whole is a very specific topic so let's take antitrust law where uh let's see how many did it say Anti-Trust so there's 362 documents they're all available online as PDFs they've been scanned and I believe they've also all been ocr'd so let's take a quick look close some of these um Superfluous ones um yeah so PDF 661 um yeah you don't have something that's this long and yeah so you you highlight it you see that it's OCR so that means we should be able to scrape it even though it was scanned excuse me a scanned an ocr'd so we should be able to get this information so let me go ahead over here to uh opinions opinions PDF so we'll save this one um and then actually what I put it in the wrong folder so you might have seen I had a recent document scraping uh video so this is whoops come back this is uh this is the lead up to that this is why you need something like document scraping is because um oh I forgot to the the whole reason this is is uh I went and asked uh chat GPT say tell me about this this this case law and it said I don't know what you're talking about this sounds like it's a real case so it's like okay cool um you know it tells me about the identification I said it was a Supreme Court case decided in 1953 it still doesn't know it right because it's not connected to any external data source so one of the biggest weaknesses of chat GPT is that it's a mind in a bottle it has no contact with the outside world the only way that chat GPT can interact with anything is via this chat interface now from an architectural standpoint that's not actually that difficult to fix but you introduce a whole lot of new problems especially when you consider the fact that there are like billions of terabytes of text Data out there to search and a lot of it isn't accessible because it's in PDFs or private databases or something so you need to have a link between the model the language model which can read anything and then the stuff that you want it to read so that's what we're working on here okay so now that you're caught up I wanted to show this is this is one of the greatest flaws of chat GPT it's not connected to anything it's in a vacuum Okay cool so now what well we've got our data here it's in text but it's not necessarily machine readable so the first thing we got to do is we've got to go over here we've got to take our um take our PDF and then we'll use this script that I wrote here let me just show it to you real quick um uh so it just takes everything in the folder PDFs and then converts it so let me go ahead and just run this it should go pretty quick and then we'll look at converted so here it is Tada there we go so you've got and this this repo is public by the way so you've got this oh and one thing that I did was I added a little thing so that it keeps the new pages I actually might remove that um actually no let's let's keep that because it it's a helpful demarcation so I added this little token because when you read a PDF you have to read it Page by Page and sometimes sometimes knowing where there's a page break um is helpful so we'll keep that that's fine all right so let's come back to converted we'll copy this and bring it back over to um do opinions Dot underscore text and we'll paste it there all right so I'm going to download a bunch of these I'm going to pause the video you don't need to watch me downloading it but this is what I'm going to do so I'm gonna get like I'm not going to spend the time to download all 300 I'll sort them by like most popular whatever and we'll have a whole bunch of Supreme Court case law about what was this Anti-Trust yeah so we'll be right back Okay I uh downloaded files until I got rate limited so be kind to your data sources and don't abuse them um many websites will do this if they detect that you are uh scraping or whatever um if they don't offer a bulk download there's there's probably a reason for it um but anyways it didn't give me a warning that I had violated any any terms of service that just said we see that you're you're you know we're rate limiting you um it didn't say that there was any consequences just were temporarily rate limiting you so that's fine um I mean this is all public information anyways it's from the Library of Congress so I think it's more of a technical thing so anyways what I'm doing here is I'm converting it all to text um so let's go to converted excuse me delete the ones that we don't need and this is uh you know this is infinitely more case law than I ever want to read I mean I'm not going to read one of them let alone 22 of them so let's go ahead and copy these over to my repo here I'm gonna go ahead and replace that one okay so now we have 1.7 megabytes of case law of Anti-Trust case law this goes back to the late 80s so this should be if we understand this if we do a model as if we if we do something that understands this and we should have the ability to interact with a machine that can explain the current common law of antitrust for America hey who knows maybe legal eagle will watch this and uh want to do a collaboration or comment on how accurate it is that would be cool um someone what's his name Devin someone please watch this and uh and get Devin to check it out and comment on one my accuracy but also the value of this tool okay so what do we do next well there's so here's the thing the token the biggest limitation is the token limit of large language models so it's this weird Paradox right where the model itself I don't remember how big they are they're many gigabytes right um I think gpt3 is like 700 gigabytes of vram is how much it takes it's enormous right so but despite how big it is that isn't it can't it can't take in that much information um it takes it's like it's like blowing information in through a straw right same thing with your brain right like your brain has you know it's three pounds of neurons 100 billion neurons 7 000 synaptic connections per neuron um but you can only speak at a few bytes per second right the your input and output rate is very slow compared to the processing power of your brain um and and the amount of information in it right so the the the the UI the API is very slow same thing is true of of deep of of gpt3 and all language models right now um so not only that they have a very short memory they can only remember what you do one task at a time so you can it can't it cannot it is not possible for the machine to be able to tell us all about this because even chat gp3 you know which is a GPT 3.5 the most recent thing still limited and even if you go up by a factor of a hundred there's still too much information here for the model to learn so this is a problem that we're going to have to be contending with for the foreseeable future until there's some fundamentally different kind of AI model that can read all of this or until it's easier to to fine tune something because honestly the easiest thing would be include all of this data in the Baseline model in the in the foundation model and then it knows it just intrinsically but until we get to that point um because they are really expensive to reach train so until we get to that point we're gonna have to figure out ways of using external databases or knowledge bases so that's the problem statement we've got 1.7 megabytes of text here what do we do with it well this is really dry stuff super dry so what can we do with it um well one thing that we can do is uh I've got this really handy dandy thing where I've got it broken up by page right and you see that like um in many cases the the sentence you know will continue so the page the page um a a barrier is not necessarily a good semantic barrier and so what we mean by a semantic barrier or a logical barrier is you might still cut something off right in the middle of of an idea or a thought but it is still a good enough thing to break because when you look at how long this is this is 20 000 characters long so this is probably about two windows worth so we can we can we can have gpt3 read most of this um actually here let me pause it for a second and let's do a quick experiment instead of just tell sorry I was just saying instead of uh telling you I'll show you okay so we put this in here it's 5 800 tokens long our maximum is four thousand um so if we just split something like this in half right so it's twenty thousand uh characters so we split it in half we summarize it that way we may be able to do something with it um but the question is or the problem then is we don't know exactly what we want out of it right so let's think about this what kind of information if we wanted to make like a Wikipedia right maybe that's maybe that's a good way to go um is is what are what are the implications here so in this case um you know boxing matches sued Don King oh this is fun um uh for Rico charges okay and they refer to other codes um and so basically it what this is doing is it's using language to build um a web of like reasoning and logic so this actually sounds kind of like a Knowledge Graph so I'm wondering what if we what if we use this to build a knowledge rest I've never built a Knowledge Graph this is fun um so maybe maybe what what the the goal here is is let's build a Knowledge Graph okay so let's go back over to chat GPT in just a second and ask it what a knowledge graph is and how to build one okay I was able to get right logged into GPT or chat GPT sorry what is a knowledge graph let's see what it says a knowledge graph is a data model that represents a collection of interconnected data and Concepts typically organized around entities and their relationships it is used to represent and organize large volumes of structured and unstructured data in a way that allows for easy querying and visualization of relationships okay and then it looks like it froze including search engines recommendation systems and natural language it's gonna freeze up um so anyways uh yeah so then once this is unfrozen the next question I'll ask is um or can I hit Escape you cannot abort um so the next question is uh that I'll ask is what kind of format is it um or I'll pause it until it unfreezes or I will pause the video Until It unfreeze I don't know if I said unpause okay I think it was just frozen so because I refresh the screen and it's fine um okay so I'm saying how can I code a knowledge graph says manually build a Knowledge Graph if you have a small amount of data to do so you can build it by creating nodes in the entities you can also use a tool like graphis or gephy to visualize and edit your knowledge graph interesting okay use an NLP tool that's exactly what I'm going to do a graph database okay use natural uh to use a pre-existing Knowledge Graph cool so I wonder what kind of format these guys take I wonder if it knows so neo4j or Amazon Neptune cool um let's see what uh file format um is a knowledge graph um like can I use Json or something let's see there are a number of different file formats you can use to represent a Knowledge Graph some common ones are graphml and XML based file format okay graph is guessing y Ed okay rdf the resource description I don't know anything about knowledge graphs other than the theory Json LD is a lightweight linked data format that can be used to do that CSV really CSV is simple on one row per relationship with columns for the source and Target nodes okay um I am personally a big fan of Json because it's human readable CSV is human readable but it's a little bit on the Messier side especially when you get really complicated so uh can you give me an example of a Json lb Knowledge Graph um let's say uh for instance I want to see um some nodes about the history of uh France I'm kind of a Francophile I've visited France and I really love it there okay sure here's an example of Json LD um all right so it looks like each node is actually pretty simple where it's got an ID a type a name and a description that's actually really simple um nationality oh interesting it looks like the some of the things are kind of arbitrary French Revolution start date Napoleon Bonaparte but yeah I really I really like it uh France won the culture so when I visited while this is running I'll tell you a little bit about France uh when I visited yeah okay sure um so when I when I I visited France uh 10 years ago in 2012 and what I really like okay here we go um let's see how uh does Json LD establish relationships I don't see any examples of um Connections in the above example okay so while it's telling me um the at ID oh okay so all right it'll it'll explain anyways so the culture in France is somewhat similar to America in um in that uh we both think very highly of ourselves um but there are some really Stark differences and namely the pace of life in France um where you know sure if you go to the big cities like Paris it's rush rush rush um but if you get outside of Paris even in some of the larger cities people just have a different attitude towards life um you know they're the the the the portion meal portion sizes are smaller and um other other things like that but then it's like people are less in a hurry um and then um I hear that Italy is even worse where it's just like nothing happens quickly in Italy so maybe it's just a European thing anyways um it's very refreshing to see a modern powerful Nation because France is the number three exporter of like military hardware or something I don't remember but like this is a powerful modern country that has a much slower pace of life and a different attitude towards enjoying things okay let's see what it says about how these things link it says okay and the example I provide did I use ID property for example in the following snippet so it does that the nationality property is set to ID of France oh okay okay so the net this this is the connection so if you're if you're referring to another thing got it so nationality is like a property so the the properties that are attached to each node are arbitrary and then you can also just have one connect back to another got it got it okay cool I wonder if we can just have gpt3 rewrite this as a knowledge as a Json LD Knowledge Graph if chat if chat GPT um knows this knows it this well um and and text DaVinci 03 is also the same underlying model GPT 3.5 it's entirely possible this will work um okay uh let's see convert the following um scotus opinion document into a Json LD formatted um Knowledge Graph and then we'll add some vertical white space just to be friendly to the thing and we'll come down to let's see it's just a little bit too long let's cut this roughly in half so start let's see new page so in addition Ace acts so blah blah okay so let's oops come back just save that there and then we'll give it some more vertical white space um and then we'll do um uh Json LD uh Knowledge Graph okay cool also one thing that I have discovered is I actually prefer to turn the temperature down lately and the reason is because I found that um you're especially the the most recent ones the instruct aligned ones um they do almost exactly what you want and so with a temperature of zero you get really good consistent results um so I have changed my default temperature to zero um you know and everything else just zero zero zero it's it's pretty well aligned okay so let's see if this works um it looks like it's going to do like the whole thing vehicle true vehicle vehicle okay so that's not quite what I had in mind what I was hoping is that it would break down the um the other the what I want is the opinions and the um and what do you call these like where you where you reference something right so um let's give it a little bit more instructions about what I want um specifically uh um uh let's see let's see yeah uh focus on dates decisions opinions um and reasoning uh the purpose of this knowledge graph is to be searchable uh by lawyers for um legal precedent and case law um and let's say let's say specifically by trial lawyers so this is basically I'm telling it this is a research tool here I'll just tell it this is a research tool for preparing for um trials before The Supreme Court I'm just trying like what would Devin say on legal eagle um okay so let's try this again and see if this changes a little bit about how it composes um this this thing decision opinion reasoning excellent opinion and the circumstances requires no more formal legal distinction between person and Enterprise um okay that's interesting um it's still not quite I'm still missing something what is it that I want from this maybe maybe we can't go straight to um to this hang on I think someone's moving around let me uh close my door I'll be right back in a second okay sorry about that so it's it's breaking it down into one thing but what up like I guess I need to think what nodes do I want out of this um and then you know so each node will be um well here let me let me go ahead and save this prompt because it's pretty good um so first I think first thing we need to do is get the whole thing Rewritten in such a way that it is um that it can fit inside a single prompt window because if we have the whole the whole thing um a little bit more con condensed excuse me then we should be able to get a proper thing but we also need to think about what kind of nodes do we want um so you know which aspect you know the second the second circuit did this uh Rico requires this um and this other case it said that um so I guess each node is going to be every case cited yeah okay so the case cited and why I think that's each node all right cool so let's um Let's uh let's see um each node should be um yeah each node should be um a case uh case citation um precedent or prior opinion I'm probably using the wrong term but um include uh what the heck was the the um the parameter um my goodness the uh what's the term why is my brain doing this I need more coffee um unique identifier property prop not parameter property um each node should have several properties such as um date uh let's see case number um involved parties um reasoning for including in this opinion um and other relevant um information okay so let's let's see if if we can get the nodes that we want because if we can go ahead and convert each each thing to to nodes that might save us a step but I suspect we're going to have to summarize it first this is really cool I was really skeptical about chat GPT but um I'm becoming less skeptical oh this is good yes it's working it's working it's working okay so let's save this prompt because this worked really well um all right so I'll save this as um let's go up here and we'll say prompt um let's see uh Json LD um and then we'll do citation nodes um so this is an example we'll say example prompt okay so we got we got the nodes that we want I believe um oh man this is going to be fun because then I can try and figure out how to take all this and and visualize it I wonder if we can visualize it with python um all right but let's let's go let's let's pause for a second because this is only half the document that's not good enough right do we want it do we want to just read it raw and just go straight to it let's try summarizing it um and and let's see if we can get the whole opinion in one document now here's the thing some of these opinions are like 200 pages long so how are we going to do that right because in order for the thing to make sense you kind of do need to have the whole thing but you also don't want to lose detail right so let's think about this for a second um let's see uh let's see rewrite the following scotus up opinion um let's see as a let's say as a list of assertions um no we'll say we'll say summarize because that summarize uh implies that you want um to reduce word count um remove Superfluous language um while retaining specific details um yeah let's see let's see if that works summary okay yeah this those are good those are good notes but it's not retaining the information that I want to see such as the nodes okay so rather than read it multiple times I think what we'll do is we'll break it into chunks of um let's see how long is this we'll do chunks of 13 000 um that seems that seems good so we'll do chunks of Thirteen thousand and um and just go straight to graphs to knowledge graphs because that worked really well that worked exceptionally well okay so let's go ahead and clean this up and we'll come down here and do chunk and then Jason Alda Json LD Knowledge Graph um and then we'll do f file save as prompt Json LD um citation nodes and I need to take a quick bio break I'll be right back I'm sure you wanted to know that all right actually I just realized this video is running long um it's already 30 minutes and uh and I'm a bit fried so we'll come back we've got our feet we've got our bearings and so when we come back for the next video we will start doing the data prep because that's that's a lot of fun let me tell you that's why I don't want to do it right now um so we'll take all of these opinions we will split them into chunks while keeping some of the essential information with each chunk and got to do a little bit of figuring about how to how do we format the knowledge graph correctly because each yeah there's there's some problems to solve so but we'll split it into chunks we'll prepare the data we'll do some experiments with generating a Knowledge Graph and then um that's probably all that part two will have and then part three will be actually like let's load this into a database or visualizer um all right gang thanks for watching it's good to be back and take care

Info

Channel: David Shapiro

Views: 87,718

Rating: undefined out of 5

Keywords: ai, artificial intelligence, python, agi, gpt3, gpt 3, gpt-3, artificial cognition, psychology, philosophy, neuroscience, cognitive neuroscience, futurism, humanity, ethics, alignment, control problem

Id: E_sMa3N44u4

Channel Id: undefined

Length: 33min 45sec (2025 seconds)

Published: Sat Dec 17 2022