ChatGPT Data Extraction: A quick demonstration

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello my name is Brandon Roberts I'm a data journalist and I'm going to show you how to use chat GPT to extract data from Messy documents some background first so I extract a lot of data from PDFs the government is required to share public documents with citizens and journalists but they're not required to clean them up turn them into spreadsheets do anything like that that is the job of the data journalists so reporters often come to me with problems that look like this hey could you turn this into a table or spreadsheet and normally what I do is you know open up a python script and start breaking this apart so I'd break this section up this section up then I would separate out these you know but it's not easy because you know a text file is oriented lines but these are like in blocks so I'd have to somehow split these up into blocks and then split out the addresses from the numbers and they're not all the same like Montana only has a number doesn't have an address so let's see if we can get chat GPT to do what I would normally have to do First Step here is to turn it into a regular text file I'll use a program called PDF plumber to do this but we'll copy it here and then we'll use backticks paste in our document and then for our prompt we'll ask can you return a Json representation of the above text it's very slow so this is a good time to drink some water okay and it did a pretty good job you can see it uh it split up the two main blocks here and you know that would have saved me a lot of time didn't try a python script just ask it boom done that's great here's a document that more closely resembles the kind of work I normally do this is a police use of force spreadsheet or table or a printout however you want to refer to it it's in a really weird format and trying to write a parser for this is really hard one of the reasons is let me zoom in here if you look carefully so we have these field names and then we have this row of values but then the next field over here the data sign is down below it and then kind of in between these two we have the value for this so what a normal like PDF to text algorithm would do is it would put these on three separate lines so then you kind of lose the orientation with um the fields and the names so writing uh writing a parser for this would be kind of annoying let's see if chat GPT can do it so here is our text version of our PDF we'll copy it and the this has a lot of information on it has complaint information we don't need that and we know that it can if we ask chat GPT to do too much it'll stop halfway through because it'll like exhaust the amount of space it can output so we're going to ask for it to return a Json representation of the text omitting information about the complaint okay and it did a pretty good job this is pretty impressive given how complicated this is um you know it got the fact that there's could be multiple officers per complaint and uh you know pull out all the fields so that's really impressive and it it didn't truncate it that's surprising sometimes they'll do that when it tries to Output this one's information um what we can do now is like it created its own output format and it shows its own fields yeah like these two it could combine these two so what we can do is when we copy our data we can give it a list of explicit fields okay and there we go it used the fields that we asked and they got all the information um that's pretty good it's not perfect but it's pretty good it shows null for for no ratio should have been no chosen all so it's a mistake but you know uh that only took you know less than a minute okay so this comes from a project I worked on in the past and it's kind of a weirdly formatted table um you know normally you would get up and down horizontal lines you don't um and it's like you know the number and then the percentage is all split up all weird um this wouldn't be too bad if it was just one but when I was working on this project there was hundreds of PDFs with these and sometimes there was like a table split between two pages or like it'd be in a different format or like rotate it in a weird way and trying to write a python script that could parse all these was just an absolute Nightmare and that project actually like basically died on the table we weren't able to complete it it was just too hard to write a script in the amount of time that we had okay so here we have a text representation of the table copy it let's omit the totals paste in our prompt back ticks are important and then instead of just asking for specific Fields um chat GPT understands programming so that means it understands Json schema so we're going to ask for it to return a Json representation of the text that strictly follows this Json schema so it's going to be questioned as a string strongly disagree as an integer and then also strongly disagree percentage as a string so and we're going to do that for all the options so basically what we're going to do is ask that to for each one of these questions and each one of these options split up each of these so that'll be an integer and that'll be a string don't forget to drink water okay so it did pretty good um each one of these is an integer field so we got the first one 371 right here and then we got the percentage field and it also removed the parentheses very nice and it did it for each of the questions um but since there's a limit on how much the response length can be it truncated it so if we were going to do this for real we would split this up maybe do half and then half and then we would get we'd ask it to do the J the Json scheme at each time and we would get this back and we would combine them at the end but this is pretty good if I had this back then I think that that project probably would have seen the light of day okay so so far we've um we've only focused on single documents you know that's fine but what if you have lots of documents like this here's an example project that I have it's police memos and you know there's thousands of them like combined number of pages is you know in the multiple thousands there's no way that we're going to be copying and pasting this into chat GPT so I came up with a script a python script here called chat GPT document extraction and basically what it allows you to do is have your input data so get this data as either a text file or Json data and then decide on a Json schema which we have here and then it'll do everything that we have done so far it'll go through each record and it'll say hey for this record give me a Json object that matches this schema so we use it like this GPT extract input type Json key doc data raw key ID name and then our input file and then our schema file name and then our our results output file all right that's it hopefully you learned something one thing to be aware of that I didn't talk too much about is that chat GPT actually does introduce mistakes into Data extraction so this isn't something that you can just completely rely on or send it out to readers or something that would be a huge mistake I'm going to talk more about this in an open news article that's coming out I'm going to put a link to it in the description you can visit my website bxroberts.org for more information about journalism technology and data journalism I'll be posting stuff there and that's it if you're trying this out good luck and have fun
Info
Channel: Brandon Roberts
Views: 8,310
Rating: undefined out of 5
Keywords: chatgpt, artificial intelligence, data extraction, journalism, data
Id: wsSqRv-y1r4
Channel Id: undefined
Length: 8min 27sec (507 seconds)
Published: Thu Mar 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.