Scraping the data from graphs with {metaDigitise}

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good evening welcome to a new video where i'm going to discuss very briefly the meta digitize package and this is a quite an interesting package so i've been um looking at this pdf that was released in 1991 i think or 1990. that's a very interesting pdf because it's a compilation of historical statistics of the grand duchy of luxembourg which was uh became independent from the netherlands from the kingdom of the netherlands in 1839 so this is these are statistics since the beginning of the we could maybe say the modern state of luxembourg until 89 and then of course we have the statistics after 89 readily available um so this pdf is quite interesting um [Music] and it's it's available online on the website of the national statistical institute and however the data that is in there is trapped so i'm quite sure that if i were to ask and if i were patient i would get the data um delivered but it would probably take quite some time for them i'm fairly certain that most of this data is probably still archived on some tapes on magnetic tapes and things like that so it wouldn't be very easy to get them out in a tidy format so i thought hey why not try to you know it's 600 pages but once in every in a while i would get a table out of it would be nice however the very first uh statistic so at first we have all these things that uh present to you the um the subdivisions of the country uh nowadays it's a little bit different because some of these these are called communes some of them have joined together to form a new one or some have been separate maybe as well um so some of these don't exist anymore and here you actually have a very interesting so this is where i lived before i didn't know that uh it became its own commune in 1849 and then it got together with these three communes into the commune of luxembourg the capital so it was it's interesting to to look at things like that i really enjoy that um the first statistic that you really find is not a table but this graph and this is a bit of an issue because how do i get the data out of it so i was i'm i was looking online so i'm fairly familiar with um scraping procedures scraping from pdf or scraping from html tables or or websites in general tables or not but i've never scraped pictures and well this is where meta digitize comes into play so this package is really interesting because it allows you to do just that script data that is in these in pictures like that um it's a very interactive process so that's why i thought i would make a video instead of a blog post uh but it's relatively easy to use it's a bit uh of course time consuming because as i said it's an interactive process so you have to do it by hand uh picture big per picture but it really works well it works well and it works much faster if you were to to do each point one by one so let me show you how this works so i already uh digitized this picture and um but i will start from from scratch by the way i will of course link this vignette in the description read it um because it explains everything so this video is really just me showing it to you but this vignette explains everything um there was just one thing that i thought well it's explained in the vignette it's not that it's not explained but i as a user i was not really expecting the behavior that i i saw at first so i thought that maybe this video i would immediately show you what interests me and i guess most people if you if you follow the vignette what you'll get at first are um only the uh summary statistics of your variables the raw data is in there so you can get to the raw data and it's not stats you have to rerun everything but there is a way to just get it faster basically by by typing here maybe let me zoom in and basically by by typing by typing summary equals false what you see what you see here you can you get the raw data immediately if you don't have that you first have the summary statistics and then you have to rerun the thing not the whole digitation process but you have to run the function to get the wrong index it's a small detail but i thought that i could immediately do it like that so um what i did first so this is also where actually to use this package i highly recommend you do that on a dual monitor setup it's really so i don't know how it's going to work here on one screen so i'm just recording one screen but if you have a dual monitor setup it's much better to to do it on a dual monitor setup and to have a mouse as well if you have a touchpad get a mouse for this you will see one so let me let me run this so another detail i um took a screenshot of this graph because you need to so the the function meta digitize expects a folder in which you have all the pictures that you want to go to okay so this is my folder and in there i have only this picture but i could have 10 of them and then you would go over them one by one so let me run and let me first start r i can start it here doesn't really matter so as i said it's a very interactive process so let's run the codes and let's see what happens um so do you want to process new images import existing data or edit existing data so in my case because i starting from scratch i removed the folder so metadigitize creates a new folder just next to your pictures which contains all the metadata so this is what you would uh if you would like to import this existing data this is where the data would be saved so in my case i would press one then are all the plots the same well i just have one so it's the same but if i had more of them so it's a scatter plot well it's a line plot if you want to be pedantic it's a scatter plot and this is the plot so now you see this plot so this is why i'm saying if you have a dual monitor setup it's much better because you can put this plot into the other monitor and maximize it um so in my case i won't do it because i want you to see but this is really where a second monitor is so do you want to flip or to rotate in my case i continue by the way if you well i guess it depends a little bit on [Music] on your workflow but if you have a lot of pictures that are rotated maybe you can rotate them first using image magic so there's actually also an r package for that especially if they're all rotated the same way it's much easier you just do a batch rotation with image magic you get all your picture rotated and then you can start digitizing them in my case i will continue so what is the y variable just asking the name so in my case i will do it in french three for um for rain three is rain what is the x variable is year or in my case in french ani uh and then here is what here's where the magic happens um you get four steps step one click on one known value in the on the y-axis and this will be your first point of reference so in my case uh we have here on the y-axis 200 milliliters uh are those milliliters i guess they are um regardless of the year so that's the minimum so i could maybe type click here and this is also again where having a mouse is useful and where having that on a very big screen is useful as well because where you click needs to be very precise because this is what metadigitize will use to calibrate the picture and then get all the other dots [Music] then uh y2 so i'm over here so this is 300 maybe this one that looks to be something like maybe 550. oh i'm a bit well it doesn't really matter if i'm not on the exact same um same line but would have been clean then same for x so this is easier i could this is 1940 and then i could go over here to 1960. and then what is the value of y1 so as i said 300 y2 so i guess 550 so again if you have this in full screen it's much easier and it's it's approximate it's like proximity of course i mean it will never be as precise as the real data but i think it's close enough um x1 1940 and x2 1960 my cat is visiting so then are some axes on the log scale no but if they were you could recalibrate no so if you think that you might have misclicked or anything in my case i will say no and you can then also specify groups so if you have multiple groups so this is actually very well explained in the vignette where you have three species of iris flowers in my case i don't have any of these so i can just press enter and now i can click on every point that i want to add okay and for for some reason those over here if i maximize the window yeah this is the first one you see if i maximize the window there's no issue i can click on every every point um if i don't maximize the window as you see the first ones over here don't work so this is again it might also be an issue of my window manager so i use this tiling window manager so the window the window is tiled so maybe this plays a role i don't know so i'm very i'm not doing this very cleanly as you can see doesn't matter i think you understood so basically now you click everything and the software is is able to determine actually with great precision it really depends on how well you calibrated this for this first thoughts if you really did a good job then you will uh then yeah then you you will have a result to this really really not bad actually when i did this on on full screen i i really think that i got something that was very very clean but i mean for demonstration purposes this will be good enough so now that i'm done i can click on this red square and that's it so here i have 78 i get asked if i want to add more or delete or whatever i will just continue um i can add another group if there is one in my case there isn't so i will just finish um yeah do we want to enter another sample size no and finally that congratulations looks like you have finished digitizing great so how do i get to the data now well uh if i look at the object pre which means rain in french i get my raw data right here but this is the print method so the print method shows you uh what the data looks like but this is not so this pre is not a data frame so you cannot immediately start working uh it's a list right and this elements maybe i can yeah this element dollar can't pre so contri dot png is the name of my picture this is the data if i didn't have summary faults instead of seeing the raw data i would see some summary statistics um but i mean that's that's totally totally fine because you can then get back to the to the raw data as i said anyway if i want to get this as a table uh i can do something like that uh and this should give me a nice tip oh no and you know why this is because i have this scatter plot here this is because um this is how you would get to the data if i had summary true but now that i have summary false it's easier i just need to get to my um yeah just need to get to this and convert it i don't even need to convert it because this is already different and and there you have it uh this is this is the data so now of course years are integers in our plot but here they're not so if i look at at pre data frame i have something here you know that is not entirely entirely [Music] integer it's not really an integer but again if you really click with high precision you'll have something very close to integers and you you can just round i mean you know you know that this is 1900 and you know that this is 1949 or 50 rather this i really i this was not really well extracted so you have maybe to correct a little bit but i mean this really saves a lot of time it really saves a lot of time it is an interactive process i don't think that it would be easy to automate this in any way with some artificial intelligence or whatever i i mean i guess it could be feasible but it would be tricky to do this i think is works really well it's quite easy to use as you saw and it works with a lot of different plots not just cataplots with histograms as well and some others again read the vignette so this is really nice and it's a good way to to get to this very old uh old data that again is probably not lost but it's probably archived somewhere and then you have this which they was probably archived somewhere and where it wouldn't be too easy to get to and then yeah so i will i will have a lot of work i think i will do that slowly one table so most of it is tables and this should be easy because as you see uh the the it has been ocr eyes ocrs so this should not be too complicated well i mean this table will be tricky um but it's i find this fascinating um if you if you find pdfs like that of documents like that from like historical statistics please send them to me i would in the comments below i would be really really really really interested in looking at that for other countries i find this absolutely fascinating oh this this graph is well that's really disgusting so 3d graphs in the 90s already yeah very interesting so anyway thanks for watching again i hope you find this um useful and if you find historical data sets like that please send them to me i would be really interested in studying them so have a nice evening and a good week ahead
Info
Channel: Bruno Rodrigues
Views: 510
Rating: 5 out of 5
Keywords: rstats, scraping graphs
Id: VhDrH2weyAk
Channel Id: undefined
Length: 16min 12sec (972 seconds)
Published: Tue May 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.