Scrape HTML tables easily with Pandas and Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome john here today's video i'm going to show you how we can use pandas to directly get data from a url into a data frame no requests no beautiful soup no loops no scraping directly from the url into pandas okay so the thing we need to do first is we need to import pandas so we'll do input pandas as pd now the if you've done some of my web scraping before you'll see that quite often i will use the dataframe.2csv or read csv so reading brings the information in now pandas actually has a function called read html that we can give a url that will go out to that page and it will scrape it directly for us so it would look like this so we'll just say df for data frame is equal to pd dot read html there it is right there and now we just need to give it a url so there's a few caveats to this the first one is that uh it does use beautiful suit but sometimes it uses other passes that's fine and what it will do is it will go out and it will look for table data on that web page so what do i mean by table data i mean the actual html tags table and the other one's tr and t d so it's going to go out and it's going to look for these on the page and it's going to return a list so it's going to return a list of however many tables it finds if you use this it works well on some some websites not so well on others you may need to do some data cleaning once you've got it but for some websites like i'm going to show you now it's really quick really easy and could be really useful to some people just getting the data straight out okay so let's have a look at some demo sites that i've i've picked up this one is a fastest lapse website it basically has a table of data for a vehicle a driver and a laptime for this specific le mans bugatti circuit so if i copy this url and go back to our code put it in here and then just do print df and we'll run that we should get back a data frame there so what it's done is it's gone out and it's returned a list of all of the tables it has found on that page for us so if we look at the first one is exactly the data that we were after and the second one is something else so that means there must be another table on that page if we look at it the first one we got was all of this and the second one i think is underneath somewhere uh it'll be this this looks like a table or something else so basically all we need to do is index it so zero being the first one run that again and there's our data frame we can see it's 17 eight lines 177 lines all the information in it we've got the titles uh sorry the column headers a whole lot in basically one line of code this is really cool and can be really useful from there we could manipulate this data in any way or et cetera et cetera another really good website that this works for is wikipedia wikipedia is basically one big table and we can scrape so much information from it so again if i come to a wikipedia page we've got a records for the lan speed record and it's got this table here that's got all of the data on it it's got the date location driver etc etc we can do exactly the same thing to scrape all of this data directly into our pandas data frame again copy the url let's get rid of that for now let's put the url in here and then again print df because this is going to return a list run that so we can see it has got back two data frames again the first one which has got all of the data in it that we were looking at which was the first table and the second one is another one which is down the bottom so if we go to the page again this is the first table and this was the second table so to get just the data from the first table again we'll index it with a zero and i will do dot head so we get a snippet of the data and run that and we can see there it automatically fills in nan for not a number for anything that doesn't have data so if we look at the first one uh under the one of the columns for mile per hour and kilometer route has now a number and that will be this column here so this found no data there are a few arguments that we can give it um we could say if we're dealing with dates we could put in past dates uh like this i believe uh equal to true it's boolean and that will pass the dates for us and create them and turn them into actual date time objects we can use we could also do we could also do skip rows if the first few rows of your table was the information that you went after you could put skip rose i think or one word and then a number two of rows to skip so if i skip uh first two rows another thing that you can do is you can actually use uh match and then pass in some regex to find the table for you if you're if this if the page that you're looking at has lots of tables um you could just use the match function with regex to find that for you but the the reason the times i've used it recently i've just indexed and it's been absolutely fine so i've not bothered with match or any of the other arguments i've just been able to do it like this and if we do get rid of our print statement and we'll just export this straight to csv so df.0.2 csv and we were on landspeedrecord.csv and i will remove do i need to i'll leave the index in for now and we'll see what that looks like run that and we come here here's our lan speed record csv file and we can see we've got all of the data so hopefully you guys have found this useful nice short video for a wednesday um it's really cool this cool function if the website you're looking at it's basic data and it's got tables remember if it has that has to have the actual table tags the table td uh tr and td for the rows then this should work so give it a go thanks for watching guys and see you next time bye you
Info
Channel: John Watson Rooney
Views: 34,650
Rating: undefined out of 5
Keywords: web scrape with pandas, python web scraping, extract html table data, pandas read html, simple web scraping, simple html table scraping, scrape wikipedia tables
Id: ODNMNwgtehk
Channel Id: undefined
Length: 6min 45sec (405 seconds)
Published: Wed Sep 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.