Much Better Web Scraping with Pandas - Automatically Extract All Table Elements From a Web Page!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone today i will show you a much faster way of web scraping databases and believe it or not instead of using a traditional web scraping library we will do this with pandas that's why i'm wearing my fancy pandas t-shirt but before we get too excited this shortcut only works for html tables so if you're looking to scrape images or any other kind of element you will still need to do it the long way and if that's your case i'm including a bunch of very handy tutorials in the description however if html tables is exactly what you seek with this method you will not need to open the developer tools even once because your entire table will be automatically scraped for you so are you ready let's do this and we will begin by installing pandas with conda install c anaconda pandas or pip install pandas if you're not using anaconda we will confirm with y and additionally we will need a library called lxml we can install it with honda install c anaconda lxml or pip install lxml of course we will confirm with y and we can now go ahead and start coding now in my case i'm using jupyter notebook so i will run it with jupiter notebook you can use any other ide it's up to you and we will create a brand new python 3 notebook which we will call scraper underscore pandas and first things first we will import pandas spd and then the shortcut goes as follows so we will create a new variable called scraper and we will assign it to pd dot read underscore html and then inside the round brackets we will include a url of our choice now in my case i'm using the exact same url we've used in the mechanical soup tutorial so i'll just copy it and i will paste it as a string inside the round brackets and then at the very bottom of our code we will also print the scraper object and officially this should do the trick but unofficially for some of us this may result in a url error one that has to do with your ssl certificate now in this computer if i'm running this code everything is perfect we are getting an entire list of tables which we have scraped from the web page however if i do the same on my alienware i'm going to get a very nasty error as a result so if you're getting the exact same error you can fix it by first importing ssl and then we will type ssl dot underscore create underscore default underscore https underscore context equals ssl dot underscore create underscore unverified underscore context who and once we run this code this should fix your error but the only problem is that we are getting too much information in return this seems to select all the tables that we see on the page well we're actually interested only in one so let's go ahead and narrow down this list and instead of printing our scraper we will simply type for index table in enumerate scraper and we will first print a separator which is a bunch of asterisks a very long line of asterisks and then right below we will print the index and we will print the table and now we can go ahead and rerun this code with shift enter and let's have a look so the table we are interested at would be this distribution table which exists under index 3 so the way to select it in the cell below we will simply type scraper at index 3 we will run this cell and there you go here's an extra organized pandas data frame with all the values we wanted to select so all we need to do now is to load it to sql and that's it now i'm not going to repeat the sql commands here you guys can just go back to my mechanical soup tutorial where i show you exactly how to do it now using pandas here is not only saving us lots of time and lots of typing we are also making our code much more reliable because we are selecting the entire table rather than individual items inside the table which may change over time however this pandas method also has its limitations so not only we are restricted to table elements but we are also not able to use pandas to interact with the webpage so if we need to log in if we need to press on a bunch of buttons we will still need to use a traditional web scraping library now thank you guys so much for watching and an extra special thank you to dito for suggesting me this read html method in the comments i really hope it saves you lots of time with your projects now if you found this tutorial helpful please give it a like if you have anything to say please leave me a comment if you want to be extra awesome you can always subscribe to the channel turn on the notification bell and of course share this video with the world now thanks again and i'll see you soon
Info
Channel: Python Simplified
Views: 65,134
Rating: undefined out of 5
Keywords: pd, pandas, read_html, read html
Id: oF-EMiPZQGA
Channel Id: undefined
Length: 5min 28sec (328 seconds)
Published: Tue Jan 04 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.