Web Scraping With Selenium And A Raspberry Pi - All You Need To Know

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
howdy tinker nerds in my last video we talked about the basics of web scraping but let's be honest that was just scraping the surface sorry if that scraping joke was too abrasive i shouldn't let that one scrape by okay that is it i am done so in this video we're gonna go a little bit beyond the basics and learn how to scrape data from behind logins forms or pagination so let's open it dump it out and scrape the bottom of this barrel because it's tinker time if you want to keep those knowledge gears greased please be sure to subscribe and ring that notification bell if you haven't seen my last video on the basics of web scraping scrape it i mean watch it because we're going to pick up right where we left off in it we talked about the beautiful soup library and how it can parse html so that you can search it for whatever you want the problem is that beautiful soup doesn't really interact with web pages so if you need data that's behind a login or on multiple pages it loses its usefulness real quick in those cases what we need is something that can type words into fields and click on buttons or links all throughout the code oh hi selenium welcome to the conversation selenium is another python library that can extend the functionality of beautiful soup or it can just replace it all together when it comes to scraping data it can automate web page functions allowing you to programmatically navigate through websites to get the data you want all right let's get back to coding the website we've been scraping is quotes.twoscrape.com a website that is intended to test web scrapers with now if we look at it it has a login option at the top portion of the page that when you click on it uses a username of admin and a password of one two three four so that you can log into it and then it just refers you back to the page with all the quotes on it so what we want to do in our code is repeat this exact same process of clicking on it typing in the information but we're going to do it programmatically so this is the code that we ended up with last time and you can find this code and the code that we're about to write on my github page and you can find the link in the video description now what we want to do is edit this code to run with selenium instead of using beautiful suit so the first thing we need to do is install selenium so open up a command line and use pip3 install selenium to install it for python 3. we're going to be replacing the need for beautiful soup and requests so let's go ahead and comment those out of our code and then import the selenium web driver and service modules note that these modules are both case sensitive to set up selenium we first have to point it to the driver of the browser that we're going to be using basically that's just pointing it to the path in which the browser executable lives so on a raspberry pi it's chromium and we need to point it to our chromium executable the chromium driver for the raspberry pi however sometimes has a few issues or glitches so what i'm going to do is install the chromium chrome driver from the command line and then point our selenium service to that path now instead of using requests we can point our variable to our driver service and then use dot get to get the website let's start getting rid of the beautiful suit code and then begin replacing it with our selenium automation code the first automation step is to click on the login link on the webpage so let's right click on it and then select inspect and see what its html code looks like and basically it's an anchor tag with some text saying log in so selenium has a great way of finding different elements within the code and it's using its by module which we have to import and this will let us search the html for all sorts of different elements like class name id css and link text so that's what we're going to do and we're going to use the link text of login which is what we need to find so now all we have to do is just tell it to click on that link once it does it's going to load the login page so what we need to do now is import a time variable to wait you know about three seconds for the web page to load and once it does on that login page we have a username field that will need to fill in our username so let's right click and inspect that field to see how it is referenced in the html and it has a tag id of username and likewise the password field also has a tag id of password so let's add our username variable to our code and set it to find the username element by id then let's add a password variable to our code and do the same thing so the next step is to populate those fields and we can do this by using the send keys command in general i hate including passwords and codes i also don't like usernames but i'll do it in this instance because it's just for a test it's just very bad practice very insecure despite that i'm going to use a library called get pass that is going to prompt us to enter in our password and then it will use that in the code so we don't have to store the password in the code but if you wanted your code to be completely automated with no user intervention or we're using this in practice then please find some type of python key ring instead to securely store your passwords and login information so just go ahead and use this code for get pass at this point all we need to do is click the login button so let's inspect that and what we're going to do is find it by css selector like this and then tell the code to click it once we're logged in we're taken back to the original quotes page so the rest of our code should work as before except we'll need to replace these beautiful find all commands with selenium find element commands instead for both the quotes and the author variables alright now let's test our code and you'll see that unlike before it opens up an actual browser window with a message saying chrome has been controlled by an automated test software and then it goes to the login page and prompts us for a password and so i'm going to type that in and then the rest of the code should execute as it did before and there you have it you can now scrape data from pages that require login first but wait a tick if you look at the bottom of the page you may notice a next button this is known as pagination and it means that your data is stored on many different pages how many pages i got no idea but what i do know is that we've only been scraping the quotes on the first page if we want to scrape all the quotes we have to programmatically click that next button and continue doing so until there's no more next button to click so let's first look at the html code for the next button and it's basically a link with a whole lot of extra code in between the link tags we could find the element by link title as we did before however we would have to then put all that extra code in there as well so instead what we're going to do is search by partial link text and then just put next as the partial text to find all right so we're going to need to loop through this and we can do this by using a while true loop and putting all of this code inside of it we'll also need to add the quotes and authors variables inside of it as well to make sure that we're finding the quotes and authors on each of the pages okay now we're going to need to stop this code when there are no more next buttons to click because in essence that should mean that there's no more data so let's put our find next button in a try statement now selenium has some nice exception handlers and one would work great for this situation and it's called no such element exception so import that and then add it to the try exception so if there's no such next button element then it's going to break out of the loop all right let's give this thing a try and uh oops i forgot a colon all right try it again now i'm going to enter the password and off we go the code is now grabbing the quotes and authors off of a page then clicking the next button and repeating that process until there isn't a next button to click anymore and that means we're at the last page the code looks to finish without any errors and since we're still exporting this to csv within our code let's open up that saved file and see if it saved the data it looks like we have 100 entries which sounds about right so it works with all the automation capabilities that selenium offers it really opens up the door for being able to programmatically navigate your way through the web and collect any of the data that you want so let me know in the comments if this works for you and your experiences with scraping data you can click here to watch more videos like this and please remember to support me by sharing liking subscribing or commenting and until next time keep tinkering
Info
Channel: Tinkernut
Views: 70,066
Rating: undefined out of 5
Keywords: tips, tricks, tutorial, tinkernut, how to, weekend hacker, gigafide, tinker, raspberry pi projects, raspberry pi 4 projects, raspberry python projects, raspberry python programming, selenium tutorial for beginners, selenium python, selenium python tutorial, selenium webdriver tutorial, selenium automation testing, python selenium, chrome driver, web automation, web scraping tutorial, python web scraping
Id: tRNwTXeJ75U
Channel Id: undefined
Length: 9min 49sec (589 seconds)
Published: Sat Dec 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.