How I WEBSCRAPE Websites with LOGINS - Python Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome John here and today we're going to cover how to web great sites that require login using requests and requests session using the inspect element tool in our browser we can see where the login request is actually sent and we can mimic that in our program and this and the session part allows us to stay alive within that and access all the pages that are behind the login there are a few things we need to do before we write our code however and we need to find out the login URL what parameters are sent with that post request and of course we need the login credentials although this although in this example I will share the login information with you because we're using a dummy site I also show you a way to separate out your credentials at the end to make it a bit safer and better when you're sharing your script or uploading to github or whatever so this is a site we're going to use and it's at this URL and I'll put a link to that in the description as you can see we've got a simple login form with a username and password required so if we log into this now using the information given to us here [Applause] we'll see that when we log in correctly we go look you are logged in and we get a secure area so this is what we want to get to with us with our Python program and then and then scrape the pages within this although this is demo so there's no real meaningful information here okay so if we log out now so the way that we find out what's going on with the requests is by using the inspect or inspect element poor part of the brother your web browser and the tab we're most interested in is the network one so as you can see here if we click the login button with no credentials we'll get a load of requests pop-up and this is what we just did what we just sent to the server so we can see here one of them ones is called login and it's got a or thin Takai into it now this looks like a post request to me so if we could click on it here that was a get request so we want the one above which is a post request so a post request is a request sent to the server from the web browser and a get request is basically the information coming back what we need to find out is the URL that is being posted to with a username and password and any other information that goes along with that we can see right away here that the request URL is this one so let's copy and paste that over here for safekeeping because that's where we're going to need to send our post request from our script so if we now clear this up and we clear that up and if we click Preserve log we'll be able to see everything come in so if we use exactly the same super secret password and login I type that wrong let's clear that again I'll get the password by this time great so we logged in correctly now we can what we can do is we can actually see on our request here that we was a post request and somewhere down here it should give us a response now here's the response didn't load okay here we go here's our form data and this is what was sent along with our request to the URL so we need to make sure these are this we need to make sure that we use the correct matching information here now sometimes you might find there might be a bit more information down here it might say have other have the parameters with it and you need to make sure that those go along with the request as well but we can see here there's only a username and password so that's all that we need from logging in here as well we can see that we've got directed back to secure and this should be our get request here that we got sent back so we need this URL as well just put that in here okay great so I'm going to close out the browser now and we'll get onto our editor and start writing our code so the first thing we need to do as always is import requests and we need to set our URL so our let's call this login URL is equal to this is where the information that we posted to not the URL that was actually went to to get the login form and then let's call this one our secure URL forbear there we go so that's posters in and this is where this is the web this is the URL that we want to get to once we have logged in okay so now we need to work on our post request and we need to send the username and the password along with that to get authenticated with the server now to do that we need to send some kind of payload and because we have two parameters we need to make that into a dictionary so we'll do payload is equal to and create a Python dictionary and the first one was username which is what we saw in our post request in the browser and that was Tom Smith and then the password was this password just like that okay so now we've created our payload to send along with it if there were any other parameters that needed to go with request they would also need to go in here and match what we looked at on the only inspect element Network toggle the browser so the next thing we need to do is let's ignore session for now and let's just see if we can get authenticated with the server so if we do R is equal to requests . post and then we need the login URL that we set and then data is equal to the payload so what this is doing is just going to use the requests to post this information to this URL and the payload is what we created so if we print out our dot just print the text hopefully what we should get back is the secure page there we go secure area so this shows that we did actually manage to log in to the secure area okay so that's great so now we think that perhaps okay so we've authenticated with the service so if we were to try and navigate to a different page within that login area we could just access that as is but if if we try to do that say r2 is equal to Quest's get and let's try and get the same page back secure once I call that secure you are so this is exactly the same page but with this one when we send this post request we're actually getting the information back and within that information was a redirect which is the which was this page here the secure area so if we try and do that if we try and do this post request and then also get the same page back again and this could be a different page but this is the only one that's there then we should the we should hopefully get this information back again but we won't we'll go and it will send us back to the login page because we are not authenticated so if we trim the text out from that request which is going here we should get here that we're back at the login page so what this is done is that we have authenticated with the server but then because we haven't had we don't have our session we're not staying authenticated so we're not going to get anything so what do we need to do well we need to use request session so I'm going to remove these and we're going to keep these for now and also to make it a bit easier to see what's going to get going on I'm going to use import beautifulsoup as well so we can make the output a bit nicer so we can see everything ok so the same we need to keep the same part payload and we're going to use context manager in this case now context manager is very useful because it will allow us to stay connected and stay logged in as long as we remain within our with statement and we come out of that will log back out again it's always good python practice to use a context manager when you're opening files or creating a session like this it means you don't stay connected to or logged into something so let's do with requests dot session with the double brackets there and we'll do that as s just to give it a name and we will then gonna do s dot post and exactly what we did before with our sorry log in URL and then a data it's our payload so this is basically just opening it and calling it s which is why it's s dot post here because that's what we've used and then we're going to let's do print sorry let's do let's create our soup variable and we'll do beautiful soup and actually I'm getting a bit ahead of myself here let's just see what we get back if we do response and then let's print ah so we should get ah area back response 200 because we've got the status code and we do the text we should get our secure area back which we do great so that proves that we've logged into there okay so we can get rid of this and let's try and load that page up again as we did before but when we did it without the session we were not logged in so we can get the page so now let's do our is equal to request dot get and then let's do the secure URL so send a request directly to the at the URL which will only get a response back if we are still logged in and in this this case I am gonna use going to create a suit variable so it's just easier to see and beautiful suit capital and let's do our content and we use the HTML parser like that and then that's print suit dot and we'll use prettify so it's a bit easier to see it's clear I think that okay so with this we'll keep our session open so when we post our login information which we've created here to the authenticate URL which came from the inspect element on the browser that we saw we should then stay connected with our session which means when we request the secure URL we should get the information back from that page okay well we didn't so we've done something wrong okay so I can see straight away what we've done wrong here is that we haven't used our session we've used requests to get as opposed to our session variable so if we change this to s we'll get in there we go welcome to the secure area okay so what we've managed to do is we've logged in to the website using the post and using our session as a context manager and then we've got our response using our session get to the secure page URL we've got the response back so this could be anything you could use logging into whatever website and then going directly to another URL that you can only access when you're logged in and getting that information so I want to show now is why I mention the beginning of the video where you can hide your user name and password from your main script which is always a good practice so what we're going to do is we're going to create a new file a new PI file and within that we're going to have username is equal to Tom Smith and our password equal to the password like this and we're going to save that as another pie file I'm gonna call that creds dot py and it's going to be in the same folder the same directory as our main script and here what we can do is we can actually import that PI file into our main pipe into our main program and by doing that what we can do is we can call those variables so we can then call creds dot username and also our creds dot password and what that's going to do is it's going to go to this file and get that information so you could then ignore this from your get upload and just upload this which means no one can see your username and password let's just check that works and there we go straight back to the secure area so that's it guys we managed to log into a website using requests and session to keep it alive and then access pages only available behind that login I've also shown you away how you can hide your credentials from your main file so make sure you get into that habit just by
Info
Channel: John Watson Rooney
Views: 10,411
Rating: 4.9422383 out of 5
Keywords: webscraping, webscraping with python, webscraping logins, learn python, python tutorial
Id: cV21EOf5bbA
Channel Id: undefined
Length: 13min 54sec (834 seconds)
Published: Wed Jan 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.