Scrape Websites Behind a Login | Web Scraping for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you're a wizard Harry I'm a what you're a wizard Harry I don't know what my info is about that there we are so welcome to this video on web scraping and in this one we're going to be doing some logging in to a website and then we can scrape some stuff from behind the login Wow crazy cuz this is the right pane normally like you try and scrape something you've got a login now how do you do that well in this video I'm gonna I'm gonna learn you that I'm gonna learn you that right so uh what's really first well let's have a look at the website we're going to be scraping and logging into it's this one here it's just a website that gives us some details about companies publicly listed companies and it tells us how many people or what the percentage of their shares are being sold short if you don't know what that means probably have a look at my Python for Finance series where I go into all sorts of nitty-gritty details like that but that's irrelevant for this tutorial because all we're going to be doing is logging in and scraping so if you have a look we're on this watch list page right and this is specific to every user so I'm logged in at the moment so if we log out all right come back to that watch list like that oh we've got a login oh no this is terrible so I'm going to just get rid of all the stuff that's in the log here a second so in order for us to actually view this page with Python web scraping we need to log in so what we need to do is look at the methods the requests that go through when we log in so how we're going to do that well we're going to be using the chrome developer tools that's the right of the screen and we are going to go to the network tab and we then are going to click preserve log like that we're going to log in and it's going to show us what happens when we log in okay so I've got a username I've got a password you're gonna see the password by the way also by the this videos out I would have changed the password so mm-hmm not that even matters anyway because I don't really use this website for anything so we'll login all we can see things happening on the right okay right so what we got well I guess it's this one at the top called login this one at the top login it's a post method alright it's got a 302 status which basically means after this the request has been fulfilled its redirected us and what it's done is its redirected us to this watch list so let's look at this one here and we can see there's some headers so the headers we're interested in ah let's have a look we're looking at the request header to form data right so the form date is down there is there anything else we need to be aware of no not really to be fair only one other thing actually thinking about it yeah there is something it's called a CS or F token which is a cross-site request forgery token and all that is is a random token generated by a server on the first request that you have to a website and it basically is kept with every subsequent request you make to the website after that point it just is a way of preventing cross-site request scripting cross-site forgery and yeah cross-site scripting they would get our cards beat today whatever that's just something to be aware of for this particular website and not all websites have it so that's this CF if I make it a little bit bigger so you can see it better that's that C S or F middleware token you'll notice here we've got this thing called next that is so that we can redirect to the page we want to do we don't need to concern ourselves with that so don't worry about that we've got the log in here and we've got the password which is password oh how clever is that so this is the form data and the route we're going to be using so that what the the actual request URL is up here its accounts for slash login and it's a post method right so what we need to actually do is we need to do a few things let's go over to Python and we'll actually start doing all that start us off okay we'll open the terminal and we will save my main like that first thing we need to do let's do your standard gubbins and we're going to say from ps4 import beautiful soup as BS and then also import requests there we go so let's get the URL that is just short tracker forward slash like this then we want to have the login route which was it was accounts for slash login like that there's another thing we need to do and I did to actually forget to mention this even though I said there's nothing else when you don't really need to mention there is actually something notion we need to mention here and that's that this these headers the request headers right they tell the proud the website the server that we're actually viewing this using Chrome or we're actually a real real person I'm not just a bot and we can mock those using ahead as object so what we can do we can take these things here this user agent bit of the headers and we can put these in our headers so what we're going to do is we're gonna create a little object back here and we're gonna say headers little object and we'll say user agent use a dash agent like this and we'll paste that as a string I didn't even copy it oh well let's do this copy let's try it again second time's a charm no one ever says that there's also another thing as well again this there's two actual things I need to have mentioned I've forgotten about for this specific website so when I was looking at this earlier there's another thing we need to add in here and that is an Origin URL which we actually might not need the origin URL we're going to try it without because I can't see it here so we don't actually necessarily need that but one thing we do need is the referrer this is when we're logging in so we're gonna need to make sure we have a referrer so the referrer is going to be actually know what we're putting the origin just to be safe origin so this is where we've come from to make the request not all websites require this so check your headers in the chrome console if you're looking at a different type and try to sort of match them up as best as possible but do it in a minimal way don't just take every single header out there cause it's a waste of time so refer it's going to be the URL plus the log in room reason for that referee so the website knows where we've actually come from because we could log in from many many different pages it's not the most important thing but it will throw some Wobblies this particular website so the origin yeah we do actually the origin and the referrer so they're there yeah they're in the chrome console so you see the origin is here and the referrer is under that there we don't need the next watch listing we don't need that for what we're doing we know what page we're gonna go to next we just need to log in so we can get some cookies so we've got our headers great the next thing we're going to do is we're going to use request to create a session object so we just call it s we'll say s is equal to request dot session now what a session is and that's not capsule s there it should be a normal s a session basically is going to hold a lot of the data with the we get normally so if we have a look over here we've got cookies that come back we get many little things that come back in response in the response headers they're over here sorry and all the recession does is sort of keep all that in memory like a normal browser does so we don't have to carry on posting stuff it's just gonna sort of remember that we're logged in essentially the the next thing we need to do is we actually need to make a request to the website initially and the reason for that is because you know I mentioned that CS our F token well we need to get that first because we can't just take it from here we can't just copy this token because like I say it's going to be randomly generated on each request so it would stop working pretty quickly so what we need to do is we need to just make a standard get request to get the token so we're gonna say si s RF underscore token is equal to session get so this we don't need to say request I'll get because we're going to be using that session object for everything from now on if you've seen the previous tutorials we use to request so I'll get but now we're using s don't get ie ie were using the session and then we're gonna get the cookies and from the cookies because this is a dictionary object we're just going to say I want the key the C token that should be a new that's wrong that's right there we go so that's just gonna get us that token so if we look back over here the reason I know what it's called is because it comes back in the tokens where is it it's in here somewhere there we go the set cookie is CS RF token and it's in our response headers I should also mention that we do get so we get what called request headers that's what we pass to the the website and response headers are what it sends back I forgot to mention that but that's that's what happens there okay so we've got the token so what do we need to do next we need to create a login payload object that's going to basically be the same thing that we have here in this form data and all these things usually are is sort of key value store so what we do here is we say login payload just to keep things nice and simple we know what name in it we're gonna say login the login is this like that we're going to say password put in the password like I say that passwords gonna be changed by the time we come back by the time this video is uploaded get this and then we're also gonna just say that CS I'm gonna stop saying I'm just say the token from now on cuz I keep messing the letters up my god right so that's what's happening we have got the login payload and we're going to then post this to that login root something I really to make sure that you're aware of is that these keys here so login password see a token think they have to match exactly with what this says in here what the form data says so it might for example on the website you're using it might say email or it might say username here as opposed to saying login just make sure it matches exactly because otherwise you'll be getting loads of arrows and your blankey looks like it's working a folder tutorial step by step but it won't work and I know this because I had a complete issue with this nonsense the other day when I was trying to do something very similar so let's go and make the request and we'll login to the website so okay login request is equal to from there s dot post port no root we said the login I haven't seen that okay I want to do that so we mmm yeah I don't know why I've named that that why don't I just name it login I don't know lugging root doesn't really matter until we try and ruin the program and it doesn't work so yet next thing we need to do headers is equal to headers and the data is going to be equal to login payload so what we're expecting now if I quickly print this say login break what we're going to be expecting is to say response 200 sound fair enough the reason we're expecting that is because it should have gone through correctly and 200 is an OK response it's like yes it's all us all well if it comes back with like a 400 arrow we know something's gone wrong okay so we just save that I don't know what that's doing we'll run it Oh got response 200 what did I say I said that's what we'd get wow that's surprising usually anyone who watches these videos though the first time I run things it just doesn't go well so hey hey that's working and we could use that response so we could we could say a status code for example and we could use that in like an if statement and say if it's not 200 then we need to go back and loop back and do it again or something like that or maybe we need to throw an error reporter to the user or something like I so you can use that status code that we just had there the next thing we need to do is we need to get the cookies that came back with the login response request there we go so we need to say no yeah response I was right cookies are equal to login rec dot cookies ok and that's just great include from the cookie that came back I didn't know it when I was testing this sometimes sessions are a bit special so I made they don't always do what you want them to do you would expect them to keep the cookies but they're not doing it for some reason all the time so I'm just going to explicitly give them the cookies when we're doing a request and the next thing we could do is we're gonna use beautifulsoup and we're all we're gonna do we're not gonna like go into depth in scraping here because we've done the login which is the important bit I'm just gonna show you that it actually does work so we're gonna say soup is equal to vs ask get your L plus and we want to go to the watch lists okay and we want to say dot txt and then we say HTML parser because that's the parcel we want to use all the passes are available we don't have to use them and what we're going to look for so we're going to inspect this so right-click inspect you can see here one point in that there's a table with the ID companies so if we get that we're expecting to see in the body that there's going to be a LPL's al a a PLC seven point one or something like that for a percentage five funds are short it and then the most recent change that's we're expecting so we're going to see this bit here it's not highlighting it we're expect to see that so if I was to quickly say t body is equal to soup not fine and if we say table we can say ID is equal to companies like that there you go see previous videos for more examples on how we can actually scrape because there are other ways of doing it you know if you want to get it say for example class we can do that in a different way we can also do we can also filter elements based on things like the type or what have you and you do that in a different way with beautifulsoup so see previous videos okay so all I'm going to do now we're just going to print it we're just gonna print the t body and I'm gonna expect to see that HTML that we're seeing here so we're gonna be looking for 7.1% that's what we're gonna look for right okay we'll do that and we'll run and there we go so we can see that we've logged in because we've got here got a 7.1% got five fun shorts we've got al plc and this is the companies that my account is watching which is really good because obviously if we log out I'll go to the same page again look there isn't any there's no there's none of that nonsense there's not as my the companies I'm watching there and you don't have to be logged in and Chrome to act for this to work either because we can run it again it will still carry on working because we're logging in with Python so hopefully you found my inane rambling somewhat entertaining and you found this video somewhat useful if you have leave a like down below also leave comments cuz I do read them all and especially if you have questions cuz I will pretty much answer every single question that's reason to be you know formed so yeah thank you very much for watching have a good one
Info
Channel: Shane Lee
Views: 13,435
Rating: 4.9287834 out of 5
Keywords: Python, web scraping, web scraping login, BeautifulSoup4, Requests, Login with python, Login with python requests, Sendex
Id: SA18JCBtlXY
Channel Id: undefined
Length: 17min 59sec (1079 seconds)
Published: Tue Jan 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.