Web Scraping sites with Session/Cookie authentication using Nodejs Request

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're going to be looking at a website where you have to log in in order to be able to web scrape the content that we want to get and the website is also expecting you to have some kind of cookie set and also pass some data from these cookies to the login form so it's a little trickier example of how to do log in with web scraping but I'm going to show you step-by-step how to do this so first we have the website it's a internship website for Indian students I think I had one of my students ask me about how to scrape this website so we go on to well first you need to register if you don't have a user inside just use a temporary mail like temp email or something like that if you don't want to use your own email you can use like a disposable email to register inside there so go ahead make a user inside there then we go to the login form let me show you what we're doing here I open up my chrome developer tools we go to the login form and let's watch on the network tab here so there we go now I login and then it shows internship postings these are the postings we have the objective of trying to scrape in our web scraper so how do we log in using noches request on this website well I'm going to show you that now step by step now that we logged in and have the network tab open inside of chrome tools let's take a look at the requests we have first here so this is the login request inside of postman or inside of conscious you can see that it's a post request has a URL here and it has a lot of hitters and cookies it's being set that's been set now you can see my form data for the form here and yeah that's my password I'm ok with the password I will change it and I actually don't care about this account so much so that's the form data that's been sent and that's all the headers now a quick tip if you want to test this request inside of postman and not have to copy all the headers manually and the form data we can right click here and say copy and copy s-curl request and then let's go inside of postman and it's open up a new request here and let's say import and paste raw text and I just paste in the curl request here and then postman fills out the request with all the headers and the body already being set which is really convenient then we can test us to request the work inside of postman and we can see it still works it says you're already logged in it's ok because I am actually already logged in it sees that the cookie is already logged in in this session so what we want to do now is to replicate this inside of a chase request so how can we do that well let's try and see how much you need to be able to do this request here if I disable the cookie here and I also delete the cookies inside of postman by clicking on the cookies up here in the corner clearing them all out and if I don't have the cookie set down here at a request well then we get a 403 forbidden that means that be request is forbidden and says something about CSRF the equestria the request is not allowed so we need to have cookies set on this request and the cookies are initially being set when we go on to the front page of this website so if I try to log out here the cookies are originally being said just as we visit this website and that's sometimes the case with these authentication websites that you have to visit one website have a cookie being set and then you can log in onto the website when you have this cookie set so you have to do that inside no J's request sometimes and let me show you now how we can do that first let's just make an empty folder for a project to be inside so make dear let's call it cookie off scraper let's open it up in which use to do a code and it's add some packages so here we have a totally empty directory opened up in Visual Studio code and let's add request and request promise now let's create a index chairs file in the shares and let's start making some requests so cost requests require quest promise and it's making async function let's call it me and it's called name down here in the bottom so every can say Kant's result requests but get and we need to get the first front page in order to get our cookies set inside of a quest so that we can do the login request because remember if you don't have the cookie set the locking is going to say 403 forbidden so let's go and say the site was called in turn solid cop and let's go and say hey wait here and let's go and also and get the login URL we found inside chrome tools this post here let's do that after and we say log in result a wait request host we have the URL here and we also need to pass in my username and password so we have email and my password here so we have a comma and then we can save form equals email it's my email password it's this one okay let's see how that fares let's try note it note indexed years let's see what happens so we get a status code 403 why do we get a for free well because we're not saving the cookies at all inside of request so let's try and say defaults and then have a coda price and say ciao equals to true so now when we had such a set D charge true request is going to say to be the cookies from request request so that means we go and visit this page first then request is going to set the cookies inside of its cookie jar that's what they call where they store all the cookies and then it's also going to have the same cookie set when we do the next request in this case the post login request so we save the cookies by using setting charter true now let's try and run note index.js again and it looks like it's the same kind of response we're getting here could that be what I think it says something about C is our F the action you have requested is not allowed now if we take a look inside of our postman request we can see there's also a CSI parameter that's being sent in the post request body and if you don't have that set well it's going to return this response as well if you only have the email and password set it's going to return this response with for for free forbidden this is not always the case for website websites in this case they have some sort of CSRF security so they test for that when they do do when you do the post request it's not always the case sometimes you could already be done with your login for site if you just do this and enable the cookie jar but for this side they do want to have the C R SF or the the t's I can't remember it never CSRF token set now the next lecture let's take a look at how we can pass in this CSRF token to the post request so if you take a closer look at the login request in South Chrome or inside of postman well we can see there is lots of cookies being set and one of these cookies is called the CSRF cookie name and this value in here is exactly the same as the one we have in the form that we sent for the login now this CSRF cookie is already been set as we go into the front page of this site so the question now is how do we get this CSRF cookie from request and use it inside of the form now because we don't have this form data then the request is going to say for free forbidden now let me show you how we can do that so first we need to make a separate cookie jar so that we can get the cookies out from request because as it is now we can't say something like request but get cookies there's not no method on that like that on request we have to make a separate cookie jar so const cookie jar equals actually limit let me put it down here request char and when you call it like that it just makes a cookie jar for you to use inside of a quest so then we have to say again request equals request defaults and then say chart equals cookie jar and we have to change this to be a lid instead of Const and let's remove these defaults here because we set them down here instead okay so now we have a separate cook each hour you can do something like cookie jar it could be straying get cookies set cookies and so on now it's and let's try and see what happens after we visit the first page so we can say cookie jar get cookie string and with this method we have to put in the site that we want to get all our cookies that has been set for this side so we say get cookie string put in the URL let's say console.log just to see what we have in here let's comment this out and it's try and run note index.js and see what happens in here so here we have we have a CSRF cookie name with the value here and this is the value we need to pass inside of the form we can see here the CSRF test name inside of the form we need to use that value inside there so we need to do some string splitting here in order to get this accept cookie value or cookie key and value so we can say Const let's take this one and put it inside the cookie string valuable they they be able I can't even say that so then we can say Const split it by C srf cooking name and we can say cookie string and split it by CSRF cooking it there's really no easier way of doing this when you're just using the default request cookie jar if you want to do more serious cookie manipulation and stuff like that you can use an external cookie jar but for now I'm just using the default one we have inside of no J's request and trust me it's going to work fine now let's try and run say run and D parts just to test out and see what we have here and have a breakpoint here at the bottom let's see we have cookie string then we have splitted by cookie CSF CSRF cookie name so this one the first one we have here is the value we have right after we have CSRF cookie name you can see here this is our cooking in for AC free and then we've split it by that we get the number two item in this split array so BB CSRF cookie name value so now we need to split it again Const C is our if let's see what the form then your name was seized RF test name let's call it that we say split it by a cookie name if you say split again let's see what it got after this come on but Oh is it making that era maybe I have an error here ah cookie name split it's not it this because I have to choose the second one of the split array so we have this one and the second item in the array is the cookie value then we have this one where we split by the semicolon just to get D value only and this is the first item in theory so let's just go through it again here we have the cookie string it looks like this we want to get only this key and value so we we split it by this name and then we get this here where the second item in the array is right after this plated value tab it's hook so it's this one that's the cookie value we want to get and now to get a rid of these semicolon and space and the other values I say split again with this semicolon and we just get D we actually need to get the these the the first item in this array so then we can say that's 0 after here and then we have the exact value we are looking for which is this one nine a nine a 5/8 legacy that matches with our raw cookie string in here okay so that's not something I usually do when I have to ultimately lock in offer website it only in this case where they have something with CSRF and they pass it into the form that's not I don't think that's the sometimes forms do that with values but it's not always you have to do these things sometimes you can just get away with enabling cookies a new trick can just authenticate away but this one is a little more tricky but I think we got it now let's see so now I pass in DCs RF in the forum let's try and run node index GS let's also do a console log of the login result and see where we got log n result and here we can see it says success true success page dashboard that means we are now locked in and now this server has saved in this session that this cookie is authenticated and we can now scrape the rest of the pages that is behind the authentication so we can go into this URL which is matching preferences and we can just say something like cons matches await the quest get and put into Europe Oh notice we don't have to put anything else else in because we already have the cookies set inside of Acrobat inside of request and now let's try and save this page to see if we get the right page from the request let's say let's say Const F s to save the file inside of nodejs F s and we can say F s right file sync and we could put in matches that HTML with the matches we get from no GS and see what we have in here note indexed years now let's check it out inside of Chrome so open with Google Chrome and indeed we can see we now have all of the internship matches inside of Google Chrome that we downloaded from DOJ's request and then we can basically just go ahead and do any sort of regular web scraping where we can get all of the items in this website and I think I already shown you that inside of the course I'm going to show you it with this example this was more to show you how we can log in and authenticate ourselves in a website that uses something like cookies and session of Education and a little bit of C is our F sessions or tokens so I hope you got a lot out of this and if you have any questions or suggestions please let me know and yeah I'll see you around
Info
Channel: ReactNativeTutorial
Views: 17,058
Rating: 4.9326925 out of 5
Keywords:
Id: nfbTyKFy6VU
Channel Id: undefined
Length: 19min 48sec (1188 seconds)
Published: Mon Apr 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.