The Biggest Mistake Beginners Make When Web Scraping

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you're trying to take the data from the front end of a website there's a good chance that you're going to be doing it wrong and you're not going to get what you need modern websites are made up of a front-end and a back-end system and it's the back-end that has all the information all the data on it that we want so why would we make a request to the front end when it's the back end that's actually got the data well to work this out and understand we need to talk a little bit about how a modern website works including using cores which is the cross origin resource sharing so the front-end website that we load up in our browser is pretty much always javascript whichever framework is most popular at the time probably and what that does is you go to the page and it will use something like ajax with axios or something like that that will make a request to an endpoint on the back end of the on the back end website which will be completely separate that will then send that data to the front end so then it will be displayed and rendered properly so for for us the end user so what we want to do is we want to be able to go straight to the back end and get the data but you see it's not going to allow us to do that unless we pretend that we are coming through the front end of this at front end through cause which is generally going to involve a cookie so what i'm going to do is i'm going to walk you through an example that i've done here i'll just show you the code now and i'm going to tell you about why i've made some of these decisions what they mean and also how you can take a cookie from loading up a headless chrome using something like playwright playwright in this case and then we can send it to requests so we can actually get a new cookie every time that we want to do this because cookies do expire before we get to that today's video is sponsored by skillshare skillshare is an online learning community with thousands of classes ready to help you explore your creativity and inspire you if you have a specific skill you're trying to learn or maybe you're like me and you like to utilize the breadth and depth of classes to help you with the other parts of personal growth to support your site projects this week i've been watching creativity unleashed discover hone and share your voice online by nathaniel drew nathaniel is a youtuber who i am very familiar with having followed online for several years now and i was very excited to take his class i believe there's great value had to be had in watching and learning from someone who's out there creating and making stuff every day and this was exactly that so the first 1000 people to use the link in the description below or my code john watson rooney will get one month free access to skillshare so once again click that link in the description below or my code john watson rooney and thank you to skillshare for sponsoring this episode so let's move over to the actual website which i've got here and you'll see that when you load this up for the first time especially in private browsing it tells you you need to accept cookies and this is very common and this is exactly what we need to do so i'm going to hit accept all it's going to load up the page and it's going to have all the information on now you'll see that here here's the list and it's all done in a nice fancy way so you click on it and it loads up more stuff etc etc we're all familiar with how these websites work what i'm going to do is we're going to go to the inspect element tool and go to the network tab try and make this a bit bigger hit reload and we're going to see that the front end is making requests to the back end for the actual information there's quite a few here but what i'm going to show you is the page data let's move this out of the way here move it so you can see that in this one we have these specific headers that we are requesting with our request headers and the response headers these ones up here and we can see that the actual response and even though in this case has been truncated and i'll come back to that actually has the information from the website that we are after so what we want to do is we want to just make this request ourselves but it's not that simple because we need to obey the rules of the cause across origin resource sharing so we need to have a cookie so we can actually mimic this and be a part of this now in my previous videos if you've watched any of those i've said just copy this copy it as curl and we'll use postman or insomnia and that's great and that works but when you actually get to the point where you need a new cookie you have to make a new request what i did is i did copy as curl and i opened up insomnia which i've got here and what i've done is i've just been through the header section this is the request and i've ticked out all of the ones that i don't think that we need except for the cookie and when i run this it will take a second because as i said this response is quite big on the opposite side which is just hidden by my head let's move that out of the way you'll see that we get this neat json data with all of the information that we could possibly want now this is the information that the back end is sent to the front end part of the website which has rendered all nice and neat in here to show us this and you can actually click through and every time you click on a person's name it makes a new request and this is its own endpoint but we're still using the same cookie so if you wanted to do that you could actually expand on this and get the information from each one of these as well so let's go back to our insomnia or postman or whatever you're using if i untick the cookie and tick everything else for example so we just have the we don't send the cookie so you can see here's our cause and everything like that it's basically all of the information that's being sent over if we send this we get this blank page and that is basically the response is there'll be some javascript in here which insomnia is not loading up telling us that we need to have a cookie or need to accept the cookie or something similar okay so let's unselect all of these again to do click the cookie back on and then run this now we're gonna get all the information back so this is the main header that's the most important one this is what's identifying us what i like to do from here is to use i uh my api tool to actually generate some code for me you can see here because i've only got the cookie header selected that's the one that's come back out and this is the one that we need so as i said before we could just use this code here exactly and paste it into vs code or whatever and this would give us that json data but as soon as this cookie expires and that's different for different websites this will no longer work so we needed to make it more repeatable and that's where we're going to use playwright to load a browser up so if we go back to our code you'll see here that i'm using playwrights to load up my chromium browser and i'm asking for the context because the context is where the cookie information is so if we come back to one of my working files so this is just the playwright part let's move this over here and i print out the cookie context from from playwright you'll see that it loaded the browser up and that's because we needed to do that and i've got this in headless is true it's false at the moment so i could see what's going on but you'll see that we get this dictionary back with all the cookies with all the headers rather and this is the one that we were interested in and this should be very similar to the one i was parting off into requests so we want to take this out and then move it into requests but why i wanted to do that was because of the actual size of the json response that i was getting so if you're trying to do this on a different site and the actual response that you're after for json is not that big you could just stop right here and then get the response.json but because the actual json file that we're getting back from this website has so much information you can see it's super long it was too big and it was causing my playwright to fail but that led me on to pushing the cookie into requests which i think is quite valuable so we can go back to it here and we can see then i'm taking the cookie for requests and the cookie context number three which was the third index of the list we're grabbing the value and taking the code from what our um insomnia had generated we can see that the cookie is in this format here and this is specific to requests on how it's going to be sent over they're just formatted slightly differently so all i did was copy this into here and then used an f string to add in the actual cookie part v with all the information that i was getting back from playwright and that means that we can then use the same cookie and we could have a session in here if we were going to want to make the other requests like i showed you uh down here these ones with all the extra specific information we would use a request session to use the cookie the same cookie over and over again from here it was just a case of then printing out the json and i've specifically indexed it down here this is actually all the information so what i liked about this was using playwright to do one thing grab me the cookie and then pass it off onto requests to then use it so we could actually make that request so if we didn't have the cookie to send through with requests our request would be failed like i showed you when we were doing it in insomnia so i'm going to put this code in the description down below for you to have a look at and have a play with what i was trying to show you here is that if you're trying to get data from a website and you're getting it trying to grab it from the front end and it's a modern website you really want to try to put your efforts into grabbing it from the back end directly using the cookie that you can grab this way or from the actual request you made in your browser initially if that works for you if you've enjoyed this video i think you're going to like this one here which goes into this method in a slightly different way but more in-depth coding it out so that might be more useful to some of you
Info
Channel: John Watson Rooney
Views: 104,570
Rating: undefined out of 5
Keywords: web scraping, john watson rooney, web scrapping, python web scraping
Id: G7s0eGOaRPE
Channel Id: undefined
Length: 10min 20sec (620 seconds)
Published: Wed May 04 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.