Scraping Dynamic Websites WITHOUT Selenium

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] over here this is you and over here this is a web server so this is like a stone that's you and this is a server okay so when you type a url into your search bar what happens is you send a http request to some server somewhere and then the server is going to respond with a html file to go over here so um when when we talk with dynamic websites there are two things that will happen after this point um first of all there's a scenario which we call our dom manipulation called dom and for dom manipulation what that means is that there's basically instructions on the web page to alter html in some way so what's going to happen is it's going to take this file you have and it's going to update it locally um on on your computer so basically what it's doing is taking data that would be like like over here in your uh file it's going to put like over here instead so the data you're looking for is actually in the original html but you just have to look for it in a different place um the second situation that can happen is called ajax it's aj ax and ajax that stands for a sickness javascript and xml what that means is that part of the instructions on your files are going to ask it to make another request to the server and servers will come back with more data and once it does that then it's going to update the uh the file on on on your end so in this first case the dominication case the data is actually already in the initial request you get while in the second case the data is not in the initial file it's actually somewhere else so in other revolt you can't rely on just the usual html you visit you have to figure out where to gain the standard form when you want to request it from there instead uh so when we like visit the file and look at the data file in your browser you're basically in the file at this point here when it's loaded all the additional data and moves stuff around just the final version you're seeing while uh when you are scraping using like scrapy or beautiful soup or whatever you're usually dealing with i'm not saying that's not that one usually dealing with this one here the first version that comes back so um that's why sometimes you might see something in your browser that you can't find when you're using scrapey or beautiful soup or whatever that's because it's been uh it's not there yet um either the file gets changed aft uh by the browser after it gets loaded or it requests additional data and puts additional data in so when you are using scrapey or flat so with when you're using splash or um or selenium and those kinds of things you are basically waiting until this version of the of the html is ready and then you're working on this version and as you can see you know there's a kind of a gap here so that means you know it's taking some time to load so um well if you're loading just like a handful of pages it's probably okay but if you are doing a very very big project where you're escaping maybe thousands tens of thousands hundreds of thousands of pages um this extra time uh could be detrimental and could really really slow your your your your um your spider down um and also in some cases um it can be easier to detect if you're using something like selenium or an archive headers browser in some cases they can't be easier to detect so if you're talking about websites with some kind of antibiotic protection you might end up accidentally getting caught up in that protection if you are using uh if if if you're relying on um [Music] a head disposal rather than just using html initially so i've got an example here of a website that we're going to look at this is an austrian uh grocery store and the reason why i'm showing you this one for our example is because um this has both kinds of um standard website in there so i figured it'd be easy for us to to work on this and use this one i can show you both both cases and and how i would investigate and then handle both cases so um to start off we're going to begin with the dom manipulation so uh an example of dominican here is up here in the um in the mega menu so let's say you know i want to scrape all the products on this website which means i probably want to get like down to this link down here so usually what you do is you would right click it and you would inspect it and you would see here you would see the um uh where it is in the page so i'm just going to open this up in a scrapy shell and again while i'll be doing a lot of stuff using scrapy shells and scrapey content um you can use these same approaches if using beautiful soup or anything else it's just that the way you request websites could be different but the way you investigate and the way you solve the issues will be quite similar so i'm just going to open up a shell here that's creepy shell there we go so now i've got my shell open up here and i am going to import scrapey gooey because i can use that to better visualize what i'm doing and then i'm gonna do scrapey gooey dot load response and when i do that if i load select the response and then when i do that you'll see there we go um a window opens we go i've got nice zoomed in on you off on on on your view just to make sure you can read on typing so what we want to do is let me just kind of move this in a way that you guys can see there we go so basically i'm trying to sell i'm trying to get this link here and i can see that this link is in a um an a with a class a soap and nav item button so let's go ahead and just copy this clicking this can be super annoying uh there we go so now i go back here and i am just going to do dots so now button and i'm going to get the hatred out of that so now i'm going to run my query and you can see i get this oh you can't see the pop-up then you need to pop up over here there you go you can see i'm getting this pop-up now it's telling me that there are no results for this query so even though i can see here on the page that this exists it's telling me that this does not exist and that's because again this is a dynamic website what's happening is um every time you load this mega menu over here it's going to um build the panel you see so the data is on the website it just isn't always in the same place so now we need to now we need to figure out where on the page this data actually is so i'm just going to um close this for a second and i'm going to open up the source and then i'm going to go over here and i'm just going to copy the link copy the link and then i'm just going to search for it here and i just need to change it because this included the um domain and here we go so you can see that this path already appears once on the page and appears here in this giant block of javascript so what so in order to do this what we need to do is we're going to figure out how do we get these links out of this giant block of javascript um for this we are going to use a um a package that a coworker of mine is maintaining i believe called chomp js [Music] so here's the um uh here's uh the the pypi page for it and uh for people who are watching this on youtube later i'll have a link down in the description that will show you that i'll package for download especially for these kind of dynamic pages what this is going to do is um you can see here that like this section that the um that i'm looking for is inside uh this looks a lot like a um a list of dictionaries uh which is which is true for uh these objects because this is a javascript it's using json objects and json is basically um the same section as in python but pretty much so what chompjs does is you can feed it these kind of filthy javascript things and it's going to figure out where the um the arrays or the jsons are starting and to pull out just the bits you need and it's going to remove you know this and also it's going to fix the content inside of these to make sure that you can actually press it as a json object so it's going to actually turn these giant blocks of javascript into dictionaries which is a lot easier for you to work with so let me show you how we're going to do that i'm going to go back here to my shell and i'm going to reopen my scripty gui and i think i might just change the size a little bit okay so we can see if i move this over to this side here you can see this is inside the script and inside the script there is um this string so i'm going to look for a script element that contains this string so let me open this back up so we're going to do script and then we're going to do contains [Music] and then we're going to do um once again here we go right here okay i'm gonna get the text salvage now i want my query and you can see so if i mute the rodega you can see that i am getting this text out of here now i need to do is i need to get this specific json away out of it instead so what i'm going to do here is i'm just going to reopen my scrapey gui and i'm going to use this use function section of creepy gui create a pylon function that's going to pull out the the bits i need for this so i'm just going to go up here i'm going to say from shop.js import plus json object okay and then here what we're going to do is the way that um this works in scapy gooey is this results argument is going to be a list of all the results which in this case is just going to be one thing so i'm going to say that my text is equal to result 0 and then i'm going to say that my data yeah and then as well another thing that's important with how this works is um the way chomp gs works is if i feed in just this string it's going to look for the first curly bracket or the first square brackets and treat that as the stirrers as a start of what's supposed to pass and then it's going to end it at the end of of it it's going to find where that is closed so this means that if i throw this text directly into chomp.js it is um it's not gonna bring out uh this part that i'm looking for it's gonna bring out just this one here because this has a curly bracket opens and when it closes over here off the screen so obviously i don't want that so what i'm going to need to do is edit this text so i'm getting just a bit i want so i'm going to say here that the start index is equal to text dot find and i'm going to find this text here the extra quote there so i will find where that is and that's going to be the start of my uh slice and then i'm going to say that my data is going to be json object and i am going to take the text and i'm going to slice it starting at start index and going to the end of the string so basically it's going to start here at w env main navigation and it's going to take from this all the way down to the end of it but then chop js is going to see that there's a square bracket um right here and that means it's going to be it's going to treat that as the start of the object and then it's going to keep on going until it gets to the very very end all the way over here and that will be the end so it's only going to pass this this one line and then i'm just going to return data because this is going to be a list of dictionaries [Music] so i'm going to run my query okay and i am getting an error [Music] okay uh cut and parts what like i'm gonna pop that let me right check uh maybe i just built it oh yeah it's not per plus json object it's the first js object it's not pressing jsons it's pressing javascript let's run this oh and i have to change down here as well there we go okay so you can see here now here's my list of dictionaries and if i just take one of these at random uh not random i'm gonna take the first one and i'm just going to close this down we'll open up the python here right so i want to paste it in here um you don't need to worry about the red lines here because uh this is this is happening because in a json file all the fields should have double quotation marks but uh python give them single variation marks so but uh uh that story error the actual the structure of the dictionary is still gonna be valid so i can see here that there is a url and that's linking me to the url of the first category and then i can see here that there are children so it's gonna have a list here of all the children of that category and each one of those it's gonna have their own url and also sometimes their own children so basically what i need to do is from uh at this point is i would need to create probably some kind of recursive function to go through this list of diction dictionaries and pull out all of the urls and once that's done then i'd have all my urls and i'd be able to continue and do my uh do do my scraping so yeah that's pretty much a a pretty complex example of the first case this dom manipulation case um so one thing um a couple of things i'll just point out before you move on with doms is what i've showed you here uh this is a very very complex example um most of the time they aren't this bad uh usually let's see if they have an example let's see so usually you're dealing with stuff like this which is json lines and with this you can basically just take this text straight out chosen sharp.js and it comes out as a json so this is usually what you're dealing with and this is fine this is a lot easier to deal with um and then as well another common one you see a lot is are these metal links any metal links here oh but i can't see why here we go so sometimes these metal links can be used to get information about the uh about the product so like something i've seen a lot in my work is when i'm uh when i'm um for example trying to escape products they might have a meta link here that has the name of the product and the price up in here and then you might have some kind of dynamic stuff later to actually put the price into the page i can just get it from these metal links instead so i'd say step one when you're dealing with dumb stuff is first of all you look for metal links and this can be good and the second one is then you look for usually different script tags and inside the script tags you may have just a json in which case all you have to do is show it into something like shop.js and you're good or you may have like an actual piece of js code or javascript and when you've got a javascript code you might have to do a little bit of string manipulation to actually get the bit you're looking for so that's all you have to do with the dom manipulation um and this is probably the more common one of the two at least in my my experience i see it i see domination of manipulation a lot more than i see ajax [Music] so i'll take a drink and then we'll move on to the next one [Music] so uh for this next one what we're doing is if i get you open here i'll just close you so if i go to one of these links let's see i'll go to this this one here and i want to grab all of these products uh usually again what you do is you right click you'll inspect the element and you can see if you scroll up and just keep hovering and you will eventually see that okay all of these they're in an li with the title class and there's an article with the product class involved as well so what i could do i'm just gonna open this up i'm gonna fetch that page in my shell here okay uh and then i'm gonna open up my scrapey gui okay so again um just so you guys can see properly here we go so here we have this is a tile um an alive guitar class and then down below that is uh i'll go with the product class and that will allow us to um select our products so i am just going to open this guy up and i'm going to do in here dot tile dot product so now when i do this it should be giving me all of these products but you'll see now when i actually run it i get no results and uh the reason i get what's right let me just use over here i get no results a reason why i get no results is because again this content is not these these classes are not in the um original um page when i load it so let's do the same thing with the last time we're going to have the sauce and see if we can find it in there go to the sauce and usually what i do is i just take a bit of name so here there's banana house so i'm just going to search for banana mouse and you can see that there is no banana notes in this page along so even though i can see it here i can see the string nose um i degrees you guys can see properly so yeah as you can see here i'm doing a search for banana and you can see here i get no results even though i can see written here so this is because what's happening here is this is the um example of an output of an eight by request what happens is when you first load the page it doesn't actually have any products on it and uh if i bring back up my diagram here so you can load the page and then it gets to this point and it says oh i need products so it goes to the server it says hey what products should i have on this page and the server comes back and it gives them a list of the products and it takes that list and it builds the page and that's that's that's how it's working there so if i open this back up and instead now i'm gonna go to network i'm just going to get rid of this stuff and get rid of this and i'm going to reload the page [Music] you may have seen that just for a second or there was a um a loading animation uh that loading animation is usually a hint that there is some kind of api request going on usually it will start off there'll be a loading animation and it's going to send the request out and come back in and loading animation lets you know that it's waiting for data so nine or ten times if you know the page you see that animation then you know that you're dealing with something new with apis so now i'm actually going to get the apis i'm actually gonna move my face down here a bit there we go that's much better i probably should have done that a long time ago but anyway you can see now you can better see what i'm doing up here so i'm looking here for this yard naturalish banana nose so i just type here b a n a and and you can see that uh it's coming from this request here so i can see that i'm getting this response and here i'm getting what looks to be a list of dictionaries uh and there seems to be a bunch of stuff here about uh products i can see the prices and stuff so this is almost certainly what i want so now what i want to do is go to the headers and here i can see the initial crest that was made as well as the head the header is that part of the request and sometimes there might be so here's this crease string parameters which basically just means um all this stuff here at the end of it but sometimes um sometimes it could be um if the poster crest will be a body there or something similar to that um actually i think oh yeah yeah it should be fine so all right i think this one might be a better one yeah so this is actually the bad one so this one this is like making a vocal case it looks like it's requesting specific products by the looks of things well i'm actually more interested in this one i think yeah because with this one i can see here that it's looking for a specific category [Music] um so i guess this could be just it loading the first page maybe or it's loading maybe it's loading like you know stuff up here um well this one here this full one because it's got a category section it's something it's probably more relevant to to to trying to find how to make um galaxy phonics um likewise in this case you can see it's coming back with a uh 200 um i'll say not that one this one here so it's um it's a get request and that's a good sign because usually with a get request it means all you really have to do is copy it into your browser and it should work let's have a look so yeah this works you can see i put it straight into my browser i got out this gigantic dictionary basically um so if i go back here so if that didn't work that usually that doesn't work if you have a post request because it's expecting a body along with it and your browser won't send the body so in that case you might want to use curl or you might just want to use um your crest library or just do a scrape view request depending on which system you're using to fix that um uh yeah and sometimes as well the headers might be important there might be some important headers in here that it's using to um slightly check the requests same way you might want to check these um headers and include these headers with your request try to figure out which ones are mandatory and which ones are optional but lucky for us all we need is this one here so um if i um yeah so and then as well you can see as i mentioned there's this category part this category card is here part of the link so then i would kind of know from there that when i want to get these uh these pages instead of going to this website up here i want to take out this section here and i want to put it here in the category equals part and that will allow me to request multiple pages and then as well we have um there's also a store id that might be interesting so if you're selecting a specific store you might want to take an account to store id and change that and this page size is probably default you want you want to leave that in there so this is showing us how we get like a single page of products but um we probably won't get multiple pages of products so to get so now we need to figure out what is actually do to get the next page so what i'm going to do is i'm just going to scroll down i'm going to get to like the next page here we go so i'm just going to go to the next page and here we've got cleverburn and that's the first one so let's search for that and yen so here we have another one and you can see here okay so this is another request although it looks to be the same with the last request let me just bring up notepad so you can see this a little better okay so that's actually good i wonder maybe it's doing everything oh though i was looking at the um this was the first one this i think this is what i want to look at uh yeah here we go perfect this is the second page i'm going to put it in notepad so it'll be a bit easier if you guys can see and you can see then from the second page that um there's now a page equals two um uh argument uh our query in there and also the page size is 24. so based off of this i think if i go like this i should get just the first page so let me get rid of this so that should give me just the first page and looks like it did yeah the natural bananas are there so that's good let's see if i can just be sneaky and put this at the end and yeah that worked i can get my clever burning up top so yeah now i can see that i am uh able to get my pages and um you also want to see like when what happens when the page ends so we go back down to the bottom here we can see that we have three pages so two happens when you do four here ready for i guess something is empty there's no tiles so now i know that i've reached the end of my uh shelf when i get something with no tiles in it and yeah so that's that's pretty much all you have to do for um this kind of thing for um api stuff let me just kind of show you here what it looks like in your code so like for example here in my shell i can just fetch and now i've fetched it and then i'm going to import json so i can convert that into a dictionary i can do data as that goes to json.loads spawn text okay and then you can see that there's this tiles query in there so go to here and i could say something like you know uh products are equal to data dot get tiles and now you can see that for products we have 60 products that's because here i said page size of 60. although we know now for more manual testing that we want to set this page size to 24 instead because otherwise it will be doing the same pages although i guess in theory you could make this really big as well if i do it all in a single request it depends if you want to match the pages a customer would see or if you want to get all the data really fast so i guess you can go ahead and like change just like a thousand or something and it might get everything in a single request um we go to like products the first products you see here we've got our big you can see here now we've got our big um [Music] from earlier i think actually here we go yeah if you've got a product with your data and you can call the url out of it you can pull the name out of it you can pull the price out of it you also have like unit price which might be pretty handy and yes we have everything there so that goes for our second one the ajax and just the kind of a couple of extra notes here so this was a really easy example because it was just a get request uh sometimes especially if it's like a poster crest or or like a put request sometimes as well that they'll use like a post or put there's probably gonna be a body involved so in that case you want to look for the body [Music] and all the times as well the headers might be important so there might be some information in the headers like for example maybe there's an api key in the headers that's a common one or maybe there's something that checks in the headers to try to distinguish it from a bot to a real user so you want to check the headers as well um uh so yeah uh and usually if you get this white and you get this white it'll probably work um not always if they have like a really good anti-bottom place they might be able to easily tell that this isn't being requested from the page it might be it might be quite difficult sometimes but um um if you're lucky it's just a get request uh otherwise might be quite a little bit of work but uh you usually this will do it and what's handy about this uh with with with ajax um uh dynamic pages is um first of all they're usually super fast uh they're really really fast uh because apis you might people want api to be fast so usually it can actually be even just like loading just html uh the ajax could be a lot faster now it'll load in fractions of a second and also sometimes it can be nicer to pass so like you might you know be doubting me a bit when you have a look and you see this gigantic um uh bunch of code rather than your nice html sections but remember this is just a dictionary and so all you need to do is um if i want to get you know [Music] let's say if i want to get you know how many are available all i have to do is i have to pull out the um the data field and i can pull out the uh availability fields from that and get the form so instead of having to mess around with um uh html selectors um which can get quite complex depending on the website like all i need to do here is um i've got these dictionaries and i can bloom straight out it's a lot nicer so um when i first started web scraping i got something like this i would i was basically myself because you know i'd be thinking about oh i have to use selenium now it's gonna be a huge effort um but nowadays when i see this i love it because it means i serve it having to around with css all i need to do is find out how it's getting these jsons and i have a nice little dictionary to work with instead so um that's really all i wanted to show you today the um the dom stuff and the ajax stuff and now if you're doing anymore scraping in future and you want to figure out how do i scrape a dynamic page or happen to rely on something like selenium or splash um now you know that all you need to do is figure out what kind of dynamic it is if it's a dom or if it's an ajax if it's a dom you just need to find uh where on the page the um data actually is and pull it from that other pub instead while if it's an ajax you just need to check your network tab look at the requests try to figure out how you want to make the request and then once you've got that figured out you'd have a fast and nice way to pull the data you need from that json instead of having to mess along with this [Music]
Info
Channel: Further_Reading
Views: 2,440
Rating: 4.9333334 out of 5
Keywords: scrapy, webscraping, webpages
Id: 6niJQLKUt94
Channel Id: undefined
Length: 36min 7sec (2167 seconds)
Published: Tue Oct 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.