Helping my subscriber to scrape INFINITE SCROLL / DYNAMICALLY LOADED pages fed by AJAX with SCRAPY

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it was gonna run guys welcome to get another room the man web scraping tutorial and this time I'm gonna help sell sell to scrape infinite scroll page from PDF drive calm and we have this specific book HTML page where we can see kind of similar books and if we want more similar books we can click this button like download PDFs and we'll get another bunch of data basically so when I'm calling this infinite scroll because it doesn't really matter like if this kind of kind of dynamic content of downloading is happening on scroll down event or on button click event so from the technical perspective of how this is how this works internally it doesn't really matter that much so another so before we're gonna be running the scraper first let me explain what's going on under the hood basically so I just wanted to invoke my developer tools here by pressing ctrl shift I and go into the network tab so what we got over here so if I just I just click this load similar files we get this so this edge X Jack's API call basically so here is the request URL we're making requesting we're making the post request which is important and also what's important to bear in the first request that we have this sort of the form data that goes along with the post request and in return we're getting this like data only kind of chunks of data basically which are really easy to scrape eventually so this is kind of and as far as I already have a tutorial on how to do this using a requesting beautifulsoup in this video I would like to like show how to do this using pythons creepy framework so if you're interested in this in learning how to scrape this like dynamic content pages please feel free to follow this and just one one more thing because before I start you know like many of you guys asking me to make tutorials on selenium or fight or scrape a selenium or whatever silly related libraries and I was telling the commentaries that I'm not doing this because it violates my life principles program and principles and tests like that so if this woo was a selenium tutorial then you you would need to just kind of if this is infinite scroll you need to mimic this browser infinite scroll or in this particular case you wouldn't need to make this bum click and this is really weird because there was the the way more simple way of achieving the same result it can just reverse engineer this API calls and do things really faster and in a way more and a way more clay so now let me just my folder so here in this great pastorals github repository I've created this Ajax folder and I'm not calling this infinite scroll or dynamic content I just call this edge X because that's that's how we're supposed to be going through so this this file address that pi is supposed to be right now so before before I started I would explain what the script is doing in particular so I say like script [Music] straight-edged X the two scrape dynamic dynamic away did I dynamically right sorry guys hold on a sec yeah it sort of just jet check this out and I just checked this out in the Google Translate so the screep dynamically rendered content fat Jack's those requests in this case this is exactly the give some sort of a signature this okay okay so now let's get to the code basic this is the very first thing to consider we need to import some packages so first we need to enforce scrapie then from scraping dollar our words our road size to actually be able to being able to run this fight and spider to run this group is better for me than the Python script also we need to say from scrapey the HTTP we need two words of request methods West which is you know the easiest way of making HTTP POST requests that where you need to pass some serves some sort of the data in the request body so you know excrete scrape the request object it's quite complicated and there is no an easy way of making those requests just like to the post requests just you just like you do this within the request libraries say so like this form request is the easiest of exists basically I would probably like doing 4G sand to pretty bring the response and it's kinda enough for now I guess so now let's define this great that means this spider class basically so let's call this dynamic Jax Jax I don't know how to call this okay let's let's move it as a checks theater scraper so you ready to find the class I call this a jack scraper it would inherit from spray spider which is the general purpose spider a prowler then we need to provide the scraper or spider I mean would be people let's call it simply a Jackson that's it so now we need to provide the URL we're supposed to be growing through and basically the URL would always be the same because if we have a look at the hearers so we always got this URL without any anything there so just just this one and we won't even be scraping this particular neural endpoint because we don't really need that and okay so we'll make use of this URL [Music] okay and then the very the most important part here we need to provide the form data parameters so just grab all these guys here and here now why this is called form data people call parameters and why we're using form requests so usually the most widely encountered the case of using the post request probably is when we're dealing with with a submitted of form so like say user username and password and then we just click like okay login sub submit things like that so that's the way to make actually a post request and the data that we have entered within the form is known as a form data but you know like that's just from the GUI perspective graphical interface and from the user perspective that's why they call this a form data but actually the form data is some through some sort of the data that is passed along with the record with a request body so in case if we're making a post request the this sort of a data they went along with a post request body in particular so that's how it's done on the lower levels so but but this kind of scrip HTTP form request method allows us to use a more high-level way of so they call this like submit in forums but literally when were not submitting forms but what I want to say that on the HTTP request level sorry so on the HTTP request level there is literally no difference between form submission and making this post request to fetch the data from from the API so it might sound a bit weird for those of you who is not familiar with the HTTP protocol itself but just just trust me I'm not behind because you know like that's exactly the same technology used we're making just those requests and we're just passing some data within so just just in order not getting confused like by this name like why this is called the form data and why this why this method is called form request so you might want to what kind of form I'm talking about I hope my explanation makes this slightly a bit more clear so here I just grade the params varible and just paste all the parameters we kind of got here and probably I would be starting from from the very initial element and then it would be incrementing by by 20 because here just do this again basically and okay so you're just so the next ejects request so here we'll where you got this third it started with 30 so 10 plus 20 would be 30 so we will start with the very initial please because it did show 10 elements by default and then it just threw the first request at it 10 more elements here and the next over the next post request already kind of added 20 more 20 more books so I would I guess we would start with with the index of zero here so it's kind of some sort of indexing for for the data they for the for the data entries were fetching from the EPI so let me just a string so we don't have any page count of the year and the very important part is actually to have okay white okay name is actually to have this the name of the book so we're now at the URL related to the book of this haha have girlfriend bytes Eden Baggett's right so here we can download the PDF but it's not the case so the the point here that by pressing this download more PDFs we're not downloading just that whatever periods but only the similar videos and to make to download only the similar PDFs here obviously we need to provide the name of the particular book of the URL of we're currently at so just just to bear this in mind as well and before I started creating the crawling logic itself basically first I would to provide the main driver here so I would say equals in this case I want to run the scraper and I could have also used actually the heaters so probably go for that as well so when you just try to grab the user-agent here so copy to be honest I'm not sure if the formal request to those using hitters but I guess by the better case but I guess it'll look actually so I just graded throats this variable that would it be equal to crowler rhoads's so I'm creating the instance of grow fruits and then I can say like process the crow and that ejects scraper as as an argument so before that looks just the custom hitter so to free up my actually so [Music] okay so I'm using this edge X creeper as as the argument for the curl method and okay and now it's time to provide what is known as the crawlers entry so we have a standard I say def story mr. request method yes is is a default crawlers entry point for this creepy spider which were inherited from the base the base class were inheriting from so here so just for the testing purposes at the moment I will I will just make so I would say would say mimic a Jax Jax call so here I can say simple yield west and west trying to figure out to be so tart base your own so the euro is always static in this case this kind of stuff here and then also that's what our here self there's and also the most important thing we need to provide the parameter or the argument that is called form data so this kind of data so form data parameter is very important here so it would be equal to self perhaps and this dictionary would be automatically that convert it to the string and we'll go as those request response body so all the stuff is going it is good is going matically under the hood there so okay so forget this okay just fine so uh form data and the last thing we need to provide the callback function because it's an asynchronous request provided by the twist internet library that is used within this grapey framework so callback I call this self parse and this is it so from now on I can provide this first expose method and simply say and the first method would contain two arguments the self instance followed by the response object so I just want to print there's bones body observer response talk text and you know like even I don't even want to pray this instead I would like first of a bunch of story this locally to to use actually to use it in order to extract the data from so also constructor I would be able to parse data from the local HT no copy to avoid torturing the targets the target site so your decoder I'm supposed to be writing right now I'll command this later on but I will leave it just for the history actually so right here I would like to use rest h2 now I want to write the first stream text okay now hold my breath and try to run this stuff actually so I just like Python 3 okay well in certain quests - vice - must request the Ășnico screen got integer soap well probably and to convert this they're gonna word this one to the two integer let's try okay now it seems like worked and we got our war response does HTML so let's have a look yes so let's have a look at this particular response within the browser we're gonna make sure okay perfect so this is the data we've just scraped it has I guess here would it be twenty elements so like one okay let me just inspect this element so we'll be scraping all the data from here so source yeah so it's nineteen or twenty so yeah yeah basically nineteen element so we will start with a count equals to 20 in next next responds okay so will will handle this a bit later on so our you know like pagination infinite scroll dynamic content rendering the logic would call it however we want it's only really the same from from the HTTP request perspective so this would be handle over in here so we would have been making this loop this with this form request request within the loop in range from zero to whatever kind of number of this index shift index shifting prompt so everything so from zero to twenty would it be the page 1 from 20 to 40 would it be the page 2 and so on so I don't I don't know how many pages they're available he's available full 12 broke quiet quite a few just to demonstrate how this works okay but for now we actually I can just say like instance object so I want to be actually able to extract the data so I call this response I call this response just like this one so we're just kind of thinking this response variable with with this one and would it be equal to selector and the text so it's just now print our response I hope to see this great the selector object instance been praying in the terminal in return look yeah exactly what I said so perfect and now we can say like responds the attacks and so on so now we actually need to provide our data extraction logics so and I just read the verbal called features we need actually so the first thing to consider would have be the title basically so I'm also wondering so this should be a user list right and so every list item right so we have this on click but well probably we have list items yeah so it's it's a really simple structure so it's incredibly easy to provide the selectors over here so I just want to say like one more time so first we need to extract the title and inspect so within the list item we are looking for what exactly is this is this within a link actually okay just wondering where is where is this text oh so just just within this within this kind of 880 tag so we just need to extract all the text recursively so let's start so it's final would it be equal to so I guess we could have loop before doing this let's call this cards [Music] and now we can say you can search already within the particular list item selectors so not response dot CSS but instead CSS because I don't I don't really like that okay so here let's try to find an a tag and space followed by the store double double column and text needs to try and extract all the text recursively from from this a tag also ghetto elements and now let's see okay uh-oh so before proceeding I just wanna read I'm using Jason dumpster predict bring the fight to the crew features so features and indentation is just to make it more user friendly actually okay so now we get the title the title is is find where we got lots of unnecessary guy not a newline character is here so let's try to get rid of those actually right we don't need to strip anything yeah we just need to get rid of those newline character sooo okay and also we need to get rid of the empty elements arrest this mic okay perfect so okay hold on a sec yeah but another thing to consider we actually need to join this list let's try to join by empty string for a while okay yeah it seems like the title is pretty flying there so he'll go for invited back yeah perfect so now now I like this movie so let's move on basically and the next into consider would be this data so pages up let's have a look so all this are kind of span so file info okay [Music] pages year okay so yeah as far as they have a name class so here again here I can say this video drive so I just want to test this link if it really works just copy and going through the works so yeah and now we can go with the pages so okay [Music] strike the tax and get there first and the only element basically so let's see if we will extract this so number of pages is now then extractors are just perfect and then the year so let's try to consider the year as well so here so this would be literally the same the only difference that the selector is slightly different so instead of beach count okay see now let's check the year okay the year is a period is a wall so what else now we got this file size okay [Music] [Music] miss anything yeah perfect and now downloads so far okay so okay so just change this glass hopefully together downloads okay so right okay now got the downloads perfect what else do we got there so this seems to be the description I guess file info so and now let's try to make use of all the be tax and try to extract or older be tax and trying to extract text from get all save but hold on a second will probably probably extract all the existing ones so it is something wrong okay let me just this works okay something is wrong the earth so this B tag is not oh it's not it's not within there so let's try instead no child elements within the v-tex we don't need them for that space followed by the store to extract the course of elements ok description description or maybe it makes sense no yeah but it's it's not yet right it's just extracting some some weird stuff you're not sure oh it takes this death I'm just trying to okay okay guys so I don't know where white can't extract this as well let's try to look for his dead glass right so just copy this one and trying to specify the parent I don't understand why this it sounds like all the same data were in here just really wondering why this is happening so card I'm looking with each card so is this all the same author er yeah it's all the same old well okay good okay what do I just hold on I just need to make sure that I'm searching within the right please so I just tried to take all the text okay so also got a pages here and revolution like I did grab the description but above that we get something else okay so let's try to do the last one over in here so the revolution so revolution 2020 love eruption and vision so this seems to be the description this one and what is this excel instance for oh it's a line revolution okay so so we need everything after the word downloads basically good adjusting everybody this actually I'm still wondering why it doesn't find the proper thing if we're trying to look using the BTech maybe it's not considered as a tag actually doesn't seem to make much sense okay so this is this go this like so I just try to extract our description it's like a different way so that I want to filter so let's just trying to join oh this illness into a strength save okay and now take everything after after the downloads words so I can simply and take the very last that's okay it's an illusion okay perfect and also I would love to actually replace the newline characters okay so business Misha said okay revolution yeah I know it seems like the descriptions for fun okay so we can strip this which is pretty nice perfect okay so probably the last thing to extract from here is the image URL so it seems like the old image and we got the source there so let's have this whole for this let's open the source can I open it in a new tab okay okay nevertheless we can still of source and see okay yeah I know God that were thumbnails is perfect so from now on what what we can do actually so just in order to store this I would need to yield the features so here just a likes store out well that's less the route to CSV file actually so I can just simply say guilt and features and this is kind of its so we don't need this stuff it was so over back and and from now on this response would be the actual actual response we're dealing with so mmm okay now let's try to so store the data to cease we provide some custom settings here so I can simply say like settings and this would be dictionary your books stop jeez okay so now let's try to run the run this again so Oh why Jason it should be CSV but still regardless of extension this should it be the CSV okay since it seems like we got our data extracted here so I just want to open this in the Liberal office to make sure it looks like properly can you show me how to crate thanks for a commentary and our I'll just answer by the way as this work I gotta say you know like really loss of your commentaries are best and alone so I can't even see them because YouTube is not always no defined it for some reason okay oh I just need to okay so just in order to actually I run this again to I change the extension so just to make sure that I I did read your commentary or if you have some important request regarding quick and on-demand web scraping tutorial or something like that so please feel free to mail me to free sub for the people at gmail.com you always can find my email within the About section of the channel page on youtube so please feel free to email me if something is if there is something important so to make sure that I obviously press respond actually so here is our CSV we got the title URL pages years sized downloads description and all the being populated so this is just perfect and now the very last thing to consider in order to make it kind of a protection rating maybe not the best ever word still so now we need to loop over the range of pages but don't get confused by when I say the range of pages because you might wonder what kind of pages kalmyk again is talking about because we don't have kind of pages here instead we have this infinite scroll and load more similar PDFs button so what by saying page I actually mean the 19 elements these 19 books book entries that kind of fit the page that paid the page I'm sorry this this was my wife okay so that's why I call this page so actually by saying page and then yet another room event oh like mimicking mimic to click this button mimic if as if I was clicking this button and then just this post request is going on and so on so I say like Luke or page range but here I would say it's like okay just just to avoid those those questions like what do you did you mean when you set page so over page range okay guys better just do this beach in quotes because it's not only really page but rather a list of a next list of the next feed so just I don't know I just explained this so I hope it's kind of just a point so I'm just done with this for those who gonna do never watch the videos but just divinely grabbing the code and trying to make use of it without understanding what's going on under there but that watching the video when I would understanding actually that this code is not like made for you to grab it and to use even though it's possible the gist is rather for clarity and educational purposes just to make it clear how the things work so just to bear that in mind so I can say simply like for so what well let's call this page still so for page range because literally these are the pages they're just following one by another in infinite scroll but these are the pages too so from 0 let's say - I don't know force will grow through the three pages and now so this level would be right over here obviously so now we need to calculate this shift index okay calculate let's call this next or let's call this next page index X Beach next pages theater index so what I want to say that here whether the parameter so just to better just show you this so for now I will get rid of this requester we're not gonna be doing any requests at this moment so I can simply say like so ramps and stores would be equal to beach dot multiplied the beach multiplied by 20 I guess well I'm under in math so let me just better it's also I need you string to fly this just friend the self grammar so self Brad storage so what I what what I want to get in return in return like 0 20 40 60 etcetera so that's when what I'm supposed to get it let's see what I got actually yeah exactly so 0 20 40 60 so starting from index 0 and by default it it loads like this 19 entries then starting from index 1 we get another 19 entries or maybe even 20 entries but doesn't matter so just this are the sturdy indexes of were starting to look at the data basically so now we can use so as far as we did update our parameters now every time the data should be different so I would also like to get rid of this again and I hold my breath and now I hope to scrape three pages yeah it was really fast basically so let's have a look oops yeah we have much more data so this is literally the same as if I did click that load more similar PDF button like a couple of times more three times maybe because it's it's slightly different in the GUI so it's more like automated here then within the Pythons grapey framework ok so just wondering ok so now I shouldn't have way yeah no it's over the kind of 74-73 so literally the same so we've got the title we got the URL this also takes it also takes place basically and there we go four pages your size downloads descriptions some things this like that so if you want to scrape more you can just simply increase this number so I don't know what's the limit there so to be honest I've no idea what is the just feel free to use the trial and error method to to discover that basically because if I could only have a number of found results that could have calculated that but I don't see why okay yes and before I finish this tutorial let me just show you what we've done here so if you would have be scraping this PDF drive comp one day yourselves and if you see if you go to this initial page and scroll down and again like trying to load more files please know this is a way more different compared to what we've just done this tutorial and see like it uses this it uses a different request URL and the method is also different you see this is the get request so instead of formed a that has this restring query parameters and this this is really similar because so the difference here would it be you just provide the different URL instead of the produce parameters you just the string parameters and instead of form requests to just give the regular scraper request and probably using the URL Lib to parse these parameters as well so well I did I did that sort of I did that sort of scrapers really numerous times to this channel so not nothing new there and again if I have a look at the response it's kind of pretty similar so but again like it's not using the post method but instead the cabinet itself this is this is quite a bit different story here so you can leave this as scraping this kind of data as an exercise for you guys but obviously if you want just the similar books you just feel free to search the particular book and then just using this download more similar books first request responses containing this HTML so I guess I hope I hope you learned something interesting out of this tutorial well so this is from my site guys learning web scraping land your freelancing jobs everything gotta be fine so and then the next time in take care
Info
Channel: Code Monkey King
Views: 1,510
Rating: 4.909091 out of 5
Keywords:
Id: pd6HnomLc0Y
Channel Id: undefined
Length: 53min 35sec (3215 seconds)
Published: Tue May 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.