Scraping and forwarding data from multiple levels of a website and from API simultaneously | Scrapy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
here was going on guys this is yet another on-demand web scraping tutorial and this time this this tutorial is requested by one of my great subscribers dr. PI if you're listening to this hello there man so this is just when you asked me in one of the commentaries so today we're gonna be scraping urbandictionary.com which is the fun site to have some sort of fun descriptions regarding the casual words and well probably it's better to it's should ask me to alphabet there so let's try this kind of word here and you see like we have a nice description of what does it mean in particular and even some visual representation basically so I think this is pretty interesting from the web scraping perspective not only because it's fun itself but because it's easy to scrape and also I'm going to introduce one of an interesting techniques that is called forwarded forward in data between the layers of between the different depth layers of the data so let me just quickly explain what I mean by saying that so the first thing to consider if you just mouse over our cursor the word it gives us some sort of the short description and this first purpose opposed to scrape so just quickly make my developer tools here and go into the network that we see that as soon as we make in this okay yeah so as soon as we moving our my mouse over some of the words it makes an API call and if you just quickly go for this view and view the preview here of the response will see that is the exact string that actually appears so say this kind of okay just for some reason it doesn't work okay let me just try this one more time okay I'm not sure why it doesn't work for every word but okay let's let's write this okay so B talent lie a pregnant untruth yeah so it's exactly the same description as we get over here so the first thing to consider it would be justice grape this sure description and then we'll go recursively over each particular word here and we'll also make use of this kind of description over here so probably won't be really person this kind of stuff to some sort of a byte the dictionary but just grab the entire test okay it's cool okay so if you're excited my sexual history reticent cuz I'm going to use I'm gonna use this faithless creepy framework because it's quite pretty nice and has what concurrent with quests and things really fast so let's actually start so first we need to work packages or just basically Oh what do I done there say this organ dot-com and to save this like so just first we need to import scrape itself and then from scrapey traveler import process we just need to run our scraper from within the Python script maybe we will need something else but for now let's just quickly draft class that would be cold let's call this basic general purpose crawler and scrapy oh so we need to provide the name this oh so great I mean driver so well in this case we want actually to Reaper and we need first to bring the process variable that would be the instance of the browser process and then we need to process the Bourbon class and finally same process that starts to actually invoke our crowd approaches so they just invoke my terminal in the current working directory okay it doesn't have any requests we have all the data we need so supposed to crawl face well actually we could have 26 links here but it's better just to make to make it this way so here we need to see [Music] and here we'll be formulated in the next page when we need when it comes to the crawling processes so and also so we don't probably use any hairs or something because I believe this should work just so let's simply say just define the callback that would be equal to so and that's kind of it so deaf porous the first method is gonna take two arguments this instance along with the response objects so let's try to print the response object itself will actually strike this make sure that it's two hundreds and we actually got the correct response just run this stuff one more time okay hold on guys okay so it just was a little typos not this great request but sprinkle west source and this time it works pretty fine sir here we have the response to hundreds also we have this response over here so we just did success will extract the data out of this link and usually it's a good you know like it's a good practice to store the response and to parse the stored version but here we don't really need to parse that much of the data so I'll be making the HTTP GET request every time during the debugging practice of the data extraction logic so if you're interested in how to how to [Music] actually extract data from the restored vowel have a look feel free to have a look at my script it's a toros all that basically uses this or every technique okay so now let's actually try to inspect element so we need all of this words first so that's actually have a look so if I just think inspect here so it's the list item with a class word but in the user list of the class now bound so we need to look over use a list no bullets so I just can simply say extract and save for item in response dot CSS I'm using the CSS selectors here so the user link the user list and the class name okay no bullet and basically we need we need to get no because it's the selectors right so let's try to print each item to make sure these are this item is a selector basically so okay probably could you hold on a sec yeah we just don't really need to get old because we don't need actually to extract data from that user release now ballast we just need to look over the selectors okay so here we can actually now try to extract the data we need and this is the user list and with the user list we're looking for that a tag and we want to extract we want to extract basically this this kind of link so this is the address of the API that we're supposed to be using to get fast to get the quick descriptions not basically sure whether the quick descriptions are always being the same so okay I also using some sort of a key so just need to have okay first nevertheless so first we're looking for a tag attribute each reference links basically okay some reason this is a bit strange okay yes other selectors but why don't why don't you [Music] so [Music] [Music] [Music] [Music] Horace link and we need just to create this self porous link so and exactly the same so it would have the self instance and response just to make sure that we just hit this first link method here and also we need to bring the response status and just do what growling with all of these guys I just want to break out of the loop as far as the verb first I think they'll be yielded here so just hold my breath and try to write this stuff again I hope to see this kind of message response the recorder okay recursive response so really successfully retrieve data from I believe this kind of link here from this one okay and now here we actually need to okay it's not too much of the description unfortunately let's try to find some so just want to see how the container is cold here so distance just want to grab all the text that's really rolling kids so just want to see okay content meaning here so we don't need that ro death panel well probably this depth panel is just kind of fine so so we just try to extract the text okay what if what you got here spec again okay so deaf panel like with the same again row so probably yeah probably we need this meaning to extract example so we don't need the examples basically right so when you just literally this meaning here right basically the examples are within the meaning so and this one so what have I just try to extract all the text from the meaning itself so let's try this so call this link maybe extract description and just print [Music] [Music] meaning and also we need to extract the tax and get one because there is only Dave there so let's try to run this again and see if we just extract some data or not okay a janine but sometimes security friend is logically yeah it seems like so it's probably regarding the Alex I believe I'm not sure we're exactly this oh here it is basically oh so this is far more reasonable to be honest okay so but okay what is what is this all about is this deep meaning okay so does it it extract all the texts no it just extract some okay just hold on a sec okay so it seems we just need to get all the extract all the text of so just run this one more time then we'll have probably the list here now the only thing we need is actually to make all this just gonna read this sort of list to this to the strain in it in order to do that okay we're supposed to have the plain text on here okay now it's the plain text great so but this is a kind of a full description but also we want to extract the fast or a quick description trade description from the API and now I'm just really wondering now we need and when we do that we will need to forward our short description to the to this first link method we'll be using the meda scrapie Mita argument you know it's do that in a format of a dictionary as well but first we need actually to basically so now I just want to try to scrape the data out of this just what's looking some answers that stuff Stack Overflow so now we need actually to try to find out how exactly does he specify okay probably over to us one more time so I just want to see if the API keys are all the same or not so okay Angie so okay so that was okay so by saying key I mean this this this one I guys so you just copy this and I'm just wondering if this is available the source code some worse so we could have scraped that okay so here is the API key but maybe just to have the entire copy this entire link unfortunately not well the thinks are getting a little bit more interesting here so we need to find the way we need to find the way of extracting okay let me think a little bit okay I just realized that probably this kind of key is always being the same so well that's actually trying to make use of this basically I'm not sure whether it works or not but that's actually trying so I'm not sure where exactly we do specify okay so this is the term which is exactly the same as the text of this link so we would need to extract that as well so just copy this quickly okay so I'm not sure how it's better being done here okay okay just now this recursive traveling logic for a while this links and actually get back to the looping over this list items so I just want to try one of those stuff here so CSS and get just I want to sleep over all the data not only one is it for so before I just used on with this now bollocks because we don't need okay brilliancy brilliancy okay okay here it is so we'll do a little bit different go okay so here we need to extract each reference save okay and what no wait reference you kidding me oh it's just a type of H reference okay okay great and also oh so we can print the text so I just make a text here this would be the term itself that when the keyword for the API call okay brilliancy okay so know we can actually create [Music] links and just call this key worried maybe just worried and list here will actually create items like this your word would be this items thoughts get taxed while the length would be the same absolute age reference it's a lesson Ford's the Jason module to pretty print this kind of items here [Music] okay yeah great so now we have both the key words and the indentation so here would be looping over the nada links the links so before before actually going there so what else do is supposed to need here we'll so supposed to need we need to make an API call to this this this kind of URL here's just a copy and we will be using the keywords so we need to make this okay so probably the first okay so follow links here okay yes I've just realized that instead of trying to handle all this a synchronous requests with screen it's probably easier to just import casual request library and make quests and as far as we're supposed to retrieve the JSON data it would be easier to use this not this not Essen cronies but actually blocking requests so what I want to say that first we would make sure that response is we got the response and only then would be making another sort of of their job so basically first okay just it's better to show rather than to explain so okay probably go following way so let's create the word this kind of stuff okay and here we can simply say shirt description and here we can simply say west west get this euro okay East and here I just break word like this oh so we need to return jason is for see the response so it would be the string we need let's first try the entire jason but we just try to see what we've got there and then try to move on okay look at a comma okay okay so now we got our short description okay so just white isn't okay it's recursive okay so let's just break oh wow oh so to brief like code a little bit I just can take this way and paste over here [Music] can you bring this syntax syntactically correct sure description what huh it didn't get the right response we're just really wondering why that happens just hold on a sec oh it just fine even though there it seems like oh it's actually crawled just fine I'm not sure what this code means basically but it actually returns to string correctly okay also this sort of syntax seems to be just fine so we need this is called the string right we need this string and also I would like to get rid of this I would like to get rid of this nasty PR attacks like so just strip [Music] shouldn't work beautiful so basically - so I don't want to get through with this grape strikers in this particular case okay okay so it seems like only this is B Tech and that's connect so we can't just simply try to replace that okay new line characters character right okay [Music] this like v-tach [Music] just hold on okay so what is recording all the time so well now at least we have you now we have our short description also we need to ok great and now the trick is actually so we don't really need this link anymore no no we do need this link ok we don't we don't need this so we just made it a very cool so now we need to Omega white white so now we need to forward this sure description to our recursive link to our recursive Ling requests now let me show you how this is done basically so as far as this job is done here so forever linked so when we when we basically start you know with links so here again basically I have this break I will remove those break statements a little bit later so here we can actually before doing that we need to specify the lanes and the key link so we make sure over the links themselves not the words or descriptions and also here in the meter you can specify the Python dictionary type for the short description so we can say simple sure description here and link and the key would be true description which is this one copy and paste so now basically as far as we go as far as we heed their forest link recursively let's try to create the short description the short description itself and let's try the print and the shirt description okay hold my breath and run this one more time hope to see okay what's wrong here integers oh sorry not like this so here actually we use blank blinky and there would be the link description ok sorry guys sorry for the inconvenience really ok seems it did fire but I can't see you can't see this we're supposed to be yielded so elusive hold on guys no I just realized that I actually forgot to append I forgot about this items to the links basically so I can simply say links this dictionary yeah so we didn't yield anything because the links list was actually empty so now let's try again okay so for the forwarded this sure description here okay so well for some reason okay let's actually try this but the space here to avoid this this disaster so no space it's considered to be the one word which is wrong basically okay and but still it's as the it works as the proof of concept so we just forwarded this sure description using this made an argument here and we did forwarded to the the second level of independent basically so this recursive first link Lao and here finally we can create our main dictionary that we supposed we can actually already start yielding this dictionary to the Jason or CSV or whatever so okay this and here we did extract the full description so instead of just the full description okay so I also would probably need to forward but this can be done basically not only this description but the word itself also so let me see the sure description and you're the word itself okay so now let's summarize this data so we'll say the word would be equals bones wait dad would call this worried worried yeah okay I know the full description I know here was saying that worried would be worried and description to Jason dumps items indentation equals spaces okay so hold my breath trying to see all the results gathered in one place also need to make sure this would also work during the essence requests okay another type of here okay it seems like we got it so the word the sure description and the full description perfect okay and now I hold my breath and trying to actually let's I don't know that's considered the Jason you know always we're doing see so let's now do Jason we could also do CSP here as well doesn't matter it's for you to choose guys and here orbán dot Jason like this so let's try this one more time with an only item okay so I hope to see this orbán - Jason fall here okay now it's supposed to be written it's not yet probably it's not but it would be just right in a moment I hope okay okay so after [Music] [Music] Yeah right results and now we need to see Jason dumps this results we've just loaded species now let me just commend this points out the difference should be okay way basically okay this is kinda nice so from no one we can actually go move [Music] all the links and I really hope that as creepy as employees requesting break the logic just great over here but we'll see basically it's actually kind of true or not so will now see this and if this is all just kind of fine over here we'll also be able to over all the letters so that's kind of scraped is scaling basically and what not only within this grape in the web scrape in the web scraping in general it's always all about this being able to scale first you can extract items from the single page then you're making all this recursive stuff and then you're looking for all the results and then all the pages and all the stuff okay there was a that section what I was doing one of my recent jobs by the way so we'll see how it goes basically so just during this this spider is running let me try and explain what's happening so it loops over all this words it makes an API call together sure description regarding every single word so this splash that appears on the right of the car sir and also it makes also it makes this recursive call to get the full description and in theory at least when oh oh all this data has been gathered it should have been yielded to the Jason flow as there is so it takes some time basically because you're afraid a lot of words okay probably running this step through all all the letters would take a really a lot of time but you know like I will probably command that code out for those of you who would want to scrape the entire results you would be able just to UM command and that's kind of it for its it's really kind of really takes time for even this first page but you know this is very interesting task from the perspective of the scaling in web scraping yes and as I was already mentioning that scaling is really is really important and I say it's by saying scaling let's just clarify this one more time so the first thing to consider which is grabbing the data from the single page and extracting the data from it and then we're starting doing two things to the first we're going to recursively through the API calls which is one task and the second another task is actually to live over the recursive pages for every single item here and then just to bring all the data using the forwarding forwarding the data through the medic argument within the scrapey follow that's it we're actually being able to gather all all the data into into one dictionary in the very end note so by saying and note so this is kind of known for the particular given keywords so here it already should get the short description from the API and it was forwarded from the up level and also the word itself was word formerly from the up level and here it just makes just extracts this kind of full description and use the for what list in theory we will see how exactly it's supposed to work on the real deal basically probably we already have this happens because I just then remove this Orban to Jason it's just defending the data here so pull reddit it's kind of done for this words okay let me just until it just grows so it seems like yeah cool description okay sweetheart oh also we need to check that out if the data is forwarded correctly basically right not I hope I hope so it'll be honest so it's still it's kind of too many too many over there okay finally we were done so now it has to Priya bring this step for us okay reload what you just oh my god did it just believe that always stuff here Camby oh my god what do I done so be honest I have no idea why this could have ever happened basically get just hold on I have to find out okay guys it's just encounters some sort of an issue regarding the standard scrapey I don't put the file right in so I just remove this probably here's some exporters or something to just avoid this sort of a disaster but in this tutorial I will go the more kind of easing straightforward wait so I will just I will use the JSON format which is the Chasen separating your line items okay so also i've limited this scraper for the first two words just for the testing purposes so one two three and that's kind of it so now let's try to run this one more time to secret brace item exporters is another interest in topic there so maybe one day I will cover that as well but for now one we need just to make sure that's horse bones okay seems like seems like I also need to add a new line there ok so just leave this and let's try this one more time and if it works so for now I need to go for the English alphabet okay so this is one we needed so copy okay so the scraper is done so let's have a look okay now it seems much better now so we have the word sure inscription full description and all this stuff okay so it seems like nicely and oh sorry instead of here we can simply say the first I just want to upper case just to make sure we have all the letters string is not too fine but the stackoverflow question always important strength okay okay so here we have all our characters so we need to append each one over here so I just take this strain uppercase s key uppercase R and let's see how works okay this is now great and now we need to feed this URLs so let's call this next page okay and here we can yield the request but I'm not gonna run this just because it would take a lot of time well without the hand I can just post a video around this stuff and provide you the complete results but okay maybe maybe let's do all just to test this out let's do all the word all the pages but with but don't be taking all over the single worried over the single word for each this anymore only the single word the first word for each page here so likely a and the first word for a and then B and the first word will be okay just let's kick this out so to start from scratch but then you fall here so it will take only 26 pages basically so this is kind of it so fast strange no I don't know why all this because the sole for your role make sure again clear file every time okay so we see it's being it's different order because it's not always it's nice and groaning it's cold so it's not in an alphabet order but still it kind of goes with all all the words through all of the page is the very first world word everywhere so just until it runs here words with a page like this so for those of you guys who want to scrape the entire data set - just need to do like this only the first word save okay so we're done here that's reload so now it seems like we have the letters for every single we have the data for every single letter okay so I would probably run run this grape or to extract all the results to share with you guys I don't believe you really need this but well dr. PI maybe you're the only one who wasn't heed this actually so I will make this especially for you man so hope you appreciate that and basically this is a guy so let's summarize all this what we've done here we've been growling the multiple pages over here and within every page we've been proud within every single page within looking over all of this words extracting the links following them recursively and getting the full description out of them and also in the meantime we've been scraping data from the API to get a sure description and we also did forward with then word itself and the sure description to the recursive link or the full description of the particular word here I hope you understand what I've just said because I did understand quite quite a phone what I just said so if you have any questions please feel free to leave them in the comments below this video so feel free also to order some sort of a web scraping tutorials on demand it's always welcomed by this time this is that from my site so until next time and take care
Info
Channel: Code Monkey King
Views: 1,286
Rating: 5 out of 5
Keywords:
Id: 1vOVsVcqx9k
Channel Id: undefined
Length: 64min 46sec (3886 seconds)
Published: Tue Mar 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.