Web Scraping Tutorial - Complete Scrapy Project | Infinite Scroll | AJAX JavaScript 'Load More'

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] [Music] so if i click load more and then scroll down here we see as we've already checked earlier in the video we know that this increases or increments rather so what i'm thinking is the first page even though it's not called pn equals one i want to still call it that so let's copy that and i'm going to use that as you can see here i've just put this in i'm going to use this and increments it each time so one two three and so on i've had to split it up because what i want to do is just increment this variable each time i'm just going to comment that out but i'm going to leave it there because i want to do is just demonstrate and test in scrapey shell so if we do a fetch on this and scrapey shell we want to get the first bunch of thumbnails so that we can extract the links from them and here we go let's do fetch on um just want to test whether pn equals one works because as you saw it went from healthy straight to pn equals two so if this works this will be good yeah okay so that means that what i've just shown you here will work page equals one page string so start url will effectively be that pn equals plus string page equals one so start url is effectively what i'm showing there now what we want to do is extract um links from that so here we go we can see there's loads and loads of javascript and so on in here but what we really want to do is we want to be finding what you could do secured copy and paste all that into notepad and do a control f or what i'm going to do is view source so let's just go back up to page pn equals no pn but it's still effectively the same as pn equals 1. if you page source now let's um forget that all we really need to do is just um inspect the element and see if we can get an href from this and if i just move that across you can see event there so there's it looks like jquery live ah can you see this this is significant script type equals application forward slash ld plus json ah that's even better so what we've got here one two three four five six seven eight so it's loading eight per page which is odd because i thought it was loading 10 but that will probably be because of the zoom level that i'm using on my browser or my particular browser so what we want if you can just messed that up what we want is we want to load that into a dictionary right let's go if you've watched on previous videos you will have seen me extract json data using xpath script that's the tag and at type so let's run something similar here response.xml then it's um that's the usual format with xpath two forward slashes and then script rather than div or span or p tag or anything we use script app type equals remember to get a quotes correct don't mix your quotes keep if you use double quotes inside then use single quotes outside and i just want the text and i got the right yeah hmm okay so that's got the object but not the text because i need to use get let's just um review that so response not xpath the response is basically this point response is equal to everything in that entire page and as you saw i want to get the json out of that little bit that we saw in the browser just now because we know it's going to provide us with a dictionary which will provide us with those eight links once we've got the eight links we can go and visit them and get the ingredients on the recipe yeah which is nice this is good this is progress so what we've achieved so far is we've identified a suitable selector which will enable us to load a dictionary which will enable us to load a url it's the url at this point we want the url so let's copy that into our code this is good so let's do something like let's turn this into a shorter name so res equals response and [Music] i've also taken the liberty of importing json so again you may not always see that on tutorials where they are sticking to the this basic scrapy framework and doing a sort of a basic bookstrap.com demo or something so we've loaded json here so res equals json.loads and make sure you put the s on the end because it loads string if you don't use yes then it's binary and that's no good to you um what might actually just do is do and i've also imported pretty print from p print import p print there you go so what i might just do is do a pretty print of res which is now a json object loaded as a dictionary so if we do control has to save and run it sorry if the um output's a bit small down here let me just um good i can see it's run let me just make it bigger and on a white page and if i scroll down this is basically the output from what i've just run in atom and this is very good it's the json loaded and we've got that context we're not interested in that we're not interested in that so we are interested in the third key called item list element and then we want to iterate through these urls so let's do that as you can see iterate through each batch of recipes eight on very first page and really it's eight on every page um for i in range eight we know it's eight each time it might break if there's not eight on the last on the very last page we'll cross that bridge and we get to it i can put in a try except in there somewhere [Music] um um let's go back to scrapey shell and um let's do something like this i need to try and get the code right here because it needs to pick out element we're actually extracting the third the third value sorry third key here this one document so we're getting yeah there is uh app type list item url let's put in just bear with i know um let's just uh so look at that again let's just peep into the print res that's better copy that it's very case sensitive isn't it [Music] um sorry about this link equals print link um making a bit of a meal of this info sorry let's go for okay no error print i think ah if i increment that to two let's see if we get a different recipe we should do what should we get two should be chinese fried rice yes so next if we do that's getting the the whole dictionary so we need to get just the value from the url key so let's specify we want just to select url key and then if we do print link there we go right sorry about that so this is the kind of thing we want to be looking for but that will need to be i we will iterate through so let's copy that and try that here so for iron range eight uh let's just print this to begin with item list i url let's run it and let's just make that white and i'll show you on the biggest screen i think that's worked so that's what we had before that's just the response and this is a selector which picks out item list element i then the url so this is good this is what we want so we're in business let's um let's just do link equals same again we know that works and we could actually start to yield that so let me save that and just to test this with uh we could begin writing the csv with something like we need to keep yield inside the for loop and we could just do link come on link close and i've already just made my items py file because i know already i know that i just want link and then the recipe and then the ingredients i may actually have to break that recipe down into more things if like how to one how to 203 i don't know because there's if you remember there were several how to's or several steps in the recipe that we'll cross that bridge we get to it um let's just run that then see if it does it make your feed results yes let's just run that game it was very quick yep it's done it so we've got our links and the good thing is is we've got absolute url so that's even more of a bonus so um yes we've got the links so that is um part of the spider done we've got the links next we will need to visit those links with a callback and then we will need to get the recipes and the ingredients and then return back to pass so that it can do that one that one that one that one it can do all eight once it's done all day it will go to the next page so we will then need to put in our next page code so the next page will follow down here and we won't actually um we won't yield here eventually because all of the yields to the csv will be done in one place which will be down here and in fact i don't really know if we need to save the links so we won't even need to use cb underscore k works or meta or anything because the link is not really the link is only for our use um [Music] i suppose it could be useful if it put it to a csv but um because i put it to a csv then with half a million results potentially leaving the link in would make the file much much bigger and that wasn't part of the specs so the link is for our use and let's just hash that out for the time being and let's take a break and we'll get back and we'll return to look at past details okay i've just noticed i left in pass which is not needed and let's uh we can leave that there we're not worried about next page code yet we'll do that last for the very reason that once we activate that we are going to be in for a very long test so we want to we want to drop down from pass down to past details next so outside of the for loop we will need something to take us to past details so i've just added from scrapey or i haven't just added but i have added from scrapey import request um what we need to do here is do um i can set request equal to request i don't need i need to indent that so whilst we're in each one of those eight recipes we need to go and visit it so url equals link um headers will just equal as ever headers equals self dot headers which is that and call back equals self remember to put self in front because we're inside the class so nearly typed object there self.pass underscore details good okay next let's go here we actually need to know if if this request is actually working so just as a demo i'm just gonna edit the print okay so we should end up with it going through for each of the eight links let me um i can comment out that now as well we don't want to really see that in our output each test right let's test this and see what happened hmm we appear to have ah i know i left out the yield yield request so i've made i've created a variable but i haven't used it so yield request now try i think i just saw some okay flash up there it's looking better yeah one two three four five six seven eight perfect do you notice how the okays aren't in sequence with the urls scrippy is asynchronous which is good because it means it runs faster because it tries to sort of do two things at once effectively but it's not always good if you're trying to run things in a specific order so bear that in mind for now though this is exactly what we want we are now dropping down to our past details and we're doing that for each link in our main page so next we need to look at get recipe and get ingredients how are we going to do that well if you remember we've already identified some json which contains those so let's uh just go back and refresh ourselves about that so here we are back on the recipe page and effectively what we've done is we've identified eight of these and now we've clicked on one and we've dropped down the level so we are this will load right yeah so past details is working on this page and we want if you remember we want to extract i believe it was recipe instructions from my notes earlier i remember it was called recipe instructions with a capital i and the text we want is in there somewhere good okay well i'm just gonna um call up my notes that i made when i ran scrapey on the direction on the recipe and the ingredients so let's do that so for testing let's we want a local details url and we want to fetch 200 good and i just want to check my code my code from my notes that i made earlier in fact it was all it was actually the very same syntax that we just used on um on the top level page so i'm just going to paste that in that's exactly what you saw just know and let's do preprints res so that was really good effectively we used identical code there to what we just used in the for loop to get the eight links here we're using the same x path selector to extract the recipe and recipe instructions so if you've if you can if you understand what we did to get the eight links we've effectively done the same thing here we've loaded a json object from the javascript into a variable called res and we just did a p print on res and if i do i've already imported json so let's do what should we call it um call it recipe equals json.loads res and then we want to [Music] print recipe so we're actually going to pick out recipe instructions and from earlier we identified that we want the index to be five so here we go i think that was just one of the steps so if we change that four yeah we're just getting the steps so we are going to need to put in a bit of logic to iterate through all of these okay let's um let's get back to our code and under get recipe we can actually use the reason why this works is because inside past details we're actually working with a different response why are we working with a different response working with different response because we've requested a different url so the response here contains the eight thumbnails the response here because when the method finishes everything it gets cleared here the response takes on a life of its own and it's now become um the json object which contains the recipes and ingredients so how we can handle this [Music] and we might need to use a while loop actually because there's no fixed number of steps is there let's just print recipe equals json as we did before loads areas print print recipe so recipe is the name of our dictionary and we're using recipe instructions that's our first key and then let's just use naught to start with we'll be putting an index um an iterable in here next but let's just run that and see what happens okay let's look at it in big good okay so for each of the recipes we've gone off and got the first line of the of the uh the recipe instructions so what was the first recipe award-winning chili and heat olive oil ground meat and cook okay the second one was gingerbread which the first line of the recipe is baking powder ginger cinnamon so ginger ginger that's looking good we'll just check the third line no way of tying that up but uh i think we can be fairly confident that that's extracted the first line per recipe so now we need to handle how to get all of the lines per recipe okay so i've refined it down to recipe using the recipe instructions key and with p print we get this um the problem i've got is i don't know how many steps there will be and i would imagine that would see we've got note 4 3 two one and i'm expecting that this could change based on each recipe so we've got four here if we go to this one and see full recipe i don't think that's even gonna be one so let's try it let's try and do fetch on that instead so fetch paste that 200 response and again we can just do the same no we can't we need to load the response again let's um so we've got a new response so we need to load the response into a new variable need to do that again to get the res and then recipe equals json rares and now if we do here we go we were still getting the crock-pot recipe whereas we need the crepe so that note syntax that we saw earlier note 002 note 3 that's just what the the poster or the person who left the recipe used so what we're interested in is text so we need to get all of the text i think what we'll do is we'll extract the text and then put a new line after each chunk of text that we extract let's um do this and then uh yeah so we need to specify that we want to extract the text i think i can get a slice error here okay so i think we're gonna have to need to iterate through because this is a a selection of dictionaries inside a list so let's have a look at that okay so i um just went and grabbed a cup of coffee and as a little think about things and really what we're dealing with is a list um so i needed to establish the length of the list and the list is the it's recipe instructions so print don't even need to do pre-print there um print the length of the recipe instructions uh within recipe which gives me 10 which is good now we know that i can iterate through it so let's copy that and we can use a while loop and we can say i can turn this into variables so rcpl equals recipe length so 1 i is less than rcpl print recipe instructions by and and then we need to do i equals i plus one otherwise we'll get an infinite loop um stupid tidying up here as well that's a lot of tests we've been looking at so if you're ready to see this test let's go for it and finished in 3.79 seconds let's see what we've got i briefly saw it flash up and it looks pretty good so from line 40 if you can see the line numbers just here slice one onion and place in the crock pot i know from memory that that's yeah it's still mentioning crock pot down here so all of this and if you remember we had the notes in the crock-pot recipe so from 40 down to there that looks to me like that is all the what was it i forget the name of the recipe but it's all there for that one the next recipe was salsa and that looks pretty good that looks to me like it's a salsa recipe it mentions it there um uh yeah we're actually extracting that type how to step and text here so we need to just slightly refine our selector um i'm just going to drag the mini map down a bit it looks like we're in business as i say scrapey is asynchronous so you can often see jumbled up output um i think what i'm gonna do is try and just pick out the text and then we'll yield it from past details we'll yield it and we'll see what we get in our results.csv file so let's go back to our code and going to test this in scrippy show uh does this work no uh um uh if we just put in number two three ah then we need to do text after text do you see what's happening before i put this here we were getting the key and the value now we're just getting take out the p again just because i've cursed it there we go so let's try another text copy that onto the end of here and run it again i'm just gonna get rid of that and this time i'm hoping it's just like um it's just picked out the text and not the the start of the dictionary yeah perfect that is fantastic so we are now ready to um assign that to to our recipe variable how are we going to do that we probably should do that as a list and then append to the list i think so ls equals empty square brackets and instead of print we will just call it rcp tx and we just want to do [Music] ls.append um [Music] rcptx and if we clear that we know that works let's test this and although we'll see the output on the screen it'll be much shorter this time because it's just done the okay it's not printed and that should be empty yeah so now what we need to do is put in our yield and we need to do that here we don't need it indented um and look at items to find just to refresh what the name of uh yeah was a recipe so here we need to say yield open curly braces and then [Music] recipe colon ls let's test that looks okay not good okay what's gone wrong it's a bit of a look yeah stored in food results hmm it must be something quite simple here that [Music] okay so i actually have got all of the recipes in the csv now um as you can see two down to nine so obviously one is the the um the the key we don't have it split or separated so it does just look like one big recipe but it's all there so how did i do that we know that the the length was the length of recipe we used the length of the recipe to iterate through a while loop and we appended each recipe instruction to the list and then we yielded the list uh so we've got the ingredients and we've got eight so far now the next page code we'll leave to the end because that could potentially make our output file very big and our test very long so if you're ready let's begin with get ingredients okay so now we've got the recipes we've got the links the final bit of our spider is to get the ingredients um if you're thinking what i'm thinking i'm thinking it'll probably be quite a similar process to what we've used so far to get the the the recipe or the directions as they're called here um now here i've already clicked the button if you remember which expands ingredients so let's go back to one that hasn't been um clicked on so to speak and [Music] let's go to salsa and what i'm looking for good i'm looking for one that's not fully displayed on the page and we're already expecting it to be very similar to the directions so let's just do c4 recipe and that's expanded it and if we just scroll up we'll look up here we could look through that but let's um let's just do view source and do control f for i think it was vinegar i just saw wasn't it and that's loaded because you see vinegar there and vinegar was one of the ingredients which was not initially displayed so although it's not displayed on the web page it's actually loaded in the background which is perfect because that means we can extract it from a good old json dictionary so as before let's let's go into scrapey shell and just um work out how exactly we're going to do our extracting now i'm already thinking i might just be able to adjust this and change it to ingredients that would be too easy wouldn't it yeah let's just do print recipe there we go and i'm thinking that the ingredients were in the same dictionary yeah so we want recipe ingredient okay no s that's better we're business so um i think that's it actually that is going to be quite straightforward it's going to be a lot more straightforward than the um done the instructions or the uh the actual what was called the recipe so for the ingredients we're getting in the swing of this now so uh we again we can um let's just move this down yield needs to be at the very end so get ingredients we're actually using the same resource so we've already loaded the response into a variable called recipe up here so we just need to create a new variable we've already done a lot of the work so let's go for in ingredients equals that would be too easy wouldn't it let's have a look see if we can remember what it's just called ingredients in our items py file so if we just go in here and add ingredients so that's the name of our field in scrapey and then ingredients is the name of our variable that we've just created ingredients check spelt it correctly um i'm going to test that not comma results and we only have recipes so something's not quite works nevermind let's just uh getting in a muddle here with too many open untitled test results what's better um let's get back to scrapey shell so that's returning us a list let's just recipe ingredients just gonna run that again yeah see we're getting the ingredients here ingredients let's just um yeah i don't know why that wasn't working was it because i had a space it may have been well i think we're there the one last thing to do is to do the infinite scroll code which i will need to check the json for something called um next page or a variable such as has next and if has next equals true or has next equals false uh i've done a little bit research on that so um this was a nice little quick section with ingredients which is now in the csv so recipe recipe ingredient all we did really was we picked out this key or this value using this key from our response in our response uh was loaded as from json into a dictionary so we've picked out this dictionary value and yeah let's get on to the next page and finish the project okay so the final section let's try and work out the infinite scroll and we've already established that we can handle that using the api just increment the page number each time the only issue as i just alluded to was how do we know when we've gone up through all the pages and we've got to the last page no there's normally something such as has next or next page so i've just done a view source here and i'm going to do control f search for next and luckily in the entire page we've got you can see down here one of one match um and you can see here in fact this was here all the time so what i'm thinking is when we get to the last page um this will probably be none either this won't be here at all or next and in fact i don't think we'd have a next and then h ref then an empty space after it so uh yeah we need to write something like if um if next if uh link rail next equals none then um that's the end of the the spider so if you remember with our next code we normally just use something like next page next underscore page equals response.xpath and then the hyperlink to the next page so what i'm going to do is i'm going to load i'm going to fetch um i didn't want to do that i'm going to fetch the second page and yeah or third page doesn't matter let's just try this in we need to change our stored response in scrapey shell because our current response is from the actual detail recipe page so we want to go back up to the top level and we need to get this so that we can look at the we can see the response and then we can start creating a selector so response dot x path and we need to uh we need to try and find if there no we need to try and find that link the um the next link so here we go okay so i've had to have a bit of a think about this and what i've done is identified um in the source where it says link to show you link rel equals next now i clicked view source i did control f to find next so link well if there's an unvisited link we want to check if if next is there because if next exists we want to go there and follow the href if it doesn't exist that means we're on the last page so let's just um test that and that gives me a one it's found one if i changed it to i don't know next gives me a zero um so really what we want to be doing is saying if response.xpath link at rel equals next and then we want to do a response dot follow good so let's um let's copy that not that one the good one and we want to go back up to our pass function and if equals one then [Music] next page equals response dot follow and we need to use we don't want to use the start url in fact what we could do is we could just call a new variable we'll call it list url because there we go and then we can actually exchange that too many quotes that's a very bad habit with atom it actually puts in it finishes the quotes when you don't necessarily want it to so we want list url as our next page url equals list underscore url and we then need to increment the page number we don't need to do any of this i'm talking rubbish burger pardon getting hungry we just want to visit the link or do we scrap that back to plan a list underscore url plus pn plus one now is that that is that's an integer adding an integer to an integer we want the string of that this is not pretty we might need to uh tidy this up in a minute so url is the base url plus the new page number self.headers equals headers call back equals save that do i want to test this way okay let's go for it need two equals there hmm i need to just have a look at that apologies it's getting um it's getting near my lunch time i'm starting to flag a bit i'm afraid i've just remembered i don't really need headers in there either so let's just um let's just save that and then we can why is that not working okay something happened there but i'm not sure why did that stop on them right that stopped on nine so the next page the infinite scroll hasn't worked why is that this is a little bit messy i think i need to uh just fix this so we really want to sort this out i'm just going to um sort out this variable or the list url variable so i'm going to print it unless i think i need to use self here so i can see we've had an error pn is not defined i probably need to use self.pn as well or actually just define it it wasn't peeing at all is page so if we do page plus one should add one plus one is two string of two it should put it in the url it should text to page two that's not looking great either what's going on here page is not defined self.page okay that's looking better line 38 you can see we now have the correct url to go to have i actually used it though i haven't so that's printed it so now we just set that to a variable and we will call that next underscore page equals um here we go this is much better and if you're ready i think this is i think there's a good chance this will now work so we're going to do a quite large potentially a very large bit of web scraping here um let's go and i have to press the right buttons to start with helps hmm okay page number one so it's still finished after eight results um it's not gone on to get the next eight pages just gonna put print in here and see if it actually um reaches this bit it should equal one so um get second [Music] page try again now i don't think this is evaluating to true so we need to check that i just need to confirm that that is the case no so that's not evaluating to true but from scrapey we uh [Music] we need quotes around it so do you see that what happened there is that returns a one in quotes i was in my code i'm just using equals to one so it's giving me force if i put quotes around it it gives me true so let's put quotes around it and we might actually be close to a working spider so i just saw get second page well get a second page run i saw it flash up 16 items um let me i just want to do a find on 403 i don't want to no that's good let's have a look in here so this is progress we've now scraped the second page we've now got a recipe for mashing bananas on blueberries as you can see this goes way off the edge of the page but when it's in um when we open it in in fact let's do that no if we open it in um in excel we will be able to shrink that column and get a bit more sense out of it so documents projects food bin food com food com food results dot csv and if we open that once libreoffice decides to load okay so we've got ingredients and [Music] it's i think let's put ingredients in the second column so we've actually got recipe in the first there we go i've just realized i thought there was something missing we need to collect the name we need to collect the name of the recipe so two things left to fix we need to go beyond the second page because here we've gone to page one page two but for some reason we failed to get to page three we need to fix that and then we just need to get the name of the recipe so let's uh do that in a second and already i've just spotted why we're only getting the second page because page equals one so the first page it goes off and gets start url using this plus page one once it's gone through once it then returns and it will do the next page and the next page is equal to the page self.page plus one self.page equals one so it will go to two and then next page equals two so what we need to do is after this we need uh yield we need to set self dot page to be the current result of self.page plus one okay i think what i'm gonna do here is page let's go up here we need to create variable score to score to store nxp equals quotes and then we will use nxp self.nxp to store the page the current page so self dot nxp equals next page and then it will set that to now become two let's change that all right let's test it okay so nxp we need to change that to string no we don't we need to change this to a i think it's a one it might be a zero let's run it wow stop that that's definitely working wow we're in business that is iterating in well infinite scrolling i don't control c is not doing very much here scrapey's busy whoa so we are now scraping all of the healthy recipes and we are extracting the ingredients and the recipes and we are building a quite a substantial csv as i'm speaking okay food wow how many did we get just them 472 recipes so there we have it i had to escape that and um that's amazing brilliant uh i hope you've liked this i think um i'm just going to pause it there and i'm going to come back and then i'm just going to add the um the actual recipe names in a third column and then i'll do a little um little uh summary so um get yourself a cup of tea and then we'll be back just for a last little round up right we're on the finishing straight now so what i want to do is extract the actual recipe name but because i'm trying to be efficient i just want to try and see if i can extract it in the same way as instructed the ingredients so i've just fetched that potato recipe and let's um let's do the usual where we load the the json into res and once we've got it is res we can do json.loadsrez we load that into an object called recipe and then we can do something like let's print that okay no moment of truth well that's the ingredients i wouldn't have expected to name that let's just try and name recipe names scalloped potatoes perfect so in actual fact all we need to do is copy this line actually i'm going to um just copy those two lines get recipe name not sure to do all the capitals didn't and just change that because we've still got the same response object so still the same dictionary we're just picking a different key and the key is name as simple as that and i'm going to put it at the start but i scrapey doesn't excuse me scrapey doesn't always put columns in order and that's wrong because that should be semicolon just go in update items because items should mean crispy ingredients i haven't used some of these but in fact let's tidy them up we're not going to use id so that can go we are use we're not even using link we are only interested in recipe ingredients and name let's keep it neat and tidy um i think that's ready to run so if you'd like to see the spider run and get the name uh food results as we know um my code deletes this each time so i know what i'm gonna do i'm gonna just rename this because i want to keep it for just to um i don't know i'm a hoarder let's go and i can see some errors that's not good and if you were paying close attention i was so busy talking that i forgot to rename this variable to the name so hopefully that's a simple little fix and let's try again it's looking good i'm gonna quit that let it run a bit more quick now okay and food results name recipe ingredients so we've got wonderful salsa we've got crepes pulled pork crock pot perfect pork tenderloin um how many rows have we got let's run remarkably quickly 177 before i i quit it and that was probably in well say it was in probably was about 17 seconds so i think we're getting about 10 recipes a second um what am i doing let's um let's view the csv and close the previous one and we want to view a csv which was made at approximately 1430 yeah 1429 open with the libreoffice i'm not expecting any surprises here to be honest um it's libreoffice ah yes there's a splash screen behind here we go now what you'd probably do is swap columns b and c around um i think we if we wanted to we could change the order of those in items i can't remember name recipe ingredients how do they appear in items no so they're not they're in they don't appear in the csv in the order that they do in items name recipe ingredients oh of course it's the order in which we put it in our code uh well that's a very easy fix so let's just do that and name i think you'd always want to see the ingred the ingredients first before the recipe but um there we go just splitting hairs now really so um it's only time to wrap this up but i'm just going to run this once more and as you can see it's just uh nicely running in the background and what i don't want to do is hammer the server because i don't want to get banned um i don't think i will let's just close that because we've got enough to be getting on with so um how many did that get just then same number 177 and save let's just open up our csv one last time um how not to get banned so i'm just going to run through in a minute a few settings that you can um set sort of a best practice for using scrapey i mean there's no point in being aggressive and trying to do it all in a hurry and getting banned because you're going nowhere if you do that so it's always better to be conservative and the first thing you want to check is settings.py while this opens as you can see settings.py is there and you can actually edit settings py or you can put in some custom settings into it's not behind yeah i thought it was you can put custom settings into your spider code so the three that i change or for that i change rather http cache directory which you set as your local directory for where you're putting your code download delay auto throttle start delay and concurrent requests and once we're happy with this we'll go take a look so we've got ingredients in the second column now which is excellent and i think we should have rather a lot of um yeah so the last time i ran it i let it run for probably getting on 30 seconds and i've got 300 results so we've got um 361 recipes all in a csv i think um if the requirement is to collect which it was to collect 500 000 results then the next step will be to write the output to a database um that's beyond the scope of this this is um this is as far as i want to take this because i think there's been a a lot covered here so in possibly a future video i will look at writing the output of this project to a database if you've got any questions drop me a comment and thanks for watching and i hope you've learned a lot um i've learned a few things making this video so uh it's all good thanks for watching and have a nice day don't forget to subscribe as well and just to tidy up um the new sends i was discussing settings so when you use scrapey start project from the command line uh you'll get this file but most of the options are grayed out so if you come in here and the first thing i'll do is http cache enabled equals true the directory uh you can accept the default there auto throttle enabled equals true and i'm not going to go into it here but it's disabled by default but you would want to enable it and scrapey varies the speed at which it makes requests uh based on the delay of the response and so on so there we go read up on that um it's beyond what i want to cover here um download delay five i'm still experimenting with these settings so um the default is zero again there's links to all of the documentation um these are just settings which i'm using at the moment so don't hold treat these as gospel um concurrent requests i've reduced that right down to one now i fully expect that to make my spider take a lot longer but i want to start conservative and then gradually speed up robots obey equals true uh sometimes you will need to set that to force and you may have seen me said that before some previous videos uh on this spider i've left it as true and i'm i'm perfectly able to get the bits that i want so this is all in settings.py if you look it food.com the main spider where we've done all the work custom settings so if you want to put custom settings in there override or additional to what's stored in here then custom settings so i'm not going to give any hard and fast rules a lot of these are dependent on the site you're scraping whether using a proxy i personally use a vpn so if i did do anything which the website didn't like if i scraped it too quickly or it decided to ban my ip address then i use a private vpn i could potentially change my connection i could change to uk manchester or ireland or paris or wherever and in doing so i would then obtain a new ip address that would then mean that i could start scraping again i'm not getting any money from these people in fact i pay them money so they're very good and it works and i've never had any problems so it's i'm not really promoting them but if you are looking for one then i can at least give you uh some sort of confirmation that private vpn works fine and i've been using them for i don't know about a year now probably 5th of november 2019 so um if you're looking for a vpn then you could do worse than using these i think uh i'm just looking at the site the current plan is um for 4.65 so it's like four english pounds per month so just under 50 pounds per year but if you're web scraping professionally and buy some proxies or i think um it's a crawler a crawler a are the the main main proxy company that lots of people use the world's smartest rotating proxy web i'm not sure how much they cost and as i say i've not actually needed to use them yet but there we go crawler proxy api this is on scrapinghub.com however basic advanced enterprise and you're probably encouraged to go for the enterprise one but as yet i've not needed them and i'm going to um i'm going to set this up on my raspberry pi or attempt to and then i may do a future video on that and if it takes days to run i really don't care um if it takes three days that's fine by me so i'm gonna sit there with a calculator and work out how long it will take to get me half a million in fact should we do that now calc so i got 300 in 30 seconds so 10 per second times 60 600 per minute times 60 equals 36 000 per hour times what's that 14 yes 14 hours so the current rate on this ubuntu vm using my private vpn it would take 14 hours to scrape half a million results now the the page that i've been working on um has only been the healthy section if you remember from recipes let's close that from recipes we had uh oh yeah this drop down which offered trending popular quick and easy healthy editors pick new um which yeah if you want to collect all six the problem is is if you did all six you might end up overlapping getting loads of duplicates so i'm not sure how you would get all the recipes just in one phil swoop if you look as i move the mouse the all still remains so if we remove the all sorry remove what appears after the all we get an error okay in which case we may need to extract or run six spiders each time just as you see healthy so what we've done is extract all of the healthy recipes and if we wanted to take this spider and run it on something else we could change healthy to quick and easy or popular and then run that as well yeah we would end up with potentially six csvs which we would then potentially need to merge or something but that would be a good starting point so next i'll be attempting to run this on a raspberry pi and also i will be looking at databases and saving the records to the database um so that that is it for now that is me signing off and i hope you've enjoyed this and it's taken a lot of work and i hope you subscribe and good luck with your webscribing goodbye [Music] you
Info
Channel: Python 360
Views: 836
Rating: 5 out of 5
Keywords: scrapy, load more, infinite scroll, how to run scrapy as a script, web scraping, ajax to json, xpath, os.remove, crawlerprocess, Python3, how to webscrape javascript, python scrapy tutorial, upwork web scraping job, code monkey king, dr pi, webscrapping, scraping the web, parse, parse_detail
Id: 07FYDHTV73Y
Channel Id: undefined
Length: 115min 30sec (6930 seconds)
Published: Fri Jul 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.