This is a Scraping Cheat Code (for certain sites)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
apparently site maps are still very much a thing despite their use probably being a bit more debated now given how good Google bot is at indexing your site you probably don't really need a site map either way a lot of sites still do have them especially Ecom sites with lots of products and things and in this video I'm going to show you how we can use Scrapy and its sitemap spider to crawl pass and process all of the links to a specific rules that we want so we can go ahead and scrape just the pages that we're after so if I come to this site here you'll see that we have a standard Ecom site and to find the site map you just type SLS sitemap after the URL and there we go here's a load of links this is not exactly what we're after though but it's close if we go toxml we're going to get the XML for all of these links that are going to hold all the product links all the information that we want that we can give to our sitemap spider now this has formatted it like this for some reason if I go to view page Source once it decides to eventually load up because this is a massive sitemap we'll see the XML that the sitemap spider is going to be able to pass I'll just uh zoom in a little bit when it catches up so we can check out some of the links so if we look here here's the sort of standard you got the URL opening tag like this and the lock tag with the actual um URL in and there's so many here so if I go down and look for this particular brand we'll see that we've got 317 results of all different links and if I go up here's some products here look so what we can do is we can actually use Scrapy sitemap spider to actually find all the links that match this rule so if I go to the Scrapy documentation and search for sitemap spider we can get an idea of how it will work so I've used the uh cruel spider a lot and I tend to use that more often most possibly than the regular Spider but this is another good one to understand that you have access to if you use scrapey that might be useful to you so it's fairly straightforward here we go you can see uh we have a sitemap URL that we can give it which is a list and a list of rules with the uh first part being the rule that we're going to match the URL with and then the Callback function so if we look at our sitemap uh they have all the product URLs are in a specific format they have the skew then the Slash and then afterwards they have this brand uh name the followed by the product name now I'm going to use that to my advantage so I can say scrape me all the products with this brand so let's get started so I have my projects folder open here I'm going to create a new directory called uh sitemap example we'll CD into this and then we'll create our new virtual environment because we need this we're going to be installing Scrapy which has lots of dependencies activate it Act is just a shortcut for me and then we will do pip three install Scrapy is what we need so our plan for this is to scrape all of the links that match a specific brand and we're going to dump them into a database only getting new ones to us rather than adding the same ones over and over again so now that's done I'm going to do Scrapy start project and we'll just uh we'll call this Ecom make it nice and easy we'll CD into this folder and we'll do scrapey gen spider products and the URL was this like so so that's going to create that spider template for us for us but we are going to obviously need to change that a lot whilst we're here I'm just going to start up tmox with myself a new session so I can have a new window over here as well my terminal however you choose to uh run this stuff is is entirely up to you I just use this and split my sessions uh over like this so there we go right I'm going to open up my project file actually that's just just check so if I go to tree-i to eliminate virtual environment here's our Scrapy project you can see we have our key uh product files are P files here that we're going to need to use so we will be using items pipelines settings and products which is why we're creating a project makes our lives easier as you can run scrapey from a single script if you want to so my editor of choice at the moment is Helix use Helix a lot now I quite like it uh giving it a go comparing it to neovim use whichever editor you want doesn't matter so in our folder the first thing we come to do is our settings we're going to change a couple of things here we're going to give ourselves a better user agent so we want to change this to something a bit more uh real from my browser we'll just do my user agent grab this paste it in format that properly there we go and um I'm going to leave it like this for the moment I think this should be fine we might need to change some other things in here as we go along but that was the main one cool so save and then we'll open up our spider which is our products spider here now we don't want to import this we want to change it so we're want to do from Scrapy dos spiders we're going to import in the site map spider class and then we can change our product spider to be of that class so we need to change this a little bit this is not start URLs now it's sitemap URL sit map URLs URLs I think and we will want to change this to I don't know if we need the WW do but we'll put it in anyway uh sitemap.xml like so then we need to create our rules so our sitemap rules which needs to be a list it also needs to be a tuple and we're going to match initially this brand here and we're going to send this back to pass uh products um call back to the function here so I'm just going to create this pass product function I think the past one is held for default there we go right so now that's created we're going to do a couple of other things first before we actually um start to look at running this so now that that's created what we're actually going to do is we're going to save all of this and we're just going to check that it's going to run so we'll do Scrapy crawl products and we'll just make sure that it's actually working and going to the links that we were actually looking for although we're not pulling any information out yet but we can see that we are indeed going to the right sort of looking link so if you look on the screen you can see the links that we're going to have that brand in which means hopefully our rule set is working so I want to look at actually getting the items from this so I'm going to open up my items. py file and we're going to add in a few extra things here first and then we're going to create the item so we are going to need to use the uh input and output processors so we're going to do item loaders do processors we're going to import in map compose and take first and then from W3 li. HTML going to import in remove tags what this is going to do is when we create our fields which is our for our item data we're going to use map composed to actually do remove tags on all of these so we're going to do all of our processing in here rather than in our spider file um it's up to you how you do this I quite like this approach because you then you can use the item loader and everything can flow neatly through and as we go into um adding stuff to our database we can then push all of this into a pipeline which will handle adding it into the database and checking if you're doing a lot of um processing on the data that you're pulling out might be worth creating a pipeline to handle that but if you're just doing a little bit of data manipulation you can also put it in here with map compos you can pass in functions that you can create uh and then have that act on the fields as well so let's create this here we'll do that uncomment this and we'll have the first one which will be brand and then we'll say it's a scrapey field with the input processor which will be uh map compos and it will be remove tags and then the output processor output processor it's just going to be take first take first is just going to pull the first one from the actual list so save I've done this wrong this should be equals there we go great so now let's go ahead and just copy this down so highlight this create a few new ones one two three four five maybe that might be one too many and let's just format this a bit better still getting used to this there we go so we want brand we're going to have uh title and then skew and then we'll have regular price and then sale price and I'll show you why I'm doing this in just a second sale price cool so we don't need this one let's remove that just going to make this a little bit smaller so it's bit easier to manage and understand what's going on there we go so these are the fields that we're going to be pulling the data for and each one is going to be run through the same input and output processor again if you created your own function to do some data processing you can put it in here sometimes I do this if I'm going to remove a sing a single a single character from some of the data or something along those lines but now we have this uh item set up here let's just take a look at the website real quick and I'll show you what I mean when we're looking at this this information so we'll take this product we'll do fine we'll open it up if we scroll out so here I'm going to go to inspect and if I go over the price on this one you'll see above my head here this is how the um product price is represented when the item is on sale on this website it's all split out like this however if we find a product that isn't um discounted and we go and look at the price it's got it in just this regular price selector so what I'm going to do is I'm basically just going to pull one or the other depending which one is there and the other one can be none or nothing or whatever so if you're not sure about how you can pull the data the best thing to do is just to copy a URL where you want information from come into your general Scrapy folder and do Scrapy shell and paste the URL in here what that's going to do is it's going to open it up where you can then query it with your selectors so you can do response CSS and then ask for the title or something like that and you can see that working although what I tend to do is I tend to do response and we do view this is going to open up a browser and if when we see this this is exactly what our spider sees if all the information is on this page that we can see our spider can see it in this view it's going to work just fine so we don't need this here so let's go back to our I'm actually going to go into this this folder cool so let's go back to our that's a bit neater um back to our spider there see down here cool we're going to import in our spider items so I'm just going to do dot dot items we're going to import because we want to use the item that we just created with our item loader so we'll do from Scrapy do loader import item loader there it is so we're going to use this to basically take the field for the item and take the CSS selector from the response squish them together and give us the actual data out that get put put through our input and output processors so we'll say our item loader is going to be equal to this the item is going to be the Ecom item that we created and the response is the response from the spider here like so this gives us the response that we can then query with our CSS selectors or X paath and the item that we want to stick that data into which is the one we just created on the other in the other file so from here it's basically just doing item loader. add I'm going to do add CSS if you prefer X paath you can also do that then the field brand and then the selector is going to go in there I'll grab the selectors in a minute but we have brand title skew and two prices so let's just go ahead and add these in whilst we're here cool and then what we want to do is want to yield out of this our il. load item and that is going to give us our item back that has been all loaded in with all the data so we just need to grab these selectors um I'm just going to do this there's no need for me to show you these all okay so I've just grabbed these from the site and if you were to use the inspect element tool you'll see the selectors that I've used here product name regular price which is uh his the title price that I used uh regular price and so on and so forth any information that's on this page you can pass out we're just taking it nice and easy and just using this information at the moment to work uh with our CSS selectors so now I think is a good time to go back and try to run this and just check that we have correctly um done the selectors and everything that we need so we'll do Scrapy crawl and it was a products spider like so so we should start to see products coming through if we've done it correctly and it looks like we have indeed here you can see them here so what I'm going to do at the moment now is I'm actually going to open up the settings again and I'm going to enable the auto throttle add-on um which is going to be a bit further down here so we're going to just put this in just so it's a bit slower it's bit it's a bit more kind on um just use defaults like so um and we will enable this it also means that when it's running I can talk and show you things that are happening rather than them just all totally flashing by in the screen and we will also disable cookies I don't think we're going to want those I didn't mean to close that so let's have another quick go and just check that this is all working properly so we can see the URLs that it's going to um like so cool so there's some information coming through now one thing to note is that we're actually going to find that we are visiting every page that has that slash with the brand and there are going to be pages that aren't product pages we have a couple of options with this what we can do is we can either try to refine our sitemap rule so we don't visit those pages um that might be a little bit more difficult but we'll have a look at that or we can look at adding in a check to make sure that we don't try and pull anything that doesn't exist when we go to try to add it to our database we're going to come up against a couple of little issues there but we will handle those uh now so the first thing we want to go back to do is to our spider so we going come back to my product spider and we're going to look at our sitemap Rule and let's see if we go back to the sitemap over here when it loads back up we want to find one of the pages I know this is all weird looking so I don't think being able to refine the rules going to help us a lot here we're still always going to pick up that first page so what we'll do is we'll build out the rest of the spider and we'll we'll figure figure out how we're going to handle it going forward you might find that your sitemap rule can be tweaked and you can actually handle these a bit better so let's go and create our pipeline which is where we are going to save all the data from the database so the pipeline works by once the item has been scraped it heads through the pipeline and then out the other end and at this moment in time there are no pipelines so it just goes straight out the other end we see it in our terminal but we are going to want to store this data in a database and I'm going to use SQ light for that so we're going to import in SQ light 3 from python to make our life easier in that respect so from here we want to we're going to use the existing pipeline um you should probably give this a better name I'm just going to leave it like this for now but we do need to initialize some stuff with this here so I'm going to do in itself and we are going to do some SQL stuff in here so we're going to say that our connection is going to be um self docon is going to be equal to SQ light 3. connect and this is going to be our items. DB this is say SQ light so it's just a file it's going to create this if it doesn't exist or if it does exist it's going to connect to it then our cursor self. Cur is going to be equal to self. con. cursor oh there we go cursor cursor is what we can use to execute commands on the database so let's do the first one which we're going to do is going to be uh self. Cur do execute and from here we want to actually execute some SQL so I'm going to make this triple quote so we can just free type and I'm going to do create table if not exists and I'm going to call it products and we're going to give it some information brand title skew skew reg price and sale price like so now you don't need to tell SQ light what sort of data type these are because it's never enforced so I wouldn't bother um and that should be fine and this is not indented properly so let me just fix that there we go and we don't need the item adapter so we'll just comment that out for now great so now we have this this is going to be formatted really badly I'm not sure what to do about that let's try and make it a bit better H somewhat better okay so what this is going to do is every time that we run this pipeline it's going to initialize our connection to the database it's going to create it if it doesn't exist we're going to create our cursor that we can then use to execute on to the database and we're going to create this table if it doesn't exist with these fields nice and simple straightforward we're then going to do our process item so what we want to do in here is we want to actually take the item data and add it into the database so this is going to be basically the same sort of thing so self. c. execute and we're going to say in here our insert into products and we're going to say What fields do we want to insert into our brand our title skew and our regular price and our sale price and we want to give it values and these values are we going to use placeholders for the moment because we're going to insert our data into these placeholders like so and what data are we putting into them well we're putting in the item data so we'll do item and it was brand like so so let's copy this line and we do item 2 3 4 5 so we'll say the title was next then the skew SQ and then the regular price and then the sale price sweet so this is going to then take the data it's going to insert it and then we just do uh self docon do commit so we commit it to the database each time so every time we get an item it's going to come through this pipeline when we enable it in just a second it's going to insert into the database and we are good to go but we would be good to go but what we need to do is every time we want to actually check to make sure that we don't already have that item in the database otherwise you know we're just going to add the same thing over and over again and have loads of multiples which is absolutely not what we want so to do that I'm going to put in a check here so we want to actually search in the database first so we'll do self. Cur do execute because we want check in here and we'll do a simple select so we'll do select should put this in capitals select everything from products where skew is equal to and again with a placeholder and then item skew we're going to use the skew um because it is going to be the unique product identifier so we're saying we're going to search for this skew in our database and if it exists we're going to do something else if it doesn't exist we'll carry on so what I'm going to do is we'll say um result is then going to be equal to um self. cursor. fetch one so we get the result back and then uh we'll make this a bit neater on another line if result so we're saying if this result exists we're going to raise a scrapey error and called a drop item and we'll say uh item already exists in database and we'll put this in to an F string once I type this correctly uh item skew cool and let's make this an F string so we can see every time we go through we'll know that this item already exists now we just need to put our import that in so we'll do from Scrapy do exceptions we're going to import in drop item there so it's a basically a neat way of handling that here but we want don't want this to happen if this finds it so I'm just going to put this inside else and we'll then put this all in here and indent you in cool done uh that should be fine so we'll check this in just a minute we're saying let's search for the item first as it comes through this pipeline if it exists it'll be in the result and then we'll just drop the item else we're going to insert it here now we need to make sure that our settings has the pipeline in enabled so we'll go down to our pipeline next one here we go so we just uncomment these lines and because we didn't change the name this is the default pipeline again if you have multiple pipelines you probably want to change the name there so I'm going to save this and we're going to go and try and run it and we'll figure out what errors we've made and what we need to do to fix them hopefully none but you never know we might find some okay so we have a key error here and that is because we'll talk about that in a second okay so I've met up my select statement I will fix that now uh incorrect okay so I needed to put this into a two pull here to insert that should be fine now cool let's try that again so it was failing on our check we should now start to see we're going to get the key error again first so that's as I said expected okay so we can start to see where we're having some problems here key error key error if I stop this now and we have a look and go up what the the issue is here is that we're trying to add an item to the database that has uh a key expected of sale price but this product doesn't have a sale price it has a standard price and so we are having an error now there's a couple of different ways I think we could probably handle this um you could maybe change your execute statement so you can only add in where it doesn't exist or you could create a default value for these so for the sake of learning a bit more about pipelines I'm going to create a default value pipeline that we can then act actually put in front of our database pipeline that's going to add in a default value to either of those two Fields if they don't exist that's going to create us another problem but we'll look about look at that in just a minute as well so this will be default value Pipeline and this is going to take uh this is our class and then we need the process item takes itself item and the spider and from here what we want to do is we can just do item dot set default and we'll set the default for um what were the fields called regular price and we can give it a default value I don't know if a blank field will work probably not so actually I'm just going to put in let's put in uh none as text that's probably not ideal but we'll do we'll use that as an example for now and then we'll do the same for sale price like so um okay I think that will do and then we'll return the item back out of this pipeline so we can add it into our thing back to our pipelines I'm I'm going to create a new one here and this was called default value Pipeline and this needs to be a lower number so it is executed first right so let's go back to run our spider again let's clear the screen first so we're have default values now but when we hit a page that isn't a an actual product page we're still going to get this value this key ER this key error because it's trying to find something there that doesn't exist so let's let this run a little bit and we'll check the database and then we'll talk about maybe how we want to handle the other thing okay so this all seems to be working fine so I'm going to stop this and we will start having a look at the database now to uh look inside your SQ like database I usually do it from the terminal but you can also do it with things like uh DB browser I think is a good one which a a gooey interface I'm going to do it like this we're just going to do open items. DB if we hit tables we should have our products table so we can do select everything from products and we can see that we have some data here and there none has been put in where it doesn't exist so let's go ahead and exit out of this and try running it again and we should hopefully now hit our our um where it doesn't get added because it got dropped because it existed already and then we'll handle this error okay so this is still not working as I was wanting it to but we do have our dropped item already exists in database so let's close this and have a look at the eror that we saw up here now okay so you can see we've got we're crawling a page which matches our rules which matches our rule set the uh here which is the rule we've asked for but because we've set a default value for our regular price and sale price we have these default values in and then we can't do what we wanted to do with them in the first place because they don't exist so we're going to have a bit of an issue here so again it's up to you how you want to handle this I'm going to do it a little bit I think I'm going to do it like this I'm going to do um let's go back to our pipelines so we notice we could either Set uh default values for everything that might be the best option so then we can say if it doesn't have this value um we can do that so I think that's probably what I'm going to do um so let's go ahead and copy these and we'll do default value for uh brand is going to be [Music] none and then uh skew title so this is going to give us an entry that's none and when we hit our execute because we're going to be searching for a skew of none it's going to ignore it from every other time does mean we'll have one entry in our database that is just none across the board you could put in an if statement in there if you wanted to see there's our full none where it didn't find anything on the page and it's going to insert that into the database we could put an if statement in our other pipeline than say like hey if the brand is none just dump this all together um but I'm going to leave it like this we'll just handle the fact that one line says None I'm not sure what the best way to do this is going to depend on your use case but you can see here we've got this warning dropped item already exist in database so we're just getting rid of that one we don't need it because we've already got it and here I think is the first one where it's adding it again we're adding it again we're adding it so this is going to just tick through based on the rule set that we've asked for finding the links from the sit map where the sitemap spider is passing all of that for us and it's going to go through all those links and it's going to create the product like this it's going to use the item and the item loader through our pipeline based on our um default value pipeline first because that had the lower number then into our SQ light pipeline and it's going to push all this information into the database where the skew doesn't already exist in the database so we're basically going to end up we would end up with a full data set of all the products and we could just run this every now and again pick up all the new ones that matched our uh criteria as well so I'm going to stop this no need for this to carry on running but yeah there's a lot of things in here I think could be a little bit tider but this is kind of hopefully giv you a good idea of how this sort of thing can work some of the way ways that you can just sort of try to overcome some of the issues that you find within Scrapy like here I did defaults because I needed them for my database but again you could do it just a different way uh dropping the item if it already exists it's pretty handy to know how to do and of course the actual site map spider itself which is a bit more Niche but still very useful uh and of course you can have multiple sitemap URLs multiple sitemap rules that point to different um uh callback functions as you need to do so so hopefully you've enjoyed this video it's a bit more of a deeper dive into Scrapy and working out how you can use it to scrape data based on your rule set and a sitemap spider if you've enjoyed this video uh definitely subscribe and like and there's loads more scraping content on my channel amongst other things join the Discord loads of people in there now well over a thousand members which is fantastic people helping all this cool stuff do that come and join in say hello uh come and tell me what I could have done better here I always like to know so I can you know tidy my stuff up as well so thank you very much thank you very much for watching I'll see you again in the next one bye
Info
Channel: John Watson Rooney
Views: 4,399
Rating: undefined out of 5
Keywords:
Id: 4j39AOVhKtQ
Channel Id: undefined
Length: 32min 7sec (1927 seconds)
Published: Sun Mar 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.