Scrapy Basics - How to Get Started with Python's Web Scraping Framework

Video Statistics and Information

Video

Captions Word Cloud

Captions

in this video i'm going to run through a real basic scrapey project uh scrapey is the web scraping framework for python so hi everyone welcome my name is john and let's get straight into it so the first thing you'll notice is i am using a virtual environment i would highly recommend you use a virtual environment when you're doing any kind of scrapey project so make sure you've got that started and pip install scrapy within your v virtual environment and you will be at the same point as i am on this screen now the first thing you want to do is start a new project so we're going to do scrapy start project and this is going to set everything up for us and we're going to give it a name and i'm just going to call this one drones because this is the website i'm going to be using to scrape it's a camera website but this is the specific section so we're going to let that run and it's going to create the project files for us so that's created and it says we can go straight into the project folder so i'm going to do that with cd into drones and then i'm going to show you what it's created so the tree command is just going to show us all of the files and folders that have been created this is the folder we are in now we have some pi files that are created for us and a scrapey.config we'll talk about these ones in a later video but for now we're just going to look at this one spiders this is where we're going to be creating our spider file but before we do all that what we want to do is we want to use the scrapy shell to load up a web page for us so we can actually interrogate the response that get that comes back so we can select all the right elements on the page and that will become really clear in just a second so on my website i'm just going to copy the url and then i'm going to i'm going to type scrapey shell and then paste in the url just like this so what this is going to do is it's going to go out and it's going to do the request part of a web scraper and it's going to get the page information it's going to return it for us and what we can do is we can actually use css selectors to interrogate that response and find the information that we want when we've done that we can copy those commands over and create our spider file and then run that for us so if we have a quick look at this we can see the text has popped up and the main one that's important here is we can see request and it's done a get request on the url we gave it and it says that it's a 200 response so that has come back good now to actually start looking at the elements and the text on this page we need to use the response command so the response is basically what has come back from the request if we were using requests and beautiful soups separately from this this would be like our r variable so our response is what's come back from the server to start finding elements we want to do dot and then css now i always use css selectors when using scrapy to find the information on the page you can use xpath if you want to i've just found that css selectors work really well so i'm going to start everything with response.css so far then we open a bracket up and we're going to type in title now what this is going to do is it's going to find the title element on the page we can see it has returned it for us if we wanted to just get the information from that we can do dot get now think of get as find kind of in beautiful soup so there's a get and there's a get all get will return the first match so we can see we've got the html back for that if we wanted to just get the text which we most likely are in a lot of cases where we've got our title here we do two colons and then type the word text and that's going to return just the text of that element it's worth noting that if you're doing get it returns the first one but if you did get all even though there is only one it will still return a list so that's just worth noting so the next thing let's check and have a look at some of the header tags without looking on the website i'm just going to do.css again response.css i'm just going to do h3 sorry h3 and then hit enter now we can see there's a lot of data come up so if i do dot get this is going to return the first h3 tag that it finds on the web page and it says an exclusive etc etc and again we could do text here and that would return the text but if we wanted to do all of them we could do get all and this is going to return a list of all of them so we can see that by not putting a get or a get all we get the actual element back and when we do get or get all we get the information inside that element so i could you could do ghetto and you could index it and you could cycle through them if you can see that at the bottom of the screen there and again we can put dot double code on and then the text to get the text from each one of those so i'm just looping sorry i'm just indexing all of the h3 tags that are on this page so we're kind of sort of starting to build up a small understanding of how you can select the elements on the page and how you could actually get the information from them so what i'm going to do now is i'm going to go to our page and i'm going to inspect element and we're going to have a look at some of the information here this is sort of what we would do if we were trying to find where the information is for the products when we are scraping in other methods so i'll just zoom in on this you should be able to see that over here we've got all the divs and the classes what we can do is we can use the css selector method that we just looked at to start selecting bits of the information on the page so this part here under our products we have a div with class of details dash pricing so i'm going to copy that and i'm going to go response again dot css and it's a div now to make it select the class for the id we just do a dot and then we paste in that information there so that is going to go out and it's going to find all of the elements on the page that match that and we can see that it's returned some data for us now again if i do get that's just going to bring back the first one and the actual information that's in there it's a bit big on my screen to see but you kind of see what information it's bringing back what i'm going to do now is i'm actually going to save this response css div with the details of pricing which is this one over here and i'm going to save this into a variable so i'm going to call this one products now that's just going to store all of those elements in that variable for us so what we can do then is we can actually use products dot css because we are searching inside those elements that we've called this products so if i do let's have a look over here again and if we go to the name we can see that under this h4 tag there's an a tag that has the text of the name of the product in so if i was to do h4 and again you can see that they're all kind of coming up there and if i do the dot get method so if i do get all first we can see that that sort of returns a list it's difficult to see on this side of the screen but it's a list and if i just do dot get this will return the first one for us i'll just clear that so we can see it easier and now we've got the actual information from in that side this element again this is dot get so this is just returning the first one that it found and this is the one that has the a tag that has the text in it but because we're only because we're doing the h4 tag only here if we were to try and do dot text there is no text inside this h4 tag is actually inside the a tag so what we would want to do is instead of the h4 we can do a and let's see if we get the first one up we do so with dot get the first link within this product variable that we've saved is the one with the name so if i did dot get all we're going to get all of the links now you can see that we've returned some of the other products because we saved all of the products inside it so i'm just going to clean that up and again if i do get for the first one and hit text we got the text of the name so that's quite useful to know so we want to make sure that we are we're going to want to put this into our spider that we're going to create we might actually also want to get the link the actual link to this product and because we're already at the a tag here to get the href attribute from the product and the a tag what we want to do is we want to keep our double columns and then we want to type attr for the attribute and then inside some brackets we want to put which attribute we want to get in this case it's the href you see over here we're at a and we want to get this href and if we go ahead and hit that we can see that we've returned the link to this product so that's also useful what if we wanted to return something else let's say we wanted to get the prices as well if we hover over the price we can see it's in a p tag here with a class of price larger so we can just copy that and again i'll do products dot css and it was a p so we can just do p tag and then dot if i paste this in we can see that there's a gap there's a bit of white space in between the words price and larger so if i was to do this it's going to not return anything but to do that we to make this work we can just close the gap up with a dot if i hit that we can see we've returned all the elements and if i do dot get and then in here as well like we did before dot text we've got the text returned of the price of that item so let's recap what we've done so far we've um say we've used scrapey shell to get all of the data from this one page that we gave it this one url and that's all stored in this response variable there so if we type response we can see the url that it was and it was a 200 because it was a good response that's all saved there for us to start interrogating if you typed response.txt all of the text from the page will pop up what we've done is we've started to use css selectors to find the actual information on the page that we're wanting to find so the main page is response and then we do start css and inside we would put what the information is that we were looking for so we could do h1 and this is going to find all the elements all the h1 elements on this page you see there's actually only one so if we do get that will return us the information with inside that element if there was more than one we did do got get all it will return all of them and it always returns a list so if we leave it a dot get and inside here we put text it's going to return the text of that element what we did is then we found the the div where the price and the where the details and the prices were for each of the products and we saved that into a variable and then found the specific information inside each of those the next thing that we want to do is we want to take what we've just done and we want to then create a spider to do that for us so we're going to do i'm going to come out of this now if you're following along it's worth copying out the ones that or if you're creating your own spider it's worth copying out the text lines that you created so you remember where things are that you found them or do it in a separate tab um i know what they are so we're just going to come out of this for now and we are going to create our first spider so to generate a new spider it's always good to use the actual scrapey commands you could just create a blank pie file so under here under spiders what you could do is you could create your own pie file here and call it whatever you like but if you use the generate gen spider command in scrapey actually uses the default template for you and it fills in a lot of the information so we're going to use that so we're going to do scrapey genspider we need to give it a name all your spiders have to have a name so i'm going to call this drones and then we need to give it a domain so i always just put the first part of this and we'll put it there so this is the page that you're scraping and you'll see where this information comes up in just a minute when we actually look at our pi file with inside vs code so what this is going to do is just going to create us a new spider file with the default template except i've made a mistake because you can't create one with the same name as your project which makes sense so i'm just gonna add drone spider instead of drones and we can then check it out okay so it's created that for us if i go tree again we can see that we now have this pi file here so i'm going to open up vs code and we can see i've opened it up inside our project file project folder sorry these are all the files we looked at before here's the config file um and under spiders we have our drone spider if we click on that so you can see it's created it based on the default template so if you were to do this yourself you need to make sure you import scrapy the class and you give it your the name and it's always scrapey.spider we've given it an actual name here which is very important and we can see we've got an allowed domains list and a start url so this is this page this is start url is the page that we want to scrape we then have this function underneath it says pass now you have to have cell phone response inside but this is where we put the information that we just did in the shell so when we were looking in the shell and we were doing response.css and then finding that information this is where that goes here so we're going to put this in here and we're going to find we're going to copy back what we put in so if you remember we did products and we did response.css and we did div dot details dash pricing that was where all of the product information for each one was saved now we want to loop through each one of those so like we would do in our other web scrapers so we're going to create a simple for loop i'm just going to do four products in products now this is where we want to put where we found the name and the link and the price i'm just going to do the name in the price for the moment just because it's simpler the name was in the a tag so i'm just going to copy that over so what we need to do is we need to change this because i called it products i need actually just call it product because we're looping through it here so it needs to be on this one here so products in products and we're going to find it in here and then the next one was the price and i'm going to copy that again put that here and again products because we're looping through each one of those so at the end of each of our pass function we need to return something we need to give the information back out and in scrapey that's called yield so you need to type yield now this is going to return what it finds okay so this is a bit like our output now you can't yield multiple items so we couldn't yield this was the name and this was the price you can't yield both of these but what we can do is we can create a python dictionary with this so i'm going to say item and we're going to create our dictionary and we're just going to wrap these in it and instead of name is equal to i'm going to turn these the name and the price into the keys and then i'm going to create our dictionary like that so then we can yield the item just like that so what we've done is we have basically just taken what we took from scrapey shell and we've looped through it i missed that and we're yielding the return dictionary of the item so i'm going to make sure that that's saved and then we're going to go back to our terminal and let's clear that up so to run our spider we need to call it with scrapy and we use crawl crawl will make the spider work and then we give it the name that we created in our code here so this name is this one now which i think i call drone spider so if i hit enter now that spider is actually going to go out and work and we should get some output to our terminal of the information that it scraped okay if we go back up we can see amidst all of this output we can see that it says scrapes 200 response and we have the name and the price of each of the each of the products on that page that's all well and good but what if we wanted to save that information we want to output it somewhere well instead of creating a whole new function to output what we can do is instead of we can actually just run our crawl again and do dash o for output and we can save that straight to a csv file just like this so if we run it again this is going to save all the information that it takes it's going to put it all into this drones.csv file so we hit enter again and let it run as it goes through somewhere around here we should see there we go nine items it's stored in a csv drone store csv so we go back to vs code and open it up right over here we've got drones.csv and we can see that we've got the name and the price and we've got the information here one thing that i've noticed here is that we've got some in speech quotation marks and that's because csvs don't want to deal with the delimiter because it's a comma so what i'm going to do is i'm just going to quickly change our spider and under the price i'm going to do over here dot replace and we are going to get rid of the comma and we're just going to make it nothing so i'm going to save that and go back to our shell i'm going to run it again and it's worth noting here that i'm going to run it again with the same csv file name because as you'll see it actually doesn't override it it will append to it so we're going to run this again and we're going to get the same lot of data back but we're going to put it onto the bottom of the csv file so it's worth noting that it runs through again and we can see response counts and all this stuff and it saves it to the csv file go back to that open it up and you can see it's appended it to the bottom and this time we've got no bad data you can also do other formats you can do json json lines and xml i think but i suspect csv is probably going to be the most common one so just to summarize this a bit i always find it best to use the css selectors and to use scrapy shell to find the information in there whilst you're working along and then copying that over to your spider the framework will do a lot of the lifting for you just have to make sure that you obey by its rules and don't be afraid to use the terminal it's really powerful i really like the way that it works there is a lot more under the hood that we can do we can do so many different things we haven't even looked at the like the item classes or the pipelines or anything like that all the middlewares there's a lot of extensions we can use as well but hopefully this is kind of giving you a really good overview like a top-down look on how it can work if you've watched any of my videos already you know i tend to do three main functions for scraping request pass and output or extract transformer load this is essentially the same thing um it looks a lot more complicated but if you actually just see what the what actual code we wrote i mean this was it this is all we did and this is just a basic loop you just have to understand that you need to yield instead of return and if you use the scrapey default template to create your spider it will do all of this part for you so thanks for watching guys there is more to come i'm going to do another video i'm going to follow up and i'm going to show you pagination we're going to build another web scraper again but a bit more in depth but hopefully you found this useful so make sure you like comment and subscribe um i've got lots more web scraping content on my channel already and more to come especially more scrapy stuff we're going to dive in a bit more deeper so thanks for watching and i will see you next time bye bye you

Info

Channel: John Watson Rooney

Views: 6,775

Rating: 4.9510202 out of 5

Keywords: scarpy python, scrapy basics, scrapy crawler, scrapy example, scrapy crawlspider, scrapy how to, scrapy guide, scrapy beginners, scrapy python tutorial, scrapy python 3, scrapy shell, scrapy web scraping, scrapy web scraping tutorial

Id: hCARQVJy_mk

Channel Id: undefined

Length: 20min 29sec (1229 seconds)

Published: Sat Oct 03 2020