Web Scraping Tutorial | Complete Scrapy Project : Scraping Real Estate with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello welcome to this tutorial we're going to cover using scrapey to extract some structured data from a real estate property website called craigslist we're going to extract items such as link name price longitude latitude from the listings and we're going to output them to a csv file so if that's of interest keep watching okay so we can see here the website that we will be attempting to extract the data from newyork.craigslist.org you can see that it has multiple thumbnail images with price description location hood as they call it neighborhood and what we want to end up with is a structured list of that data um as always that's the whole point of web scraping how are we going to do it we are going to be using scrapey and we are going to be going through approximately 3000 items so if that's of interest keep watching so before we look at any code as such we will just go quickly through where you can find the documentation for this project so i've compiled a explanation page on the red and green dot co dot uk website forward slash scrapey dash yield the emphasis is on using yield several times in a method and as the page scrolls down you can see uh samples of the code with explanations of each method start underscore requests pass and pass if you go to page two you can see an explanation of past detail and main using feeds at the end to output to the csv and then tidy it up so if that's of interest red and green dot co dot uk okay so what does this video actually cover well it covers using scrapey and scrapey is a framework for crawling multiple pages let's just leave it at that if you're not familiar with scrapey excellent documentation you can run it with python and it works on windows mac linux so so if you've got scrapy installed you'll also be able to run scrapy shell and we'll use that to get to test the selectors the selectors pick out the bits of information that we need we'll quickly cover css versus xperth i've already done that in a previous video so i'll put a link up to that also in this video i'll be using item loader and that's optional your data extraction you could just output to a json file you could do that however you want but using the scrapy framework they do offer you the item loader and um i've stuck with that for this purpose so if you're ready to start looking at how we're going to extract the data let's begin okay so as with most property websites you'll want to typically collect the property name the location the price the area and maybe even the geo data so from the main listing page we've already got a good idea of what we are able to extract and how many items we'll be extracting so if we're ready let's uh have a look at that so we know that we've got approximately obviously this will it could change by by the day so we've got 3000 items or properties um if we then go into [Music] the source or if we inspect one element um let's choose the first one inspect element and then first thing we want to do is filter it down as far as looking down at this kind of level so if we inspect the element over hover over there we want to just you kind of move the mouse until you reach the level that shows you all of the detail and then you'll find the tag so here i can see that that is probably the tag clip we will need to use that will identify each and every i want to call it advert but um each and every individual listing so i'm going to copy that and then with that information and we also know it's p so that will help us uniquely identify each advert to iterate through so this will become [Music] a little bit repetitive and each time what we'll be doing is we'll be inspecting so you right click i'm using firefox but many people use chrome if you use chrome there's a useful tool called selector gadget which uniquely identifies the css for what you're inspecting so here again if i right click inspect element and here is kind of a little bit of trial and error you you scroll through the html and then you filter down as far as the item that you want to extract the data from so just scroll there for instance if we want to extract the link which we will do because we know that the link takes us to the details page and invariably you'll need to do that to go down to extract more information about the particular listing so here for instance we can see a class equals result title hdr link so i'm going to right click you can if you copy you can copy the html the css path or the xpath i prefer to copy for instance if we copy the html and then paste that and then right click and copy out to html there we go so we can see what we get that's basically the the human friendly version and that is the full html that we will be picking apart shall we say so once we have that information we can either form our own css selector because we have the class so we could do a colon colon attdr or we can form an x-path and i'll show you how to do that next so let's just collect uh one more piece of information and then we'll go to um creating some selectors so what else do we need to extract um price maybe so if we right click over thousand one hundred dollars inspect element and there we can see spam class equals result dash price 3100 so again if we do copy we'll do copy inner html and then if we go back and we will do right click copy outer html and that will pick out the code that we actually the html that we need to use to create our selector that's all you need to create your selector to pick out the price and okay one more and then we'll we'll get on with it so um okay dates probably you we probably need to the date that it was listed at so if we do inspect elements and there we can see ah we've got time so copy inner html bring up notepad gene3 and then if we copy outer html there we go so just to uh summarize this we've got we'll call this will end up being called the link this will obviously be end up being the price and this will be the date because we're only handling one date if we were looking at date listed or or date modified or something then we'd have to uh make those titles a bit more specific but uh so we know we have link price date and already we've got three of the columns for our final csv so next we'll look at creating the selectors and then we'll continue right that's better we have a brew um so yes what we need to do now is start building our selectors what is a selector a selector is a line of code which will tell scrapy what to pick out and return as useful information as text that we can then store in our fields which will then appear in our csv so if you want a column in your final csv called price then we need to make a selector that will go and extract the price data from the html which scrapey gets for us um so if you've got scrapey installed on your machine you can have it on windows linux mac os x whatever you whatever flavor you prefer so um i won't go into the details of installing scrapy it's kind of pip install scrapy but uh yeah let's just fire up scrapey space shell enter and you have to type it properly that is one of the prerequisites and once you've typed it properly you will get a screen which looks very much like this yields might be black and white white on black or whatever i'm just rocking the green and black look today so then we want to go to the actual url of the entire site so i'm actually going to paste that in because i don't want to make a mistake and mess up this video so we want to go to newyork.craigslist.org forward slash d forward slash real dashes state and then press enter and you should hopefully get a response 200 and there you can see it 200 that is the success code and now it's got all of the html for this page in memory and to test our selectors prior to actually writing our code in python um we want to test our selectors so if you're ready to see that let's begin okay so just to ease into this gently let's begin with let's just test a price selector shall we so the the two pieces of information we need to know is or are that it's a spam tag and the class that's all we need so far so if you would like to see that let's begin the format to create a selector is always response dot and then depending you've got a choice you can use css or you can use xpath both will work i'm going to use xpuff and then without going into the full details of the dom etc the document object model first thing you need to do is quote now whatever quote you use here you must use the opposite when you're putting some quotes around the class so the format is parentheses then quote which you come across lots in programming then we need to um the format is two forward slashes and then the tag so the tag name which is spam then square bracket at symbol and then the class equals and then we want result dash price close the square brackets and because we want to pick out the text because the price was just text 3100 then it's forward slash text and then a pair of parentheses and then we close the quotes close the opening parenthesis and then we do [Music] get dot get and do you believe that the price since i we did it earlier has changed from 3100 to 3200 um i'm pretty sure that's the same advert we've got the first it may not be it doesn't matter anyway so what you've seen here is response.xpath which you will need to use that for every single selector you write open parentheses single quote double forward slash then that may change so if it was a div i've got it highlighted in green there instead of span you would put div class you always really want to use class name where possible and then the net class name in double quotes close the square brackets and then most of the time you'll want to be extracting the text because that's what you want to appear in your final csv or json file or database and so instead of text you may wish to extract the href so it's the text it would say href and then get dot get you may on some tutorials you may see extract and extract all so extract has been replaced by sorry extract and extract first so extract first has been replaced by get and extract extract has been replaced by get all so if i put get all there there we go that's extracted all of the prices from the html that was returned for the first page of that craigslist listings page so get all and get so get gets just one get all gets all and if i'd done extract if i did extract we would have got all of them and if i don't i think is it an underscore first don't use it anymore yeah so extract underscore first so if you see a tutorial or stack overflow don't worry if it's extract or extract first um going forwards it's probably best to start using get and get all i always use get and get all but there's there's no problem in using extract or extract underscore first okay so we now have a selector to extract the price information and next let's look at another one so we'll look at how to extract the link so so far we've not written a single line of of python code and um this is something you could perhaps ease into really so um practicing the selectors if that's not anything you've done before have a go at it in scrapey shell and response.xpath now we're going to go and try and attempt to extract the um the link to the what is the detail listing of the property so instead of using spam the tag here is a then we use at class equals so here at class is the same as what we've just used there for the extracting the price if you remember on our on our reconnaissance or on our fact-finding adventures we found that the href was available to us and it was a class called result dash title space hdr link so what we're always trying to do is uniquely identify the the information on the page close the square brackets again and as i alluded to well actually no we want to get the text of the link so let's just do text here because this will form part of what we want to have in our final csv file and enter right so yeah there's two things we have the text of the link and then we have we will then go and extract the href so the text of the link will form or could form the description that you want to put in column there you'll see so now we've done that the next bit will be fairly straightforward because what we do is instead of collecting the text we just change this to at href press enter and we get an error so why is that invalid expression well let me just uh find out what's going on here and that is error between chair and keyboard i've left in the parentheses after the href when you're collecting the h so with text you need the pair of parentheses after with an href you just have at href and there you go apologies for that so as you can see it's actually openable open link um and as you can see it is actually the full link so it's not just the shortened um path it's the full link that you will need to use as a link to go to find out the details so i think we'll do one more i'll quickly do time so if you remember um sorry about this just need to get the notepad file um disappear if you remember the time was available from that so again time was the tag there you go see that better time and the class was result dash date so back to scrapey shell response dot x path always that and then open print parenthesis single quote double forward slash and tag will be time open a square bracket as always then so the square brackets are called predicates predicates uh if you want to read up on those that's what you will need to look up predicates class equals um what we get dates yes class equals results dash date and we just want the text so you'll start to get familiar with this forward slash text pair of parentheses single quote then close the whole the whole shooting match and then get and then just the pair of parentheses i have to get and what have i done there ah square brackets and then we see june 4 so the date listed was june 4th again if you did get all all of those were listed on june the 4th what a lot of properties so there we go so to conclude we've just gone through and we've tested the x paths for time the at the hrefs of the link the text of the link so the description and we also extracted the price so price description link link text so hopefully that's given you enough of an intro to creating the selectors and testing them in scrapey shell and next we'll start writing the code okay so i'm going to try and demonstrate the correct or the official way of doing this here so um pseudo virtual end and we'll call it craig list demo what this is doing is it's making a it's making a self-contained project for this for this instance where all the python modules and everything can be used and i can install things without actually modifying the main instance of python on this machine so if we cd to craigslist and if we do source activate there we go and if you see there we've got craigslist demo before the username and the host name so um okay we've got virtual environment running now which is good practice here you can see i'm running python 3.8 so to do all of this project you will need python 3.8 preferably and um you will need scrapey so an operating system plus python plus scrapey so those three things to start with and then you'll need to install some libraries as required so the first thing we need to do is do scrapy start project craig craig's list isn't it there's an s in it craigslist and we will call it craigslist so this is going to be the name of the project and this is kind of the top level which will represent everything that scrapey will reference and why is that not oh pseudo so if you're on linux machine you'll often need to use pseudo to give i'll substitute user so we actually get permissions and here you can see let's uh just scroll down you can see here new scrapy project using template directory created in you can start your first spider with cd craigslist scrape and you can generate the spider with scrapey genspider exampleexample.com i don't tend to use that but that is an option uh so if we do ls now we just see what folders it's created for us and we need to go to craigslist scrape so inside this is the folder which the directory which scrapey's created and inside that it creates another directory with the same name and then inside that you can see if i do there we go you can see it's created init.py items middlewares pipelines settings and spiders directory that's where if you have multiple spiders each one of your spiders will be inside that directory and they or any of those spiders so if we were writing craigslist spain craigslist germany craigslist portugal they would all you'd have three spiders and each one of those spiders would use could use the same settings file same pipelines same middlewares the same items.py for using items which we will so cd spiders and inside there you'll see nothing at the moment so this is where we want to actually create a spider so i'm actually going to do start atom so whatever your favorite text editor is or you can just you could do it from micro or nano or whatever you wish vs code and here we go and it's actually creating a new file and what i'll just do is add project folder so with atom add a project folder and we want to home craigslist demo and add that and there you can see the bin file which was created by a virtual environment inside there we've got craigslist scrape which was what was created when we created with scraping and then inside here you will see actually what i may do is just remove project folder um i'm just gonna if you do add project folder and then you can kind of um go down to the level that you want so if we go down a couple of levels this will make it a lot tidier so there we can see we're actually down at the correct level with items middlewares pipeline settings and the spiders directory so there we go that's neater so we're ready to start writing some code so um yeah get yourself a cup of tea or a cup of coffee or can of beer and we'll begin okay so i've just put in um the title of the project the doc string describing what it does and i had a little issue saving it to saving it within atom and let's just clear that what i had to do was sudo cho minus r then username colon sudo and then the period character once i'd done that and applied it i was then able to control s and save the project so yeah just a little tip there if atom or your editor doesn't actually allow you to save if you get permission denied um so let's actually start writing our first line of code i know it's been a weight but i've tried to be as detailed as possible i don't want to skip over anything and you know leave you confused so import scrapey that gives us um access to all of why has that changed into capitals that's bad i don't want to leave you having to scratch your head or go off and have to research other things to get any of this to work i want this to work in its entirety for you and we're importing from scrapy.loader import item loader and this is what allows us to use items file now you could get away with just creating a json dictionary but i'm trying to do this as kind of best practice i realize that in the real world you may wish to um do things a little bit differently but this is kind of best practice and it's what you'll see in the documentation so um i just thought i'd use this as the example and then once you know the proper way or the official way then you can um you can do your own thing so uh from scrapey what we need to do next is import scrapey crawler and then import the crawler process now that's what allows us uh to run the script as a standard python file and it will allow you to just run it as with control b or f5 or whatever the command is to run it from vs code or your ide and you'll see that at the end what uh the syntax you'll see at the end will be um something a bit like create the instance process equals crawler process process dot crawl and then the class name and then process dot start well that's the generic way i probably will be using my own names there but um okay so now we need to do right what i'm actually going to do is i'm going to be moving the items um py file i'm going to be moving it to the same level as the spider um it's just something that i do to save any issues when you run it it's a kind of a little bit of a modification but just bear with me on this from items so you may sometimes see if you're referencing it we'll leave it in from.items so it's telling it it's in the folder of the directory above import craigs um let's just see so scrapey framework created an items file for us and it called it craigslist so the class name of the class in the items file is called that so we need to copy that and we then sorry drop the mic drop the mic like a boss not quite like a boss but drop the mic like an idiot um so we need to use that same name so we're importing from items we're importing that class okay so that's all of our imports and um yeah next let's create the class so we'll be creating one class that uses scrapey.spider object that is an instance of scrapey dot spider [Music] so i hope you're um okay with this so far any questions put them in the comments i hope this sounds okay as well um one day if i ever make any money from this channel i'll um i'll buy a proper mic microphone um we've imported the packages so we've imported scrapy oh one other thing i think i'll do because i will be saving this output to a file so i will import os import operating system and that will allow me to check if the output file exists if it does i'll remove it and then i'll create it again from scratch so i don't want to keep pending data especially when you get 3000 results you don't want to keep adding more and more to that so um right imported all the packages next let's do some code and write the class class and we're going to call it real estate spider and we are going to create it as um a subclass of scrapey.spider and it whatever you're doing in scrapey with your spirit writing aspired it will always begin with the class then the name that you choose to give to the the class instead the the class and then scrapey.spider sometimes you will see a crawler caller is when you go off and extract as many links as you wish we're not extracting loads of links we know what links we're dealing with here today so we're using scrapy.spider rather than the crawler the crawler is when you have uh let's say lots of unknown links the only unknown link we have today really will be um links past page one but we know what they'll be because we'll handle those by identifying the next button but um yeah that's to come so let's get on with it name this is the name of the this is the name of the spider that scrapey uses so this is the name that you will use uh when you actually call it from the command line um which we i'm hoping we won't do that so real a state and the format for this is always lower case so yeah that's done next we need to as always with scrapey you need to specify your start urls so you need to tell it what you actually want to be looking at to begin with so um i've already as you know i've already got that so um let's copy and paste that and we will put it inside the single quotes paste i'll just check that https new york craiglist.org forward slash d real estate search r e a good okay happy with that okay so next thing we want to consider we want to consider the output file so what do we want to call the output file and we're going to be using feed so i'm not actually going to write any code to open a file for output and then all that stuff so i just want to try try os if this file already exists we want to get rid of it if it doesn't exist we'll get in it so remove it if it exists if it doesn't exist if you get an os error then just carry on because if we don't get an error it means it didn't exist which is good well it's good if we're at the start of the program if we're at the end of the program it's not as good so um okay so next thing is i'm going to be needing to use two variables which will allow me to pass data between between the between pages so to speak so thinking ahead i'm going to be having a collection of data which i've passed from the main advert but i need to be getting the geodata as well so if you see my previous video i've already identified that the geodata is available from the details page i'm calling this the details page and throughout this video i'll refer to it as the details page so view page source and i've already established that some [Music] geo data is available from the details page so longitude latitude i think um so we want to when this class initialized is instantiated we will then say self dot lut equals empty quotes you won't see this on all scrapy projects so i think that's why it's quite interesting to show it here um what i mean what i'm able to do by doing this is to pass the data between i'm passing it between two functions because obviously when one function closes it loses it clears that data to make the data available between two two functions or methods then we need to store it as a variable and when we're using a class it's self dot because we've initialized um next we need to do now start requests is so my typing is not good is it i might go and get some more caffeine in a minute but um start requests it's implicit and [Music] if you don't write it it happens anyway but just for completeness i want to show it here as this is kind of a tutorial scrapy dot request and then what it does is it goes off to now it doesn't actually have to go off you could have many start urls so um [Music] there we go scrapey request it's going to go to this page once it's gone to this page um it then just really make it uh kind of obvious or hopefully obvious it goes to this page and then it calls back to the next method which is self.pass so start requests is a built-in method of scrapey framework and then self.pass you don't have to go to self.pass but it's kind of the convention and i want to kind of stick with that for this so then we say pass and response so what that's doing is it's creating is it's using the response from this page which is then being passed into pass and pass then uses the response and this is where it starts getting interesting so i think we'll pause it here and then we will really dive into the nitty-gritty of it okay so um just thinking ahead a bit what i'm going to do and this is you'll see this quite often is pass i'm actually just going to create the functions because i've already in my head got an idea of the functions the methods that i will be using so i now be using pass which will effectively handle the main details and it will write the data to the items well i also know that i'm going to be needing a second method which will it will be handling the geo data the extraction of the geo data and populating or filling these two variables with longitude latitude and colon yeah pass and then i think that's the the grand total of the methods that i'll be needing and what we'll do next is you won't always see this on tutorials or on um documentation for scrapy but obviously it's something you'll see across many python tutorials and projects so double underscore dunder is you may also hear it called or written so if this is main so i if it's not if this code has not been imported if this is being run as the main file um great instance called uh we'll call it cl as in craigs this i just shown you my thinking here cl equals it's case sensitive isn't it crawler process now this is a kind of a new way of handling outputting to file that i've recently come across you may not see it actually in the scrapy documentation that you look at but i've tested it and it works so that's good enough for me i i believe it's the new it's the new way of outputting so what we want to be doing is we need to be telling it well we used to do feeds format and feeds uri uh now we've just said we're just going to say the name of the output file results.csv now that could be json if you want or xml i believe not that i've used that the format is csv so we tell it the net the file name but also we have to tell it the format and once we've done that you might have previously seen feed and feed format you might have seen that previously as custom settings listed up here i'm going to stick with this we'll see what happens when we run it i have tested it previously so i know it does work and i believe it's the new way the record new recommended way so i just want to kind of get used to using that if that's what's going to be appearing in tutorials on stack overflow etc going forward so i think that's the basic uh structure of the code so i'm gonna save that pause it get a drink and um yeah back for some more fun in a second [Music] so we can see from the main page that there are 120 property listings per page there we go 120 um 120 out of 3 000 so 100 if we scroll down that looks about 120 doesn't it yep so for each one of those thumbnails we will be collecting as you've seen already we we're collecting the the dates the price the description the link but we can't extract the geo data from that thumbnail or that listing the result info so we're gonna have to go to the detail page which is what you get when you click on the thumbnail and we're gonna have to extract the geodata from the source which we've already looked at so for each one we're going to go to the detail get that and pass that back along with the price uh the time the description so on we're gonna pass all of that to the item loader which will then fill the containers in the items dot py file and then we'll do that for 120 times on this page until we've run out of thumbnails and then that full link will finish and we will then check if there is a next button which there will be until we've done that approximately 25 times and eventually when we get to the last page there won't be an x button and that's when the code will finish and then the feed will export it all to the csv file so that's the logic thumbnail go across get the detail then come back do the next thumbnail do that again next thumbnail 120 times and then go to the next page and do 120 times more so when we run scrapey we'll be expecting to see somewhere as it's flashing up on the screen it'll go from 120 to 40 360 480 and so on so let's have a look at how we are going to start writing the code and first thing we want to be doing i believe is we need to start a for loop to loop through all the ads in the page now how can we identify all of the adverts on the page well we need to say something well first off we need to identify how we can pick out with a selector all of the adverts so if you're ready let's begin we need to say um all ads equals response dot x path and if you remember earlier we said let's do single quotes there um ah we also need to do that period there because otherwise it will get too much um i'm not going to go into what that does is um it's it identifies the level within the dom model and it's then that node um that's a whole whole subject in itself so yeah excuse me if i gloss over that just suffice now to know that we need it if you don't put it you end up with something unexpected results um at class equals and then no it was a p tag wasn't it p at class equals and then results dash info put all of that in quotes and then close with a single quote okay that's good so we need to pick out everything that has a p tag and the class is called result dash info so that response which should be 120 items 120 results will then be assigned to this variable and then we will say for and then we'll create we'll call it ads in all ads that's it we just that's created our loop so what do we want to do 120 times per page well we want to let's just put go get geo data and [Music] what else we need to before we get the geo data we need to [Music] get details link so once we've got the details link we then have the link that we can go to to get the geo data and then we will come back and we will what will we do we will populate all of our items containers with the information from the results and the details so that is where we will do the loader once we've loaded all the information into the containers for that item including the geodata including the price description etc we will then need to identify the next page so nav next page and what do we want if we just think ahead what do we want the past detail to do past detail we need two um set variables to response from lon and blood with that we will just be doing two lines we'll just say self.lon equals response.xpath blah blah blah self.la once those two have been set that's done with that's finished and when we're going to be using yield here so once that's done then it will come back into pass and resume and this logic takes a little bit of getting your head around because it's not the typical logic that you would use with scrapey as i've described it's because um we're extracting the majority of the information from the actual thumbnail of the listing okay so i think i've put it off long enough let's go get the details link so we'll create variable called details underscore link and what will that be so because we've now we're going to iterate through all of the ads so for if i'd called that i for i in all ads it would be i um so ads for each one of these iterations we will be getting the x path and we want the link so again single quote period character at class square bracket concentrate it's a it's an href and we want at class equals result title hdr link you saw me extracting these earlier in the video we tested these in scrapey shell if you remember and then at href and remember not to put the two brackets around you'll use that with text and then we want to get so this is getting one it's not getting all because we're already inside the loop so we only want to get a details link for one item that's why we use get if you were using extract you would do extract underscore first okay so we've got the details link now we need to go and get the geo data so the geodata being longitude and latitude coordinates um right this is where the fun begins so we need to say yield response dot follow you may see um yield scrapey dot request but if you use follow it saves you doing url join um code mikey king showed me this which was really cool d tails underscore link and then call back equals self dot pass underscore detail so this is the method that we need to go to next and as you can see pass underscore detail we then drop down here and here we will be extracting the geode data so we will say self dot let's do it in alphabetical order lut first longitude and latitude why do you say long first i don't know self dot lut equals i know why because we're gonna need to use indexes to identify them in a minute long equals again response.xpath starting to get uh the swing of this nine right i've not used the period character there the period symbol there um i didn't need it the best way to uh you can either read lots and lots about the dom document object model or you can just experiment by putting in the period character and seeing what happens when you do and don't use it if you find you're getting too many results try putting in the period character it kind of um stops you getting all of the results at that same level it just pins it down to one but here we are just getting one anyway so meta name uh we need to yeah so it's not a span or anything it's meta and then equals geo dot position which was what it was called and then also i had to look this up but you need to use content to actually get the data out of it and don't get and then we need to split it because we will be getting if we just go back to the page i'll show you if we go into the details and then we view the source um let me zoom in on that a bit for you there we go so the meta that's the tag then we want the the content which we just typed into the selector now as it stands we will be getting all of that which is a load of numbers then a semicolon and some more numbers which you could just extract that and then pull that into one column but it's needed to get them separately so to separate them out we will be saying um dot split and how we can split we're going to split it where semicolon appears and [Music] now we'll have it two items in the list um we will get the first item for the longitude and if anybody's into uh geography they might tell me i've got these round the wrong way but from a code point of view that's fairly trivial to to fix this is more about working out the logic and writing scrapy code which is after all what we are in the business of actually i'm just going to copy and paste all that there's so much typing and there we go so i have tested these in scrapey shell already and they worked um just to show you let's test it right let's go and find the advert and let's go and get that that's the url we bring up scrapey shell and let's do screw this just so much there we go clear all right scrapey shell and paste and then go back to start and type fetch make sure we put the start of it in quotes there we go so we've got 200 now what we want to do is extract um let's just do that for now we'll get all of the data there we go so i tested i did take the liberty of testing this earlier um so we've got longitude latitude separated by semicolon so that's where i put the dot split semicolon what would that give me yeah that's created it into us turned it into a list once it's in list we can then isolate each part so there we go 40.67 and then 73.81 so on so we've proved that the selectors are good and they're usable and okay that's good and next we we don't need to return or yield anything that just runs so for every single thumbnail on the first page that will be getting called by from within the for loop so it'd be getting called there once that's done those variables are loaded we can then return and resume after the yield and you know what i think it's time to do the loader code so just going to pause then we'll be back to do that okay so uh you may have seen i've actually done this already and we've covered this in some previous videos so just to keep this sort of uh short and sweet i'm actually going to paste in the loader code so the loader is the item loader and it's some code which transfers all of the variables um into or via the i think it's the pipeline which then gives you the option of handling with processors what it really does is it gives you some containers and it saves you having to write a dictionary yourself so um let's paste that in i just need to tidy that backspace and i just as you can see i'm cheating slightly because i have already done part of this in a previous video so yeah that you see where i've highlighted craigslist craigslist scrape item needs to match the name of your class here and also if you have been watching my previous videos you will know that i've already written the items.py file um and in actual fact what i will do is just to keep it all consistent i'm just going to paste in my ascii logo um import scrapey we've created the class scrapey.item so actually scrapey generated all this for us and all we need to do is you just say name equals scrapey field so the name is what the this will be what the column is called in your output csv file and i have already written this and i know that i want the date title price hood link miscellaneous miscellaneous i don't think i need to use really but i'm just going to put it in there just as a optional field for the future and there you go that is all there is to it it's a bit like creating um a header row if you use csv dick's writer so each one of these is the title of a column in your final csv so these are very important because that is what we are in the business of collecting so yeah enough about that let's just uh save it as well why is own excuse me i'll just fix that in fact i will let you see how i fixed that what i will need to do is go up one level and run that same command again it's to do with permissions and it's a linux feature should we say right let's try again save items yeah control s i've saved it okay um for anybody wondering what i did there um i'm changing the owner and it's doing it recursively and i'm doing it for everything at this level so when i did it earlier it was only doing on files within spider but now i've done it within that entire directory here um pseudochon minus r then your username then colon then pseudo space and then period character so there we go that was a little bit of a digression which wasn't planning on but there we go okay i'll just talk through this so loader equals item loader this is you'll always have to do this if you're using items and you want to um yield to load item afterwards and uh there is another way of doing of using loader but this is the proper way or the official way and this allows you to use item processors so loader we create an object which is it references so item is the name of the class in the items.py file you could have several classes inside the items py file that's why you need to identify which one you want to use the selector because we're inside um a for loop we need to say what our selector is so adds adds there and response is response if you were i don't know if you if you were calling it res throughout your project you'd be calling that res like that but we're not so um for each one of your column names i for each piece of information we want to extract from for every property uh that's what we call it so this what we have here needs to match what we have here okay so each one of these selectors is the same as what we identified earlier so you'll remember spam at class equals results dash price forward slash text the only difference here is that we do not you need to use get and that's because we're using loader dot add x path so um we will use that for the information that we are extracting from we're using the xpath to get the information from the main listings page and we're actually going to add as values because we're not going to be able to extract them as x paths we're going to be the values have already been passed from past detail which we ran up here um so you'll see later on that it does work and um yeah that's the loader so the loader you may have already seen on previous videos you don't have to use a loader you can just create a json dictionary yourself and that would also store the values and that would be equally fine a good reason to use the loader with the items by file is if as i say if you're using multi creating multiple spiders that use common fields so if we would scraping craigslist italy craigslist spain craigslist usa and we wanted to stick to a standard format with items we could then use that and the containers it's recommended by scrapy so i'm showing you the official way you don't have to use it you can uh you can go and try uh using json dictionary yourself and there's nothing wrong with that at all and it works very well um and if you're kind of a minimalist uh outlook like i am and um i know max zimco monkey king has also i think um there's very good reasons to write your own dictionary and not do all this it depends if you're doing a one-time scraper or if you're doing lots of lots of spiders that use the same common fields anyway next page let's handle the next page and you may have already seen this in a previous video and i'm just going to copy and paste all that in because you've seen it in a previous video it is very common syntax which um we create a variable called next page and what it is is it's the x path of the next button so the next button is has a class called button next and then we want to get the href of that button and so on page one the href of that button will say go to page two and on page two the href will say go to page three and so on and then so all of that we will then be setting notice we're using we're back to using response.xpath again we're not the loader was specific to populating our items.py file we've gone past that now so we've we've collected the geodata down here we've gone back up to he's resumed we're adding all of the uh data to the items.py we've done all that and we've done that throughout the loop which has happened 120 times so once we've done 120 times on page one we then next need to find page two to go to so that will be the url for page two because we've got it from here and let me just um i'll just demonstrate that as well if we go back to scrapey shell and see if it's still in here no it's not uh just um just bear with me if we go back here i'm just gonna get that the main url paste that fetch you'll get very familiar with using scrapey shell so there we go 200 response and now i'm just going to get the code that i'd already used response to expert so we're looking for the link hidden behind the next button and that's what it returns so it returns um s equals 120 that means start at result 120 we'll start from 120 presumably it went from naught to 119 so 120 will be the first result on the second page so it knows to redo a new search and go from 120 onwards and because we're using response.follow we don't need to mess about and use url join so that's really good and i think we're getting close to testing it now so i'm just going to pause that save it and we'll come back and start some testing [Music] okay so this is the the moment of truth is nearly upon us we will need to um just close close right craigslist spider dot py i think we've got everything we need os scrapey item loader crawler process right yeah items i mentioned earlier um that should pick out items from the folder above so what we've written lives in here items resides up here um let's run it and i think if we get any issues it the issues are likely to be with that so uh cd spiders [Music] right no the whole point of using crawler process was so that we could run our code using python 3 rather than having to use the s these scrapey syntax and let's run it i don't think it'll work first time if it does then that'll be a bit of a surprise os error okay try hmm ah colon there the colon the colon goes at the end all right let's try again i could have edited this out but uh right okay we've got as anticipated we've got the issue with items let's try and fix that i saw another way of uh of doing this which was um craig is you have the project name you're probably thinking he can write all this but then he can't even import a module which so the project's name i think it's craigslist scrape if it's not then it will be craigslist demo okay let's try craigslist demo if not then i'll um do my usual trick which is just to move the items file this is what happens when you start using a framework you have to kind of abide by their rules and uh right so yeah this is the issue i always went into and um what i usually end up doing is this and what i'm going to do is move my spider out of the spider's directory and move it up one level so now it is sitting at the same level as the items.py i think this this may be just something this is just something which is specific to using the crawler process but this is my workaround i don't know if it's the correct way but i have found it to work no ah i need to go up a level so this is what i thought i'd leave this in i'd let you see how i go about uh using the whole scrapy framework with items py and still getting it to run using sudo python 3 rather than the official scrapy syntax again as i say i could have left this out but i think it's interesting to see how i go back right so that's good the um the items.py is working okay okay so line 36 we've got an issue with result dash info which if i bring up my if you remember back this far this is what is giving us an error so let's um go back into the code line 36 i don't straight away see anything wrong with that uh is it because i think it's i didn't need to use the period there the period it singles things out whereas here we want everything at that level so with the period it picks out one without the period it picks that all and we want everything that has the class name result info so go back i hope you're still um you're still interested in in this and i'm not boring you just yet but if you want to um if you want to learn sometimes you just have to see people troubleshooting as well rather than just uh okay why is that still not working i just scroll up a bit as well why is that ah how did i put the did i put the colon at the end don't need it on there um all ads equal response.x path quote forward slash forward slash p at class equals result dash info square bracket close you must think um stupid and you might be close to the truth right response spell response correctly note itself right right again what are we expecting to say yeah so um url can't be none okay so this is a issue with url can't be none past line okay url equals details link uh let's just check that then url so hopefully you're just seeing how to go about right the url url so next page equals response response.xpath we've proved that that works if there's a next page response dot follow url equals next page callback equals self.pass just check that again right i'm gonna pause that and then um i'll come back once i've once i've fixed it and then i'll tell you what this is what the fix was okay so i identified a couple of issues with this the the observant amongst you may have noticed there was a typo down here um but the main thing was which um whilst i was talking i was wondering um i'd actually called the field here i'd called it details underscore link no yes i've called it details underscore link here but in items i'd still for some reason probably from earlier on when i was writing it i'd still had it called link now whatever you have here must match what you are calling it here with your loader so create all your field names column names whatever you want to call them column names your column names here need to match with what you are calling your scrapey field names because tool intents and purposes the data from those column names is being passed through here on its way to your final output csv now you can see there if you're if you're again if you're observing you can see i have a results.csv and the reason is because it has successfully run when i tested it just now so if you're ready to see what it looks like when if you want to see what a scrapey um a customized scrapey spider looks like when it's running and bear in mind that i've used crawler process here this isn't i know i said i was sticking to the textbook but um yeah what we're going to do here is run the spider using sudo python 3 craigslist spider dot py and we're going to do that because it's already um using feed so we can tell it where to store the output file and [Music] i just find it is neater and it's self-contained and if you watch code monkey king he also does that he'll use caller process and you can run your spider just like a normal python file so without further ado let's get ready to run it shall we i thought i would um let's run the spider and i know when i first started um what was it doing there we go it's taken a while to get going um i know when i first started using scrapey i didn't really know what to expect on the output screen and um to be honest having used beautiful soup previously um where really there's very little drama on the cli when i first used scrapey i was kind of overwhelmed with the output that i was seeing on the screen um so yeah eventually you get used to that and you you kind of you almost get used to seeing the screen flickering and seeing lots of output and trying to sort of uh pick out whether it's working or not depending on how fast your connection is what you've got your settings your delay and [Music] how fast your computer is you may or may not see the outputs stall and then give you an opportunity to read what it is extracting for you quite often you'll see debugging capitals appearing on the screen and you'll see scraped and you'll see a time stamp and hopefully you'll see uh your containers or your items your fields that you're extracting also if you watch carefully you should see for instance here i'm seeing um the page count increase every so often so it's going uh 120 240 360 and so on so i know that it's working and it's going through the next page logic the navigation logic is working um and you because you can see that you kind of get a feel for how long the overall spider is going to take now i know that this has got 3 000 odd results and i can see 120 going up to here we go 3000 items in results.csv and let's so that's the important thing that's what we wanted to see stored a csv feed 3000 items in results.csv looking for errors i can't see any errors here which again is perfect so i'm just going to look in um where are we i'm just in atom already i can see a results.csv just click on that and on initial inspection that looks perfect so because that looks good i'm now going to open up um let's go back up to home home craigslist demo bin craigslist scrape craigslist scrape results.csv and that will open with library office i get a little preview again the preview is good because it just shows you uh so miscellaneous was expecting to be empty latin lon they are probably going to be empty for the first few results i think that's because they're not really proper apartments they're just i don't know escaped we'll see in a second but i'm not concerned that there we go um the estate agent or the real estate what are they called realtors i think in england we call them estate agents they're not always the most popular of people but that's nothing to do with me um there we go we've got we're only slightly my way down let's drag the scroll bar all the way down and there we go 3001 how about that we've got the date the url to the details page the hood hood and longitude latitude price and description i forget what h was what was column h title yeah the title was if you remember that was the text that we extracted from the link so if we just um i'm just going to reduce that slightly bring up the original advert and yeah the title was the text of the href if that makes sense and yeah there we go i think uh let's scroll down a bit because some of them didn't have let's go into that one that's i'm most of the way down this page so i know some of them didn't have well that one has so um [Music] let's try and find this actual one then shall we 31 eight uh now i'm going to regret doing this because i can't remember how to do fine but ctrl f in that coordinate and that is not working so i'll probably edit that out because i've made myself look stupid but um hopefully you don't think i'm completely stupid because what i have done is i've scraped craigslist website real estate and i've gone through 3000 listings and extracted price href titled hood link and we've also extracted the geodata from the second page and we've used yield three times so if i'll put the link in the description to my webpage where i've described this and you can yeah you don't have to watch your video you can read about it on the web page so i hope this has been really interesting if it's not then sorry i'll be back with other stuff soon what i may do if anybody's interested um i may see if i can get this to run on my raspberry pi and what i may also do is i may schedule it and i've showed you it to run for seven days maybe and i'll make a video on that so um i hope that's that's been interesting if it hasn't then well i think some there's plenty of cat videos out there so um [Music] yeah go find yourself a cat who's uh playing actually there's a good one there's a cat that sits on the piano and the person plays it go watch that whatever see ya later thanks for watching subscribe you have to do that if you're a youtuber subscribe [Music]
Info
Channel: Python 360
Views: 3,635
Rating: 4.9101124 out of 5
Keywords: Scrapy, Python, Webscraping, Tutorial
Id: 3A36TOm7NC8
Channel Id: undefined
Length: 107min 46sec (6466 seconds)
Published: Fri Jun 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.