Scraping data from tables with Scrapy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello so let's see how we can go about scraping this one so we see this URL here which is a great starting point and let's start the scraping project so I'm in the directory over here if I run scrapey I get some help scraping start project and then let's call these Fairfax ok so then it gives us some structures here see the Fairfax and let's do it see the Fairfax and then tells us to write a scrape it's an spider so then it's the name of the project let's name these fair let's name it quick the lead may be FF country something easy and then the domain we can put fire fox country dot Cova ok good so right now I can I do scrapy crore and FF country and now it will run a scraping business ok so we can see some output of course much of these doesn't make sense at all let's see what this process did so it ready for us quite a few files but actually only one is interesting for us this one the FF country so let's open this with our favorite text editor and let's go into the despite the directory you have country let's see what's in there so all this cold husband hundred has been generated for us you can see here the name allowed domain which must be the same with this URL up here and here we can say start URL which is quite wrong so we would like to use this as a start URL and this is why when we did a scroll here it gave an error we we want the same URL we just copy paste it here and let's see what what happens if we run a scrape across FF country so if F country's the name of the spider here okay all this is out of Auto generated code so we run this again we can see lots of output log not something particularly interesting and actually we can put minus L warned and then it will remove all this log so if we don't do something interesting we will just see no output because this output was a bit confusing so now when the page load completes this parse function gets called and let's see what happens if we do here print response dot body wow so we can see lots of text and this text corresponds to the HTML of the spades so you right click here and view page source I will see lots of HTML overwhelming of course but we don't need to deal with that but we can see that same a similar here ok so we can see here complaint numbers and lots of interesting stuff and we can see also that there is a way to get to the second page and back so let's try now to understand how these page works so I would like potential to extract those numbers from here how can I do that so the trick is here to go right click and inspect so this is an interesting case because we see here that this text is inside an encore a but it is also inside a TD so we can see some HTML around here and all this is seen inside the table called ZV table so let's understand what we have here maybe I will type some notes so I want the first column of a table with class CV table okay so this is a table and I want the first column for its row and we we will see exactly how this works what let's try to start by by printing those numbers so this is the text that is inside the n/a so I want to extract the text from an uncor element a ok so this is what I want to do and now let's see how this is done so scrape is very helpful because it can give us it has a CSS argument here so actually I can very easily tell it to give me a table with class with that name so it is the usual CSS expression which we can use table dot and the name of the the name of the class right so this create actually will find a reference for us for that table so table equals this one and let's try to print it you will see it's a very interactive process let's try to print it and see if it works okay so select or it found something which is good so if I press here table dot extract it should give me the HTML yes but we don't really need the table itself we need to go through the it's a row of this table and we can see that the row has a TR right the TR and that actually it's a TR with Class C V row let's see how this would work so let's name these rows now and I will go and find table rows with Class G V Rho let's see if this actually print something against kripak role okay good so so have some output here so let's do a for loop for row in rows okay print me that all the outputs should be similar okay nice so now from every row I want to output the first TD so again I would go and do the same trick dot CSS actually and extract the TD and actually TD of zero because I want the first TD the first column okay let's see if this works or not okay it seems to work and so out of this first TD element I want to extract the AIDS the anchor so again CSS and a let's see if this works okay so I have some selector here and then actually I can go to scrape documentation and see how I can extract the text but this is quite easy actually I just write the text here with two colons another selector and at the end of these to extract first so this is stander you will see it used again and again and here are my numbers so what I did is I went to over here and from this table that has a class GB table I went through the rows that have Plus G V ro and I iterated and I extracted the text of the first link of the first column so first column and the link in there I extract the text but we can see actually that there is a mistake here because we have here different layouts so we can see also there is GV altero and we miss those numbers here so we have only the ones that have a white background and not the other ones so how can we do that actually it's easy we can just get rid of this class and then we will have the header it's not good but we will also have other stuff we will have both a even and the old numbers so to get rid of the header I just start iterating from the first row and below is how we do that one and : meaning we skip the first row and indeed we have the expected output we can see though that we have a few more probably used the GV pager so we also would like to skip the last row and how do we do that we just put minus 1 there so this is small Python tricks that we use all the time and maybe we need to get rid of two rows ok so now we have these 375 up to 4 94 to 95 2 which is all the numbers we would like to have so this was quite fast because I knew what I was looking for but you can find all those inside scrapey and the selectors documentation if you search here for CSS you will find all those explained in quite a bit of detail but as a matter of fact this is a little bit of basic Python so access to erase and the CSS which is a very common thing when we do web development now that you have seen all those probably any other case would be extremely similar but the thing is that we actually want what these link points to not adjust the number I guess so let's go here at the network tab which is very useful for us and see what happens when I click a complaint so I will click the three three seven five complain so I click this and we see like tons of things happened right we can see here at the end this forum was provided to us which is good but how did this happen actually if we go usually it's the very first we have independent call so when I click something there is a call to another URL you can see all this details for this call over here so it has another URL that we can see here the request method is paused and we can see that the output is fine and we can see here our parameters for this one like the addressed key count and server number and actually you can see here in the response the response body so now this looks quite sophisticated let's try to repeat this and see what exactly happens with a second case so I will click now the four three four five case okay so again we can see here another post request and there is a long you stayed here so we we try to see the headers and try to find any evidence of using four three four five around here so if I copy this and I paste it in a text editor I would expect two three four three four five but I can't find it anywhere who hit is very weird so I wouldn't say that we can't figure this out but let's try an alternative path so we have the output here with four three four five and we can see the output here and let's see what happens if I use the other URL with four three four five so I click here search by complaint number and we can see this output here so let's grab this URL and you can see it has four three four five at the end so it has the same information here palmach damn excellent unfounded okay so this looks exactly the same so my question now is can I form URLs out of those that have this format well let's see if we can so I have here the case ID which is this one and let's actually just use the first one so it will return just for debugging purposes I will not either eight so I will use just the first one which is the three three seven five in this case so I take this one and this set of four three four five i two plus and the case ID and I name this URL let me also make it look a little bit better because now it's spans multiple lines okay so I would like it to visit this URL let's see if this works okay no error so let's now yield a new request so if we go to the documentation and we search for request response you could go here yield request so we can copy this one of course other examples are there as well so we have the URL here and then it too wants us to have a call back so self dot bars I will call this detail so I will define here function they've passed details self comma responds and I will make it print hi we don't need this and let's see if this will work and actually I can't find out why this problem is so let's go on and debug this which means I will remove the warned at the end so I can see more information and easily we can now see that it says filter off side request to ww-well Fox country so what was the problem here allow domain here is Fox country and we would like to BW w Fox country ok so let's go back to our pattern and now we can see that it brings hi so here we have the detail page now so if we really successfully had this page what we would expect to see for example would expect to see here street address so again similar story with the table we have a table here with class TV border and more or less would like to see those two columns the first one is the type of the property and the other is the value so let's try to find the table with class DB border so same as above rows equals disposed of table with class 2b border and let's get for every row let's start without any limits here the two rows and let's extract the text let's see what this will add dot extract okay so what do we do here we find this table we go through each zero and let's see is here so we go to this table and we go through it's TR in turn and we can see that we said extract in this case and not extract first because extract first gives give us just the first column but now with extract for every one of those we get an array with two elements the first one is the complaint number for example here then some space is the second and the third is the value we want let's see the second row again number then some spaces and then the value so we could conclude here that it's key some random stuff by you and then we have some random stuff again so this is Python again and let's try to do bring the key and value so key is that the name of the field and value is this third column let's see if this works and we can see that this didn't work and the reason is we have to go a little bit back to check and probably there is some column that has more or less values and indeed we can see here Notice of Violation so this is a bit unusual so we go back and we we can try another approach so it seems to me here like we are trying to extract the text too early let's see what happens if I go here and I do extract on those and see what happens let's see if it gives us more structured results so now we'll try to extract the two columns independently so we see here one and then two so this looks somewhat better and we would expect this array to actually have two elements so that let's name those columns and then we have key column equals columns of zero dot extract first in this case and then via Coulomb give me the second one okay so let's that I this one key column value column okay so we see some error here and they did this is the TT so let's try to extract again from here just the text and see if this works any better so with this an iterative approach again we try the two or three tricks of the trade and this looks way much better so we can see here lots of new lines and this is why it doesn't look nice and structure so there is a string manipulation function in Python that's called strip and it gets rid of for all those weird invisible characters like the newline character and let's see now that we remove those okay so we can see the key street address so the first part works but it doesn't look like the second part works again it's in the rating process so let's comment this out and let's print the second part and see what we get so we can see that it is a TD and then it has a spawn so actually we could also see it here so the first column has just a TD and some text in it while the second one has more complex structure so let's try to add here a spawn good and with the complaint number so now we have all the the data from that form with just a few lines of Python let's add a separator okay so we have complaint number and we extract it effectively this table okay now one thing we can do is we can use a dictionary which is an effective way to create this table out of key value pairs so we can call this table equals a dictionary and then it's very easy to write here table of the dictionary equals with the value and then return table so by doing this we return to the standard scrapey terminology when we return a dictionary this actually scraping knows what to do with it and more specifically if we write - OH for example Fairfax dot CSV it knows how to create a CSV file out of it so as you see here here's the CSV file that we just generated for us so I double click this one and you can see a very nicely formatted table out of that row and so this is where the magic happens this is where the scrapy framework really is very beneficial for us because as soon as we have the output and we respect those rules then we didn't need to really write anything to it export a CSV so over here I wrote a return statement to use just the first line of the original table so we went through just the first line of this table but if we wanted to do all of them all I need to do is just return and remove this return statement I just need did that I will remove also the CSV file so we start from the beginning and then I rerun this one so I will notice that takes a few seconds more but let's see what we have in here okay so we have all the CSVs from this page so all those tables are in there and you can see that this is just with a few lines of code right so I would say no more than 15 or 20 lines of code so this is a good starting point I think and of course probably the others are way too many things to digest so please come back with questions and we'll take it from there I hope you enjoyed this video thank you so much and have fun
Info
Channel: Dimitrios Kouzis-Loukas
Views: 6,116
Rating: 5 out of 5
Keywords: python 3, scraping, scrapy
Id: A8mHjz3-7iM
Channel Id: undefined
Length: 25min 2sec (1502 seconds)
Published: Tue Jul 10 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.