Python and Scrapy - Scraping Dynamic Site (Populated with JavaScript)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So welcome to another live coding session. So this  is a task which I found on one of the freelancing  sites And the task is very simple. You need to  go to this site and extract this information   So here is the site and the good thing about this  site is that all the schools are listed on one  single page so there is no pagination. It saves us  one step, and when we click on any of the school ,  it goes to a new page where all the required information is available. Now first of all before   we proceed we need to determine whether the page  is static or generated by JavaScript. Just press  F12 to bring up the developer toolbox and go to  network tab. Ensure that disabled cache is checked and press Control-Shift-P. now this will bring up the Command Palette. Type disable JavaScript and then reload the page so now the javascript is disabled and you don't see any content here   so that means that this page is generated using  JavaScript so we need to handle JavaScript in  some way so let's again enable the JavaScript  and reload the page so let's see what happens   so now the page is loaded and now let's go to  XHR. XHR stands for XML HTTP request so without   going into details just consider that there is a  separate request sent to the server so there is   settings so if we click on preview we can see what  was the information that was received and then get   all-school now this looks interesting so this  let's look at the response time and we can see   that this is a Jason request so this is very good  because now we don't have to go to this speech at   all we are not concerned with how this information  is presented here we will not be concerned with   how to create CSS or expert selectors we dot we  directly have access to this Jason the real data   now what happens when we click on any of this  school if we click on any of the school we can   see that a new request is made which is again an  xhr request and we can see that this school code   was passed and here we have all the data related  to the school so again we are not at all concerned   with how this information is presented we already  have the data so in the beginning this may sound   more complex but actually it is very good we don't  have to deal with any of the presentation logic we   directly have access to data so now what about  this school code how do we get this school code   so if we look at the previous request we can see  that this is IT school code and IT school code is   here so from the first request we will take this  IT school code and we will get the next URL we   can construct it very easily we can pass on this  parameter one more thing that we need to examine   is what were the headers that were sent so let's  click on headers so the first thing is it is a get   request we have the request URL and in the request  header you can see that this is application slash   adjacent so whenever there is a request to the  server to get Jason the accept header is set   like that so this is what we have to do in our  code the second thing to note here is there is   a cookie being passed so if there is a cookie  which is being passed then that means that we   need to do something about it now scrappie will by  default handle cookies but it has to receive the   cookie at the first place so how will we ensure  that the cookie is received very simple instead   of starting our scrapper with this particular URL  what we will do is we will start with this page   so let's generate one spider scrappy giant spider  and the school's snot and he schools and we have   to supply the fourth parameter as the start URL  but if you have been following along my previous   videos you know that scrappy is usually not good  so we have to manually update the URL anyway so   I'm just typing anything and we will take care of  it so code and the school's so this will open the   this Python file in Visual Studio code we have  the structure ready and these are the two things   that we need to take care first so let's copy  this URL and this is going to be our start URL   a reminder that if you have gone through my free  course all the videos this is a fairly easy task   to be done so I am going to remove a loud domains  not required and the start URL I am just copying   pasting this first URL that we will be starting  with in the past method we are going to call this   so let's copy the link address URL that we will  be calling is this very simple and if you remember   that when we were looking at this request there  are certain headers to be passed let's collapse   response headers not important request headers so  this is the most important one application Jason   Kuki will be handle automatically and a user agent  is also something we should provide so I'm going   to take a shortcut to save the time these headers  I copied these headers and created a dictionary so   I'm going to paste it here and you can see that  it's simply a copy what we have to do is we have   to create a new request so a request is going to  be scrappy doc request is going to take the URL   a callback method that we still have to create  so let's call it parse API and then we have to   pass the headers so this is self dot headers and  finally we have to heal this request here is our   parse API self and response standard signature for  all callback I am just going to write pass we can   of course take a shortcut and we can directly  heal this it's fine remember that whenever we   are processing a JSON request say this is a JSON  request now here we will not be dealing with CSS   selector or XPath selectors now Python already has  one inbuilt module but we need to call it we need   to import it so import JSON response dot body is  we need to convert this string into a JSON object   so let's call it raw data so this is our raw data  and let's call it a data now the json library or   the JSON module has one function called load and  loads so here we are talking about loads so load   is different and out of the context right now and  notice that the first argument is the string so   what is the string that we have or raw data so  if we look at the type it will be a string and   if we look at this this will be a JSON object so  now we have the JSON object and what is this JSON   object so now if you look at it it is a list if  we consider it as a list we can run a loop over   it so let's write a for loop so for school in  and data at this point we are only interested   in this the school code so I'm just going to copy  it and the school code is data like that so this   will return us the value of IT school code and  that's all we need to create the next request   so this is the next URL right click copy copy  link address let me let's call it base URL and   I'm going to remove this part okay there is one  goof up that I did it has to be school because   we are dealing with a list we are running a loop  so for each item we will have ID school code so   we can take this combined with this so let's call  it school URL and this is going to be school code   plus rather base URL plus school code we have to  create a request so we have to create a request   scrappie dot request and school URL callback we  will create it later let's call it parse school   and again this is a JSON request so we have to  pass the headers and finally we have to yell   this and we need to create we are on this page  again we have the JSON so let's copy these lines so again response dot body will have  raw data and we'll have to call Jason   dot loads to convert it into JSON data so  we need to get the name and where is the   name it's here this is the name of  the school so data name a physical   address and postal address is sort  of nested dictionary let's call it physical address let's call it 1 and this will  be this is the physical address so this will be   data physical address and the part 1 description  display addresses the only field that we need we   don't need anything else so again the same thing  will be with a postal address so postal address   paste it here listed here and done let's  make it pretty ensure that that is a comma awesome so we have the postal address what is the  information that we need is the email where is   email here this is the email so email is again  data ok is there anything else that we need we   need the phone number where is the phone number  so we need the phone number of the school this   is where it is so phone is going to be data and  school management again we have two things so that   also I am sure you can do this so I am just going  to stop here and see if my spider has any error so let's come to the command prompt so run spider  scrappy run spider I'm going to run it just to   see that if there is an error I will correct  it no errors I am just going to press control   C to stop it right now and let's send this output  all schools dot CSV and the switch is Oh non zero   Oh so I'm going to run it sure it's not going  to take long there because there are not many   records to process awesome let's look at items  crab count here it is so 208 schools they have   been processed we don't see any errors so here is  the CSV file so we have the name physical address   postal address email and phone number and all 208  schools and let me look at the real recording time   it took me less than 25 minutes I'm going to edit  little bit hundred dollars is very easy 100 150   200 depending on how much you can begin done not  bad for half an hour see you in the next video
Info
Channel: codeRECODE with Upendra
Views: 37,248
Rating: 4.865922 out of 5
Keywords: CSS Selector, Python Scrapy, Python Web Scraping, Scrapy Spider, browser scraping, how to scrape data, python scraping, python scrapy tutorial, python web scraping, python web scraping tutorial, scrape web pages, scraping, scrapy, scrapy for beginners, scrapy javascript, scrapy shell, scrapy splash, scrapy tutorial, screen scraping, selectors in scrapy, web scraping, web scraping python, web scraping with python, web scrapping, webscraping, website scraping
Id: Pu3gmdWsLYc
Channel Id: undefined
Length: 15min 40sec (940 seconds)
Published: Thu Feb 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.