Python and Scrapy - Scraping Dynamic Site (Populated with JavaScript)

Video Statistics and Information

Video

Captions Word Cloud

Captions

So welcome to another live coding session. So this is a task which I found on one of the freelancing sites And the task is very simple. You need to go to this site and extract this information So here is the site and the good thing about this site is that all the schools are listed on one single page so there is no pagination. It saves us one step, and when we click on any of the school , it goes to a new page where all the required information is available. Now first of all before we proceed we need to determine whether the page is static or generated by JavaScript. Just press F12 to bring up the developer toolbox and go to network tab. Ensure that disabled cache is checked and press Control-Shift-P. now this will bring up the Command Palette. Type disable JavaScript and then reload the page so now the javascript is disabled and you don't see any content here so that means that this page is generated using JavaScript so we need to handle JavaScript in some way so let's again enable the JavaScript and reload the page so let's see what happens so now the page is loaded and now let's go to XHR. XHR stands for XML HTTP request so without going into details just consider that there is a separate request sent to the server so there is settings so if we click on preview we can see what was the information that was received and then get all-school now this looks interesting so this let's look at the response time and we can see that this is a Jason request so this is very good because now we don't have to go to this speech at all we are not concerned with how this information is presented here we will not be concerned with how to create CSS or expert selectors we dot we directly have access to this Jason the real data now what happens when we click on any of this school if we click on any of the school we can see that a new request is made which is again an xhr request and we can see that this school code was passed and here we have all the data related to the school so again we are not at all concerned with how this information is presented we already have the data so in the beginning this may sound more complex but actually it is very good we don't have to deal with any of the presentation logic we directly have access to data so now what about this school code how do we get this school code so if we look at the previous request we can see that this is IT school code and IT school code is here so from the first request we will take this IT school code and we will get the next URL we can construct it very easily we can pass on this parameter one more thing that we need to examine is what were the headers that were sent so let's click on headers so the first thing is it is a get request we have the request URL and in the request header you can see that this is application slash adjacent so whenever there is a request to the server to get Jason the accept header is set like that so this is what we have to do in our code the second thing to note here is there is a cookie being passed so if there is a cookie which is being passed then that means that we need to do something about it now scrappie will by default handle cookies but it has to receive the cookie at the first place so how will we ensure that the cookie is received very simple instead of starting our scrapper with this particular URL what we will do is we will start with this page so let's generate one spider scrappy giant spider and the school's snot and he schools and we have to supply the fourth parameter as the start URL but if you have been following along my previous videos you know that scrappy is usually not good so we have to manually update the URL anyway so I'm just typing anything and we will take care of it so code and the school's so this will open the this Python file in Visual Studio code we have the structure ready and these are the two things that we need to take care first so let's copy this URL and this is going to be our start URL a reminder that if you have gone through my free course all the videos this is a fairly easy task to be done so I am going to remove a loud domains not required and the start URL I am just copying pasting this first URL that we will be starting with in the past method we are going to call this so let's copy the link address URL that we will be calling is this very simple and if you remember that when we were looking at this request there are certain headers to be passed let's collapse response headers not important request headers so this is the most important one application Jason Kuki will be handle automatically and a user agent is also something we should provide so I'm going to take a shortcut to save the time these headers I copied these headers and created a dictionary so I'm going to paste it here and you can see that it's simply a copy what we have to do is we have to create a new request so a request is going to be scrappy doc request is going to take the URL a callback method that we still have to create so let's call it parse API and then we have to pass the headers so this is self dot headers and finally we have to heal this request here is our parse API self and response standard signature for all callback I am just going to write pass we can of course take a shortcut and we can directly heal this it's fine remember that whenever we are processing a JSON request say this is a JSON request now here we will not be dealing with CSS selector or XPath selectors now Python already has one inbuilt module but we need to call it we need to import it so import JSON response dot body is we need to convert this string into a JSON object so let's call it raw data so this is our raw data and let's call it a data now the json library or the JSON module has one function called load and loads so here we are talking about loads so load is different and out of the context right now and notice that the first argument is the string so what is the string that we have or raw data so if we look at the type it will be a string and if we look at this this will be a JSON object so now we have the JSON object and what is this JSON object so now if you look at it it is a list if we consider it as a list we can run a loop over it so let's write a for loop so for school in and data at this point we are only interested in this the school code so I'm just going to copy it and the school code is data like that so this will return us the value of IT school code and that's all we need to create the next request so this is the next URL right click copy copy link address let me let's call it base URL and I'm going to remove this part okay there is one goof up that I did it has to be school because we are dealing with a list we are running a loop so for each item we will have ID school code so we can take this combined with this so let's call it school URL and this is going to be school code plus rather base URL plus school code we have to create a request so we have to create a request scrappie dot request and school URL callback we will create it later let's call it parse school and again this is a JSON request so we have to pass the headers and finally we have to yell this and we need to create we are on this page again we have the JSON so let's copy these lines so again response dot body will have raw data and we'll have to call Jason dot loads to convert it into JSON data so we need to get the name and where is the name it's here this is the name of the school so data name a physical address and postal address is sort of nested dictionary let's call it physical address let's call it 1 and this will be this is the physical address so this will be data physical address and the part 1 description display addresses the only field that we need we don't need anything else so again the same thing will be with a postal address so postal address paste it here listed here and done let's make it pretty ensure that that is a comma awesome so we have the postal address what is the information that we need is the email where is email here this is the email so email is again data ok is there anything else that we need we need the phone number where is the phone number so we need the phone number of the school this is where it is so phone is going to be data and school management again we have two things so that also I am sure you can do this so I am just going to stop here and see if my spider has any error so let's come to the command prompt so run spider scrappy run spider I'm going to run it just to see that if there is an error I will correct it no errors I am just going to press control C to stop it right now and let's send this output all schools dot CSV and the switch is Oh non zero Oh so I'm going to run it sure it's not going to take long there because there are not many records to process awesome let's look at items crab count here it is so 208 schools they have been processed we don't see any errors so here is the CSV file so we have the name physical address postal address email and phone number and all 208 schools and let me look at the real recording time it took me less than 25 minutes I'm going to edit little bit hundred dollars is very easy 100 150 200 depending on how much you can begin done not bad for half an hour see you in the next video

Info

Channel: codeRECODE with Upendra

Views: 37,248

Rating: 4.865922 out of 5

Keywords: CSS Selector, Python Scrapy, Python Web Scraping, Scrapy Spider, browser scraping, how to scrape data, python scraping, python scrapy tutorial, python web scraping, python web scraping tutorial, scrape web pages, scraping, scrapy, scrapy for beginners, scrapy javascript, scrapy shell, scrapy splash, scrapy tutorial, screen scraping, selectors in scrapy, web scraping, web scraping python, web scraping with python, web scrapping, webscraping, website scraping

Id: Pu3gmdWsLYc

Channel Id: undefined

Length: 15min 40sec (940 seconds)

Published: Thu Feb 06 2020