Indian pharmacy WEB SCRAPING tutorial | Scraping INFINITE SCROLL pages | Python Scrapy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it was going on guys in this video we're gonna be scraping medicines from from easy form easy route in which is the in Indian side with some sort of a different in essence I've picked up this personal care products here and just a bear in mind that this video is inspired by one of the subscribers of this channel so a be if you're watching this video man this is for you personally so have fun like learned this stuff and so on so just like the oil axe the Indian version of over legs that have been scraped and recently this site fetches the data from the API so it's kind of infinite scroll here so just another thing to consider actually API scraping compared to the regular web scraping using the CSS or expect selectors so they just quickly invoke my developer tools here just go into the network tap and just trying to scroll the page down here in the network tap it appears this get category product and if you have a look at the response we see that here we have data with all the reasonable stuff here so like name manufacturer and product attributes so all the stuff basically also images are available so everything we could have ever need basically for all the products like up to 20 here so we'll be using this API to fetch our data and to store this to CSV and okay so yeah so probably okay one last thing to show you so we just have a look mmm this URL here and we just paste it in oh this is the page number three but doesn't matter so it returns us a plain JSON response that pulled the parsing later on so just to kind of bearing that in mind okay so if you're interested let's actually start writing some code so here in the current working directory of creating the file code firm easy to PI and we'll be using pythons grapey framework in order to write this little scraper so that's basic start first we need to import scrape itself and then from scraping dot crowler import carouser process spelled like this and also we need to import Jason to being able to forest adjacent response and also the CSV module to actually store the results to the CSV file of course we couldn't make use of the standard built-in Python built-in scrapey items output to CSV but that's a little bit more complicated just too plain CSV append in the CSV file so we'll be using this one in this tutorial so now let's draft a class that would be called farm easy and it would inherit from scraping spider which is the general-purpose spider also don't forget a specify your name the voice it wouldn't work so let's specify forum easy here and before actually proceeding with a class method definitions I just wanted to say like run scraper and here we need to play the process before the instance of the Crone process and also say process CRO and past the hour for easy spider as parameter so farm easy here like this and also don't forget proteases that start enough to the world that Carlin Preds is basically so that's basically opened the terminal in the current working directory we now are at and simply say five thousand three and farm easy to PI and run just to make sure everything works okay so this is kind of it and now the second thing to consider here so is basically specify our you were always supposed to extract our data from what list this would be the base URL so it's basically instead and we don't need to specify the page it's just enough like this we'll be creating the page would append to the page number dynamically yeah within the look okay also we need to specify the hitters because if you have a look at the Roberts txt file over here just diplucate this stuff so it's basically always good idea to check out the robots.txt file so we're supposed to scrape the API endpoint which is got another load so we'll need to specify the user agent in order to trick this API so just bypass this sort of and it's great and matter basically so in order to get the user agent we can go down here and we can see simply like with the request hitters request eaters and the user agent this one so I will take which relates to my browser you can use your own doesn't matter really or you can use mine as well so here we specify the user agent and okay they just don't this is enough so don't forget to quote this because there's a Python dictionary format okay and now let's define the method this cold start with West which is scrape a specific method it in works when the spider instance has been created so it would take only salt for instance and that's kind of it so here let's create the next page okay so we're now on probably okay we'll create next page URL to crawl all the pages we need to just a little later later on so first let's simply yield scrapey scrapey dot request and the euro would be equal to self dot okay let's actually create this next page next page is equal to 0 and that's 0 so we will be covering actually the first page so that's actually phrase to try to check that out just print the next page here to make sure it kind of works okay okay so here at ease don't don't have a good look at this error that's because we didn't make any requests scrape it doesn't like it so I'm just throwing some errors but what do you need it here actually is this for am easy API blah blah blah page zero so this is kind of okay so as far as the next page it's pretty fine we can only already yield this request and so the URL is specified also specify the hitters that would be equal so the hitters and also will need to go back function because bearing in mind that scrape is doing requests HTTP GET request in our case asynchronously so that's why we need to specify the callback function we use themselves of course which we didn't create yet just right at the moment so death parse would take two arguments itself instance and the response itself and let's actually try to print the response object just to make sure that we got the HTTP status status code maybe just rest latest like this and this so this should have returned as the number of to hundreds which means that the request has been done successfully otherwise we'll need to add some more logic okay so I hold my breasts try to run this again and let's have a look okay okay for me has no attribute backstage okay not the self explained sorry like this okay just the regular next patient that's it okay so okay so here it is so the the status is okay so here is here we'll get our status that means that we just retrieved this sort of data within our from within our scraper and now just in order to avoid torturing the target site well debugging our spider we actually want to store this response to the false so it's generally good practice and during the debugging process don't make this request all the time just story this to the file once and then just gray the data extraction logic and when it works pretty well then you can just enable this back again and crawl the pages store just this way and you're done so here we can simply say we'd open respond stop Jason because this time will be store into Jason file went right bytes as Jason fall here and here we can simply say Jason fouled up right and we're right in response dot txt like this and if I run this again here should appear in you Jason false so let's have a look at this okay we got our Jason fault which is pretty nice so from now on we don't need this stuff to happen so just commended crowling protists for a while don't also need this with West for now and now we want basically to read the fault that we've just graded oh so let's let's create a variable called data which first would be the type of pipe in a string but later on it would ready the Python dictionary you know we can simply say for the line in Jason Jason fall dog breed and here we want data plus equals and our line here so now we can simply print our data and also here the extraction direction and here we can simply say farm easy arm easy dots Taurus and we need just to fake so the first argument is the self instance so this self so that's way of prove any of this off instance here and so let's fake the response object because we don't really have any response in approaches so just the plain string to fade this stuff okay and now we run this again and hope to see the response okay so here is our spells but now it's been taken from this res dodged a sniff I'll not from the URL not from the HTTP request okay so now the next thing to consider basically now we need to convert this sort of data to the Python dictionary format in order to do that we need these two letters say jason downloads and we're specifying our data here so one more time so now it's the json object now just to pretty bring this a little bit more we can simply say jason dumps and this data and specify the indentation equals two spaces and we got a pretty printed version of our API response here but of course it's really difficult to basically step through over this and trying to understand what exactly we need that's why using chrome developer tools is far easier way doing things so we just go back to this preview stuff so what are we kind of supposed to use here so we meet so basically so what I got what do you got so it has the data and the products so we need to look over all the products within the data key here so that's basically trying to do try and doing this so here I can simply say for product in data and I'm using the data key and this products of products ok well let's say product okay so that's third with Product ID base so product name product name is final product name just to make sure that we're looping over the right data set okay seems like it is so from now on we can basically start oh we're extraction logic so here is specifying the data extraction logic okay so now let's have a look what data in particular do we need here basically so we need so this name IMF green 15 milligrams so this supposed to be our name ok so that basically pick up one of the products so the name ok now let's start basically grading our items extractor here so I can simply say items which creating the variable items which would be the type of Python dictionary and here I specifying the name and this would be the product and the name here and at the very end we simply want to friend chase and dumps and here we specify our items that been extracted out of the API whispers inundation equal status basis and let's try this one more time okay so now we get all in the name extracted okay the next thing to consider we I'm not sure what is this slug all about but let's just have it just have this slot as well so I can simply say like slug and itself okay okay now we have the swag as well okay later also manufacturer okay manufacture is also cool tentacles consider so manufacturer would be product and the key here that's better okay just copy paste okay now we got a manufacturer okay what else so MRP decimal what is this what is this okay guys just hold on a sec I need to understand what's this data regards to okay so probably this MRP decimal is the this price that was before and act the actual price shouldn't be should've be then sale price decimal also it calculates some sort of a percentage of of the discount but we basically would make use only of this sale price decimal so let me just quickly grab this guy so just copy here and here we can simply say like let's make it just call this a price and this would be the product and this sale price decimal okay so let's try to one friend this one more time okay so here we got our prices which is brilliant basically and also well I think that the quite important thing might be using this sort of stuff here so we're looking for is available so product availability flags so just copy and here we can simply say oh my god let me spell the availability words correct correctly if they really feel it II let me just quickly check this on google translate yeah sorry just spell this slightly wrong so here is according to Google Translate house right we've done things okay and just one more time we need this product availability so just copy and they say product and this stuff and also there is a nested object here so we need to specify the Isabel we need to retrieve that is available here so just copy this again and same he's available so if I now run this stuff one last time okay so if it'll be Lizzie trailer true true true so if it will be false it would be false respectfully so this is kind of it okay and the very last thing to consider just probably this kind of images so yeah so this is the only here so let's specify the images as well just well I think this is always good to specify the images so I say images and this would be product and images basically and this is gonna okay so let's check this again okay so here we got our images and probably it would be better just so here you see like within the images it returns the list but we want to convert this to the plane strain and if there will be more than one URL we just want to separate them using the camera so in order to do that we can simply say comma dot and this products and also we can even specify comma + space here so let's have a look one more time okay so it seems like it seems like this is kind of it let's just quickly check that image else so just copy this and oh my god my mouse is going crazy okay and paste okay so seems like seems like this is just fine okay so at this moment we actually did did extract all all this stuff successfully and there is one more thing to consider here so we need now to store this data to the CSV file and this is done very easily so we need simply to say so here is a store I mean not store maybe just better covered up and result results to CSV file and here we simply say with open and we call this let's call this farm easy dot CSV and we want to write the file stream as CSV file and here we need to pray the writer objects so writer would be equal CSV the writer no dict writer sorry we're using the dict writer because it's the dictionary so we're writing the type of Python dictionary so if this was a list of place in that case we would have used just a simple right over here we'll use the dig writer and it takes as a CSV file as the first argument also we need to specify the filled names which would be equal to items that key so we just want to return a list of this keys from from within the items element here items variable and this would serve as a field names for the CSV file so like column names so this is it and just in order to write this far we know sorry it's not the double double W but instead a here because we're not right in the past room we append in the file stream I'm sorry sorry for this so otherwise it would be over writing the stuff or all on the way so here we can simply say writer dot write row and we specify our items and this is kind of it so now that's basically kind of try to test if this would actually create a CSV file okay so here is some sort of a CSV file okay it's not really not really that great so why only one here oh well that's because I'm sorry so around indentation level of course we're doing this for each item so let's try this one more time okay and okay so here we got all our data so let's also quickly check that out with Excel like prom program in my case I'm using a liberal office and calc app so let's have a look how the CSV is supposed to look like here okay okay so this seems kind to be just fine and as far as what would their parent in this items from the new pages that would be appended as well and one last thing so at the very beginning when we just started to make this stuff we just want to basically provide the column name so I need to do that we can do the following stuff here so here let's actually try to brand items keys just in order to avoid typing this stuff by hands oh so we don't need that's avoid torturing our disk for a while so I'm just need to grab this kind of stuff so this is what we need yeah let's grab the entire dictionary basically so just copy this stuff here and that's great what is known is get a constructor in fighting so we can simply say this is the object-oriented programming related term so we can simply say Nate here it takes the self instance so this kind of method is executed before everything else okay then go to the third request and so on so here we can simply say with open and here again firm easy for easy dot CSV here we definitely want to write the file stream as seriously fall and dot right and okay so probably yeah we don't need this don't need this so just just a plain string basically because it's the CSV format related stuff okay so just a plain stream and oh so okay get rid of this guy and also the very end we need to make sure that we have a new line so this is kind of it okay okay but this would only work when we run our crowler because otherwise the constructor won't be actually graded so we can't really test it you know right now but let's basically uncomment all all the stuff we got here so we don't need to debug anymore we need just implement our crowing process and save and also we need being able to yield our response so now we'll make HTTP GET request to the particular you're rolling point and fetch the data extract it and try to write to the CSV file so just save and I hold my breath and try this one more time okay and now let's go back to this C okay great guys so now we have our name slug manufacturer price it'll build the images and then the date ago so this is exactly the exact result we've been expecting for so the very last thing to consider is actually to grow multiple pages which would mimic the infinite scroll infinite scrolling on the page so here again here we can simply say basically like so it'll come in through here so scrape data from infinite scroll okay and here we save for page in range from zero to well let's say two three so here let let me just add another more common trace so I just simply say specified page range you would like to scrape data for so if you want to scrape from like kind of more than three pages like ten or twenty how many there and I don't know that should be you should found out that yourself so that's gonna exercise for you guys so just specifying here so if well now it won't actually be three it would be just too so if say you specify like you letter like eleven it was great basically ten pages and so on so but I will just leave form images let's have four to have actually three pages being scraped and also here will string it by the page instead of the zero and all that all the rest stuff seems to be just pretty nice okay so now I hold my breath and try to run this stuff one last time I hope this is one last time and well it seems like okay we don't need to basically print this keys anymore so let's have a look at our for easy okay so it seems like it's being upended the data from multiple pages so let me just quickly check this in the Libero ffice okay not here mmm where is the LibreOffice man okay no it's terminal oh my god okay here it is here is the libre office great so okay okay so here's our dated name slug manufacturer price images availability Oh everything is available cool and all this stuff so this is kind of it guys well I hope you've learned something interesting out of this tutorial so it's pretty the same from the technical perspective of what we've been doing in Aleks and you know like it's quite a nice then regarding Indian websites it seemed like they all allow you to scrape data directly from the API even though it doesn't seem to be that load but still this is possible so this is really nice so in the description below this video you will see how to basically make a request for a custom web scraping tutorial on demand so if you want guys if you want me to create some sort of web scraping tutorial covering the data extraction from the side of your choice priest please feel free to make those requests in the commentaries and I would be happy to cover all of those stuff when I have time well at least I will be putting those requests into my shadow so when I have time I will be making those tutorials from time to time so just to help you learn something just trying to help you master master Python programming and through the web scraping and I hope that finally you would become freelancers and start making your start making money why programming basically so that's kind of my goal at this pirate uptime at least so I hope you enjoyed this tutorial I wish you all the best guys learn programming learn where learn web scraping just practice more and you will definitely succeed and you will definitely succeed so until next time and take care
Info
Channel: Code Monkey King
Views: 3,144
Rating: 4.7966104 out of 5
Keywords:
Id: c9Z60JLc-i4
Channel Id: undefined
Length: 29min 56sec (1796 seconds)
Published: Sun Mar 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.