How to Scrape Reviews with Python Scrapy | Freelance Gig | Reusable Script

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in today's video we are going to talk about how we can scrap amazon reviews and i decided to create this video because i see the same job posted on multiple freelancing sites again and again so we'll write the script once and that you can reuse later so the product url is typically like this it's a long url so what matters is actually this asin number this is typically in this product information so this is the number and this number is also in this url so actually you can strip down everything else just leave slash dp and this asin number and you can actually construct this short url for the same product page similarly the product reviews have this a short form so the longer url looks like this but you can remove everything else and just keep amazon.com product dash reviews and this asin so let's start writing the spider so let's open the command prompt and let's create a project so scrappy start project and let's give this a name amazon okay so now we need to cd into amazon and then generate the spider let me clear this so scrappy gen spider and then the name so let's call it reviews and the fourth parameter is going to be the start url so i'm just putting x because i'm i made a typo so i put x because i just want to remove everything and rewrite my code okay so i'm opening this complete folder in visual studio code so of course you can use any code editor that you want to use now we have the typical scrappy project structure if you are not really familiar and comfortable with scrappy structure the only two files that are important for this particular project is this spider so this is the empty spider that has been generated and this settings.py file so we will be providing some settings right so let's start with the spider so let me make everything bigger clearer now i'm going to remove this allow domain and start urls so this is the fourth parameter that we put here so i'm going to remove everything and now what we are going to do is we are going to take this url all right so i am going to put it outside the class so this is reviews url the change that i'm going to do is i'm going to cut this aside number and leave this bracket the curly braces here and i will put this asn numbers in a different variable and i'm actually creating a list so why i'm creating a list instead of a string variable so that we can provide multiple asin numbers and just get all the reviews in one go now what we are going to do is instead of those start urls we are going to make use of start request functions and here we will be running a loop on this list okay so we will run a loop for s i n in s in list and what we are going to do is we are going to create one url and how do we create this url so we will take this reviews url let's use the format method so now that we have it we can simply yield scrappy dot request and just send in this url simple so we don't need to provide a callback because the default callback is anyway going to be parsed and let's just print something like i am in resp i'm in parse okay so i'm not returning anything i just want to show you what happens if you don't send the headers okay so let's go to the command prompt clear everything so always run scrappy list as the first command even if you have only one spider it will give you any errors if you have made any typos or anything it will give you simply the results now we are using the crawl method because our spider is part of the project so simply run this and let's see what we have so we have a huge log and if you're new to the scrappy world then this may seem little daunting but it gives us very important information let's try to figure this out so the crawling starts from here so here the first thing it is going to do is it's going to look for robots.txt and the result is 200. now why it is important this means that we are respecting the crawling rules so we should never crawl the pages which are not allowed by robots.txt and scrappy will take care of it so this is good now we are trying to get this particular page and it failed with an error 503 so you may get errors in 500 range or 400 range if you are not sending the correct headers so this time we are going to send the correct headers now where do we get this headers from so let's open the page and press f12 in fact if you do this in incognito window it will be even better so this is like opening the page for the first time and we have lot of things but right now we are only interested in the first one because that is what contains our actual response so this contains all the reviews and let's look at the request headers not the response edits request headers so there are a lot of request headers which are being sent so you can ignore the ones which start with colon so from here to here these are the headers that you should be sending usually if you just send user agent that should be sufficient so typically i recommend that you send all the headers so i'm just going to press ctrl c and now let's come to the code and this is where we need to create a headers variable so instead of writing this headers here in this spider i am going to settings all right in settings if we scroll down and let's in fact look for default so this is the one default request headers so this is the setting that we need to pass and if we just copy this here and paste it here and let me delete everything else so that it is readable and right now this is set to true we can set it to false and it will not go and look for robots.txt so what we want is we want to get all this text which is right here so let me put this in triple quoted string so we have this huge thing and we want to convert this into a dictionary because default request header will take a dictionary so how do you do that so just to make things easier just download this package so this is scrapper helper and you can simply install it using pip install scrapper helper and yes this is one of the package that i have published which contains lot of helper functions let me import this first so import scrapper helper as sh now remember that settings dot py is actually just a regular python file so if you write any python code it will execute now what i'm going to do here is i'm going to call sh dot get dict this method and this will anyway take this whole string and let's close this and what it will do is it will convert this whole string that we have copied directly from the browser and it will create a dictionary so now these headers will be sent with every request so that's the purpose of default request headers so once we have this in place let's go back to command prompt clear everything and let's see what we get now so now we have this print statement i'm in parse and we can see that for the product page we have a success message so 200 is the success and because we turned off that robot we don't see that request to robots page so now let's move on to the next part which is actually extracting the reviews so let's go back to the page and try to understand how the page is structured so these are the reviews this top positive review and top critical review this we can ignore for now because we are going to scrap all the reviews anyway so what we need to do is we need to find a selector which contains the complete review okay so let's make use of the selector tool yes this looks like one so this div id in fact even this one so these are all the reviews so the first thing that we want to do is we want to create a selector which can give us all these reviews and only the reviews and then we will run a loop over all these reviews so right now if i look at this particular attribute this data hook equals review this looks like a good candidate so i'm going to press ctrl f and i'm going to write in this attribute and we can use either xpath or css selector doesn't matter so here if we just write it in square bracket so we are creating a selector css selector and we can see that we have exactly 10 reviews so let's check all the reviews it's important to check the first and last so all looks good so let's take this and come to our code now here what we are going to do is we are going to run a loop so for review in response dot css and let's paste in this and i am not going to call get or get all the reason is i am going to chain the reviews so what is the purpose of chaining so let me show you let's create a selector and then it will be easier to explain so this is the title of the review now you can see that there are 12 so there are 12 because it is including the top ones also here you can see so this title this title and the 10 which are inside this so we cannot use this directly we have to actually use both of these selectors right so we have to use a selector like this so this will give us all the titles which are inside the main container right so as i said earlier we don't want these two top ones so what we are doing here is we are writing two selectors so we can write these selectors in two ways so let me show you just consider this as a scratch pad so we can write this selector like this response dot css something like this and then we can write get all so we can write the selector like this or what we can do is we can break this here and this will do exactly the same thing so now this is called chaining the selector and the good thing is you can actually change css and xpath selector so we have taken the parent so these are some quick selectors that i created earlier so you can get the profile name the stars title and the actual review and finally what you can do is you can yield this item now here you will notice that i used xpath why i used xpath because this review body contains lot of empty spaces tabs and all new line characters so i like this normalized space function of xpath this is inbuilt and works very well so i used xpath here and there is one more reason i use this i use this to explain that you can use css and x part together it doesn't matter so here we are using response.css and then we are using xpath so it is the the actual selector which is getting created is response dot css this part dot x part this part so we are actually mixing up css and xpath and this is perfectly fine they are going to translate into xpath anyway so let's format and run the spider and very quickly we can see that 10 items were scrapped and we have all the reviews the name of the reviewer the stars the title and the complete review so if you want you can get more text you can you just have to create selectors now the next part which is remaining is getting the next link so next link is actually pretty easy all you have to do is create a selector for this next page and how do you create a selector for this next page let's uh have a look so here this class is a last which is not very reliable it doesn't it may take you to the next page or to the last page so instead of that what we will do is we will look for this text next page and we will find the element which contains this text okay so here i'm going to use xpath you can use css as well doesn't matter so what we are going to do is we are going to write this contains method here i am going to zoom in so that you can read exactly what i'm doing so slash slash anchor tag this contains method from xpath or in fact we can directly call the text method and look for this so we have only one selection so it will look for the anchor tag which contains the next page and then we can extract the href so this will give us the next page url but this is a relative url so we need to convert it into an absolute url so outside the for loop let's write this the next page is going to be response dot xpath so this is the selector and get now we will check if next page is there or not on the last page this next page will not have anything so it will be none so it will be a conditional so now we are going to yield scrappy dot request and remember to use response dot url join function so this will convert the relative urls into absolute urls and next again the default callback is parse and that is what we want so we don't have to provide the callback method specifically so this code will execute only if there is a next page you can actually output this to a csv file for example and it will keep on running and it will get all the reviews that we want if you face a problem then what you can do is you can do one more thing go to settings and put one setting here auto throttle enabled equals true so if you set auto throttle enabled equal to true it will introduce random delays between the request the reviews are running it will create the csv file if i stop it right now so we can see that 520 reviews have been already scrapped and let's look at what we have so this is the csv file and it contains all the reviews so that's all i hope you found it useful until next time see you
Info
Channel: codeRECODE with Upendra
Views: 2,642
Rating: 4.9024391 out of 5
Keywords: scrape amazon, python web scraping, web scraping with python, web scraping, amazon review scraping, amazon review scraper python, learn python, how to scrape amazon, amazon scraper, save amazon reviews to excel, scrape amazon reviews python, scrape amazon python, scrape amazon product reviews python, how to scrape amazon reviews, web scrape amazon, web scraping amazon, web crawler, python scrape amazon, amazon asin scraper, scrape amazon products, scrape amazon reviews
Id: R-9UWqyFtNQ
Channel Id: undefined
Length: 18min 34sec (1114 seconds)
Published: Fri Feb 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.