How I Scrape Amazon Reviews using Python, Requests & BeautifulSoup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome john here today's video i'm going to show you how to scrape amazon reviews using requests beautiful soup and splash so i did a demo of the basics of splash in my last video if you haven't seen that yet go check that out because it will help you with this one but basically what splash is it's a lightweight browser with an http api that we can send pages to it will render it for us and then send us the results back what that means is that we can then interrogate it as if it was the html because splash has basically executed all the javascript for us so this is the amazon page right here this camera and we can see that it's got 253 ratings so what i'm going to do is i'm going to come down to the bottom of the page and i'm going to find the bottom button right at the bottom it says see all reviews we're going to click on that and that's going to take us to the review page to make that a bit bigger so you guys can see so we're going to do is we're going to go and get the title the rating i guess that it gave that this person gave and then we're going to scrape the text as well taking review text like this is pretty interesting we can do some cool stuff with um text analysis um which i will get into in the next video for this one it's just going to be getting the data out so we're going to be scraping those three things so what we want to do is we want to open up our docker and we want to start our splash and we can see it says running on board 8050 and we can check that by typing in localhost and then the port was 80 50. and we can see that it says splash is working so that's great so i'm gonna go ahead and copy this page but before i do i'm just gonna see how it deals with the pagination because then we can make sure we get the url right ur url right from the start so the actual url changes and it adds page number is equal to two at the front at the end sorry so i'm gonna go ahead back to one and that takes us back to the first page i'm gonna copy that so now we want to import the modules we need so we need requests so i'm going to do import requests and from bs4 import beautiful soup i can never spell this type this word right always get it wrong great url is the one that i just copied it's a nice long one so we'll just leave that there and now what we want to do is we want to send this off to our splash service to render the page and then send us the html back now to do that we just do r is equal to requests dot get now we need to send instead of using our amazon url here we need to put in our splash url which is the local host one i just showed you so http slash and it was local host um 8050. the one now the uh the splash service we want to use is the render one so we do render so we can send it to the right point in the api and then we want to add in our params which is going to be the url that we want to scrape and in this case this is this url here which is that goes in here and then after that i'm just going to put in weight just so it doesn't hang around doesn't weight if the page doesn't load and i can give it a value of 2. so if i go ahead and say let's print r.text and run that we should get back basically a load of html text which is what splash has sent back to us we can see it all here if i was to do it without splash and just do r is equal to requests dot get url and again print the r dot text we can see it comes back much quicker and up here it says we just need to make sure you're you're not a robot which is not what we want so we don't need this one and we're going to keep that there so i'm just going to do let's print actually let's ignore that okay let's create our soup object so we'll do soup is equal to beautiful soup and if we want to do r.text which we just verified came back from splash with all the information that we want and we will also do html parsers that's how we want to pass the information coffee break that's good i should probably drink that before i started so to check that the soup's working i like to do um print soup.title.txt because we can tell then that the actual name text value of the title that comes back is the exact page that we're thinking that we're getting and it is look we're on the amazon.co.uk customer reviews for this particular product so we know that it's working so now we need to do a little bit more interrogation on the page to find out where and how this information is coming from so we need to go to inspect element i'll just make this a bit bigger so we can see so by using the select element on a page tool we can hover over the hole reviews and individual pits bits so if i just hover the whole thing and just start to expand and collapse um the divs here we can see that these ones look to be like they might all be the reviews see when i hover over this first one we get the whole thing there and if i scroll down to the second one and hover over that we can see it changes so this is where we want to be let's make that one step bigger i'll just move this in so we can all see a bit clearer now if we look at these elements we can see that they've all got different ids so that's no good for us and the class has got um spaces and stuff and names in it but this here this attribute the data dash hook all says review for each and every one and we can actually access this so we can find using beautiful soup all the divs that have this data hook review in and those are the ones that we want so if we go back to our code we can say reviews is equal to soup dot find all and it was a div and then we want to do in here we want to say data hook because that was the attribute that they ham where the information was that we're going to search for and it was review or was it reviews it was review so just like that and we'll we can then find all of those elements on that page um find always going to return us a list which we can then loop through find would only find the first one so we want to make sure we do find all in this case it's finished now there we go what we want to do now is we can actually just start looping through each one of those so i'll show you we do a nice easy for loop um i'm going to do four item in reviews you can call that whatever you like i tend to use item just what i did habits die hard i guess for item in reviews now we want to do let's print out some of the information first so let's do item dot find and let's go back to our page now if we expand on the first review we'll be able to start finding the actual bits of information that we want so if we hover over with our inspect tool again this sort of title bit we can see that we are within an a tag and the span tag uh within that as well so you can see it's kind of nested in so again this a tag has got this data hook review title so we can absolutely take this and we can find that element this way so it was an a tag and we'll do just like we did before with the curly brackets for data dash hook and it was a view title now i'm going to do just inside here i'm going to do text now if i haven't got anything wrong we should get all of the review titles there we go we can see them all coming through so these are these are all there the first one says it's absolutely astonishing blah blah blah blah blah great so that works but we can see that that's got a lot of white space and extra stuff around it so we can just add dot strip on the end of that run it again and that should remove all of the white space for us so now we've got our 10 first page titles for our reviews now that's really good so i'm going to change this and i'm going to call that title remove that bracket as well the next piece of information that we should find is the reviews the uh rating sorry if i do the same thing again with the tool we can see that we're in this i tag not seen too many of those but underneath that there's a span with a class 5 out of 5 stars so we could probably hit this span tag but i'm going to use this one again keep it consistent data dash hook review style rating because we'll be able to get all of the text outside of this tag so you see this one starts here ends here so when we ask for the text within it it will go straight to this okay so rating again item dot find because we want to find that specific one and it was an i and again data dash hook review start rating and dot text again should be outside the bracket there so if i go let's say print rating and hit run we should just be able to print out all the ratings for each of these products we can see they come back in the text format that we are expecting now when we want to deal with this data outside of this program we're going to want this number as an actual number it's a decimal point number or so i'm going to turn that into a float but to do that we need to remove some of the excess text so we can see this out of five stars is on the end so we can actually just call dot replace at the end of this and if we type out of five stars just like it shows it down here we can actually replace that in our replace that in our string with nothing just like that and then i'm going to do dot strip afterwards to remove any extra white space and then i'm going to wrap the whole thing in float like this so it's got quite long but we see i've done exactly the same thing we've got this is getting our data then i'm asking for the text from the element and then i'm replacing the end part of the string which we don't want with nothing then i'm stripping off all of the white text and in the hop the white space sorry and then around that i'm asking it to turn it into a float value which is a decimal point number so let's run that we should get that back in a um decimal point number so that's that means when we export it it comes out as a number and we can then um i don't know see how many four star reviews how many five star reviews there were um so it just makes our lives a lot easier so the last part of the information is i'm going to grab all of the text from the review and hover over it again here and we can see we've got a similar thing here we have a span tag and there is the data hook again and that's where everything is so i'm just going to quickly do that just like we did with the others so let's call that body equal to item.find and it was a forgotten what sort of tag it was span and our data dash hook put the review body in there again dot text and i'm also going to strip as well on this one to remove all of the extra white space and i've missed the bracket there should go here so these two match and then i'm just going to try i'm just going to print that out just so we know that it's working where this is going to be an awful lot of text for the terminal there we go so it's all there there's probably some extra characters in here that we would want to remove as well but i'm not going to cover that in this video that will be in the next one where we look at more analyzing this information a bit more and what we could get out of it so now we know that all of the info is coming out right we want to turn what we've got into a python dictionary so we're going to remove our print state and we're just going to call this review and we're going to create our dictionary around the whole of this and then we're going to turn title rating and body into our keys and then the actual data that we're scraping into the values and we have to have a comma at the end of each of these lines so that it works properly so then after that we can just print the review that we've created and check that that works should probably remove that there we go we can see that we're getting a title the rating and the body of text i'm actually going to add in one more piece of information and that will be the product name so i'm just going to add in i'll call it products and we're going to call it um soup.title.txt but we want to remove if you remember we looked when we saw the title earlier so it had some extra text on the beginning which is um let's just go get that real quick so we can find out what it is and let's just comment this out i should probably have done this first and we wouldn't have to do all of this so i'm going to remove this amazon.co.uk customer reviews bit and that kind of just gives us an easy-ish way to find the product name in this case we'll take it from the title so we'll replace that with nothing and then we'll do dot strip to remove the white space cool so that's how we would go about getting that information but obviously there are multiple pages so we need to find a way to loop through all of the pages and get all of the reviews there is also another thing which we need to take into account that amazon regardless of your location tends to give you the reviews from these other countries as well and they actually come up slightly different so we see i think this might be one yeah so it says translate to english that actually comes up in a different html element and tags so what i'm going to do is i'm just going to skip over it because it's going to throw up an error but it's debatable whether it's useful to us or not because i can't translate this text anyway so we'll deal with that when it pops up so we need to find a good way to run the code for each page without having to you know find how many of their pages there are first and just loop through them that way so what we can do is if we look right at the bottom here we have a next page button so if we click keep clicking on that and we go to the end let's try going to it's 203 reviews 10 per page so let's go to page 10. okay and there we can go down to the bottom again and we have a next page button so that must mean there's a few on this page there is and then we can see it gets grayed out so what i'm going to do is i'm just going to go back to a previous page real quick scroll to the bottom and we'll do inspect element on this button and we can see that it has the it's under a list element and it's called a dash last okay and it has a link so now if we go to the last page where it's grayed out and we hover over that element we can see that the li the list class is now a disabled a last so what we could do is we could set our loop up so it goes through and adds so then it goes always goes to the next page to the next page until it finds this specific element on the page so we'll need to just remember that for now but first of all what i'm going to do is i'm just going to make this into something a bit better that we could actually um use and loop through so i'm going to make some a couple of functions out of this so i'm going to split it up i'm going to say the first function is going to be our get soup function so we're going to say up here define our function as get soup something like that we'll do and we need to give that one a url so i'm going to indent all of this and i'm going to remove the url from that let's just move that up and out the way for a second okay so we are basically saying this function it's going to take in the amazon page url and it's going to return the soup for us like that okay that's good the next one we want to do is we want to say let's get reviews so we need to say define our new function and we'll say get reviews and in here we want to give it the soup that we have just returned from this function and then we need to indent all of this so to deal with the reviews like i said earlier where they're in slightly different divs and information what we could do is we could find out where they are and we could add that in as well and we could do all that but i'm actually just going to focus on the the english text reviews from the amazon.co.uk site so i'm just going to add in a quick try and accept here it's not a perfect solution but it will work for this for this case and i'm just going to say we're going to try and do that and if nothing if it doesn't work we're just going to pass through and we'll move on because we know what this code's doing it's not the best way to do error handling but i mean for this sort of application it will work fine for us we also need to remove our print statement here because we want to actually save that to something so outside that all of our um outside our functions i'm going to create a quick review list um actually we'll put that just at the top here for now get rid of that and instead of printing the review we're going to do review dot review list dot append and we're going to add that to it okay so now what we want to do is we want to test these functions to make sure they work so let's do um open up our i'm just going to comment this url out i'm also just going to copy it and we're going to go up here and we're going to get our terminal up and we're going to do python 3 in my case it could well be just python in yours dash i and we're going to then run our code which i've called this so that just brings us the internet interactive python terminal so we can now call our function so we can say let's get the soup for this page um it's actually going to print all that to the screen so what we need to do is not that we want to save what we want to save what it returns into a variable so i'm just going to call it soup for lack of a better word and now that will save all of that output into a variable which is what we want to be able to pass into our next function and now we do get reviews using the soup and then hopefully if we print the length of our review list we should have 10. there we go so we know that those functions work it's returned 10 and we can actually print let's say review list and let's print the fair let's print the first one and that's it there so we can see that that works and we could then say let's print the second one so we got both of them there so now we know our functions work so now we would want to add them into the bottom of our code so they run when we execute our program but we need to be able to deal with the pages now the easiest way to deal with pages is to use the url and create an f string for it and then that add in the value that you give it to the url so the easiest way to do that with a loop is to do a range so we would do 4x in range so we could do 4x in range and let's just do 1 to 10 for now and then if we print x what that's going to do is it's just going to run through and it's going to print out up to 10 so it's going to give us 1 to up to 10. now we can use this number to put into our url to generate the next url for the page that we want so if we remove this we look at the actual url of the um the reviews that we were looking at we have this page number at the end so we can add our new number to that each time we loop through so i'm just going to remove that so what we want to do is we want to say soup is equal to get soup so we want to run our soup function that we created that returns the whole page html okay via splash and we want to say we're going to do it on this url we're going to say it's an f string and we paste that in and come all the way across and where we've got number is equal to one this is where we want to put our x because that is going to be replaced because it's an f string by whatever value x is given and in this case we're doing x in range so our x becomes our page number okay it's quite it's um it's a good way of doing it it's maybe not the best another way of doing this would be to get to the bottom of the page find the next page button and then take the url for it and then hit that but i quite like this approach it's quite straightforward and it's easy to understand so once we've got our soup we can say let's get our reviews from that suit and that's going to execute this function on the code from the url that we just generated using our x for our page so i'm actually just going to do a quick print statement in here so we can see it counting up when we run it so i'm going to say we're going to print the length of our review list at this point because at the end of each of our get reviews we append to our review list so that should grow by 10 each time and then what i'm going to do is we're going to say if the element on the page that we look for for the next button is there break out the loop so remember we saw this button down here so when it's when we don't want to go to the next page is when we find this element in our soup so i'm going to copy this class and we're going to say if not so that's if it doesn't so if not this carry on or do whatever so if not soup dot find and it was an l it was a list an ally and it was a class of this so if you don't find this yeah carry on so we're going to say pass so if when it goes through the soup and does all of our reviews with this function if it doesn't find this element it just carries on and then we can do else and then break so if it does find it it breaks out of our loop and that's a way we can basically just put any number in our range and we can go all the way up to 999 pages but as soon as it finds this element it's going to break out so we don't end up wasting time so i'm going to try that for now and what we'll do at the end of it is we'll just print the total actually we don't need to do that because we're printing the length of the review list here so when i run this hopefully what's going to happen is we're going to get a new number come up each time it's going to be 10 20 30 as it loops through all the pages so i got to 27. so this is at the point where i was saying it starts to hit the international reviews and it becomes an issue because they're not they don't follow this format i've chosen not to pull them out if you wanted to pull them out you could totally write that in as well and you could do try this accept and you could add in the same for the international reviews but i'm gonna i'm just leaving them out for this example because it's already a long video and i can't do anything with it because i don't understand the languages totally okay so that means we only got 27 for this product but we do go through all the pages because we find all of the information there okay so now that we know that that works what i'm going to do is we're going to export the information that we do get i can remove this url line i'm going to export the information that we do get using pandas so import pandas as pd at the top and then underneath here we're going to do df for data frame because why not pd dot data frame with the capital letters and we want to use our review list as our data frame and then we want to do df.2 and i'll do we'll do csv actually we'll do two excel because there might be some funny characters in there and we'll call this a6400 reviews and it's slx slsx slx yep and we say index is equal to false that means that we don't get the data frame index on the left-hand side we just use the excel index and i like to do a print statement at the end just so i know that it worked what we'll also do is along with this we'll just add in a another print statement here just so we get a bit more output to the terminal i'm just going to say print and we'll actually make this an f string getting page and again with the curly braces and x mr bracket okay so now i'm going to run this and we should then end up with an excel file with all the reviews for this item nicely split out so i'm just going to let it run getting page one so we can see that too it's reasonably quick we're sending a new request off every time that we do this as well so bear that in mind if you're trying to do lots and lots of scraping on amazon if you try and send too many requests it will probably block you it'll block your ip for a certain amount of time uh this is probably this is this is fine this is not that quick what we could do is we could run this on multiple multiple pages multiple things and get lots of different reviews for different products at the same time we could save them into a database which might be quite useful we could then pull them all out so we could have i don't know you could run on this was a camera you could do on 10 different cameras and you could get all the reviews and you could compare and contrast or you could do it on your products or you could do it on your competitors products data analysis so we can grab the file if i just open the file up we can see that we've got the product name which we put in the title and the rating and we could then go from go from there we can see we've got all the review text as well so that's pretty cool there's only 27 odd reviews so what i'll do is i'll just run it again on another product which hopefully have a few more than that and we can see it working in that case as well um what's going to have loads of reviews why this headphones that's got to have loads right 10 thousand okay that's probably going to get us some let's do this one to the bottom and let's get the next page cool there we go and let's paste that in here instead and this was just sony headphones cool and let's run that let's see how far we get all right that eventually finished and over here let's see we got 1127 results there we go there they all are so that's it for this one guys thanks for sticking around and watching it um it's quite cool way of getting out some could be potentially useful data um i'm really liking splash at the moment so um you'll probably see some more videos coming up using that um don't be intimidated by the fact that you need to use docker it's just dead simple to install and run splash so definitely check out my last video if you need to know how to work that we're basically just using the very very basic parts of it so it's pretty easy so don't forget to like the video if you like it drop a comment below and subscribe for more web scraping content as those on my channel already and more to come thank you very much and i will see you in the next one
Info
Channel: John Watson Rooney
Views: 10,638
Rating: 4.9459457 out of 5
Keywords: amazon review scraping, amazon review scraper python, python web scraping, python tutorial, learn python, learn web scraping, python splash, python requests, scrape amazon, web scraping tutorial, how to scrape amazon, amazon scraper, save amazon reviews to excel, scrape amazon reviews python, scrape amazon python, scrape amazon product reviews python, how to scrape amazon using beautifulsoup, how to scrape amazon using python, how to scrape amazon reviews, web scrape amazon
Id: DIT8rwyPEns
Channel Id: undefined
Length: 30min 19sec (1819 seconds)
Published: Wed Nov 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.