Web Scraping Amazon With Python and Selenium

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to noveltech media in this video we are going to build a web scraper using python and selenium so as you can see here is our web scraper running on amazon it is looking up phones in this case it's looking up the iphone 11 and fetching some prices then it's storing those prices in a pro dot csv file so as you can see here is our code and this is what we are going to build out in this video tutorial we are also going to look at the possibilities of price tracking and how to get yourself maybe notified if a certain item gets dropped but what you will learn is how to web scrape and web scrape amazon one of the biggest websites there is and you will also be able to apply this to your own projects or get the info that you need so if this sounds interesting to you let's get started i have prepared my editor here i'm using visual studio code as always so and i've also created a folder named source and we're also going to be using anaconda and i have opened my anaconda prompt here so basically we'll only need this to install your dependencies and to run our application so let's get start um first of all i'm going to create a file called bot so this is just going to be a python file which is going to contain all our necessary code so let us do some imports so we are planning on exporting all of this into a csv file so what we are going to do is import csv after that from selenium uh we are going to import our web driver from selenium webdriver [Music] chrome options we are going to import options and not worry i will explain what all of this means so this should be a capital o here let us continue from web driver sorry this is web driver manager chrome we are going to import the chrome driver manager so why am i importing this one this one is going to help us get the adequate version of chrome so you can download the chrome driver just as an exit file so executable file but then you would run into the problem that if your current chrome version doesn't match that version which updates every few weeks or so you will need to download it again and if you download the driver manager it's going to find the right version of the fan install it the first time and update it in the future versions so let's start by creating a class so our class is going to be crawled crold article which is going to have an init function in it [Music] it's going to take a self-argument title and price so we want to search our items or scrape amazon according to a title and price so we're going to set our title to title and we're going to set our price to price so that's basically it there is nothing too complex about that so now we're going to have a class of bot which is going to be our bot uh this is not going to take any arguments here and um let us for first define our function so we are going to define our function as article which always takes the self-argument and name if you are maybe not sure what to self and init means you could watch some other of my videos that goes much deeper into python or you could also check out novelticmedia.com for my python course over there so let us continue count is equal to one page is equal to one our page increment is equal to ten and our max retrieves is equal to 100. so what does this mean count is the number of articles that we are going to use here page represents the number of pages that you want to crawl and page increment means how many counts per page do we want so how many articles per page do we have and the max retrieves just represents the maximum number of items you want to retrieve so let us give our basic url here our url is going to be equal to https slash www.amazon.com uh then it's s question mark k equal so this is basically just when you navigate to amazon.com and want to search up an article there um this is the url that you're going to get so depending on your location you could get de or en for india and so on uh but here we are going to use the dot com side so now we're going to give it our name plus we also want to give it the page so the page is going to be equal to and we need to convert our numbers into string so since we're concatenating a string here we need to give it a str page so let me show you what i mean by this if you go to amazon.com here you can just um type something like iphone 11 and as you can see you're going to have this type of url here um the reference you could remove the reference this is something amazon uses to know where your traffic is coming from but you can safely ignore this one so let us continue with our code now we need to define our options now we're coming to this so this is going to be options and one thing sorry if i forgot that so you of course need to install those dependencies um so you will need to install pip install selenium then you will also need to install um the web driver manager here so those are basically the only two packages you will need so requirements are already satisfied so since i have already installed it so this is what you will need to run so selenium and webdriver manager so let us continue we are going to set our options headless walls so headless means that we can either choose to have the web browser activated when our bot is scrolling or we won't open a web browser and the crawler will do its thing but in order to show you how everything works and how you can inspect this it's much better if we set headless to false but of course in a production environment you would always set this to true and i think the default one is true so um you wouldn't need to change that so options uh the next thing we are going to do is add an experimental option [Music] which is detach true so what does this do and this actually just serves the purpose when we quit our bot to don't close the browser down so we only want to do this in order to be able to inspect element and see the html and css classes that we need because they are not going to be necessarily the same for all browsers so if i open firefox and go to amazon some structures some html stuff is not going to be the same like when i open it in a specific version of chrome here so that's the reason we are adding this here so now we also want to do the following maximize uh we want to maximize our window so in order to be like to have this consistency across everything so we want to have everything consistent so that the html doesn't change because if it changes we can't retrieve the same stuff now we also want to navigate with the browser and i will declare the browser in a second to get our url and we also need our browser to set page load timeout to 10 so this means that our browser is going to wait for 10 seconds before it tries to retrieve the elements that you want to have so now let us create our browser is going to be equal to webdriver the webdriver we have imported here chrome and we are now going to use chrome driver manager and we are going to run install on that and we are going to pass our options as options here so that's how we are going to initiate our browser so now we need some type of logic uh where we are going to look through everything and find the stuff that we actually need so uh we are going to make a while true loop we are going to give it a try catch and try catch block and we are going to have a catch right here so what we need here is a few if statements so first of all we want to check if we have retrieved so if we have crossed the maximum number of retrieves so we are going to use the page increment if it times the number of pages is greater than max retrieves this means that we need to break this so we have reached our maximum here and we can break it uh let us give another if statement so here we want to see if the count is greater than the page increment and if it is we are going to set the count equal to one and we are going to set the page um to be incremented by one so sorry if this is maybe um not like the fun part but we need to get the edge cases right before we can actually try scraping but you could implement some different logic if you would like something else so now we come to the interesting stuff so first of all we want to get the title so we so if you go to amazon you will see that it's not that easy to retrieve stuff let me show you if you're on amazon here and let's say we wanted to get this uh sorry this is the price let's say title as you can see there is nothing too special about it so this a size medium could be or couldn't be right so we could use it um so one thing you can always use is go to copy xpath it's always right and you're going to get the right element but you need to consider if something changes anything changes in here your script is most probably not going to find it but this is how web scraping works and in a lot of cases this is what you will need to refer back to i will also show you a situation where we have it much where we have a much better solution to this when we are going to search for the price so if you go to github you're able to find a working code uploaded here and this is the x path that we are going to need so this is our export title here so we are going to give it an x path title equal to this so um i have copied this from github because we are still not in the state to run our application so far but let me show you um what i mean by that so as you can see we are having an id search this is just basically what would happen if we copy the text path and then we have those divs here so the first article is going to have a div of one the second article a div of two a third article and so on and if we switch the page we then need to get this counter back to zero on the next page and so on and so forth that's um the logic behind using this counter page page increment marks retrieves and so on so now we got the x path for it what we can do now is get our title which is going to be equal to browser which is our initiated webdriver and we are going to use find element by xpath and here we will just provide our xpath and then we can also extract our title text which is equal to title get attribute okay now what we want to get is our inner html so here we are not getting really the title but we're getting the element so we are getting a selenium webdriver element and from that element we need to get the text so what we're getting here is the inner html and we're going to split lines and get the first line here it would probably work without that but um let's use it to as a precaution now what we want to do with our title is click on it when we click on our title we are going to navigate to the item so this means when i click on this here like anywhere i'm going to open it and we want to get the price from here because in most cases you can see the price only here and i also want to use this one because the x path is much better much nicer so um now for the price you're going to have an x path price and don't worry i will show you how you can get this once we launch the browser so in this case let me just copy this here our id is going to be equal to price inside box price inside box this is just the id of the element okay nice um what we could do now is let us copy a few things from here [Music] so now for the price we want to get the element by our price which still has the same logic and now we want to have a price text which is going to be equal to price get inner html we don't need to use the split here now so after we have done this we want to navigate back to our main page so for that we're going to set our url again we're going to set the url here and we are again going to call browser get so we always need to call this get in order to return to the um correct page now we again want to set the loading time to 10 seconds because it could take some time and if in that time our request is not met this would mean that this would fail so what we could do for this is also use a statement and check for the time that's passed but i think just to keep it as simple as possible we can do it like that so now we get our info in a new file and we are going to use our crawled article here and we are going to pass our title text and our price text now one thing i forgot here is we will also need some type of array so let's create an array here in which we are going to push this so a equals to uh append our info so sorry not equals but a is just depending our info and we can now increase our count by plus equal to one okay nice here we have our try catch statement and except this is going to be except in python but i just wrote it try catch to have it nice and tidy up here so we are going to catch an exception to be able to print a stack message we're going to call it exception as e and we are going to print our [Music] reception e here now we want to increase our counter again because if we have an exception we still want to continue because exceptions are going to happen if we can't find an article if it doesn't exist on a page if a page has less than 10 entries so exceptions are going to happen and we will have the same error handling procedure here as well so let me get back with this one here and we are also going to repeat our url stuff because if we had an exception we are going to return back so so far so good uh one thing we would like to do at the end here is close our browser so this is where it should be so browser close and we're also going to return our array so that's basically it now we can call this so we can call this like better is equal to bots so we initiate our class and now we can save the files so with open um like we can call this results with csv oops so this is going to be results.csv uh we also want to write so we are going to use a w here we are also going to specify a new line [Music] and we also want to specify in an encoding which is going to be utf-8 um always very important to specify that as csv file this is a quite common way of writing to csv files so what we need to do now is article writer is going to be equal to csv right there um now we are going to pass it our [Music] csv file uh we are going to specify uh the limiter which in our case um is going to be this this is just standard settings for creating excel files um quote character it's going to be our standard quad character and quoting is going to be csv quote minimal okay now for article in fetcher article so remember our function is called article we want to search for an iphone 11 and our article writer should write a row with our article title and our article price this should be a list here so um this should be it actually now let's test this out i'm sure there are going to be a lot of problems with this but let's try it out so python i am in the source folder so python.pie okay um that shouldn't be too hard to fix we just need a plus sign here let's run this again um okay i really forgot the pluses let's run this again so basically i fixed all the pluses here because i copied everything it i had to edit everywhere but now it should be working um let's try it out okay great this opens our browser and it is set to handle as false this means our browser is going to open this was one of the first tasks that we have specified um to get the browser in full size and now it's doing its job so it navigates to an article it takes its stuff it navigates back so we get the title we go to the item so i'm not clicking anything here we get the price we return back and we're just continuing to do that so of course you could find much more efficient ways to do this but i wanted to show you as many functionalities as possible here so you are navigating to a different page you are clicking on stuff you're extracting stuff and so on so let me show you a few things here so when we for example want to get this price here we can just search this sorry it's in german and as you can see this is our x path but here for example if we want to copy our x path and if i paste it here as you can see it's pretty simple it's just an id and price inside box buying box and this is most probably not going to change anytime soon but for some stuff you can you just get an ugly xpath for example let's get this one here copy xpath okay this one was also actually quite fine but for example if you if you go to the main page on amazon.com and search for something like an [Music] iphone 11 pro case whatever and if you want to get the title here for example and if you click on copy xbox you're going to get some ugly stuff like this but you just need to use this from time to time so there is no way around it you can of course think smartly how you want to crawl the site but eventually you will need to revert back to xpath and therefore if you want to want to build something in production with this you should always try to update and change your structure so so we've came to the end of the video i hope you enjoyed it and learned a lot on the way if you want to check out more of stuff like this head over to noveltechmedia.com or you can always contact me on my email or over normaltechmedia.com i'm happy to see you over there so thank you for watching and have a great one [Music] you
Info
Channel: NovelTech Media
Views: 9,103
Rating: undefined out of 5
Keywords: web scraping with python, python selenium, amazon web scraping, amazon bot, web scraping with python and selenium, web scraping with selenium, python web scraping, python automation, python tutorial, python selenium tutorial, selenium
Id: RMPpS6KBkgg
Channel Id: undefined
Length: 25min 56sec (1556 seconds)
Published: Tue Dec 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.