30 Days of Python - Day 18 - Price Tracking & Monitoring on Amazon Python TUTORIAL

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey there welcome to day 18 in this one we are going to be creating a price monitoring service that is we're gonna open up a ecommerce web page and we're gonna find products that we're interested about and then we will actually monitor the prices that they are as then we'll be able to run this at any time and it will update recent prices and the price list for those things now we're gonna be using amazon.com for this and of course Python so let's go ahead and jump in and for us to get started what I'm gonna do is actually jump in to vs code into our project here and I'm gonna make a new folder for day 18 and inside a day 18 we're gonna make a virtual environment so let's go ahead and open up our terminal with control tilde and we're gonna go ahead and CD into day 18 there's a few things that I absolutely want to install for this so I typically use Python pandas requests requests HTML selenium and Jupiter notebooks all for actually doing the web scraping portion or at least preparing web scraping portions cuz monitoring pricing is all about web scraping right so we have to open up a web page scrape the price and then report it back and all those other packages will help me do that so let's go ahead and do pip Envy install Jupiter and as requests requests that it - HTML selenium and hit enter so we'll start off with something pretty basic where we just pick up product on amazon.com and then grab whatever that products price is and then we can turn that into an actual function and then every time we need to run just that one single product we will be able to do that and then we're gonna expand that a little bit further and actually grab links from any given category for various products so let's say for instance you go to a categories page and you want to get all of those popular products that are on there we'll be able to grab all those links and then also monitor the prices for that now I'm not sure how practical would be to do all of that unless you are really adamantly trying to buy all of the latest thing on popular prices but this is much more about just getting very comfortable doing web scraping and using various methods on how to do that so I'm gonna let this finish and we'll come back thanks for now that it's done we'll go ahead and do a pip EMV run Jupiter notebook browser forming so inside of here I'm going to make a new folder and I call this folder notebooks okay so in here we'll go ahead and make a new one and this is going to be just one single product scraper so let's go ahead and find a single product so I'll go onto amazon.com and I'll just look for a fairly simple product to find I want an actual physical product as well so I'm gonna go down into sports and then maybe like exercise in Fitness and just grabbing something rather large so let's grab a treadmill or a bike now whatever it ends up being the main thing is we want to make sure of course we have a title here and the price is actually displayed it doesn't say Add to Cart to see the price or see these other buyers so this is now gonna be in the URL I and abusing so we'll go ahead and designate that okay of course you can pick any of them because what we're gonna do is pretty simple so inside here I want to just grab the element for the title and the price so first the title we'll go ahead and inspect the element here and we've got product title ID of product title that's simple so that's gonna be our product what let's just call it our title lookup and this is going to be product title that's also like the CSS selector so you could call it selector as well I'll leave it just in as product title and then we'll call price lookup and it's also gonna be a hash and then if we go in here and inspect the element on this price here we should see our price product block our price okay so every once in a while you'll have other elements in here of course right so shipping message price shipping message you might even have a striped out price but price block our price is the price that the user or the customer would end up paying for that okay so I've got three elements here now a URL and a few things to look up so let's go ahead and do some imports and I'm gonna first do the imports that I normally work with I normally just jump into using Python requests and then from requests on requests underscore HTML we import the HTML class and then I want to actually look up this URL so R equals two requests that gets that URL and then my HTML string is going to be our text okay so we could actually print out that HTML string just to make sure okay so the one of the things that it likes to give right off the bat is automated access it like already warns you about automated asks access to Amazon so the thing about this is we already see that it's giving me a warning which is really good an indication that I probably I'm not gonna want to use Python requests so let's actually keep going down using a request though I'll explain why in just a moment and we're gonna go ahead and make an HTML object using of course the requests HTML and then I'm going to pass in that HTML string there an HTML string is set to the HTML parameter and then when we do that I can actually do you know like dot find and then our lookups so the first one I'll do title lookup and they should only have one so we'll say first equals to true I hit enter nothing shows up ok so by default and Amazon's not gonna just make it super easy to actually scrape it like they want you to use their API because they do have API for all sorts of services but in this case I wanted something that I can use across services not just Amazon but in other places as well and maybe just change how my look ups work on other places but still have very similar methodology um so what I need to use actually is selenium ok so python request is not working obviously so what i want to do then is actually comment this part out and just say requests is not working that of course doesn't mean that these other things will not okay so let's go ahead and grab selenium if you're not already familiar with selenium go ahead and check out day 16 we go over the installation process for selenium because it does take a little bit more than just pip install you basically need to install the web driver the chrome driver for it to work so we're going to go ahead and import the web driver from selenium and then we will initialize it so it's really simple it's just let's go ahead and it insert above here I'm gonna go ahead and well actually I'll also make it headless version now I didn't cover the headless version but what this essentially means is that we can have it run without a web browser opening for us so if you went through day 16 a web browser opened and we saw all the automations happen in real time but I just want to go ahead and import the options for Chrome or headless so the first thing I'll go ahead and do is say options equals two options and then options dot add argument and headless so you can add all sorts of options in here as well so if you wanted to emulate a mobile web browser you could you could add those in there as well it's just not something I'm going to do right now so driver equals to web driver dot Chrome and options equals two options okay let's go ahead and make sure everything's imported and if you have any errors here that means you didn't have selenium selected or installed correctly so make sure you definitely have that installed because that of course is something we will absolutely need to use here okay so now what I want to do is actually go to this URL and open up so we will go driver get and the URL okay so it's going to open up that in a web browser so it's gonna be emulated like it's a real person then we're gonna go ahead and say the body element equals to driver that find element by CSS selector and that's going to be body and then now my HTML string is going to be body elements dot get attribute inner HTML okay and I print out this innerhtml - just to verify that it is actually HTML content and sure enough it is and notice that warning goes away now if I went to the very top of the page or the entire page perhaps that warning would still be there but my sense is that it's not going to be there because it doesn't recognize it being a request that now thinks it's a Google Chrome session that's actually grabbing it so now that we've got that HTML string let's go ahead and run these other things and there we go so now we actually have an element coming through so if I do text on it it actually gives me the title of this here so product title is equal to that and then the next thing is simple it's just going to be product price and that's our other look up the price lookup so let's go ahead and copy that and it's not letting me select it for some reason so I'll copy that and scroll down and let's take a look at these so I'll just print out both of them so product title and product price you know let's make sure though those are all ran and there we go so we've got a product title and product price alright so now let's go ahead and actually scrape products from any individual category first thing I'm gonna do is actually duplicate the first one that we have here and I'm just gonna rename this to two and this is just so I don't have to copy and paste a bunch of stuff so category products and then I'll click in load this one up and then just delete the last you know a few things there I might need to use them again so I'll keep the other one actually running but now I've got these initial setup processes that I need to actually getting my driver working so now into amazon.com I'm gonna actually go into the categories for best sellers and then I'm gonna click on any given category so let's go ahead and just do toys and games it's the very first one soul hit see more they'll actually take us to the toys and games category link so you can go ahead and copy this and paste it in here I'm gonna go ahead and say let's call this categories and we'll set it equal to a list the first one was toys and games okay so I'm just copying and pasting that whole URL there and then let's go ahead and go to the next one so back in the bestsellers now go ahead and use electronics copy that whole thing and paste that in there and then one final one let's go back into bestsellers and let's skip a few of them and let's go to close okay so copy that and then mama in there okay and let's go to run this cell with shift inter of course and hit categories let's just make sure that it is a valid shell or you let's go ahead and run this cell and then let's even type at categories here just to make sure we've got all those things now I know from my experience building web applications that these URLs we don't need the entire URL so you're gonna want to play around with this but typically speaking whenever you see something like this where it's some sort of argument equals to something else there's a really good chance that we actually don't need that now we definitely don't need something after this question mark well I can't say definitely for every single website or web application but way we can do is actually just do a quick manual test on one of these URLs and hit enter and we still get the exact same thing right so that's actually the same thing just without as long of a URL coming through that means that I'm just gonna go ahead and go and delete a lot of those that say ref equals hey so one of the downsides of doing well scraping of course is if they change how their URLs work then we might have to revisit actually solving this problem like I mean that's definitely something you would have to do if you were using this in a real project but now that we've got these URLs of course I'm going to go ahead and just grab one of them so say first URL equals two categories and zero okay so we'll go to a driver that get first URL course that's gonna give me an actual you know page that I can actually work with so let's go ahead and say body element equals to driver ad find element by CSS selector and it's going to be body and then I also am actually gonna get the HTML string for that so HTML string body element get attribute and inner HTML okay so then we will actually go ahead and grab that as an HTML instance so we'll go ahead and say HTML obj equals to the HTML string stuff we've already seen this hopefully a lot at this point now what I'm actually gonna do is grab all the links so we'll do HTML object dot links and what we should see is literally every link that is on that page what we actually want are specifically product pages right so it's kind of hard to tell which ones are product page with pages and which ones are not so this is actually where a regular expression is gonna make a big big difference so before we even get to the regular expression point I can actually still iterate through all of these links and look for that same stuff of our product title and price block option so I can definitely still do that but what I want to do is actually reformat these links to actually have amazon.com on them so essentially I want to loop through all these links the first thing I'll do is say new links equals two and then we'll do X for X in and then we'll say if X that starts with a slash okay so I'll go ahead and print out those new links and now it's gonna be it might be a different sized actual list or it might be the exact same I'm not positive but the goal of this was to actually just you know narrow down the options that I had which is how this inline list listing item works okay so that narrowed it down by this just one single condition here so now that we've got those new links I'll go ahead and call these page links and again we will iterate through this time I'm going to go ahead and put in amazon.com here so wwm is on comm and I don't need a slash I just need to put the link itself so X remember it starts with its slash so I can just put X there and then we'll do 4x in new links we hit enter and now what we'll see is page links that are more full URL links okay so this very first one what we can do is actually take a look at this so we open it up and this is a product review I mean it says product review on here but we could still try and actually look for those prices right so we can take from our single scraper and we can still do this exact same thing so to do that I'm gonna go ahead and copy parts of it and we're gonna make a new function and I'm going to call this scrape page hey or let's actually call it a little bit better than that and we'll say scrape product page okay and then we will just grab all of the product related items or essentially turning the stuff we did in that first section down to a function and okay and we're just going to return product title and product price and then all it needs to pass in is the URL itself now I also want to pass in the timer so I'll go ahead and pass in time import time and or rather not pass in but just have it sleep a little bit because we don't want to overload the Amazon server but time to sleep and let's say one point two seconds now when I say overload the server I mean we don't want to make it seem like this is an inhumane thing that's actually happening it's a computer it's a machine we don't want to alert their website to that fact and we also don't want to overload their system at all because we don't need to add extra load to it for no real good reason okay so now what I'm gonna do is I'm just gonna grab my first link first product link or at least something I think might be a product link and that's gonna be my page links 0 okay so if we look at this one again we could open that up if we wanted to but what I want to do is actually scrape this itself and come in here hit enter and you know I get it all the time not to find hoops that's I should I ran that cell up here let's scroll back down and let's run that again so I'll probably get nothing back title lookup is not to fight oh that's these two things right here I'm back into our other one probably shouldn't have deleted all those now in retrospect that's okay I'm gonna actually add these in as arguments just in case they ever change just keyword arguments to this function but of course they're set to the defaults so I don't really need to do anything from them other than just rerun those cells and now I'm getting you know none type it has no attribute text okay so that's good so what that actually means is I can probably skip what this link is so in other words I'm going to go ahead and now loop through all of these links and then actually use this method here into a try block so I'll go ahead and do or you know link in page links we'll go ahead and scrape that and I'm actually going to put it into a try block and then we'll do accept pass the things that returns are the title and price so we would use title price and I'll just go ahead and say that they are none and none okay and then if idle is not equal to none and price is not equal to none then I'll just go ahead and print out the title and price okay so this is gonna take a little bit of time and the reason for that has to do with this time not sleep but it should actually work and we should get at least some titles and prices I should actually probably also print out the link itself so we can start to look for patterns on how these product links are actually formatted so let's go ahead and run this and so it's going to go through every single page that was up here and that's actually not preferred what would be preferred is that we actually go through all these links and verify which ones are probably valid product URLs anyway now the vast majority of e-commerce applications have a very standardized way to have their products linked now big part of the reason for this is that they don't want to change the links to these products very often because people share them all the time right so the baseline link itself won't change that often when I say the baseline link I mean much like what we did when we actually removed the refs up here these links won't change very often because again that shareability I mean it's a Uniform Resource locator that's what it stands what's a URL stands for so they want to keep those as consistent as possible and being that they are an e-commerce platform their URLs have to have some sort of pattern or method to them that absolutely makes sense so right off the bat I can see that I could probably get rid of the product reviews if there's ever product reviews in there I could probably just removed that right so yet another way to actually do the page links so I can even come up here to new links and just read eclair it and we can say X for X in new links if product reviews not in X all right so that alone would actually change some things for me which we will take a look at once the scraping actually finishes now of course that is only one piece of the equation what is more important is to actually figure out exactly how each product URL is is actually designed right to reverse engineer that on how they design this because somebody made a very conscious decision on how these products these product links were going to be linked so we'll have to actually try and do our best as far as versus engineering them but what we see now is actually some products some links actually pulling back some prices and titles and URLs so it's actually working as is but again it's got to take a while because it has to go through all of those links it's gonna open up every single one gonna wait a second or two and then it's gonna go to the next one it'll just keep doing that so while that's still running what I want to take a note of is each one of these links so I'm gonna go ahead and grab the first two and we're just gonna take a look at them in a little bit more detail or maybe not the first two but let's go ahead and grab that one and then this one right here okay and I'm just putting them commented out just to take a look okay so there's a number of things that I couldn't for sure get rid of and it goes all the way back to that ref before the same thing that we saw before so ref equals alright so I can get rid of that and that will most likely give me that exact same product so let's go ahead and copy and paste this we should see a product in here Lego classic medium creative and it's got a price there so going back into our scraper that was the very top one I believe so let's go ahead and scroll up and here and there it is so $27.99 and $27.99 okay so that's the URL and that's the actual format of it so to just parse this out this is coming from my experience building web applications that actually designing these URLs so the first part is just the base URL right so we actually added that one which was amazon.com right and then the next part is called a slug this slug is what we see right here and right here you know slugs are often unique but in Amazon's case they might not actually be unique right because they have other pieces on this it so if I actually went to this link without that last part I get a sorry page not found and if I get rid of that DP I still get a sorry page not found so the entire URL is necessary all right so we've got the base URL slash slug slash well it looks like slash DP and if I do a quick scroll a quick view here I see that a lot of them are slug /dp okay so I can actually just kind of be confident that that says DP now that also might depend on where you are located in the world maybe it's a little bit different there but what I'm trying to do is not necessarily something consistent on my system but trying to look for the pattern that would be across any system right and then the last one these are what's called product IDs for Amazon I think they're also known as the ASN number so Amazon shipping number or something like that but that I know for sure is the product ID just because of all of my research in this actual series that's the product ID now that is actually one of the real unique features of this so there is a different product ID and then that is essentially how this URL is designed and again I did some research on this and it takes some experience doing these things prior to actually being able to just look at it and know how to parse it but that's why I'm also showing you this is because over time you will start to gain this experience and it's actually not that complicated to start breaking apart these URLs now that's not always true in some cases the application change the URLs although they're unique to how their system works right so if they changed how this looks because they don't care about the share ability of the URL it makes a whole lot difference right um okay so if I scroll back up a little bit what you now see is new links now I don't actually have those product reviews in here so that alone this one piece right here actually will save me a good amount of time going forward I actually already know that hey this is now a link that I can just ignore it's obviously not gonna be product links this one for you know rewards card member probably don't need to go to that help customer display probably really don't need that but again those are making a lot of guesses and then we're gonna have to hard code a number of things where we better off that we actually figure out what the pattern is for every product URL so that's something that we still need to do because I did successfully scrape the data that I was looking for on any given category right so if I actually loop through other categories which I haven't done yet but if I loop through all the other categories you would definitely have all of those products all the prices and all the titles for the current state like what it is right now or when you actually scrape it so that's pretty cool but now that we kind of want to work towards parsing the URL out to get these different parts we need to actually go into depth a little bit more on that before we actually finish this thing off now I'm going to go ahead and duplicate this last notebook for category a de bourree now I'm gonna go ahead and duplicate the last notebook for the category products because I'm really gonna be building off of what we left off there we're gonna go ahead and rename this this one is gonna be parse URL with regular expressions or just redux okay so if we scroll to the bottom I have a note on how this URL pattern is working so there's something called regular expressions and regular expressions are not really that challenging but if you've never worked with them before they are pretty challenging because they kind of look strange so I'm gonna do a very basic version of that based on what I already showed you so I'm actually gonna rerun this entire notebook and I'll let that do the scraping and all that so what I showed you right here was some things that I already have recognized with these two particular URLs and they were from this category I mean yours might look different there's a really good chance that they do look different because I already know of several different ways these URLs will look I didn't mention that in the last one because I don't think it's necessary to know until we start parsing these things out so with this pattern in mind I'm gonna go ahead and show you essentially how I want to actually parse this out so it's gonna be first of all I'm gonna go ahead and say HTTP colon slash slash wwm azan com slash Haven something in here and then DP like I said I have DP on all of those links I noticed that that they were there and then finally the product ID okay so these two blocks right here are what I need parsed I need the slug and the product ID that's what I need to extract from all of these URLs and if I can extract those things as in if they exist then I can feel pretty confident that those are products according to how these pattern actually works okay so what do we actually put into this to register that well this is where regular expressions come in so what I'm going to do here is I'm gonna put our in front of this I'm going to name these groups here so name the first one slug and the second one products so inside of these parentheses we want to use a brush mark P and then and less than brackets and then the actual regular expression itself in this case I'm going to do slash W dash and then plus and then inside here we're gonna say slug I'm actually gonna use that identical regular expression for this portion right here and this one is going to be product ID now I'm making a lot of assumptions here but essentially what this is doing is this is gonna look for every character from A to Z lower case and upper case every number from 0 to 9 and also a - right and the reason I knew that is because if I look at this right here it has a number it has a - it has capital letters it has lowercase letters right so this command right here is actually grabbing all that and these brackets are making sure that that's what we're looking for and then we also want to make sure that it even exists and then we put it into a group called slug so we're essentially naming this portion slug and then it's gonna jump over this part it's gonna ignore that part essentially and then look for something that matches here and if it does matches it's gonna look for product ID now I actually don't need a - in this because the product ID as far as I know will never actually have a dash in it but just to be on the safe side we're gonna leave it in there because dashes in side of URLs is very very common and I would even say dashes and product IDs are also common or can be common now that I be true for Amazon it might not be so you could actually use this same method for any URL so this is actually how you can just parse roughly any valid URL that's right here now it's not always true sometimes there's periods in there right sometimes you need to include a period I'm not gonna worry about that right now but that's a very baseline regular expression so how do I actually match this well let's go ahead and say my my URL equals two you know one of these URLs so actually setting a regular string and then I'll call this my rejects pattern and we'll set it equal to that okay so now I'm gonna go ahead and import the regular expression library I'm gonna go to the very top here and import re now my system is still actually doing the scraping so I'm just gonna go ahead and hit stop I don't need to have it scrape I already know that it was working from the last one so with that interrupt I should have to just rerun some of these cells again really the main thing is the cell down here okay so the import of regular expressions and then the string that we want to parse and in the pattern and how we want to parse it right so if you look at these right now you can see that there is some matching pattern there or hopefully you see that so what I want to do is then say rejects equals to like the regular expression equals to re compile and I want to compile my pattern okay so once I do this I can actually look for matches in this pattern with any given string so I hit enter and now let's go ahead and find matches in this pattern with rejects that match again it's whatever the variable I named here is what I have to use here rejects that match and then my URL I hit enter I get a match object here so it is showing me a match now if I did a different URL like or a different string and I hit like so in this case ABC and hit enter I don't actually get a match back so I can actually feel confident that we can say my match equals to that and then I can print out what that's gonna be okay now again I can come in here and say ABC my match is none so that's pretty cool so that are I already have a way to go through URLs and remove the ones that don't match this regular expression okay so I'm not gonna do that yet but let's go ahead and actually take a look at the named groups now remember I actually named these regular expressions that's what this does this actually names this block here as slug and then this one names this block as product ID so if I do have a match what I need to do is actually take a look and what that match is so we can actually just go ahead and say my match and then use brackets here and product ID and then we can do after that one we get we get the Product ID which is cool and then we can actually do the slug alright so the other name that I gave it and that gives me those two items so that's some basic stuff about regular expressions if this is interesting to you definitely check out learning more about regular expressions this is something that you may or may not use a whole lot but in when it comes to doing web scraping you end up having to use regular expressions from time to time this is a good example of that now the reason I'm doing this is so that my data when I actually add it into a panda's data frame I have a little bit cleaner of data right so I actually could use this data right here it is fairly clean but this is gonna be a lot cleaner to have a product ID and a slug and the URL that's one reason the other reason again was to make sure that we're not just scraping a bunch of pages when we really really don't have to so it might not surprise you that this regular expression pattern is not the only way to get to an individual product so I found a couple others so I'm gonna go ahead and copy and paste these in and what we see is a very similar pattern across all of them in the least that there is a product ID in here right but each one of them has a product ID so we should feel pretty confident that all of these will be actual valid products but we can feel even more confident that once it's scraped it's gonna grab the product name and price anyway so we don't have to worry too much about the pattern being inaccurate right it's just about making this a little bit more efficient and hopefully being able to grab out that Product ID prior to scraping it and of course over time you're gonna want to improve this if you're really gonna be going nuts about scraping various products on Amazon but for what we've done so far it's really just a matter of like hey there's a group of products that I want to monitor that's kind of the idea here it's not meant to be a full on let's monitor all of Amazon all the time type of product of course that is possible but it's going to be a much much bigger project then you know something that's like an hour-plus long okay so let's go ahead and actually parse out a URL based off of these regular expressions now and what I'm actually going to be doing and looking for is the ability to actually just extract the product ID I no longer really care about the slug it would be nice that they all had slugs but with how these regular expression options are there's just no way to do that so I'm gonna do is say define and extract product ID from URL okay so it's gonna pass any URL and right off the bat I'll just go ahead and say product ID is none and then we'll just return the product ID okay so what I need to do is actually loop through each one of these regular expression patterns on that specific URL so for rejects string in rejects options need to do is exactly what it did appear alright so compiling it we compile the actual pattern and then now we have the rejects option and then from there we just need to grab the matches just call this match okay so now we want to just say if match is not equal to none then we will actually try to grab the product ID so there's really a high chance that there will be a product ID if the match is not equal to none but to be on the safe side I'll just go ahead and do product ID equals to match of product ID that's of course inside of a try block specifically so I don't you know break my loop or any sort of function that I have going instead I'll just send Becca and none okay so that's also if I have an invalid product IDs this would actually be a really good time to pass whatever that URL is to somewhere else to log it so you can review some of those older URLs that may or may not work anymore okay so now that I've got these regular expression options I now now have a function that I can use to clean up my new links okay so I'm actually going to remove the actual scraping down a bit so this is where I actually perform the scraping so why don't we actually wrap this into a function two and we'll say perform scrape and of course it's on page links so I'm probably gonna have to actually add a few more things into here but for now I'll just go ahead and add in page links to that function so it doesn't just automatically run okay and I'll clear out that cell with just shift enter and then let's go ahead and take a look at those page links again and this time I am going to actually bring these down a little bit so the actual final page links let's go ahead and bring it down underneath that function so I for sure have that function to like actually declared prior to running it okay so now I'm actually gonna go through page links not loot new links I'll just go ahead and say Forex in page links if the product URL is extracted from there so essentially we would come in here and say we'll extract that product URL and pass in X and it's not equal to none now if you're new and Python which I'm thinking that you might be at least a little bit newer this is going to look kind of weird even if the other one didn't so let's go ahead and do Forex in page links now I'm gonna go ahead and grab the product ID by using that extraction on the X you know cuz page links this is just an arbitrary name so we can also call it URL if we want okay so this will give me a product ID and all I want to do is actually just reset page links to being an empty string or I can you know or rather an empty list or I could just call it final page links and then if product I D is not equal to none then we will just append that actual URL there okay so we could also go a little bit further than men actually put it into a dictionary and say URL is that and then product ID is whatever that extraction would be which we just said product ID okay so that actually might be a little bit better because that's going to move towards what we actually want to have which will be a list of dictionaries that we can then load it into a panda's data frame later so the final page links is looking pretty good let's go ahead and take a look and first and foremost I'm gonna go ahead and do the length of page links and I'll set it equal or test if it's equal to the length of final page links if there's no way that this should be true like at least it's one off right because I know for sure that my extract product ID from URL there's no possible chance that all of those page links were in there because I looked at them we all looked at them so this is an example of naming problems when you're doing and development so what I want to do is actually just remove everything that's not necessary any longer I still need to actually update this URL I need to look into that URL I need to look at the object I need to extract all the links in there I actually no longer need this one I just need the original links so we'll go ahead and keep new links in there the scrap product page function I still want that one and then now the examples that I went with I no longer need the commenting all of those examples out and my match even okay so I definitely need the regular expression options the function for it I need all of this or at least I think I need all of this so let's go ahead and wrap this part right here into its own function as well that we can just call to get those final links so I'll do death and we'll do clean page links and it's going to take in page links as an argument or whatever the links are as an argument and then we're gonna run through all this and then I will just return back page links and then finally I'll go ahead and say final page or rather final links or cleaned links would probably be better actually cleaned links and that's gonna be equal to this and passing in page links right here so this this page links I can actually move back up next to new links because it's kind of really just grabbing all those links and we can actually call this in one place right so I can actually run all of this as one entire operation and on the multiple okay cool so now that I've removed a bunch of the that example code and the things that we explained I should now actually be able to grab this alright so let's go ahead and in our kernel we're gonna restart and run all I hopefully see errors here because if I see an error and I do right off the bat I seen the air shows me that yeah I had something in memory with Jupiter that was definitely running which was giving these that true value so the actual error itself is saying that my url is not find and my URL of course is right in here the extract product from ID and so of course that my URL was from our example I copied and pasted a regular expression stuff from our example we need to use the URL that's being passed into that function so now that we've got that yet again I'm gonna go ahead and restart and run all and hopefully this time I actually don't have any errors and I'll actually see a link yeah okay cool so that error is okay we need just need to change that and now I definitely get false and if actually take a look at these links what I should see is of course my URLs and the product IDs that are associated to that if there's not a product ID if there's not a URL it's not gonna be appended to those clean links so that also means that the final step of this is actually updating our performance crepe is right now it's gonna go through all of these page links but we should actually call this is cleaned links or cleaned items right so clean product items cleaned items this should probably be called cleaned items as well but let's just leave it in as cleaned items and we want to set it by default to an empty list so that it doesn't error out at all and now there's going to be obj and cleaned items and then the link is equal to obj or rather this obj right here and URL works what I'm saying obj I'm just talking about a dictionary that I'm just calling obj as in an object and that's each one of these instances here will be considered my obj there and this will actually give me that scraping example so what I can actually do here is also say product ID and that's equal to obj a product ID okay so now what I can do is have another item here and say be like data extracted and put that equal to an empty list and so I can no matter what the price or title is I can actually append to data extraction and say data or rather product data equals to and we can do URL being that link we can do product ID being our you know Product ID from the cleaned links we can do our title and this is gonna be title of course it can be none which is okay and then price being price now what this also allows for is if it is none I can at least take a look at that why later right I can I can inspect these URLs manually and see like hey why is the title none' why is the price none those sorts of things and then finally of course this product data I'm just going to go ahead and pend it to the data extracted at upend that product data okay so this is a better performance crepe item we'll go ahead and hit shift enter and right now I'm going to go ahead and call this on my clean links before I actually run that I just want to get the actual length of my clean links and maybe even the length of my original page links to see what they are 170 and 51 so 51 is a lot more manageable as far as going through each one and that time sleep maybe I move it down a little bit I'm gonna move it - let's try half a second that's still really fast or a human to go through pages but let's just go ahead and try that and see if see if we can still scrape this again we don't want to over perform is a good example when we don't want to over perform and hopefully it's actually gonna print through what that data is now I actually didn't I did a kind of messed up here a little bit so I'll go ahead and interrupt it and I'll go ahead and say the extracted data equals to that and then we'll just go ahead and print out what that extracted data is cuz there's still another step that we'll want here okay so hi yes silly so I actually need to return that extracted data yet another reason to do all of this testing okay so we run that and now it's not actually scraping any longer so perhaps the product scrape is failing I'm just gonna go ahead and do kernel restart and clear output and then well restart and run all is a good one too we'll just do restart and run all it might have stopped because of crawl selenium you know when I interrupted the scraping event that it was going on that could have been because of the driver perhaps that closed and that's why it actually stopped and didn't do it again so perhaps we even put the driver in this perform scrape event itself which is not something I wanna cover yet but I'm gonna go ahead and let this finish and we'll take a look at what this extracted data is all right so here's our extracted data now we have our URL we have a product ID a title a price so this is actually looking really good now we could actually add additional data to this and if you wanted more data right now would be the time to do it because it's really hard to get trend lines from historical data when you don't have the historical data a good example of this would be like maybe the number of ratings and the actual star rating that it is or what it is currently or at any given time this would be the time to do it if you were wanting to follow those things now for me I'm just doing price tracking and at this point I have a bunch of product URLs already that I could just track off of those but I no longer need to actually scrape the category itself but in my case I'm gonna actually put all of that together and we'll do that in just a moment now let's go ahead and put this all together I'm gonna go ahead and copy this notebook here of course you can work off of the one that you might have already been working on if you want to and I'm gonna go ahead and rename it to putting it all together okay so jumping into that what I want to do is actually turn the category portion into its own method as well so the first thing I actually want to do is also update how the categories are used in other words I want to actually have a list of dictionaries here so I'll go ahead and add a name and just use that slug and then add in a URL for that and I'll do that for each one now with that done I will loop through each one of these categories to extract all of the product links that I might want okay so these product links I'm gonna add I'm gonna eventually add into that same extracted data because I still want that URL and that product ID I might not have a title I might have a price for it but I'm gonna add them all to roughly the exact same place so I'm gonna go ahead and grab this each one of these categories so I need to actually convert these first few things into a scrape category so we'll go ahead and say scrape category product links something like that and this is going to be the categories that I want to actually pass in here of course that's going to be what these are right here and then I'm going to literally just use the exact same method here so we're gonna go into that first URL and that is actually looping through these so for category in categories and the URL of course is URL equals to category not get URL okay so that's that right there that driver needs to open that up and each loop I'm gonna go ahead and add time at Sleep five pretty much every time I do a gate call I just add a little extra time okay so that gives me that HTML string finally and then I can grab the HTML object and then finally be page links okay I'm almost there I'm gonna go ahead and delete some of these other items here that I no longer need am I gonna be using them and I'll go ahead and do shift to select all these edit and delete cells now what I'm pretty sure I'm going to need is a way to clean those page links again so I'm going to go ahead and insert a cell above this and move my page length cleaner so the first thing is the regular expression stuff I'm going to put that up here let's shift enter I'll insert a new cell above that and then I need the cleaned links which was this right here clean page links and it's going to be right under the regular expression there okay so this is page links but of course I want the clean ones to cleaned links equals two the clean page links essing in the extracted page links or the initial ones okay so those clean links I want to actually append to a another list here so I'll say all product links and put that equal to an empty list initially and then I'll do all product links plus equals two main links and finally I'll return all product links okay so I didn't scrape any of those products I just get those product links here and let's go ahead and test this out so I will call it and we'll say all product links which is really all product items right but we'll leave it in as links just to not get too confusing if it's not already confusing okay so that's what we're gonna use is that initial category categories right here all right so let's just go ahead and restart and clear our output for our kernel I just want those first few things to make sure that the category scraping is actually working and it will probably take a moment or two insert cell below we're going to print out all of those links it actually went a lot faster than I anticipated so what I'm going to do now is actually create a new pandas dictionary so up here let's go ahead and import OS or actually I'm going to use path live this time you're not familiar with path Lib this will give you some familiarity so from path Lib import path and what I want to do is get my current working directory so let's go ahead and set some of the path web related stuff so I'll say base dirt equals to path CWD and we'll go ahead and say data der equals to base der splash data now this is equivalent to OS path to join base der data as you see that's a lot simpler and then I can do if not it etre exists then I'll just do data to make der and exist okay equal are true and the OS path equivalent is OS path that exists of a data directory and if it doesn't its OS that make there's a data directory exists okay true okay well so now that I've got that let's go ahead and just say product or category links rather our category category links probably better and output and this is going to just be the data sir and this is going to be products at CSV or cat products category products okay and then I'll also make another one and just say products output and this is just going to be the general one now to be more efficient we might have multiple CSV files especially we have a lot of categories but I'll leave it in as that okay so now I've got these product category links here what I'm going to do is turn all of these product links into a data frame so DF equals 2 or let's call this category DF equals to PD dot data frame and we'll pass in all product links here okay Oh pandas is not defined let's import pandas and as as PD all right let's try that again okay that seemed to work we can just check it by DF that head at a URL and a product ID cool so I'm going to go ahead and save that to CSV and it's gonna be saved into our product category links outputs and we'll just go ahead and say index equals to false we don't need to store the index essentially okay so I run that and if I look at my notebook I'll see that there's a new debtor dictionary here and now I've got category products okay so now that we've got that let's go ahead and delete some of these other notes from older items here edit and delete cells Hey so that's running those categories that's getting all the categories themselves so really we probably want to call it it's own function as well so let's go ahead and do define extract categories and save Hey so all product links and then all that stuff now of course I can actually pass in the categories themselves okay and I'll do that or category being that empty list okay so that's a way to actually run that function all into one cool and we can run it again should work so now what I need to do is actually go through all of those items so everything in category products and then now extract and save that as my final data right so whatever's in here so what that means is that I actually am NOT going to do it at the same time I'm not going to extract all the product links and then run each product scraping event as it happens right so in other words when I get a category's product link I'm not going to also get all the data related to that product right then and there I'm going to do these as two separate processes um so the next thing is then I need to actually grab all of those links right so every link that's inside of that newly created CSV file so I'm gonna go ahead and do define let's call that well let's first off just add in our data frame so PD don't read CSV forgive this error that had got no columns to parse from file and if we go into the CSV an empty CSV file okay so this means that my extraction or something related to my extraction and saving was incorrect so this up here and there's the culprit so I actually didn't pass in the original categories that I had okay so let's go ahead and pass those in to that function call and let's try it again this time it should take a little bit of time to actually run and now I'll go ahead and actually trigger that read CSV so what I need to do once this is done I see all the URLs I actually now just need to run through this right so this is very close to something called apply in side of pandas so I'm actually going to go ahead and copy this exact same method here and I'll bring it above and now I'll do row scrape event okay so this is going to take in the grow now and it'll go args and keyword args so there's gonna happen on each row each row is roughly what this is right here but I don't need all of that of course I just need some of it and the some of it that I need is this right here go ahead and delete all that and we need to return the row and get these things right here of course it's gonna be row and instead of object the title and price is what we just need to add so row title equaling to title row price equaling to price if there is one and then I'm also going to do row timestamp and I'll set a timestamp in just a moment ok so that means that in here I'm going to do DF that apply and that row scrape event and then I'll do access equals to 1 let's make sure that that row scape event is in memory so it's from that cell and we run the DF apply now and this should actually apply to every single URL it's going to go line by line and eventually we'll have all of those URLs scraped along with a placeholder for timestamp case of all that's running I'm actually going to go ahead and import take time to create our timestamp scrolling down and or our timestamp events which will be right here we're gonna go ahead and do date time to Dame time to now a timestamp so that should give us a timestamp related to this scraping event okay this is gonna go row by row and it will take a good amount of time mainly because of how our scrape product page actually works right it goes up every half second time in half no that's the category every half second it's gonna go to a new product page yeah this will take it's a good amount of time but after it's done it will have all of those products actually done so let's take a look at that once it is finished it so I actually stopped it a little early so I could actually see the results here and what I got is my URL my product ID and the title one of the things that I should have done was reset the data frame itself to that scraping event and that's a I'm definitely something that still will do but the point is to actually have this timestamp and what we actually need to do is then create a new data frame based off of the output here and then we're just gonna continue to append to that one so we'll go ahead and say products DF equals two well initially I don't actually have anything in this product output so let's go ahead and make one go ahead and insert and just do DF dot - or rather right you see as being and index being else okay so now products did a frame is just going to be the PD read CSV okay if we see this it's probably gonna be this same as categories one right so it doesn't actually have the products in there just yet so what I want to do then is after this actually runs I'm just going to concatenate it to this one here so final DF equals to PD concat and it's going to be the initial one and then the new one which is DF which will do this right here and then that way it's just going to add the results to that original products data frame and then I can do additional data on here to clean it up right so to CSV and the product output and index being pulse okay so now that we've got that what I'm gonna do here is I'm just going to do a small amount of this and I'll say D F sub equals to DF head and the number is going to be let's say 40 and then the DF sub or that final DF it'll actually concatenate two it's just gonna be a sub portion of this right there's just a lot of numbers in here a lot of things that I would actually look up okay so let's go ahead and grab this right here DF head what that does is just get the first 40 items if you wanted the last 40 you do tale and of course to get the length you can just do Lin or shape so d/f that shape oops not it not that DX that shape this is the Rose is the columns okay cool so we can chunk it is the point of that okay so we get that sub and then we run the apply scrape event I might need to actually rerun it because contact this is okay actually see in that one is just the exact same thing twice okay which is no surprise there okay so what I actually need to do is rerun this so that my apply method actually has the driver working because I actually interrupted it so I'll go ahead and restart and run all now I went ahead and actually printed out some of the data from that row apply because it was just taking a lot longer than it should have is one of the downsides of selenium sometimes it works really well sometimes it doesn't but I just added a second and a half to my scrape product page just to give it a little bit more time on there it's possible that Amazon was you know toggling or throttling the things that I was doing so I just changed the size of the cut as well being a little bit smaller and as I printed it out I was able to see that hey it actually went through pretty quickly and was able to produce the things that I wanted and if I actually look at the final D F head or maybe actually let's look at the tail what we see is is the data that's coming through here right you could change the timestamp to a date/time string if you wanted but I'm just gonna stick with the timestamp and I could just continue to add these scraping events so the realistic thing here is I don't actually I'm not too worried about products being repeated on here what I'm mostly worried about is that I'm actually getting the data and I have a timestamp associated to it one of the things I didn't actually pass was the category down to any of those links I did actually scrape them but you want actually bring that category name in so let's go ahead and make sure that we do that so that was inside of this this is all product links instead of all product links I just want to add one attribute to that which was the clean links let's just add it in here but clean page links I'm going to add in category being at category but in these clean page links here I'll add in category being to none then we'll add this here hey cool so that's certainly thing something that I really wanted to have that I intended to have I just missed it and the category you know you might actually in abusing the category URL instead of the actual category itself so that's something else to think about too but I'm gonna go ahead and leave it in just as the category itself okay so now what I'll do because I have the reference to the link here that's the main reason if I didn't have that reference to that link then I would certainly want that URL as well so now with that I'm gonna go ahead and run this again and this time I'm not gonna do a subversion of this instead what I'm gonna do is just say D F dot copy and hopefully all 150 of them or so will actually run now when I do actually rerun this it's gonna get new items from these categories so in a month from now those items are gonna probably change we've not all of them but a lot of them will so that actual CSV file the category products a CSV file this one right here is certainly gonna change it's gonna be wiped out and changed constantly for products that CSV will also change but not gonna change nearly as much now there is another thing that I should probably consider and that is actually looping through every single link right so if I actually loop through every row what's gonna actually happen is I'm going to redo it right so I'm never actually gonna be adding that much new data here instead I'm gonna be replacing all the data that's already been in there so in this row I want to go ahead and say scraped well I want to grab these scraped item so we'll say scraped equals to false and that will do try scraped equals to the row of scraped and then we'll do accept pass okay scraped being false is actually a little bit harder with the CSV file so I'm gonna go ahead and say zero or one essentially so after all of this happens I'm just gonna go ahead and say row scraped equals to what okay and that also means that then and here I'll just say if scraped equals to one or scraped equals to the string of one then we're gonna go ahead and skip that in other words I'm gonna go ahead and just return the row otherwise it'll actually go through and do the scraping itself so it shouldn't scrape those old ones is the point here okay so let's go ahead and restart this all again so restarting and run all so restart and run all all right so I actually interrupted this process because when I started noticing in my print statements were prices and titles not being available so if I actually look at any one of these things I should actually see that the price and the title is certainly there so there's a couple of reasons why this could happened and I actually suspect that it was because of my driver right so the Chrome web chrome driver it might have been doing the request too quickly for any given page because that page has to load fully that might be one thing the other thing is Amazon has some safeguards against this sure they encourage you to actually buy products from there but they're probably not gonna want you to hammer their servers or any product page over and over again and instead of blocking our IP address it just it just throttles it so then any sort of machine is that's trying to grab it will be throttled okay so the purpose of this series or this day was not so much to just blindly scrape a bunch of things on Amazon website it was much more to introduce you to even more concepts in web scraping and then actually practically put it to use and also see some of the limitations that we have you can't just web scraped like everything all the time as fast as possible there's limitations to your machine there's limitations to what these services will allow you to do which is actually a really good thing we don't want to overload any one system and we don't want people overloading ours whenever we make web applications too so that's that's also an important part now what I actually challenge you to do is to convert this putting all together into being one or a couple methods that take just a list of products and does something very similar but instead of it being just blindly grabbing all kinds of products grabbing specific ones that you're interested in because at this point you should know how to do that actually a few videos or a few sections ago you probably already knew how to do that but that's the challenge I'll leave you with hey there thanks so much for watching day 18 of course we covered a lot in this but there might be other services or websites that you're considering doing price tracking please let me know what those are I would love to check them out and give you some pointers if I can on how you might most effectively do that web scraping of course if you have pointers for other people please let them know as well in the comments thanks again and we'll see you next time

Info

Channel: CodingEntrepreneurs

Views: 16,504

Rating: 4.9565215 out of 5

Keywords: djangourlshortcfe2018, virtualenv, Mac OS (Operating System), Python (Software), installing django on mac, pip, django, beginners tutorial, trydjango2017, install python, python3.8, django3.0, python django, install python windows, windows python, mac python, install python mac, install python linux, pipenv, virtual environments, 30daysofpython, beginner python, python tutorial, scraping, price tracking, automated tracker, web scraping

Id: 3woopezpZas

Channel Id: undefined

Length: 77min 47sec (4667 seconds)

Published: Fri Apr 24 2020