Best Web Scraping Combo? Use These In Your Projects

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we're going to be using my two new favorite tools for web scraping which is httpx to actually make the request to the server and select olax to pass the HTML that's coming back so this is probably one of the more simple methods for web scraping but I do tend to like to come back to it if I need to there are a couple of downsides though which I'll explain as we go forward about selectors and how they might change on the website but generally speaking this is a great place to start I'll show you how to identify websites that can be used with this method and will then progress and Export the data that we get out to a CSV so let's get started so the first thing that we need to do to make sure that this method is actually going to work on the website that what you're looking at is go to view page Source now it's important you do view source and not inspect element because inspect element will be the Dom representation of whatever JavaScript has done but the view source is more accurate to what we're actually going to see when we request this from our code what I'd like to do is check by searching for some of the text in the source itself and if you find it within the actual HTML with HTML tags and everything like that then there's a good chance that this method is going to work so just make sure you do that first before you get all the way and then can't complete it so I found this website here and I'm looking at the URL I can see that there is a bargain section and it has page numbers in it and what looks like a number of items per page which is a hundred so that's great so what we're going to do now is we're going to come back over to our code editor and I'm going to create a virtual environment using python so I'm going to do Python 3 Dash m v e and v v e n v that's going to create a virtual environment so when we install our packages they are all kept separately from the rest of our Python and it's all neat and tidy so I'm going to activate this environment with uh Source v e m v bin activate this is a little bit different on Windows but you'll be easy easy enough to find the command I think it starts with script or something like that so now this is activated you can see we have our ve and V here so I'm going to do pip3 install HTTP X and select Alex so these are the two packages we're going to use httpx is going to actually make the request of the server and selector relax it's going to pass the information that comes back so in my folder I'm going to create a new file called scraper.py and we're going to open it up here and I shall make this a bit bigger so we got a better chance of actually reading it so first thing we want to do is actually import what we need so I'm going to do import HTTP X there we go and we're going to do from selectolax dot parser import the HTML parser so this is going to be allow us to actually go through all the HTML that we grab and and get out what we need now I'm going to think a little bit further ahead here I'm going to say well we're going to be dealing with data in Python and if you are dealing with data you're going to want to have some kind of structure for it so I like to use data classes you can use other things too like there's pedantic and marshmallow but for the simple versions I'm going to use a data class so what this is going to mean is it's going to mean that we can put our data into this special class which is designed for data and then it gives us access to things like asdict that we can actually manipulate the data and work with it how we want to so I'm going to do import data class and I know I'm going to use as dict later as well so let's go back to the site and have a look well what we're going to do is we're just going to go through the main search Pages here and we'll grab some information like maybe the the name I think these are two I'm actually going to use inspect element here because I know that this page is mostly HTML so this is now fine I'm going to go and hover over this and yeah that's a separate thing so we've got manufacturer name and a price somewhere as well down here so those are the bits of information that we're going to grab for this in this case I'm going to copy the URL whilst I'm here actually and come back to to this as now so let's make our URL this whilst we're here there we go now I've identified a few pieces of information that we want I'm going to create a new data class and I'm going to call it also product like this there we're going to say this is going to have a manufacturer and this is going to be a string so we're just defining our data here and then the title will also be a string and price I'm going to make the price a string for the moment just because when you're pulling data from a website like this it quite often will be it'll have the pound sign or something like that so we're just going to leave it as a string for the moment we might choose to do something with it later so now that that's done I know exactly where I'm going to be putting my data when we get it so we can start to construct some functions now so I know that this represents the page on the URL so if I change this to a two or a three it's going to get us the next page along now pagination is one of those things that you need to really understand the website of how you want to deal with it sometimes it's just easy to give it a new page number and just get a certain amount of pages if you know there's a lot that are many or if there's loads and loads of nodes quite often you can actually pull the link from the page itself what I'm going to do in this case is I'm just going to say let give me pages one two three and four and we'll just do it that way so I'm going to create a new function we're going to create a function to actually get the data and turn it into the HTM and return the HTML that we need so I'm just going to call this one get HTML now I know that I'm going to have my URL inside of here so what I'm actually going to do I'm going to say well I'm going to specify the page number that I want when I run this function what I can do then is turn this into an F string which is going to allow me to input this variable whatever I choose somewhere into this URL string so we can then change the number of our page as we go through it's quite a neat little trick so now I've got that working we need to actually make the request so let's say our response is equal to httpx dot get and the URL now I'm going to try this without any extra headers there's going to I'm not going to put in a user agent or anything like that and we'll see if it works no problem otherwise you might want to add in some specific headers like a user agent Etc if you need them I've got other videos on that on my channel too it's pretty simple you can just do headers it's equal to a dictionary user agent and then the string so from here I'm actually just going to put this response.txt into our HTML parser so I'm going to say HTML is going to be equal to are HTML parser we're going to give it response dot text and then I'm going to return out the HTML we can fact cut this part out actually and we can just return it like this and that will work just fine for us so as we're dealing with more and more functions on this on the in this code on this page we're going to want to be able to have a main function that's going to run everything for us it helps keep the code nice and neat and tidy and will help us change it as we need to as we go further down the line so I'm going to create a main function here and in this function I'm going to run the other functions that I need so what I'm going to do first is just I'm going to say that our HTML is going to be equal to our function which does get the HTML and I'm going to say just page one for now then I'm going to print it out just so I know that it's working and we can quickly check it out here now when you do this with your functions like this you could just call your main function like this and it would run but the more preferred python way is to do if name is equal to main so all we're going to do is if name is equal to main like this with the double unders uh just there like that all this means is that where this code's only going to run whatever we put in here if we call this file directly so if you were using this to import elsewhere none of these things like the print statement Etc would happen because it's not being run directly so okay so let's save and run okay so we can see that we have our HTML password object which is great so let's double check that this will work for us by just asking for a CSS and we'll just do title and see if that works and we need dot text on here too so we'll run this and of course I'm getting ahead of myself I asked for a list there we go so now I know that this is actually working now I'll explain all this in just a second when we get to passing the page so now that this is working I'm going to go ahead and think about the pages so what I want to do is I want to create a function that's going to pass every page HTML that we get out of here so we're going to create that in just a minute but to Loop through all the pages I'm just going to do a 4X in range here so this is just going to give me the numbers up to but not including four so this is going to give me page one two and three and then we can put our x value in here and then we can just print out the same thing that we did before so I'm just doing some testing here to make sure that this is working and it was html.css first and um let me do title dot text this is a really easy way to work out whether if if this is going to work for us um before we start going through all the different passing and putting all the individual bits of data and as you can see this has worked perfectly we now have page one two and three okay so let's move on to the passing part of our code our function here so I'm going to call this one uh pass products with underscore products like this and we're going to be giving it the HTML of the page now the HTML that we're giving it has already been put into our HTML password so we don't need to do that again we can now just simply work with what we need so let's go back to here and we need to find an element that kind of represents every object that we want to get the information from so we couldn't for example just go and get the title because if we did that we wouldn't then be able to search within that to get the rest of the information so we just go back up a bit uh still back a bit up a bit further because you can see the 173 the price isn't highlighted now there we go so if you look on the um on the HTML we have a div class of product and that is great because that's super identifiable and if you go and check all of these ones you'll see that there is a div class of product two so that's great and you can see it as I hover over it so I'm going to copy that and we're going to come back to our code and we're going to say our products is equal to html.css because with selectolax is CSS selectors only so that's why we're doing.css though you can't use XPath or anything like that and then we want to say we want to give it the CSS selector now as you saw down here I use CSS first now when you do this it will just return the first match but if you dot CSS is going to return a list of elements that match your selector this is really important because even if there's only one element that matches your selector it's still going to be returned in a list so I want to do div dot product like this so now are my products variable is going to contain a list of every element that matches this and if you remember from the page that's the ones that's got all the information in from here I'm just going to create a results list for everything on this page and then we're going to do for item in products so we want to Loop through every item that was within this which is still our chunk of a chunk of HTML which we can then pass through and then we want to do something with it now this is where our data class is going to come in so we want to populate our data class with these three bits of information that we're going to pull from the page so we're going to say that our new item is equal to product because not that one product this is our data class so you can see there now there it is it's telling us it needs a manufacturer a title and a price Each of which should be a string okay so let's go back to the page and the first bit of information is a span class with this title double underscore manufacturer okay let's go and do this so we'll say our manufacturer and the other upside of a data class as it's now going to give our code editor uh autocomplete and that's going to be equal to and we're in item now so we're saying for item in products we want to do item.css first and it was a span with a class of manufacturer and we want to do dot text because we want the text from that object now I'm just going to go ahead and do the next one and that was our title and we'll do item Dot CSS first again and we'll do dot text and let's go grab the selector for that title double under name perfect and let's put that in there great and the final one was price item.css first dot text because you're going to want the text let's go find where the price element is product price let's go down a bit further we should find it okay so you can see that this has like the two two strings together on top of each other so if we go to product price it will just give us everything in there and we will get the information that we want so let's come back here and put that in so what I'm going to do now is I'm just going to print out the product as we got it just to see how this is actually working and there used to be new item apologies so under here we can now remove this because we don't need that or you can leave it in if you want to and we're going to do past products for the HTML there we go so that should have been span I'm sure you spotted that whereas I didn't now you notice that we're getting a whole load of new lines here in the price information which is a bit of a pain so let's just try to do dot strip and see if that works great there we go so I managed to strip away all of the Extra Spaces you can see here now we have our printing out for every item on the page are product data class with all the information in that we've been looking at including the price information and manufacturer and title so now that we know that this is doing this is working for us we can start to think about finishing this up but before we do that I want to just want to show you know we've imported as dict and this is one of the great things about data classes is now we can just do as decked around this new item like this and when we run this now we get a dictionary out so what I'm going to do now is I'm going to do our results dot append and I'm going to copy this because I want to append the dictionary into my results and I'm going to remove the print statement no longer want that and now I'm just going to return the completed list for that page let's go with my terminal down and we now have our data class with our information our getting the information from each page we're passing the information from each page here and then we need to do something with the results so let's just make sure that this is going to work for us let's say um our res is equal to this and let's just print it out for the moment I'm just going to be going through every page and printing it out as we go and you can see all the information zipping by there great okay so the last thing that we want to do is we want to actually export this I'm going to create a CSV file and we're going to save everything to that as we go so I have a new function we'll call this one to CSV and we're going to pass it some information and this is going to be a list of dictionaries so our plan is to go through every page you can see here and then we will instead of printing out the results we're going to append it to the CSV file so we're going to as we go Page by Page we're going to get page one put that in the CSV page two put that in the CSV rather than waiting for it all to be finished at the end and then stick them all in in one go so to do that we're going to do with open which is a context manager you should always use a context manager when you're working with files so they get closed property I'm just going to call this one results without capitals please dot CSV and I'm going to do open it as a which is the append versions The append mode if you wanted to just write it you would use W but we're going to a so we every time we open it we're just going to add our new lines to the bottom so I'm going to say our writer is equal to the dict writer CSV dot dict writer not reader there we go we need to give it our F here now this is imported CSV at the top for me then we can do writer writer dot right rows and we can give it our file there so now instead of printing this we can do our two CSV we need to put in some field names here as well so we have the information so all I'm going to do is I'm going to do these manually you can get these from the keys and do it that way however we only have three columns so there's no point I'm just going to do it manually like this so let's run this every time we get a new page we're going to be getting our HTML printing out the title passing the information that we put in our pass function and then saving it all to that CSV file so when I run this we should get three pages one two three and our results.csv should be 300 lines long and it is perfect so now we have that information pulled out so there's a few things that were quite interesting in this one I've used a data class which I definitely recommend using more and more if you can you can use something other like pedantic too that works just as well but this is nice and simple it gives us access to this as dict or we could keep the object as it is as a class and then work with it and do other things with it if we needed to I have a function to get the information from each page that we give it a page number we're passing this information out and then we're appending the results to a CSV file this is a particularly good method to do if you have loads and loads of pages because if you get towards the end and something happens you get nothing whereas if you append to the file at incremental steps like every page you're guaranteed to get that data as you go through it also means that you can have one big one big CSV file that you can just add to every now and again don't make it too big if it gets too big use a database so that's going to do it for this one hopefully you have enjoyed it and got something out of it if you have let me know in the comments section down below like this video always helps and if you're interested in stuff like this more python more web scraping more web Tech stuff like that then I suggest you hit that subscribe button too and keep up to date with all my new videos thanks for watching and goodbye

Info

Channel: John Watson Rooney

Views: 42,183

Rating: undefined out of 5

Keywords: web scraping, web scrapping, scrapping data, john waston rooney, selectolax, httpx, python html parsing

Id: HpRsfpPuUzE

Channel Id: undefined

Length: 20min 12sec (1212 seconds)

Published: Tue Dec 06 2022