Web Scraping with AIOHTTP and Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to be taking a look at aio http now this is a client and server side python library that lets us create asynchronous requests now this is particularly useful and important to us when we're web scraping because it means that we can basically get data from multiple web pages at the same time in general synchronous web scrapers like when we use the request and beautiful suit libraries together we have to wait for the response from the server before we can pass that data this basically leaves a lot of time left in our code where they sat there just doing nothing waiting for a response from an external server when we create these requests asynchronously with something like aiohtp we can greatly increase the speed of which we can scrape it's also very very common nowadays in python in web apps and so it's definitely worth trying to learn and get your head around how it works and ai http is a good place to start so let's jump on the computer and we will start running through how you can create a simple request how to then request for multiple urls and i'll try to do my best to explain how everything as we go along so if you're enjoying this video guys if you get anything out of it please drop me a like and consider subscribing as well let's get started so if you haven't done already you need to pip install ai http and then what i'm going to do is i'm just going to come over to the documentation here and it's telling us how we can quick start and how we can actually make a request now i know this looks like a little bit more complicated than basically just the simple requests r equals request.get but we're doing a little bit more stuff here and if actually you're interested in that if you come over to this page which i'll link down below and come to the ao aio http request lifecycle it explains a little bit more about why there's more to it and how it's different to creating it against the actual requests library itself so this is interesting if you want to find out a bit more information here but what we're going to do is we're going to copy this part the code first and we're going to run through this and see what we get out and then i'm going to show you how to actually use it in a more of a practical web scraper type thing so i'm going to import a i o http and also import async io which is the python asynchronous library which we do need i'm going to go ahead and paste this code here now what this is is basically saying that we need to create a co-routine and these are like the awaitables which is the functions that we write that are asynchronous so this is one of them here so we we define that by saying we have our same defined but we have async in front of it and you can see that we have async in front of these other ones too and then the await at the end so what this second line here this first line is just uh defining our function the second line is basically with aiohtp client session now this is a context manager so what we're doing here it's like when you open up a file or in your in your code you might open up a csv file something you use the with which is the context manager so we're using that here and the client session is basically a session object i've got a video on session objects already if you don't know much about them i'll put that down there somewhere you can find it so what we're saying is we're going to use the session to get the data from this website we're going to store it in resp i normally use r but this is this our example this is the status code and then the text of the response now the next thing that's important is this down here which is the loop so we're creating the event loop um this is basically what runs all the tasks and controls everything all the sub processes so generally you don't really need to reference the loop but they've left it in their example so i'm going to leave it in here and we're going to run this code now so what this is going to do i'm going to get send a request to this server and you can see that we have got some information back here it's json response which is good so now that we sort of see that this can work what i'm going to do is i'm going to remove this code here and we're going to create something that looks a little bit more like what we would expect so again i'm going to do async def and i'm going to call this one get page and we're going to say that we need the session and a url so basically what we're going to need is we're going to need three main async functions the first one is going to be the one that we use to just get the data off the page the second one is going to be the one where we put all the tasks together and then the third one is going to be the main function that controls everything and basically the one that we give all the urls to so this is the one that's just going to get the data off the page so it's going to look very similar to what i just removed so i'm going to say with sorry async with session which is what we're going to give it dot get the url and we're going to say as r and we're going to return and here's our await keyword uh that's up for our code routine and we're going to just return our text because we're going to want the text html off of this page so that's basically just creating the simple request for the data of that page the next thing that we want to do is we're going to create another function and we're going to do async def again because this is also asynchronous and we're going to call this one i will just call it get all i'm not particularly great with my function notes make sure you name your function as well and know what you're talking you know what you're doing and again we're going to give it session and we'll also want to give it our urls list this is the one that is going to create the tasks for everything for all the co-routines to go into the loop and bring all that data back so we're going to say urls because we're going to make a urls list and then under here we're going to say here's our tasks that we want to return so we want to create a list of tasks and then we can do four url in urls so now we want to do task is equal to async io dot create task now the task that we want to create is this one this is the task we want to create we want to create a task for each url in our urls list which we're still yet to create to go and get the page data from there so what we want to do is we want to reference this this uh this function here inside this one so we can say we want to get the page with the session and the url and then we can do task dot append so we create our list of tasks tasks tasks sorry there we go so basically this part of the code is saying we're going to create a task and each need each individual task again co-routine controlled by the event loop creating that task for our function here and we're saying that the session the urls the url is one of the each of the urls and again the session and the url object we're going to pass in at the end we're going to add those in so i'm still just sort of structuring my code out at the moment so the next thing that we want to do is we want to come outside of our thing outside of our for loop i'm just going to say our results is equal to again here's on our here's our weight so we can await this function and async io dot gather and the tasks like this and then i'm just going to return the results out of this function here we are basically saying our results is we're gathering all the tasks together and we need to use an asterisk here because this is a positional argument and we want to unpack our rss variable which is our task list so and then we're returning all the results out of this function uh now we can write our final function so we can sort of put everything together so this is going to be our main function so again we want to do async def and this is our main this is the one we're going to give our urls list 2. so now we can do async with and we want to have our session in here so i'm going to do aiohtp dot client session uh as session and now we can just do uh let's just say data is equal to await and we want to do our get all function and let's say our session and urls and we want to return our data out of here so what this one's doing is this one's creating our session and we're saying with our session saving all of the information that comes back into this variable let's get all of this so basically it's kind of like this one's inside here and this one's inside here and this is going to basically let us do everything that we want in one go so the last thing i'm going to do before we get some data back is i'm just going to do our let's just do if name is equal to and main so this runs here and we'll do um we need a urls list so let's create that here so what i'm going to do is grab some of the test site urls here and we'll just put that one in there one page two and page three let's do that and the final piece of the puzzle we want to run everything so i'm going to do results is equal to async io dot run as i said earlier we let this manage the event loop for us so we don't need to do anything with it and i'm going to say main and we just need to give it our urls list and then i'm going to print uh we'll just print the length of the results now because it should return a list and if i print all the data it'll just be a bit of a mess as you'll see three okay so that will print the actual data from the results and this should be all of the html code back uh we won't know there we go we can see it all so loads of information so basically what we've done is we've created three async functions one which gets the html off the page one which creates all the tasks for all the pages and then one that runs everything and controls the session so now that we've actually get our html back what we can do is we can just pass that html separately because html parsing is cpu intensive and most of us have good computers these days so that doesn't take long in web scraping the time lost is always in waiting for the response from the server generally speaking so what i'm going to do is i'm just going to write a real quick non-async function at the end i'm just going to call this pass and we're going to give it some data and we're just going to print that out here so let's just call this pass our results and let's do 4 html in in results and because i showed you earlier that we actually got a list back for each one of the pages we need to loop through that list so now we can just do soup is equal to beautiful soup and let's say uh it was html and then we can just do print soup soup dot find and if i come to the page here if we just print out let's just put this little bit out here um so we can easily see what that we have actually got different bits of data back this is i believe it's under here it's under a form and the class of form dash horizontal i'm not sure why that doesn't look like a form to me it doesn't matter so let's do form and then put our class inside our dictionary dot horizontal and let's get the dot text and let's strip any white space that may come off of that as well so now we can print the path of our results like this in fact we don't need the print here because we've got a print statement in here i'll just end this function with return here there we go so now if i run this we should see some actual data let's kill this terminal and we've got back some information here now this is because i didn't specify what parser i wanted in my beautiful suit but you can see this is the information that i've scraped one to 20 21 to 1441 to 60. so where would you want to go from here if you were actually going to use this proper web scraper well what i would suggest is you first get all the links together off your page so if you think that you're going to be doing pages from a website then you could get the links first so you could scrape the pages and get the links and then create your urls list that way and then go in asynchronously to each product page for example and pull that data back out then pass it so just to recap we need to import aiohtp and async io because it's the async library in python we create our three async functions that kind of nest into each other we're using tasks that we gather together with async gather and then we use the co-routine async and await keywords to make everything run neatly inside the event loop of which we let the python code use and uh look after by just doing asyncio.run so that's it guys hopefully you've enjoyed this video and got something out of it i've done my best to explain async and away in a request format that might be helpful to you so let me know how you think i've got on down below and uh yeah stick around for some more videos coming up loads of web scraping content on my channel already and so like subscribe comment thank you very much guys goodbye
Info
Channel: John Watson Rooney
Views: 3,980
Rating: 4.967742 out of 5
Keywords: aiohttp, python async, aiohttp requests, async web scraping, aiohttp tutorial python, aiohttp with asyncio, aiohttp python tutorial, aiohttp webscraper, web scrapping, python web scrapping, web scraping with python, async requests
Id: lUwZ9rS0SeM
Channel Id: undefined
Length: 13min 43sec (823 seconds)
Published: Sun May 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.