Python Async Web Scraping - Day 27 - 30 Days of Python

Video Statistics and Information

Video

Captions Word Cloud

Captions

hey there welcome to day 27 in this one we are going to be doing asynchronous web scraping now day 12 we did synchronous web scraping which is fundamentally easier than asynchronous web scraping but quite a bit slower so asynchronous web scraping allows us to take advantage of python 3's built-in asynchronous capabilities with async io so this whole series what we've been doing is writing synchronous code that is blocking code that essentially says hey if you got a function finish running that function and then go to the next one and then so on you know something like this where we've got function aaa and then functions ezz and all of these functions in between right so if we call aaa that one's called first all the way down to any other functions we have in between zzz right they're not running at the same time or actually overlapping at all so think of this in web scraping right so if you want to actually web scrape 10 pages what synchronous code would do is page 1 page 2 page 3 and so on until the entire thing's done which let's say each page takes a second that takes about 10 seconds to actually complete but if you actually wanted to do asynchronous coding of that you can actually run all of these things concurrently where it's kind of like switching back and forth between each page so page 1 through 10 are all running at roughly the same speed or the entire program is going to take roughly how long the slowest page takes which we'll see in just a moment but the general idea is that each function in synchronous code blocks the next one from running and asynchronous code it doesn't do it that way so let's actually take a look at a synchronous example like a functional example this of course is on our github and i do recommend that you actually download this synchronous code of course you could pause and run this as well but the general idea is this we have a list of iteration times we have a function that will actually sleep for any given time that's passed through it or seconds that's passed to it and then we have our main function here that's essentially simulating like each page being scraped or opened up and then we get this full run time here okay so it's actually pretty simple of a function itself and of course this is a synchronous blocking function so much like the blocking.pi we have these two functions running here this iteration right here is doing that same thing so this sleeper function is going to wait however many seconds we put in there right so we iterate through all those seconds and it's going to wait for that time so the whole thing's going to run roughly 10 seconds long which i know this because that's what the iteration times are but if we actually run this and it's going to be python synced up high we can see those iterations happening we can see how it's blocking the code right we see all of that so that's pretty cool so how do we actually turn this into synchronous code right how do we speed up this 10 seconds of running time well it's actually really easy so copying all this code into async.pi it's literally the same code except i got rid of this main function running to change it to asynchronous code we just write async on and in front of every function there you go you now have asynchronous code how cool is that well not quite right so what i actually want to do is ignore the iteration for now so we'll comment that out and i'm just going to call sleeper one time and i'll just give it you know one second and an iteration of zero i'm also gonna get rid of that async in front of that sleeper function and we're just gonna really focus on running one asynchronous function for now okay so we've got this uh what i wanna do is now run python in interactive mode with python i and async pi so what i can do is call main to run it or can i what i get back is something called a coroutine object so it's no longer a function it is now an object right so that means that we still need something else to actually run this main object okay so how do we execute a co routine is the question well there's a few ways to do it but if it's our primary function if it's our main function that's going to run all of our other functions or co routines then what we want to use is async i o so we import async io this is built in to python and of course we can bring this into this async function as well and then we run a sync io dot run of a code routine so if i just did main that's not a co routine that's just describing the function itself as we see here we got a co routine was expected but it got a function so we need to initialize our code routine by calling that function and we run this and now it actually runs it actually ran a synchronous or an asynchronous function and a synchronous function right so it ran both of those things side by side with really no issues okay so that's pretty easy to do that right we first off declare a co routine and then we have to run that code routine with async io dot run now there are other ways to run this so just keep that in mind but for our purposes asyncio.run is going to be perfect and it will be perfect for a vast majority of tasks that you're going to be challenged with okay so now that we've got that let's go ahead and actually turn this sleeper function into async again so i'll go ahead and async that function i'll exit out of here now and since i have the run async run in here going i'm going to go ahead and run this again outside of the interactive mode and what we get is runtime warning co-routine sleeper was never awaited so what this is showing us is a another issue and another challenge that you'll come across when you're calling or using co-routines remember how we actually had to use asyncio.run to run that main co-routine well to actually run co-routines within other co-routines or other asynchronous functions we actually call a wait in there what that's going to do is call this function and wait for it to complete prior to doing whatever is next so this is actually going to block what's coming after this in this main function so let's go ahead and run this and now it actually ran it so we're actually ready to start trying out that loop okay so i'm going to go ahead and get rid of that and i'm going to get rid of this await call here and instead i'll just put it here okay so this runtime and just like many other functions this sleeper actually returns a value so in order to update our global runtime variable we have to call a weight in front of that sleeper function still so another thing to think about with that is basically the result of that is equal to that await call and then we just add it so again this is how we call it much like when you're calling a regular function you know that that's going to wait for it to finish but again this is now a co routine because of that async stuff so we just need to use the command called await and of course it's always going to tell us that right it will give us runtime warnings when that you don't do those things so we save this and now we're going to go ahead and run it so you're going to run through each iteration much like we saw before but unfortunately what we're seeing here is zero performance increases there's literally none right we should have printed out the runtime but as we saw it didn't actually seem to change how fast everything was running now there's a couple reasons for this but the primary reason is this time dot sleep right here this is not a co routine and therefore it's running as it was before and also how this method is working it's also not running great so let's go ahead and change time.sleep into a co routine itself because again what we want is to have co routines calling co routines if it's necessary or if it has to happen so in the case of async io we want to use async io dot sleep and again that's a co routine so we can await it just go ahead and run this again and press up there we go um and so same thing okay so this is where hopefully it starts to feel like what's the benefit of writing asynchronous code well it ran for 10 seconds the benefit is this we can actually change how these results work altogether by saying let's give me a task list so i'll call this tasks and it's empty so each one of these co-routines can be considered a task that async io needs to run um so basically what i'm saying here is i want to iterate through all of these iteration times i want to initialize the code routines turn it into like a queue or a list of tasks that async i o should run it doesn't matter how or when they're run they just need to be run and then i'm going to return what that actual data is so what that means is this right here is all i need here let me get rid of these and i'm going to call async io dot create task and i'm going to add in the co routine there right so this is going to create that task and we can actually append it to task.append okay so now we have a bunch of tasks that we want async io to run so the final step then is to run something called asyncio.gather on all of those tasks so we actually unpack them with the star and it's going to run it on all of those and this is also needing to be awaited and this will also give us a result i'll explain why in a second so let's say results equals to that and we're going to go ahead and print out what that result is but the general idea is the vast majority of times when you call an asynchronous function you absolutely need to call a weight the only difference is this async io create task is not being awaited because essentially you're making a queue or a list of asynchronous tasks that should be run and then async io.gather actually executes and runs those tasks so let's go ahead and see this run this notice the print statements came out immediately right and then all of the results from those things came out shortly after so these are the actual results from each one of these functions and the only reason i actually have data from there is because this co-routine this function this asynchronous function is actually returning something so if it wasn't returning anything this results thing would be empty it would actually have nothing um but since it is returning something and it's a list of things it returns a list of those results and they are different for each iteration okay so what that means then is i can actually update my run time based off of those results so for uh run time result in result this should probably be just results this is plural um then i can actually do that runtime plus equals to that runtime result okay so we save that and let's run it again and what we should see is huge performance increase but we don't we see 10 seconds still so the entire thing ran for 10 seconds but it seems like well why would it run for 10 seconds it didn't actually feel like it was 10 seconds well actually it was this problem right here so what we actually want to have is the runtime result being equal to you know whatever the top result is we can actually count this aloud and we'll do that in a moment so basically if the runtime result is greater than the current runtime then we'll actually set that runtime result okay so let's go ahead and run it one two three four five i probably counted way too fast because that's what i do um but there you go you actually get four seconds and you can try this out on your own for the actual results what i got though what i originally showed you was the actual program runtime or total run time like actual actually let's call this the total compute run time like what it would have been if it was synchronous would actually be related to this so go ahead and say global and this is where we'd actually want to put in our runtime results and now we can say it ran for that with a total of that total run time and let's say runtime divided by the total compute runtime uh and we'll go ahead and run that again and so again each one of these things is going to still take that same amount of compute right it still takes the same amount of compute but it's 40 it's at the 40 of the same time period it happens um so this is the magic of asynchronous code and also the challenge of asynchronous code like actually remembering all of this stuff does get a bit tricky but that's where it comes to practicing with a real world example so let's go ahead and do that alright so you might remember back to day 12 when we did synchronous scraping and we used the package called requests unfortunately we actually can't use the package for request to do actual asynchronous scraping at least not yet maybe in the future but for now what we need to use is something called aio http this actually provides us two features that are really cool one of them is to actually do the requests to other urls asynchronously another is actually to run asynchronous code so we can actually create our own asynchronous web application using aio http as well but of course we just want to use it to actually grab a url so let's actually see how we do that first and foremost we need to install it with pip bmv install io http you can also do pip install aio http it's completely up to you but after you run that installation you have aiohtp import client session so what we want to do is initialize a client session and do some requests so what i'm going to do is define our main function so async def main and we're going to use this client session so to do this there's a couple different ways on how we go about doing it i'm just going to do a very basic one first and i'm also going to set a url so the url i'm going to use is from box office mojo so very similar to what we did before so box office mojo.com and then we're going to go into yearly sa yearly box office reports which of course 2020 does not have great years so let's go ahead and go into 2019's url first so this is the very first one i'll use i don't need that reference there all i need is to 2019 okay so what i want to do is use client session to read the html body content from here now do keep in mind that this will not work on javascript heavy websites in other words if you see a loading icon here when the page loads um as in the you know the rest of the page is there but then there's this loading going on we won't be able to do those pages uh just yet i'll tell you how to do it later um or a resource for learning how to do it but for now we're going to go ahead and use a synchronous loading or not a javascript heavy website um okay so we've got this url and now we're going to go ahead and do async with client session as session so you might be familiar with the with command um so basically what's going to happen is you're going to run some code in here and once it's done it's going to cease that client session it's going to just end it so to actually do the request here then we do another thing with async with and now it's using that session and we're going to go ahead and get now of course you could use post you can use other kinds of http methods here but we're going to go ahead and get that url and then we're going to grab the response and set it as response okay so here's the initial session and then the actual response now the reason we absolutely want to use this session as well is because we're going to have a multiple requests we're not going to just do one so it's nice to have a session open that we can run the session.get many many times so then the final thing is just the html body is going to be equal to a weight response.read okay so this is going to be our html body text so i could leave it in here as an empty string and then set it down here and then finally we'll go ahead and return that html body okay so let's go ahead and or well let's return this inside of the response itself so let's go ahead and run this first off i'll go ahead and import async io so import async io and then just we'll go ahead and print out async io dot run that mean co routine okay so save it and we'll go ahead and run python a scrape pi which of course is the name of my module here and we'll give that a shot okay so it is going to be pretty fast and we get a bunch of body data in here right so that's actually our html data which is pretty nice cool so before i go any further what i want to do is actually store that html data i'm going to do it using path lib so import path lib and really what i'm going to do is say html data equals to again i'm just going to leave it in as the response from that and then i'll go ahead and do my output dir is pathlib dot path and dot resolve so this will give me my current directory and then we go ahead and say outputs or let's go ahead and call this snap shots and with that i can now do output file and it's our output dir and in this case i'll just call it 2019 dot html because it's the name of that year and then i'll go ahead and write that content so output file write text and html data now i actually know that i need to decode this from bytes so i just called decode i'll show you that in a moment and if you're not familiar with this method of doing things it's identical or very close to with open and then the path you know the path that you want to go to output and then you know write as f and then f dot right and then that same same data here now i don't actually have the snapshots folder created so there's one more thing that path lib makes it really easy to do is just output dir and make dur oops not make child but make dur and parent equals true and exists okay equals the true or rather i got the s is incorrect so it should be parents equals the true and exist okay it's also equal to true so now i'm going to go ahead and run this again this time it won't print anything out but what we should see is a folder being created snapshots and then 2019 dot html now a big part of the reason to me to actually store the raw html that you request um it's a couple reasons one is since asynchronous code is gonna happen a lot faster then i can always come back to this later but actually it's the main thing is coming back to it later i'm actually getting quite literally a snapshot in time of what that html is so i don't necessarily always have to go and do it which just helps me down the line it's not really that relevant for what we're doing here but we might as well do some a little bit better practices in real time okay so that's just one url right now we actually need to use multiple and create multiple requests now it might be as simple as you think where we actually did the async io tasks and that gather call inside of something like this but what i want to do is actually break this apart a little bit uh so let's go ahead and do that now all right so picking up where we left off i'm gonna actually use this in a module called a scrape multi just as a way to make it easier to reference what we just talked about so now what i want to do is actually break this apart a little bit more so what i'm going to do here is do async def and we'll go go ahead and call this fetch which is gonna be our url and our session um and so all that's gonna be is this right here okay so the reason for this or to have a different method for grabbing the data absolutely has to do with the iterations that we want to have so let's go ahead and say years ago being i don't know five and then our start year being 2020 okay of course you could use daytime objects you can use all sorts of things to make this a little bit more robust but the idea is now what i want to do is actually iterate through this so what i'm going to do is say 4i in range and this is going to be zero two years ago then i'll go ahead and say my year is equal to my start year minus i whatever that is right so uh it's going to go from zero to five so by the end it will have five years ago hopefully okay and the first one will be the zeroth element so it should be 20 20. and we can print out what that is print year and year so much like what we saw before we actually have our tasks or the tasks that we want async io to have and so we're gonna go ahead and do tasks dot append let's give us some space here task on append and again we do async io dot create task so this is going to be the co routine that we want to use which is our fetch call here for the url we're going to be using so the url itself should probably set that as a string substitution url now instead of what we had before so let's go ahead and say f here and do this url now i already did the logic behind this so if you're confused about the logic behind this by all means go ahead and go back to day 12. um but essentially i already know that from the url we can actually go and go back to different years and whatnot so we've got year and then our url okay so we've got our url here and then the next item was our session and this should return our html body content so that means that our gather call will have a list of lists so this right here this is going to now be our html body content or rather all of our pages content is going to be equal to await async io dot gather all of those tests so we want to unpack them just like that and we're going to go ahead and return that page's content now every once in a while what you might see is something more like this where a wait is called after that that's okay you just need to make sure that however you do it you are awaiting whatever that co-routine is okay so now we have a number of pages that we can actually go and scrape this time our output file or our snapshot is going to be well it's going to be a little different so let's actually print out what that data is going to be so what we should see is actually a list of body data so i'm gonna go ahead and run this and this is ace creep a scrape multi okay so it shows me all the different years so 16 17 and so on and it basically happens like that okay so this html data here is actually what what we want to see right and vs code froze okay so vs code froze um so i'm just back at it um i think it froze because we just had a lot of content that was being loaded into memory um so what i actually want to do is take this a bit further and change the results themselves right so what i want to have inside of my task in this case i'm going to also add in my year that's coming in here because of the outputs that i want to have so i'm going to bring in my year into fetch so as it was it's actually better for you know more places or more things and now what i can do is actually return back a dictionary of this stuff so body is that body and then our year is year so now my results should be a list of items that have a body you know with some data and then a year with some other data and it's probably going to be a number like 2020. uh so real simple change but what that's going to mean then is my results which is this is actually going to be a little bit easier to store uh much like what we had here so now i can actually give that same output directory and then iterate through for results in result and just run this same thing but just grabbing the key value pairs now so current year equals to or i got these things backwards current year is results i get and year and an f string here we go and then our html data is going to be results.get and this was body okay so that should now give us uh our better results i'm not going to print it out this time instead i'm just going to have it go straight into the html content that they have so let's go ahead and save it and let's run this again with python a scrape multi dot pi and it's still going to all those years and inside of my snapshots what do you know that was so fast like to do that synchronously would be one by one it'd still probably be pretty fast because box office mojo has quite the server on it because they get a lot of traffic but as we see we now have done asynchronous web scraping but there is still one part that is missing and that is just the sheer amount of speed that async i o allows for and aio http so we actually need to add something else in here called a semaphore which means that we're not going to just overload their server could you imagine if we did a hundred of their pages or 2000 the semaphore is actually going to prevent just crushing their server which is absolutely something we don't want to do for several reasons one when you are web scraping you should be a good internet citizen and then two you might actually get your ip banned from actually even opening their web page normally so keep that in mind but let's go ahead and implement this semaphore right now now we're going to go ahead and implement a semaphore and before i discuss what it is let's talk about what a potential problem might happen with our current lookup now in this for loop here we have an arbitrary amount of range so that means that i could essentially have a million or more tasks that async io is going to create now let's assume that your computer can handle that number of tasks that also means that concurrently we are going to be hammering this server of those million tasks i mean sure it's going to be going back and forth so concurrently would be switching those requests each time which means that you know box office mojo.com wouldn't actually get a million requests at once it would more be staggered only slightly staggered but still staggered so we don't want to do that we don't want to do that to our computer we don't want to do it to their computer so what we need is something in the middle that's going to prevent certain tasks from running in a logical way and not having to write our own sort of logic right so i don't need to write all these conditional statements like hey is a task running right now let's not run until that one's done i don't need to do all that instead what i can do is use a semaphore another analogy for this is like at an amusement park right so if at an amusement park and you're trying to get on a roller coaster if everyone got on that roller coaster it just wouldn't work it'd either break or everyone would fall out of it right so uh what they do is they have a ride attendant of course and they that right attendant prevents a certain amount of people from going on that roller coaster there's a small amount thinking go on everybody else has to wait a semaphore essentially does that same thing so what we want to do is actually bring in a semaphore and use it inside of async io it's really simple we just do sem equals to async io dot semaphore and then we give it some sort of limit so the limit is roughly how many requests or how many tasks we're actually going to run at any given time okay so you can say a hundred you can say two it's really up to you so i'm going to go ahead and leave it in s10 so this is going to be doing those 10 tasks at any given time so it gives me somewhat of an interval between each semaphore or each actual fetch itself this right here so that means that i need to add in one more feature to doing this so i'm actually going to copy this original fetch and i'm going to call this fetch with sim and i'm just going to take in one more argument here and that is my semaphore itself and so what i'm just going to do is say async with sim and now i'm going to return back await fetch url session year okay so naturally i could put these two together but it actually makes a lot more sense to have them separate because every once in a while you might not need that semaphore and it's a good idea to have your url being able to work in here in fact we might as well just say year equals to none as well in case somebody doesn't actually pass a year in there because that's a fairly arbitrary thing that's related only to what we're doing here where fetch with semaphore and fetch those two things are definitely going to be reused in the future for other kinds of asynchronous scraping so now that we've got that we can actually scroll down here and use that as our create task instead and then we can go ahead and run this so python and a scrape sema dot pi and so oops we missed a required argument here and that's we need to add in sema or sim in fact i'm actually going to change it a little bit and do sem url or sim session and then url and year i'm only changing this because the first three are required the last one is not in this case of course i can leave fetch as it was but i do want to have it like this and then we'll pass in year equals to year and that should be session not okay let's try it again run it and capital y lowercase y one more time and there we go okay so to actually measure these results and see that it is actually working you could just use the timing that we did before with the synchronous right so you can actually add in your start and end times and or rather start and end times in any given co routine or function or asynchronous function to see exactly how long these things are taking you could also run it outside of the main because what you won't see most likely is exactly how the effect of the semaphore on your application with something this small you would see it a lot bigger and especially if you had the back end access as well you would then see the load that is actually coming through on your server side but of course we don't have that access so there's no real way to show this other than yeah this is kind of like a ride attendant preventing too many web scraping events from happening all at once and really 10 or even 20 that's probably plenty for the vast majority of web scraping like i could actually get up to now like 20 years i i'm not sure if the actual website even has more than 20 years but let's go ahead and try 20 years now and we can see roughly how fast it goes i mean it's it's not going to take that long it might maybe like 10 seconds right and now i have all of those pages so okay that's it for semaphores let me know if this isn't exactly clear i know it wasn't for me at first i had to really practice a lot of using it prior to really getting the full benefit but initially i didn't even have them and then i was like really just hammering these other servers and it's just not a good idea we really want to use those semaphores hey there thanks so much for watching hopefully you got something out of this one now doing asynchronous programming is a little challenging because this idea of switching back and forth between functions just really isn't that straightforward i think the logical the synchronous programming is a lot more palpable of building applications because the other thing something we really didn't go into is debugging asynchronous code is also way more challenging than just debugging synchronous code but if you are ready to go even deeper into this idea of asynchronous scraping and more specifically web scraping on javascript enabled sites then i highly recommend you do our project supercharged web scraping with async io it will reiterate some of the things that we talked about here but something that we didn't cover is javascript heavy websites javascript heavy websites typically use something like selenium to actually load up the page and do all the actions you need to do on there what we need to do is something different than selenium using async io there are some other packages that allow selenium to work but it's better to actually use an asynchronous version and so that project will actually absolutely go through all of that for you and help reiterate everything that we learned here so thanks again for watching and i look forward to seeing you next time you

Info

Channel: CodingEntrepreneurs

Views: 13,268

Rating: 4.9610138 out of 5

Keywords: djangourlshortcfe2018, install django with pip, virtualenv, Django Web Framework (Software), Mac OS (Operating System), Python (Software), web application development, installing django on mac, pip, django, beginners tutorial, trydjango2017, install python, python3.8, django3.0, python django, web frameworks, install python windows, windows python, mac python, install python mac, install python linux, pipenv, virtual environments, 30daysofpython, beginner python, python tutorial

Id: 6ow7xloFy5s

Channel Id: undefined

Length: 38min 53sec (2333 seconds)

Published: Thu Sep 03 2020