Fastest Python Web Scraper - Exploring Sessions, Multiprocessing, Multithreading, and Scrapy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we are going to create a simple scrapper using request and beautiful soup and we will try to make it really really fast we will look at concepts such as multi-threading multi-processing and we will also see how it compares with scrappy so let's get started so the first thing that i'm going to do is i'm going to build a very simple scrapper using request and beautiful soup if you're already comfortable with request and beautiful soup you can look at the markers in the description and jump to the next section where we start optimizing it this is a page which contains link of all these circulating currencies so these are the currencies and what we want to do is we want to get this page into the local disk right so what we are going to do is we are simply going to call request.get and we will supply this url so beautiful soup needs a two things number one it needs a string which contains the html and number two it needs the parser that we want to use so i'm going to use lxml parser the complete html string we can get from response dot text so these two things and we will have a soup object from the soup object there are multiple methods which are available typically you will see that find and find all and all these methods are used but what i'm going to do is i'm going to use select methods so these select and select one method these methods can actually take css selectors so i'm not going to spend much time on the selector and how i created these selectors so basically this is the element so i'm going to press ctrl f and paste the selector right over here so here on the bottom you can see that this is the selector so you can see that there are 262 items so here is the count 262 items so let me save this selector in a local variable you don't have to do it but i'm doing it just for the convenience and the select method is actually going to return all the matching elements in one list so this will give me all link elements right so let me create one empty list here and what i'm going to do is from a link element this can be anything if you want you can put something like x but i will recommend that you use easy to remember variable names right best practices so for each link element in this list all link elements i'm going to look for a link element dot get method so this is going to be our link but right now this link is relative so we want to convert this into absolute link okay so this is where we are going to use url join method and from url join it needs two strings number one what is your actual page url so this is this url right so this is where we are actually containing the current page url so we can write it like that and the second part it expects is what is the partial url so this is our partial url so what it is going to do is it is going to convert this into absolute url so now we can override this link and this link can be appended to this list that we are building so let's do this link dot append so actually it should be links so link dot append this list can be returned from this function so this is the simple function which is going to generate the links so i'm going to collapse it now keep it out of the mind out of the site now this is the function that we are going to use to actually extract data from all these pages right so i'm going to call this just for the lack of better word let's call it fetch and it will take one link at a time we will use request dot get and we will get the link and it is going to give us response so we will get the response and we want to save this response to a file right so why i want to save this to a file because i want to mimic a real life scenario in real life scenario whenever you are running your web scraping projects you have two places where you have a long running task the number one is your get request okay so there is this network delay so in this line we have network delay so this is the first thing i also want to introduce the disk writing time so disk io input output that is another one which takes a lot of time let's take a simple approach and what we will do is we will simply dump all the contents to a local folder nothing fancy that is not the objective we just want to see that if there is a disk i o then how we can optimize so what we are going to do is we need to get the file name first so if we look at this link okay so this is how the link is going to be so from this link what i can do is i can extract this last part and we can use that as a file name right so what i'm going to do is link dot split a reminder that this is not the best practice but i'm doing it just to get it done very quickly okay and then i'm going to append a dot html so this is my file name and i want all the file to be inside this output folder and don't forget the forward slash okay so ideally you should be using the os module and path dot join function to create these actual file but i'm just taking a shortcut so what it is going to do in the current folder it will look for output folder and inside that it will take the last part of the url so it is going to take united states dollar in this particular case and then it will append dot html so this is my file name so i'm just keeping it very simple and now what i'm going to do is i'm going to open this file name okay in i can do a write mode or write binary so just to make it more efficient i'm going to do it right binary and then f dot write okay now if i'm doing right binary then i need the bytes i don't need i cannot use response dot text here right so what i'm going to do instead is response dot content so this response dot content this will give me the actual bytes so very quickly we will be able to write all these bytes to this file name and that's all that's all we are going to do this was not a good practice right so we should be using if name equals main all right so what we are going to do is first of all we are going to get all the links using this get links function so this will get all the links right so this function is just one function call and i'm not going to modify or touch it so basically what we are going to do is we are going to run a loop so for link in all the links and then we are going to call this fetch method on all these link right and this is the part which is going to take a lot of time because we have 260 plus links to process so this is the part that we want to optimize so what i would like you to do is focus on this part for the rest of the video so this is where all the optimizations will be done all right so uh before we can actually monitor how much time it is taking so we need to import time module right so we are going to calculate the time so first of all we need the start time so let's simply call time dot time so what it's going to do it is going to return the current time in seconds right we are going to work with second so this is the current time and finally we are going to write how much time it took so i'm going to use backslash n just so that it is always printed on new line and then backslash t so that it is printed after a tab character so this is only for aesthetics okay total time and this is where what i'm going to do is i'm going to again call time dot time minus start time we are going to put one print here okay and i don't want to print a lot of text so i'll just print a dot typically you can see in the suggestion itself that uh whenever you print something there will be one new line character which will be printed at the end and i don't want to do that i want to mark the end so what i'm going to do now is i'm going to run this complete program and now you will see the problem that i'm talking about it is not printing anything and it's going to take some time and then it will dump all those dots immediately so all right so it just completed and you can see that it took 125 seconds right so let's put a comment here no optimization this much time it took without any optimization all right so what i'm going to do now is we are going to start with the optimization so this is going to be no optimization so all the code will be available you just have to check the links below the description now so currently we are at 125.14 seconds so what is the first phase of optimization the first phase is going to be changing this function so i'm going to delete few comments here right so let's keep it clean so what i'm going to do now is whenever we are using request dot get function it initiates a new connection to the web server from the beginning every time you call this it initiates the complete connection and whenever you are connecting to a web server there are a lot of things which go behind the you know behind the scene so they are not visible when we are working with one or two urls or whenever we are looking at the browser but if you want to get a lot of urls from the same server what we can do is instead of using request we can use sessions right so that is going to be the first optimization so what we are going to do is we are going to call a request module right here in the main module okay in the main function we are going to call request dot session all right and this session object we are going to store it in a variable so in the fetch function we are going to modify so that this session is supplied so of course we need to modify this fetch function so we will also send s and this one let's make a copy all right comment it out and instead of request i'm just going to say s dot get so basically it is going to reuse the same session again and again it is going to reuse the same session okay so this is the only change that we are going to make creating an instance of session passing the same session to the fetch method every time and instead of request.get it will go it will use the same session.get nothing else just two lines of change so let's see what is the time it takes this time all right so let's see so by the way we also need to fix that actually let's go to the code and what we are going to do is we are going to pass this additional parameter flush and we are going to save it set it to true so let's come here and it took 26 second but it dumped all these dots at one single go so what i'm going to do here is i'm going to run this once again and you will see the difference so now it is printing all these dots one by one so it's not keeping you know so whenever you print something to the console it is kept in a buffer and then that buffer is printed to the console and it is not in our manual control so when you set flush equal to true it is immediately flush to the console so it is a useful in this kind of scenario so as you can see that it took 24.10 second as opposed to the 125 seconds before any kind of optimization so what we are going to do here is we are going to make a copy of this and at this number yeah dismiss so this number what we are going to do is we will paste it here so this is using sessions so so far what we have understood is that you should be using session if you are processing a lot of pages from the same website now what i'm going to do is i'm going to change this back to the original one all right so we will be using request we don't want to use this and i'm going to show you a few other tricks all right so let's comment this out now what we are going to do is we are going to make use of multi-processing and what is multiprocessing so basically multi-processing is you can think about all the cores in your cpu so let me show you on the console so i we can run either the python or i'm going to run ipython so let's run ipython all right so clear everything so what i'm going to do is i'm going to import from multiprocessing there is a module and you don't need to install it it is part of the standard library so from multi-processing i'm going to import cpu count right and let's run a cpu count so you can see that it is giving me the number eight so basically i have eight cpu not eight cpus but eight core cpus what we can basically do is we can create eight different process right so what we are going to do is we are going to create eight different processes and each process is going to run on different cpu doesn't require many change so this is in multi processing so we are just going to need the cpu count so this is one thing and the second thing that we need is pool so we are going to create a pool of processes and how do we so we are going to make use of with keyword and if we look at the hint so basically we just need to provide that how many processes we want to create so that will depend on the cpu count so let's give this parameter and cpu count as a function so don't forget at this opening and closing bracket okay and let's call this sp now what we are going to do is we have a function and we have a list of links all right so what we are going to do is we are going to call the map function now map function will take of course a self apart from self it will take a function and it will take an iterable right so the function that we want to process is fetch so just provide the name of the function don't call this function don't do that so we just want to provide the name of the function and the second is iterable the list of the links that we have is here so this is the links so now we don't have a for loop instead of for loop we have the map function so this is all we need to do so now what it is going to do it is going to create eight different process and it will be a pool of all those processes and that pool will be mapped with this function and this link so as soon as one process is done with one link it will be provided the next one so let's see how much difference does it make so i'm going to exit clear and let's simply run the same code once again and probably you can see if i i'm not fast forwarding it so you can see that it is running much faster and we should have a better time than 28 seconds that we had earlier so see earlier we had around 28 seconds and now we got 14 seconds so let me copy this and let's put it here so this is multi-processing the next thing that we are going to try is multi-threading so in multi-threading what happens is that we create a lot of threads okay these threads can switch so let's say we create eight threads a similar to what we did in eight process so each process when it is running when when it is waiting from the network to give a reply it is just waiting it's not doing anything so it's still there is a weight involved when you are working with process but when you are working with threads you don't have to wait actually the processor can a switch from one thread to the other thread whatever is suited for your machine that will be the best thing that it will be doing it sounds very difficult but the execution is very simple and let's keep a copy of the same code okay and you will see that we have to make a very minimal change so this was multi-processing okay so now let's go back and here let me commit it out but leave it here so instead of multi processing there is one more module called concurrent dot futures okay from this i'm simply going to ins import thread pool x xe cuter and what we have to do is only few changes so instead of a pool we will use thread pool executor and instead of this number eight that we were providing as the number of process here we are going to provide the max workers okay so in case of uh threads and you are not bound with a number of cores or number of cpus in your machine you can choose any number that works best for you so i'm starting with eight just because we we did the same thing right we had eight processes so let's compare eight processes with eight threads okay and there is no other change we are still using the map function we are still supplying the fetch function and we are still supplying the side table which contains all the links okay so no other change very simple now let's run this and see how much time it takes so it's running yeah it's hard to see actually without unless i see the numbers but looks like it's more or less in the same range yeah yeah it's so it took 12.75 seconds so it's little bit faster but not very fast but remember that i told you that we are not limited with the number of cores so let's make it big let's make it 25 so we can go very high right so let's clear this remember that this took 12.75 seconds with eight threads this time we have 25 let's see oh now you can see that it is taking much faster i'm not fast forwarding and it took 4.29 seconds with 25 threads what if i set it to 100 i remember that even though it will take so it basically you can see that it took only 2.25 2.5 seconds with hundred threads there are few things that you need to remember number one as the parameter name suggests it is max worker it does not mean that it is actually going to create hundred threads it can use up 200 threads right this is the first thing the second thing is higher number does not always mean faster time because if i set it to 250 right so if i remember there was 260 urls to be processed so let's see what happens if we set 260 so you can see see i'm not fast forwarding it as and i'm not editing the video right now i'm not going to edit this part and probably you can see that it is stuck right now it is stuck so with 260 process it is stuck so it a white is stuck so probably the overhead of creating new threads was a little bit too much so what i'm going to do is probably i'm going to kill this okay the number of threads that you should be using is something that you need to optimize based on the requirements based on your current scenario and current machine so this number needs to be tweaked right so this was all about optimizing the beautiful soup and request right and there is one thing that i want to mention here that we did all this and we try to keep the all the code very simple right we do not have multiple parameters here we are not using sessions here we are writing to different files we are not writing to the same file so there are a lot of places where things can go wrong so unless you have a very good understanding of how threads work you may actually end up doing something wrong so this is an easy approach for easy scripts but as your scripts get complicated this is not the best approach for beginners at least now let's do the same thing using scrappy and this time remember that we are not going to do anything further optimization so we are going to write a very standard scrappy spider right so scrappy giant spider currencies and i'm just putting an x here just want to make sure yeah so we have regular dot py and currencies dot py let me empty this output star yeah all right so now we have one spider here yeah this is the spider so i'm going to split them side by side these two files and what i'm going to do is i'm simply going to copy lot of code here so this get links code let me just copy this and put it here so this is my starting url so i'm just going to put it here in start url so just because you know i have zoomed up a lot for the video i can't see the whole url so i don't want to mess it up so i've just created a simple variable and i want to use the same selector so once this is processed i will be in parse method so i'll use the same selector so i'm just copying this selector putting it here so if i just call response.css pass in this selector and if i just call getall it is going to return all the elements but what i want is actually the href so what we are going to do instead is [Applause] i'm going to chain the css selectors okay and here we are going to extract the href attribute and then get all right so this will give me all the links simple so these are my all links and i can actually copy this entire loop okay so let me make this little bit smaller so i'll have to rename little bit so for all the link so this part i don't need uh url join i will not use it but in seed i will use response dot url join so this is directly available and i don't have to return anything from this okay or all a link in links yeah so nothing complicated so far there is one thing that we are going to do here which is scrappy specific so here i am going to yield the scrappy dot request okay so this is my link that is what i am going to yield and i need to provide a callback method so this is going to be self dot what do we want to call it parse link bad name yeah but let it be like that okay but of course we don't have it right now so let's define it so this method will take self and response okay and here we are going to do exactly the same thing what we did here fetch so i'm just copying this here just for the time being and i'll edit it out so this response we already have available file name is what we need to create but instead of link we will use response dot url so this response dot url will give the current link and then we are using the same logic we are creating the output file and the last thing that we want to do is we want to print the timer but as you know that scrappy spiders cannot be run directly so what i'm going to do is i'm going to import one more module so scrapper helper as sh so if you look at scrapper helper basically there is this uh run spider method so it's simply importing time you know and it's using crawler process and process dot crawl so it's a standard stuff so just you can write all these things manually or just be lazy like me and put it in scrapper helper and outside this class so i'm going to collapse this class sh dot run spider and here you have to give the class name okay from now come on import scrapper helper as sh so if everything is okay we should have something wonderful so i'm again going to use python3 and what is currencies.py okay and there is one more thing that i want to do scrappy by default is going to put a lot of things on the console right so i will add one more setting here so i will use the custom settings attribute and i'll change the log level log underscore level and this i am going to set it to warnings only so i don't want to print all the log so let's print only warnings let's see okay so we have a lot of warnings yeah here this response dot content so here instead of response dot content it should be response dot body right so here in case of scrap you have response dot text which contains the text response.json contains json and response dot body contains the raw bytes so that is what we need here okay so let's come back clear everything and go and the initial log will be printed unless it finds warning and you can see that it took only 3.3 seconds so it is not as fast as you know spawning 100 threads but it is very very close actually you know so if you're using request without optimization it was taking around 120 seconds and now we are getting 3.3 seconds so still it is really fast so moral of the story scrappy by design so we did not do anything special this is standard scrappy spider scrappy by design is written in such a way that it is going to run the fastest so in the description you will see link to all the code so that was all for this video as you can see that there are ways to optimize your scrappers using multi-processing multi-threading sessions or better yet without worrying about anything just use scrappy so that's all for today i'll see you in the next one

Info

Channel: codeRECODE with Upendra

Views: 1,129

Rating: 5 out of 5

Keywords: python scrapy tutorial, scrapy for beginners, Python Web Scraping, web scraping python, how to scrape data, scrape web pages, website scraping, python scraping, scrapy tutorial, data scraping, Python Scrapy, Scrapy Spider, scrapy splash, web scrapping, web scraping, webscraping, web scraping with python, concurrent futures, web scrape faster, faster web scraping, multiprocessing, multithreading, python, python web scraping tutorial, scrapy python, data mining, scrapy python 3

Id: qQDB6SE0a9c

Channel Id: undefined

Length: 29min 28sec (1768 seconds)

Published: Mon Jul 12 2021