Refactoring A PDF And Web Scraper Part 1 // CODE ROAST

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
something very interesting happening in this piece of code take a look at this line you'd expect that the file request variable would contain a file request object right well let's find out ooh that's interesting the file request initializer doesn't create a file request we're gonna have a lot of fun untangling this mess let's dive in [Music] in this code roast episode i'm going to analyze and refactor a pdf and web scraping script that analyzes academic papers thanks to john fellowed for supplying the code for this roast i'm going to start by explaining how the example is set up and then i'll refactor the project i'll be using tab 9 for this who's also the sponsor of this video tab 9 is an ai assistant that provides smart code completion in your ide this includes both pycharm and vs code it supports all modern programming languages including python and understands what language you're working in according to the current file type tab 9 offers a free forever plan the free model is trained on the limited static selection of trusted open source repositories the pro version of tab 9 offers many more capabilities such as the team learning algorithm which provides personalized suggestions based on your team's code project preferences and patterns you can adjust code completion suggestions to your preference whether that's inline full line snippets via pop-up and more start using tab 9 now to increase your team's productivity use the link in the description of this video to get a discount of the pro plan the example that i'll be looking at today is a scraping script that extracts data bibliographic information from papers academic papers so you can supply pdf files csv file with data and lots of different things and it can process that data and detecting things like word frequency etc so let's scroll down and take a look at the code and see how it's structured so we have a bunch of imports here then there is configuration settings like the current date export directories error messages login configuration a couple of other urls and things that the file needs then we have a bunch of classes there is a scrape request and then there are subclasses cyhop scrape there's a json script that describes information from a json file and we have a pdf scrape that can extract data from a pdf file and then we scroll down there's also different types of requests there's a file request that gets the data from a file there is a dui request that takes a csv and then we have a pop id request class that takes it from a data frame so these are all kind of different types of processing options that are then put into the classes and then there's a folder request that takes data from a folder containing pdf files for example and then we have the exporting function that exports the data and the main function that uses all of these different classes so that's how it's set up so let me run this code to show you what this actually does so you see it's processing pages from a couple of pdf files these files are actually part of the repository and they're basically research papers and the program now analyzes the text in these research papers once the program finishes exporting it created a folder scraper export and that now contains a csv file with data like word score frequency etc and detecting commonly used keywords in these papers so that's what this script does an important aspect of refactoring code is the diagnosis before you start changing things in code it's important that you understand where the main problems are and that's why i generally start these code roast videos with an explanation with an analysis of the code and trying to identify what the main problems are that we need to address in order to improve the design now i'm going to scroll way to the top and already here you see the first issue is that when you look at this code file actually in the first i don't know 20 30 lines there is not a single line of actual code here this is all help texts thank you messages etc and of course it's important to thank contributors to code it's important to provide helpful information about what the script is going to do the problem is if you put this at the top of your file the the main code file that you'll be editing all the time you'll be scrolling a lot because for developers if you're working on the code this is not very useful information so it's much better to put this into a separate readme file because of course you still want the information to be there it just doesn't need to be in the main code so instead of putting it in main file let's create a readme file here then i'm going to copy all this text here like so and put that into the readme file now i i won't do the formatting here you can clean this up and turn it into really nice markdown obviously but let's leave it like this and also here for example this kind of comment it's clear that these are comments so we don't really need to write that here explicitly and same thing for version and author this is also things that if you put this into git repost or these things are clear in the git settings so i wouldn't actually put these things in the code itself so let me just delete this like so so that already saves us a lot of scrolling so if we scroll down further we see here the configuration settings for this particular project so i think it's really nice that configuration settings are at the top of the file now because then they're easy to find and they're easy to change but these are not all the configuration settings for example if i scroll down further you see there is a definition of what file we're going to use for temporary data there are more examples of configuration settings that are not at the top for example the export name of the file that is going to contain the data is defined here hard coded in the export function and in the main file we also have the target folder where the papers are so these are actually also configuration settings that you should ideally put together with all the other settings that are here at the top and now there's another problem in that these configuration settings are not really simple things in the sense that we have strings which is fine for a configuration setting but we also have things like logging enums we're formatting things here already we have a datetime object we have a path object generally configuration settings should be relatively simple objects so this should be strings boolean flags integers those kinds of things the reason for that is that ideally you don't want to store these things inside the code itself but you want to put them into a separate file and if your configuration setting is an os path object or a datetime object then storing it in a text file is probably going to break things so it's important to make sure that your configuration settings are actually basic things that every normal human is able to understand and to edit for example if you would give the script to some user that wants to change something maybe that user doesn't understand what real path is and why we're using real path or what this formatting does so it's better to put them in a separate file that also limits the things that the user of your script sees and then they know what they're supposed to change and what they're not supposed to change i'll fix those settings in part two of this code rost now on to the interesting bit we have a class here called scrape request and you remember in the beginning of the video i showed you that the initializer of a class in this example actually doesn't create an instance of that class and that's actually what's happening here so you see we have a bunch of pretty low level python dunder method overrides here is an init subclass that basically stores the sub class in a registry and then when we call new depending on some boolean variable and the lookup code that we provide it looks up the subclass in this class object in the registry and then creates an instance of that subclass and then returns that so what it basically does is it creates a subclass and the interesting thing is that scrape request itself as a class doesn't do anything else so there's a download method here that's not implemented so the only thing that scrape request is actually responsible for is creating instances it's like a factory pattern in fact but it's not a factory pattern it's doing something very unexpected it's changing the way that a fundamental programming concept in python works if you call class initializer you expect that class initializer to return an instance of that class and not something else and that's highly confusing so you're redefining what the class initializer is supposed to do but you're also introducing a lot of coupling at the same time because these lookup codes are hard coded strings in here so if we let's say we want to add a different type of scraper well then you'd have to dive into low-level dunder methods and check that the lookup code is actually defined there and at the same time the whole idea of inheritance is that the superclass doesn't know anything about its subclasses that's why inheritance works right so you're basically breaking that aspect of the system because now scrape request has to know about its own subclasses and then it's not even clear why you really need to do this because what scrape request does it it basically dynamically creates either a cyhop scrape request or a json scrape request but the thing is when you're in the main function you probably know what you want to do you know if you want to analyze a json file or a pdf file or take data from a folder or from a website so you know that thing here so anyway in the main function you're going to have to become specific and there's no reason to provide all this abstraction logic that you put in the code if you really want to dynamically create a class use an abstract factory or even simply you could maintain a dictionary that maps string names to class initializers and that's all that you need so in short whenever you feel that you need to change the core of how a programming concept works in order to get the job done think very carefully whether that's really needed and then don't do it i've been developing software for more than 25 years and i've never encountered a situation where i could only solve the problem by changing the meaning of a programming concept generally it's going to lead to a big mess so don't do it there's a couple of other issues in the code as well one is naming if you look at the scrape request class it has a download method so that's going to download the data i'm assuming so for cyhop scrape this makes sense because this is a website as far as you understand so we have a download method that does requests to that website if you go to json scrape that's actually file so then you're not really downloading anything and if you go to pdf scrape that's also a file so we're also not really downloading something and in fact you can ask yourself whether these classes should be really be in hierarchy or whether they should just be separate functions instead also there's some inconsistency like cyhop scrape it's a scrape request subclass and as a lockup code jsonscrape also has a lookup code but pdfscrape is not a subclass but it still has the download method so it looks from name that this is also supposed to be a scrape request subclass but it isn't having proper naming having a high level of consistency in your code is really important because when you're developing software like this it's crucial that things are just extremely predictable and very easy to understand so there's also a file request class that basically does the same thing as the scrape request and that's the reason that it printed the wrong class name in the beginning of the video so fire request also dynamically creates its own subclasses and again with lookup codes and there's also some naming issues like this is a file request class but for example if i scroll down here there is a bob id request that actually takes a data frame and then returns another data frame so it's not a file actually and the same thing here a folder request or a folder well maybe in some operating systems a folder is also considered a file but it's it's still conceptually something different so the class hierarchies in general in this file don't make a lot of sense to me another thing is that the information contained in the methods and in the classes is not very well structured in my opinion for example the pdf scrape class it has a download method it's storing lots of temporary data like a list of preprints or n we don't know what n is it's the length of the number of pages of something else that's all stored as instance variables of this particular class but then actually not used anywhere else so generally don't use instance variables unless you want them to persist beyond the method that you're using them in if you simply need a variable to store some information temporarily for that particular method just use a local variable it's much easier also the pdfscrape class contains other things like a hard coded list of target words or bycatch words or research words and it's generally not a good idea to directly define this in a class try to separate this data from your actual class so that the class becomes smaller and it's easier to understand what it's doing and finally there is some code duplication in the way that information is locked so that's what you see happening here so we have a logging object that takes a message we also print that message and then because we do it twice here we also need to store it in a separate string to avoid writing the same text twice instead of writing all these lines why not simply create a log function that basically does this for you so that makes your code just a little bit shorter what i'm going to do in this part of the refactoring is clean up this file put things into separate files and also clean up the way that inheritance is working and basically simplify this a lot so what i'll do is create a folder called scrape and that's going to contain the main files for our scraping package the tools that you can use to do scraping so let's also add a init method here and then let's start with the pdf scraper so i'm going to create another file here called dot pi and then i go back to the main file so i'm going to select all this code and put it in the pdf file now of course we also need a bunch of imports here so what i'll just do to make this simple is to basically copy everything here like so and put it in here and then my typing system is going to tell me which ones i don't need this one i also don't need there we go just to remove these here and let black clean up my imports for me there so these seem to be the imports that we're going to need for pdf scraping so now let's take a look at this class so first what i want to do is rename the downloads method because it's not actually downloading anything so let's just call this scrape because that's what it's supposed to do and you see we also get a typing issue here that this returns a generic dictionary but we'd like to be a little bit more precise so if you look at what the scrape function actually does is that it returns this get data entry and get data entry is another method where is it it's all the way down here i think that basically gets a dui word score frequency and study design thing so actually what this dictionary is supposed to be is then pretty clear so what i'll do is create another file here called scraper where i'm going to define what the scrape result is going to be so we can use data class for that so we have a class scrape results and that's going to have dui which is a string it's going to have a word score which is an int and will have a frequency and if you look at the pdf scraper i see that this is actually a frequency distribution of all the words and then it returns the most common words and i've looked up what the type is of that thing and what that is is actually a list containing a tuple of strings and ins and basically these are tuples of words and frequencies and then that's a list of that and similarly there's a study design list which has the same type there we go so that's our scrape results so now in the pdf scraper what we can do is not return a generic dictionary like this but actually return a scrape result when you look at the get data entry method it's actually a bit superfluous because the only thing this does is calls other functions and then constructs an object from that but there's not really any need to do it and also it stores it as an instance variable which is not needed at all so let's make this a bit simpler and just copy this over and i'll go to my main scraping function that's the function that's over here so what i'll do is instead of this return self.data entry is i'm going to return a scrape result this i'll delete so a scrape result that's what we're going to return let's import that and scrape result that's going to get the dui which is this it will have the word score it's going to have the frequency and study design this we don't need and now the method doesn't return a dictionary but it returns a scrape result like so and it also added the correct import here so that's great and this should actually be capital letters there we go now there's still some typing issues but we'll look into that in a couple of minutes now this thing the get data entry is actually not used anymore so i can delete this there we go another problem in this pdf scraper is that these target words here are all hard coded here so that's also not very nice what i'd like to do is move these out of the class so i'm going to take this target word list and i'll copy that and i'll just put it at the top for now but we're going to clean this up even more in part 2. there we go so those are the target words and let's reduce the indentation here and then of course this is no longer a instance variable but let's just call this target words and tab 9 is helpful here in supplying already the name for us that's the target words and now in pdf scrape i can take this function get target words where are you yeah here i just delete all these and then here i'll write target words like so and let's do the same thing for the other word list so we have bycatch words let's also copy that bycatch words and i'll also reduce the indentation here and let's remove it here and put it in here as a constant and finally there is research words so let's also copy that research words is also a constant like so and then let's also write that here there we go so now i'm going back into the main file and just want to make sure that my current version of the pdf scraper class still works correctly we're going to change more things later on but just to check that everything's still working so what i'll do is scroll down and find the pdf script class in here that we're going to delete [Applause] there we go and now in the main function we simply need to make sure that we call it pdf scrape is called here so let's import this now and let's verify that this still works correctly so it doesn't because obviously we change the name of the download function so i'm going to replace this by scrape and now let's try this one more time and now it's processing the data again so that's good now what you see here in the main file is that we have a class folder request that's being constructed in a very convoluted way and that has all this initialization stuff and then it has a method to fetch the terms from the pdf scraper actually we can really simplify this a lot by just removing this whole inheritance relationship because it's simply not necessary so what i'll do is in the scrape folder i'm going to create a file called fetch which is going to contain functions that fetch various things for us for example data from vdf file and then what i'll do is that i'll take this folder request class here and i'm going to copy this to the fetch file and all the stuff that's actually here in the initialize is not is not needed because we're not going to use classes for this so i'm going to delete this and i'm going to take this function decrease the indentation and let's call this function fetch terms from pdf files and that's not going to have a self because it's simple function and this function is going to have a couple of parameters so there's going to be a paper folder which is a string and that's basically the self.target thing we see here so let me replace this by paper folder and obviously i'm missing again a couple of imports so let's also add those i'll just copy it over from the main file this folder request we can actually also delete there we go and remove the self here so now we simply have a bunch of search terms and the only thing we still need to define is the pdf scraper and let's also import that it should be pdf scraper there we go and this should be obviously a class instance like so and remove the self here and then this is what we end up with and now if we go back to the main function so i can remove pdf plumber because that's been put in the pdf scraper and in the main function this is actually being simplified a lot because what i can do now is results equals fetch terms from pdf files and then what we'll need is the target which is the paper folder and that's dot slash papers then this is not needed this is not needed and then we can export the results so let's run the main file and check that this still works so i think there's an import issue here actually this is no longer needed because that's done by the fetch terms from pdf files so it doesn't matter let's run this again and verify that this now works so it starts processing the data again now now you can basically do the same thing for file request and for dui request and also create a function fetch data from instead of having this complicated class hierarchy i'm not going to do this now in this video but when i publish the code on github i'll make sure that these functions are also in there now i want to do a few more things in pdf scraper so we have this whole scrape function here right and this does a lot of things so it does all the self things that are actually not needed but let's also take a look now at the other methods in here so we have the get bycatch words and get target words which actually don't do much more than calling a simple method and then returning the overlap with another list so having these as separate methods in a class is not really very useful and in fact and that holds for quite a few of these methods there's actually no reason at all for them to be in class for example the get target words it simply takes lists and then checks what the overlap is with another list and here you see there is an overlap function that takes a list and returns another list but this is actually well apart from this all words thingy that we have here there's no dependencies on instance variables actually so it's much better to turn these things into separate functions instead and if you have separate functions these are also way easier to test than if they're part of a class like it's it's in the current example and what's even better is that the overlap method is actually not needed at all because if you didn't use lists but use sets instead then a set has actually very simple way of determining what the overlap is between two different sets so that's another point to always rely on built-in tools when you can if the built-in tool solves the problem for you then this is generally much better than having to code all these things yourself so what you can do so initially i created these three constants here that contain a list of words so what you can also do instead is say no this is not a list these are actually sets so let's replace this by this and then here we're going to do the same thing so now they're no longer list but they're sets and research words as well and that's the nice thing about the typing system is that you get this kind of useful feedback that research words is now indeed a set of strings so now let's look at these methods so we have the overlap method that actually i'm not going to use anymore so if you look at get target words it now computes the overlap by calling this function and it looks at the all words list and the target words list now instead of doing this what you can also do is simply write the following target words dot intersection and then let's turn that into a set self.all words and that does exactly the same thing now the problem is that all words it currently doesn't know the type correctly but we're going to fix that in a minute and the same thing you can do for the bycatch words and for the research words which is actually exactly the same even better there's actually no reason to put this into a separate method at all because it simply does this so what you can also do is let me copy this line and then let's move up we just check where it's being used get target words is actually used in get word score and there you see that the word score is computed as the length difference between the target words and the bycatch words so let's clean this up a bit more and where is word score that's being used here so what i'm going to do instead is add it directly here and first right we have the target intersection which is target words dot intersection with all words and actually all words that's here an instance variable but that's not needed so i'm going to remove that here and just write all words here so it's the intersection between target words and all words all words at the moment that's a list but we actually need to make sure that this becomes a set so we can do intersection and that's done by this method here but actually if you look at this method there's no reason for this to be a method in the class so let's also remove this here and move this method outside of the class so that it's also easier to test later on so let's call this method not get tokens because that's not very precise but let's call this compute filtered tokens and what that's going to get is a list of strings let's call that text and this is going to return a set of strings and then here we have our stop word so self that's not needed anymore so the word tokens variable that uses the word tokenizes function that expects a list here so we can put text in here instead like so and that converts it into a string you can also do this slightly differently by the way using the join function and that's what this looks like i think this is a bit more explicit in what you expect this to be doing so this creates the word token for us and then finally what it's going to return is not a list but it's going to return a set so let's put this into a set instead and let me delete this comment because that's not very useful and self we can delete here and then we have our stop words and name words there we go so this computes the filter tokens and there's a couple of things you can do here to clean it up even more but we'll talk about that later so now all words doesn't call get tokens but we call compute filter tokens and what it needs as a parameter is the text and that's this postprints variable let's also remove self here there we go and now if we look at all words that's now a set of string which is what we like so we have a target intersection here that computes the intersection between all words and the target words and we can do the same thing for the bycatch and that's the bycatch words intersection with all words like so then we compute the word score which is the difference between links so the target intersection minus the bycatch intersection that's our word score let's also create the research words intersection like so and then this we can remove so now we have our three intersections we have the word score and now we can basically compute the rest so the dui i'll just leave it like this for now and then we have to word score which is basically already here so let's just change this like so so that's the word score then the frequency that's this expression that's going to use all words and then we have the study design that's going to use the research intersection right here like so now this is also not so nice that this is hard coded here into the computation of the scrape result so i think works better is to put this into a separate function and let's put that function here at the top let's call it most common words and that's going to have a word set and it's going to compute the most the n most common words and that will return a list containing a tuple of string and ins and that will call this for us and now what we can do instead here in the scrape function is compute the frequency here and that's the most common words or words and we want to have five we also have study design which is the most common words of the research intersection words and then we want to have the three most common and now we can change that here as well and here as well there we go and now overlap here is no longer needed get target words no longer needed bycatch words and research words no longer needed getting a word score is also no longer needed now there's one method left get dui and when you look at this method actually the only thing that it uses is a file path file name and then it simply analyzes that file name so also here there's no reason that this is a method that's part of the pdf scraper so let's also move this out and let's go here to the top and let's copy that function here and let's call this guess doi which i think makes more sense and this is going to have a single parameter which is the path name and that's a string just remove this let's call this the base name which is path.base name and then we're going to pass it path name there we go that's our base name and then the dui equals the a part of the base name string and then we can simply return dy from 7 plus that plus this like so and if you want to use f strings that's also possible that looks like this and there we have an f-string version of the same thing so that's guess dui let's go down again to the pdf scraper that has the scrape method here and let's see where we needed that that was here so let's also compute doi here and that's going to need the search text and now we can write doi here we can make this a bit cleaner by not using uppercase letters here in the scrape results so instead of using these uppercase letters let's just change it to this and now what we can do is simply remove all the keyword arguments here and use regular arguments because it's clear what they are there we go and get the ui we can now delete so you can probably clean up the data a bit more for example let's remove this self variable here and let's just use the search text directly that's much easier and preprints that's also no longer a instance variable but simply a local variable and there's a couple of other places where you can also delete the self dot so let's do that here there we go let's look for the self references actually the only self is now in the scrape method that indicates that perhaps this shouldn't even be in class it can also be simply a function now let's go back into the main file and verify that what we did just now doesn't result in problems so let's run this code again and verify that this still works and this seems to process everything just fine so overall if you look at the current state of the code you see that the pdf scraper has become way way simpler we now have extracted all the data like research words and target words etc from the class which is much better but we'll improve it even more in the next part and we turned a couple of things into separate functions like guessing the dui or computing the filtered tokens or determining what the most common words are i also changed it so that things like computing overlap is now much easier because we're using the built-in set which already has that functionality and because you know split out these things into separate functions these are now also much easier to test as an example i already prepared this i've defined a few unit tests now for the pdf scraper so for example i'm using the unit test package here but you can also use pi test if you want to or any other testing library but now i made a class test pdf scraper that basically imports the compute filter tokens function and then checks that it works for the empty list or that it checks for stop words so if this is my list of tokens then i compute the filter tokens and then i'm asserting that this is the expected output of that function call if everything was part of the pdfscraper class and used instance variables this would be way harder to test you would have to create an instance first it may depend on other things we don't know but now these functions that we have to compute the most common words or computer filter tokens they're really really short they're pure functions so they only depend on what you put into them and that makes it so much easier to test so that's hopefully one thing that you take away from this is that if you write data science scripts like this web scraping or doing analysis using training a machine learning model etc keep things simple you don't need convoluted class structures to get the script to do what you want keep things simple that makes it easier to write the code makes it way easier to test and you're just going to save yourself a lot of trouble in the future now the next part i'm going to deal with the scraper code and also show you a better way of dealing with configuration settings thanks again to the sponsor of this video tab 9 check them out via the link in description below if you enjoyed this video give it a like if you want to watch more of my content consider subscribing if you want to watch a full code refactor of a data science project i did recently check out this video thanks for watching take care see you next time
Info
Channel: ArjanCodes
Views: 14,864
Rating: undefined out of 5
Keywords: web scraper, web scraping, python programming, data science, pdf scraper, python web scraping, python programming tutorial, python programming language, python web scraping tutorial, python tutorial advanced, data science for beginners, data science python, code roast, code roast python, web scraping tutorial, python tutorial, python refactoring, python refactor code, get data from pdf, refactoring code, refactoring python, refactoring python code, Refactoring guru
Id: MXM6VEtf8SE
Channel Id: undefined
Length: 37min 43sec (2263 seconds)
Published: Fri Dec 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.