Refactoring A PDF And Web Scraper Part 2 // CODE ROAST

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to finish the refactoring of a pdf and web scraping script if you haven't watched part 1 of this refactoring i've put a link to that video in the description of this one now i'm not a data scientist myself i'm a software engineer or software designer so i view projects like these through the lens of a software engineer if you want to learn more about data science itself skillshare who's the sponsor of this video has a couple of great classes to get you started skillshare is an online learning community with thousands of inspiring classes for creators explore new skills deepen existing passions and get lost in creativity skillshare has many classes on web development programming in python software engineering and software design at the moment i'm following frank kane's course on data science and machine learning with python it's really comprehensive it contains lessons about statistics data types clustering algorithms decision trees conference libraries such as pandas basically everything you need to know in order to get started skillshare is curated for learning there are no ads and they're always launching new premium classes so you can stay focused and follow wherever your creativity takes you the first 1000 of my subscribers to click the link in the description will get a one month free trial of skillshare so you can start exploring your creativity today the project i'm refactoring is a script that scrapes data from websites pdf files in particular it looks at academic papers and extracts research keywords and word frequencies the state that we ended up with after the last part is that we have a main file that contains configuration settings i didn't really change anything there we have a couple of classes for handling different scrape requests and what i mentioned last time is that the scrape request class has a very weird mechanism of instantiating sub classes itself which we need to fix so there's a scrape request generic class and then we have scihop scrape we have a json scrape class and we had a pdf scraper which i moved in the last video to a separate file and i cleaned up completely so that pdf scraper now is a pretty short class it has a single method called scrape and it has a couple of helpful methods for guessing the dui computing filter tokens depending on stop words etc etc and a most common words helper function and it uses those functions to produce the scrape result and the result of scrape that's actually a definition we added here it's a data class and it contains a couple of fields now next to these scrapers we have a couple of helper functions here like for changing a directory and there is also a couple of different file request classes which are actually not really all file requests but that basically follows the same structure as the scrape request where the super class is itself creating instances of its own subclasses which is not a good idea like i mentioned in the previous video now the reason these classes are still here is that we still have a dui request class and we have a pop id request class for fetching the data from the pdf files i created a simple function fetch terms from pdf files that solve this and then we no longer need that class so we need to do the same thing for the other two for dui and for the pop id request classes and then after that we can remove this fire request class because that's no longer needed in this part of the refactoring i'm going to clean up the scraper classes and move a few more things to different files and then i'm going to show you how to properly deal with configuration settings in this project let's dive in as you can see the dui request class it creates a scraper depending on a lookup key that you give it so basically the way that scrapers and dui request instances are related to each other is via these lookup codes and that's generally not a very nice thing to do because you're basically re-implementing the composition mechanism there's a much simpler way to do it by just using composition what we can do instead is not create a class like this but simply create a function fetch terms from dui and pass a scraper instance to that function so that we know okay that's the particular scraper that this is going to use but in order to do that we need a more generic representation of what a scraper is and we have one which is the one here in the main file called scrape request but that doesn't really apply here because this is not a generic representation this is responsible for creating instances so instead of this what i'm going to do is i'll go and into the scraper file and let's add a scraper protocol class here that then the other parts of the code rely on the scraper protocol class is actually really simple let's say we have a class scraper which is a protocol and this only has one method now in the main file we had scrape request subclass that had a download method but the name download is not very clear because it's not always downloading something so i'm just going to use the scrape name for this particular method and what this needs is let's look back into the main file we see that it needs a search text so that's all our scraper class is and actually the scrape method also returns a result which is a scrape result so then this is our scraper protocol class and now let's go back into the main file and what we're going to do first is go to these different classes dui request and pop id request and let's turn these into functions as well so i'm going to start with the doi request class and i'll put them here in the fetch file because that's also where we have the version for the pdf files and then basically what we'll have is this function and let's call this function fetch terms from dui and this is going to get a target which is a string and it's going to get a scraper which is of type scraper and scraper we need to import now remove the self dot references here also here you see that all the variables that should actually be local are stored in instance variable so we don't need to do that so this constructs the search terms and then it should do scraper dot scrape because that's the new interface that we defined so that means this is what we get and then we can remove the rest of this class there's a couple of typing errors here part of the problem is that tqdm which is responsible for showing those progress bars is not really dealing with typing properly so search term for example it's unknown so that basically feeds back into that the scrape function or the scrape method now also gets an unknown thing which results in another type error etc etc i'm not going to fix that in this video but i hope some of these libraries will catch up with the typing system in python and this is going to look a lot cleaner so that's fetch terms from dui and i'll go back in the main file that means that this class i can delete this and we can do the same for the pop id request so i'm also going to copy this class to the fetch.pi file there we go and this method is going to become fetch terms from pub id if you look at the pop id request that actually takes a data frame so here the target is going to be a pandas data frame and this obviously also needs a scraper just like the other fetch functions and now let's remove all the self references anymore that are no longer needed and this should be scrape obviously like so and you see that search terms here is a tuple actually containing all the search texts and then it's converted into a list and that actually leads to problems because now in order to use it here we need to convert it into a list and there's there's actually no reason to do it because here it's also a list so we should make sure that this is consistent so let's change this into a list as well like so and now it means we don't need to convert it to a list here anymore and then this self we should also still delete so this is our fetch terms from pop id and then the remaining bits of this class can be removed because that's no longer needed there we go and again as you see there is a couple of typing issues here i'm not going to look into it in this video and on the main file i can remove the pop id request class because that's now a simple function and that means that the file request class can now also be deleted now this already makes the main file a lot shorter there's a couple of things left to clean up one thing is that we have this scrape request class and the cyhop scrape and the json scrape classes so these ones we can also put into a separate file and remove this dependency on the scrape request super class because we're not going to need that mechanism anymore so what i'll do is i create two files here a json pi file and a scihop pi file and that's where i'm going to put these two classes so let's start with the json scraper so select all the code in class and i'll move it over here now also here there are things you can do similar to what i did with the pdf scraper in that there are keys in here maybe you want to extract them and define them in a separate constant there is a get data entry method everything is also here stored into instance variables which is not needed so you can also clean this up in a similar way to the pdf scraper i won't do that in this video because it's mostly the same work that i also did in the pdf scraper class but optimally that's what you want to do here as well so the json scraper so that's no longer a subclass of scrape requests so i'm just going to remove all of this and the json script is going to have a scrape method as well and that will return a scrape results and let's import that here there we go and there's a couple of other imports missing let me copy them from here there that looks already a lot better and now you see that there are a couple of issues that start to appear for example here we have a base url which is a configuration setting in the main file right that's defined here now one way to fix it is basically define it inside the json.pi file that's not a very good solution because then your configuration settings become really hard to find so i won't fix this just yet but later on the video when i start looking at the configuration i'm going to show you how you can fix this in a neat way so other than that i'm going to leave the class as is for the moment and i'm going to do the same with the scihop class so that's this one let's remove it here go into the cycle file and paste it here and here you see we have a similar problem in that we also have a constant here that we rely on but we don't have access to that constant in this file so i'm going to look more closely at that when i deal with the configuration settings so this method is then also called scrape and this is also going to return a scrape result we no longer need this let's add missing imports there we go and this import we're also going to need like so now this already looks a lot cleaner also here we got a lot of type arrows and i think there's also a lot to improve in this class but since it's very similar to what i did in the pdf scraper i'm just going to skip this for this video now back to the main file what remains here so we have our configs we have the scrape request class which i'm now going to delete which feels great i love deleting this class very nice all right bye bye we don't need that anymore and then let's see what we still have we have a utility function for changing a directory and we have an export function which will probably also be better if we move this to a separate file as well so let's create two more files here export dot js that's the wrong channel export.pi and this is going to contain the export function that is defined here there we go and there's a couple of missing imports obviously so let's copy this to this file and let's remove the ones we don't need so also here you see that we need parameters in this case it makes most sense to define this export there as a parameter to this function so let's insert it here and that's going to be a string the export there now the only thing that's missing is the date and the change there function which will import from another file let's call that dir dot pi go back into the main file let's copy this here there we go and this is going to need a couple of os imports so let's add the import here it also needs context manager so let's put that here as well there's actually a slight issue in this function is that in the try catch block it retrieves the current working directory and then finally it changes to that working directory but this can potentially lead to bug because if this part of the code fails if this erases an error then the finally part cannot be properly executed because this command was not executed so in this case it's better to put this command right here and delete it here and that also solves the error that we saw here in the finally part this always something to be aware of when you're dealing with exceptions is all the data available everywhere and is it valid and in this case it's better to put this here because the destination part the line that came before it doesn't rely on the current working directory so we can actually solve it really easily and this just increases the stability of the code and this is also why it's so important to use supporting tools for type checking for linting your code to make sure that these kind of small issues don't become bigger problems in the future so that's the change there function let me close this file because we're not going to change that anymore and then here let's import this from oh now it's importing from main that was not the idea i want to import it from scrape dots there there we go now the only problem is apart from these minor typing issues due to pandos is that the date is not defined and the simplest way to do it here is simply redefine the current date here so i'm going to take these two lines there we go and i'm going to copy them here and then of course i also need to import date time from daytime import date time and then i can also remove this double date time mention here so this is the export function and on the main file again delete stuff which i love doing so let's delete change there because that's not needed anymore and let's also delete the export oh i already did that too bad okay now export we need to import that sounds really weird export we need to import yeah so it's imported it here and now obviously it's going to expect the directory now it's going to need an extra argument which is the export there so let's add that here export there there we go let's see if this still works there we go and it has now created a csv file with all the information from the analysis so that still works that's great the main file is now much shorter it basically just contains the constants and the and the main function this also means we can remove a lot of these imports because they're no longer needed here there we go there's one more thing i'd like to fix before i talk about improving the way that configurations work and that's doing a slight improvement to how we're logging things in the system and because what happens at the moment is that when we log something we always have to write twice the message once via the logging object and once printing it to the console and that also means you have to create this extra string which is a bit inconvenient so instead of doing this what i'm going to do is create a log function that we call that does this work for us and put that into a separate module and part of that module that should be this the logging basic config so let's copy this and i'm just going to copy that including the date because it's going to need the date and i'm going to create another file here called log dot pi where we're going to put this initialization code now this doesn't belong here and neither does this let's remove the double datetime and let's just import it here there we go and now we also need to import the logging module there that's the logging module and then let's add one very simple function here that logs a message it's not going to return anything simply going to log something and the only thing that this does is logging dot info message and it prints the message now if we use this log message thing everywhere instead of doing the double logging it's going to shorten the code and also in the future if you want to change the logging mechanism there's basically only one place in the code where we need to change it let's go back into the main file and then what we need to do is delete this because that's no longer needed here and i'm not sure these dates are still needed here actually doesn't seem to be so i'll also remove these and then instead of doing this we can turn this into a very simple function call log message auto import that and that's going to log this message for us let's remove these two print messages actually the quit function here call here is also not needed because this is the end of the main function so we don't need to quit so i can actually also delete that and while we're at it this comment is not really useful either so let's also delete that there we go and now that we have the log message function we can replace all these double calls to logging and replace it with a single function call which is much cleaner and i see that i accidentally removed the date here i shouldn't have done that so let's just put it back i'll just take it from this thing here and put it back into the main file for the moment i'm going to change it anyway later on but let's just put it here for now there that should fix it and now there is one more issue is that that this is supposed to be an f string otherwise this doesn't work and then this import here is also no longer needed let's run the code one more time to verify that this still works and this still seems to work just fine now let's think about configuration a bit more at the moment all configuration settings are defined right here in the main file which works but also it's sometimes quite useful to define these things in another place and then you can basically run the program with different configuration without having to change the code which is useful if you want to share your script with colleagues who don't know python but they still want to use your work in some way then that's an easier way to do it so instead of putting them here what you can do is put them in a separate file for example a json file and there are also a couple of packages that really help you with improving the way you deal with configurations like dot m for hydra that i'm going to cover really soon on this channel but for now let's keep it very simple and just load configuration settings from a json file that's already something you can do very easily and it's going to get you started in creating cleaner projects so what i'll do is add to my example folder here another file called config.json and this is going to contain all the configuration settings so i already prepared the basic configuration settings which is these things and i basically copied them from the main file so these are our configuration settings and then what you can do that's really nice is to use data classes to provide typing information when you're using them in python so let's create another file here called config.pi and i'm going to create a data class so let's import data class as well and then create a data class here called scrape config and this is going to have the main config settings that we have here so i'll just copy them over so i can rename them so we have the export there and that's going to be a string we have the prime source which is also going to be a string and there's a couple of other things here as well urls so all of these things they're basically strings and finally the paper folder so that's our configuration and then let's add a simple function here that's called read config and that's going to get a config file which is a string and that's going to return a scrape config instance so with open the config file as file we're going to load the data first using json and let's import json there we go so that's our data and then we're going to return a scrape config instance that gets the data and then we need to unpack it so that the instance variables are properly set so this is our read config function that's going to help us to read a simple configuration from a json file it's nothing special it's really simple but it makes it a lot easier to use the project in the future so configuration and then here in the main file let's read the configuration data so we're going to have a config object that's going to be a read config let's also import that and our file is going to be config.json you could eventually even make it so that the name of the config file is actually something that you pass as a parameter when you call the script when you execute the script but here i'm just putting it into the code directly so that's our config let's also add some comment here to explain what this is doing there we go and here we're going to fetch data from pdf files and export it so now what you can do is instead of using these constants here like export there and everything we can use the configuration instead so i can just write here config dot export there and now i can delete this export there thingy here now these things at the moment the urls are not used so i will simply delete them for the moment and this we can also delete actually because we're reading it from a config and then let's see we have here the paper folder so that one we can also pass from the config there we go and then we can remove yet a few other imports as well so let's run this and verify that this still works so i think that's the problem with this particular folder it should actually not have two dots but a single dot and now it's importing again properly i want to conclude this code draws by looking at a few things to think about when dealing with configurations so the problem is that here we have all the configuration settings available to us and we can pass them to functions but in many other places we don't have that like in the json scraper for example the json scraper relies on this particular configuration setting but we don't have it so we need a way to pass all this data to the various parts of our application and there's a couple of ways you can do it you could store it in a global module somewhere and load the config from that global module but that introduces a lot of coupling because if you don't have access to that module or you want to you want to use let's say this class in another setting you can't really do that another way to do it is to pass configuration settings via parameters and because this is a class there is actually a pretty simple way to do it we can add an initializer that allows us to pass things like urls and etc as a configuration parameter so what you could do is something like this where we have an initializer and that's going to get a in this case it needs a dimensions url which is a string obviously it's going to return on and then we're going to store it here as a instance variable and then here we can use that instead like so and then whenever we want to use the json scraper we create an instance of it here in the main file for example and then we pass it to the config that it needs that's one way of doing it in the scio class actually is going to be exactly the same thing because this needs for example this url but it also needs the research there so actually i can fix this import here right away so here also you could create an initializer that actually passes these things as a parameter and store it in the instance and then you can use it here in the scrape method i won't do it now because this is pretty truthful you can figure out yourself how to do it but i'll make sure that i'll put this into the git repository as well there's one more thing actually i want to fix in the configuration settings and that is that at the moment we have these hard-coded word sets here in the pdf scraper right the target words and the bycatch words and the research words and actually it would be much better if these were not defined directly in the code like this but if they were defined neatly in a couple of separate text files now i already created the text files that contain these words so what we can do now is instead of defining them in this pdf.pi file is actually load them from the text file and now that we have a configuration file we can actually decide which files these are going to be by defining them as a configuration setting so let's add to our scrape config class three extra instance variables we have the research words which is a string we have the bycatch words which is also string so these are going to be file names right and we have the target words which is also a string containing a file name and then let's go to our config.json file and now i've added these three settings here in the json file as well so they're in the words folder and then they're in a particular text file so that makes it really easy to change from now on now the pdf scraper is going to need to know these things right so what we can do is like i did in the json scraper add an initializer where i'm going to load these files so let's create an initializer here and this is going to get three things the research words the bycatch words and the target words and then we're going to open the file and store it in the research words instance variable and let's immediately turn this into a set as well so that we can do the intersections easily and let's do the same thing for the other two there we go and now we can remove these constants here there we go and then we simply need to replace them here there we go so the configuration settings for the pdf scraper are passed here as parameters to the initializer but now how do we get these values to the pdf scraper where do we need to do it well in the fetch parts we're going to the fetch terms from pdf files you see that we need to create a pdf scraper there and then we need to supply the configuration settings now one thing you could do is instead of creating the pdf scraper inside this function is to actually inject it as a dependency it kind of makes sense because all the other functions like fetch terms from dui or fetch terms from pop id they also inject the scraper here so we could do the same thing for fetch terms from pdf files the reason to not do it is that this is very specific to pdf files extraction so perhaps it doesn't make sense to pass anything else than a pdf scraper and it might also point to perhaps us needing to re-evaluate whether this particular separation into functions makes sense in this way maybe we should do it differently but that's another design discussion that i won't talk about today for now what i'm going to do is leave this in here because this is specific to pdf files but know that you could also inject it as a dependency and i want to show you some one more thing you can do with configuration settings so here what we need to do is pass the different word files to the pdf scraper so that pdf scraper can load them now what you could do is add more arguments here research words target words by catch words and then pass them to the pdf scraper another thing you could do is that fetch terms from pdf files actually doesn't get a bunch of arguments but it just gets a scrape config object instead so let's replace this let's see what that looks like we have a config which is a scrape config instance let's import that here and then instead of writing paper folder here we're writing config.paper folder and then the pdf scraper well we can simply pass the parameters here we have our research words which is config.research words we have bycatch words config.bycatch words and we have target words and that's config.target words if you wanted to you could also let the pdf scraper initializer accept a scrape config instance instead of these separate parameters right it's a design choice on the one hand if you pass parameters like this in this case the scrape class is highly independent it doesn't need to know anything about that there exists a scrape config object but you need to pass more parameters here on the other hand in the fetch function we simply pass a config which means it's really simple to call this and it just passes the right information along to the pdf scraper but now fetch terms from pdf files needs to know that there is something called a screen config so depending on the level where you are in your application you can make a different choice i'd say on on the lower level when you have lower level objects that are more doing really tiny parts of the work that you need them to do it's probably not a good idea to pass them high level configuration objects where they may be using one or two parameters off it's better to make them as independent as possible simply provide the arguments that they need and then they're also much easier to test because they're going to pick pure functions in this case well this is a very generic function that's going to do a lot of different things with this configuration it gets the paper folder from it it gets the different types of word files from it and pass it to the pf scraper so here i think it's okay that we pass it a scrape config instead of a whole bunch of different arguments and that's what the configuration object is for right to group these things so you can more easily pass them around and more easily have different parts of your application use the configuration so overall the way that configuration is now set up is that at the top level in the main file we decide what the configuration file is where we're going to read it from that's going to give us a configuration object and then we have several mechanisms for passing around the configuration throughout the application so here this i should actually remove because this simply requires the configuration let's run this code one more time and verify that this still works and of course it doesn't work because again i replaced the papers folder with this that's because that's in my original example it's structured a bit differently i should have done that differently anyway now this is working again without a problem so overall this is a decent way of dealing with configuration settings we're reading it from a separate file and we're passing it from the top level it trickles down to the lower level parts of your application i've read this from a json file here and i've just used my own config function that basically reads it and converts it into an instance of scrape config there are better more generic solutions for this one very nice package to deal with this that has lots of really useful features is called hydra and i'll do video very soon about hydra on the channel so stay tuned for that thanks again to john fallot for supplying the code i know i was a bit more roasty in this video series especially related to the scraper initializer but you know it's tough love i just want to help people get better at this stuff because that's what my channel is all about and i've written down a guide to help you get started with this it's available for free at rmcodes.com design guide and it describes in seven steps how to design a new piece of software basically the things that i go through these are really actionable points that you can apply directly to your code so ironcodes.com design guide for your free download i do hope you enjoyed this video if you did give it a like and consider subscribing to my channel if you enjoyed this refactoring check out this mini series where i do a full refactoring of a handwritten digit recognition project thanks for watching take care and see you soon
Info
Channel: ArjanCodes
Views: 5,822
Rating: undefined out of 5
Keywords: web scraper, web scraping, python programming, data science, pdf scraper, python web scraping, python programming tutorial, python programming language, python web scraping tutorial, python tutorial advanced, data science for beginners, data science python, code roast, code roast python, web scraping tutorial, python tutorial, python refactoring, python refactor code, get data from pdf, refactoring code, refactoring python, refactoring python code, Web scraper python
Id: 6ac4Um2Vicg
Channel Id: undefined
Length: 33min 40sec (2020 seconds)
Published: Fri Dec 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.