Refactoring A Data Science Project Part 1 - Abstraction and Composition

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is a refactor of data science project kindly provided to me by mark todisco thank you mark it's a basic handwritten digit recognition model based on a well-known data set called amnest nist i'm not sure how to pronounce it this is a three-part video in the first part i'm going to look at a few design issues in the code and i'm also going to cover a really nice functional mechanism for dealing with data in a pipeline and then in the second and third parts i'll dive more into how data flows in the application and how we can improve it by applying software design principles disclaimer i'm not a data scientist or a machine learning expert i'm purely approaching the code from a software design perspective so it's possible that i'm going to do things that don't make a whole lot of sense to you i still hope though that you find some use and seeing how i refactor this code let me know in the comments below what you think and whether you think i should do more of these data sciency kinds of videos in the future before we start i have something for you it's a free guide to help you make better software design decisions in seven steps you can get it at ioncodes.com design guide i've kept it pretty short to the point i've tried to write down my approach for doing this and then hopefully that also helps you improve your own design process so ironcodes.com design guide and i put the link in the description of the video as well now let's dive into the code this is the example we're going to use for this refactoring so we have a main file that imports a bunch of stuff from our source directory which is here i'll go into that into more detail later on and it basically runs this whole experiment the way it works is that you can install requirements so obviously because there's a data science project it uses a couple of other things like pandas numpy torch and a few other libraries that are commonly used in data science and it also uses tensorboard to kind of track what is happening in the experiment i'll show you how that works later on as well so if you look at the code we see that there are some parameters that are defined that control how the experiment is run we have loaders for the training data and for the testing data we have a model which is a linear net i'll also show you what that is exactly then we have an optimizer we have a loss function and then we have containers to collect data while we're running the experiments there's some setup happening here in order to start recording things for the experiment and then basically for every epoch or every run basically we're going to do testing and training and then hopefully step by step improve the quality of the result and then it computes some averages it locks some stuff and it resets the things and it starts over again let me run this example and then i'll show you more of the code so if i run this then that's basically going to start the the training and the testing here and you can see that happening here so the accuracy is increasing at every step and the results of this you can actually follow what's happening using tensorboard and tensorboard is basically something that you start that tracks data that is stored in this runs directory that's what you basically initialize right here and then tensorboard can show you exactly what's happening so let me switch to the tensorboard view to show you what i mean so this is what tensorboard looks like and you see you have some charts that uh give accuracy and so this is useful if you're doing a data science experiments now back to the code we have a source folder that contains most of the actual work that's being done there is a dataset.pi which has a class called mnist that's the set that we're using for this so there's some stuff here to initialize this data set what you see here is actually an issue because you're using a variable as a kind of a mechanism to store intermediate results and that's actually not such a good idea because that means that at various stages in your application the variable is going to mean different things i'll talk about a solution to this problem near the end of this video then next to the mnist class we have a couple of helper functions to get a data loader for training and a data loader for the test data and that basically creates a data loader object and passes the data set to it then we have functions for loading the data so the data in this example is uh coming from the mnist data set and that basically stores images in a kind of custom binary format and these functions basically implement reading out this format and then returning the data so we can use it in our experiment then we have class metric this is basically aimed at collecting data about the experiment and that's being used later on to to pass to the tracking then we have the actual model that's being used to train and that's a torch module this is called linear net and this has a couple of these operations in it like linear relu soft max standard things you would do in a simple machine learning project and you see here the forward method actually applies those function to an x value a tensor and here you also see the same problem we also saw in the data set is that we're using here the same variable to to store different things different stages in the process and that's not such a good idea then we have the part that deals with tracking so this basically provides the link between the code that you're seeing here and the tensorboard tool that i just showed you and that's basically what's happening here we have an experiment tracker which is an abstract base class so this is kind of trying to separate the tracking and the specifics of the tracking from the actual experiment which which is nice and then we have the tensorboard experiment which is a subclass of experiment tracker and this basically contains all the specifics that are for tensorboard and that's basically the tracking file and then there's utils which is not a great name for a file because it's way too generic but this basically provides some useful functions that are used by tensorboard to create directories when they're missing etc python like r is a very popular language in data science most data science studies focus on the theory of data science statistics machine learning and along the way you're going to learn to use python as well as a collection of tools and libraries such as tensorflow pandas numpy etc and in many cases this is all you need to know but unfortunately most studies don't really pay any attention to how to set up a more complicated data science project when you need to write more complex data science code then you kind of end up having the same problems as when you're writing complex applications how can you make sure that the code you write for your data science projects makes sense that it's easy to change and maintain that you can reuse parts of it in future projects and that your colleagues also understand what you've been doing now that's where software design comes in i'm going to start by looking at the experiment tracking a little bit more closely so that's in the tracking.pi file and i'll start right here with the class stage which is basically used to represent various stages in the experiment so we have training testing and validation so this uses a data class which is frozen so you can change it and then it has these class variables so you also see that pylint is complaining about this and basically what this class is trying to do is it's it's trying to be an enum so instead of making this a data class that's frozen and that use this non-standard values we're just going to turn it into an enum so let's do that right now so this i can remove stage that's going to be an enum let's also import that there we go and i use the auto method to assign values so now stage is an enum which is much which is a much more standard way to use it the only thing is that for the tracking we'd like to print the name of the stage instead of the value because the value is going to be some kind of integer so let's see where stage is being used in the experiment first one is in the set stage method now here state shouldn't be a string but that should actually be a stage obviously and we'll also need to change the type of the stage instance variable now that's set here in the abstract base class which i would generally avoid i would generally say that an abstract class should only have abstract methods so let me just remove this and we're going to put that into the tensorboard experiment class like so and stage is not going to be a string but it's going to be a stage of type stage so that's what we have here in order to display the actual name of the stage you can change this line by adding a dot name and that's going to print out the name and let's also do that here and let's also do that here and here we can do the same thing like so so now our stages are going to be properly printed let's run the code to verify that this still works and it does i'll just cancel this because we don't need to do all those things here there's not a problem with the experiment tracker so as i said before it's it's an abstract base class and it has a bunch of abstract methods which is good but it also has non-abstract methods now you could leave these methods in here it's not particularly a problem because the only thing that these things are doing is that they're calling another abstract methods multiple times so they're kind of convenience methods so you could leave them in there but the other hand actually these met these methods are not used anywhere in the code so to keep things simple basically the you ain't gonna need it principle i'm going to remove them so we don't have to worry about them anymore and now that's nice because our abstract class is actually now completely abstract it only has abstract methods there's an issue though if you look at the you see there are four methods here batch metrics epoch metrics confusing matrix and add h params let's look at the main.pi file the problem is that the separation between the experiment tracker which is the appstar class and the implementation which is the tensorboard experiments that's a subclass is not clear i'll show you what i mean so here in the main file we're creating a tensorboard experiment instance which is what this variable refers to and let's see what methods are being called on this instance so i'm going to look for experiment dot so i see we're calling add batch metric we're calling set stage these are the main things that we're calling here actually and there's also a dot flush the point is that set stage that's here and flush a part of the tensorboard experiment implementation and the not part of the abstract base class and that means that at the moment the code here in the main script actually has no benefit at all from experiment tracker being an abstract base class it doesn't provide a clean separation because main is still in this part of the code is still directly dependent on implementation details that are in the tensorboard experiment class and that's a pity because then what's the use of this abstract base class now the way to solve this is to make sure that the part that this part in the main script can actually run with any kind of experiment tracker and that means we're also going to need the set stage and flush methods as abstract methods to the abstract base class and to simplify this even further i'm going to change using an abstract base class to a protocol instead so that because that way we also don't need to put these abstract methods things here and it can be made a bit simpler and slightly more pythonic so i'm going to use the protocol class here and this in general is a good idea when you're setting up a data science project where you have different parts in your code like tracking an experiment running the experiment loading the data etc use protocols or abstract base classes to separate these things properly and that makes it much easier to reuse but you always have to watch out that you're actually separating them and that you're not mixing up implementations and abstractions like what's happening here so this protocol i'll remove the abstract method references here because that's not needed in a protocol class and then i'm going to add the set stage and flush methods because that's going to be used in the main script so i'll just add them here and obviously these are also not implemented here sets stage of the experiment and the same thing for the flush method flushes the experiment there we go and this we can remove so now we have a more a better cleaner experiment tracker class and then support experiment we can actually remove this inheritance relationship because we're now using protocols and to make sure that we're completely separating this you can even put the the tensorboard experiment class in a separate file so let's do that as well so i'm going to create here a tensorboard dot pi file and i'll just copy over everything that's here and then i'm going to remove the protocol class because that's the only thing that needs to stay in the tracking dot bi file it's just there to provide the the separation between the two things and stage can also stay there and then here we have some imports that we're not using so i removed those as well there we go and then in the tracking.pi i'm going to remove the tensorboard experiment class there we go and that means that here basically a lot of these imports are not needed anymore there we go so now tracking contains basically the interface of our experiment tracker which is the stage and the experiment tracker class with the abstract methods and tensorboard contains a specific implementation and now obviously tensorboard is relying on stage so we should import stage from tracking like so so now we've imported stage here let's run main obviously the location of tensorboard experiment changed so i'm going to need to re-import it here so i'll just remove it there and then i hope i can auto import it yeah that seems to work and then let's run the code again to make sure that this still works as we wanted to so that still seems to work great another thing that's happening in the code a lot is that we're mixing up various data types and that leads to problems in particular we're mixing up real numbers and floating point numbers you see for example if i go here to the update method from test accuracy that expects a value that is a real but batch test accuracy is not known accuracy score we're not sure what that actually returns here we're adding a batch metric that expects a real and we have the same problem it's it's unknown and here you see another example where at epoch metric expects a real but the test accuracy.average is actually a float it works because python kind of converts this on the go but it's not a really clean way to do it it's better to be explicit about the types that you're using one way to do it is to add type hints everywhere and that's happening in uh quite a few parts of the code actually like here in the experiment was saying that the log there is a string a stage is a stage etc but then you should also look at what your typing system tells you and where the problems are and solve those problems so in this case we're going to have to make a choice either we're going to use real numbers or we're going to use floating point numbers i don't know which one will actually be better in this case but for simplicity i'm just going to switch all those reels over to float so at least we're using the same data type everywhere and we won't get all these typing warnings anymore so i'm going to search for real and then let's just go through them so here we have the the metric class that i'll also look into a bit later but then i'm just going to change this real also to a float number and that actually also fits better with the other values that are inside this metro class because these are already float so let's change this to float and there is another place here where it should also be float obviously so this is a float this import we don't need anymore there's here an update method that also expects float so there we go here we're going to do the same thing so let's change this to a float this one too this one too and in tracking as well and now we just have these imports that we don't need anymore so let's save this so now everything is float let's run the program once again to verify that this still works so now you see this already removed some of these weird typing arrows that we got here so now at epoch magic actually expects to float and also gets a float so that's much better there's one final thing that i'd like to do in this part of the video and that has something to do with the models in this case there's a neural net here that's being initialized we have these functions here that belong to torch and that are used here and i mentioned this before it's not a good idea to use variables and store multiple intermediate results in them because then the variable has different meaning in different stages of the program i unders i kind of understand the reason why because if you don't want to do that then if you would follow this same process you would need a separate variable for each of the steps and that's just annoying that you have to rename the variable all the time a better approach to do this the sequence of processing which is something that happens quite often in data science projects is to use something called function composition now torch has built-in support already for function composition because this is something that is used quite a lot and that's called a network so instead of creating these functions separately you can actually put them into a torch network and then that is going to take care of composing everything for you that's a sequential and we're going to provide it a flatten function that should be an uppercase flatten function like so so now i'm going to copy the rest of these functions as well and just put them into the network in the same order there we go so now we have our network of functions and now we can remove these ones because we are simply going to use the network for this and so that means this forward function and becomes really simple this becomes return self dot network x there we go so sequential composes these functions for us and that's really useful in this case actually scikit-learn has something similar to torch sequentials which is called pipelines and that does basically the same thing i'll put a link to that in the description below as well so let's run the code one more time just to make sure that this still works properly now that we change this to a network and this seems to go just fine so this is the tensorboard view so you see i've i've been running these experiments here a couple of times and stopped them halfway while i was refactoring the code but this still seems to keep track of the experiment data which is really nice when you're working with data it often occurs that you need to create some kind of pipeline that the data goes through you may need to import the data remove outliners clean it up transform it into different format do some more processing export it again etc it's useful to have a generic definition of what a pipeline is and put that into a separate class or separate module so you can use the concept in different ways throughout your data science applications pytorch supports this by defining a network and scikit-learn also has generic pipelines with this that you can use but what if you don't want to use pytorch or scikit-learn is there a simple way to do this in python yes there is and that's by using function composition so let's take a look at how that works to finish part one i'd like to talk a little bit more about function composition what you see here is i have two functions add three and multiply by 2 that do very basic computations and i'm calling this function a sequence here in the main function here i'm basically using the method that stores the intermediate value in the same variable over and over again which i said before we should try to avoid when i run this it basically takes the number 12 it adds 2 times 3 and then multiplies that value by 2 and we get 72. now if you want to avoid storing intermediate results in the same variable over and over again you can of course have a more complicated expression that calls these functions in one line and that basically looks like this so we have a result let me just delete those lines here and then we're going to print the result like so and here i'm basically calling add 3 on the value x on that i'm calling at 3 again i'm calling multiply by 2 and then multiply multiply by 2 one more time this works you know you get exactly the same result as you can see here but this is actually a bit hard to read and the more functions you add to this composition and the more complicated these function calls are going to be the more parenthesis you're going to have and it's going to be not so readable anymore now a very common way to solve this in functional programming is by simply composing these functions and that means you would simply supply a list of functions or sequence of functions similar to the sequence in torch and then you'd have to take care of these functions being called in sequence and there's no built-in thing in python to do this but it's actually very easy to build let me show you how that works and i'm going to import the font tools library for this so we have func tools and then what we're going to need is a helper function that allows us to compose functions so i'm going to write a function here that's called compose and that's going to get arguments which i'm going to call the functions and these need to be functions that we can compose so let's also define a type that's going to help us with this let's say that's a composable function and that is a callable that gets us input a floating point value and that also outputs a floating point value we can probably make this more generic but let's just keep it to this example to keep things simple so compose is going to need function arguments which are of type composable functions so it's going to get a bunch of composable functions and what it's going to return is another composable function and then when we have that composable function so that's also something that gets a float and returns a float we can call that instead of all these functions in sequence and we will use the functoolsreduce function for this so we do a reduce and what reduced dos is that it basically goes through the list one by one and keeps track of an intermediate result and then returns that result at the end so in this case we're going to reduce it to use a lambda function that gets an f and a g which are both functions and what is going to return is a function because compose in this case needs to return a composable function that calls given some value x that calls the function g over the function x so what this means is that if you go through the list of arguments that we are at the first two let's let's call them f and g that's what we're doing here and then we're calling the second one over the first one and then we're returning that again as a function so then the next iteration of reduce is going to shift to the next one and then it's going to call the next function on top of that so you get this whole composition of functions in this way and then of course what it needs to work with is the list of arguments that we pass as a parameter so this is our compose function and now what you can do instead of this line here you could use the compose function to simplify this a lot so let's say we're creating a my func and that's going to do a compose add three two times and then multiply by two and another multiply by two so that's my function and the result is then very simple we can simply call the my function on value x and this is going to give us again exactly the same results namely 72. and the nice thing is now we lost all those extra parentheses if i want to add another add three here like so it doesn't really increase the complexity of the program anymore it's it's rather simple to do this and you could even have multiple lists that you would combine or whatever you want to do with them so function composition is a really nice way to achieve this and as i said torch and scikit-learn have built-in tools for this but you actually don't need them you can use them in in a more generic way using a helper function to compose the functions using phone tools next week i'm going to dive more deeply into the way that the data flows in the code and how we can improve that aspect using design principles so i'm gonna have to leave you with a kind of unsatisfying cliffhanger but i do hope you enjoyed this give this video a like if you did consider subscribing if you don't want to miss any of my videos in the future and i hope to see you next week for part two thanks for watching take care and see you next time
Info
Channel: ArjanCodes
Views: 71,714
Rating: undefined out of 5
Keywords: data science, data science project, data science project python, data science for beginners, machine learning python, pytorch, pytorch dataloader, refactoring python, refactoring code, refactoring in software engineering, refactoring python code, data science project for beginners, machine learning project python, data science project walkthrough, mnist, tensorboard pytorch, python data science project example, data science project tutorial, Software refactoring
Id: ka70COItN40
Channel Id: undefined
Length: 29min 31sec (1771 seconds)
Published: Fri Oct 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.