7 Tips To Structure Your Python Data Science Projects

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today I'll cover seven tips to help you better structure your data science projects now if you're doing a project where you need to analyze some data often that doesn't really come with a clear objective at the beginning as you get more insights from your data you're going to add or remove features so you might have a tendency to think of your code AS throwaway and that you don't need to spend any time at all designing the software or thinking about how to set up the project properly spending time on writing clean code but that would be a mistake because the whole idea of properly setting up your projects thinking about software design is that it's going to allow you to make changes more quickly and more easily and especially with data science projects you need to change your code regularly and be able to do it quickly so after watching this video you have no more excuses the first tip is to use a common structure for your various projects you're probably going to be involved in several projects that deal with data and you probably also need to switch between them regularly so it's really helpful if you make sure to follow the same project structure everywhere to minimize your cost of switching context especially if you're part of a team you may want to share that code with a colleague if you can somehow agree among the team that you're going to use the same kind of structure everywhere it's just going to make your life so much easier I found it's really helpful to spend a bit of time with your team members to make sure that you agree on what a standard data science project should look like and if you can't agree fight to the death it's the only reasonable option A very useful tool for doing this is Cookie Cutter this allows you to start a project following a specific template so you sure you always have the exact same starting point you can use an existing template like cookie cut or data science or you can create your own that follows the standards of your team in general it's a good idea that's the second tip to use existing libraries wherever possible you may be tempted to write your own code to process clean transform your data because it's quicker but know that the more code you write the bigger the chance is of introducing bugs that's really interesting to me what happened as a look over my software development career over the past year because you know the more practical experience I gained developing software the more I actually started relying on existing libraries and that's not just because I'm lazy I am lazy but it's also because the existing libraries and packages have put some thought into how to organize everything so that the package solves as many problems as possible and that's really nice because that means that in the future when you extend your project you add new features you need to add new things then you're probably in a better position than when you develop everything yourself from scratch another reason to use existing packages as much as possible is that these things have been probably hopefully tested properly so you spend less time testing your own code because the existing libraries will have already solved that for you another thing that I really like about using existing libraries as much as possible is that actually allows me to learn a lot about the domain by using a package like pandas for example you learn all about what is a data frame how does processing typically work what kind of Standards should you follow because the makers of panas have created all sorts of things to help you with that so for me it's also a learning experience to use existing libraries now of course there are tons of helpful libraries like pandas but also nonp psyit learn pytorch SQL Alchemy lots of libraries that are useful for data science projects use them one particular type of tool that's very helpful is a pipeline or workflow tool this allows to structure your data in workflows and it makes it way more scat this is where for example a platform like typi comes in they're also kindly sponsoring this video tyy handles both front end and the back end it's open source and you can use it for free to install it simply type pip install tyy or if you're using poetry simply type poetry add tyy and add it to your P project file you can use tyi in vs code directly by using the tyy studio extension and it also works in Jupiter notebooks the nice thing about using a tool that's dedicated to running pipelines is that it offers a lot of features that you then don't have to build yourself for example typy has scenarios which is a sort of registry for all your pipeline runs it also doubles as a great comp comparison tool for what if analysis because it allows you to launch pipeline runs using different parameters and this will help you take projects that you set up as a pilot using just a simple machine learning model and make it available to your users with a much higher quality model very easily now on top of that typi has many other features like parallelism caching data scoping and pipeline versioning go to Ty's GitHub page to check it out I've also put the link in description now back to the video the Third tip is to make sure you log your results as you're analyzing data and iterating it's really important that you log things because if something goes wrong you want to be able to see where that happened and then go back a few steps to fix the problem if you don't log things that's actually really hard to do now tools like typi can help you log your pipeline runs but you might also want to keep track of the various outputs and there's several ways in which you can do this you can use log files so those are just files that you store locally on your machine but they're a pretty rudimentary solution you also have log services like a paper trail that allow you to just send logs over the internet to a cloud surface though you might not want your logs to be stored on a server that you don't own this is actually one of the major ways in which data leaks occur so be very careful with that if you're doing machine learning specifically there are also tools like comment ml that allow you to keep track of your experiments visualize the performance of your models the fourth tip is to not be afraid to use intermediates data representations you don't have to do all data processing in a single step you can first do some pre-processing store it in intermediate file even store some pre-processed data in a database and then take that to the next step and not just for data science projects by the way this is something that has really helped me a lot in structuring and organizing my code better because it allows you to focus on a particular part of the job that you need to do so as a first step you can do pre-processing and make sure that the data is stored in a format that you can work with at a later stage and by doing that you force yourself to think about just the pre-processing part and not having to think about the whole processing pipeline that you're building out step by step and the reason this works so well is that different representations are optimized for different things for example if you have data in CSV or Json format this is really great because it's human readable it's lightweight you can send it over to other people without having to worry about them not being able to read it but if you need to query the data then having a CSV or Json file is less convenient because there's no easy search functionality apart from just basic file text search so in that case if you need to query data you want to store it perhaps in a SQL database and then that's a better option but that also has problems because then you'll have to deal with the added complexity of managing a database and then if you have data frames they grade for exploratory data analysis because this has a really extensive API but you may encounter Ram limitations by storing all of those things in memory it might be slower for specific specialized operations like database operations and there's also a learning curve though if you're doing a lot of data analysis knowing about data frames is a really good skill to have so pick whatever is most suitable for the job if you need multiple formats and different steps that's not a problem at all it's better to just convert the data than try to work in your code around a format that doesn't work for you so here's a question what kinds of data formats do you use and why do you have any tips share them in the comment section tip number five is to move code that you're planning to reuse to a shared package especially if you use jupyter notebooks it's really easy to lose overview once your code starts getting more complex you might accidentally break things in a notebook and there's no easy way to reuse code between notebooks if you take those parts of code that you want to reuse and put them in separate modules and then import those in your notebook then it's actually way easier to manage you could even create a package out of that and publish it yourself and then import that code that way which saves you a lot of time working with python code that's not in a notebook also has other advantages like being able to easily write unitest for it having Auto formatting style fixes etc etc all those things that you don't have in a Jupiter notebook now I'm not on Aid jupyter notebook user at all so in that sense my main experience is with simply writing python code but I do notice that whenever I use notebook I always have the tendency to quickly get out of there and get back into regular code where I feel way more comfortable but notebooks are pretty nice for exploratory analysis so it's good to integrate them into your workflow but in a meaningful careful way tip number six is that you want to move configuration settings into a sep separate file you really want to keep configuration settings separate from the code the worst thing you can do is to have these settings be spread out all over your code base and that's really easy to do right you have your different modules and in each module you just use some constants that you define all over the place and this makes it really hard to make changes later on or if you need to deploy the code and you need to change the settings because you want to connect to a different database or you need different paths or uh folder names if comp configuration settings are all over the place in your code it's really hard to find it's going to take you a ton of time so the best thing you can do is to move all of those settings at least to one single place preferably it should be outside of the code which I think is the best solution but if you want to Define things in code make sure that that happens in a single place it's kind of the idea of when you organize your code to make sure that there is a single dirty place in your code where you do all patching up and you define all the specific spe ific things that your code needs because then if everything is in one place it's easy to switch things with different values different constants different ways of setting things up what I typically do is that I store everything in environment variables I might work with a local do n file so that I can easily Define those variables whenever I'm working on my local machine but the advantage of using environment variables that they're also well integrated with Cloud tools so if for example you deploy a function or a Docker container to the cloud then it's pretty easy to define a few environment variables so that you can actually change the settings that your application should use and by the way pipeline tools like tyi also support running your code with different configuration settings now if you want to get better at reviewing your code detecting problems faster so that you're able to make changes like cleaning up your config settings check out my free code diagnosis Workshop where I teach you a three Factor framework for review and code efficiently while still identifying the main problems you can sign up iron code diagnosis contains a lot of useful advice practical code examples you can apply right away to your own projects IR codes diagnosis the link is also in description of the video now tip number seven final tip is to actually write unit tests if you think you don't need unit test because well you can just take a look at the charts well think again the problem with not writing unit test is that you have much higher chance that you're going to run into problems with your code later on for example if you need to run your code on a new data set which happens regularly right so in that case it's really problematic because that means that whenever you run your code with a new set of data at that point when you're actually trying to focus on something else then you're going to notice that there is a bug in your program and that will probably also be the time when you are on a deadline you're in a hurry you need to make sure that you can perform that analysis quickly you've also been maybe out of the code for several weeks or even months so it's going to take you a ton of time to get back into it again and try to fix the bug if you write unit test while you're developing the code that the unit test tests then it's actually much easier because that's the moment when you're focusing on the code that's the moment when you can spend a bit of time writing unit test and making sure that everything is stable so that in the future if you switch out the data set or if if you share your code with a colleague who uses it in a slightly different way at least you have the test already in place to solve part of the issues what you really want to avoid is that things are too dependent on you being involved in every step of the way the whole idea of writing code is that it automates things for you but if you don't write test bugs are going to pop up at the most inconvenient moment possible and they going to need you to fix them so if you can already do part of that work before it's going to save you a lot of trouble in the future another reason to write test is that even though you may think that you can detect issues by just looking at the charts well some problems might be too small to show up in a chart but still affect the result and therefore affect the decisions that you're going to take based on your analysis so it's always good to think of your code a bit broader than just what you see showing up on the charts and make sure that things are robust and stable especially if you moved some of your code to separate package and that code is actually a great candidate to write unit test for you can even go fullon test driven development and write the test before you write the code but for data science projects that's perhaps taking it a few steps too far I hope you enjoyed this video if you did give it a like as that helps the YouTube algorithm recommend this content to other viewers as well so I'd like to hear from you do you have other tips to help your fellow data Gunslingers out some pandas plotting prowess sexy squl statements or Nifty notebook noodles now I did talk about Jupiter notebooks a bit in this video and that you should move your code outside of your notebook at some point but there are other issues with Jupiter notebooks that you need to be aware of as well to find out what those are and how you can address them watch this video next thanks for watching and see you soon

Info

Channel: ArjanCodes

Views: 104,447

Rating: undefined out of 5

Keywords: data science python, data science, data science projects, data science project, python data science project, data science project structure, data scientist, machine learning, data visualization, data science tutorial, data science for beginners, data science roadmap, data analyst, data analysis, data analytics, machine learning python, data science with python, data science roadmap 2023, data science programming, big data, data science projects for beginners

Id: xVuqDBCQAYc

Channel Id: undefined

Length: 14min 48sec (888 seconds)

Published: Fri Nov 03 2023