The Best Way to Organize Your Data Science Projects

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] mmm [Music] hey everyone welcome to the channel if you're new here my name is dave and my goal is to help you level up as a data scientist in today's video i want to talk about data science project structures this is something that i struggled with in the beginning of my career in the sense that i didn't really have a proper structure for my projects and now i'm typically a very structured person and i like to keep things organized but the problem was that i was i was following lectures at university i was following following online courses i was watching youtube videos and basically what would happen is i would always follow the structure that was given in either the course or the lecture or the video this resulted in having a lot of projects on my computer but all with a different structure now if you're already working as a professional in a team with other people there's probably a high chance that you already have a project structure that you have agreed on with your team to follow but i think this video is mainly important for people that are still learning data science on their own and maybe recognize the problem that i just described so in this video i'm not going to tell you how you should structure your data science projects i don't think there's a right or wrong way to structure them well basically there are probably some wrong ways to structure your data science project but you get the point the main thing that i want to show you in this video is that consistency is more important than structure so it doesn't really matter how you structure your projects as long as it works for you but the important thing is to stay consistent now there are a few things to consider when you're trying to come up with a data science project structure that will work for you and the most important one being at least in my opinion the data science life cycle now if you search for this online you will find a lot of variations but they all share similarities i think the most popular one is this one the cross industry standard processing process for data mining which is often referred to as crisp dm so when you're trying to come up with a project structure it's important to keep in mind what this life cycle will look like for you or your team so eventually the project structure can better fit your specific needs requirements and deliverables etc also microsoft has some good resources on what they describe as the team data science process this is a very detailed description of how they describe the data science life cycle so that's also something you can look into now the example that i want to show you today is the cookie cutter data science project structure this is a structure that i've recently came across with and i've used this as a starting point to basically come up with the structure that i now currently use what i will now show you is how to get started with this cookie cutter data science structure then use the base starting point to create your own project structure and that's what i've done so they have turned this into a python package that you can install using pip if you want to go ahead and start with this you should first run pip install cookie cutter in the terminal and once you've done that you can use the following command cookie cutter and then the link to the github repository i will now show you how to do that so let me just copy this hop into terminal paste it here so i'll do yes then you have to give a project name so new project then the repository name then you or your team's name let me fill that in then a description then a license if you want to include one so i'll do three no license file i'll just leave this open um then python 2 or python 3. so choose one for python 3 and that's it so what we've just done so i ran this terminal from my desktop so if i now go to my desktop folder we see this new repository folder here new project dave now and within this folder you will find the cookie cutter data science project structure and as you can see there are some folders there are some files there are even some subfolders and to basically get big picture overview of how the structure is structured you can have a look at this file yeah this is also available within the readme file as a markdown file so i can also open this and you can check it out here for now i will just use the following structure over here so now what i like about this project structure is that there is basically a place for anything that you could encounter in a data science project so there is your data your docs models notebooks references there's your source code requirements.txt so basically everything there is a place for that and it's very well designed in my opinion i've been working with this structure for the past half year or so and i really like it i've made some small tweaks that i will show you in a bit but overall the structure is still the same i also like it that there's a description for every part within this structure so you basically don't really have to think about where to put things and that just really helps with consistency if you're interested in using this structure i would recommend to check out this webpage read the article in full because there's a lot of useful information in here also about naming confession conventions that they recommend how you should handle uh your data yeah there's just a lot of really useful information in here i will now quickly show you the changes i've made and what this will look like in a vs vs code data science project so if we go back to the folder that we've just downloaded this is the folder structure that you will get when you run the cookie cutter command in the terminal so you will get all the files all the folders and basically what i've done is i've created a copy of this and that is a folder that you see over here so oo data science folder template and basically what i've done is i've just removed some of the files so the folder structure is still the same also the subfolders i didn't change anything about that but i do think at least for the projects that i'm working on not all of the files are required i would rather add them later when necessary then have them in by default so if we go in here what you can see i basically have a folder name and then here is the repository name and then here are the folders and as you can see it's pretty clean it's just a bunch of folders readme and the requirements that's what i include in all my projects and if we go back here there are some additional files that i don't include in every project but of course this is totally up to you find something that works for you so what i've done i've basically created the local yeah folder a template folder on my machine and what i do is i use this folder to initiate new projects so basically duplicate duplicate this folder and of course you can also put this into a github repository and then use that to clone into a new project for now i like this way of working but if you're working with a team it's probably a better idea to as i said put it on a github repository so everyone can use that as a basis now the final thing i want to show you is what this will look like in fierce code this will also show you the hidden files that are within this folder so the git ignore and the dot environment the dot n file this is also really convenient there is a very extensive get ignore file in here that will basically exclude most of the common files that you would typically exclude and this also includes the data folder so if you're working with with data sets uh you typically don't want to push the those uh to your version control uh platform to github for example so that is excluded and also like the checkpoints and other weird files that are sometimes created within projects that don't have to be in your source control so that's really convenient there's also an environment file that you can use to basically store environment variables so this would be something like passwords or api keys that you don't want to store on your version control this is what it looks like in vs code so you have your data here your source code and then also there are some template python files already for make data set build features your predict model train model etc so the infrastructure is there now before wrapping up this video i do want to point out that this cookie cutter data science project structure is only an example of how you can structure your projects i think it's a very good starting point but i would highly encourage you to try and find a project structure that works for you so you can also look online for other project structures for example microsoft with the team data science process also has a project structure that they provide so here is the get to page for that i will also leave a link to this page in the description so yeah the key takeaway from this video would be that if you don't already have a project structure try and find one for your data science projects and stay consistent with it you really thank yourself later so that's what i wanted to share in today's video if this helped you out i would really appreciate it if you hit the like button and subscribe to the channel i'll be making more videos related to python data science and machine learning so if that's something you're interested in you want to learn more you should definitely subscribe see you next time
Info
Channel: Dave Ebbelaar
Views: 5,098
Rating: undefined out of 5
Keywords: datascience, structure, projects, machine learning
Id: MaIfDPuSlw8
Channel Id: undefined
Length: 10min 1sec (601 seconds)
Published: Sun Aug 07 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.