Build automation in R with the {drake} package

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome today's date is may 11th uh luxembourg like uh some other european countries are starting to try to get out of confinement so fingers crossed let's hope uh there won't be a second wave well i guess they will be but hopefully not as bad as the first one um and uh yes so today welcome today i'm talking about drake so drake is quite interesting it's an interesting package because um it's basically a build automation tool so if you are using linux you might have come across make files and you might have come across let me actually show you uh make file documentation on a news website so as you can see a make file consists of rules with the following shape a target some prerequisites and a recipe so it's basically a tool that is used to build software you might wonder why should we be interested into that well because basically a data science pipeline is software so we are facing the same problems as software engineers but data scientists or statisticians for the longest time have been doing that manually software engineers automated this literally decades ago waff is another such build automation tool um it's written in python but you can use it for anything make i don't know exactly maybe it's written it's in c but you can also use it to do anything really and drake um oh by the way i'm showing you this blog post which was basically the one that um made me want to try drake drake was already in my radar for some time radar radar radar radar in english i guess for some time but never really came to trying it so i need to drink a bit sorry so and this was really the post that made me want to try it so thanks thank you miles miles mcp bane um i really encourage you to read it it's quite interesting um and the it really explains everything that um you need to to know about drake so what i'm showing you today is my attempt at uh building a little example of what a very simple drake project could look like but mixed with uh package development so the the idea i had was to build a package with all the functions i needed to clean my data visualize my data and run a machine learning pipeline and have this in a package and then have a drake file which you're seeing here that would run this in a very simple and automated way so i called my package cool ml project this package contains many functions as you can see down here so um i first function is get data which is used well to you guessed it to get the data actually my my model is is is running right now but it's it it's taking a lot of time so i'll have to go from sample 1000 to sample 100. um but yeah this function as you can see because it's inside the package has some documentation so this documentation gets compiled as i build the package this function doesn't take any uh input but it does a bunch of stuff it downloads the data from the uci machine learning repository um it's just samples 1000 rows but i need to decrease that it then defines the column names it downloads the testing data and it returns a list with both training and testing data that's the first thing then the second function is well pre-process which is a very simple function which simply takes the training set as an argument and then it's a simple recipe and that just binarizes the predictors oops um then i defined a grid over which i will train my or tune rather tune my model so grid again very simple it takes a model specification actually i see that here i called it model spec and then here model so let me correct that um and this just you know builds a grid from so if you're familiar with the tidy models framework you would recognize some of the codes um then i have another function that defines my model very simple again it's a parsnip model and so that's why i imported the parsnip package so for now up until now it's a very standard package like there's nothing very different um from a standard package it's just a bunch of functions with some documentation the imports that i need and that's it now where it is a bit different is in the um or rather let me show you the the structure in the so i have my getting your i have my dre a dot drake file so i will explain what is inside in a bit um my namespace uh my readme and an in inst file uh insta folder inside the ins folder i have my run analysis.r so this is where i load my packages so i load my cool ml project package i load dplyer and i load drake and then i define a drake plan the drake plan as you can see is just a series of calls to each of my functions so first get the data then i define my training data and my testing data then i uh define my splits my cross validation splits i pre-process my training set i define my model so in my case uh boosted extreme grad extreme drain and boosting models but i could just copy and paste this code and change boost tree by logistic regression for example and i would have a new model a new model specification same for the grids same for the workflow so i have this function i didn't show it to you but it's a function that builds this tidy model's workflow that encapsulates the model as well as the recipe for the pre-processing this is very nice because then it's very easy to run the tuning over uh the grid so you just pass the grid the workflow and the cross validation splits to tune grid and it's it trains your model so now this is very interesting because now that i have this plan i can simply execute it with make plan this is very very nice for many reasons first of all this structure here forces you to work in a very clean way it really forces you to define very simple functions that do one thing um this then has the added benefit that if you need if you want to try several models so extreme gradient boosting random forests logistic regressions whatever you can simply call this let's say uh builders these these functions that build the models that build the workflows etc so it's very nice the second benefit is that as you see here in my console um i see so i have my target here boost tuned boosted trees so what does that mean so before the video i run the the plan and all the targets so each line here is a target all the targets were built already they were pre or they were already compiled that's what's inside this dot drake file that i alluded to just before that's what's in that's what's inside here uh here you have the cache of the project so because these targets were built before i run the video they are not being built again the only thing that wasn't built was the last target which is the model and as you can see now it's uh running on the second fold so it takes quite some time uh so this is the only thing that's being built so what this means is that if you have several models or if you have several you know cleaning steps visualization steps you build plots you build whatever only the targets that are outdated will be built so maybe if you update your data only some or if you're working with several data sets and one of them gets updated only the targets that depend on that data set will get updated so this is a huge time saver really really huge time saver um the other benefit that this has is that because um you work this way you work in this very structured way it also makes you think about what is actually um the sequence of your project because i don't know about you but when i start a project i really go uh i i start looking everywhere so i i first you know i i do some cleaning and then i you know i do oh i need to explore a little bit more so i explore a bit more then i redo some cleaning then i do some visualization then i run a linear regression then i redo some cleaning i redo some and after two days of this the project is a mess so if you start working in this very structured way the advantage is that you have to think a little bit okay first let's acquire the data so let's think about that then okay now that i have the data let's clean it let's think about that so it i i might at least for me it really forces me to to work in a more structured way the other nice thing is that as i said the target can be anything it can be a model it can be a cleaning step it can be a visualization so i don't have any plots here but it could be a you know one of the targets could be a gg plot and your last target that you could put down here could be a markdown file so and then inside the markdown file you just call you just grab the elements from the cache so these targets here will become available by using so let me try maybe i can stop i don't know if i can stop the uh yes it worked so you can reads with two ds the targets for example let's go with the pre-process so read pre-process gets grabs the recipe from the dot drake folder if i write reads maybe let's go with training sets this will grab the training set from the cache so this is quite useful because it also allows you so the inside the the markdown file you'll have calls to this uh to to this read function to grab the elements that you need so for example if one of your targets is a gg plot you just read this to g plot out of the cache and it will show the other advantage is that you can mix both an approach that is very interactive because you know now i grabbed my my training set i could you know save it um save it in inside inside a variable oh i think my editor broke oh no it's me we broke so you can simply save it um inside inside a a variable and and start working with it again interactively so which means that you can you know test some things and then go back to writing a function so this is really really useful as well um and then of course if you want to uh clean the cache to really force the targets to be outdated and rebuild everything from scratch nice and cleanly you can do so as well by using drake clean so and this will force everything to become outdated and then you can run your project again and here you see it will load everything again it will build all the targets again and it will relaunch everything again from from the start this is quite useful as well to make sure that everything is working well because if your target becomes uh outdated for example if you are modifying a function uh if i now go and modify for example get data uh the problem with if if i just modify this and if i rerun my clan well drake won't know that i changed my getdata function so uh drake won't know that the data is is updated so um uh that's why that's why now it's running much faster and i think i know why just before the video i recompiled the version of my package with sample equal 100 and actually i think that if i recover this file and if i say yes yes you see it's back so i didn't save between the two takes and that's why now the um model is actually running much faster so this was just a very short video i will make um a blog post as well uh because i think there's a lot of things uh that that are very important and i think that if it's written down it would make understanding easier i will link to a miles blog post which is really nice you should read it um i will also link so this code is in github repository i'll also link to this github repository install the package try it out see if it works on your machine it works on mine but i'm also i mean i'm developing on this machine so maybe who knows it's maybe um that's why it works so it would be interesting if you could test this as well look into drake drake is the kind of tool that if you don't use it you don't know how useful it can be um you don't know the until you you've used it you don't know really how much better it is to use it than not to use it is this this kind of things that you really have to get into it to see the benefits it's a bit difficult to convince you with this video that it's useful even if i think um that as you see there's there are many benefits of running of running your data science pipeline using drake um well i will link to this different uh let's take a look at the model now that it's finished i will link to the different um i will link to the different links so i don't remember how my tor target is called and let me know what you think and as usual you know leave a like leave a comment if you found this interesting and let's see really curious to see tuned boosted trees and there it is well yeah you don't see it in my console in my console but it is a very nice table with the results of each fold so see you next time and most importantly stay safe
Info
Channel: Bruno Rodrigues
Views: 2,262
Rating: 5 out of 5
Keywords:
Id: yNHwM3N8bAQ
Channel Id: undefined
Length: 16min 10sec (970 seconds)
Published: Mon May 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.