How to Use Scikit for Machine Learning with Airflow!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey y'all data guy and the data dog are here today with a very special video on how to get started with ML and airflow using scikit-learn um so scikit-learn is and it's not like psychic learn it's science kit learn is basically a public open source way to create and run machine learning workflows um you know it's python based so it's just a package you download and it has a bunch of different models that you can use to run predictions on your data so it's a really good way to get started with ML because it doesn't really require a ton of you know work building your own models they even have clean data which we'll actually be using here that you can use to just test out machine learning models so you don't have to worry about the process of actually cleaning your own uh data which as anyone that's done machine learning can tell you is probably the hardest part I think it's there's that old metric but it's like 80 spent cleaning the data 20 actually working with the data um and so today we're going to show you is just hey so I want to get started with scikit I want to build a machine learning pipeline you know I want to be able to adjust some data do some light Transformations uh then create a model uh train a model and run some predictions on it and that's exactly what we'll be going through today I'll be showing you you know how to just quickly ingest some data from scikit-learn uh that part can be subbed out for your own data if you want but then bringing that data in running some feature engineering training a model on it and then using that model to get some predictions in this case for the medium house median house value in California so that's kind of the overview without a further Ado let's get started so before we actually get started writing the code I just want to Quick outline what this tag will look like when it's complete um so the first thing we're going to do is extract that data I was talking about from scikit-learn which is just simply just use a scikit-learn method then we're going to save that clean data to S3 so we can use it later on and then we're going to do some feature engineering on that data save the resulting data frame into a local S3 file system then we're going to train the model based on that feature engineer data set and then combine the featuretrain model with just a raw data frame to actually predict the values for the medium host median house value in California then the final step of that is we're just going to save those predictions in that same S3 file system so this dag really heavily leverages kind of disability to set up a local S3 file system using Docker and the way the reason that's easy is because I'm using an astro image and so I'm already running a Docker compose that spins up all my airflow components and it'll also just spin up an S3 file system so now you kind of got the big picture let's get into the weed and start coding so the first thing we'll do as we do with all our dags is just import all our packages so we're going to import the dag decorator date time uh SQL or the Astro SDK AQL decorator so this is just makes it asterisk Decay basically makes it easier to do things like hey I just want to extract some data from API and it'll automatically turn it into a data frame and also allow you to pass SQL and python data sets in between each other so I can run you know pandas on a SQL data data set and and python on a SQL data set without needing to do the transformations in between the two so just eliminates a lot of the boils like hell you typically have to Do's then we're going to use the Astro files just a way to create a reference to a file path and then just load it directly in and then we'll also have the Astro pandas data frame which is just a airflow specific Astro SDK specific version of pandas that again just eliminates the need to actually read in a data set and then treat it as it and then transform into a data frame it'll just use the AQL data frame decorator which then automatically ingests any data sets that are brought or data frames that are brought in as data frames so just an easy way to pass data frames in between tasks because standardly you wouldn't be able to do that um so import OS as well just because we're reinteracting with the local operating system of our airflow installation to actually go and interact with that local S3 file system um and then we're all good to go so now it's time to actually build Our Deck so our DAC parameters are pretty simple just creating a dag called astronoml No scheduling interval uh start date whatever you want and catch up as false we're just going to call it Astro ml here just defining it using the airflow DAC desk decorator so after we create the dag we're just going to create a couple variables that we'll use later on for referencing where we're storing this data just to eliminate us having to actually write out S3 the full file path for each of our files so just data bucket models bucket model ID which is just taking the current date time so we just have time stamp models so they don't override each other as well as the model directory or just joining the models bucket and the model ID there um so just tap the directory directly to our most model and so once we've done that we can actually get start creating our first task so our first task is pretty simple all we're doing here is extracting that housing data from the sklearn data set so just like I was talking about before sqlarn provides some sample data sets that you can use for machine learning and California Housing is one of them so here we're just importing that method and then returning the output of that method as a data frame with this dot frame uh attachment down here and so this is just going to take the results of this fetch California Housing data set and turn it into a panda's data frame and then because we're returning it and it's at AQL task this is going to return or to set the output of this task as a panda's data frame of California Housing data so that's the start of our journey and now let's start building some features on this and so our next task is a bit of a doozy so here we have it build features and this is where we're actually going to be building the features that we'll use for Russian Art so feature engineering so here again using that hql decorator systems this task ingest the data frame also returns one and taking in that raw data frame so that raw data as well as the model directory where we're going to be wanting to save this data frame after we create it then what we'll do is import the standard scaler from sklearn so do some pre-processing using a gas accutlearn Library importing pandas as PD from joblib import dump and S3 file system so these are just a way to interact with that file system job Loops we're going to dump some things from it and then we also have this S3 file system which is allows us to interact with that S3 file system that we spun up in parallel with our airflow uh cluster I guess mini cluster on on your doctor messed up then we're going to create a local instance of this S3 file system so kind of just like a hook so we can just pull directly from it later set a Target value so this in this case we're looking for the medium House housing value so we're going to rough that from this data frame existing so we can create a clean data set and then use that to predict what the median house value is so you can see here being dropping that Target column and then also creating a data frame that is just strictly of this target column so one just that we're looking for medium house value and one of you know without the median house value then we are using that standard scalar to transform and fit this kind of data frame and doing feature engineering so that we can figure out what the key features of this data set are and then saving that that scalar for later monitoring and evaluation and then finally we are setting creating a new column X Target after we've actually done our feature engineering so you can't do the feature engineering with the target column in there what you'll have to do is take that card and call them out and then add it back in after you've done your feature engineering so that's what's happening down here and then finally what we'll do is just return that data frame which has been feature engineered um back into that local S3 file system and we're returning it as a data frame for our next task to use and so now that we've done that let's create our next task so our next task is going to train our model so here we're taking that feature DF that we just created from our last task loading it in again loading in our model directory so we know where to point this to to save and load in our model importing instead of the scalar we're importing now the linear linear Ridge CV model so this is just a linear regression model provided by scikit-learn that we're going to be using to you know figure out hey what are the median housing values in California um probably super high then we're also going to import numpy just for interactions with local files um joblibs so again we can just dump the results of queries back into that S3 file system and of course S3 file system as well so we're again just creating kind of or a hook into that uh we're setting the target again within the context of this task and then we are creating our model so here we're just using that Ridge CV that is going to pass into uh you know looking for the alphas of it that is going to use the numpy array find some log space do some model training and then we're going to fit that model um this is basically just to find the parameters of the model and then what you'll use here is actually fitting that model to that feature data frame um while dropping that Target as well as which so you have that clean data that you're using to actually run the model on and then the feature DF has just the Target on it so that you have just that targeted data so you can kind of match that up between hey where these predictions actually accurate I'm down the line and then here we're just setting the model file URI so it knows where to save this file and then we are going to just dump that model file into the local file system and then return that URI so our next task and reference it and now let's write our next task now it's finally time to generate our predictions so here in this predict housing task what we're doing is again adjusting that feature data frame because we need that clean data set actually run this model predicted model on as well as the model file URLs or URI so we can reference that model file we just created then we're doing importing job lifts we can load files in that S3 file system creating a hook instance and then what we're doing is loading that file that we just saved in the prior task because we're loading it and then here's where we're actually opening it setting our Target value again and now we're running a prediction so here loaded model dot predict then where you can pointing it to that feature data frame dropping the Target because we don't can't give it the answers to its uh predictions um so here that's what that's doing and then what we're doing is we're saving those predictions uh to this feature DF with pretz and so this is kind of the end so we're not totally done enough we're going to set some bit mapping and save these files around but this is what's actually creating our right creating the predictions and then saving them to this data frame so what you'll do is you'll take this data frame you can open it up and you can see hey were these values uh similar better or worse than the actual values within this data set so really cool stuff there and so now that you have that we'll save it to our local file system what I always dial actually sometimes just add a print feature DF just so you can see hey this is what my predictions are without even actually going and uh open the file system manually and so once we're done with that it's time to actually just Define our tasks using the tasks API so the first task we will Define is extracting that housing data as extract DF then in parallel to that second task we did where we're actually building the features we're just going to save this raw data frame to our local S3 instance so that it gets referenced down the line and we're asking for that clean data set it's also just too good to have as a reference so hey you know I can compare this extracted clean data versus the predicted value so what I thought the state it was going to be um then after we're done with that we will Define our feature DF so this is where we're building the features and you can see that we're taking the output of your extract housing data set stored as extract DF passing that into build features so we can use that extracted data frame as well as a directory path that we Define so knows where to save and pull the models from uh then what we're doing for train model is giving us model file URI so the training of the model creates a model file that we're using then to run the predictions so that again is ingesting that feature DF that we created at the start and then that we return from the build features so kind of that once features built pass it into the training model training give it again the model directory and then finally you have that predict housing where we are taking that feature data frame taking that model file URI and then using that to run that scikit learn model on our housing data and then save the predictions and then the final step of that that is actually saving the production it's not just me saying it is redfile.aql export file so this is again another uh Astro SDK decorator that just basically say you know choose an input choose an output and we'll take care of the rest so you don't need to worry about okay if you use a specific S3 to snowflake or a local test 3 you can just use the export file function sorry operator um and here just taking that prediction data frame that we produced here uh and saving it to that local S3 file system for posterity and then once that's all done we just create one line of code Astro ml at the bottom and boom you got it fully functioning uh airflow machine learning deck so if we go back into our uh airflow UI we can watch it quickly run so I just ran it there and here you can see under predict because I added that print DF it will actually show it's just a normal path of that actual file so if you want to look at the file you can of course just add you know your print DF to actually print it out for you here this is just going to print the uh model URL or sorry URL the path and you can go to your local S3 file system pull that out and look at your predictions and that's really all I got for you today so I really hope you learned something I know I did in the process of this and so me and the datadog are out have a good one y'all happy coding

Info

Channel: The Data Guy

Views: 649

Rating: undefined out of 5

Keywords:

Id: CYVyp52M4x0

Channel Id: undefined

Length: 14min 35sec (875 seconds)

Published: Mon Jul 10 2023