Data Science Workflows using Docker Containers

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you everybody my name is Ali surgery I'm going to be talking to you about data science workflows using docker containers I tried my best to put into any buzzwords as I could into the title did a pretty good job I think I'll just little fYI I tweeted a link to this slides there's also gonna be links at the end so if you want to follow along you know where to look also just a little about me before we get started I just started my first Python job two weeks ago yeah I'm working as a software engineer at a company called analyte health we're trying to change healthcare delivery models on the testing side like laboratory testing side I got a couple co-workers here so thanks guys for coming out and supporting me means a lot I'm also a grad student at Northwestern I'm studying medical informatics I like Python I like data and I love Star Trek so what are we talking about today I'm going to start by giving an overview of data signs for some context will learn about docker and learn about containers and then we'll walk through some doctor based data science where it flows what is data science data science is about extracting value from data it's about turning your data into actionable insights and this can mean many different things we can visually explore data we can build predictive models we can classify observations into groups based on similar characteristics or we can build data-driven applications so first and foremost we have to remember that data science is a science we have a question that we want to answer a hypothesis of why something is happening and our output is not just our findings it's almost it's also the steps we took to get to our result and so what this really means is that reproducibility is really important in our analysis it needs to be repeatable reproducible workflows and data science make it easier to communicate results we can use our methodology to tell a story when we go and present our findings reproducibility also make easier to defend decision-making if we produce the same answers over and over again it's less about gut instinct and a lot more about data-driven insights and finally if we have reproducible workflows other people who can go in and audit our work to ensure that it's correct and that it complies with regulations my last job I used to work with healthcare data and one of the things we needed to make sure was we never let patient data leak in any of our models so just a really good overview of the data science process we start by asking an interesting question like what are we trying to do the next step is to start gathering data to answer that question this phase is probably going to involve us transforming data from a format we're storing it in into a format we can actually use for analysis next we'll do some EDA then we'll start building this model fitting the model and then training and then tuning the model and finally we'll communicate our results through a presentation a blog post or a paper so Python has a really great ecosystem of data science tools that make it easy for us to do analysis and Jupiter Jupiter's sort of the star of the show Jupiter's our data signs front end we can use to capture our process so Jupiter notebooks are documents that contain live code equations visualizations and explanatory text so here's a picture I got from the project Jupiter website so you can see we have a block of text some formulas a live code block interactive sliders and some visual output and we can use all of these as building blocks when we're going to document our data science methodology and we can also pass these notebooks around when we want to share results with colleagues but not so fast Jupiter now folks suffer from one problem and that's that it works on my machine problem in order for our erectile in order for our notebooks to work we're gonna need the data plus all the dependencies that were used to reproduce that data and this is where docker is gonna come in docker is gonna help us solve the problem of it works on my machine so doctors a platform that allows us to package and run applications in loosely isolated environments that we call containers we can use the shipping container analogy to understand how docker works so shipping containers standardized the logistics industry it didn't really matter what was inside these containers we can send them by boat by train or by truck and we have the infrastructures at all these various facilities to handle these standardized containers with docker we can package our code plus everything we need to run the code in an isolated container and since these software containers are standardized we can pass them into different environments without having to worry if they're gonna run or not so this just sound really similar to virtual machines but there are a couple of key differences the first is that containers run natively on the host machines OS that is they share the same kernel as the host machine and for virtual machines we have a hypervisor and as hyper-v hypervisor provides each VM with virtual access to the hostess resources so what this means is for containers we really don't need full scale operating systems so they're going to be a lot more lightweight and have better performance characteristics so we can use docker for many different things such as streamlining or development workflows if we want to do some continuous integration or some continuous deployment we can use docker to build out micro services and of course we can use docker to do some reproducible data science so this is a really great overview of the docker architecture we have the docker client and that's where we enter in commands to interact with docker and these commands they go to the docker host and this host can be running either on your local machine or a remote machine the hosts needs to be running the docker demon and since dr. daemons going to intercept commands from our client it's going to manage docker objects like containers like images and it's also good to communicate with other dr. daemons and then we have the docker registry and that's where we store docker images dr. hub that's like github it's the public registry so there you're gonna find a lot of public docker images for things like Linux distributions databases and Python has a bunch of images as well so when image is a frozen snapshot of the container and each image consists of a set of read-only layers that are stacked on top of each other and each layer is the set of differences from the layer below it and containers containers are the runtime instance when we create a container what we do is we add a thin read/write layer called a container layer to the top of our image layer stack and then anytime we add a file modify an existing file or delete a file although that's going to be done in the top read/write layer that top container layer this is really similar to the principles of object-oriented programming where images are like classes layers are akin to inheritance and containers are runtime instances just like objects are runtime instances of a class we can create docker images two different ways the first way is by freezing the container using the docker commit command and what that's gonna do it's gonna take that top read/write layer that top container layer and make it read-only and then we can take this new image and use it to initialize new containers well the more preferred way is to use a docker file and the doctor build command so a docker file is a file that contains commands that are used to create an image and we can automate our build process using the docker build command here's just a list of common docker file commands we have from which allows us to set the base image like what image are we building off of and here we can use a repo name that we got from docker hub we also have a label to set metadata we can copy files and directories into an image and we can also set environment variables and the working directory the run command is used to execute shell commands in a new layer and it puts that new layer at the top of the image stack and just be aware that any time we run a docker file command we're going to be creating a new layer so if you want to say install Jupiter and pandas then we go run pip install Jupiter next line run pip install pandas that's going to be creating two separate layers and dr. best-practices says that one of the things that we want to do is minimize the number of layers in our images so we can do this by chaining together commands like we would at the command line so we can just do a run pip install Jupiter the double ampersand to chain them together back slash to go on the next line pip install pandas and we're only gonna have one layer in that in that image we can use entry point and CMD to define what our command should do at runtime like what shell command should we run when a container launches there's two ways to put this in there's the shell form which is pretty much what we do at the terminal and there's the exec forum which is the same thing except it's formatted a little differently if you're just starting off I highly recommend you just use CMD CMD and entry point do interact really well but it's a little bit more of an advanced concept so try to walk before you start running and so here's our hello world docker file and I'm not you gonna go into the terminal we're gonna play around with some stuff so here I have two files I have my docker file and a hello world py so let's take a look at the hello world py so I'm just gonna be printing hello world and now we'll just go in and check out this docker file so as I mentioned before every time we enter in a command we're creating a new layer so I'm going to be building off the Python 363 updated yesterday the three 6-3 image thank you we're gonna be setting our working directory we're gonna be copying the contents of our current directory so the docker file and the HelloWorld dot py file into our working directory and then once the container starts we're just gonna run Python on the hello world script so we can build using the docker build command we'll tag our image and call it hello world and we want to specify our docker file is in the current directory that's where the dot comes in and there we go we have our docker file built so we can look at our docker images using the docker image command and you see here I built hello world seven seconds ago and so I can use this image to generate to initialize a new container so let's go ahead and do that so just clear that out so we can do docker run and then the image name and it's gonna go out into the container and run that script so you see it ran pythons hello we're all done it printed that we can take a look at all running containers using the docker PS command but you're not going to see anything here since this container stopped so we can look at all stop containers and the process stopped with the - a flag so you see here we have we have a container ID built off the hello world image it executes python hello world and it has a name naughty sanded so we can actually use that name in order to start up that container again so we can do doctor start - i to make it interactive and - a to connect standard in standard out standard error and then put in our container name and there it goes and it prints it again or we can also use the container ID which is just that hash you see the doctors start and there it goes pretty simple right and so here I'm just walking through the steps that that I used to build the image and the steps I used to create a container and then restart that container so you saw me use the docker run command that's the command that you're gonna use the most so let's explore that in a little bit more detail we can use the - D flag to run it in the background so run a didn't attach mode as I mentioned before we can attach standard and standard out and standard error we can also make it interactive with - I and we can also name or our container with the double hash name if we ever want to get into the shell of our container we can pass in say bin slash SH or bin bash after the docker run command to get to that prawns so just be aware that anytime we delete a container we delete all the data inside the container so everything in that top read/write layer that top container layer it all just goes away so we should think about different ways we can actually manage our data so the easiest way to do it is using the docker CP command to copy files in and out of directory out of our containers but that's going to become a little bit tedious so the more preferred way is to use a data volume that's what we'll do is we're going to mount a local folder as a directory inside of our container and any changes we make to the directory inside of a container that's going to show up on our local folder since it's only been mounted there and when we're creating your containers with docker run command we can specify our local directory you want to mount and the container path we want to mount to with the minus V flag we can also add a command to our docker file of volume command to specify were mount point is but this doesn't relieve to do anything and it's not really necessary we only really do it to be explicit about our workflows we can connect to the outside world fairly easily from containers but we'll have to set up port forwarding to connect to the inside of containers so just like before when we're creating a container this time we use the minus P flag to specify the host port we want to forward and then we'll specify the container port we wish to expose and like before we can also add a command to our docker file to be a little bit more explicit about our workflow it doesn't do anything but it's best to be explicit some more dr. follow best practices I'll stress this one more time be explicit about your build process it's a lot easier to like to figure out what you did after maybe two three weeks this if you didn't have that instruction there our container should also be stateless we should try to avoid under I to avoid installing unnecessary packages each container should only have one concern and we should minimize the number of layers inside of our image and if you look at older docker files you'll see a main two command but that's actually been deprecated so try to use the label if you can going forward let's slow review the docker build process on the docker container lifecycle process one more time just to make sure everything is sticky in our head we have a docker file we can build an image using the docker build command from that image we can create a container with docker run I didn't mention kill but if you want to kill a running container we can do docker kill container name and well that's going to do it's going to send a signal to the process inside of that container we already talked about starting a stop container but if we ever want to delete a container we can do docker RM container name and if we want to delete the image we can do docker RMI container image so here's just a list of doctrine commands for containers I put stars next to the ones I use more than others and here's the same thing but for images so more tips and tricks smaller images are better only install the things you need maybe you should look into some other Linux distributions like Alpine Linux that's only 5 megabytes total we can also mount symbolic links as volumes inside of our container and if we're ever running a process make sure you set the IP address to 0.0 0.0 if you use 127.0.0.1 that's actually gonna be a loopback interface and we're not going to be able to connect to the inside out process unfortunately have to learn that the hard way so now the reason we're all here we're gonna learn how to do some data science with docker just be aware that these are suggestions you can do these a billion different ways just so use these suggestions nothing really set in stone here's a template for you to use so imagine you have a u2 paterno book you want to share this could be a thesis a project deliverable or analysis you want to send to a colleague but Jupiter suffers from that problem we talked about before the it works on my machine problem so why don't we create a docker image with a library data and notebooks that are required to reproduce the calculations and push up that image to docker hub now let's go back to the terminal and I'll play around with this so here you see we have a Doppler file a data folder and iris analysis ipython notebook or jupiter notebook sorry we'll go ahead and take a look at that dr. file so here I'm building off the Python 3 6 3 slim image setting my metadata setting my working directory copying all the contents of my current directory into that working directory so the the Jupiter notebook and the data folder I'm with a pip install some libraries so numpy pandas Seabourn SK learn Jupiter we're going to expose the port be explicit about that 8888 and when we get inside the container that launches we're gonna run Jupiter notebook accept the IP address our port since we're inside of a container we don't need a browser and also since we're in a container we're gonna be route we just need to turn that flag on to make sure we can actually run it I'm not going to build this because it's gonna take some time to download the dependencies but like a cooking show I did this do a little I did do this a little earlier so we have our worth flow number one so we can go docker run we want to connect the ports so I'm going to make it 9999 on my local machine and we're forwarding the Jupiter running instance of 8888 and that's gonna be our or our image ham and there you go it loaded up this this process inside of the container so now let's just go copy that token oh sorry to 9999 yeah and so you see we have our iris analysis and our data folder so let's open this up just to make sure everything looks good you guys know it all right so I'm reloading all the libraries on my local machine I have Python 361 yes I know I'm a dinosaur oh when I reload it its Python 363 like we had inside that container we specified and now we can just go on and load our data do a little bit of exploration some exploratory data analysis and at the bottom I have some s Kaler and stuff so let's just make sure that also ran all that also runs and here I'm just doing like an SVM or SVC fitting on the iris dataset and you can see everything works like normal so everything is inside a container so you can just pass this around and it shouldn't be a problem just walking through the steps I did and now we have our container let's go ahead and upload it to docker hub so this process is really simple we have our image name we can just dr. push full image name and I'll put it in our repo and then we can have instructions for our users to docker Polo repo and then use instructions from the previous slide in order to initialize our container and then restart the container as required so those of us who work on a team know how hard it is to set up a standardized development environment across everybody on the team or if you've ever accidentally updated a dependency and had everything break you know how important it is to keep all your development environments isolated so this is where workflow number two's gonna come in and I call this the data science project so what we gonna do is we're going to create our development environment inside of a container and then we're gonna mount a data volume so we can do work with persistent data on our local machine so the benefits of this are we can separate out projects if we want an onboarding a new employee they can just spin up a container and if we ever have to upgrade dependencies say pandas releases an updated version we can use we can check that new version in automated testing pipeline we're all testing right of course all right cool yeah so let's let you go ahead and go back to the terminal and play around with us so here I just have a docker file and so I'm just gonna take a look at that docker file so this time I'm going to be building off the mini Conda image and Minicon is just Python with the contents dollar Conda is really popular for day design so I thought it'd be really fitting for this example we'll just add some metadata set our working directory install some libraries make sure we clear up our cache expose the port and create a mount point and then as before when we get inside the container and it launches we'll run our Jupiter notebook command like the floor so I did this I did do this a little earlier as well so you can just see it right here so we'll do docker run - P to connect our ports let's use 9999 again but this time we're gonna be mounting a directory and what I have where's this folder I have a folder on my local machine when I sort of store some of my data science work so let's all let's go ahead and mount that and I'm going to be mounting to slash app see that's why it's good to be explicit because just in case you forget it and then we're gonna need our docker image name and we should be good to go so you see I have this token let's go ahead and open that up in a browser and you can see here we have the same directory structure as we had inside of that directory on my local machine so you can go ahead and open up your work and maybe share it on a network drive so everybody has the same on the same code and this is just walking through the build process initializing container restarting the container also working for number three is what I call the data-driven application and this is going to simplify our deployment process we've all had to manually deploy apps before we run our tests in dev if it works moving to prod run our test again in prod just cuz we don't sure not really the most efficient way of doing things but with this we cannot pass around our containers and know that it's actually gonna run so for my workflow I'm gonna I have data on my local machine and I'm gonna create and package a dashboard inside of my container and then every time I start my container I can go to a URL and look at my dashboard let's go back to the terminal so here I have our plot x freestyle this is just a dash script you can take a look at that later this is all on my github I've a requirements file so we can just take a look at that and here's just all the requirements that are used to to reproduce that - that - dashboard and then we also add our file here I'm building off the Python 3 6-3 Alpine image setting some metadata setting my working directory copying the content of my working of my current directory into that working directory pip installing my requirements file that I just put into that image exposing the port and I'm going to create a mount point at slash Apps slash data and then this time I'm going to use entry point to say python is my default executable and i'm gonna pass in the parameter plot time series to run that one script if I had another dashboard I could also pass it around I'll pass that dashboard so I passed that on that parameter end and run a separate dashboard if I so choose at that time so let's go ahead and get this one going so you get go to dr. images so this one has been pre-built I'll just show you guys one more thing I'm ah I've been generating data since I got here every two seconds I think it's generating a number between 1 and 4 so we can just take a look at that so yeah you see right here yeah that's right so it's going to do that and then uh we can just go ahead and run that run this container to make sure it works so docker run connect the port's this one will just use the standard ports about my directory this is the local part and this is gonna be yet obsolete why don't we give this one a name we'll call this dashboard thank you my dashboard alright and I'm just gonna copy in our image name and there we go we have our dashboard up and running so let's go to that URL oh sorry next time can we edit the script alright little later so you can see here we have a live dashboard that's updating with live data so let's thank you so let's actually make sure this is working so I have another script which generates data every half a second and you can already fall you can't see it yet you can see already the the points are a lot closer together so instead of doing maybe just a file on your local machine you can send a link to somewhere like on a database somewhere and it'll update the data on a live basis every year whatever like whenever doing you set your event timer to walking through the process like the for and initializing and restarting so with workflow number 4 which I call the data science API I'm gonna make data scientists into data engineers because usually when you build a model you're waiting for somebody else to deploy into production so why don't you just do it yourself so what we're going to do is we're going to build a model and pickle that model and then create an API around it and then that API takes in our input bars and we send it in it's gonna output a prediction go back to the terminal so here I have some ipython notebooks we can just I'll just show you that really quickly so here I'm just pulling in my dependencies loading the data set splitting it into the test and training set fitting a KN n k nearest neighbors pickling my model and then I'm just playing around here just to show sort of see if how to load the data I should take this part and I put it into a script that I push into into the container and so we also have a requirements file and just take a look at that and that's what those are all the things that we need to produce our API and so let's just go to the docker file and see sort of see what's going on there so like before building off the Python 363 I'm setting my working directory copying in the context of my current directory into the working directory path installing my requirements ah sure you guys see a pattern here exposing the port and doing the same thing with entry point as CMD as I was doing before so we'll just go back to the terminal and run this container get out of here and get some drinks so we'll do docker run when I connect our port I think this is 5,000 you just double-check that yeah and now my process is running so let's all let's go back to one of the true Paterno books I had so this one I'm just using requests to send it a sample data point and get a prediction um well I just made up a data point I expected one I got one so we can sort of see it's working just walking through the commands like we did before that's pretty much about it so now you're probably wondering how what our next steps go on the docker website there are a couple of really good resources how to install docker and there getting started guide is great if you're into portal site check out Nigel pull tins dr. deep-dive it's a really good resource surprisingly centurylink has a good resource as well and I just want to make everybody aware of an open source project called pachyderm pachyderm allows you to our pipeline and container eyes or machine learning pipeline so you can have like your cleaning in one container you're fitting in another container and your predictions in another container and all pipeline those together that's pretty much it for me load knowledge myths I will take me questions [Applause]
Info
Channel: Chicago Python Users Group
Views: 38,872
Rating: undefined out of 5
Keywords: python, chicago
Id: oO8n3y23b6M
Channel Id: undefined
Length: 32min 10sec (1930 seconds)
Published: Wed Nov 15 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.