Docker for Data Scientists, Strata 2016, Michelangelo D'Agostino

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Michael and Jojo Justino and I lead the data science team over at surface analytics and today I want to tell you all about docker in particular I'm hoping by the end of the talk I'll have convinced you to make docker a regular part of your data science toolkit if it isn't already but more than that more than just trying to kind of argue this abstract point I've really designed this talk to be more of a tutorial so maybe you've heard about docker you've heard about containers over the last year or two but you just haven't had time to dig in and figure out what they're all about I've really designed this talk so that you can pretty much walk out of here knowing enough about the docker toolchain and off about the docker syntax to just start using docker as part of your data science work but before we get into that I just want to say a few things about kind of the rapid growth of docker and kind of this this age of rapid growth of tools that we're living in and so I think as data scientists we live in this world of just rapid change to the landscape of tools that are available to us it seems like every day there's a new package a new framework a new language something that comes out and I think that can cause us to have this sense of tool fatigue you think like hey this new thing is it really worth me investing my time and effort in learning it and understanding it or is it just some fad and something better is going to come along and kind of in the next day or two so I had this sort of fuzzy feeling that that was the world we're living in but I didn't have any any actual data to back that up and there's a data scientist that bothers me so I spent a little time last week with Google search trends to see if I could kind of quantify this a little bit so let's start with with deep learning so if you've been reading strata abstracts you've been reading the news lately if you've been anywhere not underneath a rock you'll know about the the rapid rise of interest in deep neural networks or deep learning so and how they're changing everything from image recognition to speech recognition to machine translation with all these new applications kind of popping up everyday and we really see that born out in the Google search trends data we have this kind of exponential growth over the last couple years and interest around deep learning also if you're reading in strata abstracts you'll know about the rise of spark is a distributed computing framework and again we see in the Google search trends data just kind of this rapid takeoff of interest in spark now of course I'm not standing here at strata spark row giving this talk right at least not yet this is still strata Hadoop world so I wanted to add the the trace for Hadoop on here and you pretty much see what you'd expect that you know Hadoop took off earlier it's reached reached much broader penetration much broader interest and as maybe kind of leveled off a little bit in the last couple years so I thought all right you know what can I add on here that's gonna be even bigger than Hadoop you know how even more hype than that so I figured I would add you know the Big Data industry on here and it was a little surprised it's not much heart much higher interest than Hadoop you know I think for some people they would argue that big data and data science are Hadoop I don't think I'm one of those people but you know maybe you look at this graph and you could you could make that argument so given that sense that we have you know spy that sparks really taken off deep learning has really taken off I was still a little surprised to see what happened when I plotted the the docker treadline on here so basically docker has eclipsed all of these things in terms of sheer volume of itchin of its interest you know where it is right now but also the rapidity with which it's arisen so it's really the purple curve is really just taking off like a like a rocket ship and because of that you know if you pay attention to docker somewhere back here and you learned a little bit about the ecosystem and then four months later you overhear things have really changed rapidly new tools come out things change names and sort of hard to know where you are so basically that's what this talk is for it's basically all about explaining this purple this purple curve so I wanna help you understand like what is it that's driving this huge interest in docker you know a lot of which has been in the DevOps and systems community but has kind of recently been been breaking over into the data science world and then to hope you understand the current state of the tool chains that you can kind of walk out of here ready to just use docker but before I do that I want to bring us back down to earth just a little bit so that we don't think you know doctors like the hottest thing since sliced bread so I ran a few more Google Trends searches so I did manage to reprove that cute animals still rule the internet this yellow curve is the curve for golden retrievers the red one is for pugs as a proud partner myself I was super excited to see the recent rapid rise of interest in pugs not quite as much as docker but still it's cool and then because as the intro said civis has its origins in the political world I had to run just one more search which I think really puts all of this kind of in perspective and depending on your point of view it's either a fascinating exercise in electoral democracy or look really scary for the future of the country but that's all I'm gonna say about that so now that we we kind of have seen you know this rapid growth of doctor what are we going to do for the rest of this talk so I really have three goals for the talk so first I want to try to answer this this question of why I would use doctor for data science by telling you what soccer is and kind of what things it enables and take this this you know fatigued question like oh man do I really have to learn another tool and answer it with what I think is a strong recommendation that like yes doctor is worth your time to learn and I'll partly do that by telling you some of how we use doctor for our work at civis and the back half of the talk is going to be a tutorial so the second goal like I said is for you to be able to walk out of here kind of ready to use docker so we'll go through a tour of the docker ecosystem like all the tools that surround docker and then actually look at some code docker code syntax and we're gonna do this all through this this real life and to end data science example it's going to use Jupiter notebooks it's going to use Python it's going to use R it's gonna use all the things that we know and love as data scientists just kind of in the context of docker and you'll notice I have real life in in quotes and air quotes here and that gets at the third goal of my talk and that is that I've tried to design the most hype filled talk at strata using our Google search trends data as an example so what we're going to be doing for our tutorial example is deep learning for pug recognition with docker with a Donald Trump cameo thrown in there so watch out for that okay so let's start with what docker is all about so it's its simplest essence like docker is basically a virtual machine and you can think of it as providing you with a Linux operating system that's that's chock-full and pre-installed with like all the software you need for running just like one particular task or application like you can think of it as you know I have this laptop and it's built for a single purpose I only use it for running this one program or this one application so it's it's like other virtual machines that you may have used where you're running this guest operating system on top of your machines host operating system but there are a few big advantages to docker so docker is what's known as opera operating system-level virtualization so what that means is that you're not actually running a whole separate operating system on top of your host operating system there are features a Linux kernel that kind of fake it for you they give you these isolated bundles of resources so CPU resources file system resources networking ports all that sort of stuff that make it look like you have an isolated virtual machine and that makes docker images much more lightweight than traditional virtual machines so this is the picture that you often see used to describe this so on the left is a traditional virtual machine where you have you know your your bare metal server down here it's running some host operating system and then you can run many different virtual machines and each one gets its own copy its own full copy of that operating system and the binaries and the libraries and everything but with operating system-level virtualization things can be shared between containers but still act isolated so in a docker set up you have your host operating system and each running container can kind of share things from the underlying Linux kernel they can share binaries and libraries if if they're using the same binaries and libraries etc and that has a big advantage so basically docker containers are really lightweight and they're super super fast to start out so they start up in you know tens or hundreds of milliseconds rather than like a minute like you might be booting up an operating system and so if you want to think about it it's not like you know on your Mac booting between Windows and and OS X and parallels where that can be kind of slow and a huge interruption your workflow it's kind of more like having you know a dozen laptops on your desk each one is on and ready for a particular purpose and you're just kind of moving from one to the other so it makes it really easy to work with a lot of different docker containers second important thing to understand about docker is that basically everything about an image is specified in this text file that's called a docker file and it's basically just a really light wrapper around Linux syntax for installing system packages and and code and kind of anything that you could do on a Linux command line except that it's encapsulated and that's really nice to read text file that can be stored in github alongside of your application a next big advantage is that docker images are cross-platform portable and shareable and there's like this whole system that's that's arisen for you to share docker images with other people so in particular their services like docker hub and Amazon container registry that help you version and share docker images with other people basically in total analogy to what github does for source code and this will this will come up later for why I think docker is is really important for data size teams so that's that's basically what docker is at its essence it's just it's an isolated virtual machine that's that's for a single purpose that boots really quickly so why why would you want to use docker as a data scientist and I think they're they're a handful of reasons the first big one is reproducibility so by running all of your code inside of a docker container you can be sure that it's going to run exactly the same way every single time because the containers got the same operating system the same version of all the system libraries all of your you know Python or are packages and it's it's got all of that exact same stuff whether you're running on your laptop whether you're running it on some random Amazon server you know in development mode production mode it's literally exactly the same thing and if your data scientist you know there's a lot of work in the research community on reproducible research jupiter' notebooks are an awesome way to do that our markdown documents also an awesome way to do that so embedding your your graphs and your code and and your text descriptions all in one document and that's really great but what about all of the system dependencies like what about the operating system you ran the analysis on what about all the the system library it's like you think all that stuff shouldn't matter but you know six months later eight months later when you come back you can have your notebook and it just isn't going to run the same way so doctor gives you a way to make it really truly reproducible by having your notebook and the docker container that goes with it you can kind of put those two pieces together and be certain that you're going to be able to reproduce your work and on a related note you know like no one can mess up your cron job with docker so someone can't come along and change a version of a system package that totally breaks your cron job and you find out the next morning if you're running it inside of a docker container that doesn't matter as long as the container doesn't change its going to be doing the same thing on a related note to reproducibility there's documentation so I was talking about this text file docker file description of what's in an image so it's really clearly documented what version of everything is is in your container and then the next reason I think you know doctors great for data scientist is that it's data scientists we like to experiment we like to experiment and we're polyglot we use lots of different tools so if you're using docker containers for your work it gives you this level of isolation that keeps you from breaking things so if you're using Python 2 and Python 3 your setups aren't going to mess up each other if you want to mess around with anaconda it's not going to mess with your system Python you can link things against various different versions of linear algebra libraries your our setup isn't going to contaminate your Python setup you know etc I mean I'm sure if I asked for a show of hands like at least several of us have managed to kind of brick a laptop by messing around with these kinds of things with docker that's that's not going to happen and that gets into this next point that docker really democratizes devops for people who are not infrastructure type people so you can mess up right like I can mess up a docker file and mess up my running image and it doesn't really have any repercussions right I just start over it frees me to kind of experiment with things in a way that I can't do if I am just doing this stuff like on an Amazon server like I can break the Amazon server and that's there's a real cost of that and it's very low risk for people like us who may be getting started with thinking about DevOps second big advantage as I was saying earlier is that docker images are portable and shareable so if you're a data scientist who is more about analytics and statistics and you're less into this DevOps systems type stuff someone else can set up a container for you that has all of the exact dependencies you need and they can push it to docker hub and then you can just work with it you can just worry about your are Python statistical code and not so much the system stuff and it makes it very easy for you to check out those images there's a huge community that's arisen around around docker and composability which I'll talk about in a second so basically like if you want to use super notebooks like the Jupiter folks have provided a set of docker images that you can just go and grab from docker hub if you want to run an hour shiny server you can go and get an official docker container for our shiny so that's the community aspect the composability aspect is that containers can be built off of other containers kind of in this like layered fashion so let's say I want to build in our shiny app I can start from the Community Supported our shiny image and then I can just add my few things on top of it that I need for my particular thing I can just add in the particular art packages I need for that analysis so the composability means that you can just worry about the parts that are specific to like your own workflow and let the experts take care of getting Jupiter running or getting shiny running and I think docker really helps teams with communication so if you are a data science team that has less technical folks on it like I said earlier you can set up images and share them with them and they can just worry about running the code inside of those images not taking care of servers and if you interact with engineering teams you can sort of treat docker containers as black boxes so you can set them up with everything chock-full that you need for your art code or your Python code and as long as there's like a clearly documented interface you can just give it over to production engineering teams and they can just literally treat that as a black box they don't have to know about how to set up the server so you might be thinking okay this kind of sounds familiar to some things I get from like am I is Amazon machine images or like Python virtual environments and I think there's some important differences so cross-platform and portable so Amazon machine images are obviously not you know provider agnostic or cross-platform you either have a Windows or a Linux one you can't you know move your virtual environments from from one platform to another or share them docker files are these transparent text file descriptions am i sort of have that usually you log in and you do something to the image and then you take a snapshot but there are tools like Packer that help you with that and Python it's sort of has this text description there's a requirements file that documents all the Python stuff but it doesn't tell you anything about the system that you're running on a.m. eyes are shareable virtual environments are not shareable composability yes you can start with a base ami and add your stuff on top of it but there's no composability component to virtual environments you can't just like have a base Python 3 virtual environment and then like add one thing to it for one application and another thing to it from another application that doesn't work there are community a.m. eyes but from what I've seen really not the level of support that's already popped up around docker no community virtual environments multi-language you know you can have a docker container literally for any kind of application in any kind of language virtual environments obviously Python specific and then you can run a huge number of containers on one system so I could have a dozen containers on my laptop and move from thing to thing obviously you're linked to one Amazon machine image and you but you could have virtual environment so I think kind of taking all this together there are lots of advantages to docker over some of these tools that you may already be a little bit something I your about so now let me kind of tell you how we manifest some of these advantages in our work at service but I'll tell you just a little bit about who we are as a company first to help you understand that so is that the interest set is a company service traces its origins to the 2012 Obama re-election campaign where a chunk of our staff were involved in the the campaign's kind of groundbreaking analytics operation so on the campaign we did things like predictive modeling of voter behavior we you know designed and ran large-scale field experiments did all sorts of online fundraising and experimentation and optimization and since 2012 we've been taking a lot of the same techniques and moving them into the nonprofit world other political campaigns companies and then increasingly kind of bundling them together as software-as-a-service data science products that we've been selling and I think because that unique heritage we employ kind of a interesting mix of people with different backgrounds so this is some old data but it basically shows the academic backgrounds of a bunch of our data science staff so across some different departments you know data acquisition our data science team or client-facing team like these are all people doing deep technical data science work statistical analysis but you know we have basically as many people from kind of CSS hard math hard sciences backgrounds as we do from social science political science like that whole world and so those people might be writing code but they have less of the like kind of DevOps type background and as a company we're really trying to inhabit this like sweet spot in the middle of the the traditional data science Venn diagram but with our own little sector added on there for social science methodology so things like randomized controlled experiments survey design causal inference from observational data like that sort of stuff and so you need people with lots of different backgrounds to do something like that so how do we actually use docker some of the features that I mentioned so are first of all our data science team maintains this very large set of docker images both for different production applications but then also for interactive use by less technical analytics to the company so we have a docker image for our doing our predictive modeling software we have a docker image for bayesian time series modeling and aggregating polls together a kind of Nate Silver style we have a docker image for a Bayesian analysis of randomized controlled experiments processing surveys various shiny apps and then general-purpose images just for like I want to do some our coding gave me an image it's like chock-full of all our packages I need or same thing with Python and so we maintain those images because we know a little bit more about about DevOps and some other folks but with basically just some training and docker so like how to run docker on your Mac or on your Windows laptop our less technical client facing folks can pull those images down from docker hub and then just run all of the code that we provide for them in there they can make pull requests and change the actual statistical and analysis code without really having to worry about like the DevOps piece of it and if they are interested in learning a little bit about the DevOps looking at the docker file and making pull requests against that is like a really approachable way for them to start by just like adding a new package or modifying the version of something it's not is intimidating it's like going onto the server and pseudo doing some stuff and then we like I said earlier pass over a bunch of images to our engineering team as sort of black boxes with clear interfaces so they can hook up a front end button that basically does some really complex data science just by knowing the interface of that docker container and without having to to know anything about the R or the Python that goes into it and I think another really interesting interesting thing that we've done is develop this batch cluster computing solution that we use really heavily for our internal work but we've now started to give externally to some clients so it's basically an auto scaling cluster solution so if it gets busy it just boots up more Amazon nodes but from a user's perspective what you do is you specify two things you specify a docker image with which to run your code and then you can specify either just a command to run in that image or a github repo that has your code in it and it sort of puts those two things together it runs them and it passes you back all of your blog output and our internal folks can just run these things with a simple web form they can hit it via an API but it really allows less technical people to scale up the work that they do just by using this kind of community set of docker containers and not having to worry about the servers that they're running things on ok so now I want to transition into the kind of back half of this talk where we're going to talk about the the docker ecosystem and what the tools actually look like so that you can you know hopefully walk out of here with some really concrete new knowledge so let's talk about the docker toolchain so the first basic component of using docker is the docker daemon also sometimes called the docker engine and the docker client so doctor has like a server client architecture so the docker demon has to be running on any machine if you want to do docker and then the docker command-line client connects to it and tells it to do some stuff so you're going to be doing like things like docker run docker docker build etcetera kind of like the git system where you have git and then a sub command second thing that's important to know is a tool called docker machine so docker machine helps you to get a server up and running and configure docker on top of it so and it allows you to point the client at that running machine so so use docker you have to have a Linux machine running and it needs the Linux kernel which you don't have on your Mac laptop but with docker machine you can you can create a local VirtualBox VM that's running Linux and then you can point your docker client at it we'll see how to do that later and then you just run docker commands and it's just like being on a Linux machine but behind the scenes it's communicating with that VirtualBox virtual machine doctor machine will also provision servers for you on Amazon digitalocean other kinds of things and then you can run docker commands locally on your laptop but it's actually doing stuff on that server in the cloud next important component that I mentioned earlier is docker hub so I think they've done a really awesome job at docker hub basically just building it in complete analogy with github so docker hub is the docker like github is to is to get and it's basically an image registry with a nice front-end you can push your images there you can see the versions you can add descriptions you can share them other people can pull those images down from docker hub it's really the way to share dr. images and to get access to the community images then there's kite Matic so kite Matic used to be a standalone company that got merged into into docker and it's a really simple graphical interface for launching and using containers so it's a great place to start for less technical folks or less technical colleagues if they're not into doing stuff on the command line then we start to get into things that are maybe a little less common to you to use if you're you're a data scientist so there's dr. swarm docker sworn swarm helps you to to boot up a cluster of machines that are running docker and then you can launch and schedule containers on top of them so here think about like if you're running a web application and you want to boot up like 20 servers and launch multiple instances of it and do some kind of like load balancing that is what docker swarm is for docker compose maybe something you would run into which is a tool for helping you define and run multi container applications so basically inside this simple yamo file which will see an example of later you specify all the different containers which images they are and how to link them together so maybe you have a front-end that's running in one one container you have an API that's running in another container you have a database that's running in another container it allows you to separate and isolate all those things and then connect them together and docker compose used to be called fig it was also if you've heard of that it was also a standalone tool that got merged into docker I think at some point dr. realized like that this was super confusing and there are like 10,000 tools so they launched docker toolbox which was yet another tool you had to figure out but it basically bundles all that stuff into a single installer that you just install docker toolbox and it gives you all the things that I just mentioned and then as of last week hot off the press is they released yet another tool in beta mode called docker for Mac and docker for Windows which is meant to supersede docker toolbox it is basically just the all-in-one installer that gives you like a native Mac experience for working with docker and a native Windows experience and that should simplify some stuff and just make it nicer to work with okay so now I want to launch into the kind of tutorial example piece of it and if you're interested all of this code is up on github take a look at it with has links to the docker files but also links to the docker images on github so you should actually be able to literally reproduce everything I'm about to show you with the docker containers and the commands that are up there that's that's the whole idea of docker after all so if you try it and you cannot get that to work you have like a money-back guarantee for me I haven't cleared out with a Riley yet so I don't know we'll see so what is this application that we're going to we're gonna we're gonna look at so here's here's a screenshot and actually I'll just start this demo running and talk over it so basically we're gonna we're gonna make in our shiny application our is a shiny is a really nice front-end framework for our and it's going to let the user stick a URL in there and then once the URL is in there it grabs the image from the internet it crops it and shrinks it down to the size that we need and then it makes an API call and that API call goes to a second container that we have running and that second container is serving a model that we built in Python so a neural network model that is supposed to recognize pugs versus golden retrievers and you can see at the bottom these are test images I just grabbed off of Google Image Search just doing a really great job its unhappy that that's a pug I'm sorry it's unhappy that that's a golden retriever it was pretty happy that this is a pug and it so it takes the model score back from that container and then it you know it displays this nice message for you and all of this is linked together and boot it up using docker compose which we will show in a second I'll just let this loop through to the end so it thinks with 97 percent certainty that this photo of Donald Trump is actually a pug I think actually if you kind of look at that expression on his face it really makes sense alright so how is something I know you have to be thinking right how is something that's awesome possible so let me kind of walk you through all the pieces of docker that I used to build that application so first it starts with with docker machine which I was saying is the the way that you boot up servers to run docker so first I started by booting a local VirtualBox machine to work locally on my Mac and then once I realized that you know I was having memory issues I use docker machine to boot up in the exactly the same way a server on Amazon and then eventually a GPU instance on Amazon so dr. machine handles booting those three things and then pointing your docker client at it and then I started by doing the work in a first container which is like a Python container so I started from the Jupiter community image that has the notebook just up and running and installed on top of it caris which is a deep learning package and Python if you know it and then worked inside the browser which is running in the container just literally exactly like you work with a notebook on your laptop there's no difference from the interactive standpoint and what we're doing here is is a form of transfer learning I took a pre trained image net Network convolutional network chopped off the last layer and then just presented it with images of pugs and golden retrievers and had it learned that last layer for our particular pug recognition task the regular network actually did a great job knowing that it was a dog but it didn't go that really crucial last step and tell you that it's a pug so once that once the model is built in the notebook I took advantage of composability and I started from this image that was already set up to do our Python stuff and just added one more thing to it which is flask that allows us to serve the model with an API then in a totally separate container we defined the are shiny application and then used docker compose to link the two of them together and start the application running so let's look at some of the syntax for how this actually works so here's what docker machine looks like you just you say docker machine create me a machine and then you kind of tell it what kind so here I'm creating a local VirtualBox machine and I give it a name just like my VM and then once I've done that the second line says tells docker to point its environment at that machine so then when you're using docker commands it's actually running them on that VirtualBox machine that's running on my laptop if you wanted to boot up a server on Amazon it's essentially exactly the same thing you say doctor machine create me a machine except use the driver for Amazon and then you pass it your Amazon keys and you tell it what kind of instance you want and then again you point your docker environment to that machine and anything you do is gonna happen on that Amazon instance so I talked about the dockerfile this text description and here's the actual docker file that I start with so we start from this docker of this Jupiter community image that has the notebook in it and then I run a bunch of commands that look like Linux commands so pip install is installing these packages on top of that image install Kerris and some other stuff and then I'm running some Linux apt-get commands to install image magic which is just like an image processing thing and then what it sorry so then I said earlier that once we define the model we use composability to just add a couple things on top of that image so here we're starting from the base image that I just defined and now just adding a couple more things on top of that for our API we add flask we copy over the model weights files that we stored exposed tells us tells the container to open up a port so we're gonna serve our API in Port 5,000 and then this last bit down here says when the container starts what commands should it be running it should be running this command that actually starts the flask API so just a very lightweight wrapper around the next commands and now we'll see what it actually looks like to build to build this image so docker has very similar syntax to get where it's docker a sub command etc and it's very easy to get help just with that flag and to build an image so to take by building an image I mean taking that text description from the docker file and turning it into an image that you can run or push the docker hub or whatever so you just say docker build and I want to tag it with this so this is my docker hub user name and then the name of the image which is you know my Pug classifier notebook I pointed to the docker file and then the the local directory context and when it's done I can I can run the docker images command and it will show me all the images that I have access to so here's a screen capture of it running it'll go back kind of quickly but you can see it's basically going step by step and it's running each one of those commands in the in the container produced by the previous command and so you can see maybe some pip installs scrolling by right now it's doing like apt-get update and apt-get installing a package and then I went by really quickly at the end if you could see it there but I ran docker images and it now told me that I had an image available called pug classifier in notebook now that I've built that locally I can push it up to docker hub just like git I can say docker push and then this image name and then it will appear under my account on docker hub here so here's my username here's the the notebook or side that the name of the image and then I put some kind of text descriptions and a link to the docker file in that flexible field and if you wanted to go use it you could just do dr. pol you know m2 goes to pug classifier and you would have that notebook sorry that image locally to work with so what does it look like to actually run the container so here's an example of running an interactive container so you say docker run this - IT says I want to I want to interact with directly with this container on the command line I tell it the container to run and the command to run inside of it so here I'm just running the bash command which will just open up a command prompt inside the container for me to do things with but if you if you didn't want to run interactively you just want it to like leave the container running in the background doing its thing having its ports open you can say docker run - Dee which tells it to run in daemon mode you can map the ports from the container to your local machine and then tell it the image name so here the the jupiter notebook image is set up to run the notebook as soon as you start running it so what it actually looks like is I ran that command and you can see it basically just returned instantly the machine has started up and then I go into my browser and I type the Jupiter address and Here I am like inside of the notebook doing whatever I want in in the notebook so I'll actually just I'll read let that run one more time so you can see like starting up the container is like instantaneous right and then from this point you're just in Jupiter you don't know that you're in the container except that it's it's frozen your dependencies and you know now like six months from now if you start that container and run your notebook in it it's going to have exactly the right stuff you can also mount volumes inside of a container which is a great thing to do for debugging so in this syntax here what this says is start start this container and mount of volume and here I'm going to take a local directory on my machine and I'm gonna make it appear in this certain spot inside the container and so like if you're working on some code you can mount that code into your container and it will run inside the container but then any changes you make will be reflected back and forth you don't have to keep like rebuilding the image and putting the file in there you can just sort of sink it back and forth I mean you can also inject environment variables into the container like so you can with the stash n flag so I can run this container and then now inside the container I will have these environment variables that are available to me so I can set like a database user and a database password without having to like store that inside of the docker image that other people might be able to pull down from docker hub or store that in github somewhere okay so those are the basics of like what's a docker file look like how do you build it how do you run it kind of a lightening an introduction and now let me just show you docker compose which I use to actually like run that application so I said before there are two pieces to the application there's the the are shiny server for an end I didn't show you the docker file for that but you can just you can just trust me and then there's the flask API component and so we we named the two pieces and we tell them what images should I run in each container so in one I'm gonna run this pug classifier shiny image that's just got the front end and the other I'm going to run the flask API I specify if either one should open up some ports and then this link part is really the magic part so here I'm saying hey my shiny server container it's going to need to access the flask API container so I just tell it to link the two containers together and it sort of just does that magically and then you run docker compose up and it's going to take care of starting those containers and then linking the two of them together so here I'm running docker compose up and it is saying it's creating the shiny container it's creating the flask container and then it's going to start showing me the logs so these are the logs from the plot the flask API starting up and just being ready to ready to serve the model scores and that's basically it here's here's what the again what the app looks like when it's running so the super cool thing about this is that once you've done that like once this is working I can use and did use you can use docker machine to just boot you up an Amazon server say you know doctor machine give me an Amazon server and point docker at it and like we saw in that first slide and then you can run docker compose again and what it will actually just boot up your containers link them together on your Amazon instance so it's like total seamless deployment from the thing that was working locally on your laptop with just like one more command you push it up to an Amazon server and now it's like live somewhere for people to use okay actually that's perfect I think I have about five minutes left so that's really all I wanted to talk through with you guys hope I've given you a sense of like what docker is all about why people are so excited about it and and I think why it's really useful for us as data Sciences and then hopefully you know again enough of a flavor of like what the syntax is like that if you if you're really interested you can walk out of here and just kind of start start using docker and a great place to start is just by grabbing the code from this for my github link and trying to just reproduce some of the stuff that I just walked you through and yeah perfect I have about five minutes I think for questions if if there are any oh if you have questions you can go up to the microphone alright I can repeat it go ahead it's it's two containers running at the same time so it's it's one container that's running the the shiny front-end stuff and then it's a second container that is serving the model API sorry if I can just get back to this so it's just these two containers it's the shiny front end which makes API calls to the to this which is serving the the model scores this third container it's just this container is built from this container so it starts with everything in there and it just adds one more thing but it's only these two that are running if that makes sense yeah okay so so the question was about Jupiter notebooks needing a web web browser to run so so the web browser that you need to connect to the notebook is actually on your local laptop and it's it's going out and connecting to the thing that's running the Jupiter server so if like so normally when you start Jupiter luck on your laptop you just go into a terminal and you start it and you then you go to your browser and your browser is talking via API to the to the server so here it's exactly the same thing except the server is running in a docker container either locally or if it's running on Amazon then you just point your local browser to that Amazon IP to create containers do you need to have any special permissions yeah that's a great question so so there are some have to have to remember this so usually you need like sudo access to run docker select if you install docker yourself like freshly you you have to preface all those docker commands with like sudo docker whatever I'm pretty sure that with docker machine it just works so if you use docker machine to make your to make your virtual machine or to make your thing on Amazon you don't need to preface everything with sudo it probably like runs as some user that has added to the docker yeah so the question was I understand that you can attach data to containers so there's like a concept of a data container or a data volume so that's not something I've used personally but but you are right like the idea is that like if you if you were running like a database let's say you know if you want to run like a sequel like database you need to have some that data stored somewhere and the the right way to do this in docker land is to store that in its own container but I don't personally have experience with that in the back yeah so the question was if I have any experience using things like vagrant or ansible and if I can comment on the differences so I have not used ansible our DevOps team at civis uses ansible I have used vagrant and actually in a previous a previous job I set up a basically the same thing so a vagrant image for surveying shiny for applications and my experience with docker is just it's like Nitin it's for me personally it's like night and day so I think the docker file syntax is just way easier like then so if you don't vagrant is another system for for managing virtual machines and like the if I remember right the syntax is Ruby which I'm not like natively familiar with and it was just a lot harder to figure out what's going on but the docker file is literally just like running commands on the command line so and then also like there was already a shiny image for me that wasn't there for vagrants so I think personally I vastly prefer docker to either those two things in the far back yep so the question was like once you once you push your running application up to Amazon can you can you point that at a different URL like is this a bug comm or something the answer is definitely yes I'm not a hundred percent sure how you would do that but like you know em you probably go to the Amazon console and like do some mapping between like your with like Amazon DNS features there's definitely a way to do that but like that's I think outside of the docker world yeah take one more all right yeah so the question was when yeah so the question was how much CPU and memory does the running container use are there requirements and stuff like that so you can control that so you can you can with some of the docker run command-line options you can tell it don't use more than this CPU I'm gonna limit it to this much RAM and actually that our internal thing that I mentioned has like the user can specify don't use more than this much you know more than these resources I don't know what the default is probably by default it gets it will get everything it will have access to as much as it as you have available but you can place limits on it okay great thank you everybody
Info
Channel: Michelangelo D'Agostino
Views: 15,473
Rating: 4.9484978 out of 5
Keywords: docker data science
Id: GOW6yQpxOIg
Channel Id: undefined
Length: 42min 49sec (2569 seconds)
Published: Mon Apr 25 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.