Kubernetes-Native Workflow Orchestration with Argo | Skillshare

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] my name is Kai Ricci I'm a data engineer at sculpture just here in New York City I'm gonna talk about kubernetes native workflows with Argo just a quick agenda just giving some context about what we're trying to do with Skillshare and the problem we're trying to solve going a little bit about tool selection and some alternatives we looked at and then finally talked about Argo itself or some details about about it and our experiences using it all right so so that's culture we have a pretty small data team there's three data scientists one data engineer that's me we are hiring by the way and we're wearing a larger online learning startup so Skillshare is a community of learners and teachers on creative topics so stuff like drawing photography design if at the end of this this two days you feel like you need a break from from data stuff and you want to do something creative I'd encourage you to check it out so we have like pretty small data right now just in the tens of terabytes but we have like a few different use cases so like did many data teams at small startups we're doing multiple things so data analytics is a big one looking at like user behavior setting up dashboards analyzing experiments we're also doing some at Mel's so we have a recommender and are have many ideas for more models for things like like content content evaluation and you know community stuff so and then finally we do have to do some integrations so just like moving data to different business tools like Volusia for email marketing Zendesk I mean this is typically not something that you would associate with the data science team but usually at you know small startups like ours that we were the ones who end up doing it so I joined scull chair earlier this year and at this one I joined we had some like standalone infrastructure pieces like we had a data warehouse that was getting some data into it somewhat and frequently we had a lot of scripts being run on laptops we had that one server that you SSH into and run your scripts so what we really needed was like a place to run all of these different like ETL toss so extracting data loading in different places transforming it doing like model training running our integrations between tools and we also needed something to orchestrate all of those tasks into workflows so just to give you an example of the kind of stuff that we're doing I mean this is a little bit contrived because we probably won't be running all of this off of a single workflow but it's just sort of giving you a flavor for the type of tasks that we're doing so you know extracting data from different places transforming it maybe training a recommender rerunning some experimenting alysus and like pushing some user analytics to our marketing tool right so this is how not to do tool selection I think the best way to do to the selection is just to find a couple of guiding principles and then evaluate tools like through that lens so our first guiding principle was was operability and so I have here can less than one person or more than one person operate it so what I mean by less than one person is one person not working on it full-time I have a theory that if you have like one smart person doing a heroic effort like you could they can basically make anything continue to run but then if they're not working out a full-time it may fall apart and then also if more than one person needs to work on something and you can't coordinate that then you're also not in a great place so I'm like the things that we were looking for basically like low conceptual complexity so we were okay with what sort of technical complexity but we wanted to kind of minimize the total number of moving parts so that people could actually like reason about the system in a way without you know without having to like look at a giant you know confluence talk that's like here the 18 different parts of our data platform and same deployment and development of workflows was really important to us so I think this is like a pretty underrated component of just data tools in it's like hey can I actually deploy it on my local I mean can I develop it on my local machine and and run it in a similar way that's gonna run in production can I do a deployment without stuff breaking and then just little things like logs and metric secrets management you know these are all sort of stuff that I think people kind of think of this is nice to haves or like after the fact stuff but we wanted to really make sure that we had all of that place all of that in place from the start and then our second guiding principle just reliability so does it do workload on how they're supposed to run so for that you need like explicit dependencies between steps you never you don't want to have anything that's just assuming something else has run and also we kind of wanted to avoid doing too much like polling or using like web hooks and things you know we wanted to have a workflow system or like one thing directly led into another it's important to us to decouple or execution logic and our orchestration logic so sometimes when people are using stuff like air flow or Luigi you know depending on how they're how they're using it but they can you can sort of couple like what's supposed to run with with when it's supposed to run and we wanted to just have those things be completely separate so for each of the tasks what it's actually doing should be like sort of irrelevant from like you know what's triggering it whether it's an a workflow system or some other workflow system or we were just like running a step manually so which we don't want to do but and then finally like we wanted something that could that could scale through like pretty big data volume increase and this is just kind of something that's you know we're starting from a small point but we are a grown States startup we our customer base is growing pretty quickly you're over a year and then also we just have a lot more instrumentation once you start giving people data tools they want to use them and they want to like put more data into the system and then more users meaning more internal users so again once you give people ways to build build workflows and build pipelines they're gonna start doing that on some things we didn't really care about right now so graphical creation of Doug's didn't care so much about programmatically generated Doug's and permissioning insecurities like at this stage where we're at right now it's it's basically you know everyone and on the data team has access to everything so we didn't need a permission model this is what we definitely didn't want this so if you're a data scientist you've you probably sent this message to someone and if your data journey you've definitely gotten this lag message from someone so so we one of our first decisions was to build to build a door by so we looked at bender tools I mean nothing that we're doing is particularly unique right like doing ETL and and even doing integrations between tools and is not like a something that's specific to us but we didn't find like any any single vendor that did everything we wanted and we kind of wanted to avoid like mixing and matching tools cuz it's really hard to do reliably like if you're doing your ETL in one place and you're doing your well you know with one tool or you're doing your mo with another tool and then you have some internal scripts that are running short of orchestrating all of those can be really difficult because a lot of vendor fools like don't really have fully fledged api's for doing things like notifying you when something is done or you know even sometimes when something is failed it's like you just get an email or something and that's that's not really what you want also development can be kind of tricky with vendor tools like you know if some good good tools will have like let you spin up a development environment in a sandbox environment or production environment and all of them do and then deploying changes across multiple tools at the same time can be can be really tricky so you know if you're if you're using a managed tool for ETL you have to make sure that when you know our and another managed tool for MLR you're doing em all in-house you have to make sure that those things are changes to the both of those are coordinated and that can be tricky and basically for the stuff that we were doing we like the tasks the kind of tasks that we were doing the extraction the transformation we felt pretty comfortable using open-source tools or in some case just writing code for them and some of the pieces were already written so our recommendation system was like already written out and can so since we decided okay we're gonna we're gonna do some of the stuff you know we're gonna run run this stuff on our own platform we definitely wanted to use containers I don't think there's any reason to just be like running code outside of a container in 2019 in a production environment and so we we built containers for a small set of initial tests needed a place to run them our SRE team hooked us up it's a kubernetes cluster so they they help this provision it and they're they're helping us sort of Co maintain it that's a really big thing I mean I'm not going to understate the the sort of maintenance effort and the difficulty of provisioning kubernetes in the first place so if you're you know having that already done for us was was a huge help and something we want to take advantage of and you know there's just lots of benefits but using kubernetes if you are going to be running containers scaling is the big one of course but again even small things like secrets management it's really easy to have like parity between different environments because the whole environment is just you know a piece of code and you can run kubernetes like we all have kubernetes just running locally on our laptops and with the same sort of configuration as Ice Cube in production just with different environment variables and it also has good integration of a data dog which is useful so all right so all right we were like okay we're gonna use containers we're gonna use kubernetes how are we gonna orchestrate it so our first thought was to use airflow but we weren't enthusiastic about it for a few reasons so many airflow is like pretty complex you know there's there's a lot of moving parts there's the UI the schedule or the executors you can have message buses and and it's kind of full of at least in my experience like it's full of foot guns meaning ways that you can just if you're not using it correctly you can shoot yourself in the foot and we're not going to be using most of the the features right if we're just using a single executor we're not going to be taking advantage of the sort of UI components that you know we're not going to be taking advantage of permission anything like we wanted something that was that was simpler so that's where our NGO comes in so what is Argo so it's a open source kubernetes workflow engine it's built by by intuitive TurboTax fame or infamy as the case may be but it has a lot of contributors outside it into it and it's pretty active there's new releases like every couple of months and so Argo is actually a set of projects there's there's two that we were interested in Argo workflows which is the way to define and run workflows and then argue events which is just a sort of flexible event system for kubernetes that we used to trigger workflows Argo also has components for doing like CI CD but we that wasn't really relevant to our use case so what does kubernetes a native mean so workflows and events are defined as as custom resource definitions and kubernetes meaning that they're basically knit like sort of like native kubernetes objects and they can orange resources and they can interface very easily with other resources so secrets config naps volume mounts and workflows take full advantage of like the kubernetes scheduling so you can do like affinity didn't be obtained some toleration x' set resource limits so here's an example of what an Argo workflow looks like so first thing you'll notice is that it's defined any animal just like and every other criminalities resource that depending on how you feel that can either be like a positive thing or a negative thing for for me I think it's pretty positive because basically means that the workflow is declarative there's no sort of execution of or you know like imperative execution of steps and in the same way that you would have with like a dag that was written in Python and so the the workflows steps are just container images and you can basically just declared a view declaratively define a dot so in this case we have like two extractions and then a transformation and we just define the the dependency between information is dependent on the extraction steps and you know when you execute it it'll run the extraction steps first and then run the transformation which is pretty pretty nifty so there's a couple of ways to run a workflow like in for development you can just use there's a command line tool called Argo which is a just a wrapper around cheap CTL so these workflows again since they are just kubernetes resources you can use keep CTL to create them and turn them down if you want but argo has a nice CLI tool so you can you know you just you can submit a workflow it'll run this is a sort of very small example from earlier gives you some nice output as you can see there's a pod name for each step of the workflow so each each workflows each of each of those containers is being scheduled into a pod so there's complete isolation between them and also if you can use any of your existing kubernetes logging monitoring whatever you're using to so just get the output of each of these steps there's some advanced features so you can there's parameter passing so you can you can pass parameters from the command line you can pass parameters between between different tasks so like the output of one task can be a parameter into another task there's artifacts so you can output an artifact from one step and so this this path here is like the place in the the container after it finishes running where you want to pull the artifact from and then you can kind of reference that in another as an input to another task and we under the hood it just you can you can hook it up to a artifact store so either Mineo or s3 and it just kind of seamlessly handles that so that's that's really nice I think we were not using this feature so much right now but I think it's just it's a pretty nice feature and and certainly like because our goal was originally a sort of oriented toward CI CD I mean this is something that kick came out of that world but we find it pretty useful you know for for data workflows as well so there's some just a couple of I'm just gonna go through like some miscellaneous features that we really like so there's memorize resubmit meaning that if you're if your workflow fails you can resubmit it and not rerun any of the existing steps just pretty easily you can suspend and resume workflows so if you have something that's ticking read the time you want to dispose it again especially like on your you know your local machine you can just suspend it and then resume it later and then you can also suspend workflows within the task definitions so kind of like having a like press ENTER to continue or something and that can be that can be helpful in some cases like for example if you know there's something incomplete data or is some validation that needs to be done before like if you need to validate something before you want to continue along with it you can just add a suspense step inside the task let's see there's many many other features there's cite cars and daven containers so side cars are basically companion containers to each of the workflow steps so if you have you know you don't need to have just a single container per workflows step I mean it's just for simplicity I think it's pretty useful to to do that but you don't need to and the daemon container is pretty cool it basically means that throughout the entire lifecycle of the workflow you have a you have a container running so you can use this for like if you want to have a database for example that all of your workflow steps are accessing you can just spend that up in a daemon container and you know like have your your individual tasks right into that into that database and then do some aggregation at the end so this is if you're doing sort of more you know MapReduce type stuff without or just something that needs like dedicated infrastructure to run you can you can use the daemon container there's all sorts of dag genetic ins like conditional tasks subtags generated Dex loops recursive dives we generally try and stay away from stuff like this because again if we're going for operability and reliability I mean I think as as your gags kind of get more complicated and start including logic and dynamic generation that can definitely like create some problems for you but if you need to do that kind of stuff you can either is also like partial you can do like partial dad groans in a similar way that you would do with with airflow post run hooks moving a pretty straightforward and after after a step runs or after the Holbrooke full run so you can do stuff like send a slack message or update something else you can also create kubernetes resources as a task I think for the data use case I I haven't figured out where this would be useful but I'm sure some creative people in the audience can kind think of a use for this so yeah you can like as a TAS you can create you can spin up a pot or you know create a config map or something there is a UI it's pretty it's pretty primitive the astute audience members will notice so this is not the the same dog that I showed before as an example but it's just a screenshot from the Internet we don't really use the UI that much I mean basically all it does is show you the what would ran and if you click into any of these boxes that will just show you the logs for that step we find it easier to just use the CLI tool but the Argo is working the Oracle project is working on the UI they're definitely planning on adding features like you know kicking I know right now I don't think you can even like kick off a workflow from the UI so if you do want something with you I don't think this is the right project for you yet maybe in the future all right so I'm going to talk a little bit about triggering so the examples I showed earlier where like running workflow is just from the command line obviously you don't want to do that in production you want to have a programmatic way to trigger your events so this is where argue events comes in it's actually not part of our workflows it's like a companion project so you can you can use Argo workflows without any of this event triggering stuff and you can even so you can either use a command-line tool there's also a Web API I think that's built into our the workflows so if you want to if you want to just hook this stuff up to your you know if you have an existing system that you're using to to trigger stuff you can but if you want to use our go events it's pretty flexible you know there's lots of different ways to trigger things like you know holding a git repository web hooks schedules let's that's what we use right now to just hit the top of our tags but lots of different options and there's there's also sort of you can use some pretty sophisticated logic so like this only see where the pointer thing is yeah so these these things are are basically sensors that will listen to events and you can see the different sensors so listen to multiple events at once so you could have something that's like checking on is scheduled and triggered by it some other thing and you can you can combine the logic either using like you know just anzen or is just to set up some pretty sophisticated triggering and these things these little octopus dudes down here are whoops let me go back these are your actual workflows so I'm not gonna give the code examples of that right now because it's a little head doesn't and it's not that interesting I don't think but the capability is there and so one thing I wanted to talk to touch on is its packaging and deployment so that was kind of one of the things about that we wanted it in terms of operability so because we're and I think this was really cool because we're close in events our kubernetes resources we can just use helm to deploy it an upgrade workflows which is like really nice so we package up all like free every workflow that we have we have an individual package for it so the the source again is just like gamo files and github and then we can use we can also in that same repository define define it as a home package so that means that like if we wanted to let's say deploy something you know to QA we can just use home to upgrade the existing workflows and one nice thing is that it'll actually like if workflows are already in progress it'll just continue to run them under the old verge under the old version so you don't have to like worry too much about when you upgrade stuff like with air flow if you're upgrading if you're upgrading DAGs sometimes like the new dag will be running what when you think the old dag is running or vice-versa helm does like a pretty nice job of just you know not upgrading things until the the workflows don't actually that's not that's not a helmet eacher that's just something that oreo does but this also makes like dev and QA and prop deployments very seamless because let's say we wanted to test out this this new version of this of this workflow in dev we would just do like a helm upgrade against our our local dev clusters running on our laptop and you rerun the workflow see how it works and same thing in a QA environment and this can also be useful if you're like if you're worried about interactions between workflows because then you can rather than just packaging up inter-individual workflows in our case our work clothes are like pretty separate so we just package them up individually but if you want to have like a big a big set of workflows you can also package those up together and then like deploy that to a QA environment and see if there's any like hidden interactions between like your new your new version of one workflow with with another workflow that's in the same package so yeah I think that's that's a really cool feature it makes makes my life very easy because you know if I need to deploy something it's it's literally just one line so just I'll just talk about our implementation experience so it took us like about a week of I did full time engineering work to set up or go it's not like a just completely out of the box thing and then you have to most of the documentation for our go poses like in the github page itself there's not a lot of like standalone documentation argue events actually house quite a bit of stand-alone documentation but you know it's not a it's not an out-of-the-box solution also just as a caveat you know we already had kubernetes set up I mean that's that's a pretty big thing I would sort of say to you if you are not running kubernetes at all setting up kubernetes and setting up Agro at the same time is maybe a little ambitious and something you should think about carefully but if you already I have kubernetes set up I mean I definitely like check out or go but the trade-off I think here is like it's more time up front but then the operational time has been cut down a lot I mean we've been running to production tags with around 20 tasks so I mean definitely not like a huge volume of tags but these are things that run like every hour for the past six months and I think we have like one production outage where one of our underlying nodes failed and a workflow home which is in theory something kubernetes should be able to handle and self-heal from but it didn't in this case but that was just fixed by like a pretty simple restart and use data dog for log aggregation again that's it's really useful to have the output of all of your all of your workflows in a single place and you can set up filters just so like if you're doing proper logging or you know you're using like block levels and stuff like we just have an alert that if there's if there's an error in any of in any of the workflow logs that well we'll get notified and so there's there's a couple of things that are go can't do right now that we that we wish it could do so the big one is like you can't limit like top-level got concurrency meaning for a given dag you can't say only one of the Stags should run at a time that can be an issue it's like time triggered time trigger we're closed right so if you have if you have something that's running every hour and for some reason it takes two hours to run you will get a second copy of it at the end of the next hour that could be I mean that could be a deal breaker that's something that we've worked around for now you can definitely adjust concurrency within within the workflow itself so there's task concurrency but at a global level there's no way to limit it like that concurrency and then like I've mentioned earlier argue events is a little complex there's a few different objects involved there's the events the gateways and the sensors we kind of toughed it out and set it up but you know you can just trigger the work flows through an API if you want that maybe may be easier for some some people does anyone have any questions and before that let me just give a quick plug so we're hiring two data scientists one analytics focus 1ml focused and a data engineer so if you're interested working with some of this stuff please come see me after the talk but otherwise that's it hey if I wanted to do some sorts of can retest things between two models is that possible to implement it in cargo as well cannery or a be testing any sorts of well so let's see I mean you could you could implement it in or go but basically you it doesn't have any like built in in future to do that it would basically have to like run two things two models in a workflow and then have some sort of like you know the post run stuff that sends the output somewhere in you and you can compare them but really it's just like a it's just a way of sort of running containers in a certain order so I mentioned earlier with the being able to spin up the pod separate pods and then having to spin down based on specifying tests that would be probably more useful right yeah right so you can yeah so you could like so by default possible just and like when when the the task ends but you can also like you can sort of exit pods early if you want within when the task itself if you'd wanted to I'm not sure if that's your question yeah well and then there are there resource cleaning when yeah yeah so it's there's garbage there's garbage collection of resources right so you can you can so we actually we let our pods sit around for a little bit because we want to have them in case we want to like go back and just like see what happened and more detailed but yeah you know you can just set that stuff to just go away as soon as the workflow ends and the worst question is are there any metrics alerting besides the dog yeah I mean so we do doesn't have any like built-in metric stuff itself so yeah you need you need to bring that yourself so yeah we used a dog both at the cluster level and then with an individual task so hi so first of all thanks for the talk yeah so I had two questions sure so I'm an avid airflow user so that's a disclaimer okay so firstly can you backfill dag runs using our goal there's no yeah it's that that functionality is definitely not as as picked out is in her flow I mean you would basically you can sort of you'd have to do it yourself by like parameterizing your dog run and you know you can pass in like a start date and end date but yeah it doesn't it doesn't really have any built-in notion of like tying and the same idea door flow does okay okay fair enough and the other one is how do you test your DAGs cuz it's ya know Mouse so air flow tags are Pisan so they're easy to test in a sense yeah so we for testing we kind of we use like basically like a data of validation approach for real will we use like data tests so we go look at what the output should be and then write tests against that we don't actually test the dag logic itself I mean most of our dogs are pretty simple and we're more interested in like the output than anything else so it's kind of like integration test versus you know to us [Music]

Info

Channel: Data Council

Views: 5,155

Rating: 4.7435899 out of 5

Keywords: machine learning, computer vision, AI, big data technology engineering software engineering software development

Id: 99o9S20D5s8

Channel Id: undefined

Length: 30min 19sec (1819 seconds)

Published: Thu Nov 21 2019