Apache Spark on Kubernetes - Anirudh Ramanathan & Tim Chen

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey I'm on a road I work at Google on the container engine team specifically on culinary school time and Tim Yan assimilation used to be a mesosphere or some Sparkle Nestle's now we have a startup called the hyper pilot all right so we're here to talk to you about Apache spark on kubernetes some of the work that we've been doing for the past four or five months and where we're at at this point the agenda is going to be I'm going to tell you a little bit about kubernetes in general or containers darker etc and the motivation for this problem as to why we're trying to solve trying to bring spark onto kubernetes a little bit of a deep dive of the design on how this really works and then Tim's going to show you a demo of it actually working a deep dive which explains what happened in the demo and then like a roadmap of where we are headed next so what is kubernetes it's an open source system you may have heard of it it's used for automating deployment scaling and management of containerized applications so let's take a use case you have a single application running upon a server somewhere and you just have a script to restart it when it goes down you probably don't need kubernetes but you scale that out you have 100 applications and each of them has like tens of instances maybe and their load balancing and you have all other requirements around it at that point you might want to think about cluster management and kubernetes just to explain a little bit more about what I mean by containerized here and it must be familiar to most people but I and we don't have time to go into the technical details what containers are but just to mention some of the advantages of using containers you get repeatable builds and workflows I'm sure a lot of people before this would have talked about this it's like you get a prepackaged you can package up all of your application and your dependencies along with it into this hermetic seal of the package that gives you repeatable builds you can store it in registries version them and so on so it's going to improve your work slow it's less DevOps load its just goodness all along and it's also lighter rate than VMs so you have improved infrastructure utilization that's a good thing kubernetes is a pretty active project it's one of the top 100 projects on github 23,000 stars a lot of commits and as you can see the velocity is pretty damn high there are lots of organizations involved and these are not just organizations which are using it in production they're also contributing code back to Google Nettie's itself so and a lot of this like kubernetes stemmed out of Google's own experience running board internally which is the container management solution and a lot of those learnings have been channeled into coronaries so an overview of what kubernetes is and does this is like the 10,000 foot view I'll start from the right so on the right are the nodes the nodes run the users containers so if you launch something on kubernetes it's running on a node each node has a demon called the cubelet the cubelet is going to talk back to the special node called the master the master has a bunch of things running inside of it just a special node and then the users talk to the master to declare to specify what they want to run so you say hey kubernetes run five instances of nginx for me and then the master is going to take that and then schedule a workload on the nodes themselves so drilling down into the nodes the nodes I kind of lied to you they don't run run containers they run pods pods are like a higher level abstraction on containers so a pod and kubernetes is the lowest unit of scheduling so a pod can contain multiple containers one or more containers the cool thing about pods is each of them has its own IP address so if you have a pod in a separate name in one node trying to talk to a pod in another you don't have to worry about how you translate those packets through the network because the underlying system can handle that for you and you can also have volumes you can have network attached storage and so and another cool thing is we have primitives that build on top of pods which are controllers which can create pods on the user's behalf so you have stateful sets even sets and so on yeah so the motivation for this problem is why even care about spark on kubernetes a lot of organizations run spark run a lot of workloads on kubernetes they run serving and stateful workloads databases and so on and bringing spark and other data processing workloads on to kubernetes gives us convergence it's good for the IT folks because it's less infrastructure to manage it's good for the developer because it's going to give you one a single interface to manage all of your workloads and your spare cycles can be utilized well so that improved infrastructure utilization and on top of that kubernetes has a whole host of different add-ons and services that you can use right out of the box and this is mostly because there's such a huge ecosystem surrounding both kubernetes and docker and the containers so there's for example SDO is one project that was launched recently and that takes care of a lot of the problems that people encounter often when running applications like AWS and tracing rate-limiting and things like that so you get right all of that you can use whips Park on kubernetes as well so getting to the design of spark on kubernetes this diagram must be familiar to most people who have used spark so you have spark core which is like this layer of abstraction which contains the rdd's and the all the cool spark things and then on top of that you have different libraries which to graph processing sequel machine learning streaming and so on and then a layer below which is what's part core talks to is different cluster managers and these are almost like plugins they're pretty disparate parts so be added kubernetes just alongside meso CR and standalone in our cluster manager and this is what it looks like the dotted line separates the kubernetes from the spark so spark core cares about getting new executives pushing new configuration deleting executors and so on so we take whatever spark core tells us and our plugin that we wrote which is the kubernetes scheduler back-end it's going to translate that into kubernetes primitives so that the cluster can understand what you actually mean like when you're provisioning executors and so on in addition to that it also handles resource requests authorization and authentication and just encapsulate all of the communication with kubernetes in addition to this we had a few other components that we designed around the integration so one of them is a staging server so imagine you have tiny like you're running experiments and you don't typically want to create new docker images every time you want to change an input file and you don't want to maybe host it in HDFS or push it there every time so one thing that you might do is push your files on to this local staging server that we wrote it sits inside the cluster hosts your files so that your drivers and executives can actually fetch from that instance from that staging server so that's that's one component that we added it's a standalone thing that you can add to your proven Eddie's cluster another component is the spark shuffle service so this is essential if you're using dynamic allocations so dynamic allocation is mode where spark lets you elastically change the number of executor zazz as the job progresses like depending on the number of pending tasks or in a particular stage so that needs all the executor temporary files the shuffle files to be written to a different location because the executives are not long-lived like you might lose an executor across the spark job scales down so that's where the shuffle service comes in this is a component that it's part of SPARC and then we just use a lot of those primitives and created an implementation for kubernetes so that you can deploy your own shuffle service and use dynamic allocation in SPARC third-party resources which are now going to be called custom resources in the next version of kubernetes this is features that exist in proven at ease this is used it's part of the API where you can add application specific knowledge to kubernetes itself so that you can visualize it in dashboards in the console and so on so you can have your SPARC job specifics like the number of executor is how much progress you've made and so on be visible to kubernetes as part of the API so yes this shows kind of the different ways you can get your files into kubernetes one of them is you bake it into your container with all of the dependencies it's one reproducible unit you could just every time you run it you're going to get the same results you could also stage the files and jars using the file staging server that we wrote or the traditional ways of using Google Cloud Storage or s3 or HDFS those also work equally well getting to the administration point this is community has a lot of stuff in it that helps system administrators kind of manage the cluster really well namespaces and we have integration for all of this in SPARC submit because of the changes we made you can launch into a particular namespace you can have those namespaces associated in certain roles which have permissions like a certain user could be allowed to create pods in that namespace certain users may not and you can have quotas associated with those namespaces audit logging there are several third-party solutions and there's also pluggable authorization on authentication all of that can be directly used with sparkin kubernetes as well yes this word cloud kind of shows the different areas that we focused on this is all the options that we added to SPARC submit as part of the effort just worth noticing here are the focus on like authentication SSL and so on so this kind of goes to show that we paid close attention to SSL end-to-end and also the submission server so yeah that's kind of we've done so far I'm gonna hand it off to Tim it's going to do as a demo okay so we'll show a real-life demo of actually working so before that actually was curious who here is already running spark with the cluster manager besides stand alone so yarn or messes you're actually running it okay most of you pal say 60 70 percent who here's running kubernetes already well that's that's a lot of people sorry any brave souls run spark on kubernetes already oh they are there are people so this is great so we will show you what is spark native integration with kubernetes means so you probably expect the same thing you see from yarn or Meadows native integration so so let's run an actual real demo here so the job I we running is the spark movie lens example using ml lip right so you can give it in a movie data set what if I have a personal ratings of some movies can I actually recommend another set of movies out of these big sets which is more likely the ones I want to watch all right so let's I have a kubernetes cluster let me break out of this real quick I have a criminal's cluster this sort is the typical crew Nettie's dashboard so we have different namespaces we have a definite space away the priming space right so the good thing is about kubernetes it has all these primitives namespaces pods replica sets these are all available for us to use when you start talking about actually integrating with kubernetes itself so let's try to run a job in deafening space so this is what you typically would expect right so you can run a sparkly KSU's alright cool right this is the typical sparse limb that you see in this case smart summit we have deployment a cluster mode the critical movie lens ALS the most importantly we're pointing to a kubernetes master right this is why you used to be using cluster managers they used to be meso this use be yarn here we're pointing to a kubernetes master api server end point this is what shows on Kubrick you see gel cluster and so you because they're integrating directly into kubernetes you can we support namespaces right out of that so we can say kubernetes run job in this particular namespace and you can expect grenades the integration will actually set up and use the right namespaces but all the other settings are once you've been familiar with right how many executive course I'm going to use what's the memory I want to use what's the driver memory what's the driver chords these are all the same settings nothing changed from moving once you want integration to the other so this is very powerful each use spark as is pointing to kubernetes so let me just run this this is the spark with see with Def mixed-raced okay so we submitted this cluster this is the job now our oh I heard we are the first one to actually do life demo today so oh this is probably good refresh token on all right this is what you control do okay I'm sure your experience will be lot better all right let's try this again so this is triggering the kubernetes well okay kubernetes exception always oh oh oh sorry oh it's already submitted okay this is different yeah so might also teach you guys how to use QCT underneath at the same time so our movie went yeah so because this is we actually generate a unique ID for every job pot this is easier from the point to URL i hard-coded a pot Naevia but in this case should be better okay now it's running now alright so the demo got do you assess today so yeah so we then able to submit this job directly the crew minutes as pots so driver runs in the pot it will also launch executor is as pots let's look at the pot names next phase of the pause you see there's the movie lens job running here's a movie lens executor pause running if you look at the logs of this pot you see it you see your typical favorites Park logs that gives you all the nice information because kubernetes also integrates with proxying you can actually go to the kubernetes with spark UI through the proxy so this is make sure I at the right place right the sister spark UI I can directly go to I'm using Q CTO proxy on my laptop right so it's proxying to the tuna master so i can go a local host and look at this typical spark UI that i like to see so everything just works as is so now you don't even have to think about kubernetes that much once you start running just the jobs itself spark it takes over runs your task as is so you know if we do a similar thing running on the product space and i hope it is in the correct maybe I don't have another pod running the same name this time hopefully not okay we don't good so we'll be running and you can imagine that when you go to a cube Cpl you're watching your nose keep CTL get pods on a death namespace you won't see the prod namespaces jobs right but when you go to the prodding space you should see our prod job running so this is true integrating us is simple it's super easy because we're integrating directly with kubernetes features ah so let's see yeah if you want to see it again this is pretty much the same running space for the same prod UI you can see your job running here as well so yeah so eventually you should also see the results showing up in this box as well just giving you your favorite set of movies right here there you go all the movies you like okay so so that's our demo really quick too just to show you that with the same integration we've done similarly what you see it on the cluster manager side for for CRM esos now you have an option to point to a kubernetes cluster and we can leverage the kubernetes features to help running SPARC namespaces there's other features other primitives that's available in kubernetes that allows you to run spark in a very more powerful way as well but all the other stuff is still looks and feel and more familiar with what you guys will expect as customers this is great you don't have to run the standalone cluster on communities like what people will probably do right now so a little bit back to a little deep dive it's pretty simple so when you run sparks limits right sparks them it goes to and submit because running in cluster mode it submits a job directly to the crew Nettie's as a pod to run your driver pot so this driver pot we will you can optionally support it called talks a resource staging server if I want to upload my jar somewhere I don't have HDFS running I want to actually able to have my driver jars available you can actually say run this specified this resource dating server here's my files they will go to research dating server and your driver will automatically download it for you so this is a nice feature for you if you want to just runs kubernetes and and have two jars available so it's scheduled a driver pod the driver pod itself has the kubernetes integration two goes back and talked to the cool neti scheduler gives me executor pot right you can specify how many instances of executor as you want in your in your spark executor instances configuration it will ask the scheduler actually say please for my executor pause and one thing to mention is because we're integrating with kubernetes directly all the scheduler features it's not supported yet but in the future right we have affinity and high affinity all these things are available at the kubernetes scheduler side will be supported here so if I want to say schedule or executor pause please only run on B snows spread them out or anything that says the scooch discriminative scheduler support we eventually can also support that too so that's the powerful thing of integrating with kubernetes directly so executor pops will launch up all the dependencies all your spark configuration are forwarded right from sparks permits you all driver in executor as you expected you know the tax run-as-is and once it finishes all the dependencies are marked with the same tags and able to be cleaned up by communities itself and so in the end you also have a completed or failed pod of a driver that contains the logs for you to look at what happens was my results or why does it fail so all your if you are being using Chrome or Nettie's already all the rulings that you have available for Communities for my logging ingestion or some how to operate my criminals clusters those also are available for you okay so now go down to the roadmap we really only started this not too long ago actually so you know I'm this is a pretty pretty good progress really the help of the community really that everybody's involved so we support cluster mode we also support dynamic allocation I would add in a demo today but because we already we manage a shuffle service for you that means we can actually just run with that an allocation enabled and have the we have our own file stating server that the community has written just for spark actually for communities and we test this Farsi code this is Java Scala support technically graphic XML is structure maybe all should just work we still in the process of trying to test them and make sure there are they're ready but the next steps we're going to definitely want to support the Python are binding that's actually currently actively being developed by the community SPARC shell high availability so things like how do I get my spark stringing checkpoints or all the HEA features that use expects running spark in productions which will try to incorporate them as well cool yeah so yeah thank you this is the stuff that we're adding to kubernetes in the coming releases like some of the things that we've been working on custom resources like I mentioned are ways to extend the coconutty API itself there are a lot of fixes coming in that direction it used to be called third-party resources it's now graduating the beta and we're looking to add more features their priorities and preemptions for pods which is going to lead to better sharing of a cluster between users based on you know priorities that's that's coming as well that's in the design phase that's scheduling and resource sharing this is more for use cases like spark where we want to submit batch jobs and have it fairly share a cluster with other batch users and also share it with you know real-time and serving workloads and scale down appropriately so that's also being designed for the coming releases and then we have cluster Federation and multi cloud which lets you deploy workloads and then have them burst across different clouds or manage multiple instances of kubernetes on on-prem and on cloud under one umbrella using Federation's so that's also being worked on and that's actually available I think in beta finally there's Marcus's part of the equation there's a lot of other things in the ecosystem that we need to work on so Casca we have helm which is a tool used for packaging applications of we have a home chart for Kafka Cassandra and HDFS there's pepper data has been working a lot on HDFS on kubernetes we actually have a talk on that tomorrow so yeah the ecosystems also coming along and we expect in the coming releases and by the end of the year we'll have a lot of the parts of this entire solution filled in yeah so it is definitely a community effort so you know there's lots of organizations involved so in fact most of us stuff around the native integration is really heavily contributed by talented Atta Red Hat really actively involved and we are actively also looking to merge this back to spark right right right now we're actually working on a separate fork because we want to make sure that all that native integration is well tested and have our own velocity able to really get to a point where it's stable and we're actively working with us about committers right now they're all involved discussions and we are we are going to be also having the docs pointing to our repo very soon so you can definitely follow these direct Achatz and our criminal issues for fighters supporting spark or even batch scheduling and and and big data in general on scuba netting so this is where I repose that encourage all of you to really go and try it out right pull it down there's there's a real ER doc repo that actually has documentation and a separate sites for you to be a vote so all the links are available there and there's a tons of activity happening there's a lot of people involved and that really encouraged we cannot if you want to be involved if you want to be involved in discussions or the intra contribution or testing at any sort of way this is truly a community effort so really won't be everyone more people involved it's better and that's that's it we have a six basic data meeting so kubernetes has a special interest group meeting and every week Wednesday 10:00 a.m. we have a weekly zoom meeting that we if any one of you wants to be in all in kubernetes or Big Data or spark and kubernetes wants to be involved in discussions one follow what's going on feel free to join us and most importantly as well there's another session that's very related right HDFS on kubernetes most Furion spark uses HDFS as well how do you actually have data locality solved that's also from the spark that I'm able to know which which executor launched on was later note there's the ghost it'll be a talk pepper data will be presenting their work at tomorrow at 11 a.m. so that's that's our content so we're open for questions where we have any do you have any questions or anything yeah I think in the introduction you mentioned that you had been using spark on mezzos or DC us curious why if that was if you moved to kubernetes because of any particular reason off of the DC OS so you did audibly just wondering why you moved to kubernetes if you were using spark on do that's what what kind of comparison between the two cluster management systems are yeah I guess yeah so it's I think kumara is a meso there are very different systems right through different different communities as well kubernetes you know it's designed and built from learning from things from borg and having the primitives available in the metals it's designed very differently I've met those two level scheduling system able to write write your own frameworks you know so there's there's a lot of customization of opportunities there and I'll just say the committees are different kubernetes is as you can see there's a lot of momentum going on everybody lots lots of people starting to use criminate as a resource manager so I think it's just it's going to happen no matter what people that uses a cluster manager will also probably want to prefer don't have a lot of different custom managers they have to operate and and learn how to operate each one of them I think kubernetes is going to be very interesting where they already have a lot of support for their all the long-running services this is where they just start to learn how to big data so you know I think it's it's exciting time you know it's it's just it I think we're not suggesting that this is production ready and everybody right now in production just move oscillators oops marker neighs right away this is very a young effort so that's why I want everybody involved but I'm really excited that criminals with the permit is that X exposes allows different opportunities for you because if you want pods if you want replica sets if you want those things and you don't want to use a mesos framework micronized is a good option here there's definitely other options why you want to use missiles maybe there's existing frameworks there I don't think there is any conflict really just different cluster managers altogether so hi first of all thanks for doing this because I've been running spark standalone on kubernetes for a while and it's good to see that someone's playing on the back end in because for too long it's just being nice awesome standalone so and yon so it's good to have something else I am question is I asked what the the kubernetes Park meetup at lunchtime I asked this and someone said you might have looked at it but they won't be able to give they didn't office over their head the different obviously spark they're the shuffles very Network intense operation have you looked at the different networking backends like panel we've calico and have you noticed any performance difference between them good one yeah this was there was a set of experiments that separated it but more in the context of HDFS and spark together well you've been looking at the difference between you know using an overlay network not using it and how it impacts locality so in terms of when your workload is like when your spark job is in the cluster and you're you already have your dependencies available to it networking is not your biggest you know it's not the biggest time consumer or the bottleneck there but when you have even you're fetching from HDFS yes it does become a little bit more of a constraint they're detailed experiment results on the spark repo if you want to look at that okay cool thank you one question so it's how does the Extension Service work with respect to Kuban it is like how are you thinking of implementing it they say again how does the external shuffle service with all your social service right today the shuttle service is implemented in terms of it's a demon set in communities that's a primitive that lets you put one instance of a particular service on each node so there's each node has a shuttle service and it shares disk with the executor but we designed it in a way that it is flexible enough that you could have you could move to a model where you can have two containers within the same pod and one being the shuffle service and one being the executor so that's also possible in the current model so we left it kind of flexible in that manner but right now like the default implementation uses a daemon set with shared persistent shared host fat values also the daemons it is basically like a set of demons which should be running all the time in that particular node is that right so the demon set is basically a set of demons which will be running all the time in that particular oh yes yes yes okay yeah yeah you could actually target a subset of the nodes using a node selector on kubernetes so that's possible and it also gives you the added flexibility of having multiple shuffle services if you have different spark installations which are not compatible with each other you can have different episodes before them another question like how long does the startup time make with respect to kubernetes like spark seven startup yeah I think it's just various factors because we have docker images involved right so this would be a large time probably on a docker image polling site okay it pulls the docker images here every time like yeah it's running through kubernetes right you specify the docker image you want of actually why I left of one pic part is you specify what executor docker image you want to use what your driver docker image you want to use and then it will run and download the kubernetes will do it for you download this docker images down and run it for you so yeah one thing out the cache is probably a lot faster there yeah so if you're running the same job over and over again there are nodes going to people that like once it's pulled it's going to cache the image so you don't pay that cost multiple times if you're using the same driver and executor images okay good and I'm just wondering for the night excuse me I think there are more questions but the sessions kind of over so if you want you can come up and ask the questions to the speakers here and can we think like this speaker [Applause]

Info

Channel: Databricks

Views: 15,055

Rating: 5 out of 5

Keywords: apache spark, spark summit

Id: 0xRHONrWwvU

Channel Id: undefined

Length: 31min 54sec (1914 seconds)

Published: Mon Jun 12 2017