Data Pipeline in Kubernetes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

i'm about to tell you a story of three engineers a devops application and data all three of them have different needs and different views and why can i tell you their story because i had the unique opportunity to play all three of them my name is eleron vivas and i work for iguazio i'm what i tend to call myself a joker engineer and as you can see i'm really into legos but this is as far as they go down this presentation for for the past decade or so a big data world has changed dramatically since its first release back in april 2006 hadoop really defined the big data era these were the day's network was slow memory was expensive and we needed something to com to process our ever growing data we believe that by bringing the compute closer to the data we will solve all of our problem we remodel our application into mapreduce pipeline and it actually worked and it was fairly easy to deploy these scenarios how big that the wall is grown our software needed a lot more than just mapreduce our users demand a lot more analytics memory became slightly cheaper so we had the idea of loading everything into memory and try to compute it as much as possible in there we worship the rdd that spark brought us we even started storing some of our data in non-hadoop nosql data source like redis or cassandra or maybe elasticsearch but we need a lot more we placed kafka rabid mq right next to our apis application flow was reading in growing amount of data we needed a method to process all those incoming events we built uh event process pipelines with spouts and bolts we added flink and even toned back to spark and got our notion of streaming with micro batching overall all those frameworks were basically a way a way of bringing more and more data into our big data world now that we were so good at collecting data we need to figure out what it's mean uh presto was added to the mixture and with its squaring solution we turned to machine learning later adopted uh deep learning and in these days it seems like many companies want to add tension flow through the overgrowing technology map and deploying this infrastructure became real hard work a lot of companies actually based their offering and how well they can actually deploy this this technology this technology map into your organization even where i work at iguazio we provide hadoop compliant apis because we deal with companies that require big data solution the so-called cloud era didn't change much of what we knew uh even the major cloud provider had some alignment with the dupli ecosystem a major cloud provider have hadoop integrated with those services most of them have an implementation of hadoop compliant apis if we'll take a uh for example we take amazon the leading cloud provider we have s3 working as a dupe compliant a file system emr which is our last amazon elastic mapreduce runs of course spark hbase and yarn and later on they added of course other capabilities we we see and of course if you have tension flow amazon has a pre-configured ami running for you on aws and we started streaming stuff through kinesis but as you can see nothing really changed the artwork of actually deploying all those tangled technology map replaced by the the actual weasley of choosing which serverless data servers you're gonna choose and trying to figure out what your invoice by the end of the month gonna be and no one actually can the cloud brought us to the serverless era our data was in the cloud we didn't have to worry where it was it was simply there or occasionally it wasn't and the the real major leap was the application-less code this is when lambda was introduced we were promised a simple way to process incoming events uh we just code well if you're familiar with lambda this is the promise we got we actually got something else so let's do a short recap our data flow has obviously evolved we added so many big data frameworks or to our toolbox these frameworks were scheduled with different schedulers spark for example was using yarn some of them were using mesos and applications sometimes using something completely else and now with the introduction of new architecture new analytic tools we need to rethink our ecosystem looking back at our three engineers we need to re re-understand their current needs application engineers want one want what every application engineer wants they want agile development they want to release as frequent as possible and get the user feedback as soon as possible data engineer is pretty simple they simply want stuff to keep working they don't care how but keep it working and have their data available and the data and the devops engineers when they choose to keep work that the the tools to keep working but have less maintenance they want an easier way to manage all those services and application and frameworks and they will not manage multiple clusters with multiple scheduler we we must consolidate thanks to the container revolution we have an easy way to deploy these scenarios a new ecosystem to work with our combined toolbox should run on a single scheduler and of course since we're at kubecon i'm suggesting kubernetes so let's first review our data uh we advisor look at our data as unstructured object store structure store and streaming this is basically how we look at our data throughout even if it's cloud or even if you're on-prem this is your data decoupling the data for our entire system require a shift for in the current mindstand our ecosystem should grow from a dupe mindset of distributing the data itself to distributing its access when you run in the cloud it allows you to access your data in a distributed fashion think of s3 or dynamodb or kinesis you don't have to worry where the data is you simply access it from anywhere in any and even if you don't run in the cloud you can access your data using some services like looks rfs for objects if you keep your data in cassandra you can access it without much problem and your kafka cluster can be accessed from anywhere this is what people are actually referring to as cloud native these are resilient and always accessible data services in between we have the orchestration kubernetes will schedule our application analytics tools our frameworks and manage our entire configuration when we align everything to a single unified orchestration we need to adapt the application layer we require the upmost layer to be upgraded we can simply run anything on top of our orchestrator this firmware can application must be cloud native they have to use every tool kubernetes has to offer besides our application big data frameworks we can now leverage serverless frameworks we can now have our function service our very own lambda our big data pipeline master evolver as well and actually it already has it's no longer a pipeline it's a living system our system and services are constantly processing and analyzing data accessing the data simultaneously running function microservices analytic tools we have data coming in or pulled out from iot devices external sources dashboard and many more the entire layer around our data runs on our unified orchestration data is being accessed from anywhere at any time another evolution we see when moving to kubernetes is using serverless frameworks like i've mentioned before the cubeless openfast nucleo this is of course a blessed move we now know no longer need to worry how to uh build docker we no longer need to understand how to deploy it how to run it on kubernetes we simply write a code and the function uh firmwares will simply deploy it for us it will compile us the code and everything will simply run in in our infrastructure another good benefits of function is basically you don't have to bind to any specific language your entire stack can can be polyglot and it already has but using serverless frameworks is the way but comes with a great price usually it comes with a slow performance slow development cycle and we're mostly limited to http endpoints not really our very own lambda look at all those uh serverless faults we the guardio decided to build a real real-time serverless platform nuclear nucleus platform can have any event source not just http they can be of course combined you can listen simultaneously two kafka two http two kinesis with the same function code that you wrote this allows you a better debugging testing and execute execution cycle since we also run everywhere not just kubernetes you can deploy and test everywhere your sole focus is on writing the code and and the rest is taken care of we even provide you with built-in metrics uh and logging this was open source recently and you should definitely check it out we also provide uh now with a hackathon we should definitely check it out the the prize is a very high-end drone so now you're probably saying okay i listen to your talk and everything containerized i'm using function i'm using microservices my spark is containerized great but now i have thousands of containers running on my questrator and each one of them is going to open a connection to my data service it's going to create a huge load on our system and you're actually right and when i said we need a cloud native frameworks and application it meant that some frameworks need to evolve as well at iguazio we were dealing with very large clusters and needed a way to optimize data access we built a solution around shell memory which brings the data directly to your application memory the solution works closely with kubernetes to allow shared connection fast data access and faster connection initialization let's take for uh for example spark we created a data frame a implementation that reads from a shared memory populated by a vitreo demon running on each of the nodes this daemon is the sole owner of the outgoing tcp or rdma connection to the data service now if you are not running with something natively that we support like nuclear spark hadoop and others we can we also have you this same solution available as a fuse mount which you can use a flex mount a flex volume to use in kubernetes and read directly with your application now just like with nuclear this entire walk was open source and you can check out the solution and work some ways how to actually average it with your data services we do hope that other data services will hope will offer such acceleration in the near future we've talked a lot about how we need to look at our data our application our frameworks which new frameworks we need to add now let's look at our deployments um i'll assume the devops hat for a minute but remember how hard it was to deploy and manage complete clusters let's look at a spark example every aspect of the system is managed by kubernetes deployment services and even the configuration itself and config map is not just for flat configuration startup scripts are easily managing config maps i know and i will later on demonstrate here i can leverage a lot of the tooling kubernetes has to offer to manage your entire deployments and how much simpler is using helm versus chef or puppet you don't have to use another language yaml which you have to use already because you're uni using kubernetes is being utilized by helm to basically now describe your deployments another demo that i'll demonstrate is how our current pipeline looks like and like i said a current pipeline is a living system the data is simultaneously being accessed for multiple location at once so i'm taking a real example from one of our customers this is an iot automobile company they have their cars sending information constantly where the driver is some metrics of the cars and so on and everything is being processed in real time and injected to our data services and simultaneously there are dashboards showing where the data where the driver is what are the alerts for that driver and there are also data engineers trying to run analytics on the same data that just came in everything is being processed simultaneously okay so what i have here is a completely new kubernetes cluster running in in one of our data centers i hope you you can see it properly so the first thing i'm uh i'm gonna do is create a new namespace for uh for this new uh customer and since i can't show i can see what i'm typing and of course i'm gonna make a mistake great so we have a new namespace to facilitate this new customer and now let's create something that allows that new customer to access its uh its new namespace i'm gonna do something that it's not something you should do i'm gonna provide it with the option to be a cluster admin usually we provide the option with the front grade binding but just for the sake of the demo we create a new roll binding and now we'll do helm install of our vtero daemon okay so what about to install in the namespace of you kubecon uh i'm using the vista io daemon chart which is available for you to use and pointing to one of our data services once i hit you immediately see that it's been deployed and in a few seconds it's operational now i want to show you what i i meant when i use that config maps just not just config maps it can do a lot more okay i'm simply going to describe one of the config maps that i'm using oh wait a second i forgot to update my contacts okay so i'll move just i can see as well what we have here is raw configuration as a json file bring save directly as a config map it's not the usual way to use a config map because usually people tend to use it as a map a key value simply place it as a file and then map it to the container another thing that i i really love doing is the initialization script i also placed in a config map it allows me to better control which uh which parameters i'm doing in initialization and not by overriding all kinds of yaml's files now let's add spark again like we did with the vitreo demon it will simply run in a matter of seconds right very very simple now what i'm about to show you is nucleo's function service okay we have we have a plug a playground for you to actually deploy functions it will be simpler to find the mouse i can deploy any preloaded function or of course provide with me with my own hit deploy and it will simply ship to the to the cloud in this case our kubernetes cluster but this is not how you usually want to do stuff you don't want to open another ide or another another tool you simply want to use either cube ctl or other command line tools to do the automation for you so of course we do have an automation tool for you which is our and now i'm going to deploy a function just like you see in in the present in the ui i'm going to deploy a function using our new ctl command line tool okay it will build take the code build it into container do all kind of tests for it and ship it to a registry that we define and would be running in our kubernetes cluster now i also deploy a ux a ui for our demo and while we are looking at that deploying there's a simple map currently there's no data because we didn't stream anything in we just launched the application where well there's a dashboard waiting for data to be streaming in so let's streaming some data okay and it's it's it's just a streaming but the real real issue here is that you will con now the function that recently deployed start to receive the events and populate all the real-time data on the map this is of course with a lot of drivers being hammered into the system but as you can see everything is constantly live during during the presentation now now i have two acts two two way of accessing the data one is the function the other is the dashboard and i mentioned also that data engineers might want to use let's say zeppelin to run small jobs so let's use zeppelin and we'll create a simple a second let's call it kubecon and now we'll do a simple spark job just to just to show you how i can access the data simultaneously has it coming in from the function okay i have a very very simple job run some analytics and show the results this is of course a call styles of spark so it might take a few more seconds but still the data is coming in it's been constantly processed by the function constantly being read by the dashboard and also by spiral job this is what i meant a living system everything is constantly being accessed and we have the how the output of a single driver that fits the the criteria now during the talk i said it's going to be easier to run with kubernetes we have much easier deployment and no one actually stopped me and said this is not easier you actually keep running cube ctl and you're using new ctl and you're still editing yamls and init scripts anything and stuff like that and too bad no one interrupted but actually yeah this is not the way to do it this is not the way to deploy this is mimicking the old way of doing bad stuff on your deployments so what you should be doing let's kill the current application i'll use helm to to kill everything okay i'm simply gonna remove spark our daemon the function everything that we just installed and i'm about to show you how to really do it if it's working no permits right so when i say that we should leverage our tools in the in our toolbox helm is not just for presenting how easy to install with elm is how to install a complete application with the helm so as the as the cluster is shutting down what i'm about to show is that nucleo provide you with the yaml yaml is something that is common to kubernetes this is a native yaml two kubernetes meaning you can edit it and redeploy the function over and over again through your functions so taking some time but meaning that you can take that yaml and use it in helm using uh um all the templating that helm provide and now you can launch everything with a second a single command instead of all the stuff that i just typed in which is i have to stress out it this is the wrong way of using kubernetes okay i hope no one took pictures but uh this is actually the wrong way of using it so now let's let's see the demon doesn't want to die we'll kill it okay so we we have a clean cluster and now this is the way to do everything that i just mentioned including the the roll binding including everything that i just mentioned name spacing and everything in a single command this is the proper way this is something that if you're being around the talks of helm and how to use kubernetes in a single command we're going to deploy our function our daemon spark zeppelin everything that i just showed you in like five minutes of work now is a five second of work until everything is running including the new function including spark including zeppelin nr enhanced daemon so few tips and lessons learned we we the guazio had along the way the first and foremost rule and i can't stress how important it is you should really read the manual and i'm neglecting the effort you should definitely read the manual i've seen too many hacks people are trying to do with kubernetes and and the manual is very comprehensive very easy to follow it's it's sometimes hard to follow because it's length but it's not something that you say oh it's very very difficult simply do a copy paste of commands a lot of the times second is the community kubernetes has a great community it's not limited to just github you can find the slack channel you can [ __ ] the groups everything within the community is helpful but it's come with a special note like kubernetes the community is young so sometimes you might get help that's gonna contradict the the manual try to recheck everything that you're getting help to know the tools that kubernetes has to provide during the development of many uh scenarios for our customers you don't know how how many times are you just port forwarding just to check if the pods is doing what i expect it to do do logging collect everything that you have using cube ctl cube ctl is a great tool you should really understand every other option that has available also one of the faults that most people are having in cube ctl is when or not to use the minus o the output because the output of yaml allows you a lot of times to understand what happened to the service or what happened to the deployment and not all commands except the minus o flag and like i showed you always navigate with helm or other solution that you choose with but stick to it helm has great uh great options great introduce great understanding of kubernetes and as you can see once you really understand how to use it our entire application stack was being has been deployed with a single helm command and doing upright good helm is even easier and like i think that kelsey mentioned don't ever do cube ctl edit don't ever do ssh into our cube ctl exec to edit your containers simply use helm another tip that we when you deal with large cluster and many applications sharing uh the same cluster is do not you overuse the note port notepod is great when we are doing debugging it truly is great because you now know where to access but if you stick with static node in node ports it's meaning that you're start having to manage all those node ports use load balancers there are great options within kubernetes now the sixth item might look trivial but configuration must go in config maps don't try to force them in into all kind of solution that i've seen is loading files from the host path but someone has to populate that file which everything goes to config map which very very easy and a specific case to a config map is dealing with yaml yaml is very very tricky uh syntax so when you try to override the command you sometimes end up with results that you didn't expect so like i've showed you place any script and call them instead of trying to do some wizardy with the ammo and the last it's not really related just for kubernetes for any deployment large clusters you should really collect operational data and not just collect it if you just collect it you've done nothing with it it doesn't mean anything you can simply shut it down you have to collect it you have to understand it we are big data engineers so we need to understand what big data means and if the data is meaningless shut it down thank you if you have any questions

Info

Channel: Share Learn

Views: 607

Rating: undefined out of 5

Keywords: Big Data Pipelines over Kubernetes, data pipeline in kubernetes, devops, kubernetes, kubernetes data pipeline, running spark on kubernetes, kubernetes running hadoop, running hadoop in kuberentes, running ccassandra in kubernetes, running cassandra in kubernetes

Id: xzvl4ZCxFqA

Channel Id: undefined

Length: 33min 45sec (2025 seconds)

Published: Sat Nov 28 2020