Data Pipeline in Kubernetes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i'm about to tell you a story of three engineers   a devops application and data all three of  them have different needs and different views   and why can i tell you their story because i had  the unique opportunity to play all three of them   my name is eleron vivas and i work for iguazio  i'm what i tend to call myself a joker engineer   and as you can see i'm really into legos but this  is as far as they go down this presentation for   for the past decade or so a big data world has  changed dramatically since its first release   back in april 2006 hadoop really defined the big  data era these were the day's network was slow   memory was expensive and we needed something to  com to process our ever growing data we believe   that by bringing the compute closer to the data  we will solve all of our problem we remodel   our application into mapreduce pipeline and it  actually worked and it was fairly easy to deploy   these scenarios how big that the wall is grown  our software needed a lot more than just mapreduce   our users demand a lot more analytics memory  became slightly cheaper so we had the idea of   loading everything into memory and try to  compute it as much as possible in there   we worship the rdd that spark brought us we even  started storing some of our data in non-hadoop   nosql data source like redis or cassandra or  maybe elasticsearch but we need a lot more   we placed kafka rabid mq right next to  our apis application flow was reading   in growing amount of data we needed a method  to process all those incoming events we built   uh event process pipelines with spouts and bolts  we added flink and even toned back to spark and   got our notion of streaming with micro  batching overall all those frameworks were   basically a way a way of bringing more  and more data into our big data world   now that we were so good at collecting  data we need to figure out what it's mean   uh presto was added to the mixture and with its  squaring solution we turned to machine learning   later adopted uh deep learning and in these days  it seems like many companies want to add tension   flow through the overgrowing technology map and  deploying this infrastructure became real hard   work a lot of companies actually based their  offering and how well they can actually deploy   this this technology this technology map into  your organization even where i work at iguazio   we provide hadoop compliant apis because we deal  with companies that require big data solution the so-called cloud era didn't change much of  what we knew uh even the major cloud provider   had some alignment with the dupli ecosystem  a major cloud provider have hadoop integrated   with those services most of them have an  implementation of hadoop compliant apis   if we'll take a uh for example we take  amazon the leading cloud provider we have s3   working as a dupe compliant a file system emr  which is our last amazon elastic mapreduce   runs of course spark hbase and yarn  and later on they added of course   other capabilities we we see and of course if you  have tension flow amazon has a pre-configured ami   running for you on aws and we started streaming  stuff through kinesis but as you can see nothing   really changed the artwork of actually  deploying all those tangled technology map   replaced by the the actual weasley of choosing  which serverless data servers you're gonna choose   and trying to figure out what your invoice  by the end of the month gonna be and   no one actually can the cloud brought us  to the serverless era our data was in the   cloud we didn't have to worry where it was  it was simply there or occasionally it wasn't and the the real major leap was the  application-less code this is when   lambda was introduced we were promised a simple  way to process incoming events uh we just code   well if you're familiar with lambda this is the  promise we got we actually got something else so let's do a short recap our  data flow has obviously evolved   we added so many big data  frameworks or to our toolbox   these frameworks were scheduled with different  schedulers spark for example was using yarn   some of them were using mesos and applications  sometimes using something completely else and now with the introduction of  new architecture new analytic tools   we need to rethink our ecosystem looking back at  our three engineers we need to re re-understand   their current needs application engineers want  one want what every application engineer wants   they want agile development they want to release  as frequent as possible and get the user feedback   as soon as possible data engineer is pretty simple  they simply want stuff to keep working they don't   care how but keep it working and have their data  available and the data and the devops engineers   when they choose to keep work that the the tools  to keep working but have less maintenance they   want an easier way to manage all those  services and application and frameworks   and they will not manage multiple clusters  with multiple scheduler we we must consolidate thanks to the container revolution we  have an easy way to deploy these scenarios   a new ecosystem to work with our combined  toolbox should run on a single scheduler   and of course since we're at  kubecon i'm suggesting kubernetes so let's first review our data uh we advisor  look at our data as unstructured object store   structure store and streaming  this is basically how we   look at our data throughout even if it's cloud  or even if you're on-prem this is your data   decoupling the data for our entire system require  a shift for in the current mindstand our ecosystem   should grow from a dupe mindset of distributing  the data itself to distributing its access   when you run in the cloud it allows you to access  your data in a distributed fashion think of s3 or   dynamodb or kinesis you don't have to worry where  the data is you simply access it from anywhere   in any and even if you don't run in the cloud  you can access your data using some services   like looks rfs for objects if you keep your  data in cassandra you can access it without   much problem and your kafka cluster can be  accessed from anywhere this is what people   are actually referring to as cloud native these  are resilient and always accessible data services in between we have the orchestration  kubernetes will schedule our application   analytics tools our frameworks and manage our  entire configuration when we align everything   to a single unified orchestration we  need to adapt the application layer   we require the upmost layer to be upgraded we can  simply run anything on top of our orchestrator   this firmware can application must be cloud native   they have to use every tool kubernetes has to  offer besides our application big data frameworks   we can now leverage serverless frameworks we can  now have our function service our very own lambda our big data pipeline master evolver as well and  actually it already has it's no longer a pipeline   it's a living system our system and services  are constantly processing and analyzing data   accessing the data simultaneously running  function microservices analytic tools we have   data coming in or pulled out from iot devices  external sources dashboard and many more   the entire layer around our data  runs on our unified orchestration   data is being accessed from anywhere at any time  another evolution we see when moving to kubernetes   is using serverless frameworks like i've  mentioned before the cubeless openfast nucleo this is of course a blessed move we now know  no longer need to worry how to uh build docker   we no longer need to understand how to deploy  it how to run it on kubernetes we simply write   a code and the function uh firmwares will  simply deploy it for us it will compile us   the code and everything will  simply run in in our infrastructure   another good benefits of function is basically  you don't have to bind to any specific language   your entire stack can can be polyglot and it  already has but using serverless frameworks   is the way but comes with a great price  usually it comes with a slow performance   slow development cycle and we're mostly limited  to http endpoints not really our very own lambda look at all those uh serverless faults  we the guardio decided to build a real   real-time serverless platform nuclear nucleus  platform can have any event source not just   http they can be of course combined you can  listen simultaneously two kafka two http   two kinesis with the same function code that  you wrote this allows you a better debugging   testing and execute execution cycle since  we also run everywhere not just kubernetes   you can deploy and test everywhere  your sole focus is on writing the code   and and the rest is taken care of we even  provide you with built-in metrics uh and logging this was open source recently and you  should definitely check it out we also   provide uh now with a hackathon we should  definitely check it out the the prize   is a very high-end drone so now you're probably  saying okay i listen to your talk and everything   containerized i'm using function i'm using  microservices my spark is containerized great   but now i have thousands of containers running  on my questrator and each one of them is going   to open a connection to my data service it's  going to create a huge load on our system   and you're actually right and when i said we  need a cloud native frameworks and application   it meant that some frameworks need to evolve  as well at iguazio we were dealing with very   large clusters and needed a way to optimize data  access we built a solution around shell memory   which brings the data directly to your  application memory the solution works closely   with kubernetes to allow shared connection fast  data access and faster connection initialization   let's take for uh for example spark we created  a data frame a implementation that reads from a   shared memory populated by a vitreo demon  running on each of the nodes this daemon   is the sole owner of the outgoing tcp  or rdma connection to the data service now if you are not running with something  natively that we support like nuclear spark   hadoop and others we can we also have you this  same solution available as a fuse mount which   you can use a flex mount a flex volume to use in  kubernetes and read directly with your application now just like with nuclear this entire  walk was open source and you can check   out the solution and work some ways how to  actually average it with your data services   we do hope that other data services will hope  will offer such acceleration in the near future   we've talked a lot about how we need to look at  our data our application our frameworks which   new frameworks we need to add now let's look  at our deployments um i'll assume the devops   hat for a minute but remember how hard it  was to deploy and manage complete clusters   let's look at a spark example every aspect of  the system is managed by kubernetes deployment   services and even the configuration itself and  config map is not just for flat configuration   startup scripts are easily managing config maps  i know and i will later on demonstrate here i   can leverage a lot of the tooling kubernetes  has to offer to manage your entire deployments and how much simpler is using helm versus chef or  puppet you don't have to use another language yaml   which you have to use already because you're  uni using kubernetes is being utilized by helm   to basically now describe your deployments  another demo that i'll demonstrate is how   our current pipeline looks like and like i  said a current pipeline is a living system   the data is simultaneously being accessed for  multiple location at once so i'm taking a real   example from one of our customers this is an  iot automobile company they have their cars   sending information constantly where the  driver is some metrics of the cars and so on   and everything is being processed in real  time and injected to our data services   and simultaneously there are dashboards showing  where the data where the driver is what are the   alerts for that driver and there are also data  engineers trying to run analytics on the same data   that just came in everything is being processed  simultaneously okay so what i have here is a   completely new kubernetes cluster  running in in one of our data centers i hope you you can see it properly so the  first thing i'm uh i'm gonna do is create   a new namespace for uh for this new uh customer  and since i can't show i can see what i'm typing and of course i'm gonna make a mistake great so we have a new namespace  to facilitate this new customer and now let's create something that allows that  new customer to access its uh its new namespace   i'm gonna do something that it's not something  you should do i'm gonna provide it with the   option to be a cluster admin usually we  provide the option with the front grade   binding but just for the sake of the  demo we create a new roll binding and now we'll do helm install of our vtero daemon okay so what about to install in the namespace  of you kubecon uh i'm using the vista io daemon   chart which is available for you to use  and pointing to one of our data services once i hit you immediately see that it's been  deployed and in a few seconds it's operational now i want to show you what i i meant  when i use that config maps just   not just config maps it can do a lot more okay i'm simply going to describe  one of the config maps that i'm using oh wait a second i forgot to update my contacts okay so i'll move just i can see as well what  we have here is raw configuration as a json file   bring save directly as a  config map it's not the usual   way to use a config map because usually  people tend to use it as a map a key value   simply place it as a file and then map it to the  container another thing that i i really love doing   is the initialization script i also placed in a  config map it allows me to better control which   uh which parameters i'm doing in initialization  and not by overriding all kinds of yaml's files now let's add spark again like we did with the vitreo demon  it will simply run in a matter of seconds   right very very simple now what i'm about to show  you is nucleo's function service okay we have   we have a plug a playground for you to  actually deploy functions it will be simpler to find the mouse i can deploy any preloaded function or  of course provide with me with my own   hit deploy and it will simply ship to the to  the cloud in this case our kubernetes cluster   but this is not how you usually want to do stuff  you don't want to open another ide or another   another tool you simply want to use either  cube ctl or other command line tools to do   the automation for you so of course  we do have an automation tool for you   which is our and now i'm going to deploy  a function just like you see in in the   present in the ui i'm going to deploy a  function using our new ctl command line tool okay it will build take the code build it into  container do all kind of tests for it   and ship it to a registry that we define and  would be running in our kubernetes cluster now i also deploy a ux a ui for our demo and while we are looking at that deploying  there's a simple map currently there's   no data because we didn't stream anything  in we just launched the application where   well there's a dashboard waiting for data to  be streaming in so let's streaming some data okay and it's it's it's just  a streaming but the real real issue here is that you  will con now the function that   recently deployed start to receive the events  and populate all the real-time data on the map   this is of course with a lot of drivers  being hammered into the system but as   you can see everything is constantly  live during during the presentation now now i have two acts two two way of  accessing the data one is the function   the other is the dashboard and i mentioned  also that data engineers might want to use   let's say zeppelin to run small jobs so let's use zeppelin and we'll create a simple a second let's call it kubecon and now we'll do a simple spark job just to  just to show you how i can access the data   simultaneously has it coming in from the function okay i have a very very simple job run some  analytics and show the results this is of   course a call styles of spark so it might take a  few more seconds but still the data is coming in   it's been constantly processed by the function  constantly being read by the dashboard and also   by spiral job this is what i meant a living  system everything is constantly being accessed and we have the how the output of a  single driver that fits the the criteria   now during the talk i said it's going to be  easier to run with kubernetes we have much easier   deployment and no one actually stopped me and  said this is not easier you actually keep running   cube ctl and you're using new ctl and you're  still editing yamls and init scripts anything and   stuff like that and too bad no one interrupted  but actually yeah this is not the way to do it   this is not the way to deploy this is mimicking  the old way of doing bad stuff on your deployments   so what you should be doing let's kill the current application i'll use helm to to kill everything  okay i'm simply gonna remove   spark our daemon the function  everything that we just installed and i'm about to show you how  to really do it if it's working no permits right so when i say that we should leverage our tools  in the in our toolbox helm is not just for   presenting how easy to install with elm is how  to install a complete application with the helm   so as the as the cluster is shutting down what  i'm about to show is that nucleo provide you   with the yaml yaml is something that is common to  kubernetes this is a native yaml two kubernetes   meaning you can edit it and redeploy the function  over and over again through your functions so taking some time but meaning that you  can take that yaml and use it in helm using   uh um all the templating that helm provide  and now you can launch everything with a   second a single command instead of all the  stuff that i just typed in which is i have   to stress out it this is the wrong way of using  kubernetes okay i hope no one took pictures but   uh this is actually the wrong way  of using it so now let's let's see the demon doesn't want to die we'll kill it okay so we we have a clean cluster and now this  is the way to do everything that i just mentioned   including the the roll binding including  everything that i just mentioned name   spacing and everything in a single command  this is the proper way this is something that   if you're being around the talks  of helm and how to use kubernetes in a single command we're going  to deploy our function our daemon spark zeppelin everything that i just showed you in like five  minutes of work now is a five second of work   until everything is running including the new  function including spark including zeppelin   nr enhanced daemon so few tips and lessons learned  we we the guazio had along the way   the first and foremost rule and i can't stress how  important it is you should really read the manual   and i'm neglecting the effort you should  definitely read the manual i've seen too many   hacks people are trying to do with kubernetes  and and the manual is very comprehensive   very easy to follow it's it's sometimes hard to  follow because it's length but it's not something   that you say oh it's very very difficult simply  do a copy paste of commands a lot of the times   second is the community kubernetes has a great  community it's not limited to just github   you can find the slack channel you can [ __ ] the  groups everything within the community is helpful   but it's come with a special note like kubernetes  the community is young so sometimes you might get   help that's gonna contradict the the manual try  to recheck everything that you're getting help to know the tools that kubernetes has to provide  during the development of many uh scenarios for   our customers you don't know how how many times  are you just port forwarding just to check if the   pods is doing what i expect it to do do logging  collect everything that you have using cube   ctl cube ctl is a great tool you should really  understand every other option that has available   also one of the faults that most people are  having in cube ctl is when or not to use the   minus o the output because the output of yaml  allows you a lot of times to understand what   happened to the service or what happened to the  deployment and not all commands except the minus   o flag and like i showed you always navigate with  helm or other solution that you choose with but   stick to it helm has great uh great options great  introduce great understanding of kubernetes and   as you can see once you really understand how  to use it our entire application stack was being   has been deployed with a single helm command  and doing upright good helm is even easier   and like i think that kelsey mentioned  don't ever do cube ctl edit don't ever do   ssh into our cube ctl exec to edit  your containers simply use helm another tip that we when you deal with  large cluster and many applications sharing   uh the same cluster is do not you overuse the note  port notepod is great when we are doing debugging   it truly is great because you now know  where to access but if you stick with static   node in node ports it's meaning that you're  start having to manage all those node ports   use load balancers there are great options  within kubernetes now the sixth item might   look trivial but configuration must go in config  maps don't try to force them in into all kind of   solution that i've seen is loading files from the  host path but someone has to populate that file   which everything goes to config map which very  very easy and a specific case to a config map   is dealing with yaml yaml is very very tricky uh  syntax so when you try to override the command   you sometimes end up with results that you didn't  expect so like i've showed you place any script   and call them instead of trying  to do some wizardy with the ammo   and the last it's not really related just for  kubernetes for any deployment large clusters you   should really collect operational data and not  just collect it if you just collect it you've   done nothing with it it doesn't mean anything  you can simply shut it down you have to collect   it you have to understand it we are big data  engineers so we need to understand what big data   means and if the data is meaningless shut  it down thank you if you have any questions
Info
Channel: Share Learn
Views: 607
Rating: undefined out of 5
Keywords: Big Data Pipelines over Kubernetes, data pipeline in kubernetes, devops, kubernetes, kubernetes data pipeline, running spark on kubernetes, kubernetes running hadoop, running hadoop in kuberentes, running ccassandra in kubernetes, running cassandra in kubernetes
Id: xzvl4ZCxFqA
Channel Id: undefined
Length: 33min 45sec (2025 seconds)
Published: Sat Nov 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.