Intro + Deep Dive: Kubernetes IoT Edge WG - Steven Wong, Cindy Xing, Dejan Bosanac, & Kilton Hopkins

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay hi everybody we're going to get started here this is an extended session with both an intro and deep dive to running IOT and edge applications utilizing kubernetes we've got four speakers here from different regions of the world I think we're standing up so you can see who they are but Cindy why don't you introduce yourself just briefly okay my name is Cindy Shing I'm a Software Architect at the future way so I helped a future way to build the public cloud infrastructure specializing in IOT edge computing currently I'm leading the open source project called the cube edge and also I'm co-chairing with this gentleman's for the kubernetes IOT edge work group so before future way I worked in Microsoft for over 12 years have a lot of experiences in building distributed system Hey so my name is Dan I'm a software engineer with Red Hat working in the messaging and IOT team at let's get primarily working with a within the Eclipse IOT working group on couple of interesting projects one of the most notable is is the cloud scale messaging for the IOT also yeah last year we started the the edge working group in IOT edge working group within the kubernetes community to try to explore this topic a little bit further and I'm assigned to work on kubernetes and a few other open source projects and lead of the IOT and edge working group in a former job I was on the founding engineering team of Wonderware which was in the IOT space before they called it Iowa key guilty home hi I'm Kim Hopkins I'm a software engineer and architect and co-founder of n CEO of a company called edge works which is an edge computing company and I'm the original designer of the Eclipse io fog open-source edge computing layer and we Ataturk's are the still the core maintainer us of that and so that's part of the the Eclipse foundations Eclipse IOT and edge division this is the agenda I think you'll learn the agenda as you watch so let's talk about the use cases I think most of you has the idea of the benefits of edge computing then if we look at all the common use cases probably we can summarize in the following for example you you you might have a lot of IOT devices and at the edge and then somewhere you want to have a gateway to manage all the devices the benefit that you can gain from it is mostly the low networking latency and then you can remote remotely manages fleet of devices then the other would be like say for example you're doing artificial intelligence at the at the edge for example you have a machine manufacturer and then currently a lot of companies are using OPC UA to do the data collection using time series DB and then you want to have the data to be locally analyzed and then upload some of the data to the remote so in essence you want to remotely manage your data and also have the data uploaded to the cloud and then you have aggregated analysis further so from a requirement perspective we would say you see like I want to remote manage my notes or resources I want to remotely deploy or orchestrate my applications and also consider because I have a fleet of edge nodes the scalability can be huge and then the first one would be the network connectivity between the edge node and the cloud might not be reliable that's the and the fact so based on that let's talk about what kind of approaches and is are available so the below are the ones the working group we we summarize and come up with so you can actually have a white paper published in our LT working group you can go there and find read more details so in summary if you look at this architectural chart the add node can be categorized in two buckets why we call it infrastructure edge for the infrastructure edge most likely you're going to run a whole cluster at that point and then your IT admin our user can just talk to your infrastructure edge and then scheduling your application or manage your resources then the other type of edge required device edge basically it can be a worker node which is part of a bigger cluster where you the controlling of your class or can either be in the infrastructure edge or it can be at the cloud so in general you can have a combination basically you can have a control plane in the cloud and then your device edge attach or you can have a hierarchical infrastructure basically your cloud manager infrastructure edge and then your infrastructure edge manage further the device device edge so in summary as I mentioned there are three kinds of architectural approaches we listed out some reference architecture for example you are heard about the Ranger p3s in our mind that's the way like you run a whole cluster at the edge it's lightweight and then you can add as an IT admin you can your resources and scheduling your applications there the second one the reference architecture is Cuba which is the one I'm currently leading so the concept for Cuba is your control plane is at the cloud but your worker node is at the edge so with it you can have a single pane of glass then you can manage all your resources that's the second one how about the third one the third way is the hierarchical so that it's similar to the Federation concept basically your control plane is on the cloud and then in the middle you probably have another layer of of control plane which is the infrastructure edge from that perspective we see virtual cubelet is open source project which can meet this need so next I'd like to do some comparison among all those three architectures and see what had the pros and cons and what other scenarios are a better fit for all those three architectures then before that before we drill down let's look at the current kubernetes cluster in the data center so you have multiple nodes talking to a master node and then your couplet on the under on the node has a long connection with the worker node so assuming the network connection is very reliable okay and then also like for the current kubernetes the scalability support is up to 5,000 nodes which means the beveled is to that extent okay and then the third one would be when crew blade talks to the master it watches all the parts on the controller and then was the meta data flow down to that specific node a figure out what are the parts bounded to this node and then I will do car doctor to start the application so this is the current datacenter coronated solution then let's see what about that's a k3s what's the requirement to run Cornelius cluster @h you probably want to think about at your edge you might only have a single machine right so in that case you don't want to spend a lot of resources to deploy your control plane so we want to have a lightweight control plane and a lightweight agent so I think from that perspective the case 3s can be a reference architecture you can use and yeah so basically k3s for the control plane a used Saku light rather than etcd and then from Coolidge perspective k3s wrote their own agent and remove a lot of redundant like unnecessary code for example dr. shim as part of the couplet which makes public really heavy so let's see about a cool badge cooperage that as I mentioned it has the cloud part and then the agent part for the cloud part besides the current kubernetes components it added two components wise card controller the other is a cloud hub and then on the agent side there are a bunch of components just to I want to mention for the cube edge agent at runtime the memory footprint is only 10 Meg so the benefit of the qubit is you can have a cluster with worker node in the data center and also you can have worker node on the edge but on the cloud you basically have a single plain glass of pain to manage your node on the cloud as well as the node on edge without knowing where those worker node is located and then you also benefit from from the lightweight agent and then the third thing I want to mention is cube edge dresses the network at this connectivity issues so if you're interested in next whenever I have a deep dive session talking about more in detail about Cuba so the third approach will be the hierarchical so hierarchical meaning you have a control plane on the cloud and then you have an infrastructure edge which is a cluster at the edge and then your infrastructure at node managed further a lot of worker nodes so then let's see you virtual couplet can be referenced architecture because behind the virtual couplet you can have a cluster for example in the data center it can be assured assure you assure cluster or forget it like AWS but for the for the hierarchical IOT edge scenario it can be an actual outer edge so that's another architectural solution so in summary I have a table listed here when to use and where to use so if you want to have a single class class of pain and really lightweight of control plane then you can think of cube edge where you have your control plane in the cloud and your worker node at the edge but on the other hand if you want to have strong autonomy of resource scheduling you can consider case 3s but the the thing challenge for running a cluster at the edge is you don't have like a central management of all those clusters that come to the third option is like if you have multiple clusters right at the edge then you would need to think about a hierarchical solution so basically that's what I have Thanksgiving so I'm gonna move on to some of the unique challenges faced when you attempt to run IOT and edge use cases using kubernetes there are three categories of challenges here how do you manage your nodes and clusters out at the edge how do you manage the infrastructure the control plane and the data plane the interconnects between these apps running at the edge when I say there's a challenge here for example this is might involve two leaf nodes needing to send things to one another but it isn't necessarily attractive to hairpin that network traffic all the way back to a central cloud some of the issues very unique to the edge compared to the understand that kubernetes architecture was originally based on Google's Borg and a plug public cloud scenario where you had a huge number of resources with different classes of workloads like batch jobs which were potentially expendable your resources at the edge locations are typically constrained by power physical size cost and they're really not necessarily comparable to what you've got at a public clouds or a large on-prem data center you've also got the network limitations the challenge of completely unattended operation at these remote locations as well as physical security and we're gonna get into these things in detail later in the presentation today with regard to network some of these challenges don't necessarily have solutions or certainly not solutions within kubernetes itself they might involve using external tools and some of them are more like situations involving triage where if the if you're just going to have guaranteed periodic network outages you want to have ways to live with them we'll get into a deep dive of what you can do about it first of all we'll talk about resources what I mean by a limited number of nodes at the edge you might have one node you might have two you might have three you're unlikely to have rack after rack there's no bursting when this comes along I mean even if you're in a large public data center and you view it longer term you can probably order new servers and have them delivered within a day or a week out of these edge locations where you don't have IT personnel that might not be an option your workloads may have a wide range of priorities and there's a need whether you appreciate it or not at first for doing triage on these likewise your network capacity limited variable subject to outage and I'm gonna give it go in now to some examples of what to do or in this case what not to do yeah if you're rookie to kubernetes you often go to these workshops see these examples that show you how simple it is to deploy a containerized app a pod and they'll often the fact is when you go into a pod specification or in this case a deployment there are a lot of things that are technically optional and you can leave out and when you go to that intro session to make this thing quick and easy they will leave it out in this example I hope you could read that in the back but there's a resources space in this pod spec and things work perfectly well when you leave that out the what goes there if you do if you choose to fill out the resources section is a declaration of how many CPU cores you would expect to use how much memory you'd expect to use and you know if you choose to leave this out after your pod gets deployed your pod could experience anything from resource starvation one pod negatively affecting another basically being a noisy neighbor on whatever worker node this got deployed on and when this happens when resources run out let's suppose you get an out of memory could condition if you don't declare any usage or any priorities this is completely random behavior who's going to be killed first if every workload you're running is like this it's not it's completely unpredictable is this likely to be okay I'm guessing no I'm guessing that some of the things you're running at the edge are more important than others and you might have distinct preferences for what might happen the other benefit is sometimes software isn't bug free and if you make a declaration of memory usage and something happens to be suffering a memory leak it potentially will be killed and restarted and that's probably a better scenario than just having this thing randomly kill somebody perhaps an innocent and it's an innocent container or pod so what can we do about this well first let's look at where all these pieces are it actually is fairly complex in the moving parts and this is the simplified view but you start with your pod specification done by the developer kubernetes puts the pods that are ready to be run or deployed into a scheduling funnel there's an admission control process potentially if you if you manage to declare it with priority where some jobs in the queue are able to leapfrog ahead of others but if you once again if you don't indicate the preferences the kubernetes scheduler has no idea that you might have these and it is just first-come first-serve so this goes to scheduling which determines what not-it's run it runs on and it takes into account the resources on the nodes you've got it might be one in edge but it could be more and if you declared what you need it can do the right thing with regard to placement finally once this thing is running the cubelet is responsible for actual enforcement in other words I declared I need X amount of memory but I went way over there and we're out of memory and somebody needs to be clubbed on the head and taken out this is a deeper dive it's really the same things going on in that first aerial view but it goes into greater detail of where some of this takes place the pod manifest admission control moving on to the scheduler the cubelet the container runtime gets a bite at this process and finally some of these enforcement mechanisms happen inside the Linux kernel in some of these cases like out of memory resource there are actually two places that potentially enforce things and let me go back here I want to make sure I point out that this is a session covering the broad subject of IOT and edge on kubernetes but there was a really great presentation a year ago at cube Konig in Copenhagen there's a link there will give you a link to this deck after but go there and it goes into great session lengths detail on how how all of this works but I think when you run this as edge you need to get yourself familiar with this kind of detail in order to do the right thing so full specs are always going to be better you've got two forms of specs there's something called prod priority and preemption and this indicates to kubernetes the importance of APOD relative to other pots and if you try to schedule or run a pod that has been declared as more important than you've got now but the resources are fully occupied something called prevention occurs which will evict the lowest priority pod this also potentially affects the order of pod scheduling so if a number of pods are scheduled and one is higher priority than the other it isn't necessarily first come first gets to run one of higher priority can jump ahead in the line if you have no specification there is a mechanism to have a global default and if you don't specify that then your pod priority for everything is none is zero which means every everything is the same which means it's random and you really shouldn't do that you really want to take the effort to give kubernetes a clue as to what you want to have happen orthogonal to this there's something called class of service and every workload is going to fall into one of three classes of service guaranteed burstable or best efforts the way to get into the guaranteed class is that every single container in your pod has to have a spec for your memory and CPU consumption both of what's called a limited request and like I say go to that Michael gash presentation to understand what this is that's worthy of a whole session itself but specify everything for every container in every pod you got and you'll fall into the guaranteed class and if you have a guaranteed pod and some of them you really should put all of them with these specs but if you don't some of them have fully specified things they actually will get a higher class of service and they'll get preferential treatment if you only specify part of these specs say you have a pod and I gave specs for one container in the pod but two more containers are sitting there with none of these specified you fall into the burstable category leave everything else all everything off and you get best efforts and these determine which pod gets killed first when you're out of resource understand that kubernetes itself is monitoring this but on a very slow control loop measured in minutes typically the Linux kernel is there and if kubernetes is too slow and the Linux kernel determines that the Linux OS is out of memory it can independently kill something it has no knowledge of these classes of services so that results in the random behavior network capacity is similar to C you know the resources I talked about a minute ago the covered allocations of CPU and memory but network is another critical resource and it can be limited it can be you see the speed test here asymmetrical where the opera's down isn't necessarily uniform and different workloads are likely to have different priorities and behavior and this means that you should have different policies just like the resources it is entirely possible but probably not smart to ignore your option of defining network policies and just go out there and get random behavior but with kubernetes you have the option of creating these resources called network policies and they deal with things like what traffic is allowed by default every pod is going to be able to communicate with every other pod but that more than likely is kind of a looser security policy than you'd like to have can potentially when things go wrong produce noisy neighbor problems and just be kind of a bad day I will caution you that kubernetes if you didn't already know this the networking layer is implemented as a plugin so some of these examples I'm about to give it some of these plugins may not have all the options and others are very rich and maybe even have sidebar options for implementing network security policies and observability but I'm just going to give you one example to point out that it's important to do this so the when you define a policy that what's going to go on is based on source and destination IP addresses and ports and there's an example here defining limits on the ingress and you can do egress as well if you have the plug-in that would support this kind of option and probably if you're doing edge and IOT you need to be on one of these network layers that give you all of these options once you've made this these declarations understand that the actual enforcement of your preferences is going to be taking place at multiple levels that it starts with the pod bandwidth annotations and the prior slide covered restricting who gets to talk to who but there is something called a bandwidth plug-in that may be available if you choose the nut write network layer that can in addition to specifying who's allowed to talk to what you get to allocate bandwidth caps on egress and ingress this is implemented by the traffic control system in the Linux operating system and under that the Linux network namespace comes into play this is an example of what you can and should do out of pod spec where in this example I've called out a spec for ingress bandwidth and egress bandwidth so at this point I think we Cindy and I talked a little faster than we expected it's a good thing because we'll leave plenty of time for this so if you want to take a break we can do it now we're gonna ask you to divide up in other words get up out of your seat and we'll receipt by these use cases and even if you don't need a break right now you get to meet the other people who are interested in the same use case you have can you guys grab those signs there and maybe take one to a different corner of the rooms to help this take place so if you like go step out into the hall go but come back in five minutes maybe we're gonna resume this but feel free to chat with your peers for the next five minutes for and separate yourself by these use cases the use cases are remote office and retail so this would be people at at cube con Seattle an example of this remote office retail would be the chick-fil-a sandwich chain had a proof of concept they talked about of standing up a 3-node kubernetes in every fast food outlet I think something like branch bank branch offices or retail stores would fall into that use case the second use case sensor data collection analytics that would be grabbing sensors perhaps doing some filtering of them before they get published up to a public cloud I'd also declare that things like video data collection with a little processing we'll call it that use case physical device control would be industrial automation where you've got sensors that have running tight control loops that need low latency if you're in that look who's ever holding up that sign there are people doing gaming out at the edge where for best experience in the game they're trying to run workloads closer to the actual participants in the game and then finally telco edge cloud where there are telcos that are interested potentially in using kubernetes to manage things running in cell towers or legacy landline switching offices and that would be that use case if you're in the other I don't know come up and sit in the front row so let's Ricci and 5-minute break so I've got 2:30 so at 2:35 we'll start talking about data plane so let's continue so at the beginning of the presentation we had that we said that we usually have like a three common challenges when it comes to to the edge computing right so infrastructure challenges how do we put our computer resources at the edge sites then we talked about a lot about the control plane challenges so how do we then scheduler workloads do to these to these compute resources but something I want to talk a little bit now is is the data paint challenges so once our our workloads are successfully deployed you know you know something like a very distributed infrastructure like this how do we communicate between between these services and how these services communicate with the central clouds how they can possibly communicate between themself and it's on the first look looks like a very complex problem and in the first thing instincts to everybody is to try to solve this this on the infrastructure level on the you know l2 l3 levels trying to put a lot of VPNs everywhere try to use DNS to resolve resolve services and do this discovery but you know in a fairly complex network eat it it quickly becomes very very challenging so that's one of the solution for that is is to go up go on on the application level use some tools that that have been around for a while and can can can be it can be done to solve some of these problems one of those tools so just one of the solutions is mqp 1.0 protocol which is basically a messaging protocol which supports a lot of features so first thing is to say is that AMQP is very loaded term it has a lot of history and it changed over time so you know once people first hear about mqp the first Association is rabbitmq and the brokers but mqp 1.0 is is a completely different kind of protocol I know that it should have been called differently but this is where we are today one of the the major features of amqp 1.0 which makes it a good fit for solving some of these challenges is that it's a pure peer-to-peer protocol meaning that's it's it doesn't it doesn't assume the desert does look like a client-server communication or order or a broker in between of course you can implement but you can implement all kind of these different communication patterns like like multicast or or or fan out or or request response so just to give you a quick idea of these different kind of communication patterns is this is what you usually call direct messaging right so we have two endpoints one is the producer of the message one is the consumer of the message producers sent the message directly to the client the client accepts the message and and that's delivery is the Islam right so think about it is like HTTP where you have like a request direct response that's like a direct communication right but when you talk to people about messaging everybody assumes a message broker message to queue in in between and this is this is completely different kind of communication pattern and you can see from this diagram that you here actually have two contracts between the in the whole delivery of the message so first the producer sends the message to the broker broker owns that message at some point and then the producer is done right and it's the job of the broker to to send that message to one or multiple consumers depending on the Monda semantics of that particular communication that that's going on right but because of the the symmetric symmetric property of the nqp we we were being able we will we were able to develop a new component which is which we call the the message router and and if you think about message router it's it's think about IP router right for the IP addresses but only on the AMQP level on the level of the messages and what that means in particular so what's the difference between the the broker storing forward messaging or the direct messaging with the router is that in this case router never owns this the message it just routes the necessary packets to be able to achieve like pops up or point-to-point communication between the different endpoints that are communicated now connected to the single router all this is currently implemented in in couple of components upstream we have Apache ActiveMQ Artemis which is which is in QP 1.0 message broker not just what in Kipp 1.0 but it serves that purpose very well we have a apache QP dispatch router which we'll see in action a little bit later which implements this routing component in the system and we have a apache cupid proton project which is basically umbrella project for all kind of AMQP clients you may wish to connect to this kind of network so this is just the basic components in basic theory behind it but now things become a little bit more interesting because if you go back to the like a h kind of deployment section the the nice thing about the routers is that they can form the very complex networks themselves so the router the routers themselves have ability of out out to discovering the other other routers and they can provide a provider a complex network over which we can send messages in a very very you know distributed geographical area for example so there's a lot of different use cases one that I always like to point out which is which is very important for if we go back to the origin of these stories so how are we gonna solve all these edge challenges so if you take a look look at these pictures so this is how we can for example enable edge to edge communication without with two things to notice first thing you can see here that edge nodes here doesn't have any incoming traffic the incoming TCP connections here so the the points of this arrows shows that the routers that are hosted on the edge sites are connected to TCP connections to the router in the cloud right but once these connections is it established on the TCP level the full-blown duplex AMQP connection entropy flow can go to these connections on that layer right so the service a simply needs to communicate to the service see on the on the gamecube level addressing saying like service II wanted to consume messages from the addressee and and service so that information will be forwarded to to the routers network and in the router here we remove that and it will note that for the for the message sent on the other C it needs to send that message for us to router B in the router B will note how to how to forward that message further round through the through the network right and with this kind of architecture we solve two problems so first there's no need for for for a VPN to get to this and the second one is that is that all the dressing is done on purely on the own on the l7 address space so there's no need to do any kind of any kind of dns resolutions in the in the process I'll just go quickly because I am assume using a lot of times so with this we can then do like a point-to-point flows to the network completely independently we can implement a very efficient multicast to the system sending one message to a lot of consumers through the system and and also these routers implement implement the the best path optimization routes so they will try to find the optimal part of the message to flow through the system but if something goes wrong the the alternate path will be available there and when the original or original connection comes back we'll have will have everything everything back to normal so right now I would like to demo this a little bit so let's see how they're gonna work out multi-cloud demo that is talking to his lab past as well so what I want to show here is a little demo that we so we have of this for first tops is basically for physically different OpenShift clusters right so this first one is easy in the private server the other one is is the AWS instance which is you instance hosting the the operative cluster and these two hour represents are my two clouds which are interconnected we'll see a little bit later and then then we have two more two more clusters this money is behind the corporate firewall firewall and this one is actually running on this laptop on the on the conference Wi-Fi in the mini shift and these two are representing my two edges so I must say like I use different clusters just to show that you know this work in this environment completely physically different clusters but it it can be used on the devices edge as well so it doesn't doesn't need to be the whole cluster to run the router right you can run it on on a single node on different nodes so this is just to show things and many this is how the dispatcher outer console looks like so this little white thingie here is actually our console connected to the to the cloud cloud one one one of the nodes in will call one how we set up set up this is that we have to - no - routers per cloud just to have this high availability feature and you can see that they are interconnected so they form like a nice nice high availability thing there right and we have these two things here which are actually these two edges running on on the two different different nodes so first taking point here is that you can see how routers can discover themselves and even if this console is connected only to us a single single cloud we can visualize the whole topology of the whole network of the routers going on here so next we have this yellow things are the services I have two very simple types of services here so once and we will demonstrate like our PC traffic so we're trying to do you know a command and risky and you see our control so we have really two simple services one is called upper and other is called a reverse right so one will receive the text uppercase the text and other will reverse the text okay so that's that's as easy as it gets and we gave the colors to each of these these instances of the service is running everywhere just to be able to visualize physically where the responses are coming from so that's that's the whole idea and this little guy here is is the client it's basically one one single page here which we're gonna use to to demonstrate these calls so there's the address call all here and if I send a message to all I should receive the response from all the all the services running all over the place right so interesting image and this so this is I'm calling this from the from the I think this client is actually connected to the AWS instance of the openshift and it called the service running on the mini shift on my laptop over the over the conference Wi-Fi without any VPNs or firewalls or DNS so all the layering and all the routing was done on the on the on the AMQP level we can also do basically define yeah we can do point-to-point which message will be sent to a single instance of a single service and it will try that the closest service to the caller to instantiate it so there's a lot of things more we can do with this but I don't know how I'm on time at this point two minutes nice so I have a let me clear this we have also a lot Jane service which I can start I can show I want to show you just one more thing okay so now I can call so Logan as the name suggests try to generate some load right and this Logan service is running is and at the at this node at this particular particular crowd right so if I generate a very low load here what I want to show you is that everything all all the load will be handled by the by the service running on that particular node as well so no traffic we believe the single node so we will handle all the all the requests locally so let's hope it will work so let's me start and then if you get the stats from it we see that all the all the requests are handled by the read in instance of the service which is exactly what we wanted that the instance running on on the same node where the on the same cluster where where the where the logs and service is running right so you need to stop this now okay bear with me for a second stop and I will I will instruct this service to start to to generate much more load here now if I take a look what's going on we'll see that at this point sorry if I'm clicking too fast but it will all come clear now we see that that at this point the red service and that's how the all network is is is settled because there is a flow control implied in all these routes we can see that with this particular configuration it said that that the red service will only get a part of the load and then the rest of the Lord will be offloaded to the other sites to do the services being located or on the other on the other basically plasters so that's that's what I wanted to show you today regarding regarding this so maybe now we can go back to the presentation if I can find the PowerPoint yeah all this is is done with the with the with the operator for the QP Dispatch router and there's a so this demo is in the rough shape still right we're working on it so you can find the source here and you can play right around but what certainly we will do in the future try to document it more and and try to present it more so that more people can play with it and instead of writing these simple services try to put it in in the content in the proper context of the of the edge computing so yeah okay great so I can't express enough affection for that protocol AMQP that is fantastic stuff so and by the way if anyone wants to continue the group discussions that we were having on the break there cuz it seemed pretty lively we after the this session were all up for being available outside of the the conference room to take those discussions so this is a topic that would would normally be you know a talk in itself and so I'm going to do my best to to bring it together in the final 20 minutes there's a joke that I like to tell how many of you have heard this before the s and IOT stands for security anybody have and if you if that's new to you feel free to reuse it it's it's it's true it's it's true though so um the gist of security at the edge is that it's a completely separate environment it it can't be added as an afterthought so if security is normally the sort of thing that you put your system through as a final check and then penetration test or whatever and say ok it works now let's get the security in place and launch it this at the edge is going to kill you and that's because the edge environment is is very different so you want to be designing security into the edge infrastructure from the start and so the best way to do that is in this kubernetes IOT edge working group we have begun to assemble and are actually nearing completion of a white paper that exposes the challenges of security at the edge the reason we start with this is until we have identified everything that we need to overcome at the at the edge in terms of security then we won't really be able to assemble the right architectural approaches to fixing it and too often we have piecemeal security solutions that we think have solved the issue only to find out that in combination with everything else it's just one tiny piece and the whole thing is not yet secured so expose the problems first and then we can know what solutions we need to be designing so there are some key differences about the edge environment that makes security a particular challenge we just saw a example of diverse networks some things are behind firewall some things are over VPN VPNs not a good idea by the way usually at the edge but some things are on cellular some things are through you know three different NAT layers because they're inside a an advanced manufacturing facility and you have to try to bring all of those things together but still in some way represent the identities of the things in a way that allows you to do you know dataflow control and so on also there's no guarantees of continuous power so out at the edge you might have things that are running on battery or solar charge battery and when things go offline the perfect opportunity to hack is then because as it comes back up on things look strange anyway and you might even be fed some initial commands you know to allow you back on a network that's a great time to be present and trying to get access to a system and you have intermittent connectivity so if you can guarantee that everything's flowing all the time over a high-speed land then you can maybe watch the traffic but what if the traffic comes in bursts or is offline for two days in a row because of a of a cell tower outage or there's a dust storm and so on you have to take these things into account and at the edge you have direct physical access to Hardware so how many of you believe that if you wanted to go and attach a keyboard and a mouse to someone's cloud server that you would be able to find where it is and get in and do this anybody feel like you are up for this challenge it's not gonna happen how many of you feel that you could go and put a ladder on a light pole and access a camera that's part of a smart city deploy anybody figure you could do that if there's no police around yeah it's a completely different environment we have to take this into account the hardware at the edges heterogeneous so it's really easy to buy you know hundreds or thousands of servers to go in your data center all from the same supplier all with the same spec but at the edge you have mix and match you have different machines you have boards that have been in place for dozens of years and you would like to access those boards over serial port with some new IOT gateway that IOT gateway that you bought is different than the one was bought two years ago and so on and so forth and we have a lot of non tcp/ip communication anybody ever program with Bluetooth does anybody ever done any Bluetooth development you know Bluetooth is non tcp/ip Nats short-range anybody use Lauro an for IOT that's good stuff great long range decent bit rate not tcp/ip and so then you'll also have multiple vendors they're very different than you know racking out your data center and you'll need to handle security in offline mode so a lot of times we take a look at we have some kind of advanced AI based security traffic during in the cloud let's take a look at what's coming into our data center and we start to sniff out things that don't quite look right but at the edge you actually need to be able to handle security down on single or multiple node level off off of the the cloud connection and at the edge you have very low latency locally this is one of the advantages of the edge but you have a higher latency to the cloud and back and so you have to take into consideration that you might not get answers as fast as you want so there's a kind of a summary or an overview of the types of security challenges at the edge not said yet to be complete but complete enough for us to be talking about it and so the gist of it is you have to trust the edge hardware this is the hardware that's running your edge computing you have to be able to trust the devices connected to that edge hardware you have to find some way to deal with operating systems at the edge and the fact that they're complex beasts you have network concerns and then of course now say that all of those things are in place edge micro services themselves post some new challenges we're going to talk about how to or some of the things that we've exposed around edge maker services so starting with trusting the edge hardware so I already mentioned that physical security is not guaranteed at the edge in fact I would say physical security is actually very difficult at the edge and there have already been plenty of hack attacks I've suffered one myself as an IOT advisor to the city of San Francisco and there were some incidents that that occurred a number of years ago really opened my eyes you have people going out with screwdrivers and wrenches and saying hmm I think I can get that get at that box what do I think I could do with it if I got it so the hardware root of trust is a starting point for a lot of people you know a TPM 2.0 module is kind of the de-facto hardware root of trust solution if I've got that then everything that's riding on top of that hardware is said to be good that's not actually true because hardware root of trust is just a starting point this basically tells you that the board that you're operating on is the board you think it is and looks to be still in the same condition that you last trusted it in but building on top of that that's like that's like the the basement foundation of a house right what's built on top of that still needs to be good quality but you do have a that foundation so what about the condition of the hardware so if I have some orchestration that's happening with the clips IO fog and it says within these GPS coordinates I would allow the following micro-services so it's some center point and then some radius a great way to get micro-services issued to your edge hardware that isn't that's not supposed to be allowed would be to spoof the location so even if gps coordinates are normally thought of as something that's fairly safe to have transferred you know in an insecure way because just a couple of data points what if that is actually used to make decisions about what software is supposed to run where that's actually a really great technique for getting software issued your boxes to make it look like you're inside some trusted circle and when you're really not and so then attach devices start to become much more of a problem at the edge so what do you do if your edge Hardware node so an IOT gateway all of a sudden has a USB flash drive attached to it that's usually a warning sign and if it was your laptop you would be very concerned because who's getting up to your laptop and plugging in their USB Drive but for an IOT gateway that said a remote oil pump this might be something that you're never aware of so you're going to need to do some kind of detection of what attached devices are our present and what condition they're in and are they on the trusted list because no one's going to be there to watch it not like there is with your phone your laptop which you won't let out of your hands and so if you do have an indication of compromise like something like an open case on on an edge compute node right there's a lot of of case detection systems right it can tell that somebody has opened the box but what do you do then if you go scope our off then maybe let's take the thing off the wall take at home and get the data when they get home what you really want to do is if at all possible find some way so your challenge is find some way to detect that there's compromise and react to it in a way that saves you can I wipe the data can I wipe off the micro service images these would be good things can I make sure that everything is encrypted and then therefore you know even if they get the hard drive it's not a problem okay great but then I need to have some monitoring of the whole system and so what about authenticity of the hardware so in some cases you have hardware that you trust but then you come to find out later that there's security flaws and the hardware itself possibly even some way that someone's doing remote monitoring of the hardware and you weren't aware of it so no the sources of your hardware and know that what you're putting out in the field not only has a TPM 2.0 module but that all of the setup on the board is what you intended to pay for and there's nothing there that is waiting for a BIOS update through a backdoor or anything like that just be careful of supply trusting connected devices gets really tricky so these inexpensive battery-powered sensors the kind that you'd put ten thousand in a warehouse they don't tend to come with identities the same way that like a you know a mainboard of a computer does so whereas a rack server is going to have all the the software identity stack with the little battery-powered device how do you know which one is which and how do you trust and that being said if you have 10,000 of them who is going to assign the identities to the deploy that's a lot of numbers and where are they going to plug them into for you to say oh that ID belongs that ID doesn't belong so you have to consider the cost and the labor involved in getting these systems deployed so if there's data flowing you want to protect that possibly encrypt it as near as possible to the source but what about commands going back out so some really interesting commands are shut the water valve off or open the gate to the to the warehouse which is a mostly automated facility if I can issue that command remotely once I figure out how to get into the system then I can basically just line somebody up and say okay you're ready to steal in five four three two one let me issue the command opens the gate and we're done so who's watching what commands flow there's a very funny place you might not have any monitoring of the traffic and yet then somebody just happens to get on the Wi-Fi network issue the command and you're done device management is actually a backdoor so think about it this way if you have a device cloud and in that device cloud you're saying of all my IOT devices a several hundred of them this is the battery life that's remaining this is the network address of the device this is the firmware I've got and you use that information to make some decisions such as oh we're almost out of battery here let's go ahead and move the workload over to where we're getting you know sensors that have more more data coming in because the battery is still alive if somebody spoofs that information then you can force decisions on the back end and a lot of the device management API systems or cloud systems they're open all you really need is an API key and you can get that off of any one device so steal a device you steal a way to report the data in report the data and falsely decision start getting made and nobody's any the wiser so the operating system this complex thing so who's heard of secure boot before who use a secure boot in any of your deploys it's tough secure boot is a hard thing to get right because basically it says I'm on some hardware that hopefully I trust and I'm going to execute the boot sequence like this and I'm gonna check that I'm only loading the drivers that I'm supposed to load and I'm only running the binaries I'm supposed to run them only opening the the network configuration I'm supposed to open and then you're done you basically report I securely booted what happens after boot so then you get some software that decided to get if gif itself an update and you trusted that software you installed it but now the update that's God is doing something funny secure boot won't save you so secure boot is yet another step so Hardware rid of trust secure boot these are the first two layers what comes after that you have to know what's running and you have to know that that the thing that's running is pure so some attestation of what's happening on the edge node and you need to continually monitor this because when you're in the the manufacturing facility and you're kind of getting everything ready and your all of your binaries are loaded it looks fine it's when you go out to the field that you'll have a change so it's not enough to say that the image looks fine it's gonna change afterward when you if you're actually gonna get hacked so constantly monitor monitor so this one is kind of a pet peeve of mine in something I like to point out a lot so if you have a private key and you put it into an edge device and that's meant to be the identity right so can it can attest I am this thing what happens if someone gets ahold of that device and they're able to get ahold of that private key this is essentially like taking your house keys putting them on a park bench and you might as well put an address tag on it too that says where you live because you have now given yourself a false sense of trust and you've given the keys to your kingdom out into the world what's the answer to this it's actually really challenging something related to using keys and asymmetric key pairs but maybe giving them out when you know the device or the edge node is trusted already so something that's more dynamic I know that flashing fixed private keys into edge nodes is very dangerous that's that things for that's for sure and so component firmware vulnerabilities is also a problem so your OS is running and you're talking to the serial ata chipset there who made that does it have ulnar abilities are you riding on this for all of your edge to point probably not you probably have 10 20 different vendors based on the variety boards you have how are you gonna check for security vulnerabilities in all of these chipset firmware versions that are on the board it just starts to become a nightmare so much of a nightmare in fact that some organizations in the United States at NSA for example requires the authority to turn off all of the security features baked into all of the components of the board and this is so that they can override and say well at least we have visibility instead of trusting some of the stuff that's been put out there that might might have flaws in it hopefully this is all getting you scared right because it gets me scared and so what about security updates of the OS on the security updates are great until somebody gets in the middle of them so this is a golden opportunity to cause problems also are you doing security updates of your edge nodes in some cases you put a pump controller out in the oil field and once you put it out there you want to conserve bandwidth maybe the bandwidth that you are paying for is on satellite very expensive and you say patches recommended for Ubuntu hmm nah that's like a lot of money we'll do it next week right and this is because you're trying to make a cost-benefit analysis right of consuming bandwidth and so then what ever happens on the OS usually produces an audit trail and some logs that's a great place for people to hack it's also something that's typically missing from the edge backup if I can't get access to the audit trail or logs and I'll never know what happened out there something has to take care of these things and treat them like data because you're not guaranteed protection right no physical protection of your of your edge node so something has to treat log files as first-class citizens now so on the network another pet peeve of mine is open ports at the edge so there have been some hack attacks executed against energy grid systems where somebody really just kind of rolls up near the the transformer facility so you've got these kind of fenced off areas outside cities where you do power transformation and there have been hack attacks where people roll up start scanning for wireless network connections that are available especially SCADA and then they just begin to say well based on what I'm seeing there's actually an open socket here that I can talk to you I'm just gonna I'm going to issue a couple of bytes commands and see if I can get it to feed me back some data or do something it's basically just sitting open because nobody thought that anybody would ever be out there with a laptop and it's a different wireless adapters trying to do these things but it's starting to happen and fixed VPNs are like fixed private keys so you definitely do not want to put edge hardware out there that contains VPN credentials that if somebody gets access to the edge hardware they can get on your VPN because that's like saying I have a warehouse and I have a data center and I have a retail store hack any one of them and you get the other two for free right we don't want this scenario network access control so identities on the network are pretty easy to establish and in a fixed environment that's good at the edge you have things roll-on and roll-off if a truck pulls up to my warehousing facility and it wants to join my my you know my network here and start doing some edge compute data sharing you want to dynamically give it access but then you're not gonna have a network engineer on standby at all times saying you know let's let this identity in so you tend to open things that's not the right approach but you have to think how can I dynamically a check identities and so then we always talk about verifying what's happening at the edge up in the cloud insanely is that the right device did it give me genuine data what about the edge saying am i talking to a real control plane because we have this this notion that the masternode is going to be like walled off and protected but it was a conversation actually happened at the Eclipse booth earlier today was really interesting about smart cities if you get access to the master node you control the city this is kind of a future we're heading towards so think about it the edge node should say hmm something changed on that control plane I'm not liking it you know I'm just gonna go into offline mode until I can get rid of era fication that would be a betcha smarter and better edge attacks of the transport layer is very interesting for Bluetooth Laura when all of these local wireless things if you just blast a bunch of noise you can often interrupt the transmission and prevent any data coming through that's a great way to like turn off water flow or cause something to overflow or cause fuel line transmission to stop all stuff that you can do just by blasting white noise and this is a problem that we don't have a solution for I can't think of any way to you can't radio shield things so much that they don't get interference because then they don't get transmissions not sure how to solve it and this is a term that that actually came out of a years ago working group that was if think it was there was edge works and it was Cisco and it was Samsung and and GE but back when GE digital was like really in the space and more of a thing and so denial the thing attacks are kind of like this you have a device and it's reporting data maybe it's a temperature and it's reporting it over Bluetooth Low Energy I don't actually have access to the device and I don't think I'm gonna get it because it's actually encrypted data that's transmitting these are all really good architectural implementations but I can ask it to connect so many times that it runs out of battery and so you're basically denying all of the things of the world from participating in the solution this is something that we don't really have a good solution for okay last section and I did I get to watch from Steve so when you are running micro services at the edge you have to make sure that the micro Service images are the ones that you intended there's a lot of solutions for this in the cloud arena and all we really have to do is make sure that you can do these checks checks at the edge and do them quickly or maybe an offline mode you have API keys and things find a way to securely deliver them to the edge instead of having them baked in what do you do if you have microservices running at the edge that were unauthorised if somebody watchdogging is there a whitelisting capability and so if that's not built-in then i could basically deploy micro services just by getting on the edge node and you might not know controlled access to resources means serial ports are available to micro services are they available to all should they be what about Wi-Fi connectivity what about the Bluetooth radio that should be controlled guaranteed remote shutdown if I'm remote and I'm managing an edge node and I say ok repeel right I want to turn off these micro services and I'd like to wipe the data because we're decommissioning this IOT gateway or decommissioning this video camera that's been been taking the video feed and doing AI I need to know that that stuff's wiped off before I basically consider it no longer part of my edge compute system and you want to match the right micro services to the right edge hardware so it's no good if you deploy a micro service that checks employee IDs in a factory and says you can or cannot be in the sector no good to run that micro service at the edge if it runs in the women's restroom and so then there's no real control because it's not at the right targeted location so match the right edge hardware to the right micro services and the last one is micro service is that the edge are great but they're largely unmonitored and what happens if you deploy one that just start sending your data to some back channel and this happens people write micro services and they put them out there for free and so hey use this this is an MQTT the broker slash you know analysis system that will give you a great you know edge iot system and you look at it you're like whom well i don't have access to the source code and I'm not really sure where they're sending all my data so maybe I should think twice about that cool so the gist of it is in the end it's going to require a multitude of approaches and everybody's going to have to get involved so what we really need to do now is start thinking security first question the status quo so just because it works in cloud and data center doesn't mean it's right for the edge and just remember the edge environment is so different that unless you're absolutely certain you've explored all that challenges you should assume that things aren't yet safe and then start to get involved that way asking the right questions before we start giving the right answers cool Steve okay thank you Tilton that was a great talk and like he said I I think potentially that talk would be a 90 minute session on its own done right let me wrap up here by pointing out that this session you just attended is being run by this IOT and edge working group within the kubernetes project we have two meetings per month the reason we have two is one is done for the convenience of Europe Eastern Europe Asia Japan the other one for North America once you get synced on to the right one of these and I'm sorry that I've observed that some people join at the wrong time for the the wrong four-week cycle but look at them here take a picture of the the top therefore that shortened link you can get this whole deck to find it later and these are zoom meetings interactive you can come on by a camera if you like and we're trying to organize this with presentations like what you saw today if you're with some open source project or some vendor we don't really want to give airtime to commercial sales pitches but if you've got something particularly open source that you want to present approach any one of us and we just we'll put it on the agenda likewise if your users and it was I I think some of you got captivated by the experience of breaking up into groups who want to have a meeting focused on your use cases we'd be willing to do that and just give us help as to what you'd like once you join this group it's the kubernetes community standard you don't get to see the full extent to the documentation the documents without joining the group but it's easy to join the group join it you'll get a to the meeting agenda for these and you're free to simply write in the agenda document and suggest topics to be discussed yourself we won't guarantee that if we run short for time it will get in that one but you'll get yourself in the queue and we've got some working white papers once you join the heading for the group you can discover all of this stuff at this point I think we've well it looks like we hit our deadline if another speaker walks in the door we'll take it out in the hallway out of courtesy but at this point if anybody's got questions I think we have an extra mic I'm not sure where it's at okay so raise your hand and then if not I think we had to break up some of these groups so if we have no questions let's reassemble in the hallway divide it up into those popular groups and you let's get to meet each other okay we've got one okay so we were likely to be able to continue any questions or you want to go into these focus groups again let me know there's one hi I wanted to that you are more clear about what is the roadmap of this working group what are you planning to achieve so did the roadmap for the first six months we try to cover this material for intro try try to figure out the use cases and things and things like that and for the last couple of months we're basically focusing on what we presented here so focusing a little bit on security focusing on on the messaging trying to deep dive in into these things I know that on the recent course we had a proposal to do some work around around verifying the docker images for for the edge nodes and things like that so it it's basically open for now on demand yeah okay it looks like that's it but feel free will step out in the hall if somebody wants to chat privately certainly the four of us really love this so we'll hang around thank you for coming yeah quick thing like at four o'clock 355 I have a deep dive session for a cubic if you want to learn more oh you know what we forgot sandy we do have some shirts available yeah certainly offer one medium size if you're interested to come to me yeah thanks for coming in [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 1,110

Rating: 5 out of 5

Keywords:

Id: 5UgOjvK1IN8

Channel Id: undefined

Length: 75min 33sec (4533 seconds)

Published: Fri May 24 2019