Container Networking From Scratch - Kristen Jacobs, Oracle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Thank you for sharing this! This is fantastic is a fantastic resource. I didn't know about the tunnel stuff, so I have something new to learn. That and bgp.

If folks are interested in a slower paced walk through I wrote a post on How Do Kubernetes and Docker Create IP Addresses?! This covers what he discussed in his step 1 and step 2. And I think it's enough to have an understanding of how step 3 works too.

👍︎︎ 2 👤︎︎ u/dspeck 📅︎︎ Jul 11 2020 🗫︎ replies

Recently switched out baremetal clusters from flannel to calico. Now trying to understand more about how it all works. This helped :-)

👍︎︎ 3 👤︎︎ u/soberto 📅︎︎ Jul 11 2020 🗫︎ replies
Captions
hello cool so um I work for Oracle and Oracle has a managed kubernetes offering and little while ago I was given the job of looking at replacing the networking layer which was flannel with some features of the the Oracle cloud and that will find it really good and I started digging into it and I soon realize I didn't really understand how flannel words and it kind of seemed wrong to replace one thing would another thing if you don't understand the thing you're replacing so I kind of dug into a bit deeper and after a while it became apparent that I didn't really understand any of this networking stuff at all so um yeah so long story short when that a big rabbit hole learnt some stuff but more importantly I kind of realized so I've really enjoyed this stuff so um I thought I'd write it all can come and spread the networking love a little bit so yeah my name is Chris in the next 30 minutes or so I'm gonna try and explain our container on one machine can get can connect to a container on another machine and what the various mechanisms are to allow you to do that yes so if you already notice stuff there's your time to leave cool so first of all we need to know what we're aiming at so we're gonna well go with the kubernetes model because nice cube coronel so basically for a network blaring kubernetes to be compliant it needs to follow these three rules but what I really boils down to is every every pod in the cluster has his own unique IP address and each one of those needs better talk to each other just using that IP address and there's no address translation in between and also pods news we have talked nodes and those better talk to pods that's it so that's what we're trying to work towards and how are we going to do this so we are going to get there in kind of four steps and for each of these steps I'm gonna show we're on a rather badly handled diagram and talk about that bit and then I'm gonna show some code in bash and everybody loves bash so that's good and then I am going to run that code and then we can get in and or you know captures and packets and ping some interfaces and things and see how it all hangs together so the four steps are as follows first of all I'm going to look at this as possible thing just a node with a network namespace and look at how that can connect to the node the next step we're going to stay on the one node but have two Network namespaces and look at how you can connect and packets between them next step move out to two nodes and then look at how the packets go between the two nodes containers but in this case just on the same l2 network and then finally the kind of more general cases to know to separate it across different networks and we'll see how that works so let's get going so this is the the simple case so we've got the big diagram the big rectangle on the outside so that represents the node so that could be a bare metal machine or it could be a VM kind it doesn't really matter that's got an interface here it's rather confusingly called EMP zero s8 but just treated Z zero it's only called that because that's what the VirtualBox called it it has an IP address 1000 10 and then the little box in the middle which I've named labeled con well that represents the container so maybe we should just take a step back a bit it's a container in there notice is a process with a bunch of mechanisms Linux mechanisms to kind of isolate that and and they include C groups and namespaces and various security things but from the point of view of network connectivity the only thing that really matters is the network namespace so when I say container from now onwards really what I'm just talking about is a network namespaces here yes so what we need to do well actually first of all what is a network namespace so the way I see that is it's kind of separate instance of the kernel networking stack what another instance so that would involves three things so you got extra interfaces a separate list of interfaces you've got a separate list of IP IP tables rules and a separate list of routing rules so if you go into a network name space and type ifconfig you'll see one list of interfaces do that on the host on in the default namespace you'll see a different list so there are the kind of things we got to kind of play with - why are these things together so the first thing is how do we get connectivity between the network namespace and the host and the way we do this is we have a user to be fair so that is a kind of Linux networking construct I kind of visualize it is a sort of Ethernet cable with an interface card either end there's a point-to-point thing but packets one end they'll come out the other end and you get one of these ends and you put it inside the net web namespace and the other end will leave in the default namespace on the host so that way we get connectivity between the two we're given the interface inside the container an IP address 1 7 2 16 so it's a different network than the host and the final piece of the puzzle here is to set up some routing rules so on the host we have quite simple routing rules that's basically saying if you want to go to that IP address 1 7 to 16 0 send it straight to that beneath one interface and then the routing rule inside the container well there's only really one interface and there's nowhere else he can go so we just need one default rule which just routes everything back out again and it kind of looks a little bit confusing because we're saying the Gateway is well we are the gateway in this case but yeah it works this way so that is the set up so what we can do now is try and have a look at what this looks like in code cool so I'm running in a different new VirtualBox a new virtual machine here and here's some code which will when we run it we'll set up this this model and I say it's bash it's not really it's just lots of calls to the same command so here I'm using that IP command and that is kind of why it's kind of the only one you really need to know in order to set up any of this networking stuff it kind of subsumes the older I have config and routes and things so I'm not going to go through every line of this this file because that'll take a while but I just want to give you a kind of flavor of what it might look like but basically we go through and set up all the bits in that diagram so we create a namespace we create the beeth pair and we put one end of each pair into namespace we enable we bring up the interface bring up the interface on the node set the loopback in there we don't really need to do that but done anyway and then we set up the route both on the node and in the namespace so let's give it a go see if it works it's now I'm going to run that so the first thing we can do is that's so they were just listing the network namespaces and that's the one we just created so now let's have a look at the interface inside the namespace so here I'm doing a net NS exec so you can kind of think of this as a bit like docker exec or cube controlled executor you know we're kind of going into the namespace and running a command in there and there you can see that the bottom one is to be to interface that's one we created and then let's see if we can ping it from the node cool so that kinda seems to work so that means we've got connectivity going both ways which is good and so yeah one question we could ask is what is actually responding to this because normally when you think about containers you're thinking about a process which is in a container and that's the thing that's kind of you're talking to in this case I haven't put any pressure so I've just created a namespace so I'm sending ping requests ICMP packets in and it's an it's the the kernel stack inside the namespace that is responding with the response the ICMP responses so I guess we wanted to set up the more kind of realistic example we might want to start a process inside there and talk to that but from the point of view of just investigating the connectivity we kind of don't need to so I'm just gonna do it this way because it's easier yeah so next step Oh wrong way is to move on to the case where we have two namespaces on a single node so here it kind of looks similar so we've got the big node on the outside we've got each container they're separate networth namespaces giving different IP addresses here so one is 0.2 one is 0.3 on one and come to the Vieth pairs are exactly the same the thing that's different in this case is that rectangular box in the middle and that's the way we achieve connectivity between them and this is a linux bridge so again just like Paris is a kind of Linux networking construct which you can create with the IP link command and so I've faced one here called it br0 and I've given it an IP address so I guess you don't really need to give it an IP address if I didn't then you still be out of route packets between the two containers but there'd be no way off of that bridge onto the host and ultimately out onto other machines so by giving it an IP address here it kind of becomes a gateway to that little network that little subnet and this is kind of the way that docker works by default if you just install docker it creates a Dockers aerobridge you just like about this so the final piece of the puzzle here here the the to the routes that we have to set up so on the node we've got a slash 24 range so I'm assigning all the IPS in that slash 24 range to all of the containers are hanging off the bridge that means I can have 254 I guess so anything in that range will get routed to the bridge and then from inside the container we have so the bottom rule there is saying anything in that range just send it directly out at the v21 interface so that's a directly connected network directly connected route and if it's not that then use the default route and the default route is saying use the bridge IP address as the gateway so that would reach out to the bridge and then when it gets out to the host what I would have to look at its routing tables to kind of forward it on to wherever it's going to wherever it's destined for go so let's move on so here we got rather like before we've got a kind of clean virtual machine so let's have a quick look at the code to set this up so as you might expect is it's kind of similar to what it was last time but there's two of everything so again I'm not going to go through each step because it's basically the same as before the key point here is the line 26 so that's where we're creating a bridge so as I said you can use the IP link command and you can create a link call it VR 0 and it's of type bridge and then we have 2 we assign a by P address to it a little bit lower on sort of line 33 and then we enabled bridge and we're all good to go right so what can we do so we can look at the interfaces on the on the node now and we can see the bridge is the one at the bottom here and it has the IP address 16 0.1 maybe let's get inside of one of the containers and see if we can ping the other container so I am exacting into con 1 which is the container on the left the network names based on the left and I'm going to ping the one on the right so this is looking good it means we got connectivity both wait well good actually one thing worth mentioning here is that value of the TTL so the the TTL is a time to live of the packet and that gets it's just a number and it gets decremented each time a packet gets routed it starts in 64 so in this case it hasn't been routed at all because it's just come out one interface onto the bridge straight back in so it's been no routing going on it's just a single Ethernet packet go and we're on the bridge I only mentioned in though because in the next section we'll see this changing and won't make more sense then also kind of proves I'm not pulling the wool over your eyes or something um finally yeah we can just check out so from inside the container let's make sure we can talk to the node itself and again that's good so that means we've got ICMP packets going in and out so we've got connectivity both ways right so let's move on to the third step alright and so now we're getting we're kind of doubling up again almost so this is the K the key point by this case is both of these nodes are on the same layer 2 network so they're just connected by a switch here so the node on the left has a 1000 10 node and right 10 0 20 they're in the same subnet otherwise they're much the same so each one is same as the one before we've got two containers in each connected with these pairs got a bridge on each or looking good so what we need to understand what I'm aiming to get across here is how you can get packets from one container on one node to a container on a different note and the trick in this point is it's really quite simple there's nothing to it it's just setting some routing rules on each and the nodes so they know where to route the packets for the other node so if you look at say the routing rules on the left in the bottom left-hand corner there we can see the key 1 is the second one down so that's saying each for any IP address which is destined for the containers on the right-hand node send it as a next hop to the node itself and then the node will know how to route it up into the bridge and likewise we have a corresponding rule on the other node such that any of the containers in the 0.0 / - end there's a next hop to node on left and that will know how to route it up into the bridge so if you just have a your kubernetes cluster on a single l2 network and this becomes quite an easy way of getting connectivity there you don't need overlay networks you don't need any of that magic and this is the way some of the the kubernetes plugins do it so there's a flannel has lots of different backends and one of the backends is the host gateway back-end I must exactly what this does it just sets routes on the on the node I think also calico might behave in a similar manner as well if you're all on the same LTE network so if you have more than one though of course you might have you would have an entry per per node so you might end up with big routing tables and you need some way to manage that so you'll need to some way to somewhere to store the range of ips on one node to the node itself and that could at CD or it can be somewhere else but we'll get to that a bit later so let's get back into our demo so now we've got to clean virtual machines one on the Left corresponding node and left and right the other one so first let's have a quick look at the code to set up this all of this stuff is basically the same as what I went through previously the key stuff is underneath the little comment down the bottom here and here we're setting the roots on each node to know how to route to the other node so I'm going to run this setup script on both VMs such that you get the roots going both ways yeah so it's kind of as simple as that the final thing you need to do is to enable IP forwarding on the node so if I didn't do that Linux by default wouldn't forward packets out so if it received a packet on itseif 0 and it wasn't destined for the IP address of that interface it was just chuck it away and that makes a lot of sense if you just got a laptop or something you're not acting as a Rooter but in our case we are acting as a Rooter because we're going to get packets coming into the f0 destined for one of the containers so the kernel needs to know to route it on to the bridge so we have to enable IP 40 cool so if I run out on that side and on that side so so first of all let's have a look at the roots on each side so on the left-hand side here you can see the key root is the bottom one so that's saying any of the IPS in the range 1 volt 0/24 send it to the other node and there'll be a corresponding root on this side sending it back the other way so that as well and that's the kind of trick to get connectivity when you're on the same l2 network let's have a quick look so now let's see if I can ping one of the containers on one node to the other one so I'm exactly into com1 on the left hand node and I'm gonna ping on one on the right time load cool so it looks like it's it's working we got connectivity both ways remember I said earlier about the TTL so in this case the the time to live has gone time delay if it's gone down by 2 which is kind of what you expect it's been routed twice has been routed once when you can't really see it here so it's been routed once when you come out of the on the kernel on the left-hand node and route it again going back onto the bridge on the right unload hence the decrement of 2 if I were to do the same but instead ping the other nodes to not going into the container but go into the node itself well it worked but now we're only gone down once which again is you know makes a lot of sense because it only been routed on the left-hand node and not on the right-hand node so now we get on to step 4 which is the the kind of the more complicated one and the one I've kind of been building up to because it represents so kind of why I didn't understand in the first place back flannel and overlay networks and things so I guess before we move on to that what could we do so if we if these two nodes here were on separate l2 networks and that switch in the middle wasn't just one switch was the internet or other rooters and all the kinds of things then this trick wouldn't wouldn't work any longer because it wouldn't the next hop wouldn't be on the same network so one thing you could do is add routing rules to all the rooters in between and maybe if you control all those researchers in between then maybe that'd be okay but I suspect that's probably not what most people can do another thing you might be able to do is wear depends where you're running if you're running in a cloud environment and the cloud provides some sort of rule capability so I think Amazon and Google do this you can assign IP ranges to nodes and then you just do it in the cloud and that's basically what what it would do for you then so again instead of using any kind of more complicated overlay you could just sign these route ranges and allow the cloud to reroute it for you but let's assume we can't do that either so what are we left with so we are left with using it well one option is to use an overlay Network so in the example here we basically got a similar set up to different nodes same detainer same bridges same everything key point there's a Rooter in between so we can't pull the same trick that we did in the last step the one thing that's different is we got this ton zero interface so this is a Wow this is nothing yeah this is the bit that made me understand the kind of how you can set up these kind of virtual networks so what an zero interface is something you create using the IP tool or a ton interface and if you just create one it'll show up in ifconfig as an interface but there's nothing behind it so normally when you have a network interface of some sort hardware or some virtual NIC or something but in this case there's nothing behind it so it doesn't seem very useful but what you can do is put a process behind that and that process when you send a packet to the ton device the process will get that packet the raw IP packet and it can do whatever it wants with it so it could print it a standard out or it could you know send it to the printer and physically print a everyone do but what we could do is have that process behind it racket in say a UDP packet and send it to a node and that's exactly what happens in an overlay network so you don't need so the two nodes don't need to know about the separate IP ranges of the containers they just need to be able to connect via their node and yes so we're going to listen a bit more detail on the next slide the last thing we need to do here is explain these routing rules so on the left hand side we're saying everything for the all the containers on my node just send it to the bridge that are all beneath its and everything for the containers on the other node send it to the ton device and likewise we've got corresponding routes on the on the other side so let's drill in a little bit and have a look how a packet actually makes it from one container in the top left-hand corner all the way around to a container in the top right-hand corner and yep so the packet comes out the container get onto the bridge comes out of the bridge the kernel will then route it to the tun device and we want to set up a process which sits behind that and it knows because it can see the IP address it's got the wrong idea packet it knows which node to send it to so like I said before it might look up in a database like Etsy D or something and it looks at that mapping and then it knows where to forward on to so in this case we were creative wrapping in a UDP packet sending it to the other node on port 9000 so it goes our v-0 goes through whatever network there is in between comes back in Eve 0 on the right hand node there's a process sitting there which is listening on port 9000 which gets that unwraps it is just called the raw by p address then it sends it back into the ton device and when it comes out of that the kernel will then notice that as the original packet and route it up into the bridge and hopefully to its destination yeah ah there's one thing as well this was when I did this talk once before someone asked this question and I thought was a really good question they basically said but isn't UDP unreliable and kind of stunned me there but it kind of doesn't matter in this case because we're getting our reliability a higher level so it's the TCP stuff on top or inside that's what would actually do the retries if this failed you can actually kind of think of the the UDP connection is a bit like just an ethernet device and now on the wire that's not reliable either but it doesn't matter because the retries are handed by a handle by the they are above so UDP is okay in this case and this is exactly how what the UDP flanimal back-end were which we'll get to again in a minute so right I guess you're getting in there it's a similar setup to the for this have a look at the code so again all this stuff is exactly exactly identical to four before if we pop down and we can see this stuff specific to this step so as before we have to enable IP forwarding but the key thing here is we are using so cats let's just out of this tunnel between the two the two nose so if you don't come across so cat before where I only come across it recently when I was looking at this because I fully expected to have to write a little process behind this just to set this demo up and do that UDP stuff and I come across so cat is amazing so this tool sets up provide directional route between two or two end points and those end points could be TCP or UDP or standard in us today or ton devices and yeah if you just type man so cat or blow your mind as amazing it does laser stuff so there's a lot going on this so remember we're going to be running this on both sides and so just this one line is saying set up a ton device give it an IP address bring that interface up behind that ton device or want the UDP our process which is mostly listening on port 9000 so any packets that come into it it will receive them and send them to the ton device and it's broadly sending stuff out and port my entire isn't two to the other node so any packets I come from the ton device it'll send them out with each they're on to the other node and because we're running this the same thing on both sides we get connectivity between the two oh and finally there's a yes there's a couple of little one why I say that's what other things that you need to do to get this to set up so when you start dealing with overlay networks you have to kind of worried by the MTU so this is a maximum transition unit and that'd be yeah so we've got to account for the eight bytes of UDP header hence I'm setting it here to 1492 bumping down from 1500 which is what it was before if we didn't do this it'll probably still work but if you just got a packet which was just above that bit it will get fragmented and that's what skating probably would matter but in the general scheme you don't want to do that so yeah it's just something you might have to be wary of if you're setting this stuff up yourself finally we got this stuff about disabling reverse path filtering and so this is a little little more subtle so I Linux by default if you send a packet out of one interface and it receives a response in on a different interface it will just drop that it will consider it as it's kind of suspicious which kind of makes a lot of sense really in general but in this case when we're sending a packet from say one node to a container on the other node the packet will go across the ton device going up to container up on the way back it will just come straight out of e zero and back into the node hence the packets going out of one interface coming in on a different interface so unless we disable this in this case anyway this demo won't work I guess there's other ways you can work around this maybe you can do some sort of more complicated routing stuff using kind of source based routing to ensure that the packet goes over the ton device no matter whether it's destined for the node or the container but yeah in this case I chose to do it this way so let's have a go at running this stuff so run it on both oh no it's typical it didn't work so another guy let's try it one more time before I start resorting to her videos or something looking good so I was like no yes so first of all let's see if we you know if it's connected see if it actually works so as before I'm going to exact into the container on the left hand node and try and ping the container on the right hand node cool so that's working and as before you can see the details online by - which is what you would expect has been routed on both nodes and and if I ping the node itself then it should be down to 63 so there yeah so that proves we got connectivity so if we throw in a little deeper now and actually look at what's going on with the packet as it traverses through them through the through the various interfaces so on the one side I'm gonna run this little script which just pings continuously it's that one going on the other side I've got this little script running which basically given an argument of the interface name it will do a t-shirt you can use t sharp to kind of sniff the packets on that interface and T shark if you haven't come across it is like the terminal version of Wireshark and it's bit like TCP dump this is great great for this kind of stuff so first things let's try and capture the packet coming into or through the front door so this is the EMP zero s8 cool so as you can see the packets coming in through the front door but the source and the IP addresses here are of the nodes itself so this here you can see the encapsulation in in process so you don't see anything about the IP source and destinations and 172 ones of the actual containers that's invisible and that stuff is all held within the data section at the end so you can see it's Nathan net packet wrapped an IP packet within the UDP packet and then the data which would be the IP packet of the container itself so now if we stop that and kind of drill in one level deeper and we go to the ton device now you can see the packets have been unwrapped and you can actually see the source and destination IP s of the actual containers itself and likewise you can see just like I said it's a raw IP packet and inside of that you can see the ICMP packet which is the ping and then you just really in one more step we should see now we're capturing the packets on the bridge on the right-hand note and you can see it's the same it's the same source an IP packet so of the containers themself but you can see now it's been ripped wrapped in an Ethernet packet so it's been wrapped in Ethernet packets sent onto the bridge which again is what we would expect so that is basically kind of the whole overlay network just kind of you know patched together in a few lines of bash there so what is this well let's have a quick recap first and see what we've done so we've gone through these four steps and so the first step was the single network namespace and the key point there was if you want to connect namespaces to nodes you can use beneath pairs and then the second step if you have more than one network names based on the same node where you used to be spares along with a bridge the third step was the case where we had multiple nodes but they were on the same l2 network and that was a kind of easy one where you can just set up some routing rules to just directly hop to the node of the destination node and then the fourth step is what we just did and the key point there was you you can use a ton device to create the overlay Network and a couple of like key takeaways at least for me understanding the different types of routing rules that was kind of my you know my kind of aha moment for understanding this stuff and an ton devices well there and one of the ways you can it allows all this kind of virtual magic to to work and in terms of tools where you've got a p-47 wall up you've got so cat 4 just threatened by directional streams for testing and then for debugging stuff in a TCP dump and T shock they're your friends so finally I just want to kind of bring all this back it's a real life I guess and try and relate it back to some existing stuff that exists out there so one of the common network solutions for kubernetes is flannel my father has a bunch of different backends and they work in different ways so one of the backends is a host gateway back-end which exactly corresponds to step three so that's when you have the all the nodes on the same l2 network one of the backends is a UDP back-end which is basically what step four does so although it doesn't do it with so captain things in essence it's the same but that wouldn't be the one that you were typically used in production or anything that's more like a kind of almost like an educational back-end as far as I can tell maybe debugging it's a wait would really use this VX land and VX plan is isn't overlay network is UDP thing based implement in the kernel and it's I guess it's more efficient and things and then we also have these cloud specific backends so they set routes in the cloud we saw it talked about earlier so one for Amazon one for TCE and on the other thing I want to characterize in these different network solutions is where they store their node two pod IP range mappings because they all do it slightly differently so in the case of flannel while flannel just stores it in at CD it's kind of code that's a popular one there's that I believe you would I all of these things are so configurable but you can set up such that there's no overlay for just our two stuff so it uses a step three next top bruiting for cross network stuff it can use another type of overlay Network which is IP IP encapsulation by I'm sure you can configure it to use other things too and in terms of it's no two pod subnet mappings I believe that's done via bgp so you run BGP agents on your node and they cross it this around weave is a final one or another one in terms of connectivity similar to flannel so it uses VX LAN which is the UDP overlay but the difference being it doesn't use that CD I believe it has it's pod subnet two node mappings distributed peer-to-peer somehow so that's basically all I've got to say so all of these scripts and stuff you can just go onto this get our page in them and grab them and fiddle around and send us some comments if you know I'd be great yeah any questions [Applause] okay one minute I think yep well I guess it's so well it is simulating like an l2 switch this is kind of using the ethernet packets just as a way a normal physical switch would oh okay I see I think they're probably yeah you probably don't have to you could do it other ways I'm sure it's just the way it works here
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 36,005
Rating: 4.9725399 out of 5
Keywords:
Id: 6v_BDHIgOY8
Channel Id: undefined
Length: 34min 44sec (2084 seconds)
Published: Sat Dec 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.