An Introduction to Open vSwitch - Simon Horman

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you very much so the purpose of this presentation is I'm imagining most of you in this room don't know what open V switch is so I'm going to quickly describe what it is very briefly how it works and then go through some examples of some of the features it has that makes it interesting so I recovered that one so what is it's a multi-layer virtual switch what does that mean it means that it all it like a physical switch it operates at layer two but like it also it can also operate at layer 3 and layer four so you can direct packets based not only on the MAC address but also for instance the IP address it has a it's a software implementation it's purely software runs on on a commodity hardware running Linux it has a controller making running in user space so this is the decision-making engine it has the data parts in the kernel so this is what actually passes the packets around it's it's quite fast perhaps not as fast as a hardware switch might be but as fast as Linux networking is it's an implementation of a protocol called open flow but it actually implements a lot of features which aren't in open flow where can I get it open V switch org is the place to go all of the code is available and get the data path was isn't going to be included in Linux three point three not three point two is that it says up there and there the announcement discussion user mailing list it's a very open community I did a company behind it I'm not associated with those people in any way I am just a member of the community so the licensing is important I think the user space tools are Apache License the kernel portion is GPL because it needs to be compatible with the kernel and the bits are going between the shared header files opposed so we need to there's a few concepts you need to understand in order to be able to discuss what open V switch does and how it works so ass which contains ports so if we think of a physical switch I'm sure you've all seen one that would be the holes where you plug in the ethernet cables there actually that's not true the holes would be the interfaces um a port may have one or two interfaces so really the difference being a port in an interface is it you can use some technology like bonding to create a port that uses multiple interfaces usually each port has one interface but you may may have more than one and packets afforded on the basis of flows so a flow yeah so if we think of a very simple switch a flow will be defined by the destination destination MAC address flows may be defined by any of these things I'm not going to read out that list but you can see it's quite long and it includes source and destination MAC addresses but it also includes a variety of other things from including the source and destination IP address which is not information that it will be normally done in a layer to switch so this is really what this ability is one of the key things that differentiates open V switch from the existing bridging implementation which has been in the Linux kernel for many many years is the ability is the flexibility that all of these keys allow you so how does when a packet comes in where when a frame comes in how does open V switch work how does it deal with it well firstly it doesn't say this up there but it hooks into the Linux kernel at the same place that the Linux bridging code hooks in and that place has been extended slightly to allow it to be shared but essentially it's bypassing all of the Linux routing code so the very first packet that comes in for a flow obvious or not obviously but the data path doesn't know what to do with it the data path is not an particularly intelligent thing it will see this flow packet come in it does - type table lookup and discovers it it doesn't know anything about it it then sends that packet which it doesn't know anything about down to the user space controller and forgets about it it doesn't wait for a response it just sends it and it gets on with whatever the next thing it has to do the controller then reads this packet and makes a decision based on whatever rules have been configured so if we're just doing a simple map based learning switch it will look at the destination MAC address and make a decision based on that if or it may decide to flood it whatever it decides to do it then tells the data pass whatever it decides that should happen to packets belonging to that flow so by teaching the data bus it teaches the data path how to match that flow and what to do with packets that match that flow it's in so that's stages one and two then lastly it sends the packet back up to the kernel because the kernel has forgotten about that packet it just sent it down to user space and that's it if the controller wanted to drop the packet it would just do nothing okay so now these packets come in there's been a bit of a delay here so this is a potential performance problem um and then it's sent the packet on okay that's nice it's sorry for everything of a TCP connection essentially the other end will be waiting there will be no that and this this takes an extremely long time in is a retry which is wait the connection will stall for a few microseconds and of course the next packet that comes through it knows already knows what to do because it's already in the data path so this is this fourth packet with this fourth stage which is the second packet just go straight to it's just that the kernel does a hash-table lookup realizes that it knows what to do and does whatever that the hash-table lookup tells them to do so this forwarding is very quickly it is very fast this is your Fastpass this stage 1 2 or 3 is slow and that's just the way it works ok so how do we manage this thing it's important to be able to configure things otherwise we have no control over what they're doing so it's it's configured by a JSON database the fact that it's Jason isn't pretty isn't particularly remarkable but the rest of it is kind of interesting it's persistent across restarts so if we think of something like the Linux routing table of the Linux bridging code itself if I reboot my machine I do not expect my routing table to magically reappear unless there's some init scripts to reconfigure it did in this case we have a persistent database which is in user space and so whatever I configure now if I restart the demon or I restart the whole machine it will be there again so this is kind of cool too in order to configure the system what we do is we connect to the database and then the database connects to another part of the system my call to the database base won't return until that action has out of that configuration action has either failed or has successfully completed so we can rely on the fact that if I ask the database please add a new bridge that by the time it says yes I've done it it's actually done and it's ready to work so this is interesting because it allows us to for instance configure the database remotely and rely on the fact that we don't have to worry about how long it's taking and guess how long that communication we know that when the message comes back it's done and so this is the last point it usually you configure the database using a UNIX socket for efficiency reason but it can be remote it can use TCP and that TCP can option connection cannot optionally use SSL so I think this is kind of cool ok so what does it look like I'm not sure how many of you have used bridging but let's say we want to add a very simple bridge like we had create a bridge and we had a port this is a simplest thing you could do so firstly we need to make sure the user space portion is running ordinarily this would be done on in it but this is fundamentally different to the way the bridging code works the bridging code only exists in the kernel there is no user space portion and we create a bridge I like this may exists a optional argument because it makes the operation independent which means I can create the bridge the bridge creation action will succeed even if the bridge already existed so sometimes I you lose track of what exists what doesn't it's easier to use this and to test to see if it exists and then created if it does or it does it just may exist and again we may exist is here and we add a port so this bridge bridge 0 now has one port which is Ethernet 0 there we are oops ok going the wrong way ok so after all of that we decide we don't want and afterwards we want to modify it we don't want Ethan at 0 to belong to the bridge anymore no problem remove it we decide we don't want the bridge at all we can just remove it and which implicitly runs this action because once the bridge is gone nothing is a member of it anymore so that's kind of interesting in this Maps very closely apart from this part that users well spaced part map Maps very closely to the bridging code now in practice you wouldn't really use this because if it's you you're just doing what the bridging code can do anyway why would you bother it but you can ok so now I'm going to look at some more sophisticated technique configuration options it hopefully will give you an idea of some of the more interesting things you can do so I'm going to talk about four things I noticed while I was preparing this talk that actually this Maps very closely to the man page apologize for that that it was not intentional okay so what's VLANs fundamentally VLANs are a method of partitioning a physical layer 2 network into multiple logical layer 2 networks it's pretty old concept I imagine many of you have some experience in it there are two types if where I have a switch and it's configured to use VLANs each of the ports can be one of two types of ports you can be an access port or a trunk port now an access port does not use tagging whatever you've plugged into that maybe you may be some sort of desktop equipment or anything it's completely agnostic over the why only packets for the configured VLAN we will appear on that wire and the switch make sure that that occurs right a trunk port makes use of tags to allow you to have multiple VLANs using the same port so for instance if I had two VLANs on this switch and two VLANs on this switch and I want them to communicate to each other I can just use a single link between the two switches and make use of trunking so how does this relate to open this which is very what we can do is by providing a tag the tag is the VLAN ID but by providing a VLAN ID when we add a port to us a bridge we basically turn this into an access port so if the bridge receives so we imagine a bridge with is only there these two ports is a physical port and then there's this access port we've just created if anything comes across the physical port which has VLAN tag 7 it will go to that port if it has any other tag or no tag at all it will not go to this port ok so we can basically petition our network and that's kind of nice and exciting so this is an example of something that is I don't actually believe it's possible to do is bridging I could be wrong but because of the some quirks in the way that the Linux network stack works when you're using bridging you essentially need to associate your IP address with the bridge itself not with ports attached to it and this is because the the Ethernet link layer addressing is stripped off at inappropriate moments if you're not using the right port so this is why I believe you can't simply create a bridge and attach a VLAN interface for it you need the bridge itself to understand the valence and internally what this does is it sets up the data path with some actions to strip off the tag or attach the tag and match it as necessary okay so span is another technology that I find quite interesting so the idea of span is that you can mirror traffic from one or more ports to another port now why why why on earth would you want to do that let's imagine you have a scenario where you have a router perhaps you have a small office and you want to somehow measure the traffic that's going through this router for some reason the router itself does not a nice Linux box it doesn't have facility to do that so so you can approximate what's going on there by just observing all the traffic that goes into the the router on its Ethernet port okay well how do you do that back in heck in the good of this the back in the not-so-good are they all days when we just use hubs you would just simply plug a machine nearby and sniff the traffic bingo you can't really do that anymore because everyone uses switches so what do you do you some how trying to get a hub involved in this equation somewhere do you plug a Linux box between your switch and Europe and your router we all of those things would work fine but all you can use span to basically mirror all the traffic from the router port to another port and in just if that ok so it's kind of cool there's also a remote version where you can potentially sniff on other switches but that's not so I don't believe you can do that with open V switch easily so how do we do palmering how do we do expand with open V switch okay so firstly we need some we don't have a physical switch we can't just plug things in physical things but we can plug as many non-physical things as we want so we create this dummy interface a dummy interface in Linux is just just a sync basically you can put stuff in there and it will it's like they've known for networking but you can use TCP dump and tools like this too what is going down the sink so this is pretty good match for what we want to do it's very efficient and okay so we create this just adds dummying to the kernel and then when we activate the link because you can't TCP dump a link that's down and then we add that to our bridge which we created a little bit earlier okay so it's all sitting there not doing anything it's just sitting there okay so now how do we tell the system that please anything please we want to use mirroring so okay I got to explain this syntax a little bit because it's a little bit special firstly the double dashes allow you to string multiple control commands together in a single invocation of the tool so each line here other than the first line corresponds to a separate command that's thing number one to know thing number two to know is this bizarre notation here with the @ sign and the P and it's over here to what is this and there's another one M these are variables so what we're saying is so this get port command what it does is to fetch the ID the internal identifier of that port and stuff it in this variable called P which we can then use down here okay so the output of this is used as the input to this and the output of this is used as the input of this okay so know hopefully we can read the syntax if we um so what are we actually doing we find the UID of the target interface and we say that we then we create a mirror so a mirror is just done it's actually just an entry in the database but it's a conceptual thing of we're going to mirror some traffic we need a way to store that configuration and that will become stored in a row of the mirror database called mirror zero okay so we've created the mirror we're going to configure the mirror to do something sorry Adam well we have the mirror but it's just sitting there by itself it's not attached to anything so now we need to say this mirror will be used on bridge zero so that's what the third line does so this this bridge has a list of mirrors which was up until now empty and now we've added one mirror to it which is the one created on the previous slide still not really doing anything but we're building up the configuration and now lastly we say this mirror any traffic that it's configured to mirror will be output to port B which is the dummy interface which we looked up a little bit earlier so we've created a mirror we've attached it to the bridge and we've said its output will be on our dummy interface okay so so far so far we know that we have the mirror setup and we can we can put packets out but we don't actually have any packet we haven't configured it to collect any data so that's the next thing we look up our tap interface so this is just going off to a virtual host it could be any interface we're interested at all but it's just an arbitrary example and we say any packets that are going into that interface so it addressed to that interface we're interested in and also on the next line any packets that are coming out of it so all packets going in and out of that port okay so that's it it now works um and I will demonstrate that hopefully in just a moment the other way you can do this last step is you can just say I want to mirror everything everything that's going through this switch apart from obviously the destination port because otherwise it will get into a loop I want to say all good so if so that's kind of cool to know I guess if you only had one port that would be a shortcut there I don't know I think this one's more interesting so see if I can get that working for you so I made a few little scripts to make this a little bit easier okay so what are we doing okay we can see that so this just sets up the basic networking these lines here just make sure it's in a non state and then we create the bridge and then we we get an IP address or see if this works okay so we're now configured if I so we can see our easing it interface maybe you can't see that you can see the Ethernet interfaces up but it doesn't really have an address apart from the ipv6 link local one and we can see that an address that we might ordinarily expect to be on the ethernet address Ethernet interface is over here on the bridge interface and if I ping something like myself it's working okay that's wonderful great next stage is to try and set up some mirroring so how does this work so we're going to create our dummy interface and we're going to add it to the bridge this is the configuration I described just before okay that's also working there's no error and so we can see that if we want so this just shows us the mirror table in the database and we can see it's got one entry which is called mirror and most of its data is empty oh sure I will make the font larger as requested okay that's as big as it goes so so that's what the database the database configuration we've created so far so lastly we try and activate this well yes so the activation is to tell Lumira - that everything from a zero should be mirrored to the dummy interface okie-dokie so with a bit of luck I must say I'm not really very keen on giving demos because I feel that they always go wrong but let's see so up here I'm going to try you so up here I'm running tcpdump on the dummy interface and you just have to take my word for the fact that that is all the traffic that's going at going through each zero which is being mirrored over to dummy zero it's quite a lot idea so I could run some filters on that in this case it's kind of pointless because I could just sniff the ethernet the e zero interface directly but I hope that that kind of illustrates things a little bit we did my presentation good okay so that's mirroring okay so I have two more topics but I'm running out of time so I'm going to skip one of them I'm just going to do tunneling this will last one tunneling allows you to take two l2 networks which is separated by some kind of routing and join them together well why might you want to do this the room might be full you might have your server and might fill up you might want to put some machine somewhere else the some reason there's routing or going on between those two rooms that you want to heat them to use the same l2 network at least in a limited extent you could also use it from a remote access if for some reason you felt like you wanted to have an l2 based VPN instead of an l3 based one should be able to do it it supports GRE which is a very popular tunneling protocol that you can use to tunnel almost anything and it also supports VX lamb which is a relatively new addition which has been developed because some switches aren't smart enough to do GRE physical switches aren't smart enough to do GRE in an intelligent manner so this you end up with a lot of flooded traffic I think what I would describe is a hack anyway how do I set up tunneling we create an interface we call it GRE 0 its type is GRE this is the most important thing in the whole thing if I wanted to use VX LAN I just call it V excellent it's rumored IP address or the other end of the tunnel is 1000 eight local address is 1000 Niners and the key Gerry tunnels can have keys so you can have multiple of them is number one then we create a port that uses this interface and then we add the port to the bridge bingo this information here the local IP and the key is not strictly necessary if you if you're only planning to use one tunnel between the same endpoints you don't need the key and you don't really need the local address at all so I didn't realize that initially but you know the other end obviously need to configure the other end it's not much point otherwise these addresses I have been exchanged everything else is the same and again you don't need these addresses you just need this so you just need to know the other input really yeah um I'm not going to demonstrate that purpose but you can't send packets between two endpoints and you can also throw IPSec into the mix here so there's a bit of a security theme in the keynote this morning so you can communicate to between your two endpoints in a more secure fashion I didn't include how to do that because I haven't actually done it but I believe it works okay so I'm not going to go through QoS feel free to come and ask me all about it later but suffice to say that you can have per point QoS policies if that's what you want to do thank you very much thanks for the talk that was very informative on the details of it oh can you give some examples of the real world you see oh it's useful sorry I meant to go through that again why would you use this so the initial target is virtualized networking so let's imagine you have a host and it's got very various guests on top of it and usually when you have guests you want them to have some kind of network connectivity perhaps you want to sell them to people well the bridging code can already do all of that but the idea is essentially in those kind of in multi-tenant environments there are things that you might want to do like feel ads which the bridging code is not well-suited to and rather than continuously extending the bridging code and putting which is entirely contained in the kernel to move the logic down to user space to give a small flexibility there are other uses for it because it's a general framework people have discovered also to what weird and wonderful ways to use it but the original aim was virtualized network environments so you can basically use this to create a virtual network that is independent of all the underlying hardware you can use it to basically eliminate routing from what's going on in your Linux box if that's what you want to do to be honest I understand the plumbing much more than I understand how to use it can you say rewrite a VLAN tag the use case is you've got an upstream provider that providing a wide area network over a VLAN sure so the answer is can you rewrite VLAN tags and I believe that to be true and if it's not true it would be very easy to add yes ok you use Eadie tables with it I can use EB tables with that I believe the answer is no but you could do some of the things a B tables does like dropping packets use it using this the provided mechanisms just note that it's our intention to embed open V switch into Oracle VM so we are actually going to switch from base Linux bonding and bridging across to open V switch in some future version of Oracle VM as well well the nifty thing about open V switch over the base bridging code is that we can create virtual switches expand hypervisor so on a cluster of Oracle VM servers you get what is usually really expensive in the Cisco world is that switches that to actually talk to each other's create and there's a multi trunks and stuff like that that's very enough to you for that in the virtualization space thank you that's very exciting I didn't know that do you know if anyone's doing any open hardware to make play football for the airport switch is that run this stuff okay so the OpenFlow protocol is which I only mentioned briefly is designed so you can have the controller part in software and the hardware as a black box I think that's the closest we've got to I'm not aware of any hurt hope and hardware projects we towards the start you're talking about how the first packet that comes through the control then tells the kernel what to do with it did the kernel inspire those rules we've set up the controller how does that work in terms of expiring expiry oh yes so after the the controller teaches the data path what to do and yes it's just a simple time-out on it I think the defaults may be 10 seconds or a minute something like this so it has um but only if it hasn't seen any packets you bunch of different matching how is the flow table lookup done at you I don't know what the hash key is I think it might make me I think it must construct the hash key from all the possible any entry in the hash tag so in the controller you can have wildcard entries so if you don't care about the IP address that's just anything in the kernel the all the fields are fully populated so in the controller you can have one rule that colors all traffic that you will ever see in the data pass every single unique endpoint or every single your unique flow will have its own entry so it's an exact match so I think the hash queue would involve looking up all the fields yes potentially the lookup is more expensive than bridging because it would suck in more of the packet if you're just using basic bridging you might just use things there's no plans to get rid of the bridging code right I think in practice you would have to be doing rather high packet rates to notice that but that's it's good point yeah any more questions yes I'm how scalable is in terms of both performance and kind of management to visualize all these single port kind of configurations that you've created right so how scalable is this in terms of management I believe the way that it's set up allows you to create a scalable interface but I also believe no one has done so so I think the current answer is not yet ah we do you also asking about packet rates and things like that and I've also done some work on on packet rates arm not so the key bottleneck is the numb the rate at which you can create new flows because that has to go across that user space boundary I have done significant work improving the performance of that so essentially you used to have a bottleneck around a few tens of thousands of flows and you just couldn't do anything more it's now around a few not a few tens of millions but around at least a few million so we've improved that a bit our the rate of packet of flow creation still needs some work I have some packets outstanding that's the thing that makes that the scalability issue there is there are certain things that occur which inc for instance in dump statistics on all the flows present in the kernel for account keeping and obviously the more flows of the present that becomes more expensive in terms of packet rate not flow creation rate packet rate if I once came up with a number of 12 million packets a second but I was never able to reproduce it so I don't know I went to lunch and I came back and it doesn't work anymore but I can reliably get four million packets a second without a great deal of tuning the main limitation on this again is the control which is single threaded but if you just pumping packets through the system of course there's some hotspots in the code because it hasn't been op optimized yet because no one has really looked at this but it should be possible to make it rather fast I think okay so that the four million and the maybe twelve million was I was using Intel ten gigabit cards which will link together using crossover cable the tunneling has significant performance issues which must be bugs if given the way the data works so there's this some work to be done but it should be pretty fast eventually well thank you very much for your time I usually say please come up to me afterwards and talk to me but actually I have to leave the conference after this please send me an email um if you have any questions and thank you very much everyone thank you so much on behalf of LCA team 2012 we want to thank you for a good time thank you very much
Info
Channel: Linux.conf.au 2012 -- Ballarat, Australia
Views: 38,544
Rating: 4.7802196 out of 5
Keywords: lca_2012, SimonHorman
Id: _PCRNUB7oNw
Channel Id: undefined
Length: 35min 49sec (2149 seconds)
Published: Fri Jan 20 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.