Contrail vRouter implementation and performance - Raja Sivaramakrishnan, Juniper Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hiya my name is Raja Seb Ramakrishna I'm a software engineer evgeny / network previously I was at contrail Systems which was acquired by Jennifer at the end of last year I will be talking about the Jennifer contrary solution and then I'll go into details of the contrail V router which is one of the components of the of the solution and then talk about what changes we made to autumn optimize performance in the contrary solution and finally I talked about service training and policy based forwarding which are features of the of the solution so the contrary solution is the network virtualization solution it allows you to create virtual networks that are independent of the physical network so you could provision new applications or migrate existing applications without having to touch the physical network infrastructure without having to revert any security policies load balancing things like that any changes in the physical network do not affect the virtual networks and there's full isolation of of the tenants so failures in one tenant or misconfigurations in one tenant do not affect other tenants and addresses are unique or private pertinent and failures in the virtual the main don't propagate into the physical domain and we are able to peer directly with a gateway router in a cloud datacenter so we don't need a separate gateway node in order to be able to bridge between the virtual world and the physical moon and finally our code is open source we just released our product this week and what we've open-sourced is the production version of our code so we don't have a separate open source strain and a production plane but there's just one train of code which we have open sourced and I'll have a pointer to the code at the end of this presentation so in the next slide we'll look at where the the contrail solution fits in in a cloud datacenter so typically we have an Orchestrator a system like OpenStack or cloud stack and the orchestrator has a compute a API a storage API and network API so the compute API is responsible for spawning virtual machines on the compute nodes in the data center the storage API provisions storage for those VM and the network API is responsible for setting up the communication between the VMS so the orchestrator talks to the contrail configuration system what's marked as country and v NS they're using REST API and the configuration system talks to the V daughter module which is the kernel module that runs all the compute nodes it can also talk to the gateway router which could be a Juniper MX or a router from other vendors directly and it can also talk to service nodes which could be virtual running inside virtual machines or they could be physical boxes that implement services in the next slide we'll look at the components of this solution we have the orchestrator talking to the configuration node and they could be multiple configuration nodes and the configuration nodes generate config that's consumed by the control node and they could have a federation of control nodes talking to each other using ibgp and the control node stop to the date data forwarding elements which is the kernel module inside the compute nodes using X and P P the control node also talks to the service nodes and the Gateway node using bgp and net corn in the future we also have an analytics engine which allows for monitoring traffic debugging troubleshooting problems looking at traffic trends top talkers that kind of thing which we won't be focusing on in this presentation in terms of the physical topology so be a typical date cloud data center would have leaf and spine switches which have an AGP between them and then there are gateway routers which talk to the spine switches and then there are racks of servers and this could be the servers where the virtual machines run and some of these racks could also be used to run elements of the contrail solution so this could be the configuration control control control protocol analytics and the UI this slide illustrates how we integrate with OpenStack for example to create a virtual machine horizon which is the OpenStack UI would be used to create a virtual machine it talks to Nova which is the compute API in OpenStack and that invokes the Nova scheduler to decide which compute node the virtual machines should be spawned on on that compute node there is an open stack element which talks to the contrail agent in order to provision that virtual machine it also talks to a quantum plugin which talks to the contrail configuration node which in turn talks to the control node to create an interface for that virtual machine so everything in blue in that picture is from OpenStack and the parts that are in green are part of the contrary solution and finally the control node talks to the agent using XMPP and one thing I wanted to point out point out was our solution is entirely based on standard protocols we use bgp MPLS XMPP and encapsulation as MPLS over GRE so these are all standards and we should be able to able to interoperate with any vendors the implementation of these protocols yes we'll get to that there's a slide about that yes so before we get to the the kernel module this is just a brief description of the control plane so in this picture we have two servers with a virtual machine being spawned on each one and when a VM is pawned the agent that is running on that server advertises that VM to the control control node which is running the control protocol and it advertises that with the address of the server and the label that's associated with that VM and similarly for the VM spawned on server 2 that is advertised to the control node and this information is exchanged between control nodes and then send to all compute nodes in the same virtual network so if VM one wants to send the packet to VM 2 the packet would then be encapsulated using a GRE GRE header and MPLS header and the outer headers which contains the source and destination of T of the server's themselves no it's not so actually this light talks about that it's a kernel module on the computer yes yes the control nodes and the virtual order there is a protocol between the control node and the virtual order so that is XMPP I'm sorry what is SP NRT yes so we borrowed heavily from MPLS l3 VPNs so a lot of the concepts that we use in this solution are borrowed from from there so in that sense yes that is true so we supported on CentOS Ubuntu and and fedora and also on Zen so on the compute node we have a kernel module which is the V doubter module and it talks to a user space agent and the virtual machines have their tap interfaces which are interfaces that the via outer module knows about and inside the module we have the notion of separate brf's for each tenant this is again borrowing from a 3 VPN concept and each vrf has its own forwarding table and we also support flows to implement forwarding policies so there is a receive handler that in 3 dot X kernels there is a handler that you can define and retry packets using that know no it's just a different API in older versions there's a bridge hook that we tap into and that's how packets enter the kernel module and from the tap interface they enter again this is a handler that you can hijack for the tap interface and that's how they entropy yes correct so as I mentioned there it's the V doubter is an alternative to the Linux bridge or or the Ovie's module in the corner an analogous question I assume that iptables isn't gonna work as it normally does if you have the V router module installed so what I'm not that familiar with a key table so well I'm just asking because your third bullet point up there says V router performs networking services like security policies nat mirroring and load balancing what we already have a lot of those facilities in the kernel you know we have traffic classification IP tables for NAT that sort of thing I assume this short circuits alright right it doesn't go through all the proof okay and you said you open sourced all of this are you maintaining this in an upstream for you maintaining it out of tree currently so we have our own github repository okay thanks miss vrf so are using the vrf patch or the network namespace feature for the different tenants for the vrf are you using the Linux vrf patch or are you using Linux Network namespace we have implemented it ourselves inside inside the kernel module so we have our own forwarding tables we have our own routes next ops interfaces we are at we have implemented inside the inside the core no more you know yes yes yes and sorry could you repeat that nor at the moment we have our own api's and what the API is allowing you to do is to add routes next hops interfaces and we also allow additional flows but it's not the same as the Oviatt api so all our code is open source so like right and that's an excellent question so we've borrowed a lot from the MPLS l3 VPN concepts and the underlying primitives there are routes next hops MPLS labels and those are different from using flows and that's the fundamental reason why we decided to do this on their own on our own that's good you know comment well so you don't know me but I I started epsilon networks way back when we kind of invented flow switching and we totally had our asses kicked by the mpls guys who were not doing per flow stuff so pay attention okay you're having all these later for low scaling problems this could be a very serious competitor so this slide we look at how packet forwarding happens so we have a VM that's sending a packet to a VM running on another computer node so the first thing the VM does is it sends an art and that is trapped by B router and B router responds with its own Mac in the art response and then the VM sends the IP packet and based on which tap interface the packet arrived on we would look up in that vrf and the result of the lookup would be the IP address of the destination server and the MPLS label to use so we would encapsulate that inner IP in MPLS and GRE and then the outer IP where the destination would be the other servers address so once the packet reaches that server the B router module would de capsulate the GRE header and then based on the MPLS label it knows which VM on that machine needs to receive that packet so it goes up on that tap interface and finally reaches to be the slide shows the api's and interfaces that the very outer module has it talks to the physical interface and also the tap interfaces to the virtual machines and any packet that should not be consumed by a router is sent to the Linux Network stack using another interface and B router also handles some packets like DHCP when the VM sends a DHCP request the Beit outer module sends it over to user space and the DHCP is satisfied from from user space and we also have a net link API and the API includes things like addition of routes next hops new interfaces getting statistics adding flows things like that in the next slide we look at the changes that we made to optimize for performance yes so we have a user space agent and it talks to the control plane so whenever a new VM is instantiated the control plane tells the agent that here's the new VM and based on that we would then program using this API yes everything is open sourced so to measure performance we have a set up with two servers and the servers are connected by a 10 gig link with a 1500 byte MTU each server has 2 CPU sockets 6 course this is Xeon running it 2.5 gigahertz and we are not depending on any of the segmentation capabilities of the NIC so we want to be able to handle the lowest common denominator so we do a segmentation in software and with that we get a baseline performance of 3 gigabits per second using MPLS over GRE when we run a TCP streaming test between the VMS we then implemented G ro inside the via our module and with that the throughput improved to about five gigabits per second but what we saw was that the CPU that was receiving the packets was the bottleneck it was doing all the processing and also the V host processing to actually send the packet into the VM was also usually happening on the same CPU so we decided to move that part to a different CPU and to do that we used our PS so with JRE most Nix are not able to look inside the inner packet so all flows between the same pair of hosts end up on the same queue even if the NIC is multi cue capable it still goes to the same queue and this is a scenario which RPS handles pretty well so we did our PS on the outer header so if the packet arrives on CPU core 0 we would do our PS and send it to CPU 1 and all the we'd outer processing D capsule ating the packet and doing G ro happens on CPU 1 and with that we saw that we were able to get about 7 gigabits per second the bottleneck was still CPU one because all the processing was happening there so we did RPS again on the inner header so the packet arrives on CP is zero all the physical interface processing happens on CPU zero it then goes to CPU one where all the video output processing happens ero happens and then it goes to CPU two where all the vhosts processing to actually send the packet print into the destination VM happens and with that we are able to get about 9.1 gigabits per second which is pretty close to line rate and we're getting bidirectional throughput of about 13 and a half gigabits per second there is some variability in the performance based on how how the VM is scheduled if if the VM is scheduled on on the same CPU as the ones that are doing the packet processing work then the throughput goes down a little bit but on on average we see something between between eight and nine gigabits per second on a 10 gigabit links because we're doing RPS twice there is some impact to latency and we measured that with the request response test and we saw that the latency went down by less than 10 percent CPU consumption on both the sender and the receiver is about 120 percent so this doesn't count the CPU consumed by the guest itself it's the CPU for handling the packets and the V host thread in the kernel if we change the end encapsulation to MPLS over UDP then the CPU consumption on the receiver is about 15 percent less and this is because the verification of the checksum can be done by the NIC if it's UDP but for MPLS over GRE the necks are not capable of verifying the checksum no a single stream would let's say the packet arrives on coal zero from there it would go to coal one and then two coal to so even for a single stream yes it does have an impact I see that we are almost out of time just a couple but I won't go into details here now so we support service training so you can have multiple services a packet can be sent from a VM to a service which could be a firewalls from there to a load balancer and then to a and then to an the destination VM and this is all orchestrated by the control plane using BGP and we also support policy based forwarding so I won't go into details of this we have a flow table where you can have policies to accept deny do not only packets things like that and most of what I presented today is work done by other people in the contrary team so I want to acknowledge that and our source code is on open contrail odd that's all I have thank you
Info
Channel: Linux Plumbers Conference
Views: 7,645
Rating: 4.7837839 out of 5
Keywords: Juniper Networks (Organization), Contrail (Cloud), Lpc2013, Network Virtualization, Linux Plumbers Conference 2013
Id: xhn7AYvv2Yg
Channel Id: undefined
Length: 22min 13sec (1333 seconds)
Published: Tue Oct 08 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.