Deep Dive: Linkerd - Oliver Gould, Buoyant

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right cube Khan looks like it's time people still kind of filing in while they walk in how many of you have been to a coop calm before cool it's a lot I mean even did two coop cons okay Plus how many of you are using kubernetes in production how many of you aren't but wanna be okay most of you are using everydays of direction that's amazing last year when we did this it was like half the hands in the room so that's really impressive how many of you like writing go how many of you have written Russ before all right that number is gonna be a lot higher next year I promise you so my name's Oliver I'm the creator of link Rudy my I do a lot of work on the proxy side but I work kind of across the project and I'm I do a deep dive at every coop con that I go to and usually that's kind of a technical deep dive where I have lots of hand-drawn slides and I kind of ramble over those and I'm gonna do is try to do a slightly different one this time and it's either gonna be really quick and will be done quickly and have lots of time for questions or I'll be running away over we'll find out soon but I want to talk about why why do I work on Linc Reedy why do I spend all of my time working on microdata when I could be doing other things well besides hanging out with my dogs but yeah other than that so why does a link ready exist I think that this is a any project should have a good answer for this but this Center is pretty personal for me and so I want to talk about why a link ready exists we have to talk about why I was in a position to create this and work on this and why we're gonna continue working on this for quite some time and to do that I have to take you back about a decade this is not the start of my career but it's where the story starts I I took a job at Twitter it was down all the time so anyone using Twitter in 2010 okay there was a single to fail whale and every time you went to Twitter you saw it and it was a great place for someone who was an OPS and wants to do kind of ops programming and system programming to go and participate at night kind of quickly got funneled onto this problem where we had ganglia nah genius and we're trying to build our own data center and really expand and grow that and that wasn't gonna happen so the my manager the the head of ops sat me down said all we need to get all of the host data into a time series database this other other group is working on time series data base we got to make sure all of the host and system data gets in there so that we can do off salutes and then some other people get wind of this project and say well that'd be really cool if we could add application metrics to we have all these Ruby processes unicorns whatever is running Twitter at the time that we also want to get into that system and we actually would need to provide alerting because we can't use nod us anymore because it's not gonna work with the new system so we really need to go build our own huge alerting system in Java for some reason and then we also need customizable dashboards we need every team is gonna have a different set of metrics they need and ganglia is just really not usable for anyone who's used a modern web page with JavaScript and such and so we need to build a new framework or system for this and this is actually anyone used the bootstrap CSS framework that came out of this project Jacob fat and marcado were the internal tools team and we're working on a bunch of the things that were fed into our viz system and then open source and have been wildly successful since and we have people on our team reading the dapper paper at lunch and getting really excited about distributed tracing and so we had this hacking project called Big Brother bird and it was really cool because we could do distributed tracing in our application and so now my team owned it too and now we have that which is cool and that thing became zip kid and got open sourced it is now no longer really maintained by Twitter but is a pretty successful project and because the service is used when are debugging incidents this thing has to be much more reliable than every other service of the company and when it's down everyone's gonna let you know it's down really loudly and this company is growing we're at Twitter and 2010 when we know we're at about 2:00 we're on the track to IPO and in the middle of this we said we're gonna do micro services and we invent medicine Arora and we add all of this extra complexity and all these other metrics that we have to have and it was a bit of a ride and so the observability system we built ended up looking something like this and since I've lost the team it's I'm sure been improved quite a bit but is that we had a collector that would go talk to had a could enumerate every host and could go get hosts metadata or host metrics from every host we could also talk to zookeeper server sets and discover things and then go collect data from there and then we need to write that all into a service the time series database called KooKoo that was at one point the largest Cassandra cluster in the world and that's not something you want to brag about something that you lose a lot of sleep over actually and that's all since been replaced thank everybody and then what we want is we want a kind of nice queryable interface on that so I can run ad-hoc queries when I'm an incident that I need to diagnose things so that we can build that that dashboard system and so that our match or alerting system has someplace to plug into and actually get data and I was on that team for about three years or so and as I was leaving I wanted to go work on some other projects of the company that I thought would help the observability system but there are a few big lessons and one it would been really nice if there were open source tools of the time open TST B was just getting started while we were developing it and it was a little bit of a too early to bed on and it wasn't I don't necessarily think a good bet but now we have great reusable tools like Prometheus ingre fauna that we can just drop in and they do this job great and I don't want to project onto any companies but I hope you all are not building her own observability systems anymore and are using something off the shelf - the other thing I learned is that configuration is the root of all evil absolutely full stop that that collector system we had was initially this Python thing I wrote twisted and you had to configure all of the targets and sources for it to go talk every time a new service came online they defile a ticket with me and I would go edit a Python file and I have to like load balance all the configs to make sure services were distributed it was a lot of awful manual work for folks it was what we needed to get out the door but certainly not what was gonna scale that team up and related that is this the operation operational data model is critical and this is a bit of a loaded thing if you let people choose their service names and put that stream code and an config and in spreadsheets you will have many different names for a single service and you won't really have a taxonomy to talk about what is a staging of the service versus the prod version of the service versus my version of the service that I'm using for development and so we need to have a strong taxonomy in the system that a system like that observable assist will use for instance to link tracing data to metrics data that's not as far as I know if still not possible Twitter to go to a Zipkin dashboard and link to an alert for instance that would be wonderful but we need common nouns that we can use to reference and URI is basically the reference of crossing systems and the really surprising thing is that I was so in over my head on the offer the building the system and operational I was even production izing it that it kind of you lose sense of what your problems you're solving and like all of the technical problems of scaling problems we were actually working on our solvable in a pretty short timeframe and that timeframe is like a year or two a team of skilled team is able to go solve that problem and just work through those things your organization's inside of that is gonna be much more than a skilled team for a year or two going to working on a problem and if you can't get folks to agree on the operational data model for instance it's good or you can't get people in high leverage situation places like a deploy system it's going to be really hard to instrument these things and productionize it this will come back these are these lessons are irrelevant so after that I play ping pong for about a year and then if you worked at Twitter you'd know that was true and then I went to work on this thing called the traffic team and we were given an inner-tube remit there so we have as I mentioned before we had this zookeeper service discovery cluster anyone here done zookeeper service discovery and it wouldn't been on call for a zookeeper cluster it's a tough I feel you so we took that on for some reason and we were up to very close to the finagle team so Myers Eriksson who was the creator of Nagel was on this team with us and we sat right next to the core library news building finagle and our job was really to deal with service discovery related incidents and make sure we were fixing the kind of core infrastructure in to finagle finagle is a JVM Scala functional networking library that makes that basically every Twitter service is written in so if you just go fix things there you don't have to really worry about going to get people to upgrade you just have to be like deploy again and it'll be fine and really the the feature we're working with staging and more generally making requests routing flexible in a complex topology and so a very simplified version of Twitter might look something like this you have a big front end here that's doing all sorts of composition we have data services behind that that own different parts of the domain and then we might have something like a user service and every everyone has a user service summer and let's say like this is three calls down in the system and I want I do a new version of the user service I want to stage out a version before what we do is we probably pick one random host upgrade it with the new code and like hope that doesn't help people remember that even happened hope for that we can actually know those differences so the kind of canary and or we had very complex staging infrastructures where you could basically reserve a whole stack of this and replace part of it for your use case and you know there can only be a finite number of those things because this is a lot of resources so people would be basically fighting over staging resources that they could go claim to test their code and so we wanted to do is basically make that a header so you can add a header on to your browser and it says instead of talking to the user service talk to the users be to service and anywhere that request goes with the context we can apply that logic and again because this is all in finagle every place we on we can make sure these contexts get wired through properly and as long as we don't hit any evil non finagle services will be great and so the big lessons for my time on the traffic team were their micro services are all about communication in fact the name link Rd comes from this concept we learned on this team that really thinking about the system like a linker and a loader a loader being something that schedules pods and schedules and crates them a linker being something that names these targets names these other libraries and a micro service your libraries are served our services that are running they're not necessarily code units and you link at the network layer and so communication is the kind of fundamental thing that we have to solve here or allow for and to do that we need Diagnostics out the nose you can't you can no longer go to logs and try to correlate logs across various systems you can no longer attach it a bugger to one thing and inspect it and actually the clue what's going on across the topology we need to make sure that we're building Diagnostics into the system throw into the traffic layer specifically not just on the the kind of resource level and going back to the initial problem about solving organizational problems the highest leverage way we found to do that was by putting things in finagle a thing that we knew was in every request path and every service of the company we could go make these changes there we had launched this whole new staging system without really having to get any of these services to buy into it or convince them about it we could just deploy it and even if they had a downstream service that they didn't know it was staging we could implement all this and so having that kind of fundamental infrastructure layer of control is really important to roll out types of politician use of the company and so as you do when you work at a place for a certain amount of time you get tired and want to go somewhere else and I had this friend of mine who had been driving me to work for several years with him he had quit Twitter and he was like we're gonna start a company I thought it was crazy and he's like no no like look at what's happening with docker right Ivan used docker I've been in this Twitter hole for a while but I knew about medicine or or any new about I knew about what would I learn to become micro services they were called SOA or services first hundred short Twitter but it was clear to me that there was an opportunity and the the types of tools I was working on at Twitter I could go work on as my full-time job in open source which has been a huge part of my life since I've been in college basically and so that's what I we set out to do and basically took Twitter's code Twitter finagle and really how do we this great thing that we know it has all this operation of value has all this power is this uniform layer of control and visibility how do we make that something that people who don't want to write scala on the JVM can benefit from can we make that a component or a product that we can drop in there and so to lincolni one which we created and started working on in 2015 was released in 2016 was the first version of that it was super configurable so coming from that that framework we had within the JVM in finagle we had really good abstractions for services gallery so nothing was zookeeper specific we could talk about marrying zookeeper and kubernetes and console and fcd and building topologies that coorporate all of these things and most of the places lincolni one was deployed was to kind of satisfy these complicated multi scheduler flexibility cases and of course we have all of this like we took all of that routing logic I was just talking about dropped it in finagle bye-bye and delinquent you by default and that meant you had to go learn a bunch of complex configuration around service naming and fallbacks etc to get any of this working and it was it's a nice system there were people out there who really loved it really have done some very sophisticated things that are right over my head honestly but it's a lot to get started and if this is going to be useful it can't you can't like require a course on service mesh to get started with it and linky one had this deployment model again this is kubernetes was floating out in ether but it was not 1.0 yet there was swarm and nomads and mezzos and marathon like just a big messy container orchestration ecosystem and our model coming from the mezzo spoil of Twitter was well we can have one of these on every host it'll handle basically connection multiplexing all the hard thing at a host level it's the JVM so we can only really get it down to 150 Meg's or so if we really squeeze and pray which is you know a hundred fifty Meg's per host just like okay if you're not on a micro host or anything really small but it is a pod model running one of these per pod becomes wild if you have a 10 Meg go application right like how do you compare the memory footprints of these things and so over time folks put together really interesting topology data lincolni but we kind of realized that the complexity there was not the path forward so some lessons there again configuration is the root of all evil there get if you anyone written a DTaP in here ok a few of you anyone like reading dee Tubbs ok Alex isn't here I know one person who likes writing details but they're they're a wild dark art the JVM is also kind of root of all evil i I would have been offended by that a couple years ago but it's a really nice system for building lots of enterprise applications especially Twitter everything is built on the JVM it works it's great but when you're at this part in the infrastructure in the data path it's just really not suitable from our resources point of view linker d-did we we knew this micro service thing was happening but no one was talking about it really quite yet linker D at first was really positioned of like oh it's gonna replace a five load balance and all sorts of weird things that you can use it for but weren't really our intent and overtime we really saw that everyone who is picking up link reading seriously was doing so because they had micro service problems the other thing we learned is that kubernetes is king Cabrini's is one we can all agree or I can proclaim it was really obvious that the criminais is modeled the pod model was so much more usable than what was out there otherwise Coast scheduling processes like a proxy with an application is an obvious fall out of the pod model that just works well doing that in medicine or at the time was quite cumbersome and so focusing on that as a security model and being able to have per pod or security guarantees or privacy or isolation we knew that kubernetes was going to be leaking and so sometime in 2016 we started prototyping new proxies we wrote linker D TCP which was a first version of linker d2 in a way and then I think it was coop con 2017 in Austin where were you now something called conduit and conduit was our experimental version of what became linker d2 and it's a kubernetes native service measure so we've ditched well I don't want to say we ditch support for everything else but we've did support for everything else we don't support details anymore we don't have a Plex Alexa buscar' system we're betting on kubernetes primitives through and through the first thing you have to get here or the reason to installing pretty is get out of the boxes of traffic observability kubernetes does a great job of showing you the state the metrics various things about the resources as they're running what pods are running on with nodes how much memory they're using how many CPU cycles are using etc those can all be great and healthy and your site can be down and so we need a way that actually looks at the traffic that's uniform across the system so we can really have the tools to build and operate live applications on top of kubernetes we also want to provide out of the box MPLS identity that means we have before before we implanted this we lots of requests around I want TLS and no one really could articulate what Tillis was some people wanted to do ingress TLS some people undo egress TLS and after having a bunch of this conversation to realize that the pada pod communication is what's squarely within our wheelhouse and what we can do automatically without configuration again I hate configuration so what we do is if we as we discover as we do discovery we talk to the communities API we know if link or t is on stalled on both sides of the connection and if it is we opportunistically add TLS to the thing it all gets discovered transparently and we just do it it's all tied to service accounts there's a lot of room for improvement there but it just works out of the box and it's awesome to seven the next release which soon as I get back from coop con I'll get back to working on we're gonna be adding until us to all TCP metrics out of box or all TCP traffic out of the box additionally it's not just this kind of baseline security and visibility concerns there's a whole bunch we can do in the reliability space and so one of the obvious things that people are picking up Lincoln e14 was surprisingly gr PC load balancing that kind of surprised me at the time but most load balancers like Kubb proxies load balancer for instance it's just a connection of a load balancer so you get one connection here you go one connection there your one connection there and then all the requests in that connection or bound on that with HTTP load balancing we can actually look at the request right each request gets can be dispatched to a different host on that connection if we look at the latency from the response and that informs our node selection and so we can really substantially I have another whole talk on how we can improve success rate just by doing load balancing it's a really important cool night toolkit we chose to do not in Scala in the JVM we made a decision that we wanted to write the control planning go because we thought client go is awesome it's it's awesome I guess but it's quite a bit more difficult that we thought we had written our own kubernetes client in Scala and it was well Manzo's written some great blog post about their incidents let me put it that way the kubernetes api is you don't want to write a client for it it's a really hard thing to get right you're dealing with the distributed system and converging States it's just very difficult so we wanted to leverage something like Clank oh that would hopefully solve a lot of those problems for us I have some hindsight things I hear but I'll share that fellate err and the other decision is that we wanted to rust date a plane or we wanted a native fast date if I'm gonna get into Y rust and all that bit but we knew the JVM wasn't gonna work we knew we needed native language and so we went rest oh and finally I get to use Promethean grief on us your fauna so we bundle a small Prometheus instance with us with some default your fauna dashboards so that you just get some basic stats out of the box without having to actually configuration you can first configure another Prometheus to scrape all this data we're working on making a Prometheus part pluggable so you can just use your own directly and not use ours there's a bunch of work there but our goal like we don't want to build that we just want to use tools that are good there and it all kind of hangs together like this this is slightly inaccurate because our apologies always changing but the main idea here is that we have basically as a micro service in the control plane it's a bunch of go controllers operators the mission controllers etc geo RPC services to the property and those all run in dedicated namespace they can be replicated Lisa is prometheus and then we have proxies that get added to every pod or every pod that you enable to so a mutating webhook called the proxy injector up on the top there every time a pod gets created in your system kubernetes says hey linker T what should I do to this pod manifest and if it looks right we add the proxy to it if it looks very wrong we'll reject it otherwise I'll probably let it through and then we add a proxy IP table stuff gets to set up there this is the whole service mission config down set I don't wanna go into right now and so I don't have a lot of lessons around like 32 because they're still right in the middle of it and so I think anything I have to say won't be that insightful but I have a few things I want a harp on kubernetes is the database this is something Thomas burg on our team said to me earlier this year and it is something I've been chewing on since kubernetes is a totally open database so you can think about like Etsy D but you have schemas and CR DS and we have this data model we have pods and service accounts and labels we have that taxonomy is some degree that we're missing in other systems right having to describe what workload coordinates are we get for free out of kubernetes and we have a label system that we can use to do selections and so I think it's still a little too open-ended for a production system you want to have some constraints and say every pod must have these labels etc but this is a great building block and on top of that now we have things like API extensions I can go add new endpoints in the Kerberos API can add custom resource definitions that have schemas that are validated before they go in we have built an operational database that happens to have these controllers running on them that just persist the state of the running system it's awesome like if I had this when if we had thought about Aurora and mezzos and zookeeper in terms of this with that nice API I think that project would be much more successful having to go figure out how to use do keep her from first principles and have no validation etc was not really workable I would not have said this last year but I have come to the conclusion that the world needs more reference architectures it's weird but we get focus all the time we say I really want to do a multi region or I really want to do tracing even and we get into that and there's actually very little in terms of link ready features that can or should be done there really what folks are looking for is I need a pattern that shows how do I do instrument tracing from it ingress through my application and benefit the mission that that all together is the useful thing any one slice of that is not actually going to get me wind of prod and so similarly things like multi cluster and multi region we really have to think about global scale load balancing and what does failover mean there's some big concerns here that probably don't I hope don't belong in link rudy but we'll solve them if we have to and the other lesson I've been learning slowly is that infrastructure projects like those in the scene CF only succeed by building trust slowly over time and there is no one talk I will give to you that will convince you that link reduce production ready or really convince you to use it even but what we have to do as project maintainer is and community is just keep showing up being really open and clear in our communication and so certainly in 2020 we're going to be focusing heavily on that per side of things okay I'll take a little sip of water so why another proxy and I know I'm probably running out of time the proxy goals one small in terms of memory if this is a sidecar that's in every pod it can't be a can't even be 10 Meg's I think right now or it likes 2 to 6 by default with a couple hundred yes it needs to be fast and terms of low latency we can't have garbage collection pauses adding lots of latency unpredictability into the system and so we also need it to be pretty if we're consuming your cpu quota you're not gonna have a good CPU like you're not gonna have a good time your cluster so we need to make sure that's low overhead this probably shipping first but it needs to be safe I'm not putting heartbleed on every node in your cluster like but that's not an instant might want to deal with and finally it has to be malleable like I have to be able to work in this thing and modify it every day I can't use some general thing and it's kind of off the shelf and configurable we go build features that touch the data plane and to do that best be something that I actually wanna work on every day so that leads us to one no garbage collection no G JVM no go a native language so you get that memory footprint down to get that CPU usage proper and this one might be a little more surprising I need a strong type system in my life I am not smart enough to write good or at least workable software without a type system and a compiler that's going to help and so I this was something we learned from Scala and and really poured it for with us and not going away anytime soon and so we also want to be able to specialize and that's in terms of one it's not a configurable thing if any there might be a proxy config this is something that is opaque instrumentation detail and we want to keep that really keep it that way part of that is we do transparent protocol detection you never have to tell us what protocol are given ports talking well you might in the future for very weird reasons but everything should just work out of the box and we want to customize to that we want to do automatic transparent TM TLS within the mesh and so we have some customized logic there that's specific to link routing on how that gets initiated and terminated something most folks don't know is that we do automatic hb2 multiplexing between every proxy so if a proxy is talking to another proxy it will it should never have more than one connection per pair and so if you have HTTPS one service and between in your app we will take that and shove that all through an HTTP channel to the other link ready and then we've returned the back in HD one on the other side and so it actually really saves on cross data center connection initialization costs and MT less cost etc finally we're not finally but we we want to have really really really good for me disintegration when we started we met with Frederick from there from UTS product Frederick and he was like the community is not getting this we need really good Prometheus support in a proxy like this so he helped us work on that we have lots of we basically take all the kubernetes made it metadata and just shove that into the labels for the traffic so we have really hydrated stats that can do things like dependencies and cetera we have something called liquidy tap which I don't know of any other system but it's a way for us to basically push inquiries into our live running proxy so rather than say I'm gonna log everything to Splunk or whatever and query it afterwards we can actually connect to proxies at runtime and say show me requests that look like these I want to dig in and see live requests that look at these things so if you go to linker you dashboard you'll see lots of live requests that's all powered by a link ready tap and there's a whole bunch more we're going to do there I don't know what in what order my focus right now is very much in the security roadmap but there will be lots more things that we want to do in the data plane this project goes on all right I'm gonna whip through the rest evangelism strike for us here's what some proxy code looks like after about two years of fighting it we actually have really nice composable layers here this is the outbound endpoint stack so per for every endpoint we create a service that has all of these layers at the outer side we have a the bottom there we have a tracing layer which gives us some extra context for log messages and errors we you shouldn't with metrics with tap view protocol upgrading which your Petters and what the point is here we can write lots of orthogonal separated bits of logic that are testable independently that we can compose together to build a proxy logic that's something they learned again from taking from finagle and Twitter and porting forward into rust and the tower ecosystem okay this quote is the nicest thing anyone has ever said about my work so I just have to put up on the slide once at least we had a security audit done with C and C F in July we worked as this group here 53 in Berlin they two minor bugs in our web dashboard as of edge two weeks ago we fixed them they'll both be fixed in the next stable release but they gave us really glowing reviews about the project and the way we work on it so don't take my word for it take theirs okay wrap it up big bets for 2020 mandatory TLS by default something we're going to do is has to be iMessage there can't be a I'm gonna just enable TLS opportunistically and maybe it's not there and I have a soldering tools we need to get to a place where this is just mandatory if it's not TLS or health check or writing this probe then we will fail the request and that's has to be how it works we need to get to if after we get there we can start to talk about inter-cluster idea and identity and policy this is a big request this is not something we're gonna do though until we've nailed the identity model and so we need to get to cross cluster identity but that's after we get identity nailed and this is the craziest bet I think on the slide is that I want to reduce linker you line decoded by at least 10% I think there's I want to maintain less code I want to have a a better ecosystem of libraries etc and so we're gonna definitely push on that and part of that part of all three of those things is the service mesh interface this is a partnership or doing with Microsoft and hasha Corp and some other folks around standard about standardizing certain C or DS certainly API extensions so that integrators who want to do things like like flag or does traffic splitting and traffic shifting they don't have to implement that to any one service mesh they can just use the interface and if they want to read metrics about those services they can just use a standard interface and not write to any one service mesh furthermore I think I hope that we'll be donating or finding common implementations of many core components that linker D shouldn't be maintaining things like the C and I the proxy net container which configures IP tables and some of the multi cluster syncing things left to do I think all belong NS Mayan we're gonna have to work with that community to do that okay the big flashy slide that I'm required to have here is that we've been in production for a long time we're doing awesome and we had the security audit done we've had a distributed tracing recently and 20/20 stuff is really again getting the security mode up advanced and making that extend across clusters I'll let you take a picture and then imma show you slides thank you very much please come get involved the more the merrier and if you have questions I'll try to fill them here or we have a booth somewhere and find me there Thanks

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 3,121

Rating: 4.8947368 out of 5

Keywords:

Id: NqjRqe0J98U

Channel Id: undefined

Length: 34min 52sec (2092 seconds)

Published: Thu Nov 21 2019