Envoy Internals Deep Dive - Matt Klein, Lyft (Advanced Skill Level)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
great thanks everyone for coming I am super super excited to give his talk I feel like for the past year and a half I've been doing the conference circuit and singing the praises of service mesh and now we come here and there's like 20 talks on service mesh so I don't feel like I have to do that talk anymore which is great so what we're gonna do today is I'm going to go into some of the kind of deep dive system internals of envoy so I apologize in advance if you don't know what envoy is this is probably not the talk for you and computer is frozen let me see yeah there it goes okay cool so just a real quick agenda I'm gonna go over kind of the overall goals of envoy and kind of why we actually started the project will do a high-level arch overview and then we're gonna dig deep into the threading model we'll talk about the hot restart capability which is the capability to restart envoy a full binary reload without dropping any connections we're gonna talk about stats which is pretty interesting and then all the plenty of time for doing QA so just from the 30-second version in case any of you hopped in here and you actually don't know what envoy is the envoy project goal is that the network should be transparent to applications so you know from a micro surfer perspective we're living in this world right now where lots of people are doing micro services but unfortunately they're still spending too much time not focusing on their actual business logic they're debugging network issues they're debugging infrastructure issues and it's a very confusing situation so we would like to you know make an abstract network for people so that they can write their applications and they can focus on their business logic so when we were originally uh you know doing the the envoy project plan this was back in 2015 and you know we were looking around obviously I was working at lift at the time and lift you know had started their micro service rollout and you know that was not going very well at the beginning of 2015 lift I think had a monolith you know it had like 30 different services written in Python at that time because of all of the typical problems in the microservices world you know the micro service rollout was basically aborted people were just having too much problems actually figuring out what was going on so for a non-void design goal perspective first thing is we weren't gonna do an actual library we're gonna do an out of process proxy that's because lift even at that time I already had multiple languages we had PHP we had Python I think there was like a single Java service at that time we were thinking about you know doing go you know so we knew that we didn't want to maintain a library for all of these languages we knew that you know from a long-term perspective particularly when we wanted to open source on boy I I knew at the time that we would be competing in horse race benchmarks against H a proxy engine X so you know we wanted to focus on doing a solution you know that was low latency high perf also wanted to try for developer productivity so at the beginning of 2015 that led us to choosing C++ if I were to choose today it's not completely clear that we would still use C++ but at the time that was a pretty clear decision envoy at its core is a l3 l4 filter architecture and that means that at its core it's a by proxy and that allows it to be used for multiple protocols which is pretty key so for example for today we use envoy for Redis we use it for HTTP we do it for MongoDB we're probably gonna do Kafka so you know from a low-level perspective we want to have a core that has a bunch of functionality but can actually be extended to do multiple protocols obviously a huge amount of the Internet is HTTP so we wanted a separate filter stack at that level so that people could write very interesting plug-in functionality up there we knew at the time back in 2015 the h2 was the future and at that time actually really no proxies supported you know full h2 proxying from an end-to-end perspective still today I think nginx only got that functionality in about the last month so it's taken a couple of years actually for most cease to kind of catch up to this h2 first perspective we'll talk about it more but you one of envoys kind of core tenets is obviously service config discovery we spend a huge amount of time on an ability to have an API to actually drive envoy config and that's mostly from operational experience seeing what it's like to actually deploy proxy and deploy configurations and then have to go through and deal with the operational mess of hopping things and like making sure the files get there and then files get corrupted it's it's a pretty crappy situation obviously we wanted to do you know a whole bunch of active and passive health checking advanced load balancing was really at the core so obviously load balancing rate-limiting circuit braking timeouts etc and you know I talked about it and most of my talks not this deep dive talk but really I would say the biggest goal of Envoy is actually observability and you know what you'll find in these kind of polyglot micro service architectures is that understanding and debugging what is going on is an incredibly difficult problem so you know having best-in-class like stat slogging tracing you know these are things that that really make it so that you can you can kind of understand the problems that are occurring other thing was you know what you'll find historically is that people would do things like deploy nginx at the edge but they would deploy H a proxy you know for their internal services service traffic and what you'll find is that ninety percent of what these proxies do are the same maybe even 95 or 99 percent they're all doing service discovery they're all doing load balancing you know there's very few logic at the edge that is different from what is done internally so from an operational agility perspective it was very important that we actually use the same code it's just a lot easier to operate the same code that you're running from your service proxy to your middle proxy to your edge proxy and hot restart was a it was a core goal again how it restart is the ability to actually restart envoy without dropping any connections and we'll talk about that more so here is a very high-level architecture diagram of envoy and you know unfortunately this is a short talk so I'd love to talk really at length about filters but it just don't have time but just from a very high-level perspective what you'll see here is that envoy you know kind of has a connection processing pipeline and obviously here I'm only showing kind of you know normal rest type traffic but you can kind of extend that to what it would look like for Redis or MongoDB or something else but what you'll see is that you know connections come in before connections are processed we have a couple of different types of filters within envoy so filters are extension points that people can write to actually modify functionality beyond the core so we have listener filters these are filters that operate before we make the actual connections and then once the connection is established then we have TCP or l3 l4 filters so this is where you would do your Redis or MongoDB or HTTP and then these create a chain of filters so if you've used libraries like neti or other similar libraries that allow you to compose different chains of filters it's a pretty powerful programming paradigm because you can do things like have an auth filter followed by a rate limit filter followed by a protocol sniffing filter and you can compose these filters in different ways that are that are quite interesting and then you know at that layer from an HTTP processing perspective we actually have an HTTP connection manager filter which is an l3 l4 filter that actually you know parses out the bytes and makes messages so things like headers body trailers and then once you're up at that layer then we have a separate set of filters that operate on those headers body trailers and that the HTTP layer can do things like again do auth or rate limiting or buffering but instead of operating at the byte level we're not operating at the HTTP message level so from a from a filter writer perspective it's a lot more natural and a lot easier to actually plug in functionality there once you get kind of from the front end of the proxy and you're about to you know do do your routing which is what most people are using on before we actually have a filter which is the router filter so that's your service router and then there's a whole separate portion of Envoy which is kind of this back-end which does all of the upstream our back-end or endpoint management and that's what we call the cluster manager so the cluster manager is the thing that basically knows about all of the sets of backends that envoy can eventually route to and so a cluster is basically a grouping of hosts so that would be your for example your location service or your user service or lyft your pricing service and then a cluster is composed of multiple backends so these are hosts so it's cluster manager to cluster two hosts so what you'll see from this diagram is that we have this connection processing pipeline and then there's a bunch of central functionality that is common to that entire pipeline so that stats that's admin so we allow people to kind of dump rich information from the box but it's very useful for my debugging perspective we have parallel managers so we have a cluster manager we have a listener manager these are the listeners that you know accept incoming traffic and set up all these filter stacks and then we have routes also and then at its core too we have again I don't have time to talk about it but we have a concept of the XDS or this is our discovery service API so these are things like listener discovery service cluster discovery service endpoint discovery service and this information gets fed into all of these managers and then that's how ongoing knows about all of the backend information so let's step back and before we kind of dive into what the threading model looks like you know we'll kind of talk about historical trends in this area and from a proxy perspective if you go back 15 or 20 years you know the way that it was common to you know write networking software is you would have an operating system thread and then you have a connection per per thread and that's just how things were basically written back then and in the last at this point probably 15 years there's been a trend towards kind of what people loosely call C 10k there's some paper written in the early 2000s that kind of popularized this term and that's basically running 10,000 connections per box which now today 10,000 connections per box is kind of a joke but at that time you know that was like big big numbers and what is important to understand what still holds true today is that you really can't do connection per thread it just doesn't work like you know not to get too deep into operating systems but threads have stack there's context switching so if you're trying on a box you know have ten thousand twenty thousand five hundred thousand connections and you're doing a connection per thread you just can't do it like there's just not enough memory you're gonna waste a lot of time context switching from a CPU perspective so in the early 2000s the connection per thread kind of paradigm fell out of favor the unfortunate part of that is that connection per thread is very easy to program because everything is effectively synchronous and blocking so you let the operating system essentially schedule everything for you and that is from a reasoning perspective that is quite simple in the early two-thousands really started by nginx we move towards a a multiple connections per thread model and what that uses in is an async event loop and essentially now all of your logic has to be async and unlike synchronous programming which is very easy most programmers can kind of understand what's going on you block you get some type of error ASIC rhiness programming is very complicated things happen out of order you have to handle all these cases that can happen and it's quite complicated but in order to scale you know individual machines too high connection counts and high throughput this is basically the only way to do it there are other methods now that people are exploring around things like core routines and things like that which was super interesting and let's talk about that too but not not enough time but historically in the last again 10 to 15 years we're now moving towards this highly async model so that takes us to envoy itself and the way the core of Envoy actually works and what we do in envoy is unlike nginx envoy is a single process so there is not like oh you know multiple processes on the box we do a single prob and we use different different threads within that process the reason that's done is mainly because coordinating complex tasks between threads is a lot simpler than doing it between processes and it's just from an operational perspective it's easier to manage one process than having to manage an array of processes with with a different process manager but one of the things that we do in Envoy is that there's a concept of a a main thread so this is kind of like your boot thread and the main thread hosts low throughput but highly important behavior so these are things like XDS fetches so going out to all of the discovery services or we have a concept called run time which is like loading feature flags from disk or doing stock flushes or admin server or just general process management like listening for for like signals and stuff like that so these are things that don't take much CPU you know so they're not very CPU intensive but we do them on one thread and one of the benefits of this is that none of this requires locking like it's effectively it's it's a lot simpler to reason about or even though it's asynchronous it's not like there's any kind of thread pool or a connection pool and you'll see that as a theme and envoy there are no there are no thread pools in an envoy at all and and the reason for that is that a synchronous programming is hard enough but when you couple thread pools and all the locking the com with thread pools with asynchronous programming it gets very very complicated so we have this split where we have a main thread which does a bunch of this low rate kind of important functionality and then what we do is based on the concurrency parameter we boot and worker threads and by default we run one worker thread for hardware thread on on your machine so if you have four Hardware cores we will boot for four workers and workers are the actual data plane so the workers host listeners that run accept they accept connections and that that entire kind of chart that I showed you before about going through that entire filter pipeline that all only happens on the workers and what what what this does and the reason that this is important is that this is what we call and embarrass the pair architecture there needs to be no coordination between the workers and basically what that means that there is no locking that's slight lie there's locking in certain cases but not in any high throughput cases so this allows envoy to scale to a massive number of course and you know for a proxy that was originally written say in the late 90s or in the early 2000s you know it's not that high core processing was uncommon but it wasn't super common whereas today you know most of the big cloud vendors they're running boxes that have 90 90 plus course so and they run servers like envoy that run on all 90 cores so at that core account you actually really have to start thinking about what the actual locking looks like so since we're writing this in 2015 and it was pretty clear that you know this is the way the architectures are going we opted for this embarrassingly powerful Bera Singh lis parallel architecture one thing that I will point out is that which is kind of interesting I don't have too much time to actually go into but there are certain things particularly on Linux that make this difficult right because the way the way that this works is that if you're only gonna run you know one thread per hardware core it can't ever block if it blocks you're basically and so there are operations and particularly in Linux where even though they're technically not blocking bait they actually block and one of those things is file i/o and you'll particularly get into situations in virtualized environments where when you tell the kernel for example to do a you know cache write or something like that we see this at lift all the time we're like the way that the Amazon EBS driver is written like it'll go through and it'll end up blocking and then like you're blocked right so that's obviously bad so there are certain cases where because of this we have to push functionality off to other threads so for fire flushing what we actually have to do is you know we'll have to basically put data onto a file flush thread so that it never blocks those data processing threads but again this architecture is designed to be massively parallel so the the next core concept before I kind of to how this is used is as a walking paradigm called RC use this stands for read copy update and you know this might be common to you or it might not be it's a pretty interesting thing it's used heavily within the Linux kernel it's a synchronization paradigm which was designed for very high court's and it's designed for read heavy write and frequent workloads and what this paradigm allows you to do is it allows you in the read in the repass to take no locks so you can actually spread data across many parallel kind of threads and you can read the data without requiring any any locks at all and it's super interesting and the way that this works which might not be very intuitive is you have some update process which I just show here and then say the update process makes new data and from a non-void perspective that new data might be new cluster information or like new route information or something like that instead of people that need to actually read that data instead of them acquiring a lock and then say you know reading what the current time is or like reading what the hosts are what ends up happening is that the updater will take that data which will be reference counted and then that will get posted over to the event loop that's running on every worker because remember every worker is fully async running its own event loop so that data gets posted over and then the key for RCU is that there's this thing called the quiescent period and basically what that means is that there are people that will use this data and the only time that you can update that data is when the people that are reading the data are quiescent or are not running and from an event loop perspective this is really anytime that an event in that loop is not running so what we end up doing is that this ref counted data which in C++ is a shared pointer basically is copied to all of the event loops and then it might be true that at the point that the event loop is actually running people are holding on to a shared pointer for example to a route table or something like that but then when the event loop comes back and it processes the message for this new route table it'll basically copy in that new route table any people that held the old route table will continue to hold it and it'll be ref counted but any new readers of that route table will get the new route table and that is all done without locking so it's a it's a it's a pretty cool thing I would encourage you to go look at the at the Wikipedia article how it's used in envoy I would still call it RC u but it's a little different than what the what the colonel does because the colonel has to do it with interrupts and a whole bunch of other complicated stuff but this is a this paradigm this locking paradigm is really pervasive in in Envoy so the next important concept is a TLS or thread local storage and like we talked about we have this massively parallel architecture and within this architecture you know we typically have two hosts thread local data and again that might be things like load balancer context or or local host context or you know local cluster manager or context there's all kinds of things that again in order to avoid locking this has to be spread and be embarrassingly parallel across all of the workers so the way that TLS works in Envoy is it's it's a little different than what you might be used to and it's because it has to be worker aware so in most languages you know there's some keyword that you can use you know to say like give me like a thread local variable you know that's typically some type of static variable that is thread local that doesn't really work here because what you'll see in a second is that we need the ability to use these threads local slots to actually accept our Cu data and then read that later so with a normal like operating system or language level thread local variable that doesn't actually give us the context of what we actually need so what we allow things to do is we allow them to allocate what we call TLS slots and this is just a vector of pointers and again this is like an arbitrary vector and you can imagine the different pieces of the system actually say you know I need a TLS slot to do something it's abstract and what you'll see in this picture is we have five different TLS slots the different things in the system have actually you know basically allocated and that live on the main thread and then what you'll have is that on each worker there's a parallel vector of those five slots and the way that it works is that we couple TLS and RCU such that when a process on the main thread let's say they're operating this picture on slot three and let's say that slot three contains a route table what will happen is that the updater will basically put the new route table which is reference counted in slot three it'll post to all of the workers and basically say update the thing that you have in slot three so you can see this post right and then basically on that worker which is doing all these event loops when a process needs to use the information it'll say give me the current value of slot three which again might be this route table or cluster information and this coupled together gives us this this kind of really strong kind of programming building block where we can't update data on the main thread we can send that over to all the workers where it'll be eventually consistent and eventually used but we don't have to acquire any locks which is which is pretty awesome so to put that all together let's take a look at you know from a concrete case of how envoy uses TLS and RC you to actually do cluster updates so again to refresh a cluster is a grouping of back-end so it's like the location service at at lyft so as I was saying before we have a cluster manager which manages all the clusters the cluster manager is getting inputs from different things so we have a health checker which might be doing active health checking we have a passive health checker which is doing outlier detection depending on our service discovery type we might be using DNS we might be using XDS so all of these signals are basically are basically coming into the cluster manager and when the cluster manager detects that a host set has changed so again a cluster manager cluster set of hosts let's say that a host becomes unhealthy or or host goes away the cluster manager on the main thread will compute the new set of hosts the new set of healthy and unhealthy bits it'll make a new array which is basically a reference counted kind of share share pointer of a vector and then it'll in its TLS slot it'll RC you post that to all of the workers right so the set you know step one is cluster manager we're getting to is health checker three is DNS then we post this new set over to the event loop on the worker the worker will again update that data in that slot right and thread local storage and then when the next event comes and has to do load balancing or has to make a decision based on that data it can acquire that information and then host that again without acquiring any blocks so this is like embarrassingly parallel and can scale to an almost infinite number of threads so moving on to hot restart so how it restart again is the ability for envoy to reload you know to do a full reload including a binary binary or configure reload without closing any connections and you know at this conference we often spend a lot of time talking about kubernetes and containers and like that's awesome and blue-green deploys most companies still have lots of things that don't run in kubernetes and that includes lyft you know and you know in the new kind of container world or in the new kubernetes world it's much more common to do what we call a a rolling deploy or a blue-green deploy and that's where you know if you want to run some new software you're gonna tell kubernetes to obviously spin up you know my new containers you're gonna do like a gradual traffic shift between the old containers or the new ones and you can roll back or roll forward and then when you're confident in your deploy you say kill all those old containers that's wonderful and I would love to have that and that's amazing but most companies still don't have that most companies are still stuck in a world where you know you have a bunch of software that runs in virtual machines or on bare metal and you know the ability you can't do this Bluegreen deploy because you don't you don't have like 100 other computers sitting around to spin up your software and update your load balancers and kind of do all those things so many companies are still stuck in this world where you might want to update envoy either binary or configs and if you were to have to do that you know with having to drain connections it would take a long time it would be very disruptive so the ability to restart without actually dropping any connections makes envoy deploys and some types of configuration changes like way way way simpler so this is a pretty important feature still for many different people so the the way that hot restart works is which is which is a little different kind of than the way that most systems do this is we basically allow two different processes of envoy to run at the same time and there's a shared memory region which mostly contains stats and I'm gonna talk about stats next but from an operational perspective obviously from stats you know we have counters we have gauges and we also have histograms particularly for gauges it is incredibly useful for the gauges to be consistent across hot restart and the reason that's useful is let's say that you have a gauge for the number of active connections if the gauge is only tracked in the new process when you HOT restart envoy if you were to have say a 15-minute drain time or an hour drain time your gauge will immediately go to zero even though you're box still potentially has a hundred thousand connections so from a monitoring perspective it's that's quite confusing so being able to share gauges and have consistent counters you know across these two processes it makes operations much easier so we we have this shared memory region where we allocate a bunch of backing memory for stats and I'll get into stats more more next and then we have a couple of locks that we use again not at the high throughput kind of date data path mechanism but you know for example if we have to allocate a stat in the shared memory region we have to use a lock also in share memory or for a certain type of access logging we're gonna have to acquire a shared memory lock or there's you know a few other things that kind of live in the shared memory region and in the way that it works is that when the new process starts up we have a very light way RPC protocol that runs over UNIX domain sockets where the two envoys will basically talk to each other and they'll ask each other things like what version are you like are we compatible and then they'll actually do socket passing so we'll pass sockets between the envoys and and we do that still so in for the past four or five years there's a way in linux where you can basically open a socket you know kind of multiple times so that you have to do this socket passing dance but as it turns out that code still drops connections in certain cases so if you want like fully droplets connection handling just have to do this dance of passing sockets over over this kind of over this low-level connection so what we do here against we have the shared memory region the envoy will boot up it'll you know allocate stats it'll attach to the existing shared memory region you know and then maybe it'll go through and kind of pass some sockets around and then you know the new envoy will basically tell the old one okay like I'm good I'm accepting traffic please go start draining and then based on the configuration of you know what will actually happen there you know there'll be some amount of drain time so 15 minutes hour two hours depending on what actually happens and then eventually the old envoy will shut down and this allows kind of this communication to happen without any any single process wrapper and that's actually key the way that this historically has been done is that typically a process will come up and it'll kind of build what we call like a little trampoline and then that will keep forking and exacting itself that does not work in containers because obviously the way the containers work is you have a process when the process dies the container goes away so if we want heart restart to work in a container world and even in containers there are people that want to use hot restart for various reasons because the only communication is over UNIX domain sockets and shared memory if you give your containers the capability to access shared memory and use domain sock this process will work over over normal containers which is pretty powerful so this is the first time that I know of that we've done something like this where it's kind of built for container container ready okay um check my time great so from from a stats perspective we have a couple of different things here and I think stats are interesting because they're obviously done at very high throughput so from a stats perspective we have what we call a store so this is kind of a holder of data so counters gauges and histograms we have a sink so these are a protocol adapter so Envoy was built from the get-go to support multiple stats backends so it doesn't only work with Prometheus like it'll work with stats D actually now have a G RPC metric service or we can push stats we have an admin server that can pull stats from so if I stats endpoint so Envoy can work in both a push mode and a pull mode so if you're using stats D you're gonna push if you're using Prometheus you're gonna pull and then one of the really interesting things that we have is a concept of a scope so a scope is a grouping of stats that together can basically be deleted and this this you wouldn't think that it's important but it ends up being incredibly important because when you're using shared memory and you have an envoy architecture where you can live reload clusters and listeners if you don't allow stats to be block deleted you will leak a ton of memory right because over time you're gonna update clusters you're going to delete clusters you're gonna update listeners you're gonna delete listeners so you need grouping of status that you can basically delete together so the way that stats work which is very interesting and again is it's been heavily tuned for performance mainly because envoy has a lot of stats and alter stats quite a bit is we have again using some of the concepts that we've talked about between RCU and TLS we have a mechanism where we have a kind of a global store for stats stores contain scopes and then depending on the type of data we either allocated from shared memory or from process memory so counters and gauges are allocated from shared memory histograms are allocated from processed memory the reason for this mainly is that histograms are a lot more complicated than counters and gauges and there is not as much operational benefit of having them in shared memory and consistent between restarts like doesn't matter that much if you have timing data that you lose from the old process and the way that this process works to keep it very high-performance is that you know it's all thread-local so basically if your thread or a worker and you're looking for stat will first actually look in a thread local cache of that stat and if the stat exists we just return it immediately without acquiring any locks and then that stack can be incremented if the stat does not exist in the cache it'll go back to the central store it'll look there if it's there it'll add to the cache and then basically return it if it's not on the central store then it'll go into a slow path and it will allocate it from shared memory or from process memory it'll store it in the central store and then it'll move it into the TLS cache and then from then forward it doesn't actually ever have to be accessed via via lock and the one interesting piece here is that because we delete scopes you have to have the ability to flush the cache so what will end up happening is that when a listener or a cluster gets destroyed on the main thread will basically post a message to all of the workers to say hey I don't need this cluster anymore I don't need this listener anymore go ahead and delete all of that cache data and that's how we basically keep things up to date without actually leaking any memory lastly and this is something that just got implemented and actually then got reverted and now I was a bit about to get added again is is TLS histograms and I'll go through this super quick cousin I think I'm basically out of time and what we do here from this perspective is we have kind of a parent histogram we have a TLS histogram that you know lives on every worker and then from the merge perspective what we actually do is we allow the ability for the history to accept values into a kind of a primary histogram and a backing histogram so the way that it works is that on the TLS worker it's flipping back and forth between histogram a and histogram B again requiring no locks and then during the merge process what actually happens is we post to all of the workers we tell it to flip the current histogram so that again no locking is needed then we can go back and if we're currently writing into histogram a then we can go we can merge all of the backing histogram B's without without any any locks so that is TLS histograms so in a quick summary from an envoy perspective you know we're biasing for developer productivity you know we would like to obviously you know have high throughput low-latency be want developers to be productive writing code it's an embarrassingly parallel architecture aiming this scaled a very high hardware thread counts we use a lot of our CU and TLS designed for containerized world and again try to have extensibility at every every different layer so thank you my normal plug for lift we aren't we're hiring so feel free to come grab me if you're interested in jobs love growing the envoy community i over am my time starting to talk but i'll be standing here and i can comment answer questions and thank you very much [Applause] [Music] [Applause]
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 25,125
Rating: 4.9156909 out of 5
Keywords:
Id: gQF23Vw0keg
Channel Id: undefined
Length: 36min 56sec (2216 seconds)
Published: Fri May 04 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.