"Caching at Netflix: The Hidden Microservice" by Scott Mansfield

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody my name is scott mansfield i'm a senior software engineer on the AV cache team at netflix and i'm here to talk to you about the hidden micro service it's our nickname for the caching layer to be clear before I start this has nothing to do with the open connects CDN which some people also refer to as a cache that's a really cool interesting deep concept all on its own it's not what we're talking about this is just the Amazon Web Services cloud caching layer so you just signed up for Netflix good decision you get home the end of the day you turn on your TV you open the Netflix app you need to sign in with your brand-new account you choose your profile and naturally your name is tester like everybody else you asks you to pick a few titles that you've liked before or maybe ones that you've seen so you pick a few keep going it tells you it's personalizing your experience for you so it's building a brand new homepage just for you and there you are you get there you have house of cards making a murder which is a fantastic documentary you could do a search for Tom Hanks watch Forrest Gump for the millionth time but you settle on narcos this says season 2 coming September 2nd it's already out in case anybody is excited about that show you can go watch season 2 you're watching the first episode you get to the part where he tells the gardener llamo Pablo Emilio Escobar's guy Vidya and he tells him Plata o Plomo and you're so impressed by this line you need to go rate this show five stars and you've him just showing all of your friends all the favorite parts of all you of your episodes and we're keeping track so any guesses on how long we have to capture your attention before you go off and do something else guesses okay okay a little bit short it's about 90 seconds so the faster the smoother the better the experience that you have the more likely we are to keep your attention and keep you watching Netflix and part of the way we do that is by caching things on the server side so let's take a look in that experience that you just saw what things caches touch you're signing up logging in choosing a profile picking like videos the personalization process which we'll look at in a little bit more detail later loading the homepage itself scrolling through the homepage any a be test allocations that might be on your profile and even the video image selection the boxart that you're seeing and there's a whole mess of other things even beyond that and if you look there's a bunch of these have multiple caches involved in the backend put another way this is what a typical homepage request looks like this is the output from our request facing system so the request comes in from the left side and goes all the way down to the right if you look at those leaf nodes on the right there's maybe half a dozen that are not an easy cash node so I've talked about it I mentioned it let's talk a little bit about EB cache it stands for ephemeral volatile cash it's a key value store that we run that's optimized for running on Amazon Web Services and tuned for Netflix specific use cases it's a distributed sharded and replicated store and that replication is tunable both within a single region and across multiple different regions it's based on memcache two highly resilient to failure because AWS is a highly dynamic environment one of my colleagues actually called AWS chaos monkey as a service which is pretty accurate all credit to him for that but they actually kill more of our instances than the chaos monkey does it's topology aware network topology aware for faster access as linearly scalable we can scale out the server side as much as we need to to handle the traffic and it has seamless deployments for our internal customers so why do we want to optimize for AWS like I said it's highly dynamic but what does that mean well instances can up and disappear at any time zones can fail entire regions can become unstable for any one of a variety of reasons the network itself is lossy and customer requests can actually bounce between multiple different regions of the same session and the simple fact of life is that failures happen and we have to be ready for them and we test them all the time so that we are ready so I'll give you a sense of scale for Evi cache use of Netflix we hold hundreds of terabytes of data and cache we do trillions of operations per day hold tens of billions of items do millions of operations per second millions of cross region replications per second between two different regions which we'll look at in more detail later we have thousands of servers it's more than ten thousand but not quite enough to call it like tens of thousands we have hundreds of instances for cluster up to hundreds hundreds of service clients that are potentially connecting to a single cluster it tens of distinct clusters for different use cases they're tuned differently or have different setups we operate across three regions and we do this with four engineers yeah so let's take a look at just like a single app box and a single server box how they talk to each other or just what the setup looks like there's an application running on our cloud somewhere the likely path is that they're consuming somebody else's service client that library would then consume our Evie cache client library and this is a Java client library it's a jar that we've end the server side has memcache D or what looks like memcache D and prana which is a sidecar it's written in Java it's used for people who are running applications that are not based in Java it hooks into the rest of the Netflix ecosystem so we can report metrics and and register with our service discovery system their client talks directly to memcache T there's a TCP connection between the two the sidecar will register with our service discovery and the client can pull that information in order to find the server so one client with a whole Evie cache cluster here if you look there's three different availability zones and there's this dotted line around these boxes that's a whole copy of data so in this picture we have three whole copies of data one per availability zone and the client has a connection to all of these separately if we have many clients all the clients have a connection to all the boxes so this forms what's called a complete bipartite graph so all clients connected to all servers no clients connected to each other and no servers are actually talking to each other so for reading relatively simple we try the closest one first and there's a couple backup paths in case that one doesn't work out if it's a brand new instance if it just died or it's having network problems or any one of a number of failures that can happen we can try a different node a writing is a little bit more interesting the client writes to all three and that's how we have multiple different copies all clients would be writing to all the different places that they need to write data so we're gonna take a look at a few different use cases going from simple to complex this one's fairly straightforward it's a pretty common use case so if you have a relatively slow service in a relatively fast cache you can try the cache first and look in the service layer so if the application requests some data from the client library here we try Eevee cache client first and if it's not there we can go to the service which will go to its database typically Cassandra for us and then the service itself is responsible for writing back into the cache before or asynchronously as it returns the data the second one transient data store so thinks things like your playback session as it's going on we want to keep track of how you're doing and over time we have multiple different applications talk to the cache you might have one start up your session another one update your session then do a session roll-up at the very end have you notice there's no database in this picture it's just the cache there this is a lot more interesting to me this is our largest footprint really of caches it's actually the primary storage mechanism for data so we have these really large scale pre compute systems that run overnight every day to compute a brand-new home page for every profile of every user it's quite a lot of computation and it's quite a lot of data and what they do is they just write it into an e be cache cluster and the online services are reading from this not worrying about what time anything was written we have a pretty strong culture of fall back so you don't have to worry so much about data not quite being there all the time so the whole picture looks like this we have a set of offline services writing in the cache so I have online services reading and the cache acts as this buffer between the two or a gateway or many other words you could use for it so this is our personalization data this is even more interesting we have some things that are very high volume and also need to be very high availability so thank you I strings where if you don't them the user experience is pretty bad because it has no words on it they would have an in-memory cache and then if that doesn't have the data or they could do a background refresh from evie cache if it doesn't have the data they could reach out to give you cash it does if it does they could do a background refresh so think about whenever we're going into our peak during the day we need to scale up some servers and they all want to get this data they'll be pulling it from UV cache and there's another process that is asynchronously computing the latest package and publishing it to the cache and beyond that they could even go further and have a service that owns this data or the database behind it but most people have actually found this to be optional because the uptime of the evie cache clusters is so good that they don't worry about it so we've been looking at maybe 10,000 feet let's take a step back let's go to 3050 and take a look at what I call pipeline of personalization so when you compute a new homepage it's not just one step there's many different steps involved with different algorithms doing different things you would have one say just compute a publishing into an Eevee cache clustered its data sources or something that we're not involved with may be offline hive or other data sources and then maybe a compute B you might have a see that depends on these two and a D and maybe an e and this forms a dag of data dependencies for our personalization process all this happens offline and online services don't just pull the last output they might pick and choose from different parts of this in order to provide the experience that they need so online one pull from A through D and online to could pulls from a different set so the polyglot story at Netflix is an interesting one it's a point of discussion right now even but our team as a somewhat central service has decided to go ahead and solve this in our own way our old world or current world really is a Java app and our Java client that's pretty straightforward is what you've already seen but other people actually run our prana sidecar as well so we have an HTTP like a REST API that runs on box that people can use to access the cache and we've made that remote as well so we have a remote HTTP API and even an experimental memcache D protocol API that we've been working on with a couple of our partner teams internally this is still an area of active development for us so I don't really have any more detail on it right now if you want to know more you can talk to me at the booth upstairs afterwards there's some extra things on top of all the things that I've shown you so far that make the product work at Netflix scale the main one is our cross region global replication we also do cache warming for faster deployments we have secondary indexing for point queries for debugging and we have a consistency checker that provides metrics on how consistent the caches are now all of this is powered by mutation metadata flowing through a Kafka cluster so we'll take a look at replication and the cache forming first so replication we have two regions here we actually operate out of three so this goes between like every pair but I'll just show you these two we have an app and region a and an app and region B and we have a cache in region a that we want to match the cache in region B so in a an app could mutate that cache and asynchronously send metadata about that mutation into Kafka we have an app we call our replication relay it pulls that data out of Kafka and can optionally pull the data if it needs it out of the and write it across the cross region linked to what we call a replication proxy which does that operation on behalf of the app in region a in the cache and region B and then the app in region B can see that change and of course this goes both directions or every which way between all different regions a cache warming this is our steady state so you've seen this before without the Kafka button this heart in our steady state we're writing to one cache say we want to double it is two nodes right now we want four but we can bring up a second one in parallel that's double the size and start writing to it immediately we bring up a cache warming app that pulls that metadata stream from Kafka can read the data from the old one and write it into the new one until the new one looks like the old one and we can tear down the cache warmer tear down the old one well send reads the new one first then tear down the old one and win our new steady state and very recent modifications or updates to memcache D itself have actually made it possible for us to do this without the Kafka box on here but we're still figuring out how to roll that on the production this is all the code that our clients need to use in order to actually take advantage of all of that complexity I've just shown you all they need to do is make a client set get delete touch etc anything that you might think to do on a memcache D server that's all our clients need to do internally so let's switch gears a little bit I've been working well I've been showing you all of this higher level design work how the caches operate in the larger environment let's take a deeper dive into the server side so this is a new server for us and we just were at the tail end of rolling this out to the places that it's required it's a project we call Minetta his name Minetta after the goddess of memory and the protectress of funds for Juno as an evolution of our server to put some data on disk is a cost optimization project for us and it's lowering our cost now but also lowering the rate of increase of our cost in the future as we gain more members this is pretty important for us because we actually track the cost per stream start it's a metric that our finance team tracks and something that is highly correlated if we change that will be helpful in the future and it takes advantage of our global request patterns that we have across multiple different regions so our old server you've already seen it's just memcache D and prana it's not very exciting all the data stored in RAM this is very expensive relatively speaking compared to disk the late last year Netflix as a company changed their architecture so changed our architecture to be n plus one across three different regions meaning that any one region could go down at any time and we would be okay by shifting traffic to the other two any member can be served equally well from any region in addition at the beginning of this calendar year we launched in 130 new countries at the same time which just added more data on to the pilot so how do we optimize this well we can target those personalization pre compute use cases that you saw before our global data means many copies so that picture that you saw with three copies is actually nine globally because it's three in one region three in each region times three nine copies but our hacked our access patterns are very heavily region oriented if you're watching Netflix in California you are hitting the US west to region the likelihood that we see your traffic go to EU West one or or Us East one the other two regions that we operate out of is extremely low so in one region our hot data is very hot and our cold data is very cold it will keep the hot data and RAM and the cold data on disk and we can size the RAM for the working set and size the SSD for the whole data set what was our new server look like it adds a couple of new processes there's one called rent and another called mnemonic we're still running memcache D the server now is a dynamic l1 l2 cache which allows it to adapt to the needs of the application at anytime without worrying about very strictly segregating data and all three of these processes are running the same protocol so we can use the same debugging tools or same load generation tools to test any one of them so rent here is the proxy taking external connections and holding open connections to the internal processes memcache D and mnemonic are l1 and l2 we'll take a look at ren first take a look at mnemonic see the whole thing together and we'll take a look at some performance numbers which is probably the most exciting part for me so any go developers in the audience yes okay this is a go project you can go get it it's actually a public github repo under the Netflix org so yeah have fun with that it's a high performance wire compatible memcache D proxy and server it's a transparent change to our clients it's written in go mostly for its concurrency primitives but also because it's relatively fast for developers to get up and running in to write new code and it's very fast and it's running in production its purpose is to manage the l1 and l2 relationship on a single box and before the project even started we had caches that had tens of thousands of concurrent connections reading and writing that was a requirement for this process from the very beginning so taking a better look at the insides of it it's a modular software project what you see on github is actually what we're running in production it's not fake it manages the connections coming in the request orchestration between l1 and l2 communicating to the backends it even includes our own homegrown metrics library which might make some of you cringe but nothing really fit our needs in the NGO world because we have to integrate with our Atlas back-end and Netflix it includes multiple orchestrators reasons for which will become apparent when we look at the whole thing together and it includes this feature which we call parallel locking so any one server can have many parallel requests happening at the exact same time but no one key will have two concurrent modifications which allows us to keep data integrity and keep l1 and l2 consistent so mnemonic this is our l2 piece I say open source soon with air quotes because I don't really have a timeline for you but I want to get it open source we'll see this is our SSD backed storage solution it reuses some of the rent code to do the heavy lifting for memcache T protocol parsing and connection management but its main purpose is to map those memcache the operations into rocks DB operations so in the bottom you can see rocks DV the request here would come in on top of this diagram run through the code that you've already seen in whren and through a shim to get into the c++ world into our mnemonic core library which then uses the rocks DB library underneath to store the data on disk we chose rocks TV for a few different reasons mostly because when you write to rocks DB it writes directly into an in-memory buffer which then is asynchronously flushed on a disk in these static files so it's fast to write and fast to read so how do we use it specifically we use FIFO compaction which isn't really compacting anything at all it's just this FIFO queue of files on disk it's linearly written in terms of time so more recent files coming at the front of the queue and older files will just get deleted if yet weep in the bloom filters and indices in memory for quick access or misses as the case may be which does trade some l1 space but it makes our l2 much faster which is great for us and we have many rocks tb7 like separate rocks TV instances per box so that we can further decrease the latency this is great for our pre compute use cases but we're still working on figuring out how to do this for much for more heterogeneous data so if we have very fast-moving data next to very slow very fast changing data next to very slow changing data the fast changing data can write enough data on disk to actually push the slow ones off the end of the FIFO queue and then we've lost something which is not great so we're working on this still the whole thing together is a little bit more complex we run multiple different open ports in production for different use cases per server so we have these servers already serving all of that personalization data before remember that pipeline of personalization all those green Eevee cache nodes were this server we have two ports one for standard access which is highly dynamic access or actively managed data and then we have an async batch port well it's really a batch port it's not a sink but it's for people who are writing data in that won't necessarily be used anytime soon and it'll keep the working set hot so that we don't end up blowing away our l1 when we're computing our personalization stuff okay the slide I've been waiting for so this is performance in production for our most heavily loaded cache that is using Minetta right now I measured both at the peak like the three hours of peak for us and the three hours in the trough for us all the different Layton sees and percentiles and things so given all the complexity that we've added to this you might think that it's really slow but taking a look at the gets our average latency is around two hundred and thirty microseconds on the server side and you can see the percentiles here the higher ones get a little high because it's got go as a garbage collected language but that's okay with us the 99th is really really great for us and in the trough it's even faster and the sets are also we're very happy with this 367 microseconds average and like I said you can see the rest of the percentiles there but these are all great for us so if we look at client side Layton sees the network in AWS adds maybe 250 to 500 microseconds and most people expect an answer out of us in under a millisecond almost all the time okay I told you this is a cost optimization project any guesses highest percentage savings we saw in a single cluster one thousand percent kind of impossible 10% a little low okay is about 70% cost savings on a single cluster just because we noticed that we have this small hot set and a really large cold dataset and this project has been a really great one for us and we're just nearing the end of rolling it out so yeah huge success for us all these things that you've seen not all of them sorry mnemonic we've had the air quotes around it evie cache the Java client library is open source and that REST API that I mentioned earlier is also in that same repository and the rent code itself is also open source on the netflix github repo so you can go check those out and that's all I have okay I'll take any questions if anybody once yes I wasn't the engineer that sorry the question was the decision tree to get to rocks dB I wasn't the engineer who made the decision specifically to go with rocks DB but generally it was the fastest it worked for our use case yeah I have to think about it a little bit more you can I'll be at the booth after this for a couple hours so please come find me yes all right have we considered using Couchbase no not really we've talked to them before we're happy with our homegrown solution that's working really well for us if we need to customize it in any way we can just turn around and modify our code and get a new candidate jar out the same day I don't believe that would be a good process for Couchbase so I mean I haven't like gone into deep talks with them but we're happy with where we are and I don't think that we would have been able to do something like the monitor projects directly with them yes right so have we found any limitations with the commands that are available in the memcache D protocol not really a lot of our data is organized by user or profile or something like that most of the time people just want to get the whole piece of data there's not very many use cases where they just want like the one array element so it's not really necessary any other questions yes time scale for this project you mean the Minetta project right the time scale I started dabbling and go about a year ago and started hacking on rent maybe November if I had convinced my teammates earlier this year that it was a good idea to do this project in this manner and we are basically done rolling it out into production for the largest use cases right now so about a year total yeah follow up so how did I convince my teammates to use go kind of like because we're not in the paved path anymore it's not too hard when you make a proof of concept that's already running fast right off the bat so that's what I said like it's fast when it's compiled I had a relatively dumb like text protocol based version that was already running you know in the hundreds of thousands of RPS fairly easily so from there we don't we don't even run that fast I need one server from there we can work out how to make it function on the server we already had memcache D which was essentially a binary blob to us we already were kind of off the paved path so it didn't really scare us so much plus my teammates have been converted I guess to the religion of go at this point they seem to like the language is there any other questions anybody up top anything yeah okay do the management asked us to do this to save money not directly engineers at Netflix tend to have a lot of context about how the business is doing and where it's going and what the needs are so as a team we've looked at what our needs were and saw this opportunity for an optimization so we decided to just go for it and you know not only do we have to convince ourselves this is a good idea we have to go to all these people whose data we're storing and tell them hey we're gonna change this out from under you I hope you're ok with that and they generally are so it was driven mostly by the engineers but we had backup as we were pitching this idea started as a side project became my main project just because we decided it was a higher priority anything else yes I'm sorry I can't hear you can you speak up please okay I probably didn't explain that extremely well is the first time I presented that slide publicly so our personalization project process ends up creating an ad hoc data dependency dag but it's not any formalized like dag that exists anywhere in any form other than in engineers minds of this is my dependency so when they're writing the data out their process writes into an Eevee cache cluster and those Eevee cache clusters are in discovery in the service discovery system and anybody can find them at any time so that's I'm not sure if I answered your question yeah so anybody who wants to talk to Eevee cache would use the Eureka system in order to to access them does that answer your question well in that in that diagram there was these like single green like database looking things that was meant to represent a whole cluster and that's actually a globally replicated charted cluster I saw another yes and he plans to extend to other cloud providers it's open source we accept patches pull requests not really we our team is focused on solving Netflix problems we have it an open source in the hopes that it helps other people solve their problems it's not a primary goal for us to support the open source as an open source thing like but it should be useful to some people anything else up top nothing ok I think that's it
Info
Channel: Strange Loop Conference
Views: 63,013
Rating: undefined out of 5
Keywords: netflix, caching, microservices
Id: Rzdxgx3RC0Q
Channel Id: undefined
Length: 35min 22sec (2122 seconds)
Published: Sat Sep 17 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.