Facebook and memcached - Tech Talk

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys I'm Dave Federman I'm an engineer here at Facebook and welcome to the second of the Facebook Tech Talks I've actually recognized some of you from the last talk the last talk was on chat facebook chat and how we built that in here and that's something that demonstrates a lot of things we do on the application layer the design layer and this time we're actually talking about something that's really crucial piece of server-side programming or so systems book run and that is memcache D so mark is going to be talking about memcache D here in a sec just so you know if you can read to the skwiggs we got a Facebook Tech Talk page so that if you guys want to see the next ones that are coming up look at videos from old talks post do all that great social stuff at Facebook so good at you can actually go search for engineering Tech Talks on Facebook you kind of found that page and get all these updates so without further ado Mark Zuckerberg on memcache D so today I'm going to talk about memcache into the last Tech Talk that we had on chat seems like a lot of fun so I just figured all right I'm going to do one of these I haven't coded in quite a while at Facebook but one of the last things that I did was I implemented the first version of memcache so it's become a really integral part of our of just the software stack that we use here and one of the things that has been our biggest country one of our biggest contributions back to the community so I just figured this would be a pretty good thing to talk about on top of that you know I mean Facebook is a technology company and what that means to us is that just all throughout the company at different levels including you know obviously the engineering and operations and all those teams but also just the management and the folks who are running the company we think it's really important that they're technical as well right so I think that the fact that a lot of the decisions that we've made along the development to the company have resonated and kind of been in line with the technical strategy and what we're trying to get down in the world just make it so that the whole company is just a lot more coherent so I'm you know today I'm going to talk about memcache which is just one part of what we're doing and I think it has some really interesting problems and if you're interested in in systems or interested in working at Facebook I think this is um a good example of the type of thing that we do and the type of scale that we're dealing with so just to give a quick intro to memcache and and why it's important for us so on Facebook what we're basically trying to do is help people share more information with the people around them right in the people that they're connected to so the data set and in the data access patterns that we have are different than a lot of from a lot of other types of applications right so if you take something like email all of the data for a user can be stored in one place for Facebook a lot of the different applications that we have are pulling data from all of your different friends right so if I search for down I'm going to get completely different results than if Dave searches for Dan if I look at newsfeed it's going to pull from a lot of different ones and my friends it's gonna be different from anyone elses newsfeed right just kind of generate on the fly so so the the data access pattern is different and in order for this to work we've needed a fast cache and a just a good way to get access to the data very quickly from from just all that all the sharing that's going on so the characteristics of memcache that have made it an important part of what we do are first it's it's um it's this distributed in memory hash table service right so it basically runs across a whole cluster of machines and you can have as many machines as you want and basically really simple to to set keys get keys get a lot of keys at the same time from all of your friends if you're pulling data from multiple connections that someone has from different machines things like that it's extremely simple basically what it does is it'll store data that is commonly used or hot from the databases in order to make it so that you can access it more quickly so here's an example of some of the syntax or get you can get as many keys as you want from a lot of different friends or people that you're connected to that's really important to be able to pull data from different people where the data is on different services servers set really simple delete so just kind of how this fits into the stack right we have web servers tens of thousands of servers and basically the numbers aren't accurate here but the web servers are doing around 50 million requests a second to memcache and um just like through the dear annum so it what we found pretty early on and one of the motivations for doing this was we found that even really fast database queries would still take a few milliseconds right so two three four milliseconds and in order to be able to get this to work at a good scale you know memcache queries take you know an average a little less than half a millisecond so we have about a 95% hit rate in the cache so most of the queries just go hit the memcache they don't have to hit the databases it's one of the things that makes the infrastructure work so this has become a really important part of the stack that we're using and also for a lot of other companies so how we're using it we a degraded this very early on or our first integration was in um implementation was in early 2005 where from what we can tell from the the different people in the community who are using this by far the largest implementation of this and the the memory that we have is around it's 20 plus terabytes almost 30 terabytes at this point and um we've encountered scaling challenges that a lot of people haven't seen yet so as we solve these problems and we'll go through a number of them today what we do is we just try to make them available to the community and some of them we've been able to successfully merge back into the the open source the same branch of memcache some of them we were in the process of releasing or have released and just our own branch so but you know all the changes that we're making we're releasing back to the community for other people to use if you're interested in this as well so okay so now we're going to go through I haven't done most of the stuff that we're going to talk about today I'm going to highlight the work of a lot of really talented engineers at Facebook who have done different optimizations the first part is what I did so so that's me I did on the the very first implementation of memcache in 2005 and just to give you a sense of where Facebook was at the time there were a few people who worked for Facebook myself Dustin just a couple of early people we didn't have an office yet we rented an apartment in Menlo Park and just sat around a kitchen table and coded and um one weekend we were talking how do we optimize that the databases and in order to make data access faster and continue to be able to generate all pages in faster than 200 milliseconds all right which is just a goal that we've had and um so what we did was we kind of decided we looked at some things that were available including the the early version of memcache that had been built by the folks who who started LiveJournal I was just really basic back then and we just had to implement this and you know me and a couple of friends who are working on Facebook then I basically implemented the first version of this in a weekend right and that's just kind of fitting with Facebook style of doing things go really quickly move fast iterate a lot and just kind of solve the problems as they come up so that those kind of a style that we had with this so um so what I'm going to talk about isn't the first implementation that's not that exciting what we're going to talk about is just how it's evolved over time and the different issues that have come up right so the first set of issues Scott and well we'll just go with Scott for now I think there was supposed to be someone else who's apparently not getting credit I'm Andrew McCollum one of the early folks from Facebook so are these the right slides all right cool so the the first set of issues I mean when we first got memcache it was it was in a really rudimentary state right so all kinds of different issues for one thing it only supported 32 bits so um so it for among a number of other reasons it just kept crashing you know all the time and one of the biggest causes of this was you know the servers that we were running at had more than four gigs of RAM and it being only 32-bit couldn't address that and as it started using more of the RAM it just overloaded and crashed and this was just kind of like the first issue that we had right so apparently no one who'd used this before had ever used it on a server with more than four gigs of RAM so so this is kind of the first in a long sequence of things that we that we fix so I mean the first implementation that we did just kind of in keeping with um with a very fat hacker style was just to write a script that would restart memcache every 15 minutes because it would crash and you know he's still useful for those 15 minutes but but pretty quickly we um we made it so that here's McCallum okay so so quickly we made it so that we could actually support the 64 bits and and that would work yeah so then we start getting to some more interesting things it was pretty buggy so I can actually read this because of the angle so um there were all these issues early on where you know you'd you'd send servers requests and sometimes if a server had crashed then the server would just keep on getting all these requests even it was is that had just started coming up and like the frequency of all the requests would overload it and just prevent it from ever coming up again so just things like this that just kind of hadn't been thought of and the first implementation that we were just working through these one by one and everything up to the point where you know that the first client was written in PHP and PHP serialization is pretty slow it's I guess uses just straight straight binary so like for serialization so things like okay that's not we don't the data that I had there isn't there anymore so things like a timestamp which could be demonstrated in a smaller amount of time and of memory ended up being like taking up twice as much memory on PHP serialization also tends to be pretty slow so when we first implemented the client ourselves we were actually able to get it to go a lot faster and this will be a theme of a lot of what I'm talking about today because at the scale that we were operating at then um some of these things just made it set up as faster and a lot more stable but now that we have you I think it's almost a thousand machines running memcache these small tweaks save the company millions of dollars right so these so things that are really interesting edge cases that that we run into being able to detect these and kind of troubleshoot them and then look through these have a really meaningful impact and that's only possible because of the scale at which Facebook is operating day so there were ton of different things that we did with memcache but I want to focus for a little while on to the more interesting architectural things that that I think we've done over the past three years of using it so the first was we ran into um into this issue was a series of issues that basically made it so that we wanted to stop using TCP to transmit the data between the between memcache and web servers and instead we basically implemented it in the application in UDP right so there were a number of reasons why we want to do this and Mark is that is the one who did this and he's there so you can raise his hand and also these folks will be able will be around to just answer specific questions on this stuff afterwards so we can have a cool discussion about that so I mean there were a bunch of issues with this so one issue was just that we had all these different connections going on right so if you if you picture you have a lot of different web servers which each have a number of different threads on them right like in the web server each of those threads has a persistent connection to each memcache server so it ends up being I think almost 10,000 serve connections or so on each web server but on the memcache servers it was even more so I think hundreds of thousands of connections and I'm keeping those connections open required a bunch of memory I think from from some of the data that we had it was almost 5 gigabytes and you know for a cache that's whose job it is is to efficiently use memory that's kind of a bummer right you want to use the 5 gigabytes not for keeping the connections open but for actually using being able to store things and serve it quickly so um so what we wanted to do was figured you know instead of having all these different connections we could just use UDP to transmit stuff UDP means never having to say ACK let's say awesome joke now no joke so just to put this in the perspective of the type of problem that is pretty unique to Facebook and the data access patterns that we have here and probably social networks in general is um you know on a given page you're getting data from a lot of different friends order people that you're connected to in order to pull that together and show either you know recent events that all your friends are going to or recent status updates or things in newsfeed and one of the things that happened over time is you know I mean just as people's friendless grew and there started being a few users who had thousands of friends or a lot of connections there start to get some really big multi gets right and um and the UM the issue with with TCP on this which was another motivating factor for why we wanted to switch to UDP was the we basically send out it would batch all of these different get requests the different memcache servers we'd send them out to like to hundreds of different servers and then they'd all just come back in and bombard the the server that had sent them out initially and just in that um like the in the server if it dropped the packets then TCP would then require 250 milliseconds to in order to retransmit right and get more data but through using UDP we were able to get around that limit and that made it so we could really optimize the amount due to which the server's could just send all the data back in act in response to the request so using UDP was just it was a pretty big project that took awhile for um for mark to get running but it ended up being able to optimize all these different things right so memory use and so so we have um so you have UDP there are other projects that we had at the same time which is which the you know the connection limit was was a pretty big scaling thing that we ran into early on right then I mean first one we the first way that we knew that we add at this issue of the number of connections was that Linux had some limit where you physically couldn't have any more connections but you can go into the kernel and tweak that all right there's a reason why it's there you probably shouldn't we get that far but um so that's why we had to do all this stuff to get around it but this was a pretty big problem so we knew that the UDP thing would take a while to get done so at the same time as we were developing that we also developed this other piece the which is a basically a memcache client which we call memcache proxy and I think this is also is just a pretty interesting example of a of just a part of the memcache ecosystem and in terms of the deployment that we have that became interesting not just for connections but um but also for how we scaled out to multiple data sensors and and replication so um so if you take a look at like that's that's the stack that I showed before right and this is basically you know so you have web and then make requests to memcache and the ones that miss can go to the database the way that the scales to multiple multiple data centers is a little tricky because you need to keep the the memcache tiers basically need to be coherent with each others they're not they're not just like out of sank or giving bad data but at the same time you need the memcache servers to be close to the web servers or because if you go back to the original reason why we implemented memcache in the first place to save that 4 milliseconds from the the optimize data call data based call to instead just go to the the half a millisecond memcache call having two milliseconds of latency between a web server and even just another data center that we have in the Bay Area was just not acceptable right it just that the site wouldn't work with that so the basic architecture we have set up is we have the memcache tiers with the web servers for the small percent of misses that we have to the cache those can go to the database because they're only you know 5 10 15 maybe a page right so that's not that's not quite as big of a deal but keeping the memcache is the tiers cohere with each other is a big challenge so we basically were able to use memcache proxy as a solution to this so so basically just to go through what what memcache proxy is instead of having all of the different connections right so instead of the instead of the UDP solution right so that was one of the solutions that we head to this problem of a lot of connections an alternate solution would be to have all the different threads on a web server instead go through one single memcache client which which maintain persistent connections to all the different memcache servers right so you can do that across multiple data centers - all right that's basically what we did here so each this diagram doesn't show it very well but what you can imagine is you know each each web box has a mem cache proxy client which has connections to all of the different memcache tiers and all the different databases and all the different data centers and it's then able to to sync up and replicate the deletes to make sure that the the caches are coherent in different locations right cool so that works for San Francisco now we also have data centers in other places so the challenge with this is that the the latency sorry this isn't latency this is a race condition so so if you um so the reason why we couldn't have the same solution for this is because you basically get into the situation where you're sending the um where you also need to sync that the databases in addition to sending the to syncing the memcache tiers so you get into the situation where the database would basically replicate by sending over the queries from the west coast to the east coast right and then executing them and then the aam the memcache would would sync up but there would basically be this race condition over where which delete came first right so that caused issues so the way that we actually solved this which i think is a really creative solution was to instead of trying to sync it up through having one memcache proxy run on the web servers have memcache proxy run on the sequel server on the East Coast so you basically get into this situation where you have these type of sequel queries where we've effectively we've modified sequel at this point to to support dirty really means delete so just old thing that we need to get around to fixing but um so modified sequel query where as part of the sequel query it'll take this memcache command you'll send it to memcache proxy it'll execute it on on the memcache here and it will stay in sync so by using memcache proxy able to not only just you know keep the connection count low which is one of the original things that we try to do one of the original goals of the project but it also is really what made it possible to keep all the memcache tiers in sync just all the way across the country and you know we'll use this as we continue to roll out more debt at data centers around the world so kind of cool so those are just two of the things that I think are pretty interesting examples of just like architectural decisions that we made with them with memcache along the way and there's more if you want to ask some of the folks who made these who will come up at the end one of the other things that I just want to talk about is aside from these large changes and large projects that took in a monster in some case over a year to build and in roll out there are also just all these optimizations that were able to make that can seem pretty small I mean sometimes they're hard to find sometimes they're really easy to fix sometimes they're harder to fix but because of the scale that we're operating at they make a really material difference right and they're able to save us you know tens of machines hundreds of machines like millions of dollars so I want to run through a few of them the first thing that we did was um a lot of the optimizations that we've talked about so far have been client optimizations right so fixing the clients that it works first then you know implementing it and C then doing memcache proxy to cut down the number of connections make it so that can replicate and be efficient and all these things the at this point we basically moved to the server and trying to optimize a number of different things so um so Steve Grimm is the one who bind these out and he's there in the back and um so you know that the optimizations are both on in memory and and CPU so if you think about how the memory system works for allocation and memcache the default that we built that was built in was on with slab allocation right so as opposed to conventional memory allocation because that's more subject to fragmentation and obviously again if you're building a memory cache you wanted to be used as efficiently as possible so making it fragment as little as possible is a really valuable thing so you had slab memory allocation but the default for the sizes of the slabs was were powers of two right so you'd get these chunks that were 256 K 512 K to 1 megabyte slabs and what we found was that for the distribution of our data this was a really bad size right so you know the way that this works if you have two powers of two is that as the chunk size Benham something that's a 1025 bytes will waste you know thousand 23 bytes right so so you basically will lose so this ended up being about sixty percent effective and wasting about 40 percent of the space in the cache which is pretty lame right so just through experimenting and in trying to find a better power we we figured out that and Steve figured out that by using 1.3 is the power which you know computer scientists don't tend to think in powers of 1.3 as much as powers of 2 but but basically by figuring by by optimizing for that we were able to get 90% of the memory used which is obviously a big change right you can imagine you know when you have thousands of servers and the the size that we're operating at and this is millions of dollars or in order to be be able to do that so just kind of thinking outside the box not thinking in powers of two optimizing this made a pretty big deal you know same thing on CPU so this is um so this is a more of a systems programming programming thing but the instead the first version that we had of this we just used the normal right system call right so we would write the different buffers and instead you know we found that we were we're just making all these different system calls so instead we were able to really check the code and Steve did this as well to make it so that we just were getting all the buffers and sending them all over with the right V system call instead using the method that called scatter gather i/o reduce the number of system calls and this actually saved 50% of the system CPU time right so just kind of using one system call that is a little bit more esoteric than what most people use was able to but but I mean I think it's clearly like you meant for this case made it so that that um we were just able to save again millions of dollars right just based on this change I mean it took us a long time to figure out that we should do this and yeah so and then you know I mean the same thing in user space or in user CPU so we um the protocol is this is it is string right so basically transmit strings and we were actually getting to a point where the majority of the CPU was being used just to parse the commands so and and what the majority of it was was in the stur line command right just like as new information was coming in we were calling that a lot so just by taking the the length of the string and storing it in a variable instead of continuing to say sterlin we were able to cut this down by factor of three save millions of dollars right so I mean this is it's it's an interesting thing just like it's it seems like a such a simple optimization but at the scale at which we're operating on and the number of servers and you just figure I mean this whole compound you know because I mean we recently just got up to 100 million users I mean we're ending the year at almost 150 million and you know that's just going to continue growing really quickly hopefully and just as we continue adding more and more servers and these things just compound and make a really big difference right they're just it's a really big focus for the company to get the stuff right and make sure that that this is in good shape so those are two optimizations that were in user land or application land so what we found was that we were actually we're actually operating at such large scale that it even makes sense for us to get into the kernel and modify some of the network drivers in order to optimize things even more there it's I'm just going to go through a few examples on this and then we can open up for questions but there's this interesting change that's happened architecture at least since we started working on Facebook right so the first set of servers that we had had one core now I don't even think you can buy a server that has only one core right and the way that Linux manages manages on that the sorry that probably not all but um but the majority of network drivers will deal with I'm out of order sorry let me give props to the people who did this stuff so Mohan I see here no but Tony and Paul are Tony's not here either Paul's here all right so ok so now the issue that I was talking about was um in you know in the majority of the the network driver implementations the UM the the network card will raise interrupts to send the majority of the interrupts to only one CPU right so you can imagine that as we're using

Info

Channel: Facebook Developers

Views: 194,021

Rating: 4.8657866 out of 5

Keywords:

Id: UH7wkvcf0ys

Channel Id: undefined

Length: 27min 56sec (1676 seconds)

Published: Wed Nov 27 2013