Scaling GlusterFS @ Facebook

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright first I guess quick show of hands how many you have heard of Facebook before this conference are furred herd of Gloucester before this conference okay good and how many views it how many currently use it okay it's not bad how about Steph okay Hadoop alright cool cool just want to see the audience so I know where to not spend time or spend time alright so my name is Richard I'm a production engineer here at Facebook I have been at Facebook I guess almost five years and kind of seen Gluster transition from what my original manager called the science experiment to now production use so it's kind of the quick agenda of the talk well kind of a like give you for those of you who have really not an experience with golf dog it'd be kind of like a three slide crash course and what it is and how it works and that'll hopefully give you some context for the rest of the presentation and I'll go into some deployment styles how we have used to Gluster at Facebook how we currently use it and how we deal with things like NFS and then I'll go into kind of like the two kind of broad areas of where we've had scaling challenges one in operations and the second one is like the actual core code of mustard or the internals and some of like the changes we have had to make and if we've got time hopefully we'll get in some questions so first I just want to acknowledge the great team I work with these are the six folks that I work with or five other folks I work with every day open source is fundamentally about being on a team and kind of our team is no different so I'm really representing a body of work done by these people today so I just want to acknowledge that so first the hook here's some numbers of Gluster at Facebook data sets range from gigabytes to petabytes individual clusters can have on the order of hundreds of millions of files and billions of files if you go across clusters in terms of QPS or in glossary call them tops or if it's tens of billions per day and namespace sizes this would be like a single volume there can be anywhere from the terabytes to into the petabyte range and bricks which is kind of the unit of storage and cluster we have thousands of them so this is kind of like what we have to deal with and here's kind of the quick like lineage of Gluster at Facebook we started out at three point three at a prior life I had started with three point two and then we in 2013 move to three point four and then last year we moved to three point six you'll notice we trail we trail main line we do this for stability reasons and I can kind of touch upon our developments like a little bit later but we typically will be like a version or two behind in clients into one cluster will have tens of thousands of clients and I'll get into like how we actually accomplish that little error so first off we'll go back and further the Orangemen cluster itself this is about open source after all it was created in 2007 by Gloucester inch the founder and original author was this guy named a B he's been really really good to us in the whole community and this was acquired by redhead in 2011 and they've been I think a really great Stewart for Gloucester in fact I don't think I could have asked for like a better acquire of that company to kind of carry it along and they put a lot of resources in a Gloucester now which is really good to see okay as promised here's like a high level of luster there's really three basic ways to get data in and out of bluster you have fuse mounts or fuse clients you have a GF API which is pretty much what it sounds like a direct API into the cluster and then you have the NFS and on here you can see kind of the two important translators so translators and glossary are basically the modules inside the software stack that we divide up the functionality inside cluster so if you're if you're familiar with like canoe herd it's basically an idea stolen from that project and Gloucester uses it to important translators to really understand or distribute translator which is effectively doing sharding curious things like memcache it's the exact same idea and the replication translator this is provides you the high availability and replication of the data and here we're just showing like a simple cluster of three shards and then three replicas inside that so a replication translator will serve this first effectively how it works if you've got a client that's doing i/o it's updating what inside Gluster we call a journal what it really is is little entries into extended attributes and individual files as you write to files it's effectively just counting how many write operations or metadata operations you're doing to that file and they used something called a wise full algorithm should I'm going to fail to figure out like where do I go - like reconstruct this data when this node comes back and it basically makes a little matrix figures out based on this like if it's a three by three X replicated cluster a little three by three matrix and it's going to figure out based on this wise the logarithm which no to actually heal from so in this case it's going to pick brick one okay the distribute translator this is basically just using ring hashing and it's very similar to say what memcache does in cluster we implement this at a directory level so if you've ever wondered why Gloucester has directories on all the nodes this is the fundamental reason why we create a directory it creates a hash ring based on how many nodes are in your cluster and it codes that again and extend out attributes across the cluster and these are inside cluster what we call sub volumes or you can think of them as shards and when we actually do file i/o we basically take the file name we hash it and we figure out which sub volume it should go to a lot of the code of you know some like I think it's five hundred thousand lines of code in Gloucester is really this all sounds really simple but we're Gluster has all the work is like what do you do when you rename things what do you do when you're growing a cluster shrinking a cluster all of this gets like really complicated so at Facebook when I started about five years ago there was a lot of proprietary storage used for POSIX truth be told we still have some of it because there's still some really hard use cases out there that we haven't Mele to handle the cluster though we are still working we're still trying so but one of the fundamental reasons that kind of drove Gluster to adoption was it was really like the speed we could move Facebook culture is all about like moving fast and when something goes wrong and you have to like explain to a customer that it's gonna take like you know potentially hours to diagnose because you have to go talk to a vendor or potentially days to actually come up with a fix this is like a non-starter especially when you're like working with other teams across the stack that you know they are literally fixing things like getting hot patches out within hours so this was something that in terms of like the posix storage at Facebook we wanted to like move in that direction so the other thing that really drove adoption is in the data center so although our clusters did not look like this this kind of like highlights the issue of cabling and they're kind of like - too important cabling systems inside datacenters power and networking the networking side most prior to your store systems have really custom cabling they're built from the ground up they have dedicated backends like InfiniBand or fiber channel and these are like and and because the these systems are not ordered frequently or turned up frequently it's really hard for your set ups guys to kind of like remember how to like cable this stuff up so you either after like bringing on the vendor to actually do it which takes time and coordination and then the other the other component is power and our datacenter is we're using 277 volt power now which is more efficient and a lot of the vendors are simply like not on board with that because frankly a lot of other folks out there are still using older path style power power systems with OCP we're trying to kind of like push more adoption to this but really the state of affairs of in terms of the storage world is there the probably be like the trailing end of this so the final thing is is money like cost per gig like this is like how would everyone always thinks about when they're thinking of storage and you know the accounts get out there they're spreadsheets and they're trying out much stuff costs and if you start ordering and like Bozek store systems are not necessarily the cheapest thing in the world they're actually like POSIX in general is actually pretty hard to solve and do it well and do it at large scales so you pay for this and you know the ability to kind of lower their drive the cost down by using commodity hardware is actually like obviously really appealing so our customers this is kind of like these were these include like new customers as well as like the existing customers we had when we originally started and we have a wide range of customers many teams that are could simply be like Rd so these might be like teams that are working on AI they might just be a team that they have an idea they don't really know like what they're gonna do with it yet so they don't really want to invest the time on like say writing a object store API they just kind of want to like throw some ideas around write some C++ and like figure out hey is this actually gonna do what we think it's gonna do and then maybe they'll they might stay with cluster or they might like move on to like something like HDFS and those that stay on they go into like basically full production workloads and they're supported as such so in general they're like and we also have like the classic POSIX general-purpose clusters where these are like they kind of look like you're you know maybe your average NetApp where you've got like just a slew of unstructured data from various teams could be spreadsheets it could be just you know could be media it just runs the whole gamut and you can kind of like further kind of refine this into like four different groups so you have archival which was you know your classic backups and then you've also got like the glue being the glue between large-scale systems this is like another important use case so so you've got like some huge data whereas application it's gonna distill that data down to maybe only a few terabytes from many petabytes and then that's going to be injected into some database generally these systems don't talk well to each other and we use Gluster to be the glue between those systems and then finally anything that kind of doesn't fit into any other store solutions so if you're not like media you may not fit into something like haystack if you're not like you don't look like really database EE or or maybe you were on a database but the devious guys yell at you because they're like what the hell are you doing storing like a gigabytes you know blob in my database they might get booted out of that and they they're gonna tell you home so basically if you don't fit anything these other boxes we're usually gonna take you on here's another way of looking at it you can look at this as like IO size and data set size haystack HDFS coal storage at one end of the spectrum and then you've got MySQL at the other side and rocks DB so these guys are typically very small iOS for like a transaction and the data set sizes are generally gigabytes two terabytes using some sharding magic these guys can also obviously get into like the petabyte range in the case of mysql and then you've got nuts in the middle and we range from like data sets eyes on the gigabytes all the way into the petabyte range and our i/o sizes are generally they can be anywhere from 4k to 2 megabytes depending on what kind of request sizes they're using on NFS so hardware you'll probably have no surprise here we're using open vault which is an OTP solution that's what one of the Gloucester trays looks like and currently were using 4 terabyte drives we have 30 of these in a machine and which is 100 terabytes usable per host we divide these into two raid 6 groups we also have some other hardware that we use which are kind of like hybrid systems they use it's a hybrid of flash and and a raid raid 10 volume these are kind of four like hybrid workloads that maybe require like they've got a lot of hot data that they need to that needs to be accessed very quickly the Glasser community is kind of working on another solution to this around cache tearing you're gonna see that I think it's showing up in like three seven already and I think it's going to get more hardened than three eight but we kind of like to experiment with both ways so right now on some views hybrid systems it's really like block level caching or is that's coming more like file level caching these are near line sass yeah so it's the basically the controller is enterprise and then the platter would be more consumer yeah raid six fifteen and fifteen yeah so underlying file system we use primarily XFS though were we start to use butter FS as well but we're kind of like we have probably maybe 20% of the fleet as a as in butter FS we're kind of like letting that mature a little bit we've seen some performance issues with butter FS so the majority is collects with us know harbor rate yeah we are experimenting with Safa raid right now the big question there is like the right hole as well as being able to like journal things very quickly like nvram is like obviously a really nice and using a hardware raid card and it's actually a really tricky problem to get rid of it in terms of the vendors we use it'll be like that we we don't we do multiple vendors so it would be I think LSI and and say PMC Ciera which is I think now owns Adaptec yeah we use all the standard tools yep so if you want to look at like what a cluster cluster looks like at Facebook this is the general-purpose clusters we use we usually have sub volumes that go across racks you know here we're showing like you know the positions are perfect but in you know in a real data center we don't really care what position it is in the rack and we'll have in the case of OCP hardware it's gonna be 9 servers per rack so we have 9 sub ones per rack and then we just like stamp these things out so in terms of high availability NFS a lot of people always ask us like how do you guys do it there's a lot of different options for this I really just suggest people use what they're comfortable with we haven't used cddb to do this and it's just a really small piece of soft part of the Samba community its job is really to just move IPs around when a host fails and I'll give you a kind of like a quick like you know rendition of how it works we have some file going to some node that node dies and cdb is responsible for moving that data over to our moving the IP addresses over to the other nodes this this all works because NFS v3 is stateless if you do not believe me try this out with Gluster it will work you'll get a briefs you'll get a grief pause and it resumes if you really want to drive down to the stack why it works is it comes down to the structure of the file handles in Gloucester they're completely deterministic you have a volume ID as well as a gf ID and coded into the file handle and because of that any node can answer a request for any other note so this is kind of one of the beauty beautiful things that they've done in the design of cluster these are the NFF statements that would be structuring the file handle this way yeah so so this all works but like what's kind of like if you're looking at this in your with a critical eye what's kind of like a problem with with this kind of method of Haan FS anyone got any ideas yes we don't support that so not a problem for us in what way no no not really too much problem okay so the hmm no because again clusters got internal locking on the back end of the bricks which will maintain the constancy but during these failover events well the real problem here is like this all works great with in a rack but what happens with whole rack feels it's a rare event but it's like for most people using a file system or you know designing an application on a stack like this it's like it's an event they probably want to know what would happen and for a while we didn't really have an answer this so we then started looking for other options and this is basically what we came up with this is basically strolling with how a lot of other systems low bounce so like you know this is kind of a very classic setup for the web ease at Facebook and I'm sure like a lot of web server systems out there is just using you have a bunch of machines they all advertise an address over BGP and there's a load balancer that's going to direct traffic to these machines based on some sort of a heartbeat to detect whether the systems are up and on for us you know we needed some tweaks done so the way the low balancing worked for Gloucester because we needed like very static assignments for host port and source port meaning went once the session was established we want to make sure the traffic kept going to an NFS daemon because its although failover supported you don't wanna be failing over every single packet and effectively how this works is no dies it'll stop advertising and another rack of nodes will go pick up that traffic and I over zooms okay so that's kind of give you an idea of the the deployment styles and how we do things like a che NFS I'll get into some of I guess the the scaling challenges that we've had and so in a prior life this is really what I had to deal with and you can get away with a lot when you're you know dealing with one rack maybe two racks maybe three racks so my mentality was quite different in terms of like what I should be doing with automation and things like that because if a brick died or or an FS demon died it was like you know pretty rare event you can you know maybe put some maybe a cron job in there to kind of clean it up every once in a while then my boss took me for a datacenter tour and this is what I was this is what I saw and suddenly you realize that none of that's gonna work and you know after you like calm down for a bit you start to like think about like okay you gotta really like change the way you're thinking and as I kind of mentioned there's kind of like two broad challenges upscaling operationally and then kind of handling any deficiencies ins inside Gloucester itself or the internals so operationally the first kind of thing you'll figure out or you should figure out with Gluster is like okay am I gonna build like this one giant cluster and I'm gonna like shove all my data in there or am I gonna do something else and maybe make smaller clusters and they both kind of have pros and cons you make one big monolithic thing this is you know a lot of the Hadoop folks do this they've kind of really designed a stack that like literally can do like hundreds of petabytes a pretty amazing accomplishment cluster fundamentally is is it's not built that way it's not designed that way so instead of kind of like working against it and trying to like force of view things that's not really well designed to do we kind of did like what came naturally for Gluster which is you really do a celled approach the cons to this are of course that you know you've got autumn you have more widgets so you have to like you know you have to be really good at things like configuration management provisioning like these these cells really need some kind of like like run themselves because you're probably gonna have a lot of them the pros there is some pros here too though you have got like great failure to make it like great isolation in terms of failure commands so you know instead of like you know if a name node dies on the Hadoop world it's like complete tragedy and you know hopefully your failover mechanisms work but if they don't you're gonna have like you know huge amounts of data that are unavailable with a system like this are a cell design like this at worse you might have like a few petabytes unavailable and but the rest should be a-okay so in order to kind of manage all these cells the the the first thing we did was like the cell really needs to like or the cluster if you will needs to like really manage itself and we built this tool called ant farm to do this which is really a cluster manager if you're familiar with the haqqani and the glossary community it's really gonna be like designed to like take over this role this is something we chose to design in Python and completely extra over Gloucester for some pretty important reasons so first some teams will take approaches to like actually kind of like fork a project make a lot of like internal modifications like you know modify things like even the logging structures to like pump data instead of using like you know the standard C log and C logging libraries they might use like a Facebook one and you know there's some advantages to that but ultimately you're kind of like one your forking and - you're marrying yourself to things that the open-source community is not you know has no idea if it in exists and they're certainly not going to support it so being kind of a big open-source fanatic this is not a direction I want to go so anything kind of like specific to Facebook I want to completely external it had to like the core of lustre had to be pure it had to be like still fundamentally the the open source product an ant farm was really designed to kind of like to do all to basically take the the Facebook functions and capsulate it somewhere so fundamentally ant farm will do performance metrics configuration management as well as monitoring or alarming and there's two components you've got Manor a manager node there's one per cluster it selected based on a bully algorithm super super simple and then you've got everyone else's basically was not a manager it was just a worker and there's basically master tasks that are that the manager will do and then their slave tasks which the workers will do and who does what is basically like if it needs to be coordinated it's pretty obvious the manager should be doing it uncoordinated and you know the workers can do it coordinating activities would be like turning up a cluster doing if a node needs to be replaced because it's been out too long and maybe a human didn't like go figure going on our site ups took too long the manager coordinating that to on Courtney act if ities is the note comes back from imaging it needs to put itself back in the cluster announced a cluster hey I'm back I'm ready that does not need to be orchestrated by a manager that's just worker can go do that as well as submitting statistics it obviously doesn't need host levels districts can be just be sent on by a worker so one of the other important things it does is enforcing layout so we support three different types of layouts we have what we call off network which is probably the more common inside pretty much the probably the cluster community itself which is no replicas are ever in the same network or in our case rack and the pros for this or hires a high resiliency high read rates the cons less right throughput but say you know a customer comes along they're like I really need high right throughput you can give them the in network layout which is basically putting replica groups always in the same rack you first tell them they're insane they're probably have unavailability and and durability issues if they agree this is what they get great right through puts but not it's gonna read through bits and you may have an engineer come to you and saying hey I know what I'm doing you know let me set up my cluster however I want for that we have ordered I'm not really aware of anyone that uses this anymore but we support it because hey you know it's America everyone all right so so we're growing and growing and growing and things are going pretty good and we're getting more and more clusters more more cells we finally get to a point where we're like okay holy holy cow this is a you know this is actually becoming a lot of work to turn these things up and managing things like provisioning and you know when imaging fails a human would have to go and figure out what what what happened and trying to like get it going again and you know we we use things like kickstart to do imaging which is pretty automated but when you have a lot of this going on and it eventually doesn't scale so with JD was created and it's designed to basically be the as ant farm is to host JD is to clusters and it does things like provisioning it'll like shepherd machines through the provisioning process it'll actually create the initial cell configs and it's going to actually hand things off to ant farm to actually go create the clusters eventually we're actually going to be having JD monitor metrics and kind of like what the vision for this is is we actually don't want humans to actually be turning up cells at all we actually just want humans to basically be like feeding the monster with machines and it will turn up the cells on its own by just monitoring metrics so like if the cell gets full maybe it's like 70% full boom go create a new cell we don't need to know about our cap inch guys may not like this plan but they don't need to know you some people have actually said that but I don't know I'll see what I got talking about Hayley then you might starts get scared but that's coming ok so there was some coaching just to for operational reasons as well the first kind of obvious one for us is we're a big ipv6 shop Gloucester to not do I pb6 so we added that this was kind of one of the few patches we have not open source we did not open source this because frankly probably not all of the world needs ipv6 support and we did it in a way that makes all of Gloucester ipv6 so there's we actually removed the ipv4 support out of it so Antfarm we were looking at it but then her Katie kind of came along and we feel like Hut caddy is like a really a better approach for the community you know the question for us really really become do we move to caddy or do we like continue on with with ant farm but we actually think like that's like it was a great approach the community took this one we have given to the community I'm not sure if to see this in three eight or four but actually I think the Red Hat's performance engineers actually really like this feature you used to be able to actually have to run like start this performance command and then you'd stop it it would dump out some stats for you for us we are like Facebook is like crazy data-driven the engineers are like even if they don't like aren't owning the Gloucester clusters if something goes wrong they want to see metrics they want to know why this is like really ingrained in our culture and for a long time they're like this thing is a black box I can't tell what's going on it really frustrated engineers so we modified the i/o staff translator in Gloucester so it could run full-time we got rid of all the walking in this translator and then got to dump things out in a JSON format which is kind of like digestible by almost any kind of monitoring system you can think of and from this you get something like 3000 different metrics every five seconds so it's like more stuff than you probably even know what to do with we didn't stop there we actually went for F up sampling we want to know like again data-driven people want to know like what are my worst case service times in order to do that you really need something like sampling now of course in a file system you're doing billions of operations or hundreds of thousands per second in the case of Gluster you can't you need to like sample these things so we we created this F sampling feature into the eye of staff translator well this has been open source as well and we get this to redhead so another challenge we had which is kind of like it's not operations but it kind of is which is really this kind of like you know this notion people have around NFS and I was kind of naive to this before I came to Silicon Valley I don't know if this is a Silicon Valley thing or never thing but is this notion that NFS is evil it sucks it's horrible if you use it disaster will be upon you and your family and it's kind of an odd thing and when you really get down to it and you kind of like unwrap what people are really pissed about like NFS is just nothing but a set of RPC calls and it's actually like you know pretty nice that's stateless it's actually really clean it's well documented it's really old frankly I kind of challenge people that it whatever you plan on replacing it with it will probably still be here when no one uses what you built you know but you know why do people hate it you know these mounts and mounts are easy they're originally designed for local use and the standards of mounts are as such people expect local like behavior and when things on the network go wrong or things over the network go away you know these are not good at communicating to users like something went wrong what should happen hard bounce make that even worse because it didn't even give you any comment an error so you know it's also bad for other reasons you have to like you know when the kernel if there's a kernel bug you have to like upgrade your whole kernel well if you have like you know a thousand machines start to go to a customer and say yeah no problem we've got a fix for that just upgrade and roll 1000 of your machines like they're probably not gonna do that so looking for a solution to this I look to the open source community and sure enough this guy Ronnie Salzburg he wrote this thing called lib NFS and this really kind of I think broke the really allowed us to like or you know prove to people that you know it's not NFS you really hate and we did that by making CLI utilities that expose NFS as a CLI so if you want to like get data in and out you can like cat it or put it you don't need a mount anymore and it made an FS look and feel like things people really like to use like the Hadoop CLI and it also gave people and then once we finish writing all this stuff it provided like demo code so if you want to like actually embed live in FS in your app you could so in short it really gave people an option beyond mounts and I think that was like just giving people choice kind of literally brought down a lot of like the tension so here's a kind of a quick demo of what these utilities look like in action you've got you know NFS LS at the top just showing there's nothing there we're gonna like echo some data into the into a file and then we list the file cat the file we delete the file and we LS and show us not there so this is basically what it allows you to do no mounting it's completely user land if there's a bug you can upgrade this in user space it's really nice this guy we've been meaning to open-source you know it's really on me I have to like you know get this working with like Auto tools and stuff but we're probably what we're gonna do here is a really offer to the libman FS guys and hopefully be a part of like live FS itself so if you compile it you'll just get these for free okay so back to internals and scaling challenges so kind of one of the things that we we went to that there's actually the first cluster development summit last year which was really awesome but for the this year's and one of the things are kind of brought to the developers was like pragmatism over correctness this is kind of the philosophy we have as like peas at face book and an example I can kind of give you on this is like this code snippet any ideas why this could be like a really horrible thing to do I'll give you a hint okay so what this is really doing is like when you connect to an NFS daemon this thing is basically having it pick a privileged port and this is actually in the fuse code so this is like you know originally the fuse clients on our stack used to do this and as our clients were as our customers we're growing and growing and going we found that Legman mounting is getting slower and slower and slower and we were kind of like trying to figure out why so we dug and we dug and dug and we found like this this like beautiful piece of code and like you know since like the days of like a raspberry pie are here that you can get like you know a 1024 port and below for like $10 like I have no idea why people even bother putting this stuff in code although it's correct if you look at the NFS spec this is we're supposed to do it's kind of insane so people needs kind of like as developers I think thing is it kind of like at least put an option to say yes you know I've satisfied my you know the correctness of the of the spec but I'm gonna give you an option to get rid of it for performance reasons oh and another example we found is actually on DNS lookup so we actually found that like DNS lookups are happening for every inbound connection and this is something although again correct and there may be reasons you may want to do this for security doing this like in line not very scalable and you know we've got like you know little pieces of C code that can prove that you can you know do thousands of TCP connections in the blink of an eye so like you can scale really huge numbers even serially if you avoid something issues so anyone who's setup Gloucester before have you ever seen this this is like an i/o error this is probably one of the first things people see and hate and if you go through the bluster Docs they will quickly tell you that the way you solve this as you go to the back end and you go figure out what file you really want and you pick that one because this is basically you split brain you've got two or more copies of the file and cluster doesn't know what to do we saw this pretty soon after not even getting that many hosts going so pray in like we're in like the hundreds of bricks this stuff we start to see and we knew this was something we need to solve and as it turns out when you actually go ask a customer and you say like well which one should we pick they'll leave this say I don't know they'll say I don't care or they'll say you know pick the last one pretty much what a human is gonna do right they're gonna pick by size I'm gonna pick by time or they're gonna pick by majority like two are the same ones not pick that one and we basically modify it AFR to do just this and you can see here it's called favorite child and you can see there's a split brain on the back end and we resolve that automatically without any kind of i/o error happening to the user we do log these things so we want to know when they happen but the key here is we want availability to the data or for the data and lately one of the other things were kind of like tackling to kind of like go next level on this is at the DHT level we just finished a patch to actually handle really exotic cases where you may have like shards disagreeing with each other in terms of what the hash ring might look like or what the GF ID is on the hash ring and we can resolve these cases without any company loss so yeah in this case we're picking by size yeah so traditionally typically what we do is we use the majority policy so we'll pick the majority case and we'll go with that and since we implement this we've really never had any customers come to us and say hey you bastards you you lost my data it's been actually working pretty good so the next thing we had issues with was ask access control and if you go look at vanilla Gluster and you're not say Ganesha is now an option I'll throw it out there but vanilla Gluster this was kind of what you had to live with which was this like awful ow off deny system where you give it like a list of IPs you can even use wildcards and it would you know yeah or any access into an NFS daemon this you know when you're faced with like tens of thousands of clients you can clearly see this doesn't work so well so we hired an intern and we said hey intern solve this problem it ended up being like a really good compartmentalized problem for a summer and you know it's kind of like already a solve problem out there in industry and there's a lot of like well-defined ways of selling this and we chose using net groups to do this and it was something that we actually had on students had used you know on enterprise systems so we know it works and we had a lot of infrastructure in order to to kind of like generate net group files and effectively how it works is we take you define an export file like this and we have a job in the background that basically just scans these these exports and will create net group files against like a really huge database that knows you and say a TR called you know host scheme called my tier for example it'll figure out what all the my tier hosts are and I'll create a generator net group this is then sent to the to the machines using chef actually and from there this actually like scales and you can actually like control access on thousands and thousands of machines which are in turn limiting access to you know hundreds of thousands of machines ok so another internal thing this is probably one of the scariest charts a PE at face book can be faced with you I would wish this was like you know dollars and my bank account on the side but is not it is memory and this is like this is a memory leak and you know probably last year and the year before you know kind of a lot of the low-hanging fruit problems were being solved and we were faced with stuff like this and you know at the beginning we really didn't know why this stuff was going on you would find this you'd find maybe a birth that had a high CPU high memory you might wreak it maybe do a state done before hands and it would drop down and then you know it may stay down or it may go up but this was really bad and you know machines are running out of memory this is even worse than like a machine dying because it's kind of like hobbling and in distributed systems like a zombie machine is like way worse than like a dead machine because you don't know like should I boot this thing ouch and I keep it in these are like really hard things to automate so ideally you want to like make code changes so they really just can't happen and in this case we just eventually got tracked down to like locking you've you know there's basically some misbehaving clients or a piece of Gloucester out there that is controlling that has a lock on a file or directory and it's not giving it up and these are also like really hard problems to debug and because they're hard to debug it was really hard to write patches for them because we would see this on a system we would come look at it we do a state dump we'd see tens of thousands or hundreds of thousands of locks pending and then you're trying to like piece this together to figure out like what series events actually made this happen so kind of what broke the logjam on this was creating this feature called monkey unlocking this is like a developer feature we do is we actually like on purpose will drop one percent of lock/unlock requests and this creates these really rare cases into like really common cases and then the idea was okay Gluster must handle monkey unlocking with it on and it must be able to like not block when this thing is operating and once we once we like created this it actually became like you know pretty straightforward unlike how to make the patches how to make sure they worked and how to ensure stuff like that graph back there never happened and we created lock relocation and the idea behind Walker vacation is if no one contends for your locker ok we won't go after you everything's good but if anyone is contending on your lock we're gonna revoke you based on two parameters one is time the other is how many people are blocked behind you you can use them one or the other or both it's really up to you and it's also POSIX see when we revoke a lock we're gonna send you back a game if you choose to ignore you gain in your code you may crash that's your problem not ours right V clearly stays on the docs this can happen this is why I can happen and there are the options down there in practice we don't see too many people crashing me again we are blessed with pretty good coders so they're handling it pretty good all right final internal change I'll talk about which is replication so Gluster actually has kind of went through two generations of replication we don't use either of them because one of the things you find when you're trying to like scale things to really large numbers is you need simplicity so the fewer moving parts the better and this was like a case of really that so we created this thing called halo replication which is really a collection of patches that enable this form of dual replication take place first patch we multi-threaded the heal daemon this we've given upstream this could be useful for other people if they just want to make things heal fast depending on the harbour using most people may you want to feel slower enough faster but we have pretty beefy hardware so faster is better and this is important when you're like do replicating obviously because if you're dealing with high latency is you want to like get as many packets as you can in the air at the same time the other thing is people come to me they say okay rich this is really great the first question they have is like what happens when actually you get back I'll get to that in a second non-destructive DFAT split brain resolution this will be nothing until probably the next slide but it's really important and basically what it is is if you have two copies of the same file that have a different gif ID if you're not familiar Gloucester has basically like an inode number on a file system what do you do and we created a patch to use similar techniques as I discussed on the split brain to resolve these cases in this case though it's non-destructive we do not we rename the data and finally the halo feature itself and its core requires only one option it'll actually figure out for you how many data centers you have and it will you know take its best guess and form replication zones and in within those zones you will have asynchronous i/o reads and writes if you're a geek like me and they want to like tune this and you can tune things like the minimum number of replicas that you want before you acknowledge your rights as well as the maximum number of replicas you may want even because maybe within a zone you've got six replicas cuz your Netflix and your you need like a lot of read capacity so you may say you can limit how many maximum replicas you do synchronously and then you can say if you want to failover enabled or not which was enabled by default which is basically if some failure happens you can decide that even though there's only two replicas in your zone you may want to bring in another replicas from other zone at the cost of higher of latency reads and writes but you want that extra durability so so you may want failover enabled and also we have this notion of min samples which is there's as hard as we look at these pings and we're kind of this is how we figure out like where all the data centers are and many samples is like how many pings do I have to see before I actually start making calls and the system goes in like a synchronous state until that many samples is is received so this is the way you can think of halo G replication we got like three different data centers here we've got kind of maybe this is like the west coast this is like the east coast and on the East Coast maybe there are 12 milliseconds apart and then it's like you know 65 70 to the west coast and halo is gonna form replication zones based on what halo setting is set in this case it's about 10 milliseconds so it says anything that is within 10 milliseconds and this is like the brick nodes themselves so if you're like an NFS demon you're trying to figure out like who should I be talking to synchronously it's gonna do this based on the halo value so it's gonna say okay I can see you know maybe 20 bricks in my zone and they're 10 milliseconds away from me that becomes I'm gonna talk to them synchronously and everything else I'm just gonna like let the heal Damon's do it asynchronously but maybe you're a weird customer and you that's not good enough and we do ask some some folks like this that they're like one data center not enough I need two data centers comet hits takes one out I need my data safe but I also want a third copy over there for maybe some other reasons you can actually do this using a fuse mount and you can actually just say this is you know I want 20 or 30 milliseconds inside for my halo and you'll get two data centers synchronously so fundamentally it's like extremely flexible and then of course we have heal Damon's these are the guys that are actually pushing data between the data centers and for these guys they just use the infinite halos so they're just they will see everything and talk to everything and be able to shuttle data around so just kind of some of the the reasons we like halo and and I think our customers do too super easy to use all the standard tools work you can use Illumina fests you can do this gf API is NFS Eli's and the fuse mounts it's got some cool behaviors this partition tolerant if two regions are up but disconnected they can both receive rights and how we do this is actually using that GF ID on split logic that I mentioned before will actually allow you to write to two regions simultaneously and we will be able to handle the case of figuring out like who wins and when we do figure out who wins we're not going to blow away the loser we're gonna just rename them on the way and we do it pretty much like what a local file system ensue which is like the last writer wins and it's pretty performance so six hosts we can get up to a gigabyte a second for 50,000 files an hour if you're more student files per hour and this like this scales perfectly so if you add 12 18 24 whatever you'll you'll you'll get it so future work and current challenges hardware raid this is something that we are looking at removing from our stack we do not know really how we're gonna do this yet it's actually a really hard problem to get rid of NVRAM so you know my first Inklings are kind of I think this may be one of those things that we can actually solve and get a ratio coding into production and really hardened these guys data labs in Spain did some awesome work creating the disperse translator what we're gonna do is track that in production on our systems and really really harden it so it works at our scales and we're hoping that maybe that may be the key - j-bot is well we don't know yet and then the other thing is multi-tenancy we're basically making all of our clusters look the same at the same if you become one of our customers you're just gonna get a standardized cell and you may or may not live with other customers and another pretty car problem because we have to like QoS and provide things like self-service and we've got the QoS down pretty good we actually have a patch that will be up streaming probably on this quarter which is to do throttling at any directory level and that's kind of been the key for us to get multi-tenancy to work that's it yes yeah so that's cross country yeah but and they're both writing the same file so if they're both running the same file and if you're doing the read async they're both gonna they're they'll both be reading the data that they're seeing within their region we allow so depending on the customer some people want they have really really high consistency requirements and for those we have to basically tell them listen like you can't really have it all you can't have if you want hundred-percent consistency we can give that to you but you're gonna be in sync mode so you'll have the G replication but we're really gonna have to consult all regions in order to like answer that perfect to give you that perfect answer to your question in terms of consistency so yep so Gluster is going to Gluster does granular locking so as its replicating that file it depends so the very common cases say it's a brand new file that one's like way more easy it's you know a file might be being written in the West Coast and read in the East Coast the first thing Lester's gonna do is create a file of some of the it's gonna phallic a fairly significantly sized file whatever it sees at that moment on the other end and it's gonna begin back filling the data cluster then has like granular locks that it's going to enforce on readers in that region and they should be able to they will not be able to read past the lock so they should actually get pretty consistent reads in that case the more exotic case would be like the random right case this is something we just tell our customers don't do this we do not really support this the granite are locking in theory should should protect you meaning that while the replication or the heel demons I should replicating data into that regional file it's gonna be locked you won't be able to read it but this would not be something you're gonna want to run like a database on like we'd tell them like you know use database database systems have their own replication mechanisms because their requirements are very very specific so this is more for like very you know like photos or videos or things you basically you open you write it and you want to replicate it go ahead not open-source yet I'm so we had it pretty down pat on three four on three six were we're almost there I think but I think three six is really what most of the community is going to want to run it on so there's kind of a few final things they're kind of like touching up and leave we don't like throwing patches over the fence that are like not hardened because we're since we're like writing the patches we come a little bit faster and looser than like an end user so we don't want to like get patches out there that are really not baked so yeah you know you know it'll probably probable for the Gloucester developments that developer summer will get it out there yep so internal customers so like whatsapp might be one of them Instagram might be one of them like these are all to us customers yep yeah so a photo video to be clear this is like ever store haystack this is their bread and butter we may store bits and pieces of that but that's usually for like not for front end access this is usually like people might be doing like experiments or trying out different you know video codecs or are these kinds of things but like you know we you know someone comes in and says I want to store a photo video boom we would shoot them over to the team that is best suited for that go ahead so it varies on the cluster I would say the number I do have to top my head is prey 80% of the f ops on a Gloucester cluster and our systems are actually metadata and only 20% are actually reads and writes in some it gets as high as 90% in terms of reads and writes I think the last time I actually broke down that stat it was almost 50/50 actually yeah which I actually kind of surprised me because usually on like NFS style systems it's like you know there's usually a lot more reads than there are rights but for our users they're pretty heavy so it I think last time looks about 50/50 yeah like Gloucester at Facebook yeah just those six guys that were at the the front of our presentation so yep so currently we depend on our hardware raid cards to actually do that at a block level the background scan is all done by the hardware RAID cards at the at the block level so Gloucester actually has bit rot detection coming we don't use that yet what we're probably gonna do is roll that into our gbaud project because with the with out hardware raid that is something that like you have to own and do so right now we pretty much like delegate that to hardware RAID and hope it does its job yeah we actually have a run about 20% in butter FS it's been a good gut check to see like is hardware RAID like lying to us and the answer is not really like it's it's actually fairly rare we have run into a few cases of Herberger fesses pointed out corruptions that the hardware RAID stack did not catch but they're rare enough that we were that we weren't like concerned and we never I've never seen it on three replicas so in these cases usually just drop the bad data on that replica and you just reconstruct yeah no so not yet this is something we we actually did our own snapshot work using butter FS we've done some experiments with VM to see like talking to the glossary girls of the last summit they seem to be pretty confident in LVM snapshots and it is like fall suspect Gnostic so it's got some nice properties there but in our early experiments with it we were still kind of not super pumped about the performance of it so our customers are like really they don't like blocking like if anything blocks forms any reasons we hear about it so yeah that's back so it can be it can be one percent or it can be ten percent it's really up to the operator of the cluster we used clusters actually I think one of the cool things that Gluster designed in I think pretty early on is they've got multiple queues for different operations so you've got hi Priya you've got a normal product queue a low-price q and a lease product you and you can actually operate he'll Damon's and Elise pride queue and that enables them to not like suffocate out your production workload so typically what we'll do is we'll break down we sign threads to queues based on how many cores our systems have and then we typically would do like two or three threads and at least probably queue so not all of our clusters are running that today because we kind of all really only got comfortable with at least the notion of least prior queuing in 3 6 and 3/4 I think it either worked and not so great or not at all I can't quite remember but yeah least pride queuing would be your your best bet yep so if I had to pick I would probably keep it because I actually feel it's like a hard problem that's been like well solved and you know I feel like the you know I you know upmost respect for the guys that did that technol aasaiya I think these guys know what they're doing and are really good with it so I think a lot of this is driven by like people wanting more metrics of like what the individual drives are doing so I think like the message to maybe the hardware raid vendors out there is like exposed a lot more metrics because there are power users out there that want to know like individual like latency is to individual drives what other drives doing all that kind of stuff so I think some of that is really driving it from like an engineering standpoint yeah yep so the throttling feature that we're putting out is designed to really handle those situations now this is brand new to even our team typically in the past what we did is for if we saw a really heavy hitting customer we'd actually put build their own sell and move them there in kind of the the new world to really kind of like help us scale is we really need all the cells looking identically if they are truly big enough like they're a multi petabyte kind of use case they may get their own cell but it's really by convention not like from if you look at our config layouts for example it's going to look like any other config so we're throwing what we'll do is we'll go to a customer and say hey how many f offs do you need odds are they gonna be like what the hell's an F up and I have no idea how many I need so what we may do is be like okay run with your like run with your workload will then track in their namespace how many f ups we see and then we'll say hey that's great this is the cap we're gonna place you at before we're gonna shunt you to that lease price you I just talked about and and you know in some cases they just may need to change our workload because maybe you know a lot of cases you see people that are doing like you know a 4k read or something you're like well why are you doing a 4k read no like I don't know and we're like well go look at your code cuz like you know sometimes they're so far abstract ID using various libraries they have no idea what like that read actually looks like when it gets down to the Cisco alright I think I'm out of time but you can pull me aside and I'll be happy to answer any questions thanks
Info
Channel: Southern California Linux Expo
Views: 12,299
Rating: 5 out of 5
Keywords:
Id: jWM3HTwsNE8
Channel Id: undefined
Length: 61min 0sec (3660 seconds)
Published: Sat Jan 23 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.