Rubrik Atlas FIle System with Adam Gee and Rolland Miller

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm Steven Foskett organizer of the tech field day events and the what you're about to watch here is a presentation where rubric is going to be presenting to a panel of delegates from around the world these folks specialize in enterprise IT technology and they are here to ask questions and discuss and learn about the technology if you are interested in learning more about this event you can find out by going to tech field dot-com and if you enjoyed this video you can find a lot more at youtube.com/btcare distributed file system I came from Google I did nine years there working primarily on a system called Colossus which is Google's file system so almost all data that gets that lands on a desk at Google goes through Colossus and is and is managed by it so I was really excited come to rubric and build the same sort of web scale infrastructure for enterprise I guess second time here Roland Miller I was actually with pitbull the first time we announced rubric at Tech field day um my background is basically 17 years storage and backup systems my I cut my teeth on networker back in the late 90s when I worked at SGI back to the port so unfortunately I'm sort of the industry veteran I've sort of seen all the progressions of all these products along the line but what I'll talk about on our session is really just what I saw the potential in the platform that really excited me and a big part of it is what what Adams going to go into is the file system is really kind of a reset building on opposite the new parent ok cool oh and that's Charlie she's my dog security just thought that picture was booked you know yeah so we're really excited to present this bill to you guys let's see so quick agenda slide so first you know what is Atlas you know it's are too stupid a file system but just sort of introduced some sort of key principles you know scalability that's really important that it can scale so you know how do we do that you know what are the architectural decisions we've made fault tolerance what are the failure tolerance properties of the system application awareness so from the get-go you know the file system was built with our application in mind and so we'll go into some of the use cases and things we've done to make our application really effective another slide on performance just some quick things of how we make things fast and finally deployment you know how do we deploy all system itself so the core fundamentals of our platform is is our architecture right so one thing to understand about the platform we our session today we're going to what Adams going to talk about is atlas but just to see the the overall architecture of how the platform is designed and built is you have basically the foundation or the core of the system right so the file system itself is really the centerpiece of that and then we have the applications that support it so Callisto are distributed metadata store this is basically if you think about traditional backup and recovery platforms you think of this is your you know your catalog your metadata store you know so all of the metadata here and is tightly coupled with our file system and then cluster management right so how do I talk to all of the different components and manage all the resources within the clusters and then the distributed tasks framework right so think about job scheduling and how in Bac a recovery you'd have you know a single job scheduler and if that server went down that had that well then no jobs could get done for us this is all about how do I build every piece of the software stack into a single component just and then deploy that in a scale out fashion so this way that if you know a single you know in a fault tolerance and point a single node is down well guess what your jobs continue the other applications the other nodes will continue to work and operate and execute their code and then that was the foundation of the of the platform for scale-out and then we built the applications recovery search deduplication and in the higher level or the higher order applications above that right and then that became really the internals of the system and then here's all the external interfaces for you know connecting out to the real world so the user interface right so although the rest api is making that an easy lead consumable platform which our own user interface uses so our web html5 web interface the cloud connectors right so how do I export and bring data take it out of the system and bring it into you you know the public cloud or the private cloud and be able to take data you know that I have in my system but also get it out as well and then all of the ecosystem integration and so this is where you get into you know the first the first part of that when we build platform was VMware right and it was because that was you know a largest market segment that was the best part to start with but now we're continuing to expand and for us is really just about adding modules to what we call the infinity layer so that as new applications come online new types of you know data services whether it's file systems or applications right now it's just we have to build a module or component to talk to that application type then now we can ingest that data management processes store it and then also bring it back up and you know it instantiate it back into your user space as well so okay now we're going to talk in depth around Atlas right so we'll focus on the file system now and I've already mentioned so I you know the slides sort of set a high level but if people have more detailed questions please feel free to interject and we have an answer you really did build it by the way cool so yeah so so what is Atlas Sun so it's a distributed file system right so you start with nodes right they have on compute and disk attached to them the little square looking one is supposed to be flash my PowerPoint skills aren't great so though I did learn how to use animations so I'm a little trigger-happy presentation that's that bad looks all right so Elvis you know runs on top of these nodes right and what it does is it assembles all these all these storage resources disks and flash and they some upset accessible to our application right so in our application just sees normal file system they have no idea it's distributed or fault tolerant they don't need to know how it works right they're just writing files that makes their life release and you know this is how you know with this you know distributed fault tolerant scalable file system you know we can build rubric to scale out as well oh right oh well told my slides I forgot so yeah you know saying right it's assimilating and becoming more powerful it's like borg right but no no it's actually like you know this nice guy but who helps us or stuff okay cool so what are the properties about this so it's a it's a shippers file system it's homegrown so you know we wrote it and you know what the important you know when the important things to this we can design it with the our data management data management application in mind and we'll talk more on later slides about you know how we make a data medical place it will be easier so it's a global namespace you know all the storage resources are distributed but the application just sees a file system locally the data is stored on disk or flash we will talk about how that's a controlled later metadata stored in our distributed metadata store so rollin mention that earlier we call it Callisto it's where we also store our catalog metadata for snapshot management but the file system actually leverages that same system to store its metadata so the files is metadata itself the notes communicate through the RPC acronym for remote procedure call and we rely on the cluster management service called Forge to tell each Atlas instance what the cluster topology is okay cool so another picture with animation so you start with a node you know it's disks and flash Atlas starts running on it it connects to those right knows about those disks and flash and will serve those to other nodes you know each node sort of looks the same right there is this two flash and there's outlets running there our application is running on some node and it's just talking to the local atlas so each each one of these really just becomes its own separate reason right and they're all fully independent of each other and so you'll hear some concepts around like master lists that you know and how all this ties together but basically each instance of Alice owns its local copies of those disks of flash resources and then from there they can all communicate talking and and basically build the hive mind all right so application rights to Alice Alice just rights you know might break to itself and it might write remotely as well and finally our metadata stored in the system pop callisto any questions so far the metadata is actually it's it's on each one of those nodes yes well so the week the file system itself treats the metadata service as a service so it doesn't really know or care I just assumes that that metadata its metadata is distributed and also fault-tolerant the truth is the metadata service yes replicates to all nodes or not all notes but replicates across nodes as well so if some node eyes right you don't lose that benedita and the metadata also you know the way Callisto operates is that it's also tears out the metadata so as we get into things like you know general search there's a lot of that it is Scrolls stored in flash but then we can take additional like deeper indexes like we index files on the file system then we can actually tear that down into the disk layer as well which is distributed by the Atlas file system so that way we can get very high performance lookup but at the same time without exploding the namespace for the metadata so I don't I can be really efficient on how much flash is actually required to service all the nodes in the platform and an exact copy of the metadata is also with the cloud if you archive out - that is correct we actually will when we archive out and push out to the cloud we send the metadata with the data as well right so so actually so metadata is a bit overloaded in this context so for the purpose of this talk metadata will refer to just filesystem metadata what role in and you are discussing is sort of the metadata for snapshots itself and we call that sort of cerebro that's the data management layer and so there's a bunch of metadata associated with like you know what beams would protect what are the snapshots that we have and all that's also metadata technically but it's sort of at a level above the file system for the Mexican though it is though it is actually still stored in Callisto this is the meta metadata okay cool so scalability right so scalability is a big thing if you want to build a big cluster right so you know one of the key decisions we made was to make it masterless so all nodes are essentially peers there's no single choke point no single point of failure you know this is actually a property of glosses at Google which is the second generation file system the first one actually wasn't masterless and the classes was sort of born to address that problem so we designed Atlas in the same way so no single point of failure scale out each node is the same nodes can die other node can pick up to work on its behalf and thanks to sort of go on you know business as usual that concept makes sense for everybody I mean any questions around that well okay a lot more pictures to Bill sparks the questions cool so yeah we use cluster management just publishes a list of nodes that are participating in cluster and how this uses that list you know as you know to discover you know who can write to who can read from and it's actually topology where so you know we'll know like which nodes run which brick and it's extensible so you can imagine you know no cluster management telling us sort of the topology the cluster and trying to spread replicas you know as wide as possible to increase the fault tolerance like I'm just much on the last slide so the metadata is also stored in a distributed fault-tolerant system wouldn't make sense to sort of distribute be fault tolerant at the filesystem layer if your metadata for that file system doesn't have the same properties and the maintenance is shorted so you know it's a self-healing system which means there's work to be done to manage the data that's stored within it and that work is spread amongst all the nodes participating a cluster and divvied up by equally amongst them cool it's one of the picture so right so this is like the story when you have one brick with four nodes you know the four of them form a cluster or this metadata service running then you add another brick okay so what happens right those nodes just appear you know on those just appear to out lists from the cluster management system and it started me it starts making use of them and you know this is your scale out picture right just keep adding more bricks it has more nodes still one cluster so same global namespace you just you just added the capacity of the new notes so that that was kind of going back to the first slide where he had the command-line and he showed the you know the capacity you uh LS to see how much how much capacity is in the filesystem you know as you start there you had you know goes 50 terabytes you have the next one now it's 100 Terry's the next one now it's 150 it just continues Gail out you know basically until you run out of Rackspace and power now I'm assuming if you add nodes it's going to rebalance but that's not necessarily part of the Atlas file system that's doing it actually so Atlas does manage the data placement so we will rebalance though it's a bit it's a background activity so it's not like you know you drop a brick in all of a sudden your cluster goes crazy like there's throttling and sort of smarts around like outer balance when the rebalance based on what the cluster topology looks like instead of what's going on in the cluster okay good question okay so fault tolerance of course you know we're built on commodity hardware thanks fail I'm going to make sure we don't lose data so what do we do we replicate so our fault tolerance properties are two disks or one node so any two random disks in the system can go away and we should not have lost any data same thing with a single node which will knock out three disks but we take care to not place toom the replicas on any one node okay um so you know how do you do that you use a replication strategy so pre three auto we use something called mirroring and there's a follow-on slide just to sort of demonstrate that but in in post reonardo we've made your racial coding to default and the other side to sort of go into how gracious coding works the key benefits of switching from the triple mirroring to the erasure code IAM was we wanted to be able to provide the same level of high availability that we got with triple replicating the data but unfortunately you know the capacity of that you know they're there that's the other side of that point right is the penalty for capacity erasure coding and design we went with erasure coding allow is going to allow us to provide the same level of high availability but double the effective capacity of the platform all through software how comfort of the first version you use mirroring at all then was it just for simplicity sake to get it up and running essentially I'm earrings like pretty pretty easy reason about right any piece of data I replicated I can read from the other one you know as if it's an exact copy because it is you know there's it just gets a yeah a little more story it's a little more complicated when you have a richer coding we can dive into that too on the maybe on the rich pudding slide is that upgrade non-disruptive oh I'm sure it's still be up there it is is not disruptive it's just the default policy for new data creation so as new snapshots are taken all that data would be you know written in reed-solomon format furthermore the system has a background process to sort of churn through it and modify things so you know we've discussed whether we want to make that sort of the people thing to just transcode everything right now doesn't the system sort of will write you new data in racially coded format as it's running but we don't make a wholesale pass to like go you know as soon as you upgrade to like you know turn everything you know typically just because of the nature of the platform the SLA is the way that you knew you know snapshots or backups are taken and then they're eventually aged off you know most customers have very short policies you know 30 maybe 60 90 days at most but then over that period of time those older copies are going to get aged out and the new snapshots replace it so all the new snapshots will come in as a racial coded copies and the old ones have simply expired and be garbage plug customers are basically upgrading and getting more capacity eventually yeah that's right that's right and so and this is all built like within the file system so even our data management application the thing that's dealing with snapshot snapshot chains and all this has no idea that it's happened which is kind of nice so doesn't have to think about it it's just using the file system as it as it normally was but all of a sudden things are taking half as much space so the system is self-healing anything written into the system as managed by the system so you know we rely on cluster management to publish node health information no to disk health information and we'll react to that so if we see a node fail you know will we replicate the data that was stored on that node I've got a great animation for you guys on that again the placement is topology aware so that the folsome is choosing where it wants to replicate the data to and it knows a little bit about the cluster right like which nodes wrong which bricks and you can extend that to sort of which bricks are in which racks you know which tracks are in which data center we don't have that ability yet but certainly the way we built it it's this generic framework so it'd be extended that way and the other thing I'll comment on that though is that as the system grows and you add more appliances more bricks that topology awareness really allows us to increase the availability as the system it's bigger as well oh and finally one note so we use crcs for data integrity we have COCs all across the stack but you know this protect against things like cosmic bid flips so randomly one of my replicas goes bad for you know some straight beam in the universe but we'll discover that and we'll throw it away and relocate that data okay um cool so you know this what the picture looks like with mirroring right so you have you know your file let's call this orange thing a chunk we actually store three copies of this chunk spread across the system on different disks so you know let's say you know a note on this third break dies you know the file system will notice it and just rear applicate to another disk somewhere now in this picture you can see there's actually three bricks and because the file system is topology aware I would actually place this ryoga's on different bricks so you know same thing holds when an entire brick Dyess right we haven't lost all three replicas you still have two left and you just do the same operation to move it wherever you for view of space and you know obviously as the system grows right you have more and more choices of where you can put stuff so you can increase your failure times as Roland was that make sense if you lost that if you lost a whole node there would you still have would you still copy one of those existing chunks through one of the two remaining bricks me an awesome yes so for example in this brick yeah I lost coal brick then the next operation after this is now it it's lost its you know three bricks so now it's going to have to force that copy it'll send it to that another copy will be exist on one of the other bricks however it'll still put it on you know a different disk in a different node with it within the same rights so the cluster is adaptive right so we want to we value a failure tolerance above like sort of spread across the topologies so we always try to replicate you fully because we don't wanna be in the state where we're sitting with only two replicas left because we haven't sort of satisfied our failure tolerance guarantees so we'll always try to create a new replica the one exception is if you have exactly one node left we won't put all three replicas on that one node okay they'll be my later replace that that brick eventually will then the background move that those blocks back over yes so so so as I said the system is self-healing it also runs background maintenance and one of those operations is trying to increase the failure tolerance of existing we call them stripes so you know if I have if I have a two brick set up and I saw a third brick all of a sudden now I can tolerate a brick failure if I'm able to spread my data across all three bricks and so the system will try to do that in the background you will see the current progress of that rebuild going on in fact that happens so plus entire for uhm so certainly in the UI you would see that nodes are down but you do not see at the level of the filesystem you know what sort of replication or maintenance operations are going on certainly we have that information you know for our internal stats and reporting so potentially you could get to it but right now it's not exposing UI well the addition of that third break your rebalance or is it something that's got to wait for midnight for a job um no so so there is no set schedule but it is rate limited so you know based on what is going on in the cluster you know will sort of hold things back we don't just like you know start dumping everything we can as fast as we can to the new brick and that's that's configurable from like the file system point of view but again it's not like a knob you can turn in the UI or anything though certainly we could think about doing some of that and sometimes you know a filler like this isn't always you know it's not like a failure of the hardware it could just be I work out switch went down these nose went offline but a minute later came back online right before we evacuated all the data hey guess what this is here and that will recognize that we're doing guys like never happens is it possible for you to I guess put like an entire brick in the maintenance mode like if you're going to like take it offline and move it so that way it didn't try to rebuild on other notes the answer that question is yes so we do have the notion of maintenance for a node and when you put a node new maintenance we will not recover the data that you've sort of made inaccessible because we expected to come back so its treated differently than being sort of missing from the cluster right however in order to be sort of the faucet has to green-light maintance operations and what that means is we won't give you the go-ahead signal for like pulling a node or pulling a brick until for any given piece of data there's at least one somewhere else so you won't make any get an accessible by or informing your maintenance is there a certain time period that after it's a maintenance mode and hasn't come back it will trigger rebuild um and does Chris you know I don't believe that's true because we expect main it to be done you know with an operator like a human they're sort of babysitting it certainly something we could we could think about doing so you said that you have it like rate limited for actually rebalancing is there any way that you can like chill that rate limit for like automatically like so I imagine like I want to put something in maintenance because I'm going to be I'm going to be doing maintenance on this thing I have to take it down for whatever reason and your processes haven't finished running so now you won't let me take it out because it'll you know violate some data I just I don't care about that I just want to move this data as quickly as possible so I can take this thing off line yeah so so certainly there are knobs and we've thought about exposing a way to control how maintenance is scheduled I mean we do have you know sort of what we call them the maintenance man is the maintenance manager who's responsible for scheduling he's made into like when do I decide to recover when do I said to rebalance though how do i balance that against garbage collection these sorts of things so certainly we have the capabilities of tweaking the knobs that we do not actually expose that right now this guy kind of gets into like the area of design or like user interface and simplicity and ease of use and it's about you know what you know there's there's tons of knobs under the covers right right how much of that do we want to expose how much is it and you do through the api's and maybe you know so if you have to do it things like this where you're talking about you know maintenance items it's not like an everyday tab right that you're going to be managing um you know but it's definitely something that you can work with us in our support we really need get in there and do something because you have an exception you know our service support an engineer can help with that Oh as far as a he's a using experience standpoint we don't want to like all of a sudden clutter the user interface right all of these buttons to go push and hope that you don't actually use them very often right so the to the vsync comment that's one of the things that I wish that it had so that i could like force eliminate any type of throttling that's going on because i have maintenance i have to do this right now to fit into my maintenance window plus all the data you can off of those nodes so that i can start using or exponent yes so so to commence on that eric you comments on that um so first one is you know for maintance to be blocked your last replica of our puppet he said it has to be on that node so as soon as we've gotten one replica off of the thing you're trying to do main it's on your sort of good to go and you can go ahead it's not like you need to drain that then completely so that's one thing and second thing is yeah we have a cluster ops team which is working intimately with the filesystem to get feedback from the field on you know what sort of things our common operations right like you know whether it's moving a node you know plug it in a different rack or swapping a disk out or whatever so we're sort of reacting to feedback we've seen from the field to see what actually are the problems that sort of operators out yep then can you manually pick off a rebuild let's say I'm nodes and maintenance bullet out of the rack drop it on the floor light it on fire I don't know I mean yeah it's not coming back online so is there a manual process or is that a support um yeah I mean sounds like a support called or something rubric does not educate setting them on fire yeah that's like anger-management god marks isn't here yeah after mine okay cool any more questions on on this we can talk more of course great question six okay um so yeah so a quick intro to reed-solomon or racial coding so mirroring this is what Mearing looks like right we have this orange thing i call it a chunk and we create three replicas of it and stick them on a disk somewhere so i rich going looks a little bit different so you take you know same piece of data this time you chop it into four and we call those data chunks from list to data chunks you create to code chunks and then you know with this configuration four or two you can still tolerate losing any two of the chunks and be able to reconstruct the original for guess folks are familiar with racial voting it's not like a new thing we didn't invent it but um cool okay okay um so application awareness so the fascism was designed with our application in mind and what our application does is take backups sources backups is our way up stores those backups are using the same self snapshot chains so snapshot chain is built from immutable point in time snapshots and what the file system allows you to do is instantly materialize any one of those points in time and because we know that they're related is we can actually co-locate their data to make things fast and it's important to understand the point out you know I mean immutable points in time is is really about you know the types of data that we're taking you know copies of the data you know this it has to be immutable because you can't come back later and say oh yeah I was able to through the system modify that data because you know at least four in back recovery space you know hey that data may be used in court somewhere right how do I know that this data was never you know never modified never changed and so that was actually you know a design point for the platform to make sure that people understand you know hey I can make modifications I can make copies but the original copy that I made the the hash is the CRC's for it you know they're immutable right once I've taken that copy this data you know stays as a as a snapshot chunk within the system but it's completely unchanged okay and then so certainly we have these immutable snapshot chains but we still want you to be able to view a mute a table version of a snapshot so we use something called redirect on right and we'll have a slide to show how that works you know we also give our application control so we know what our application is you know we sort of listen the things that they need and then we give them control to you know modify various properties of the files are working on so for example they can control the replication level of the file if I'm creating some scratch file I know I'm going to use for ten minutes and throw away and I don't care if it you know if it dies in the middle I can just rerun my job you probably want to replicate it and spend the time to like spread the data everywhere CRC and whatnot similarly they control the media type so do I want this file stored on flash or on disk and also there's some hints about placement so the application can tell us you know I want this file replicated near on these nodes or near these nodes and we'll do our best to accommodate that and then because we wrote our own file system we can you know provide the application the second order features that you know you couldn't get with some off-the-shelf thing for example or just a few examples on TTL files so some application wants to create a file it doesn't have to worry about cleaning it up they can create these special TTL files which is cleaned up by the system on its own similarly like cancellation is just a way to sort of make a file not writable anymore which can sort of cancel processes trying to write to it so you know smaller things which just make our applications life easier the only thing I was going to say there is is that you know that that art there the second order is is what really allows is the the tight coupling between the higher level the the cerebro the data management applications and the underlying file system is it by having that communication between the file system in the application right we can do a lot of really unique things and leverage the resources and resource pools below us you know to have you know to increase the performance to be able to set the time to live and say hey I'm going to create this data it's like scratch data uh you know but I don't want to have to send another command to go clean it up so I'm going to use it for about an hour after an hour right let it let it phase out and be deleted as part of the garbage collect platform right so it's an interesting uh you know use case but maybe it's a little too deep but it's just it's something that I found that really tightly cobbles the the higher-level application you know data management data protection with the underlying file system okay um cool another animation just a sort of demonstrate how these snapshot chains work so this is probably familiar picture for for most folks so you know I'm 0 you know take a full backup right give the Hulk the whole guitar D the data from then on you're just taking incremental backups so here you know these yellow blocks are just the Delta blocks at time one since time zero you know this sort of goes on through time right you're taking more snapshots just pulling in the Delta blocks but now you know I want to actually use the snapshot at time three to do an export or restore so so what I do actually just materialize a you know a virtual file that represents the content the entire VM or VMDK at time three and the same for VMs and file sets and sequel databases as well I believe the VM use case here so I've materialized a snapshot at time three and you know that operation is just a metadata operation there's no data movement it's just pointers back to you know when was that let that block last mutated any questions on this picture cool okay so let's say you know now actually one of like live mount this VM and do some mutations on it so same thing right we materialize this immutable snapshot chain but any mutations now come to a log file right so we just use this journaling system to sort of soak up the mutations but we're not actually modifying any of the content you know from the initial snapshots right and to extend on that so as you know so that that guarantees at any one of these point in times is you know fully materialized it's immutable right so that I I'm not making any changes to the original source data but as I make new changes and I'm writing out you know new snapshots and if at some you know for additional use cases things like copy do management if I want to spin up a clone of this well then I can simply just protect this and then it'll create its own immutable chain snap train a new chain that starts off of that Base Point right so I can take a snapshot that I've already mounted and back it up again and now it has its own data protection policies cool so performance just a few things here right so we've flash flash is like faster disk so you know the application is smart enough to tell us when the files you know performance critical will actually put that on flash on its behalf any unused capacity on flash device will soak up and use behind a cache in which we put hot blocks so that we were getting sort of good utilization of our flash io furthermore not IO all IO streams are you know created equal in our world right we'd much rather prioritize you know an ingest you know of a VM that's held up in a VM or snapshot then some background you know maintenance job consolidating stuff or or that sort of operation and finally you know there's you know some awareness of the day locality so we do our best to place related data close to it close to other related data so that Mormons can be fast okay um final slide I think so deployment it's a single binary which makes it really easy to deploy an upgrade etc there's no you know different server components right it's just a single binary which is the client and the metadata management and the maintenance yeah I think and this gets back to the money start in the beginning where we talk about you know it's a single software stack across all all the notes in the cluster because oftentimes you know different software services have to be installed on different application servers and this is this creates the complexity right the erector set of a back and recovery where you have this application and server that you install this piece of software on but this other server runs this other background services you have to install these other pieces of and components of software for us it's you know the single binary allows us to deploy the same software apply it you know equally across all the nodes in cluster and you know we leverage our core services right so the filesystem uses the same metadata store the catalog is stored in the data management application stores it's metadata in no reason not to use that it's distributed fault tolerant etc and cluster management we use you know cluster management the files didn't have to build its own sort of you know node participation or cost refer to participation mapping we just rely on the cluster management service to sort of see everyone was there publish their status when something dies let us know and we'll take action accordingly cool yeah I just wanted a quick shout out to the team great job guys awesome stuff you build so

Info

Channel: Tech Field Day

Views: 3,764

Rating: 4.5999999 out of 5

Keywords: Tech Field Day, TFD, Tech Field Day 12, TFD12, Rubrik, Adam Gee, Rolland Miller, Atlas, File systems, backups, recovery, architecture

Id: _bi4r-faCyQ

Channel Id: undefined

Length: 35min 36sec (2136 seconds)

Published: Fri Nov 18 2016