Rubrik Atlas File System: Designed to be Masterless, Self-healing, and Cloud-scale

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Chris Wall chief technologist at rubric and welcome to another engineering deep dive today I'm joined by Adam and Adam you were what a tech lead at Google for the search infrastructure team as well as what else we working on the the file system Colossus obviously some some pretty big-name things out there so how did that experience kind of prepare you to build Atlas 4 rubric yeah I think well at Google one of the most important things I learned was how to build systems at scale you know when you're building a system that's running on thousands of nodes you know hardware failures are a matter of course so you have to design the system to anticipate those failures and deal with them you know when they happen so you know I think a lot of the same principles apply to Atlas which is also a cloud scale file system you know we expect the same types of environments same types of failure profiles so we built the system you know with similar principles in mind you know working on these distributed systems problems is really fun for me and I think you know it's been a great journey so far all right I gotta admit I have no idea how to build a file system so my question to you is what was that experience like but also more importantly why build your own well I go from scratch versus picking something up off the shelf for Atlas yeah so Atlas was written from scratch and we had a couple different goals in mind the first of which of course it be scaled out so it's masterless right there's no single point of failure or single choke point for performance you know we want the system to scale linearly with the number of nodes that you add okay also we wanted to be fault tolerant you know of course needs large environments we expect hardware failures so the system will notice when something goes wrong and up replicate whatever data was lost okay so you took the time built it from scratch those sound like great things but I also recall that we're aware of what application were backing up there's some awareness there and also I was hoping you could dive a little deeper on data integrity right of course so you know another one of the benefits of writing up the scratch is we were able to optimize for application so we our application we're and our application is the data management application so atlas the file system understands that it's touring snapshot chains which is what the blob engine uses to back at snapshots sure so you know this allows us to do things like know that these snapshots are immutable protect them with data integrity so we have crcs up and down the stack while we're validating that nothing's changed in the background we're also validating on the read path and if we ever see an anomaly we can just throw it away and let the system up replicate you know this sort of allows your application to be confident that data is written in has never changed and so that I assume has some kind of effect on ransomware because immutability tends to be the key criteria for making sure that can't be encrypted and harmed and that kind of jazz right right so our application can identify where the ransomware took place and then roll back to the prior snapshot and of course they're confident that the snapshot they've written into Atlas has not been changed therefore does not have the ransomware all right so here's kind of a little twist for you everyone else that I'm working with in the data protection space is pretty much narrowly focused on ingest right we can eat this much data per second that kind of jezz what did you do with that list to focus on the restore because I know there's a lot of you know instant recovery that kind of jazz out there what's fueling that from a technology perspective the benefits of writing it from scratch is we can customize it for application so Atlas has features built in to support the blob engine and the snapshot chains you know one of these benefits is ability to do a zero copy recovery which allows us to be essentially instantaneous that makes sense and when I talk to people in the field and our customers and things like that they're really jazzed up about the instant recovery feature the fact that they can build a workload like you said in matter of seconds sometimes less than one second so how does that work exactly because I know that a lot of other vendors it usually takes you know minutes hours that kind of jazz to put it together here we're pretty much consistently talking about a second or less to build a workload from a backup what's the technology behind that right so because Alice understands snapshot chains an instant recovery is simply a metadata operation and this can happen very fast no data actually has to move and then the reads are served on the fly by merging the snapshot chain so does that matter if it's the most recent backup or can you do this for weeks months you know whatever old backup data that we have within the appliance so again it doesn't matter which snapshot we're talking about it's always a snapshot chained to Atlas and the instant recovery operation is exactly the same okay let's bring it all home we've talked about instant recovery something that people love and use every day we've talked about a scale out file system that's resilient and self-healing things like that what is the enterprise what does that align with their messaging and what they're trying to do with in enterprise ID you know ultimately we want the customer never to have to to know or worry about Atlas right it should just scale it should it should be fault tolerant and again it should be performant I like that answer so that's been another episode of the engineering deep dive Adam thank you very much for joining me today thank you [Music]
Info
Channel: Rubrik
Views: 4,405
Rating: 5 out of 5
Keywords: rubrik, rubrik data management, backup, backup and recovery, data protection, ransomware, ransomware recovery, instant recovery, rubrik ransomware, atlas file system, rubrik file system, rubrik atlas
Id: eCCKzV39cSs
Channel Id: undefined
Length: 4min 51sec (291 seconds)
Published: Wed Nov 22 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.