Cohesity Under the Covers: SpanFS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
name is a port I hope you can hear me welcome some of you may have seen me earlier in tech field day 16 I think Stephen was there maybe some of you too okay so I'm going to try and spend you know the time about 50/50 on span FS and the application stuff although the demo at the end will tie it together all right so how many of you have not seen this before not many okay so let's start right so Cohasset ii if people ask you know you guys do this and you do backups and now you can do apps now you do see learning what is the core DNA of the company right so the core DNA and you know is distributed systems and the core strength starts with building a distributed file system because you cannot go around solving data management or other issues until you have such a system so span FS is our system like oh hey city is built on you know enterprise class but still servers with direct-attached storage that we have to tie together to build this platform okay so whenever you have you know a lot of people it's very hard to agree on something when you when you start building distributed systems the first thing you do is you need consensus so that's done by our layer which is marked as the distributed lock manager here and you know distributed lock manager is good it can achieve consensus among nodes so you can you know for most decisions maybe elect a master who can then make decisions on behalf of everyone it can handle partitions and nodes going down and inside next thing we have is a distributed key-value store so there are lots of key value stores out there but when you're building file system you need some unique semantics for example you want this store to be fully concerned it should never happen that you write something and then when you read back you don't get right because it's also replicated so remember everything that we are building has to be highly available fault tolerant and resilient it has to move the data should something fail and this property is required everywhere so that's a distributed key-value store and of course we need some cluster management software so that you know let's go bad to respond to that new nodes have to be added that's a cluster management software so this forms the basis okay if you look at even probably inside Google you will find similarities they have something called chubbie which is the distributed lock manager they have different key value stores but you know these are just basics that we need but then let's start looking at some of the unique stuff that callosity does here there's snap tree you know snap tree is our distributed B plus tree and this is our patented technology this is the one which allows us to do very efficient cloning you might have heard of it and you will see in various places that we can keep doing clones and clones in a large list of clones without impacting performance and this is enabled by this distributed D plus tree and then is our game a transactional metadata store so if you are building a file system for the enterprise it's not good enough to say that it's eventually coming right just think of a simple thing like file rename operation a file rename is transactional because it has to move the inode from one directory remove it from one place add it in another place you know those are two different rows but your key value stores don't support right so you have to build transactions on top of this so that's what we do in that transactional metadata store and this is why you will find file systems that are eventually consistent and Lanyon to problems when things fail you know we don't have those problems because of this layer and then of course you know there's the data repository and there are some unique things to our data repository it so one thing about to remember rapid capacity is it was built for the cloud from grounds which means the file system native understands Club which means it was not like a bolt-on later with a gateway put in there so the file system itself can realize what is cold data and so it follows a waterfall system it will move that data down the tiers right so you have SSD you have hard disks and then you know for cloud theory and we can just treat it as a chip so really cold data we can move to this tier automatically and that's built into the file system it's not something where you have to and of course you have archiver's and other stuff we are on policy based are we gonna touch on our table today okay so you will see some more details on archival but you know actually what we do is we can actually take the snap tree and we can move you know a copy of the snap tree to the club that's actually how snap tree and the data so when you want to recover like let's say one file that you have archived you don't need to bring back the whole thing so we create what is called a stub volume and the stub volume has pointers to data in the cloud and as you access only those bits are brought back and that makes it highly efficient because if you look at cloud at least AWS you know ingress to the cloud is free right smart move you know it's it's when you start a and get out that's when you start paying for it so this minimizes that and then of course we have our era Journal n so you know when you're doing all this you're moving data to cloud you get a lot of things but you know the one thing that you probably don't get is very fast random write performance but you know when we were designing this at that time we knew that we would have to run you know applications and many of those applications will do you know random writes so how do we speed that up and even when we do sort of you know virtual machines recovery the way we do it is it's mounted on Kohei City as a data store and then we do a V motion right so during that time that virtual machine is going to do I if you're not going to do those random writes fast enough then you're gonna have terrible performance can I ask a question only on the design yeah how how stretchable how scalable is it in the sense that how far can I stretch a cluster of the file system over distance you know can I geo span it or do I have to keep it tightly coupled with in in a single data center and so on okay so there are two aspects to scale out one is you know how many nodes does it work across well I think we have the largest cluster that a customer has deployed is of the order of I think three petabytes the largest that we have tested in the cloud is 256 nodes which would make it about you know we have tested I would say about six petabytes right now when you stretch it across Gio and I will touch upon it with the application there are different challenges so if you put those nodes which are close by you know the only the it will work the only thing is the so to get acceptable performance within some distance you can just use it as it is but if friends you're gonna stretch it across you know continents which could be like you know 150 200 milliseconds it's not it now starts depending on the kind of guarantees that you want okay so if you were designing for a large enterprise you'd probably be putting clusters yes in many locations in order to achieve that okay so what so that's what the term replication here talks about that is showing different clusters right so we actually have customers who are I don't think I can name that customer but one of our customer has about 30 locations worldwide in in different geographies and they have you know different replication topologies set up but then he talked to me and so that so if you have a stretch it cluster okay across you know multiple data center that are close how do you manage then you know the split brain kind of problematics because you know if you have a - so do you need at least three data centers how does it go so as of now we need at least three data centers to make it work you know some customers have asked and the way to make it work with - which is the common case by the way is a stress cluster across - is to have a cloud witness like that and I don't think we have done that work yet but what we do do for these you know and I think when we look at our target customers they are happy with slightly asynchronous replication and when people are happy with asynchronous replication there are different techniques that we can do right so in our snap tree you know how to def in our data repository you know how to define transfer the data efficiently van optimized so we can replicate data from A to B clear it from there so that kind of all that workloads have been under used for I don't last three years so when doing that replication is that still centralized management for all for multiple sites yes so that is our Helios layer and we have a Helios demo all right so Helios will show you guys now how do you even manage clusters which are different right so it sometimes it is better for locality reasons to have different clusters but you know there's a bunch of management like upgrading them together that you can do all right sorry just to have an idea so you mentioned that the biggest customers as a tree petabyte cluster not a single cluster is to developers the single cluster is three provide okay so on average these cluster how big they are so I mean the cluster and then I I suppose all of them a good part of them as they can like as free for offloading some of their whatever but the cluster that you know based is there the average size the average size would be let me calculate would be about 300 terabytes so 300 terabytes with something's going to the cloud is is the average size okay understand that the distribution is you know models and big enterprises have very very large deployments and they have very very large clusters and a bunch of them may just by like six to eight nodes so a so eight nodes it would be if I look at the median it's going to be about eight nodes it nodes for us would be about two hundred terabytes in the average is there because there are lots of big enterprise customers who have large departments okay and Uday's customer usually expand the cluster if they need or do they take advantage of as free kind of that can it depends on the workload right so let's say they are using it for our backup target then they need more performance as well it's not just about storage right if they want more Headroom they just add more nodes and I don't know if we have demoed it to you guys but adding a node is as simple as you rack and stack I know you just click one button and it's assimilated in the background will rebalance some data but it's just operational for you right away do you support nodes of different size and different generations okay we have to by design right so because what is happening is you know hardware gets deprecated especially the motherboards get deprecated every three years and intelligent this very fast cycle and we have customers and they're not gonna throw the old hardware of it still works so we do allow you know you guys we switched from 2500 hundred to see 2600 series you can put them together what's uh what's the throughput or a node system or an eight node system what's uh through port depends on the workload so it depends on whether you are doing so one thing about the file system is highly configurable you can configure it to be doing inline deduplication not doing inline deduplication you can even configure it to not do any duty that's what we call as ndd right so when you create a storage domain if you set it to be n DD I think for node cluster can do in excess of I think 500 megabytes per second but it really depends right if you are doing like the 4k random IOT's then I think I don't know if I'm supposed to tell you this but I think we have seen about yeah it will be letting care of the back but we like to think the data journal part can do about 45,000 49m writes this is without any specialized hardware we have no nvram and remember that every write is committed to two nodes like at least over the network it takes that hop right so tying all these things you know the filesystem B plus trees transactional metadata store data repository data journal is our Spanish fest that's the file system right but you know people don't consume that file system people need to consume it in standardized ways so that's where our you know what we call portals they said so same file system underlying supports SMB and a fast and s3 right and I think I don't know of other systems that support SMB NFS honestly simultaneously okay so this is what we had there a bunch of other stuff you know healer is a layer which sort of keeps heeling the system any background work rebalancing one thing worth noting is we built QoS into the system fairly early on so I rebuilt it fairly early on and this was here so that we can give different priorities to different traffic when an i/o enters we tag it with the QoS and we carried all the way down to the desk giving its fair share of CPU and disk right ok and so I'm going to talk about why this was important right but this is you know what we had I think these are the slides from maybe my tech field a60 talk right so yes and you've been improving it file systems are tough you have to you know keep adding stuff but you know now we added something else on top and let's see it shows up applications right so here's the problem right so we have this distributed file system what does it solve one problem solves this mass data fragmentation thank you no data is not sitting in different silos and you have one place where you can see what all is there what is taking how much space you can do more efficient copy data management in fact you solve some security problems because if you let you know data proliferation happen you don't know who has access how our recursively access is granted you have it in one place and that's you know a big value for solving masterda fragmentation we have done the consolidation but now comes the question of what applications you know what do you do with this data fine so you are consuming it but there are certain use cases for which there is so much value in bringing the compute to the data I'll pick two examples from the industry if you are familiar with databases you know stored procedures and sometimes be very more efficient in databases because they avoid a lot of back and forth of data movement so you take your logic you put it on the base you run it there and that's more efficient right and Google did MapReduce so you specify your functions on what you want to do with the data and then you push it to the node where the data is so you don't have to do massive data movement so there's value in bringing compute to the data there are cases when you have to move data to compute but you know we'll skip that for now so we have our application layer our application layer you know again needed to be distributed all tolerant because some node goes down something happens users should not get affected so application movement has to happen right and fortunately you know the world has evolved where there are good open source tools here so our applications today around on in docker containers orchestrated by kubernetes so let's see some of the you know third-party applications that we have on the platform today so we have Splunk so you have data you have you are putting pulling a bunch of logs application servers they are sitting here you wanna gain insight you can run this blank application and Siddhartha's going to demo it but it's as simple as click click click and it's up and running on the data said that you want I don't think it takes more than six seven clicks in that case that knows there's a marketplace for distributing these can we put our own homegrown applications into that coronaries environment containers correctly oh it's ok so as of now you have to be certified by the Queen City marketplace before you can go there but we never know what the future holds sorry company supports both file system access as well as block you have block you never know what the future might hold so I mean most kubernetes applications seem to okay he's trying to find the right direct Wars well I'm trying to say I can't I can't talk too much about forward stick I can't confirm or deny is it this right as of now it uses the file system okay okay I have to question one is the the workbench analytics platform that you had at the beginning is is it going to disappear from this schema okay so this was the prototype you were there in 16 so we tried a different computation model earlier okay we tried to bring the MapReduce model oh I guess I'm getting to eat okay no no I just want to point out we actually have an entire section on the app stuff yeah so just to keep things rolling we have a whole at the end if we can dedicate to that so I know you want to ask questions yeah very very quickly I think you know we tried a different computation model we tried the MapReduce smart and the MapReduce model was good for you know sifting through large amounts of data but it did not allow you to build persistent applications if you wanted to index the data and then you know search on it multiple times it did not fit that model so you know this is a better model and I think you know we are moving towards it so the MapReduce model still coexist today but it may go away so it is a and then we will stop with the question I promise so the marketplace that you mentioned is also a system to get a standard packaging and be sure that is everything is certified so it is certified it is signed you download a sign package we verify it only then do we run it okay I have another question but I use okay I'll leave that do Sadat's you want to answer at the end about apps right so you know we have other third-party apps and we have I think we have more than two we have four quality apps but we have four you know our in-house apps but two of them are in well at least one of them is interesting talking about Co hazily inside okay this is the app that you know can search your data and index it and then keep it so we actually had customers we had customers and I can't name them but let's say they handle postal services and sometimes letters would get misplaced very rarely they do a fantastic job but at that time they won't but that was the only time they went to came to the logs and said I want to find where this went so they were using a WB for that right well a lot of other customers will you know do more it's not that they had one query to sift through the data but they just wanted the index so that they can you know do more and now with the code inside app they can do that and then you know of course our data product is again an application that is running on top you know that's that's how the architecture looks today and it is evolving and hopefully next year I can share even more today I find it interesting that the the first third-party app mentioned in your store is Splunk yeah it seems so appropriate from that perspective that is that's what our customers are asking for right that's what I'm recommending
Info
Channel: Tech Field Day
Views: 2,844
Rating: 5 out of 5
Keywords:
Id: FpLjHRCp-Jc
Channel Id: undefined
Length: 22min 49sec (1369 seconds)
Published: Fri Mar 01 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.