Cohesity Data Platform Deep Dive

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so my name is Johnny Chan I'm the head of engineering here at Kobe City I joined a company last November and prior to that I was at Google for 11 years worked on the ads backend that adds fraud detection and prior to coming here I was in Google research working on natural language processing so Moorehead talked about secondary storage issues with secondary storage fragmentation silos copy data management problem they also talked about requirements what does it take to consolidate that and so for the remainder of this talk or talk about how to translate those requirement into implementation okay this is the agenda of the technical talks first I will talk about a scale out different district the architecture the file system how we handle mixed workloads and our adaptive self Euler don't head it off the mark he will talk about the architecture and a snap tree how we do how we do integrated backups and devilwalk flows and the finally a budget will talk about pneumatics so I have a question just from the first moment I see this reduce machine yeah you have eight CPUs yep and just 24 disks on the front end yes and you told just 12 disks on the front end cloud the three and a half yeah okay well so it doesn't look very dense and actually secondary storage needs to be very does so let's talk about this all right so this is something that I don't understand okay so this is a 2-u chassis right for knows per chassis this is a one note but dual CPUs memory SSD 10 gig Ethernet and yeah we have 12 a gig a terabyte hard drives yeah but they kept their cpu capacity ratio is not I think first of all has the appearances are misleading it's 96 terabytes of hard drive storage and some 6 terabytes of SSDs that are inside second I think the CPUs are a little bit more because remember this is suppose you do mixed workloads not just backups so we are also doing analytics on this we're also doing DevOps on this right so there is compute needed on this platform that's why we have a little bit more compute then storage and remember also that this can scale infinitely right so I think for example if I have a head and a Duke workload of some kind you run the Hadoop into the cluster right so yes so we can potentially run a VM inside which is going to be on our roadmap going forwards and then how to run Hadoop in that yes but today we have in place analytics that which is going to talk about and there we run other kinds of analytics and let's just defer that to the later part of this topic to reinforce Enrico's point their primary storage systems designed to deliver very high performance with a with less CPU per gigabyte or less CPU per spindle yeah then we're talking about here yeah this you guys are providing a very a lot of compute for the amount of storage you have I also want to say that that I think that you're gonna confuse the market by using the term secondary storage for those of us who are practitioners primary storage is where data is created and secondary storage is copies and you're really saying for tier two applications which is a different thing and if you you know and if and if we start mixing those metaphors it's just gonna lead to confusion it's that really for both though Howard when you look at what their use cases are I mean they're talking about it for I can see the when you talk about backup recovery that's copies of data when you talk about DevOps types of stuff that's copies of data but when you start looking at some other things they had on the list yeah I was leaving those for when they got to so so let me take that Howard thank you for that question we actually love the fact that people get confused by this because we actually would really find second in storage so that people don't think of it as backups people actually think that this can actually run some computer also that's one second so you're looking forward to the heavy lifting yeah more power to you second second you know the the if you I think what matters to customers it is not how much computers there what matters to them is the the actual value we can deliver to customers and I can tell you that a dumb old second storage device just that that just as backups combine that with some other vendor from whom you buy backup software if you compare that to this device which is far more functional we are actually significantly cheaper so it's the value that matters not what's in there that's what the customers care about right okay thank you okay so I'm going to talk about oasis which is our open architecture for scalable intelligent storage so first on our hardware platform as we talked about we have s SSD and hard drives and on top of that is our oasis the sugar default system as mentioned building a tissue the file system is not easy if it's easy anybody would happen and the reason it's hard is because these nodes there they're sure knocking architectures and they can come in and out of the of the cluster they can go up and down and so the software has been extremely intelligent to coordinate all these nodes to make them act as one coherent and consistent file system so let's look at what's inside and some key elements in our file system the first is that the sugar lock manager this is being used to to do coordination in leader election second is a distributed no sequel store this is our digital key value key value store it's a very strong is a strongly consistent key value store and we build it in-house a layer below that is the mayor that a manager so this is a file system that manages directory structures and inodes a blob store can you guys see that blob store this is managing faul faul level data a data journal this is a SSD bat journal to absorb random i/o a disk manager is manages writing to the non-volatile memory and their self killer which is a background process that's things like garbage collection and on top of that we have the applications that Mohit talked about data protection which is backup/restore retention policies DevOps we expose NFS and SMB interfaces so DevOps can use us as for their purposes and for followers which versions of both NFS 3 and an SMB 3 also analytics so this is the this is what we'll talk about to eat to eliminate our dark dark data ok so a little bit about our file system first is a shared nothing architecture and it's a strongly consistent file system so this is not like Amazon dynamo where is the eventual consistent file system and the reason it has to be strongly consistent is because enterprises need that kind of guarantees you can't run a database a relational database on dynamo and there are no bottlenecks anywhere in the system this is why we can do true scale out we can scale infinite infinitely right so we'll show you a 32 node cluster right that we can go more but we just they want to buy so many machines just for this but there's a for this demo and of course Moorhead mentioned incremental scale so you just you you pay as you grow it has value only around what eight petabytes how many is that 32 node system so what is 96 times eight maybe a petabyte that's all you've tested to you're the one with 70 million dollars as Moorhead mentioned this has to be highly available if a node goes down the system has to keep running and and he also mentioned in upgrades up as we do not disrupted upgrades you can do colonel upgrade firmware upgrades and the file system will work without skipping a beat and so you you use erasure coding or mirrored data or yeah so we have a replication factor of two and because of this sorry how do you just you just go back to the point about infinitely scalable how do you make sure that the network itself that it doesn't become the bottleneck to the platform that's where traffic perspective you know it's easy to go to you know we talked about 32 nodes it's easy to go to that in a single switch but right once you start to scale out to you know 6400 28 what happens then so our architecture is sort of is within our nodes so we have to sort of figure out well basically by high-performing switches to make sure that there's no network bottlenecks so we do dual 10 DB per node and so it's switched Ethernet so it's not like all the traffic is going through some bottom lack of network you know when two nodes are talking to each other that's independent off went to other nodes talk to each other so we can I get as long as they're running the same switch you're not saturating in ISL so it's here I guess you're up links to your spine or whatever your network design house is where your potential bottlenecks might be in the east-west do you have the ability to for example keep try to keep traffic between multiple nodes on the same switches and so therefore reducing them of traffic that's going via a distribution or a spine line right we today don't do anything special there but in the future we probably will do something special for network also but today we can I imagine that just plug it into your 10g switch make sure there's plenty of networking we do peer-to-peer and that's what we have so today is it switches it you sell with the product there is it a customer network if our customers ask we do we can sell either Arista or a cumulus which along with the product but some customers now already have tangia environments so they don't need that so most most of our customers actually had thinking environments ok and because of architecture you can you can scale linear in capacity and performance so you just actually add more nodes yeah I'm at capacity and you also add more performance ok okay so that was this get out the tissue the fall system now let me talk about how we support mixed workloads so before you go on or all the do the nodes have to be the same next-generation nodes can I intermingle you know whatever comes next and hardness yeah so this is the heterogeneous Hardware requirement that we set for ourselves so as mentioned we want to have a converged secondary storage so this means that we need to handle multiple workloads and concurrent workloads so we need to have backup running at the same time some tests in dev is being spun up at the same time is being used as a fowler at the same time we're doing some analytics to do that to handle those mixed workloads that every workload needs different kind of requirements on a storage system so if you think through what's required it boils down to three basic things one is we're gonna do metadata operations really well so this is handling small files dealing with directory structures creating files didn't files and so forth second we have to do data operations very well so for sequence so for sequential i/o high sustained throughput for random aisle high I ops and third we must high-performance oscillation so as so we don't want a one huge backup job to completely start out the other workloads any questions on this okay so let me move on to file metadata operations so I mean having dated protection and other ancillary storage in the same appliance normally you want to have data protection on device is different than where the data is that's why you have primary storage and secondary storage but if I'm doing data analytics and file services and other things on this device and I'm also doing data protection on this device right what you think there's an additional risk there so know so a few things why is when we do these additional work clothes we actually clone the system so you never modify the existing file and I'm okay with you isolating the data on the system the problem is that there's a problem with the system and you've lost the data so we have we have remote replication it's coming so we can remote replication but but but if a bug in your file system decides to delete all the data and all of your data is on 27 systems in 14 locations with your file system that bug affects all of them right this this is a genetics problem what's the classic if but that's not that so we first of all we do the most application and yes it's done to quality cluster but we also do cloud archival which is done on technologies like Google near line which don't run our software so if yes we introduce a bug that kind of erases software on all the data on lots and lots of data centers we can always download it from the archive so that's one second I think maybe your question was more around you know well this is non mission-critical data do you really need an active copy for this so that's where we feel running as primary the primary protected by us and then whatever run runs here this is non mission-critical it's okay to either remotely replicated or to put stuff in you know the cloud okay so so when you guys talk are talking about file services you're talking about file services so that I can spin up copies of VMs so that I can use them for dev and test not that all of my users home directories are going to be on the system right depends on how you define mission-critical we have customers who want to put their home directories on us they feel you know they can live with some downtime but remember you know with the remote replication and with cleric I will and stuff it kind of gets minimized so it depends on how our customers define that but yes if home directories are paramount you need to be have them up all the time then yes you would probably put that in the primary system well it's not like there aren't customers who do snap and replicate they do that today yeah but whether they should or shouldn't is irrelevant they do it right okay and there's a risk for doing that but you know if the company is 10 15 years old and had you know 10,000 machines in the field the risk is somewhat minimal if this company is four years old and it's got I don't know a thousand machines in the field the risk is somewhat higher I think you could say the same thing for primary storage companies that's what I have that's why that's why you know the fact that we don't trust anything is why you back up to something completely different from a different vendor than what your source was that's the cloud for us for storage guys companies go out of business we don't have that's why I think I think we all trust Google little bit so that's why you use Google worse I'm a steely-eyed storage guy I trust no one questions okay so let's talk about management operations so here in this example an app wants to create a file so his pulse is local and it's cool and his client says hey go create a file on my Kohi city appliance so yep so the suit so the appliance asked the DNS server gave me the IP of the appliance and let me talk a bit about before going through details and talk a bit about how we do load balancing and fault tolerance using lips so when we set up a clusters there tend no cluster each node gets an IP and also virtual IP and we registered that virtual IP with the DNS of the other cluster so when the client asked the DNS give me a IP for this host the DNS is a round robin and and hands out a virtual IP that belongs to one of these nodes so this is how we do load balancing across nodes now when a node fells let's say so this this clients target is no let's say this node fell somebody pull the plug on the node then we have a heartbeat mechanism that detects that the node has failed and the virtual IP then migrates to another node so this is how we do fault tolerance and load balancing pushes on that quiet NFS client is your own design no no so this is this is just just a normal standard and a client is that the virtual art is that provided by you or you rely on this is Linux providers were trumping face so you I ask you do I have config to get it out right okay now let's look inside the box to see how we actually do the military operation so a create request comes in it gets serviced by another server enforce requests to a middle manager this manages the the file system and the di nodes and so to create a file you could have to do two operations one is create a file then put the file in a directory when he was created to do that that is stored in our artist ship the no sequel store that I mentioned earlier and because these two operations have to happen all or nothing so you can have a transactional property so you've got that in a two-phase commit protocol once the two-phase commit protocol completes the is acknowledged back to another manager and it goes back to clients in saying the big files are created I noticed because we have two-phase commit it is a failure in any of the steps that the commands rollback if you notice what's interesting here is that this is purely distributed there's no sort of single asset database there's no single node that controls all the memory operations is purely distributed and this is how we can get true scale out ok so now that you have the file credit and now you want to write to this data now we're talk soon let's go to a data path so here incoming data is coming through the system the first thing we do is determine whether it is random aisle or sequential aisle if it is random aisle they would give it to a distributor journal which is backed by the SSD right through a journal and return success so this is our fast path or random aisle if we determine this is sequential aisle then we give it to our blob store which is manages the blocks on disk one of the things it does is it chunks up this incoming data into variable size chunks and these chunks are units occupation and we use for being fingerprint to to determine the borders of these chunks so that we get maximum chunk deduplication okay so for each of these chunks we ask the global d du map does this chunk exists in any of our nodes this is this exists in any of the clusters and if it's it returns true with all these don't write it if it's not true then we write we give the chunk the disk manager and it writes it to either the SSD or HDD and then we update they grow what math to say hey we have a new chunk and it's located here and then once in a while we flushed it the journal onto the to the blobstore okay so this is the right path on the read path we first check so if you want read a block we first check the journal in the center journal we return it if not we look at a hard drive or SSD and we turned out that data and one thing that we do which is very unique is that we do up tearing so if we detect that a block is being used heavily either reads or writes we've moved that to the that's SSD so now now is serviced on the faster SSD and so when a disk manager writes if the if the data is hot and it's been on SSD you write an SSD and if it's not the invite on to the hard drive it's not really unique that's what every hybrid vendor yeah so what's unique about it is good question is our patent-pending clear optimized write scheme or toes so what we do is we write to is h HD the hard drive we write out of place so if you have lots of random writes we sequential eyes them and right out of place but if the data is hot an SSD we actually write in place and what this does is it gives us really good performance so obviously rotting out of place in a sequential fashion on disk is very efficient and if the data is hot you want always right on SSD because that's just much faster than log structured file and HDD and random block placement unassisted yeah it's it's luck sure if I like like yeah and the other benefit is that you get much less fragmentation or garbage so imagine if you're writing a block all the time and you've got to if you've got the hard drive then you keep running new blocks and what you would have overwritten it becomes garbage so if you if you don't use write in place and SSD you just create a lot of garbage as you write as you've got those blocks on a hard drive but by the time you get past the SSD controller you're creating yeah even if you're writing the same LBA on the SSD you're creating garbage because it doesn't write you can't overwrite in the flash so right so so that's sort of the internals of flash right but from our metadata point of view we don't have garbage so from our decided system garbage you just have flash system garbage a flash isn't garbage right but the flash isn't garbage flash flash control takes care of well some of them better than others true true ok any questions on this jump them back to the previous what hash are you using for D do okay and then are you just doing one single style so we use the sha-1 hash I think we use 20 20 bytes fingerprints and so and you assume that a unique hash means a unique data cache that a duplicate hash means duplicate data so mathematically I think we're more likely to get killed by an asteroid in this meeting then the node powers before we get a collision right collidable now actually collidable yeah 4d intentionally collidable and deduplication collisions are completely unrelated problems so the probability is very very low okay all right there the last thing I'll talk about with regard to the next workload is performance oscillation so if you have mixed workloads coming into the system you have a huge backup job and now you wants to spin up a test in death you don't want that test and death to suffer because of this backup job so what we do is we have user-defined priorities and the user-defined workloads so you can say for this workload is high-priority QoS for this workload is no priority and what we do with that is we map that to a proportional resource allocation so the way to think about it is if you had to do ten things you said okay the high priority thing I would do I will do seven of your tasks and three of the lower priority tasks and we have grandeur or QoS throughout the system to show you sort of how this works so so you saw this picture before this is sort of our key components of a system it's just for illustrative purposes let's say we have three Q's here okay now we have a data protection job a queue a DevOps queue and we look at a queue into the disk manager in this scenario let's have a huge backup job coming in it's dumping 20 terabytes and just saturating the system so all these queues accuse fills up here and the queue starts filling up here now somebody says hey I want to attest in dev I want to debug some problem you Spencer the VM and start running to it if you didn't have QoS that rights is coming in but get queue behind all the other write requests from the data protection job but you have TOS and you set this high priority you essentially let it jump in front of the queue you let a cutting line and so now you have hi now you can have a responsive DevOps at the same time I have to have the data protection gun as fast as I can so what what are the nubs I get to turn for QoS so so I'll show you this in the UI we wanna make it really simple so basically we say is high or low and is it random or is it lack of jobs architecture actually supports doing proportional share scheduling so we can actually assign weights our UI right now is very simple high low that sort of stuff will wait for feedback from our customers if more is needed right so that's why we have it like that but the underlying system is is actually it's not what complex what I just showed you obviously one should only hope yes and but on the you I would try to make it as simple as possible and then we'll see what the customer feedback is what's the greatest number of files you put into one of your systems how many billions I don't know the answer was it we have we have loaded we have loaded and everybody's off yeah more than other terabytes but I don't know how many files probably I don't know close to like several hundred millions but no billion level I think bulk of our tests have more been on another VMware backups ID so that's kind of pulling more data then pulling files but we scale we're not limited to a few thousand files or few tens of thousands of files again I'm talking billions of files being a small to mid in like bioimaging and things like that there is and that would be secondary level so there is nothing in our architecture that prevents us from creating a billion files testing is another thing yeah so so we haven't tested it a good point and that's something that we will do but we have loaded over several hundred terabytes into our system ok so the last thing on top of the self killer so in in traditional actually modern file systems a lot of operations are done offline so even in the UNIX file system we delete a file that the operation returns quickly but the data blocks are reclaimed at night our time it's called Congress question and it's assumed this system you have garbage question yeah and a lot more things you have to do in terms of healing your system so our garbage collection or self healer is that issue of due process it's based on the MapReduce framework so it does seem that garbage collection we also does things like this rebalancing so for example if I add some new nodes to the cluster I want to now distribute the data on to those new nodes this gives me higher read and write performance or if an O goes down then I need to replicate the data that was on that node to the other nodes the other thing we do with self killer is post process actions so for example you don't turn on encryption or compression on a file system we don't need to do immediately we can do it in the background and of course the system has to be fault tolerance so if a node goes down we can't just say hey sorry I can't heal the system he has to continue and so even though goes down will heal the other nodes that are still up and more importantly well just as important is its continuous and it run on a low QoS okay so the heating is always right in the background that low priority so if the if there are user jobs running it takes little precedence if the system's idle then you ramp something this is actually very important we have a client that that has a lot of data and they have a very popular target storage vendor and he tells us that they run backup six days a week on Wednesdays it's garbage collection day so they actually stop all their jobs through the garbage question for Wednesday and then continue from there on and for us that's not acceptable as a converged solution we can't tell the business guy you don't get analytics on Wednesdays or we can't tell the devil person you can't do any development on Wednesdays so for us it has to operate all the time okay so this concludes my section any questions I think your deduplication I'm right in saying you can do inline or post process yes what would be the different use cases and if you're using inline is there a point where say under high load it would actually failover to becoming post process anyway so you would want to do a post process or sometimes no D do if you're writing a test staff environment there's no need to know needy tube that because you're gonna do rather under rising you get so everything away but that and you're doing iota spinning disks and IO 2d duped spinning disks is not no fun at all cuz it's all random well if it's random it hits our better journal which is absorbed by no but if you if I'm trying to read they D duped data sequentially the disk drives still see random i/o yeah the difference is that if the working set fits in your SSD we will have moved that stuff into the SSD because of our up tearing but but like Johnny said we do both in in line as well as post process D dope very likely for tastin drive it makes more sense to do post process because one of the things is you probably are also overwriting the same thing in and again you don't want to deduplicate and then we write it D duplicate and then rewrite it so you probably want to reuse post process but when it comes to backups it's your choice you can use inline or post process it's you know as efficient and what's the D do granularity it's 8 K to 16 K so that's the variation and the post process is done by the self-healing algorithm that I just spoke about

Info

Channel: Tech Field Day

Views: 10,817

Rating: 4.8974357 out of 5

Keywords: Tech Field Day, Storage Field Day, Storage Field Day 8, SFD8, Cohesity

Id: rowWqLOYplQ

Channel Id: undefined

Length: 32min 18sec (1938 seconds)

Published: Thu Oct 22 2015