Dell EMC PowerMax and Machine Learning with Vince Westin

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is storage field day 16 I'm Vince West and Technical Evangelist for PowerMax we're glad to have you here all right so let's talk about machine learning and whatever we want to call AI as part of the context of all this in terms of storage so we see a lot of things at the server level or people are doing machine learning and artificial intelligence applications they can use lots of latency they can use lots of bandwidth right low latency high bandwidth is exactly what they're looking for we want to be able to provide that we think Power Max is a great source for that kind of data so that's all we're really going to say about the servers on this our focus here is going to be on the erase the array side but on the array side we've got analytics and optimization that we want to do in the break in the little brain here inside power axe to make our system smart to make it work better again our goal is to work smart not hard so we want to get the most we can out of our hardware and then we also have clawed IQ for extended analytics where we have the ability to pass information up to the cloud and let our Cloud repository look at things and you have Susan sharp coming in after me to talk in depth about cloud IQ so I'll just make a few references to it but I'll let Susan really explain all the details of what we're doing with cloud IQ so how are we doing this what are we doing it why are we doing it so what are we doing we're implementing some machine learning within power Mac's to serve our i/o requests with the lowest possible latency pretty simple it's all about ladies how do we drive that down what are we doing to do that we're looking at fine grain workload profiles to understand the nature of the workloads that are being run on the system to be able to optimize for the likely future workload that we're going to see all right it's all about predicting the future it's easy to predict the fat the past it's very difficult to predict the future all right I'm pretty good at predicting the past and then why are we doing this again all optimizing around the configurations how do we help you get the most value out of the hardware you're buying so what are we doing we're looking at work load characteristics we're looking read write copy whatever's going on block size 8 12 64 or whatever where are you on the Lund or the logical volume where are you on the block address within that locality of reference spatially in the LUN Jocelyn's what tends to be happening together and looking at patterns across days and across regions of the IO profiles to see what's going on in the within the system right so we have both locality of reference in terms of addresses and temporal locations in terms of time all right how frequent is it and so if you look at the the heat map version all right this kind of says to you there's a whole bunch of this data it's not doing much these things have little Peaks and this has some really big piece and if you look at this there's a certain pattern here what do you guess the really hot parts on this might be many backups yeah right it happens two days out of every seven backups on the weekends yeah amazing to think that so many of our systems still get slammed more by backups than anything we do in production but that's the way a lot of things run in the real world all right so what's the opportunity here we want to look for workload characteristics right what's the workload what's the size how are the locations based what's the frequency of the repeat and then we want to do optimization for that what do we what do we reduce what do we not reduce what benefit from being on the the storage class memory that's as SMC 90cm flipped threw me for a loop what would benefit from being on storage class memory what would benefit from being unmanned what would benefit from me peered out to the cloud at some point if I've really ignored four months you know we're looking at some of those kinds of things were actually building some things to queue to the cloud the only challenge is how do you remember when we rolled out the SATA drives for tearing within v-max right we did that many many moons ago the challenge with that is when you tear things to it it does a great job of freeing up space in the hot tier the problem is if you ever go to use it because the response times are awful but at least the bandwidth is low all right I mean it's just it's really bad all the way around so we got to figure out how to make sure we don't do that with anything that's anywhere near active so optimization of this stuff is critical so what are we doing we're capturing stats across the controllers across the volumes across the system we're storing lots of data for a 200 terabyte typical pasady it's 40 million samples per array per day and we have to manage and groom the statistics in the history and how do you figure that out over 30 days and how do you a map at all and not keep every bit of all the data you've got on the performance and how do we analyze all that so we want to do with that is we want to build an infrastructure that helps us use it so we start with service levels for you know the old mixed drives and then we said okay we can optimize based on space efficiency for v-max all flash and now we're saying well we're gonna add service levels so we can you know start slowing things down on bronze and silver and how do we manage things best to the service levels and then how do we really apply that as we mix in the storage class memory and such and so it gets kind of complicated right you collect all this data from all the LUNs and what am i doing for i/o and where is it what's it hitting like and then I go back and do multiple variations of what are the patterns will fight right and try and really describe the features of what I'm seeing at a more advanced level to understand the types of workloads I've got feed that into my forecasting engine which then decides how to rank the different pieces and where I'm going to store them pass some of that off the cloud IQ so cloud IQ knows what's going on and what we're doing where our priorities are and what we're seeing for patterns and then run that down and go activate it and make it work and then push some of the results back into the forecast and take the results from here and mix it and rinse and repeat right so this is all on a rake this is all honourary right a little brain sitting here on the array chunka chunka chunka so we're looking at NVIDIA GPUs to do your ml we actually we've figured out how to do this without needing a whole lot of extra stuff okay all right we're not trying to do full-blown AI go figure out how to solve the world this is more of an expert system kind of thing right expert systems goal is to say I look at a B and C I match it with the pattern I go do eat right and so you don't need the same kind of computation if you're really doing artificial intelligence and learn new things yes right that needs more intense stuff we're not doing that that's why we're specifically trying to call this machine learning yeah we feel that's fair for what we're doing we could try and call it AI we think that would be a stretch of what we're doing well it's a known problem domain yeah I know all of your inputs you know you know the inputs we know the app we don't know what real workloads are right but we know how they look right and we know enough about it that it fits into a fairly defined class now there are things that are outliers that we aren't going to get and if we added real AI could we figure some of those out mm-hmm probably right because we're not looking at trends that go over 24 months or 12 months so we don't really know what it's like when you have annual cycles that come around and do your you know the end of year but it's like we don't see that right we're keeping things 30 days or so and that's it so quarter and you're in you know annual cycles don't really mean anything to us but we don't think we'd get that much value out of it right now so we're trying to work on the small stuff first and then you can add other intelligence outside that the other option is right since cloud IQ is already in the cloud and has access to more stuff you could say gee I've got some arrays that are really busy now I could take cloud IQ and have it go look at some of the data more intensely and say I think you have this other kind of pattern laying on top of this and it could start at some point feeding data back in right by the way plan for or think about this possible addition to your future description right but we believe that that's going to be better served up here in cloud IQ makes sense right because it's got more processing power it's more dispersed and he can also see more workloads because we'll talk when when you have Susan come in she'll talk some about how it can also recognize his cloud IQ and sees lots of things not only the arrays in your data center and across your data centers but across all of our install environment all right so we can look for patterns and say hey there's something really interesting going on everywhere right or on this 30 set of frames or whatever it may belong to different customers we can go figure things out hey Vince real quick on the forecast and if you're only going back you said what 30 days yeah how can you actually properly forecast if you're not seeing the past performance for longer say I'm a tax shop and I just happen to install the new system in June well you're not gonna get my performance or the impact on my business that it would be annually around we're not going to know what you're gonna do in April until I absolutely and you know what we go any when you come around the next April we still won't know what you're gonna do in April because we forgot about at 11 months ago as a problem well so the trick is that the time to run through this cycle is measured in minutes to hours not days right it's an all flash system so on the bad case if we've done it wrong you're compressed on nvme all flash I mean it's it's not a far drop it's not like you're sitting on eight terabyte data tubs in the corner and suddenly I'm trying to do 10,000 reads from a drive that does 100 a second right it's just it's not the same class of problem so yeah it's not perfect but again it's not trying to be perfect it's trying to be reasonably intelligent for most things are you using this to determine what tier of storage to put a particular line on or something like that yeah so right now we're deciding between compressed and uncompressed right and the next thing will be if I'm uncompressed do I go on SCM or do I just sit an uncompressed on twice uh Nana right so yeah so you're not making real-time cache decisions based on this you're more or less sharing of data right where do you where do you get stored on the back end in the old days you would have had some sort of an array of hot data versus cold data somehow distinguishing out almost on a track-by-track basis or whatever your granularity was you're not doing that anymore this is this is all track level granularity yeah I'm in the old days before machine learning and AI you would have had some sort of table that you would try to maintain hotness versus coldness on a track-by-track basis right you're not doing that anymore because using this well we're doing that we're doing both all right so this guy is hoping you decide so you have you have the combination of how hot am i right how hot was i four hours ago how hot was like two weeks ago or four weeks ago right what's my one of my long term in short trend averages and which feature am i a part of right because I may not be that hot today or yesterday but I've seen lots of bursts in this workload so I may decide that that's worth tearing up anyway so again the the goal here in the ranking is more than just the temperature of the data it's the actual overall feature depiction and sometimes you promote things just because the tracks around them have been somewhat hot in the recent past and so I'd rather keep them together because it'll get me more Scott of the system where the drives are we're also managing the front-end cues so as IO is coming into the box if we see that there is a particular workload that's missing its service level score right it's supposed to be sub-millisecond and it's tripped over a millisecond what we're gonna do is gonna say okay so you're missing where your performance target is supposed to be so what I'm gonna do is I'm gonna shift around the cues off of the front end so I can get more of your IO serviced faster so I can get you back into compliance with with the performance service level so that's just one example again it's not it's not completely isolated to the backend there's lots of things that are happening within the front-end the cache the backend etc we also talked about service levels right so if you have a bronze service level the odds of you not being compressed nd duped at every stage is almost zero right because you don't have any priority so we'd rather push you to the backend and minimize you because you're always going to be the slow guy right we don't care you never get any priority in here at all whereas if you're Diamond even if you're the same speed as somebody who's gold in terms of i/o repetitiveness right if you're the same frequency of attack you will wind up getting a higher priority a higher ranking in the system because of your service level so it's a it's again it's a little more complicated than just look up on a table and go not a lot more complicated but it's making the table is a little more complicated about that and then you just rinse and repeat quick question yeah this this feature here is this ole custom developed by le MC or you this is long term custom development by our max engineering so you're not leaning on a third party provider for the AR function l all of this well again this is this isn't dramatic ai right we're looking at trending and we've got guys who've been doing this for longer than I've been here and I've been here 23 years a lot of this stuff a lot of this is IP that's been patented over the last several years so this is right stuff some of it in the last five and some of it in the last 30 so there's this stuff a lot of this is stuff that's very unique to this implementation and what we're what we do is very different from what almost anything else in the world does so using some outside code probably wouldn't help us much and we also need it all very optimized to run in the confines of what we're doing without overloading again you're looking at millions and millions of records every hour and on appending the proceedings right and fading cloud IQ that's for the Cole home interface right so this is going to go through the SRS gateway yep this is currently in beta or early test we can call it so rarely accesses early access their proper wording and I think that's on the next line but yeah and this is a part of the current support agreements that people have right yes our ass gateway but we've been pulling data back into what the system we call this yr that allowed our internal guys to go look at things and didn't tell anything the customer right so we had some interesting information about your system and the customer can come and ask their sales team but that didn't really help the customer and the unity guys developed cloud IQ and they said hey we got this great way to go analyze data on the arrays and and the B max guy said oh oh help us all right we want to join you and so Della's kind of spread this out so now cloud IQ against we're gonna you can get a whole talk about this you're gonna have a 45-minute an hour on what are we doing with cloud IQ and how we're expanding it to cover all the storage platforms because we've decided it's just a great idea to make it easy for you all to get this kind of access and intelligence across all your platforms yeah just to be clear there is no license charge for this this is included and how does this complement its customers who are running s are M so the SRM is designed for your local view of where things are being used this is more of a system level view of Health and all again we're gonna get into all the cloud IQ pieces in the next talk yep good I think that'll kind of make that clear all right so this is proven machine learning right active in thousands of systems today to exabytes under management already 325 billion sub-1 segments are being forecast in real time I mean this is not something that's just getting started we've been doing this for tearing across drives there's something we've been working on for years in terms of how we you this we have new ways of describing it now using the terminology around machine learning and all it's what we've been doing we're now properly describing it so we think there's going to be really useful for helping us optimize storage class memory because storage class memory again has a limited capacity in the near term and it has a very high expense in the near term and so we want to extract absolutely every bit of value we can out of the storage class memory devices they say a cycle for that is like on the order of minutes in general we're doing a cycle of the analysis in this every 10 minutes yes all right we're doing real-time data collection so we may adjust things if you find something suddenly getting very hot it may decide to adjust faster than that but we adjust the templates of what we're doing but every 10 minutes in the frame yeah havens a different question I don't know if it falls under the machine learning aspect but do you have auto provisioning for apps within power max sure what kind of what kind of provision who has access with a VIP eyes we need a certain i/o throughput for an app that I can I define that automatically and have it provisioned for that application you can set host:i own limits and and how many hours do I need no it doesn't matter which hypervisor you're on you can you can go build that easily I don't know if the hypervisors can describe the LUN that way or if they do that but we've got the whole example got the vsi plug-ins that work with both the ESX right I'm assuming most people earn 6 or 6 5 but it goes all the way back to 3 5 or 4 I've been doing integration with VMware forever G we've got kind of a partnership going on there don't know how that would happen but we also do it with hyper-v right we've got work going on around you know what's going on with docker and all those things we've got the ability you chef and puppet and then we can use we'll talk about the whole management thing coming up so let's get let's talk about it there thank you yep all right so now we're going to jump into a few minutes on security so when you look around your house right you don't have one way to defend things in your house you have multiple layers of things not everything belongs in the safe right if you locked everything in the safe you couldn't use it very well so the very valuable things that you use left often you keep in the safe similarly we do security and layers in storage we have to protect everything so it all gets locked up with there in general but then around that we start doing things like secure snaps and user management through role based access controls remote support and then Phipps compliance and all those kinds of things for certifications to make sure it's all getting done the way you think it should be all right so security builds in layers we have a whole bunch of pieces about security you've got certified data erasure you come to us and you say hey I want to pull this frame out of my data center I can wipe the whole thing give you the certificate take the box away data at rest encryption of course because you've got to have that secure snaps something we're talking about in the next slide but being able to protect your snaps from being deleted role based access controls being able to decide who gets access to what and why so that you can pass down control so that a DBA can make snaps of their database and they can restore it but they can't restore it on top of somebody else's database all right some of those things are kind of important being able to service credentials for our customer service guys so when somebody walks in your data center they say yeah I'm here I'm from Dell I'm gonna help and they go login right they have to use their two-factor authentication to get into the system and if they've been terminated or if their access isn't right for that system or they haven't said they're going to work on it they don't have the credentials to get past the due factor authentication it doesn't work all right so lots of security there secure remote support on that remote support gateway audit logs will take all the logs and stream them and you can archive the logs and be able to go do all the logs surfing you want to make sure you know what's going on with all the security pieces and then the Common Criteria certification as well so simple things data at rest encryption right every bit flows through encryption hardware comes out the other end one keeper Drive you pull the drive out of the array send the send the drive off we shred the key key and the driver never together no problem data is secure install a new drive in the box we generate a new key for it all right key lasts as long as the drives in the box done so a fairly simple stuff all done in hardware if you want to you can do remote keys so you can have an external key manager managing the keys in the array I fart away prefer to leave the keys in the array because these keys are not actively manageable if you reach in and change all the keys for your Ray and you're managing the keys you just wiped out all your data that's generally a bad thing all right so rather than doing an external key manager if you just leave it in the box nobody can muck with it nobody can change your keys nobody can get to them they can't be exported by a user and taken out of the building or something they're secure and done compression and deduplication and encryption how is that layered is it these are Drive level so this is done in Hardware on the software encrypt it's self encrypting drives what's your talk on no it's it's a it's a chip running on the nvme car that's sending the data down to the drives so we're doing it in Hardware before it ever goes before it leaves the director but it's not on the way to the drive in Hardware it's like having self encrypting drives except we only have one chip for the drives one chip or controller there's a slick that pushes the data out to the drives it encrypts it on the way so the link thing the link chain the link card that's is that where it's done or I'm one step above that in the director in the director itself the blaze double slick brings the data and encrypts it pushes it down or reads it up unencrypted it passes backup so it's after compression after deduplication so this has no impact on any of the data surfaces at the higher levels the array it's just the hardware piece as you go to the drives now as we get to you know the self encrypting Drive thing is getting more standard for high bandwidth drives but for example next year when we do storage class memory there won't be any SCM drives that herself encrypting next year right so we wouldn't be able to do that as quickly if we were waiting for the drives to do it going forward we expect that all the DRI vendors are putting that into their low-level specs on their firmware so we may start using self-incriminating C and we have this this idea of being able to isolate data so you know you're you're able to decide who gets access to what you're able to securely use things with role-based access controls and what we call Simak holes the access controls but the the role-based it's kind of a higher level easier to manage version of that with better resolution at the user level and storage group level and such rather than being for whole arrays we also have tenant to tenant data protection so if you have a lunding in a box and some of that data may not only tear up between storage but you may delete Lunz and leave a whole bunch of data sitting around right all of those tracks in the box are mark marked as invalid so when you start allocating new tracks they're all zeroed on right so you can't ever read any of the old theta so everything is protected there's no leakage of data between ones so I mean with the wide striping and all that stuff you're actually mixing up any-any volumes data on any of these drives right right every Drive essentially has data from every one and then as far as the multi tenant to consider an alum is obviously assigned to a particular tenant let's right right so the keys don't go with the lungs so you it's not these are the basic encryption of the security is based on access to the tracks not based on data arrest encryption because the track on the axis the track is only accessible by the one that's a part of and then when it's invalid it's invalid and you can go somebody else can say hey I want to use that track great it's all filled with zeros all right there's no there's no reading of the old data it's all wiped automatic so that's one of things the the security guys are rather insistent and they do testing on that to make sure there's no way to leak data back and forth between lines kind of like they do with vmware right can i leaked data between the VMS they do the same kind of thing with us between ones just to make sure that there's no way to have data accidentally feed back and forth now obviously if you take a snap of one restore it on another right that's not an accidental you cause the data to move between Lunz that's a whole different discussion of your security and how you're going to manage things the new role based access controls you have the ability to sign different roles we've got things like a local replication remote replication I can manage the device allocating storage I'm the Wizard of all things in the array okay whatever level of access you want to give somebody to individual storage groups or to the entire frame but the ability to do this kind of thing helps us with multi tenant systems but also in large enterprises because again if I want to pass the ability to do snaps to an end user the last thing I want is an end user who makes a mistake and restores on top of somebody else's bed all right so I want to be able to control all those kinds of things they also shouldn't be able to make a read copy of somebody else's data and go put that somewhere else right or make it accessible to them and I've seen a lot of paranoid people with different things on the arrays but the most paranoid I've ever seen is when somebody had juvenile court records on a system evidently the laws around that make the the things that the NSA look pretty nice oh because nobody wants juvenile records leaked ever and wow they're paranoid but again we can do the access controls and make sure that nobody other than the people authorized to use that system can make any copies or do anything with that data you still use gay people devices you you're welcome to have gatekeepers if you need them for anything but they're no longer necessary
Info
Channel: Tech Field Day
Views: 1,564
Rating: 5 out of 5
Keywords: Tech Field Day, TFD, Storage Field Day, Storage Field Day 16, SFD, SFD16, Dell EMC, Vince Westin, PowerMax, Machine Learning
Id: U4ToiMzCo30
Channel Id: undefined
Length: 25min 19sec (1519 seconds)
Published: Thu Jun 28 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.