Architecting and performance-tuning efficient Gluster storage pools

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome everybody to write out summit hope you guys enjoyed the keynote stuff this morning and hope you had a first good session maybe that woke you up a little bit everybody will stay lively for this one so this is Gloucester can do that so we are here to talk about Gloucester architectures with a fairly technical session it's fairly fast paced so we have quite a bit of content we want to throw at you we do keep it pretty flexible here though so if you want to interrupt in the middle and ask a question feel free if you want wait to the end if you want to grab us afterwards you'll find is hanging out in booths down in the partner pavilion around so won't be too hard to find us I am Dustin black I am senior architect with our storage business unit so I'm actually helping to build reference architectures and work with our partners and performance and sizing guides there's a lot of this content that I build it kind of helped you guys out to make sure that the work you do is right and my partner today yep I'm Ben Turner I'm a principal software quality engineer I work on both Gloucester and Steph I do a lot of performance testing I wrote a tool called G bench and I'm really interested and any sort of performance automated regression testing also if you forgive me if I have this I was in a battle with a razor this morning and it's the real quick quite a flesh wound yes look I don't let this distract you and yeah I'll keep it too so we're going to tell you first as I said we've got quite a bit we want to cover we don't really have a time to do a full-blown what is Gloucester I'll I'll do a quick survey of the audience and Gluster users today in production I owe most of the hand stayed up all right good deal so how about self just I curiosity stuff users all right very cool so we're not going to give you the the full 101 but if you follow along really quickly we can tell you exactly how everything works in them okay that's the strategy everybody got it good okay the good thing is if you did miss it feel free to snap a shot of that we do have a set of slides that are hosted up on people page so you can actually go through all of those 101 slides on your own time as you like to so I'll give you guys half a second yeah is a little quick me too so don't feel bad if you miss this you always Tagus too you can get a lot people page people don't read calm flash be black and you'll find the link to it there anyway so what are we gonna do bluster can do that if you build it right so that's kind of a big caveat where we're sticking to today is is all what this is all about is making sure you make the right decisions up front so we're going to be focusing on what it means to make those decisions up front so the first thing I want to cover is some testing that we did with a number of different scenarios so this is a you know first of all let's say that our resources for this test were confined to a certain architecture of six node clusters that we had kind of cherry-pick to say you know this is a good cluster nodes I've got six of them now I can figure those six clusters with their 72 total drives yeah the club transfer system 72 total drives I can figure them a number of different ways for you know redundancy and different ways of using blood serve volumes and based on the way that I configured them I'm going to get pretty massively different different results so keep in mind when I show you the data that I'm showing you this is all data tested on the same systems just configure different ways for different workloads so one of the things we want to show you here is this is representative of a what we consider a very small file workload so 32 kilobyte file like a JPEG file and let's say you just kind of throw it out there just do kind of a default configuration you don't think much about it okay so maybe you can get 1,700 JPEGs process per second not bad but let's say you spend a little time actually tuning to that workload and and you'll see I shouldn't really say tuning we'll get into that later but let's say you architect to that workload upfront make the right decisions and you suddenly go from 1,700 JPEGs to almost a tenfold increase to 12,000 JPEGs per second same hardware we didn't change any of our hardware investment in this case we didn't change any of our software investment in this case we just changed the way we used it and if you want to go to the extreme and you'll see I have some SSD examples on these as well we did do some SSD testing so you can see if what you really need is a performance based configuration you can do it this is an all nvme configuration that we tested and you can see we have 23,000 what may be interesting there is that we only really doubled our optimized hard my base configuration this is kind of a limitation of small files that we'll get into a little bit later so same nodes same notes configured a slightly different way and different workload here so this time we have a 4 gigabyte based DVD kind of representative workload so what can we push through the system with this so again not really putting a whole lot of effort into it what we get is a DVD per second pretty decent right but let's take some time and configure that workload appropriately 2 DVDs a second ok so again same hardware or go to the SSD route we may be able to push this thing up to 4 DVDs a second but again but the first two are same hardware configure different ways the last one is actually a different hardware configuration and our third example here is let's take a look at a real world workload so something we studied with a CCTV workload and again no configuration no no real effort put into that we can get 200 concurrent CCTV streams on that same set of servers or we can configure to that workload architect appropriately to it and more than double the capacity of the system so this is a key point that we're going to be hitting on today and everything we talked about these performance numbers so what it boils down to you guys are familiar with the the acronym kiss right right rock and roll not yeah that's kiss kiss Ben oh um okay kiss well did you change that font yeah at least the fonts right keep it simple stupid so we were thinking about this and kind of how we can apply the same sort of mnemonic to what we have going on because we want you to concentrate on the right thing so we came up with our own acronym which would I didn't work out on it well yeah I think we should have bought a vowel yeah I think so so it makes a good point though what we wanted what we want to get across to you today start with the workload dummy all right this is really important if you don't understand your workload and you go architecting your storage environment to work for that work code it's never going to work right for that workload so know your workload first talk to your architects about what your workload is so we can actually help make those right decision upfront because you will net will reinforce this later but you will never tune your way out of a performance problem yeah if you architected improperly in the first place so let's take a look at how this kind of affects us as engineers and support guys on a daily basis and it all has to do and you know eyes this and I apologize to some of my colleagues that I'm somewhat publicly shaming them but at least they're not called out by name but these are real emails that we get and some of the problems we run into so you know like in this case somebody saying I have I need a new system to perform similarly to an existing system I did a test with DD and I got good write throughput well 500 megabytes per second is that good I don't know why there's not enough information here it's a rep to configuration via 10 gig connection rep to over how many nodes yeah I don't think it yeah the record size did you write well how did you run DD what was your input DD this is an NFS mount or fuse now does it say yes there okay DeeDee strangely on slower DD strangely yield slower throughput for reads which is you'll see by all the chart numbers that we have reads always on Gluster outperform right but me almost almost all the time yeah unless you go to like a peer distribute volume but right motivate they say here involve replication so yeah so there's just these are the kinds of questions we get right there's not enough information here for us even answer this question there's not enough information about the workload similar thing this is somebody they want to add physical nodes to increase performance I did not add those quotes in performance they're experiencing problem they do give us some information okay eighty by two weight distributed replicated volume on six nodes they want add six nodes which would not still be an 80 by 2a if it stays it would be a 160 by 2a it's a little bit of missed information there and so they're asking to get better performance how do I need to add these nodes to the system I don't know okay you know yeah yeah I think we're going to start with the workload yeah I think we do start with the workload so we're missing here once again we don't really have enough information about what performance means to them what what kind of sizes are using is it random it's a sequential is it a mix of small and large files is it latency sensitive is it throughput sensitive how many concurrent clients are happening is its use client is it in FS client there's there's so much we need to know to make the right decision and it's not provided and something else that we'll cover on a little bit more later is there is like a workload like make there it might it's one of the few things in Gloucester that doesn't scale positively linearly so with make sure the the more bricks you have it tends to slow down a little bit so if they're doing a really make their heavy workload that might actually hurt their performance a little bit yeah it's kind of an oddity right because we do talk about linear scalability improving performance by adding nodes but again if you if you don't understand what that work will do is and how you build against it you could actually add nodes and slow everything down so be careful last example we'll do here is somebody saying okay we need a certain number of I ops per drive slash rate what's an eye up with depends if I know what an if' is okay you can tell me what the definition of an ayah is sure you all know what an eye op is but how big is that ayah you know how many clients are performing that I when they say per disk per raid volume how does that really translate to there's again there's not enough information here I will go on a soapbox rant all the time when somebody asked me to design a cluster file system and they want a certain number of I ops and especially if they just say they want I ups I'll just tell them to go away because there's just not enough information and I ought to design a file system to meet that requirement yeah we got no more yeah it's pretty tricky whenever you're working with AI ops something that I personally like to see a little bit more is files per second I think that's a better metric and I think it's something that we cover a bit more later on in the presentation you'll find when we show you some of our small file workload numbers that we report those and files per second the tool that we use we use a tool written by one of our fellow Red Hatters Ben England called small file that actually will do these small file transaction reports files per second which is nice because it's a it's kind of a nice abstraction of throughput and latency which is really when you're using a file system what's important if you know what your file size is what's important to you is how many of those files can I push through the system at a time and it's not as important exactly what the throughput is or exact what the latency is it's just important that I can get the number of files to the system that I need at a given time cool they're come to work load all right so let's actually take we're going to take a deeper dive into the the workloads that I introduced early in the slide so first of all this small file JPEG work again this is a 32 kilobyte workload and I want to explain a few things about what my team was doing with these tests because if you guys are used to studying performance purely you're going to find that some of these numbers may look a little funny that's because we are interested in in the document that we're publishing which I will plug real quickly because we finally actually have this document this is a reference architecture for Gluster that we've been working on performance and size again it's pretty link to but it goes into all of this detail that we're talking about today and more so what we wanted to find was actually a throughput efficiency we want you as the person who's implementing this to understand that if performance is important to you what is your performance per per investment so we don't do a pure performance number here we actually divide performance by the number of disks or drives that are actually in the system regardless of the replication that's actually being used or the data protection mechanism because we want you to know that what your investment is in the system returns you a certain amount of performance so that said if you take a look at the small file workload again all of these configurations that you see down that that y-axis are all the same Hardware just configured in different ways we haven't changed anything about that hardware so you'll see among those configurations we have the best performing option is at the top using the Gluster Native Client that's 12 disks and a raid 6 configuration using a distributed replicated volume 3 bytes to distributed replicated across those raid 6 volumes and you see dramatically it outperforms I mean look at the worst configuration that we have so a small file workload and again this is it's a synthetic benchmark it's a synthetic workload we're doing small files sequential with committed writes so some of this little funny you know if you're thinking about all I get advantages of caching with small files well yeah you do we kind of ruled that out in these tests because we wanted to understand what the file system itself could handle at the lowest level so the the NFS configuration there a dispersed volume which is a racial coded volume built on top of raid six get the horrible performance right so this is really important guys these are decisions that you can't tune to later if you didn't make this architectural choice up front if you instead said oh I want to do a you know even look at the third line there I want to do aj baud based distributed dispersed volume on my nodes using the Native Client well okay but if you're running a small file work with good luck yeah right your read-through puts terrible yeah and I don't know if you guys have ever populated your data and then had to go and redo your bris or brick configuration but it's hours and hours of loading data again and you lose a lot of time that way data has inertia yeah it doesn't doesn't like to move unless you push it really hard yeah it's definitely important to get these these decisions made upfront so that you don't have to go back later and fix them because like we mentioned you know tuning around them is is not necessarily an option whenever it comes to that large of a difference between the two workloads you know we take this a step further for you so that whole efficiency idea we actually want to see you what are you getting throughput for a dollar if you invest in this and actually the comparison here is important you'll notice what we actually have along the bottom are the two primary configurations which are a distributed replicated volume and a distributed dispersed volume and again dispersed as a racial coding versus a cluster terminology and what we look at is the comparison between what we considered a standard density and a higher density system so the first the the first of each two bars is a twelve disk system and the second is a 24 disk system in an analogous configuration and what you find in this case is again if performance matters to you your dollar is going to go a lot further if actually you invest in less dense systems in this case so you're less than systems you dollar gets you more performance per dollar you'll also notice a dramatic difference between your dollar investment when you do the distributed replicated volume versus be dispersed volume again other than the fact that we're looking at 12 verses 24 disk systems these are the same Hardware different configurations you're getting Maxima different results another thing that we look at in our study is we're trying to figure out what the system can handle so we're actually interested in aren't workloads we want that we want to push the storage system itself to its limit you'll see in some of the work that Ben did that what he's looking at is more of like what how it affects an individual worker so we have a nice kind of comparison of the two-year so one thing I want to point out is you will never get the full the full capability to the system without pushing your client karent's concurrency up until you actually start maximizing throughput yes question so the question is do in our raid six volumes do we hardware RAID with write back cache that is the recommended configuration yes in our testing in the lab we did not have write back cache enabled because we could not test with a battery backup - because of the particular lab we had in production yes Red Hat recommends a write back cache battery backed up or non-volatile memory bathtub running bluster on beams I don't have data to show we do have some information about running bluster as a backing store for VMs would've been some testing form but bluster inside of eons I'm afraid right now I don't have good performance data for that yet yeah yeah there's some tunable zhh that are set whenever we do G deploys and the the robot comp config file and it may be set in there that's something that I can look at and get back to you after the presentation but also something else on write back cache you know it's really important to set at the controller level make sure that you have a battery and I usually see 10 to 20 percent gains so if sometimes if my battery dies and I'm running automated performance regression tests I see boom 20 percent performance hit oh no what was it so the battery died and it turned off right back cache so you can really see the the performance and it's something that you'll want to keep an eye on and in one of the test scenarios that we'll cover here in a little bit we actually did to compare write back cache on vs. off and you can see it did make a difference there yes yeah we Texas thieves we don't put them in a raid configuration we are pretty much invariably going to deploy those in a J baud configuration and let the what the data protection happened at the Gloucester lair yeah and I have I have tested with that and I didn't see a huge performance improvement but I didn't do a lot of testing just like touch it there just to mention normally with SSDs we let Gloucester do the protection so what I did is I did a raid zero so that I could get a little bit more performance out of my brick but I didn't have a huge amount of data and I did set right back and I didn't see a huge gain but I did see a game so the next thing we want to dig into here is when we look at this concurrency and we start seeing as we go up you know particularly on the reads we start seeing the diminishing returns we know we're going to hit some kind of bottleneck on the system so we want to know what that bottleneck is you also notice the way we did these tests the writes we can't hit the bottleneck right away so we're curious what's going on underneath it causes this so we have some subsystem metrics to look at first thing we look at here is the network utilization and you can see we're using squat for network utilization right this is where that line at the red line up at the top of the 10% threshold so we're not for this 32 kilobyte mile even at high levels of concurrency yeah we are there's a 10% line and we're just barely getting close to that so you understand these workloads to this as you go down these charts this is a write this is a read this is a write this is a read write read write read so the pattern kind of persists down there so with what's going on CPU again we're barely even getting to a 25% consumption the CPU this is an aggregate across all of our servers so we're not network bound we're not CPU bound what's going on with Webster must be the memory it's not the memory it looks like that charts wrong no there's basically you know the green line there is actually how much that much memory Gluster is really consuming how much system memory is really taken up which is you can see nothing the yellow line is how much disk cache is being used and 4:32 tale about files even with high levels of concurrency we have a load of memory on these systems we're not anything close - even cashing you know all that much so what's look what's going on just truly we're disc pound and you do using to see indeed on the read operations we are actually getting disc bound so we know that we is that on the read operations that we start to see that curve going off we're getting there because we've hit a limitation on the disc so that's good thing to note what's going on with writes I've been can you explain what's going on right here so what we we look at everything we didn't bottleneck yeah yep so what's right there's lots of the overhead that happens so the first thing that you need to do whenever you go to create a file is does the file exists so you need to go out and you need to issue a look up on all the bricks in the cluster so once those what cups come back then you go and you do the next operation you do the next operation you do the next operation so the the amount of overhead that we see with small files is actually what is taking up a lot of the time that we're seeing in that graph so with reads you don't have as much overhead you know you don't need to make sure you just need to issue the the read and then the first brick to respond by default there's you can tune this behavior but by default the first brick to respond is going to be two services that read so with reads a lot less overhead with creates a lot more overhead right and that's why we're maxing it out on reads and we never see it as a system resource limitation because cluster is really kind of getting in the way of beads these write throughput so that starts to beg the question when files get that small are they really acting like file D yeah yep if a file is very very small is it still a file so I think that this is kind of a it's an interesting thought or question because with small files it does tend to be a lot of overhead in it in it and it is more about round trips between client and server lookups making sure that files don't exist and just different amounts of overhead that you see where what I found is that anything from 16 K depending on your system it can be a little smaller say between 16 and 4k and 1 byte your see the same level of performance on creates at least and that has to do with the amount of overhead that you get so if we want to move on to the next slide Dustin I like to think of bluster like my my friend here Ted Stevens like a series of tubes right so the cluster is a stack of translators so it flows from the the top translator let's look at fuse and the what it is is Gloucester takes a system call and it passes the system call through each layer in the translator stack so you can see fuse performance translators distribute let's go back to our tubes analogy once we get to distribute it splits into three tubes once we get into replicate that splits into three more tubes and then the file goes to the actual brick itself so I really you know I joke with the Ted Stevens analogy but it's a good way to visualize it and even once you get into the translator code itself so you can see every part of the translator definition of any of any translator you're going to have a function that's called X later F ops and what that does is that lists out the system calls that this translator acts on so if an open comes in this this is actually for the the read ahead translator so if an open call comes in it go ahead it goes ahead and execute read ahead open and does its thing as far as read ahead and then passes it down to the next layer and the translator stack so you know it's that that system of tubes the the i/o is the system call is dumped in at the top and if the translator can act on it it will based on that list if the the system call is not in that list it just treats it like a no op and passes it directly to the next translator one last thing on that previous slide another thing that you're going to see in the that's required for different translators is volume options so that's just where you can see all the volume options that can be set for a translator so if you are interested in you want to look at the code and you want to see what what system calls the translator acts on and what options it has that's a good way to do it so with small file there is a you know some challenges and what we are trying to do with the community to get over those challenges is we're trying to improve the efficiency of individual calls something that I'm going to talk about like a good example is lookup optimized before lookup optimized came out you would issue more than one lookup for brick whenever you were doing a file create lookup optimize takes you down to the minimum of lookups that you can do so that's one one lookup per brick to make sure that the file isn't anywhere the reason that we don't just look at where the the file hashes to is because if a brick is down you might have a link file so we try to stick to to one lookup for brick another thing we're trying to do to help improve that overhead is we're trying to store metadata and a client-side cache anything that you can do to reduce the trips over the wire from the client to the server it is going to is going to do is going to be a nice performance improvement latency is the hobgoblin of distributed systems so anything that we can do to destroy or lower latency it is better the next thing is like prefetch and metadata so we have our client-side cache if we know that whenever we access a file we're going to need x y&z chunks of metadata so we're going to send all of those metadata pieces with the previous call and then they can get stored in cache and they'll be there for future color future calls another thing that we're doing is we're compounding file operations so if I'm going to do and if I'm going to do a series of system calls that I know need to happen in a specific order I can send two of them in the same frame and say if I know that I need to do a lookup and then an open or there's two system calls that are definitely going to happen no matter what one after the other you can compound this file operations and you can send them in the same frame so that you're reducing round trips something that's going to come out with bluster 3 3 is negative lookups that's a way for us to do caching of lookups so a good example is Samba Samba is used to running on top of a local file system and it issues lots of different calls over and over and over again so long as very lookup heavy so with with negative lookups what we'll do is we'll do the lookup the first time we'll cache it and then subsequent lookups will be serviced from that cache parallel read derpy is just going to speed up the reader P operation the more bricks you have it's going to deal with some of the problems we talked about earlier I know we're probably running I'm gonna give you a 20 second challenge I'm yeah I don't want to miss out on the water file yeah exactly so a grade 10 and raid 6 recommended for bricks 2d profile again RHS throughput performance event threads this is a theme that's been common through Gloucester Gluster started out with a lot of single threaded operations so what we've been doing is we've been taking those single threaded operations and making them multi-threaded and doing more things in parallel so event thread is something to look a good thing to look at look up optimize we touched on earlier cache invalidation that has to do with Md cache it's just how often to invalidate cache stat prefetch we also covered that one cool so here's a graph of what we're seeing with tuning so in the graph on the Left we see untuned small file creates this is 32 K files and what we're looking at is untuned vs. tuned with cold cash vs. tuned with hot cash so on create we see that the tuning has a really great impact that's actually a 44% improvement from the tunable but you see with the hot cache we don't see much improvement because create workloads don't really have a lot to do with cache so on reads we see the stories a little bit different so with reads cash is very effective so we don't see much improvement from whenever we have the tunable set but whenever the cash is hot and note that this cash is hot on both the server and the client on the server the the small files are stored in page cache so that the the server doesn't have to go and hit the disk it can just send the read/write off to the client and then with metadata is cached on the client side also another thing to talk about is multi-threaded LS l versus single sided LS - ow you can see our tuna bowls we're able to get from what is that about 300 seconds down to about 75 seconds yes this is a smaller is better graph right yeah you have a smaller is better this is time for us to run on a single client we're an LS lar and those are the times and with this we use bed englands small file tool where we have four clients 8 threats per client and we're running the same LS workload but we're doing it in parallel and if you can see we went from 300 seconds to 12 seconds so just like the standard changing we're doing things in parallel and lots of threads lots of workers lots of file is going to it's going to help Gloucester performance let's move on a large file so that second scenario I talked about early on was same configuration of servers or same same set of servers has configured them a different way for a different workload let me show you why you need to configure them a different way so again qualifying that we are looking here is a throughput efficiency what we actually find interestingly is that for this large file 4 gigabyte workload with enough concurrency to saturate the system our dispersed volumes are actually giving us the best overall performance it's interesting because we think of a racial coding as something that adds overhead but from a performance efficiency standpoint and considering saturating the system you can see that particularly on the right side of the equation that the erasure coded volume is actually giving us the best efficiency configuration and again we did this with a throughput per dollar so you could really see you know what it means based on your your capital investment here what you'll find on the the the less dense versus more dense servers when you're looking at the replicated volume there's not a whole lot of difference so you could really kind of go you know 12 this 24 discs it's not going to make a massive difference if you need to go with replication for some reason but if you are going to go with the dispersed volume that we actually do again see better overall efficiency for the less dense servers I'm actually gonna go back to the slide real quick because I do want to point out that first NFS line the fourth line down you will actually see that the performance of the NFS client on a distributed replicated volume is is pretty good the the writes and reads are pretty decent so you may actually find if you need to run in FS that you'll want to do in FS on a replicated volume not on a dispersed volume whereas if you're running the fused client you may want to run the dispersed volume instead of replicate again client concurrency is important so we really kind of have to push this thing until we find the edges of the system but where are those edges you're going to see these patterns look a little bit different than they do for the small files you have much difference right well yeah much different so again when we look at the network mostly we're saying well below but you will find that we do appear to start bottlenecking on that network once that client concurrency gets high enough we're not quite hitting a line but that's because this is network and there's a lot of like if the overhead going on there but we have enough data to say that we believe that what's happening here is that we are constraining the network on our right operation but I read still have a little bit of room to grow so we're a reads getting bottlenecked take a look the cpu again the CPU the CPU patterns look different than they did with small file but we're still only around that 25 percent mark so we're really not doing too bad on CPU memory patterns look a little different but you'll notice that yellow line is again disk cache so the used memory is the green line which you can barely see because the system's not really using any memory but the disk cache is actually being used pretty heavily with these large files we're pushing the data into the system now that we can really fill that disk cache these drops that you can see are actually just part of the testing process where we this is a write operation then we drop caches as a read operation we drop caches so each one of those is just a purposefully dropping caches across the system so let's take a look at the disks and at the disks we you start to see here that on our read operation our bottleneck of the disk so interestingly at about the same time we're hitting very close to network and disk bandwidth bottlenecks for for reads and writes so we're there actually configured very very close as far as the throughput capabilities of the disk in the network yes we we've done some testing with the all SSD not with InfiniBand but we were doing 40 gig Ethernet your testing but I don't I don't have I dated with InfiniBand I have antenna band data in 40 gig I've done rocky I've done IP over IB and I've done InfiniBand testing so it'll talk to you like you have about it I'll talk to you afterwards I have some data out there that I can show you so talk to has been about what's what's going on with large file workloads right so cluster was actually designed for large file sequential workloads so where ghoster has always been pretty good at that so for our bricks just like Dustin mentioned we can use ec or raid 6e see if we're massively parallel and we have lots of threads slash workers lots of files we tend to see ec perform really well where if we have less workers threads files or even single threaded workloads raid 6 tends to do a little bit better the 2d profile we set is our HS high throughput it sets read ahead on bricks that's really important for sequential reads the deadlines scheduler sort of helps out both reads and writes and then it sets VM dirty ratio to help more aggressively reclaim dirty pages then I'm going to keep pushing you phone me up so we get jumbo frames also important use it if you can so right you can see the improvement most of its due to jumbo frames read that's mostly due to read ahead on the brick devices this is a quick formula that I used to sort of guesstimate performance and how many nodes I'm going to need since we're using replication and replication is done client-side on writes we take the the slowest of your your your bluster cluster is only as fast as the slowest piece so we take the slowest piece usually its network or disk we divide it by the number of replicas we multiply it by 0.7 so that's about 30% overhead on writes if we do the formula that goes back to 420 look at the graph above multiplied 420 by 4 comes out to about 1600 numbers coming just in with the numbers that he was showing with the open numbers and reads sense reads are on every brick we don't have to worry about replication we just multiply it by 0.6 the overhead is a little more like Dustin showed you see a lot more CPU usage and reads so we have a little bit more overhead and that also drives so takeaways Justin you want to Dustin you want to talk about the top line yeah so you see I mean there are there are some nuances in here you know in my testing because we're doing this the saturation type testing with really kind of pushing up the concurrency of the client but we do find in that case is the eraser coated on j-bot outperforms a replication on raid 6 when you've got that high worker concurrency yep and same thing goes for replica 2 on raid 6 whenever you have less concern less concurrency it outperforms EC on j-bot so they also apply to the NFS client yeah as well as we found yep do the way that NFS is architected you go from the client to the mount and then the mount actually goes on the back end to get the file so that's one of the reasons that we see NFS performing little worse on EC volumes also read ahead and jumbo frames they're really important with large files to quench workloads jumbo frames on both the server's and the clients if you can get them set sometimes you need to talk to your network admin and again you know start with the workload when designing your storage cluster the proper brick architecture from the start we owed far better performance than any of the tunable is mentioned design in a way that avoids problems don't try to tune your way out of them I think of it like a doctor what do they say an ounce of prevention is better than a pound of cure so you know sort of work out right and you'll never tune your way out of a performance problem again if you if you architected incorrectly yeah so all of that was synthetic workload stuff so I'm going to give you a one example that we did of that CCTV workload so you can kind of see how this applies to something real-world can I take the synthetic stuff and does it does it really translate to something real this is this as good as real as we can get when we're kind of running benchmarks the CCTV workload does concurrent streams of simulated cameras of high-definition and we're really looking for is concurrency we want to know is how many simultaneous camera streams can the system support they want that to be as high as possible because we want our investment to be efficient here but this runs because it's simulating a real-world condition it also runs a interference workload that does a random smaller read operation so it ends up becoming latency sensitive so what we see that first line that Peaks out really early hits hits the the latency curve quite early and stops us at a low number of concurrent cameras is actually a replica on raid 6 configuration well that's interesting that didn't turn out to be our best use case or our best configuration for this use case so we do find going to that orange line the orange line actually represents our initial go at doing a dispersed volume on J button you can see we got much better results everything else is kind of incrementally trying things one of the things we did try here was write back cache on the raid layer so what we actually use raid zero single disk but we were able to leverage the raid write back cache because we configured it that way we get about 10 percent performance improvement doing that and then we also took some nvme drives that we had in the system and configured a layer of cache layer up in the system so each node actually had a block level nvme write back cache that was configured and that got us about another 15 percent performance improvement so that kind of became the target configuration we were looking at for finding the best overall Gluster set up for this this streaming video configuration and there's actually some interesting things talk about the bins discovered about you know kind of why that's happening but will probably reserve that for another time yeah yeah sort on time ok so the next thing that I want to talk about as far as use cases go is hyper-converged Gluster and Rev or just virtualization so this is a wildly different than any of the use cases that we've talked about virtual Azure virtual machines tend to be more random workload then the the sequential workloads that we talked about with large file and they don't have the metadata overhead that you see with small files so here's a nice diagram of the infrastructure hosted engine is what we use to deploy Rev you can see that hosted engine and that Rev VM are sort of together so the rev manager whenever way back when whenever we bought Coomer net the the rev manager used to have to run on a Windows system but and it may or may not had to be bare metal but over time what we did is we ported that code to Java and now it actually runs as a VM inside the cluster so you don't need to have separate hardware for your manager it also runs on Linux now which we do and it's really easy to deploy using that hosted engine solution so with one quick thing note that there's three nodes there we've been deploying nodes in in sets of three right now we support three nodes but we're continuing to enhance support hopefully we'll have six nine and maybe more in the future and note that all those are in the glossary volume so just like we said storage and compute on the same systems there's a cost advantage it's really nice to manage you know a few guys everybody here I'm assuming has worked on Linux you're using Linux tools you use the same Linux tools to manage your storage to manage your virtual machines to manage your systems so you don't have to have a storage admin and a virtualization admin and you know you can all use the same stuff and this is so cool I'm really excited about this I was actually able to get the deployment of a fully set up ready to go hyper-converged reve reve res cluster using two commands so just G deploy and then you point it to Robo Comp rubyconf is a file that we give you that has some examples and you just go through and you fill it out for your environment and then hosted engine - - deploy config a pen there's a file called hg answers comp that we also give you you go through edit it set it up for your environment so two commands takes about 45 minutes some of that 45 minutes is downloading the ISO image for the rev manager here's some of the performance data so the first one is sort of like our control that's me running on just a regular mount that was on one of the systems outside of VMs the second one is one VM to VM three VMs and these are all on the same hypervisor I just scaled up on the same hypervisor a couple interesting things to note is you see slightly better write performance whenever you're outside of the VM versus inside the VM but something that I found strange I still don't quite understand yet and I'm going to be digging into it in the near future is why reads were faster on a VM versus on just the Gluster mount we can see that reason writes both scale linearly and writes are kind of slow on the sharded VM we do good I was just going to say we'd kind of expect this blue line to be about the same and the control and this is a VM running a single load we'd expect that to be about the same it was actually better so yeah interesting yep so that's interesting also go ahead yeah these are thin provisioned the bricks are thin provision lves oh right yeah yeah so I create the whole file and have it already there but yeah that's definitely something to note also we have a bug open against sharted shorted volumes for four VMs that that's only about 70 Meg's megabytes per second inside the VM it's about 80 outside the VM the performance should be better so if you guys are going to be using this hyper-converged configuration I would suggest testing with sharding and without and as we this configuration the setup this architecture as it grows we're definitely going to we're going to fix some other issues I'm going to be able to explain that shameless self-promotion this is the tool that I used to collect all my data it's called G bench if you guys want to contribute or use G bench for for your testing I'd love to see it it basically wraps I ozone small file and FIO it gathers meta for multiple iterations that averages it does one other statistical stuff a lot of the work that you would do to run multiple iterations or two or to run kind of the same run different tool the same way it takes a lot of that effort out for you yep we're kind of at the yeah we're really at the ends I'm going to say thank you guys yeah thanks look idea plugs contain it yeah and I have let me let me close out with just just a couple little things as you're leaving first of all highly recommend you guys install the summit mobile app that's the QR code to take you to it please review our session please review all the sessions we're always trying to make Red Hat something better for you guys so definitely put in your reviews get the the tool installed to do that there's also this cool thing going around and pick one of these up you can check out what we're doing storage wise around and we have some special swag for everybody who attended this session if you're interested come grab one of my business cards go to the Gloucester booth tell them Dustin sent you and there's a special piece of swag they may not have it yet you may have to check with them a little bit later today or tomorrow but grab a business card go visit them I'll leave them here so you guys can grab them easy thank you everybody for joining awesome yeah thanks guys thanks everyone for coming

Info

Channel: Red Hat Summit

Views: 8,608

Rating: 4.8202248 out of 5

Keywords: Red Hat Summit, Gluster, storage, software-defined, breakout session, Red Hat Summit 2017

Id: 61HDVwttNYI

Channel Id: undefined

Length: 45min 54sec (2754 seconds)

Published: Fri May 05 2017