Storage Server Update: Hardware, Optane, ZFS, and More!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ah it's time for a storage server video update storage server videos well account of a refresh you guys might remember when we first started global 1 we built our own storage server out of some repurposed storage that thing is still humming along great so great in fact that I think it's time we build another one but not for us this is a little bit of an experiment a little bit of a how-to a little bit of lessons learned let's take a look [Music] so this time around we're doing something a little different with the hardware if we get the same LSI disc shelf we've swapped the IOM three modules for IOM six modules because the on six modules support larger disks a little better and we're also testing what disks are compatible with this now at the time of this video these Toshiba drives these eight terabyte Toshiba drives this specific model are compatible as well as the WD red Nass drives are fully compatible with the ion disk shelf and the reason for that is because even though these disks are 4k they'll still optionally report 512 byte sectors this is kind of important because we're using older serial Attached scuzzy SAS hardware and some of the older SAS hardware has trouble with 4k sectors on the disk so what this shelf does is it gives us two controllers and each of those controllers basically has an in and out so you can stack these disk shelves and you can have an active active configuration where you've got two physical controllers connected to every Drive and you can split the load among all the drives mechanical drives at the end of the day really not that fast you can expect a single mechanical drive to perform around 100 megabytes per second maybe a hundred and fifty megabytes per second 200 megabytes per second at the very top end but only for the very first part of the disk as your your read/write head moves to the inside of the disk the read/write head is passing under less surface area of the disk per rotation ergo the drive has to slow down there's not as much surface area passing under the head there's not as much data passing under the head so we're looking at about 100 megabytes per second per spindle so setting this up we want to build an inexpensive storage array from just one shelf that uses as few disks as possible but that will let us achieve basically 10 Gigabit Ethernet saturation so what are we using for the host PC well let's take a look this is our fractal mesh fic it's a great case it breathes really well the componentry in here is going to run super hot that SAS controller the ten gigabit ethernet controller if we have to have two SAS controllers the opt ain all of these drives are all these components are run super super hot so we want tons of airflow we've got an as rock Taichi with a risin 1800 X CPU we've also got 32 gigabytes to 16 gig sticks of error-correcting memory this is gonna be kind of important for ZFS but it's not the end of the world if you run the system without error correcting memory we've also got our obtain this is a two-and-a-half-inch version of the intel obtained it comes with an m dot 2 adapter it's sort of like a ghetto-fabulous m2 2 u 2 adapter because the drive itself is U dot 2 but the cable doesn't really go from u dot 2 on the drive to you to the square connector it just goes straight to m dot 2 so it's a little unusual in the Intel saved the extra penny or two not having an actual u dot - connector most of the other bundled kits that I've seen are a standard you - connector on an MDOT - board and then they give you a u dot - cable but hey Who am I to question of $500 280 gig storage drive right I mean and you know Intel knows best the reason we're using op time is because of its ridiculous endurance and extremely low latency unlike NAND flash this drive can be rewritten on the order of tens of petabytes and I'm not really sure that I trust NAND flash which is what most solid-state drives are made out of to last that long also the access time and iowa latency on obtain is actually genuinely better there are two other alternative obtain devices you could have used other than a 900 PT 800 p and in the original 16 and 32 gig obtains these are garbage don't even bother they're too slow and they're not really built for it and they're PCI Express by 2 even if you were to take like the asrock 4 channel card that we have and put 4 m dot 2's on it that might work if you use the 800 piece but the smaller ones you can forget it it's just not worth the headache you pretty much have to go with the 900 P now we are splitting the 900 P into partitions so we've got a partition for our separate log or separate ZFS intent log and also the l2 arc I'm not sure if 32 gigabytes of memory is going to be enough for the system the old rule is that you have one gigabyte of memory per one terrible to disk space and depending on if we add 12 or 24 a terabyte Toshiba disks you know we may be looking at 40 or 80 gigabytes of usable storage online because we're gonna go with the mirrored configurations we're not gonna use RAID Z because we want maximum performance maximum I ops and maximum redundancy and that's about the only way you can get that out of spinning rust so there you go now in terms of our disk layout and the performance and you know is it a best practice to have your l2 arc and your CIL on the same device no that's that's not a best practice in an ideal world you might actually have two of these and do a mirrored configuration they're probably going to run into on the rise in 1800 X's there's not really enough PCI Express Lanes for us to be able to do all of that we're pushing it as is right now because we might have to SAS controllers plus our 10 Gigabit Ethernet adapter coming off the chipset because that's gonna be about 2 gigabytes per second that's pushing getting to where we're pushing the limits of what the chipset is capable of and I think you'd probably have to trade up to the the 1900 thread Ripper or something like that to move up to get your extra PCI Express Lanes in connectivity strictly speaking we don't need the cores in fact a lot of stuff especially the primitive Windows File Sharing not really multi-threaded as it turns out so you really want the high clock speed even than Intel 8700 K would be a really good choice because the super high clock speed means that you're not gonna have to fool with things like jumbo ethernet frames or anything like that any of the old tricks that were necessary to get the high throughput on tami Gigabit Ethernet networks without having to have a super insanely fast CPU so now switching gears for a second the storage server that we use here at level 1 we are using ZFS a ZFS file system we're using like a bunch of raid z pools so we're not even using American figure Asian we've got a bunch of raid z2 pools so we can lose two disks from each vida if there's this I think there's four disks of redundancy per shelf or something like that is where we ended up and so we got so many disk shelves saturating that 10 gig Ethernet has absolutely no problem for our server in terms of what it does for us it does a lot of stuff more than storage we actually can automatically transcode our video into proxies we can use ffmpeg for that and kdenlive we're using Fedora Linux on the host even though we're using ZFS it started out as free FreeNAS but then we switched to FreeBSD and then we search to fedora so that we could get a little bit better multimedia functionality from kdenlive and some of the other stuff that we're doing as far as transcoding goes we're also running docker on the host docker is a sort of an automation platform to quasi containerization virtualization platform and we're using docker to automate things like our own steam cache so if we were gonna have a land party we could wheel this thing to the land party and have a cache of all the games that we've downloaded from Steam so it's really nice to be able to download old favorites like GTA and divinity original sin - and you know games like that at basically wire speed I mean we can download from our steam cache for the machines that we have in the office that are 10 gigabit connection we can download from Steam at 10 gigabits even though our internet connection is nowhere near that fast for everybody else in the office that's on gigabit you download a full gigabit speeds it's absolutely no problem let me tell you it is great when you're setting up a new machine and you just need to click a few buttons and install some stuff you're good to go you don't even have to bother with importing a Steam library off the network or anything like that you just let steam do its thing and sting just transparently will pull from the cache we've also got that set up for Blizzard games and origin games although it can be a little temperamental how you do that how the Devils in the details you're gonna need some DNS hacks and some other stuff we could probably do that as a separate video but I think that might be better suited to a guide on the level 1 forum so I think look for that at the forum at forum that level 1 Tex calm and then maybe if we get enough interest in it in the forum we could do a video or something like that but I really want to dot my eyes across my T's because we had a lot of newbies doing those tutorials and then they're like I'm completely lost and it's like well we really got to go step by step on those not a bad thing just something that takes a lot of time now for this storage server I don't think it's really necessary to set up Fedora or anything like that in fact we're just using FreeNAS for testing and I start the testing by creating a 12 drive mirrored configuration so that we've got basically twelve mirrors added to our Z pool and right now I'm using two terabyte this because our a terabyte this frozen in yet I do have a Toshiba a terabyte disk and a WD read a terabyte disk that I can use for testing just to be sure that those work but in terms of does this actually saturate speed and things like that as sort of this worst-case scenario yes it does even before we add the opt Ain even before we add to the zi L on a separate device we're already right up against the the bandwidth limit so 10 gig Ethernet we're doing on an uncompressed workload to our ZFS data set we're doing 800 to 1100 megabytes per second so why would we add the cio and why would we add the L to Ark well we may not need to I mean it sort of goes against conventional wisdom but ZFS modern ZFS in some situations you can get away with less memory usage than one gigabyte per terabyte in modern use cases in modern scenarios especially when we're talking about storing big video files and the type of workload that we have on our ZFS storage server you can get into other scenarios where it doesn't quite work as well by default there's another thing called a shift like the the how and it has to do with the relationship of the physical sector size which in this case is 4k what it's reported by the drivers is 512 bytes but I'm happy to report that FreeBSD FreeNAS automatically figures out a shift now and it figured it out correctly for this pool there are a lot of other ZFS and FreeBSD tunable --zz that you can apply to the system that can speed it up but unless you know what you're doing you can actually make things a lot worse now in terms of us adding the separate log device we're going to go ahead and do that we're gonna create a 60 gigabyte partition which is probably too large and go ahead and add that here now why 60 gigabytes well because we're mostly gonna be riding a large video files to this thing by default ZFS or the you know the mechanism here will keep about 5 seconds worth of data in memory before it it sort of blocks new data and says ok I need to get this data written out to disk before I except to do data for our workload adjusting it to 30 seconds has had no ill effects and that helps us when we're dumping memory cards and things like that it will keep things running a little smoother so we're going to adjust the limiters on the ZFS pool so that we can write to it for about 30 seconds with the obtain we can write it all to obtain in that 30 seconds and then if the power goes out or something bad happens or whatever it could be read from the obtained device and written to the mechanical disks for about 30 seconds worth of stuff so 30 seconds is 60 gigabytes enough to hold 30 seconds of information be about 30 gigabytes that 1 gigabyte per second or if we've got 2 10 gig connections be about 60 gigabytes a little bit back in the envelope math there but that's how I came up with 60 gigabytes also we're doing something called over-provisioning so we've created some partitions on obtain we're not really using the whole drive and the good news is that the firmware on obtain will distribute those rights across the whole disk automatically so we'll get even wearing for the other partition we're using an l2 arc a layer 2 advanced replacement cache and so this is like a read cash for your ZFS pool we're gonna create that one at 100 gigabytes that gives us roughly a hundred gigabytes of unused space on our obtained but it's automatically going to distribute the writes that go to obtain across all of the space on the device because the firmware is smart it knows what what areas of the disk you are using in what areas of the disk that you're not using and in that we actually increase the endurance of our disk some people have been using NAND flash SSDs for their zi l or their l2 arc using the same over-provisioning strategy we did that ourselves on the level 1 storage server using a 450 gig Intel 750 the Intel 750 SSD is NAND flash but that thing is built like a tank and we're using about 100 gigabytes of the 450 gigabytes and monitoring it using utilities we can see that it is doing the wire leveling thing and basically everything works with that and it's showing no signs of wearing out or or having any problems but again that's because of our specific workload and we understand how the technology works your mileage may vary now something I mentioned really quickly you may not even really need an opt-in cache the only thing that the cache the write catch the separate intent log the CIL you know ZFS intent log being on a separate device really matters for is synchronous writes that is when something asks for something to be written immediately you see this with virtual machine type our clothes don't really see this on most normal NASA type workloads where you're just copying a bunch of files to it so you may not even really need the cash now because we've only got 32 gigabytes of memory in this system I may need to go to 64 depending on the pool size but using opting for the layer to Ark the L to Ark maybe makes sense because you know 280 gigabytes of octane is gonna be less than ddr4 shockingly although even 280 gigabytes is a little bit overkill the octane 800 P really would be a good choice if it wasn't so damn slow on the writes the reason for that the way ZFS works is hopefully this right cache is written too but never read the only time I would ever be read is in the case of like recovering from a bad situation or recovering from a situation where something happened unexpectedly like a power failure or some other kind of event that shut the whole thing down and then it will read from the intent log and reconstruct what was supposed to happen on disk if you're used to like battery back raid controllers and you're getting your write cache this is kind of sort of what the battery back cache does for you except you know this will store its data pretty much indefinitely actually it's a funny story if you have an octane drive that's off for a long time as soon as you turn it on it'll immediately rewrite itself it's a quirk of the cork of the hardware because it's worried that it actually would lose data integrity if it's often unplugged for a long time but so it's fine we'll just pretend that doesn't happen so that's pretty much the long and short of it that's what goes into picking out these parts we've got a little bit of old-world technology and a little bit of new world technology really the bulk of our expense here is our a terabyte disks for the disk shelf and everything else everything else it's just you know a few hundred dollars worth of equipment five ten years ago buying this kind of a set up in the enterprise would be north of a hundred thousand dollars probably more like two hundred thousand dollars give or take is the enterprise solution more bulletproof and robust yes absolutely but the magic here really is ZFS and I know there are other file systems like butter FS or btrfs and let's about replication technologies and things like that ZFS is so far and away so advanced and so reliable that I would rather be running ZFS on hardware like this then something that's not ZFS on proprietary enterprise gear that that may or may not die in a way that I understand and that's just because I understand how ZFS works internally and ZFS has always done everything that I have asked of it so ZFS is a really incredible file system does come with a lot of overhead I mean we're we're running an eight core system here with 32 gigs of ECC memory that's gonna be better than a lot of systems you guys are running at home for your primary computers but for the type of data that is storing business data this is sort of the minimum that you need to be able to do that we also need to set up some stuff on this storage server like if you lose a disk it sends an email so that you can get a replacement disk might not be a bad idea to add some spare disks because if one of these disks dies it's nice to add the replacement disk while keeping the disk that died still plugged into the system but we're not going to be able to do that if all 24 slots are occupied so there are some trade-offs if you're thinking about building a storage server like this or you've already built one and you want to show off come and add some pictures to the forums at level one text I'm one the one signing up and I'll see you there [Applause] [Music] you
Info
Channel: Level1Techs
Views: 103,672
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: eeHVVx-D2q4
Channel Id: undefined
Length: 17min 36sec (1056 seconds)
Published: Thu Mar 29 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.