Record Breaker: Toward 20 million i/ops on the desktop with Threadripper Pro

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

15 million iops performed in a kind of jerry rigged mixed drive arrangement in a single machine with mixed gen 4 and gen 3 drives as well.

Would be interesting to see what would happen if all them were precisely identical, just to prove that the 15 MIops is the absolute limit of the threadripper pro platform.

At which point, what would threadripper pro in the zen 3 variation provide as well as the inevitable pci-ex gen 5.0 as the infinity fabric to come with DDR5

๐Ÿ‘๏ธŽ︎ 8 ๐Ÿ‘ค๏ธŽ︎ u/DHJudas ๐Ÿ“…๏ธŽ︎ May 26 2021 ๐Ÿ—ซ︎ replies

I've watched that LTT NVMe server video talking about bandwidth limits and this seems to be a more independent follow up. 23 mis matched NVMe is a lot, but I'm interested to see how much more those rumored Genoa Zen 4 Epycs with 12 channels of DDR5 can handle.

๐Ÿ‘๏ธŽ︎ 2 ๐Ÿ‘ค๏ธŽ︎ u/Powerman293 ๐Ÿ“…๏ธŽ︎ May 27 2021 ๐Ÿ—ซ︎ replies
Captions
how'd you like to get that evga keyboard and mouse for just about half off that's the evga elite program and you can sign up and it doesn't cost anything you can also get in the queue for gpus things like the rtx 3090 you can sign up and uh you know at least get your name in the queue so that eventually when it does restock there's at least the option to buy but you also get discounts and coupons and special access to things and stuff like that evga is building their elite member army level one text has a code so at checkout if you want to use our associate code level one text we get a little bit of a benefit from that it is an affiliate code but evga sponsored this video so thanks evga be sure to check out that affiliate program and you know you get 24 hours of early access when they come out with new hardware so if they're coming out with a new model gpu hand hint you can put your name in the ring to reserve one 24 hours before everybody else it doesn't cost anything to join really all you're giving up is your email address for evga why not a model this is level one thanks evga for sponsoring this video be sure to check out that link below sign up for the elite program and tell them we sent you uh confession time the faster the ssd doesn't necessarily mean the faster the ssd remember last year when we all got really excited and we're talking about pci express four and some of the very first generation like the feison-based you know two and three terabyte and four terabyte two and four terabyte uh ssds are coming out and it was oh my gosh we can get five gigabytes per second on this you know tiny little m.2 gum stick there's more to the story than just transfer rate it's also i o latency and how well it handles multiple simultaneous requests so it's down to iops the number of iops that you can complete at a given time generally is how responsive and snappy the system is but there's a special case and the special case is q depth one so q depth one is when one request depends on another request depends on another request depends on another request this is something you run into you know with a single user operating system you don't really run into this with servers you don't really run into this with things that are running in in parallel but a lot of the time that qdep1 translates into why is it taking forever to load this program because something's not super optimized and so it's basically get me this block on disk whatever's in the contents of that block on disk tells me the next block that i need from disk and so how quickly you can get that block and then get the next request and get that block and get the next request determines how fast it actually is it's not the bandwidth you've got plenty of bandwidth but you know not a lot of latency think about um you know if you're walking along a hallway and somebody's coming the opposite direction is the hallway wide enough that you can sort of pass by each other without anything bad happening or do you have to kind of stop for a second so that you know everything works exactly the way that it should well packet flows packets of data are not really a lot different than that at the end of the day i've got threadripper pro and i want to build a system that's going to break the world record for iops now when we're talking about different things there's different world records one of them is for hyper converged systems and so this would actually be a cluster of threader for machines not just one and uh the world record for hyper-converged iops i think is around 14 million iops give or take but that's for a cluster of machines servicing a bunch of requests when you look under the hood though what's happening there in that cluster of machines every machine pretty much has local storage a local copy of the data set and whenever there's a write the write is synchronized across all machines in the cluster but reads all those reads those are technically local so it's really fast 20 million iops i'm not sure we can do it with the hardware that i have oh i mean through for pro don't get me wrong there's not a better work section that you can get right now i think pretty sure 512 gigabytes of memory and all of the other stuff with this many pcie lanes 128 pcie lanes and a single socket yeah you can't get that anywhere else 10 million is pretty easy you just drop fast ssds in and you're good to go that's pretty much all that you've got to do i mean the very fastest nvme that you can get like the p5800x if you get like six of those you're at 10 million iops it's not a problem but 20 it's going to take some elbow grease is 20 million even reasonable let's do some napkin math you know back of the envelope math that's i don't know what what you would call it if not napkin math or back envelope math or whatever all right main memory bandwidth we've got eight channels so theoretically absolute maximum 200 gigabytes per second real world you know we do the 8064 test it's on the order of about 150 gigabytes per second so 20 million iops 4k each because the ssds have to read it somewhere it's got to go to main memory right i mean that's that's how we're doing this so we do that it's about 61.444 uh gigabytes per second if i moved all the zeros there so it's about half main memory bandwidth and that's probably about right i mean that you're sort of dealing in you know orders of magnitude here if it's oh we're within five percent i've got some news for you there's this little thing called overhead it's not gonna work it's not gonna fit it's gonna be a little bit more problematic and it's another thing to consider it's like you could run four or six channels of memory with redripper pro but when your main memory bandwidth suffers and you're looking at building something like this that performance is also going to suffer hey maybe a future video if you want to like run the tests because there's that long forum thread that has all kinds of stuff but let's let's keep doing the napkin math for a second what about on the pcie side well pc express 4.0 is double the bandwidth pci express 3.0 so 16x slots can do 32 gigabytes per second in each direction some documentation says 64 gigabytes per second but that's reads and writes simultaneously for the thing that we're testing that doesn't really apply so 32 gigabytes per second per socket 64 gigabytes per second if i use two x16 slots and i've got two of the uh asus hyper m.2 add-in cards so if i had some add-in cards that could do 120 gigabytes per second the pci express bandwidth is there so that napkin math checks out because that's that's got a margin of basically double as well double is usually reasonable for you know overhead included calculations so our napkin math checks out but we're going to have to go a lot deeper than that my first device my first pci express device the intel p5800x i bought this i paid a lot of money for this too much money in my opinion but 800 gigabytes and over 2 million iops in a single device can saturate pci express 4.0 by 4 interface here that's right at 8 gigabytes per second but my goodness that's fast unfortunately because of the price i've only got the one of those but i do have some older octane we'll talk about those in a second enter kioxia they're sort of indulging me a little bit here on my mad science but they loaned me a bunch of cm6 ssds now these are pretty respectable ssds in their advertising material and stuff like that they're talking about drives that are on the order of 770 000 iops per second 4k random read which is pretty class leading for nand based flash devices optane is something else entirely like it's different i mean storage but it's something different and it's also only 800 gigabytes whereas you know you can get up to 16 terabytes in good old flash and nand flash in the u.2 form factor i've got like 10 of these so we can shove 10 of these in this system and get something approaching 7.5 million iops assuming it scales linearly which is an assumption i'm making for now but something we're going to investigate so let's let's keep digging now i also happen to have one keoxia cm6 this is a like more enterprise more performance oriented drive and this is of course going to be a little bit more expensive but that's between 1.3 and 1.4 million iops depending on which model you get at least according to kiosks advertising material but like the p5800x i only get the one but i do have a rather large assortment of older cm5 and some uh you know m.2 uh keoxia and toshiba before it was koi keoxia storage devices that are between 500 and 750 000 iops each i mean we could really add a lot to this system if we want to go nuts and and we will don't worry i've also got a bunch of older optane p4800x and also the m.2 form form factor you know the 300 ish 300 to 307 well 280 to 375 gigabyte optane devices those are older but those will usually clear you know 500 to 750 000 iops and finally i've got samsung 980 pros those advertise 1 million iops so check it out for our first configuration we've got four intel nvme the p5800x and three 375 gigabyte m.2 obtained drives and 10 of the cm6 from keoksia that's 7.5 million iops just for the kyokushi drives plus you know like three ish million should be again napkin math from the intel optane drives that's pretty smoking fast we're using fio to do all of our testing fio is the industry standard thing it's got a bunch of different i o engines you can plug into it you can really do a lot of tuning real world simulation with it fio is a really really good tool to do for this kind of testing and mainly mostly what we're interested in right now is random reads i'll get into random rights and i'll get into database workloads and i'll get into concurrency and all that other kind of stuff but probably in a future video so stay tuned for that get subscribed whatever i don't know if you want to support mad science support level one so we run this what do we get 6.8 million iops so i spent a lot of time on this sort of poking at it and looking at it and thinking about it and thinking it through and to understand this you have to think about how threadripper is built it's a bunch of chiplets and there's an io die and all of our memory goes through that and all of our io goes goes through that but some chiplets are closer than other chiplets and there's definitely a path that you have to take from a to b so we can use a tool like ls toppo this lists the system topology and so this shows the layout of the system which things are connected where and we can see that i happen to choose a physical arrangement of pcie devices that is sub-optimal they're all sort of on one side so some of the cores have local access to all of the nvme but some of the cores have to go all the way across the i o die to get to the nvme see it's eight memory channels but it's really four groups of two memory channels so what i need to do is move half of my nvme to a physically different slot that's not a problem there's seven slots total in our asus ws sage pro 34 pro motherboard moving a slot not a problem and so with that in mind you know we're not venturing off into uncharted territory here amd's actually got you covered for this scenario there's two options in bios i would call your attention to one of them is called preferred io and so if you happen to be running a really wicked fast pci express 4 ssd something like the liquid honey badger that's one device and one you know you can't split it up you can't share it across cpus preferred i o here will let you specify which pcie bus should get priority across the infinity fabric and that solves this problem so if i had a single storage device hanging off of a single x16 pci express card or even x32 if the two x32 slots were on a single node as we have here i can prioritize that bus by putting in the bus number here in the bios and then the performance will improve because all the knobs and tunables at a really low level in the system will prioritize that i o the other thing is in ps4 you see even though we have a unified io die it's still possible to uh pass hints to the operating system or alternatively if you're on windows and windows struggles with understanding numa nodes you can do nps4 most of the time you don't need it but for some of these edge cases where you need a little bit more visibility into what's connected where you can go into the memory controller options and set nps4 and then you actually get four pneuma nodes that will show up in the system topology so the output of ls toppo will change and it'll give you a little bit more sort of direct insight into how everything is in the system so yeah we get it up and running 11 million iops 11 million in change that's pretty good and that's that's sort of what i put on twitter the other day and everything was actually working really well but we're already starting to see a little bit of fall off and what i mean by that is if you test each device individually now and you add up that number that is a much bigger number than what you get when you test all of them simultaneously that's the whole people in the hallway thing again when everybody's in the hallway when all of the infinity fabric bandwidth and all the pcie bandwidth across the entire system is being used with these just mad crazy benchmarks uh things have to shuffle around a little bit in order to get everything to fit and because of that you have a little bit of a bandwidth penalty now you know theoretically 12 million in change iops versus 11 point something the overhead is very low that is perfectly reasonable and completely tolerable but what if we get unreasonable now we've got 28 nvme in threadripper pro don't don't ever do this this is crazy 28 nvme and threadripper pro are you a madman yes yes i absolutely am 100 a madman 28 nvme has no business here but we can do it so 28 devices including 9 million more iops from those samsung 980 pros eight of them add to the mix what do we get uh that should be 20 million iops right the napkin math says oh it's 15.2 15.2 million iops what's going on okay we've basically hit the bandwidth limit these are spread out all over the system and i've got a couple of them that are actually running a pci express three speeds because some of the sketchy adapters that i got from alibaba won't run correctly at pci express 4. so [Music] but 15.2 million iops hey that's better than 14.8 that's a world record right kinda yeah i think about 15.2 15.3 million iops is all i'm going to be able to get this generation or this iteration or for this video future video i'll probably revisit this i mean it goes without saying i've got to try to get 20 million iops on epic milan certainly a two socket epic milan system with double the memory channels and double the pcie root complexes to spread the bandwidth over 20 million seems a lot more reasonable and if we can do 15 on this we probably push 25 on a two-socket system probably and i'm getting ahead of myself the napkin math is dangerous you don't necessarily want to paint yourself into a corner with napkin math but yeah it's definitely something to check out in a future video i also want to try to get my hand on a liquid honey badger because those are fast i spent a lot of time working on this and there's actually a lot going on here under the hood there's things like um the scheduling algorithm like how does the linux kernel schedule i o across such a huge number of devices well out of the box now the linux kernel uses something called hybrid polling and this is something that's popped up on this channel you know kind of a lot over the last year but it's a default out of the box now on the linux kernel for high performance nvme which is great so if you're running a modern uh desktop distro all of the performance knobs and tunables that you can do in software are mostly pretty much always there there's things like read ahead turning read ahead off returning read to head on it can make a difference in my specific scenario and really make much of a difference if anything sort of hurt performance fio also has different i o engines that i mentioned before sometimes you can find an i o engine that'll actually work a little bit better you can also switch to 100 straight polling instead of hybrid polling which uses a combination of interrupts and just waiting for the drive to complete it'll actually just pull the drive continuously this uses a lot more cpu overhead but for getting the world record on iops could be a good thing although you wouldn't really want to do that in most real world scenarios and if you think through and do the napkin math on how much cpu overhead we have here in terms of like the number of cpu cycles that you have versus the number of uh i o requests that i'm generating for the cpu utilization the number of cpu cycles that occur for a block transfer is on the order of ten thousand instructions ten thousand instructions with a four point two ish gigahertz processor is a tiny tiny drop in the bucket but when you're chasing 15 or 20 million iops there's a lot of drops in that bucket and before you know it you've got you know the bucket is full it's full of lots and lots of yummy delicious pci express 4 data in the real world the cpu utilization wouldn't really be this high because the i o wouldn't really be this random you see when there's a relatively small amount of i o it tends to be random in real world usage patterns and then for data processing which is one of the things we're working on for another video it tends to be more sequential and so when you ask the nvme transfer a megabyte of information or transfer this list of blocks or you know do this long complicated sequence of things the nvme is able to do more work on its own without bothering the cpu so to achieve 60 or 70 or 80 gigabytes per second which this array can do no problem uh i mean again it's we're hitting that main memory bandwidth of limitation again uh the array is so fast that if it's copied large blocks of information from the nvme to memory and it's continuous and it's not a million little random blocks there's virtually no cpu overhead in doing that the cpu overhead comes from the sheer number of requests which have all been randomized so it's not really exactly 100 real world representative either fact of the matter is that we've got single nvme storage devices that are approaching being able to handle 1 million input output requests input output operations per second and the reality is that you know a single core if it is truly random to saturate that many possible iops per device you're looking at having to use more than one cpu core per device to keep it busy which is sort of crazy like that's the world that we're in one million iops on the samsung 980 well that's going to keep about three of these threadrip pro cores busy servicing that mail that many i o operations that optane that p 5800x if you want to get two and a half million iops out of it which is about where it tops out you're looking at about five threads on the threadripper pro for truly random i o now remember real world you're probably not dealing with that much random i o real world in such a short amount of time but it is nice for a benchmark so yeah admittedly 28 nvme it's not real world the testing itself is not exactly real world but it is useful for telling us what the hardware is capable of at kind of a base level so in a way this kind of testing really is just more napkin math but it's like the real world napkin math as opposed to the you know armchair conjecture napkin math if that makes sense there is plenty of uh knobs and tunables that i can still do so there's plenty of material here for a future video and in fact if you have ideas for stuff that i should try or look at you should post in the level one forums i've created a thread with a rundown of some of the fio performance parameters for some of these different devices and some of the stuff that i encountered for random reads of course uh in the level one forums you can check that out reply there let me know what your ideas are it's probably some pretty interesting stuff going on here i'm wendell this is level one i wanna hear from you in the level one forums if you found this interesting or you have any questions or something that i can include in a follow-up come to the forums post let me know i'm wendell i'm signing out i'll see you there [Music] you
Info
Channel: Level1Techs
Views: 25,470
Rating: 4.9685922 out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: RfrMnVpPuVw
Channel Id: undefined
Length: 20min 57sec (1257 seconds)
Published: Wed May 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.