128 Cores & 3D V-Cache EPYC - Launching Today!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

SMT being disabled out of the box is different security wise than it being disabled in BIOS?

👍︎︎ 6 👤︎︎ u/Positive-Vibes-All 📅︎︎ Jul 19 2023 🗫︎ replies
Captions
two very important things are happening because  first cloud computing providers are offering   General availability of the 128 core Zen 4C  based AMD epic CPUs 128 cores in this socket   it's not an incremental Improvement It's actually  an engineering change and kind of a milestone for   AMD it's a pretty big engineering win we'll  talk more about that and the second thing is   3D V cash Genoa yeah 1.1 gigabytes in a socket  I've been Hands-On with these two systems and   they're the fastest CPUs in the world by an  absurdly monstrous margin at this point amd's   just killing it literally and figuratively  well let's let's take a closer look [Music]   first up is 4th gen epic with 3D V cash AMD  already spilled the beans on its capabilities   and Engineering the San Francisco event for AI and  data center last month these CPUs have 13 chiplets   total totaling up to 96 cores and each of the 12  compute chiplets can also have an extra stack of   level 3 cash that's what's launching today 1.1  gigabytes of L3 cache right on the CPU in such   a short time 3dv cache started as kind of an  r d experiment with tsmc but now it's sort of   dominating the entire segment of the  industry for technical Computing I mean   it's pretty cool when you can stand up an entire  vertical on it's like ah let's let's do some r d   experiments but 3D V cache is here to stay if  you do any type of computing where you require   software products from Altair and CIS Cadence  synopsis and more you're really missing out if you   aren't looking at these CPUs with their monster  L3 cache there are three products launching in the   3D V cash stack from 16 to 96 cores the 16 core  Parts clock into 4.2 gigahertz and is designed   primarily for Eda and design and anywhere you're  constrained by a per core licensing fee and you   want those cores to be as fast as possible no  matter what for my own benchmarks I was seeing   some impressive single thread gains over their  milan-based 3D v-cash counterparts those CPUs only   launched about a year ago okay maybe a little bit  more than a year ago just been over a year it's   still about a year it's Madness in the marketplace  these parts are in such high demand and yes the   core counts are so high and so on and so forth  that a lot of people are opting for single socket   servers I mean 96 cores and 12 memory channels  and a single socket okay I mean that sounds good   to me oh and one other note my local test system  here with 1.5 terabytes of memory that's so much   memory I feel guilty it isn't doing something  for some open source project every millisecond   and I'm not actually using it but these sp5  generation CPUs can technically support six   terabytes of ram per socket if you need more than  a terabyte or two of ram the cost of Delta between   a 3D V hash CPU and a non-3db cache CPU just go  ahead and get the 3D V cache the more RAM that   you have especially that one terabyte and Beyond  boundary the cost is negligible when factored in   the overall cost of the system and more cash can  cash more memory more effectively it also kind   of leads into machine learning which is kind of  hot right now a huge Cache can make a difference   there but it really depends it's a lot of the  algorithms have been optimized for not a lot of   cash more cores is more better first for AI to be  sure but more cash can help as we're starting to   see in some machine learning benchmarks even Zen 4  cores I mean they're really big and powerful cores   some machine learning jobs do actually make sense  to keep on the CPU for that reason don't believe   me check out neural magic in our sparse Zoo these  models have been sparsified to use their word to   run sparsely on Epic CPU so it's no coincidence  there's a lot of quotes from AMD folks there 96   cores okay I mean that's really cool for that kind  of workload but what about 128 cores the cloud   monster density that's zen4c zen4c is going to do  more cores more better from 96 to 128 cores you   just add more cores right no it's not that simple  well okay yeah it actually is that simple for AMD   but it's their own Brilliance that has made them  do this this is flown under the radar of a lot of   people pay attention so this is a 96 core CPU this  is this is Genoa this is not Bergamo 96 cores so   each chiplet has eight CPUs and then this thing in  the middle is your i o die this is where your pcie   and your memory interconnects are so in the socket  here in our super micro motherboard 12 memory   channels that all goes to the i o die and all of  our pcie E5 connectivity all goes to the i o die   the chiplets are behind the iodine the chiplets  talk directly to the iodine the ioda talks to   the rest of the system super micro and AMD and  other vendors have to work together to qualify   the entire platform what AMD has done differently  with Bergamo is it's actually fewer chiplets there   are only eight chiplets instead of 12 because each  chiplet is 16 cores but each chiplet is still a   Zen four core this is different from the ROM to  Milan generation of epic in a sense with this type   of it's still just Zen four but it's been shrunk  a little bit so we've got better power utilization   it means that AMD and their Partners don't have  to qualify an entire new platform to handle that   it really just drops in behind the i o die by  treating the the compute chiplets almost like   a peripheral you you end up really dramatically  shaving the engineering time that you need to   bring up new products it's a new uh lithography  process it's a new you know shrink if you want   to think of it as a shrink but really you haven't  had to completely re-engineer the platform I mean   reusing the i o die is not anything new but you  still would have to make changes to the platform   minimally like the agiza version as we saw from  Rome to Milan because this is still just Zen four   it's exactly a Zen 4 implementation you don't have  to do that to get the shrink to you know to fit 16   cores on a chiplet that's basically the same size  as an existing chip but other than the shrink they   also had to eliminate the Vias so there won't be  a burger Mo 128 core with extra V cash to get the   the stacks of extra cash or anything like that  but I think that's going to be fine for amd's   Edge customers they're able to customize the  platform in a way that makes sense think about   the possibilities this opens up if AMD one and  two qualify an arm chiplet all of the engineering   for the ddr5 and the phy and the pcie is already  done AMD can do that in secret and no one would   ever know because all of the other physical  interconnectivity everything is already here   they just have to prototype a chip that plugs into  the platform here behind the i o die and literally   the entire rest of the platform is good to go this  is a serious competitive advantage that has flown   under the radar but now that cloud providers are  taking up this 128 core Bergamo as fast as they   possibly can I have a feeling that more people  are going to notice I mean think about that at   the end of the day it's 128 core CPU with four  fewer chiplets than this in basically the same   physical design what's not to love now even though  there's 16 cores in each chiplet from an i o die   perspective it's still like as if it's talking  to two eight core chiplets doing it this way   meant the io die is 100 already validated those  n4c cores are exactly precisely the same Zen 4   cores from a an electrical logic standpoint so AMD  doesn't have to work with their board Partners to   completely re-qualify all their existing solutions  that cuts the engineering time in half and that's   why I say this is an engineering win it's flown  under a lot of everybody's radar I mean CPU   engineering projects are years-long projects  things like turn on a dime that doesn't really   apply to CPU engineering and yet AMD by doing this  you can just swap out the chiplets the onus is on   them to do the qualification because everything  else has been qualified memory controller memory   interface physical board stuff if it works with  you know the Genoa zn4 chiplets then it should   also work with zen4tc chiplets or it should  work with whatever other chiplets that amd's   cooking up in the background this is huge from an  engineering standpoint and does mean that AMD can   turn on a dime time as much as any company can for  this kind of engineering the only real downside is   the Boost clocks or maybe not quite as high when  we're talking about 128 cores and that's because   these 128 core CPUs are in the same power envelope  instead of having 12 compute dice it's only eight   because 16 cores also interestingly there is  a version of Zen 4C with smt disabled for some   Cloud Partners I suppose I asked AMD about this  and they said well some Cloud providers don't   want to sell symmetric multi-thread capabilities  presumably for security reasons we don't really   ask what because the customers just said ah we  don't really want smt certainly for something like   an Amazon ec2 micro instance just giving someone  One Core with two threads I mean that's great   there's no reason not to do that from a security  perspective or otherwise but if someone's running   uh you know an open source database server on a  slice of 128 core or a 256 core two socket system   then disabling smt maybe makes sense because you  don't want to leak information now for the overall   performance breakdown of this 128 core Parts even  though the Boost clocks aren't quite as high it's   still insanely impressive multimedia right you  got a lot of cores or you got a lot of cash what   does multimedia look like well the breakdown here  is pretty interesting starting with SVT av1 hevc   Embry uh the results here pretty much as you'd  expect the 97 84x is on top and dominating when   we're talking about things like Embry and path  tracing and you know anything with that hevc the   9754 holds its own just remember that even though  we're dealing with 16 cores per chiplet we're   still dealing with you know half the cash per core  cluster so 16 megabytes of L3 versus 32 you know   some sacrifices were made to get everything to  fit on chip but it doesn't really hurt it very   much in most workloads at all times compilation  was also really interesting because you know I   worked on Greg Crow Hartman's compilation system  where more course was always more better but in   this case it seems like the good old 9654 dual  96 cores is pretty much the ruler across the   board here okay the 9784x was better in some  scenarios like if you're using the ninja build   system and you're building something larger like  llvm you can shave a couple of seconds off of your   build but but generally the regular Zen four  cores do better than there's n4c counterparts   one DNN was also of particular interest because  you know 128 cores maybe it's going to do a   little bit better no it seems like the loss of  cash and the physical characteristics of those   Zen 4C cores doesn't really do AMD any favors  of course the software has also improved since   the last time I benchmarked this I mean the 96  core CPUs that we looked at you know six months   ago have improved dramatically just because the  software is catching up of course one DNN Intel   still will dominate here because they've got  Hardware acceleration built in but the Gap has   closed significantly with the 9784x processors  this also seems like a workload that is memory   bandwidth bound or is somehow linked to memory  bandwidth because the 93 74 F CPUs which remember   those have dual GMI links to main memory so you've  got a lot more ability to get data in and out of   those compute chiplets and that really helps  most of the one DNA and benchmarks of course   conversely you know you got 128 cores fighting  over this same 12 memory channels it does it   does quite a bit worse for one DNN type workloads  I also took a look at R and RNN noise and redis   there was nothing that was really super surprising  here the 9374f of course still dominates you got   to keep in mind some of the benchmarks here are  single thread oriented you know in a real world   scenario you're probably not going to be using  redis in exactly the way that it's configured in   The Benchmark but if you've benchmarked a bunch  of systems in this way then you can kind of get   a baseline for how things are going to go the SE  are very impressive results don't get me wrong   neuromagic deep sparse was another thing that I  wanted to take a look at if you don't believe that   AI can be done on CPUs you should definitely take  a look at neural magic and what they're doing with   the sparsification of neural Nets I've covered  that a few times on this channel before and uh   yeah it's pretty darn good I think I'd rather  have more cash than more cores than i7 84x is   is dominating pretty handily but again sometimes  the 9374f pulls ahead it depends on what you're   doing probably down to memory bandwidth probably  down to memory bandwidth on a per core basis you   know each core has physically more memory  bandwidth available to it of course that's   not true for image classification and deep sparse  you know we're doing resnet imagenet calculations   or resnet image classification and the 9654  dominates pretty much everything with a very   nearly identical score 97 84x with the more cash  not really helping I have a feeling that could   be fixed with some tuning of the model and what  neural magic is doing doing here it's probably set   up for 32 megabytes of L3 cache not using a lot  beyond that but but maybe they can be improved in   software rocksdb is another interesting one roxdb  is definitely a favorite of of Intel and there's a   really interesting thing going on here depending  on whether you're doing read or update while uh   reading or update while writing or write random  you get different performance characteristics   depending on what exactly it is that you're doing  with rocksdb in some scenarios the 128 core part   will pull ahead in some scenarios the large cache  part will pull ahead and in some scenarios the   9374f with its dual GMI links will pull ahead it  just depends on what you're doing with roxdb it's   pretty interesting stuff Pi bench is one of  those benchmarks where it's single thread-ish   it's really a benchmark at single threads and the  9374f takes the crown here because of its insanely   High single thread clock speed but the 9784x will  scale better because it's got more cores and we'll   be able to run more in parallel and is pretty  much the champ on this chart running tests with   nginx and Apache you know just the sheer number  of connections per second well the 9754 dominates   but you got to be a little bit careful with this  Benchmark because sometimes the 9784 or the 9654   will come out on top depending on the geometry  that you set up for engine X like how many   processes are going to run independently there's a  lot to factor in with how you actually set up your   nginx test but doing 4000 parallel calculations  I can clear 230 000 connections per second on a   well-tuned 9754 system 128 cores out running  your 96 core counterpart yeah it's possible   also tested neat bench and geekbench just to give  us an idea of what we were looking at for all the   different systems geekbench doesn't really take a  huge advantage of the sheer number of cores here   so a lot of the time the 93 74 F will dominate  just because geekbench and some of these other   benchmarks are not really fully loading the system  but nevertheless it is interesting to compare   to PHP bench which is basically a single thread  Benchmark shows us that we're not really working   with a lot of difference for these web server  type workloads between all of the different CPUs   I mean the 9374f comes out on top because this  is basically a single thread Benchmark but there   really isn't much of a difference between the 9754  and the 9654 the difference between 96 and 128   cores for a lot of PHP type workloads the extra L3  cache that you give up by moving up to 128 cores   doesn't really make a lot of difference and we see  why someone who needs 128 cores or 200 156 scores   in a single system will benefit greatly from  having that many more cores in you know a single   system also through inspect jbb for good measure  this is one of the few comparisons that I have for   arm I don't have an arm system for comparison but  you know ampere at computex there was definitely   a lot of fun interesting stuff that I learned to  computext about arm and their platform and I can't   wait to get my hands on on one of the arm systems  to play around with it but I'm pretty sure this   spec jbb Benchmark just blew out of the water  what we were looking at from the arm system the   9754 with 150 000 that's the composite Max jop's  estimated score here uh is sore nuts the composite   critical jops estimated at 92 518 again that is  I think the fastest speed that I have ever seen   from any system truly breathtaking just for fun  I also ran some comparison benchmarks on what we   were benchmarking last year with the launch of the  7773x you know the Milan base generation and it's   anywhere from 25 faster to more than 200 percent  faster it just depends on what you're doing I   mean x-compact 3D and compact 3D benchmarking down  to four seconds versus 17.7 for the 7773x that's   not just core you know IPC improvements that's  improvements the platform memory bandwidth ddr5   etc etc molecular Dynamics simulation 47 frames  per second versus 31 from the previous best the   7773x running the NW cam buckyball simulation uh  you know 1700 versus 2200 previously okay that's   about the uplift you'd expect but this Benchmark  is a little bit of an outlier in that the   performance Improvement here is what you'd expect  not dramatic from 7400 to 3700 in the wrf 4.2.2   Benchmark that's a that's a doubling we've we've  more than doubled in a generation in just a year   a year depending on what you were doing if you  had a cluster of these it might have made sense   for you to just wait a year for things to double  in speed before buying your cluster to solve your   problem that's that's an impressive generation  on generation Improvement and to be sure it's   not just your IPC it's your whole platform it's  the platform benefit even things like blender like   the BMW classroom scene software improvements IPC  uplift and platform improvements move you from 15   seconds to do the BMW to nine there's all kinds  of interesting tidbits in these be sure to check   out The Benchmark results linked below I mean the  results here speak for themselves it still fits   in a 400 watt power envelope 400 watt per socket  amd's already got a commanding lead for General   compute and server CPUs and now they've sort of  specialized for even higher density or even more   cash I mean 20 more density and the same power  envelope for these large Cloud providers plus   also capturing the whole Eda computational  fluid dynamics and specialized compute field   the compute segment server segment I mean pretty  much anywhere that can benefit from a large CPU   cache AMD looks like they've got that Engineering  Process locked up nicely amd's got a part for all   of this 128 cores 12 memory channels a reasonable  single thread performance it's sure looking like   AMD is racking up a lot of wins even in mass  production impressive results across the whole   Genoa family I mean you might be able to replace  your dual socket servers with a single socket and   still get more than a 20 performance bump even  if you bought your servers just two or three   years ago that kind of Janna and Jan uplift even  versus amd's own products I can't recall another   time that we've seen this in a server Market I  mean AMD is the Relentless execution machine and   they're showing no signs of slowing down I mean  it really is genuinely impressive if you have a   project or something you'd like to run on one of  these systems reach out and let's connect let me   know I would love to take a look at other real  world workloads and run these just from my own   experimentation running you know web servers and  web servers with acceleration and taking a look at   the pensando accelerator in conjunction with how  much compute you can save by doing the compute   on the pensando pcie card as opposed to on the  CPU and then 128 cores and the density and doing   Wireline 25 and 50 gigabit Ethernet it really  makes me not want to work on systems that aren't   you know ddr5 and pcie5 because things have come  so far the it's it's a watershed moment in that   the floor of performance here is so high that  pretty much everything else really is fairly   obsolete I mean don't get me wrong you know older  Parts the performance is still really good but   this is another Head and Shoulders leap generation  on generation that's very very impressive we've   got some other coverage coming and specific system  setups like our single socket system here based   around a super micro h13 SSL motherboard and it's  amazing what you can do in an ATX form factor   even given the footprint of the CPU has become  monstrous and what does this level one this has   been a quick preview of the large cash and Bergamo  CPUs uh our full links to our fronix benchmarks   are below but we've also got some other benchmarks  coming up stay tuned for that and Wendell this is   level one I'm signing out you can find me in  the level one forum [Music] foreign [Music] [Music]
Info
Channel: Level1Techs
Views: 80,901
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: 2XcRZIMgqKU
Channel Id: undefined
Length: 21min 50sec (1310 seconds)
Published: Wed Jul 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.