What is NUMA?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
👍︎︎ 14 👤︎︎ u/Red1ck 📅︎︎ Sep 06 2018 🗫︎ replies

he mentioned people getting better results from gaming by turning core0 off for the game in windows, talking core affinity. any links to this?

i found this https://www.reddit.com/r/GlobalOffensive/comments/6yukg1/thesis_disable_core_0_for_better_fps_an/

seems interesting

👍︎︎ 7 👤︎︎ u/loggedn2say 📅︎︎ Sep 06 2018 🗫︎ replies

Nu ma balls lmao😂😂😂

👍︎︎ 5 👤︎︎ u/framed1234 📅︎︎ Sep 07 2018 🗫︎ replies
Captions
oh the shortened version of the title for this video that I could come up with was making your games go faster on thread Ripper but not necessarily Wow at least I want to explain it because there's not necessarily a silver bullet solution but if you understand what's happening under the hood you can probably make some better choices about how to organize your system and also this is a really good video from a computer science standpoint about Numa non-uniform memory access and the consequences of Numa versus um a and we're gonna use thread River processors as an example but I've also got a nice dual Xeon system in front of me or at least it was an AI system at some point in the distant past about ten years ago those Dell r7 tends to dual Xeon you know tons of memory slots this is our model that we can use for learning so I guess it's I guess it's exciting maybe Numa and I mean if anybody is going to make non-uniform memory access exciting I'm your guy [Music] this is the quietest thread Ripper system ever this is a 1950 ex in here and you know I've got my lab mic on we don't really do a lot of audio treatment because we don't know how so it's uh pretty good it's a pretty good pretty quiet system which is really super exciting and this this is our Dell r7 10 which really I'm just gonna use for explaining stuff it's really about CPU topology so this is what if you're not in familiar with computer science or you haven't made it that far in your computer science classes CPU topology what is the topology of the CPU it just like you know a topographic map where you've got mountains and rivers and valleys and all this kind of stuff it's a map of sorts for how stuff's laid out in the system so if we take a look here at our our 7-10 system we can see that we've got some expansion slots in the back here and we've got two CPUs that's what these are and two banks of memory and so if you if you look at the the inside cover it'll say you know these slots are associated with this CPU and these slots are associated with that CPU so like this CPU would handle this memory and this i/o and the CPU would handle this memory in this i/o but when you use a system like this it doesn't actually show up as two different computers did Intel just glue two computers together well kinda yeah I mean there's a there's an interconnect with between these two CPUs called quick path and it's an insanely fast interface for the you know CPUs of this time and thread Ripper it's called infinity fabric and instead of being two physical CPUs it's actually all in one package AMD has taken their there's n dies like you would have in a 2700 X for example and put two of them in the thread route for 2950 X or four of them in the 2990 although it's probably at 2700 and not the X for the power considerations for the the thread refer and then of course epic CPUs which are AMD servers right CPUs are the same deal but those CPUs can actually run two of them in a system as well and there's an interface that goes through the socket and all of that kind of stuff so we've got two banks of memory and each Bank of memory is serviced by a CPU but if CPU one needs something from CPU two memory well it can request it through quick path and and the the other CPU will service the primary CPUs memory requests for you know the memory or i/o or whatever so then you might be asking what is the topology of thread River and that is a great question thread Ripper is the first CPU to my knowledge that has a at least mainstream that has a configurable CPU topology you can use the rise in master software to determine if the 1950 order 22950 is a Numa or Yuma CPU uniform memory access or non-uniform memory access and the reason it lets you pick is because even though it's technically Numa there's so little penalty for the way that the CPU is put together that you can get away with running um a uniform memory access modes for most operating systems and that actually helps you instead of hurts you except in some scenarios and that's really what this video is about so we're gonna talk about that but this also has implications for the recent meltdown inspector vulnerability so we're talking about you know speculative execution and what speculative execution means and Intel is a little bit more affected by these issues than AMD because Amy's a fundamentally new microarchitecture that was designed at a later point in time where some of these things were taken into account so it's a different critter between Intel and AMD but I'm getting a little ahead of myself talking about stuff but it is really super super exciting because of the consequences of new computer science and the security vulnerabilities and all the stuff and you know the wait wait wait what does this have to if uniform memory access or non-uniform memory access let's take a look at some CPU topologies okay let's take a look at the Numa mode or 1950 X this is a diagram that shows you the layout of the stuff in the system it's pretty easy to understand you can see that we've got our eight cores and we've got two CCX complexes we've got our l3 and our l3 those are our two four core CCX complexes on the same piece of silicon and then we can see that we've got another piece of silicon set up the same way over here on this side we can see that we've got PCIe devices level 1000 2 : 6 7 C 4 and you know 1020 2 4 3 B 6 and then connected to that devices you know eight six one five three three 808 6 2 4 FD these represent PCIe devices and on Linux you can do LS PCI and see what your PCI Express devices are it isn't really super important on Windows for you to know which PCI Express IDs go with where but I can now tell you I can give you sort of a preview of what gta5 is faster and you can tell by looking at the picture so if you were to run GTA 5 which doesn't really need more than like 4 cores you can tell GTA 5 hey don't ever run on anything other than core 4 5 6 or 7 which is really like 8 9 10 11 12 13 14 15 because of SMT so espies multi-threading it's like AMD's version of hyper-v so you can see that we've got our 8 cores and 4 threads and when we do that if our graphics card happens to be connected over here then all of the i/o that Grand Theft Auto needs to do in order to communicate with the memory and communicate with the the the graphics card and the CPUs that is Bennis on to all happens here if the CPU you know happens to be like if we tell if these CPUs but the graphics card actually ends up being attached down here then this Numa node has to communicate with that nimma node and even though we don't have two sockets in our thread river system thread Ripper is basically multiple CPUs on one carrier so there is a penalty or an overhead for communicating between those two pieces of silicon there's basically two 2,700 X's and a 1950 so from 1 to communicate to the other is ones doing the processing and the other one is handling the graphics card that actually will hurt things in terms of low latency performance which is where a lot of the time the really high frame rate benchmarks come in and so that's why I sort of have been laughing at people that wigging out over super high frame rate graphics because it's really not about the system being slower in terms of computational horsepower it's about the system being a little worse in terms of latency but the reason it's worse is because the scheduler and the operating system don't take into account how things are organized the best way ever a lot at a time and some operating systems do actually take into account how things are arranged like depending on which program that you're running which can actually hurt performance so like let's say that you're running 7-zip or something like that well windows doesn't handle that particularly well either because it will say oh we're running these processes across Numa nodes that's not going to be really good performance let's not do that and that's because it's optimized for a situation like this where we're working with this server and this server it there is much more of a penalty to go from one CPU on this server to another CPU in a different socket on this server because the memory and all this other kind of stuff the penalty with thread Ripper is relatively low for those same kinds of operations and if you've been paying attention this system has 64 gigabytes of memory 32 gigabytes is associated with this Neumann ode and 32 gigabytes is associated with that Numa node now with the 29 50 X and the 1950 X if you turn it on in you MA it will lie to the system a lot to the operating system and say ok everything is all connected to one place all the buses connected together and all of a sudden the scheduler can't really take advantage of the true topology of thread Ripper and so if you are in that situation like most like that's why I have like the game mode thing and most of the time if you just take a little bit of care you can run a PowerShell command or other configuration option and actually say hey for this game I want to run on this number of cores and you run it on the cores that you know or associated with your PCI Express device and even if you don't know that even if you don't want to use Linux and run LS topo to see what your topography is there are tools for Linux that will let you do that or the tool tools for Windows that will let you do that there's tools for most operating systems that will let you do that and by doing that you minimize how much travel over the rest of the system that information does now if you're running GTA 5 it like 4k because you need a lot more computational horsepower and because the graphics card has trouble keeping up so the frame rates are not as high these kinds of issues matter much less and the reason for that is again not bandwidth they're really latency because the graphics card needs more time to process a 4k frame which means that it is more tolerant of more blips of latency in between the data of being taken from memory delivered through the system bus and to the graphics card when you're talking about 150 160 frames per second that little tiny extra hop on the thread Ripper CPU to go from one dies who the other die could cost you five or ten frames per second at 160 frames per second but at 4k it's gonna make a difference of like one or two frames because the latency the bottleneck is not the Infinity fabric the bottleneck is not the two CPU dies on one piece of you know one piece of hardware talking to each other anymore it's in the graphics card or the system bus or somewhere else so then you might be thinking what about the 2990 WX what kind of monster is that well that's basically a server part for the desktop now the 2990 unlike the 2950 in the 1950 AMD does not give you the choice of running in um a mode or a uniform memory access mode and the reason for that should be apparent from looking at the topography so if we look at the topography we can see that we have four Numa nodes two of the Numa nodes have all of the memory and all of the the peripherals connected to them the other two Numa nodes don't have any memory or peripherals that means to load information into those processors to be able to be processed it's going to rely on the other nodes in order to be able to do that so that's one of the reasons why the performance can regress like if you look at the benchmarks from an in tech or the fir Onix benchmarks the for all these windows versus Linux benchmarks the performance can regress when you're running certain workloads on a 2990 versus a 2950 you would think that a 2990 is always going to perform very close a 2950 because worst case scenario you just don't use the the cores that are on these hidden Numa nodes this is what this is what that term means that the node is hidden meaning that it doesn't really have direct access to the system and Intel was actually the first company to do that some of the hike or cows eons are basically hidden from the system they might have direct memory access but they don't have direct access to the bus and that was also a source of bottlenecks in certain workloads on the Intel side of things but if your workload is such that you aren't bottlenecking memory or you're not bottlenecking your bus then you could keep those extra CPU cores fed and so that's why certain types of tasks like rendering tasks or those torvalds with compiling the linux kernel and all sorts of stuff like that can really you know still help in that situation even though these CPUs don't really have direct access to the rest of the system I mean they still have 16 Meg's of cache each they still have a course each so it really does add a lot to the system to have those just not as much as it would as if it were the server counterparts and this is also sort of the difference between this thread worker part and the epic part because the epic part which is AMD's you know server CPU it does have the extra memory and IO connections for those views so your peripherals and your memory can be split among the dies and so just like this system has several different you know CPUs and the peripherals are routed through it a different way you could do the same thing in the server version of the CPU so it's really sort of exciting now I mentioned about mitigations and things like spectrum meltdown one of the vulnerabilities on the Intel side is when you switch from a system process to a user process the level one cache memory level one that given is not flushed properly and so if you look at this we can see on our diagram that maybe I'll be running a program and we've got instruction and data cache you know instruction and data caches at two level one and if I run a program and it loves a program and it's executing and then that's a system program and then I switch to a user program the user program would be able to read at least on Intel CPUs the contents of the cache memory which may contain passwords or encryption keys or other sensitive data so the fix for that has been to just flush the level 1 cache when you switch context or switch processes if you use Windows Windows really seems to prefer core 0 on your CPU so if you've got you know a 6 or 8 or a 16 20 700 X or 2600 or the 6 core coffee-like or whatever CPU that you have if you can tell your games and other stuff to not use core 0 you will generally have a little bit better performance not always that's not true of every game but there are definitely reports on the Internet of people going into task manager and changing the CPU affinity that's what that's called so when you run a process and it's like this process should only use this CPU core and the CPU core in that CPU core that's the CPU affinity and so when you configure your process to only use specific CPUs those types of penalties from switching from a system process to a user process basically go away there there are always some penalties with switching process contexts like that it's just that the recent fixes on can tell a lot of things to just flush the level 1 cache that's really bad you should never flush anything level 1 that's just you shouldn't do that that's they're gonna have to fix it in hardware but I think with the high core count CPUs I think that we're gonna see a lot of innovation in the scheduler Microsoft and the Linux people and and really everybody that's working on these mitigations you could probably run the system processes on one group of cores and user processes on another group of cores and then only when the system really gets overloaded do you start sharing user and system processes between cores because hey what the 2990 yeah 32 cores to work with and really I want to have all my user processes running on the 16 that are directly connected to something and I would want my background processes to run on the cores that don't really have direct access to anything the scheduler knows best the scheduler is the thing that can see where everything is and use user generally should not second-guess that but because this is new hardware it's probably going to take a year or two before everything really catches up and everything is running as optimally as possible so using the info from Linux our LS topo command we're able to associate grand theft auto to only run on the four cores eight threads that are on the other side of these CCX from the system one but that is also physically connected to the graphics card so at least running the benchmark at 1080p makes about a seven eight percent difference in the maximum frame rates which is pretty encouraging now you don't have to use process lasso you can totally use the PowerShell commands and set the processor affinity that way with PowerShell just you know get processed name and set the parameter that way that's totally fine you know you don't have to use process lasso but it's surprising that that makes as much of a difference as it does now the really interesting thing is that even when the processor is in um a mode it makes a little bit of a difference although not as much and I think the reason for that is probably because Windows doesn't know where to allocate the memory it's possible that we're still getting some of the GTA 5 memory allocation that is actually on the opposite side of the of the chip it's the opposite memory controller and maybe that's why the slowdown but we at least know the the i/o because it's pin to the same cores and that's true whether or not you've got Numa or um a installed and we know that those cores are associated with the graphics card so that's why our GTA performance is still improved a little bit even in um a mode just running this test it's pretty interesting I think and something that might be deserves to leave them a little bit more experimentation but you know GTA 5 is an old game but it's I understand the engine and how it works and how the engine breaks so it sort of useful to test I think I bet you never thought there was a relationship between security mitigations level 1 cache flushing and cache affinity and processor affinity GTA 5 and getting GTA 5 to run faster and the topology of your system which which PCI Express peripheral is connected to which group of CPUs and which memory oh and by the way in case you're wondering I didn't tell you how to you know make sure that your process is running with memory local to the CPU that is executing on don't worry everything automatically takes care of that Windows doesn't in um a mode but in nuuma mode and Linux and all this other stuff it definitely does completely take care of it and in case you're wondering about numbers the Layton sees that we're talking about so the memory latency going from a local you know Xen core to ddr4 is about 60 to 70 nanoseconds and the latency from going from a far CPU core to memory that's through another another node so like if I wanted to get from from one of these cores down here to Numa node p2s 16 gigabytes of memory that's gonna be more like 135 nanoseconds instead of 70 so we're still on the scale of nanoseconds but there's a big difference between 70 and 100 and 60 nanoseconds but at the same time the bandwidth is really high so I can get 32 or 64 Meg's of data into that core really quickly and then the core can process it and as long as that core is not processing stuff faster than I can get stuff in and out of it I'll benefit from the CPU performance if the CPUs processing stuff more quickly that I can get stuff in and out of it it'll bottleneck and we'll see those performance regressions so maybe with this information maybe with this knowledge those benchmarks that you see around the internet will make more sense and if you've got a system like this you should experiment with telling your games not to run on core 0 just to see what happened maybe it'll help maybe not I'm not really sure there's also a program for Windows called process less so I've been talking to the process lasso guys and they're gonna add support for making process lasso more Numa aware but process lasso automates a lot of stuff that you can otherwise do at the command line I mean with the PowerShell commands which if you go to the level 1 forums there are some example PowerShell commands that go with this video that you can use for setting your process so you don't have to by process lasso the process lasso makes it a little easier to save and restore profiles and it also prevents Windows from doing some obvious dumb stuff so it's a pretty useful program you can check it out there's a trial for it this is not you know sponsored or endorsed or anything by process lasso I just saw the process lasso on Windows especially if you're working with 2990 could be useful and maybe AMD could build some of those features into the next version of Roz and master unfortunately I don't think we're gonna see a um a mode for 2990 s but for the 29 50s in um a mode for certain processes that are known to not perform as well in um a vote mode versus numa mode maybe that would be useful I don't know if this is still clear as mud come to the forum because I gotta learn to make better videos I'm Windell I'm signing out and I'll see you in the little one forums [Music]
Info
Channel: Level1Techs
Views: 64,964
Rating: 4.9735308 out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: M-Q02b5uvfY
Channel Id: undefined
Length: 21min 37sec (1297 seconds)
Published: Thu Sep 06 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.