The FUTURE of Computing Performance

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

often referred to as the father of supercomputing Seymour Cray founded Cray research in 1972 with the purpose of building the world's fastest supercomputers ago he would achieve in 1976 with the launch of the Cray one Seymour was one of the first computer designers to understand that for fast computation you need more than just a fast processor and the gray one reflected this by having fast IO that could keep the processor fat among other innovations that affected the whole system and not just the processor on release the Cray one beat every other supercomputer on the market by a wide margin while there were doubts about what great research would even do as a business when it was founded the Cray one ended up selling 80 units worldwide at close to nine million dollars each becoming the first commercially successful supercomputer a resistant to parallel computing for most of his career Cray eventually conceded in the early 90s with pressure from an increasing number of parallel systems coming to market in that period and started work on a massively parallel supercomputers in 1996 he would never see this project finished however after his car was struck on a motorway Seymour Cray tragically died in October of the same year his legacy in supercomputing carries to this day with many elements in our personal computers today being derived from crazed innovations last May AMD announced that it is working with Cray to develop the world's fastest acts of scale supercomputer frontier expected to be completed in 2021 if we want to understand what personal computers will look like in the next 10 years there's no better place to look then at tomorrow's supercomputers this video is sponsored by PCB way if you are looking to make your own electronics product PCB way offers a custom prototyping service with high quality advanced boards in a quick turnaround in PCB design and assembly get 10 free boards with your first order over at PCB Wacom link in the description innovations in computer design like refrigeration pipelining massive parallelism and many others as well as new algorithms and software patience to take advantage of them have originated in supercomputers the belief is that the technology that you put into the fastest supercomputers over time will trickle down into the commercial world interestingly innovation in this field has been progressing at a much faster pace than in any other computing field but for this trend to continue supercomputers will have to adapt to the slowing of Moore's law but also overcome some of the performance bottlenecks that are emerging if I ask you what the biggest bottleneck currently is when it comes to getting more computing performance out of these supercomputers what would be your answer is it the clock frequency wall is it AM dows law holding us back from taking full advantage of multi-core or perhaps memory bandwidth is the greatest deterrent to achieving more performance while all of these are indeed obstacles they can each be addressed to some extent as long as you're willing to compromise in other areas the biggest bottleneck that will change computers dramatically in the coming years is related to data locality and fundamentally energy you see with the end of Dennard scaling and the frequency ceiling that came with it adding more cores has been one of the last low-hanging fruits when it comes to increasing performance but that who will reach a limit both physical and with Amdahl's law preventing the use of cores past a certain number of them said the situation today is that the scaling on all fronts is reaching a limit and at the same time most of the current software is targeting the same machine abstractions that were introduced 40 years ago so if Hardware scaling is no longer possible with traditional means and software doesn't reflect the emerging changes in extreme scale systems how can we continue getting more performance especially had the expectation levels that we've gotten used to what we need is a paradigm shift both in hardware but also in software and software developers will have to think 15 to 20 years ahead when designing for new abstractions because it's way too expensive to modify code every time we change the hardware environment and that will be happening a lot in the next five to ten years as you were about to see now be we look at how the hardware and software will change it's probably a good idea to distinguish between hyperscale an HPC as these are the two segments that we will extrapolate from to understand how our desktop PCs will also be changing so hyper scalars are companies like Google Amazon Facebook and Alibaba who run warehouse sized heterogeneous computing systems so that servers with GPUs and FPGAs and other accelerators HPC or high-performance computing is where we find supercomputers in so these are used in places like the Department of Energy or the Department of Defense or in specialized industries that rely on scientific breakthroughs in fundamental physics in HPC you will also find heterogeneous hardware environments typically with CPUs and GPUs and increasingly with other accelerators as well so the main difference is in precision so when Google is serving you emails or when Facebook is doing image recognition the level of mathematical precision doesn't have to be there high what matters is the speed at which they deliver data to you but in HPC precision is fundamental to achieving accurate scientific results so and the guys at the Department of Energy who commissioned AMD to build frontier go to Washington with climate change models they need these to be as precise as possible as these influence policies that are put in place at least in theory we all know how low these are the actual policy influences but that's besides the point so when you hear about the compute capabilities of GPUs in spec sheets for instance that's what things like FP 16 and FP 32 are referring to to the mathematical precision processing capabilities of that product when we talk about a hexa scale supercomputer we are talking about achieving one hexa flop a floating-point operations per second so that's one billion billion floating-point operations per second another way of seeing it is a thousand times faster than the PATA scale supercomputers of 2008 exascale these around the same peak performance level of the human brain at level so under this definition Intel's Arora will also be a hexa flop system also co-developed with Cray to be delivered and around the same time as AMD's frontier although it's not expected to be as fast china will likely be the first to have a hacksaw flop capable machine probably this year with Japan following shortly after with their ARM based Fujitsu axis scale machining 2020 now this might look like a dick measuring contest run on a national level by nerds but the real reason why we need evermore powerful supercomputers is because science is not something that you can solve but rather scientific models can always be refined scientific advancements that might one day save your life depend on having this ever-increasing computational power another very important difference between HPC and hyper scalars is that HPC cannot function in the cloud it needs to be a self-contained unit usually taking up a warehouse or a whole building while workloads that run on hyper scalars can be distributed across the cloud it just so happens they even though the workloads and the environment for HPC and hyper scalars are different as we've just seen both have been asking hardware companies for the same types of advancements this means that there's a concerted effort from all of these parties to advance technology in the same key areas and it just so happens that many of these advancements will affect us PC users as well as we will discuss in a second it's kind of a perfect storm that will make the economic constraints much easier to overcome because a lot more parties are pushing towards the same objectives and remember the Department of Energy doesn't just play around with a supercomputer for funsies the idea behind these programs is to create algorithms that can lead to advanced models which will then be shared with United States industry with companies like Boeing and ATK and Procter & Gamble which ultimately helps competitiveness in America and creates more jobs now going back to that bottleneck related to energy if you've been watching my videos for a while you'll know that moving data around uses magnitudes more energy than actually computing that data so when we talk about energy being a major bottleneck to improving performance data movement is where this really comes into play so if that's the case then fixing this bottleneck will involve both physics but also architectural changes in processes in the next few years and going forward you'll hear companies like Intel and AMD saying there are Co designing their hardware with application developers but how are they going to overcome this energy barrier exactly when you look at the power density growth of a time you can clearly see that wall that Anand was pointing to getting reached in around 2005 the best place to look for an answer to this bottleneck is to look at a company that has been operating under heavy energy constraints for over a decade now arm because of the size of mobile devices in that passive cooling nature arm came up with a solution that I believe will make it into desktops in the somewhat near future and that will probably be introduced in the custom silicon that AMD is making for the frontier supercomputer so here's the brilliancy of harm's low-power design principles let's say you have a chip that is 389 millimeter squared running at 120 watts and at a frequency of 1900 megahertz and then we have another chip that is a hundred and forty-three millimeter squared running at 15 watts and at a frequency of a thousand megahertz because of the relation between voltage and frequency this smaller chip can do four times more floating-point operations per watt than that massive one even though the larger complex chip has more raw computational power if we look at an even smaller chip at say 24 millimeter squared running at 0.6 watts at a frequency of 800 megahertz we get eighty times more flops per watt and by the way these are based on real products now if we go down to the size of a core inside a GPU running at 0.09 watts at 600 megahertz we now get 400 times better performance for watts so you can probably see where this is going as we simplify these chips we are getting a modest decrease in raw performance yes but at a massive reduction in area and an increase in performance per want makes sense so even if a simple small core is 1/4 as computationally efficient as a complex large core if you fit hundreds of these small cores on a single chip at the same price point you want a hundred times more power efficient a peak arm realized this and created the concept of the lightweight core DARPA back in 2009 came out with a report which they called ubiquitous high performance computing which I'll link to in the description which had a bunch of recommendations for continuing the computer performance scaling one of which was related to energy goals while they hope to achieve 50 gigaflops per watt at only 20 pika joules per floating-point operation if we chart this magical ideal number that DARPA came up with in a graph you can clearly see why arms lightweight core solution makes perfect sense going forward for more than just mobile phones the only solution that has even come close to this ambitious recommendation by DARPA our GPU cores and lightweight cores heavyweight cores like the ones on Zen - or the skylake once that Intel has been putting out for years stagnated and will barely change in the next five years with more dispersed for what improvements so you can see here this top red line is what AMD would refer to as a new processor having 20% more performance at the same power while this red line is when they say a new processor has lower power at the same performance as the last generation so the desktop chips we buy will barely change in the next five years as far as performance per watt is concerned unless some drastic changes are made so the first solution for our energy bottleneck is that not only do we need more cause we need to introduce lightweight cores into the mix so going forward we need the lightweight and heavyweight cores combined into the same chips in a similar fashion to what arm has been doing on mobile phones with that big little architecture but then you might ask if this is the case why not just make every cause small well because not all workloads can be parallelized for workloads with a lot of parallelism a solution with lots of tiny cause makes a lot of sense which is why GPUs have been so popular in the last few years for those workloads that require a lot of throughput but if the workload doesn't have a high degree of parallelism then latency optimized cores are going to be more energy efficient so that's your heavyweight cause and because workloads are increasingly varied meaning some are highly parallel and others not so much I believe HPC processors and eventually desktop processors will become hybrid with both thoughts of small cores and a few heavyweight cause again very similar to what arm has been doing for mobile phones for years and when we look at Intel's presentation slides for that and nanometer Lakefield chip we see precisely this solution with four small CPUs and one large more traditional heavyweight CPU so this have solved the problem of energy in computation but not when it comes to data movement even though transistors become more energy efficient as they scale down because of the transistor capacitance copper wires which I used to connect them don't get better efficiency as feature sizes shrink in other words even when we go down to five nanometers or three nanometers copper wires will have the same efficiency as they do now so if moving data around costs so much energy and wires are not keeping with transistor efficiency how do we solve this problem well how about we don't move data around so much some simple right to achieve this I believe that the next generation of chips will feature a system of data locality management in other words data storage will need to have compute capability next to it or even integrate it and in addition to that we will see another Hardware processing unit whose job is to process meta that will be included in operations so there when an operation is performed this unit will check where the data is stored on the cache and will perform the operation in the right core yet a large core a small one sounds confusing how about we put this future chip together and see how this would actually work so on top of the inter person have our traditional heavyweight cause similar to what we have today for instance on Zen - and these are latency optimized then we have our lightweight cause and these will work similarly to how compute engines work on GPUs we then have accelerators which could be GPUs or FPGAs or fix function for a specific workload with the high speed bus to access CPU memory we have our i/o in the center and then taking a look at the side view we would have 3d stacked high bandwidth Mamre like HP m3 or future type of similar memory and because we also need high-capacity memory not just high bandwidth memory we will have off-chip hybrid memory with both DRAM and non-volatile Ram now you might immediately spot a few problems while on the heavy way cause the cache will be AK with distant just like it is today the lightweight cause present a data locality problem you see the way software works today you run a looped instruction and it gets distributed across cause automatically so the loop iterations are divided evenly across the processes this is very inefficient even on traditional cause but it becomes a much worse problem if you have a multitude of lightweight cause because this core over here might get a sign we doing computation on data that is stored over here which means the ton of energy needs to be used to move that data around to solve this problem we introduced a new model a data centric model so at runtime level a unit like I was describing earlier will keep track of metadata for each operation there are signs instructions to cause where the data we need is located so when an instruction requires data that is stored in this was cash this new scheduler unit assigns their score the nearest available core to perform the operation instead of distributing the operation blindly and then fetching the data from far away this means we probably also need some sort of buffer to keep this location metadata in so a loop will only run where a metadata variable is recognized as being local this obviously means that parallel loops will need to change in code to accommodate data locality so that's where this metadata gets introduced in application developers will need to code with it in mind yeah it does require work on the part of developers but once you do it for a loop you can apply the same logic to all of you loops and even if systems change in the next 15 years the programming models will have this inherent data locality built in with this approach not only are we able to get a hundred times more floating-point operations per watt using lightweight cause we also reduce data movement massively which lets us break through the performance ceilings that we are reaching in practical terms with this approach we are looking at a performance speed-up of 80% over the current models and a reduction in energy usage to less than half of what is currently used not to mention that by eliminating these energy related bottlenecks we get had room to continue hard scaling for many more years now why go through all this trouble because scientific breakthroughs depend on it and because for things like photo realistic-looking games to become a reality for you and I to enjoy we need this paradigm shift in both hardware and software and there's almost a self-fulfilling prophecy here because these systems will let us create material simulation algorithms that can help us find alternatives to silicon so that in 15 years time we can have materials ready to replace it and continue scaling performance even more speaking of photorealistic graphics at this point you might be wondering how a discrete GPU is going to change in the coming years well they still be relevant once we add lightweight cores to these chips because we live in an age where the application environment changes almost every month in hyperscale is especially it's virtually impossible to predict which hardware solutions will win out in the end Jennsen Wang will tell you that its GPUs and he might be right if you've been watching my videos for a while you know that I'm skeptical of this I hold the view that GPUs are a transitionary tool until this model with lightweight cause materializes and until the software environment enters a stasis but I could very well be wrong and maybe GPUs are indeed the future there's one company that surprisingly might give us an answer to this GPU question while Apple's hardware has become an absolute joke for the most part the silicon is some of the best in the industry at what it does when we look at the Apple a4 from 2010 it had less than 10 accelerators da 18 2014 had almost 30 accelerators and da 12 launched this year has over 40 specialized accelerators on chip seems to me that there will continue to be a disaggregation of hardware blocks but there will all be on the same chip which means discrete GPUs will likely become a thing of the past again no one can know for sure at this point but that's why things seemed to be headed so it seems that for the near future at least it's possible that we see both types of specialization happening a broader more generalized specialization using GPUs and a narrower specialization with accelerators that target just one application but going forward these specialized blocks will increasingly be integrated into chips so there's this interesting trend where CPUs are becoming more specialized on GPUs are becoming more general-purpose in a recent talk on deep learning invidious chief scientist Bill Daley should offer really interesting prototype GPU that will be coming to the market in the next few months remember earlier when I said that to better manage energy we would need to move to multiple small chips rather than large core chips well the same seems to be happening with GPUs at least for deep learning applications previously accelerators fabricated on a single monolithic die were limited to specific network sizes with this multi chip approach and Vidya can scale the hardware to meet the specific demands of the deep neural network built on 16 nanometers these dyes use only 0.11 Pico joules per op and can be scaled up to 36 chips inside each of these chaplets are 16 processing units and interestingly a risk 5 IP block is also used here my guess is that this risk 5 block is used to evaluate weights and distribute them efficiently across the cores remember when I said earlier that the software paradigm needs to change to match new hardware abstractions that's exactly what we're seeing here the activations sent to the cause and presumably this scales to all of the chips as well stay local to where weights are so just like that model that I was suggesting earlier hair with this MCM GPU we're seeing mattered data being used to distribute the work in such a way that data doesn't have to move around as much and therefore less energy is used if you've ever wondered why it seems like NVDA so is a step ahead of everyone when it comes to performance per watt this is why they are already implementing new hardware abstractions then our data centric and are very aggressive in energy management because the smart people have NVIDIA have long realized that energy is indeed the bottleneck to performance increases the AMD Radeon group should really pay attention to this MCM GPU because it won't be long until Nvidia uses similar principles on their gaming GPUs now as you can probably imagine this MCM layout itself is a ways away from coming to the desktop as a gaming GPU not so much because of the hardware constraints but because the software environment would have to significantly changed in you surf a system like this unless either Intel Nvidia or AMD figure out how to distribute loops in a similar way to what I suggested earlier using a processing unit to schedule parallel loops at runtime across multiple chips can this be done well yes but probably not cheaply enough for the consumer market I guess one possibility would be to have one traditional GPU die and then several small chip that's alongside it to do parallel computation again I'm not sure if the economics here makes sense for gaming GPUs but for other workloads besides gaming this is exactly where things are headed and you can expect Intel's upcoming GPUs to look something like this as well at least the ones that are going to be sold to hyper scale as an HPC if you look even further into the future we are probably going to see accelerators for memory and hybrid memory systems new types of transistors and some pretty insane changes to some of the other things that are holding us back I suspect that even the file system will eventually go away and be replaced by a Nero inspired model where data would be managed in a similar fashion to how our brains work but these are all topics for another day as a parsing note I was talking recently with someone from arm and it seems that they have a team of engineers optimizing popular games engines for some of their upcoming hardware and just last week a job posting popped up online looking for a game engine tack lead to work on optimizing unity and a real engine for arm hardware considering armes ambitious plans for the laptop and desktop market that i've been hearing about it looks like that there will be a discrete GPU coming to the gaming market next year which means we'll go from to gaming GPU makers to for next year sounds good to me I'll be taking a look at arm in more detail in an upcoming video so be sure to subscribe so you don't miss that this video is made possible by my awesome patrons join them for just $1 per month and get exclusive access to the cortex discord server where you can talk to me directly and discuss these topics with like-minded enthusiasts if you can't contribute financially at this time and please share this video with friends and on social medias that really helps thanks for watching and until the next one

Info

Channel: Coreteks

Views: 197,475

Rating: undefined out of 5

Keywords: cray, amd, supercomputers, arm, nvidia, intel, apple, frontier, fujitsu

Id: 3PjNgRWmv90

Channel Id: undefined

Length: 26min 59sec (1619 seconds)

Published: Mon Aug 05 2019