AI’s Hardware Problem

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
since the Deep learning explosion started in 2012 the industry's biggest models have grown hundreds of thousands of times today open ai's Dolly 2 has 3.5 billion parameters Google's imagine has 4.6 billion and gpt3 has 175 billion parameters increasingly larger models are ahead Google recently pre-trained a model with 1 trillion parameters these increasingly large models strain the ability of our Hardware to accommodate them and many of these limitations tie back to memory and how we use it in this video we're going to look at Deep learning's memory wall problem and some of the memory Centric paradigms researchers are looking at to solve it but first The asianometry patreon Early Access members get to watch new videos and select the references for them before their release of the public it helps support the videos and I appreciate every pledge thanks and on with the show virtually every modern computer runs what is called Avon Neumann architecture meaning that it stores both as instructions and data on the same memory bank at its heart are your processing units a CPU or a GPU those processing units access memory to execute its instructions and process its data the one Neumann architecture has been really good for us it helped make software as powerful as it is today but it works nothing like a real human brain the brain's compute ability is relatively low Precision but it tightly integrates that compute with memory and input output communication computers on the other hand run on high Precision 32 or 64-bit floating Point arithmetic for instance but separates that compute from memory and communication this separation has consequences especially for memory the AI Hardware industry is scaling up memory and processing unit performance as fast as they can nvidia's V100 GPU released in 2017 had a 32 gigabyte offering today the top of the line Nvidia Data Center gpus a100 and h100 Sport 80 gigabytes of memory despite this bulking up Hardware performance is not keeping up with how fast these models are growing especially when it comes to memory memory allocations for Leading Edge models can easily exceed hundreds of gigabytes even with the latest parallelization techniques a trillion parameter model is estimated to require 320 a100 gpus each with 80 gigabytes of memory these differences in processing and capacity mean that a processing unit wastes multiple processing Cycles waiting not only for all the data to travel in and out of the memory but also for the memory to perform its read write operation this limitation is known as the memory wall or a memory capacity bottleneck alright then the obvious solution would then be to add more memory to our gpus right what is stopping us from doing that there are practical and Technology limits to how much extra memory you can add on not to mention the issues of connections and wiring just think about how widening a highway does not much help with traffic additionally there are very significant energy limitations associated with shuttling data between the chip and the memory these electric connections have losses which cost energy I mentioned this in previous videos accessing memory off chip uses 200 times more energy than a floating Point operation eighty percent of the Google tpu's energy usage is from its electrical connections rather than its logic computational units in some recent GPU and CPU systems dram by itself accounts for 40 of the total system power energy makes up 40 of a data Center's operating costs so for this reason storage and memory has come to be a significant factor in the data Center's ongoing profitability in addition to the significant operating costs of the energy are The Upfront Capital costs of purchasing the AI Hardware itself as I mentioned earlier a possible trillion parameter model would need 320 a100 gpus each with 80 gigabytes of memory a100s cost thirty two thousand dollars each at MSRP so that's a clean 10 million dollars a hundred trillion parameter model might require over 6 000 such gpus that is just the cost of purchasing the hardware and does not even count the aforementioned energy costs of using these things to run inference on them which is where ninety percent of a model's total costs are it risks restricting the benefits of advanced AI only to the Uber Rich Tech Giants or governments much of the shortcomings are tied to historical and technological limits in the 1960s and 70s the industry adopted Dynamic random access memory to form the basis of our computers this adoption was largely made for technological and economic reasons dram memories had relatively low latency and were cheap to manufacture in bulk this worked fine for a while as late as 1995 the memory industry was valued at 37 billion dollars with microprocessors at 20 billion dollars but after 1980 compute scaling far outpaced memory scaling this is because generally speaking the CPU or GPU Industries have had just one metric to optimize towards transistor density the memory industry on the other hand not only has the scale dram capacity but also bandwidth and latency at the same time something has to give and usually that has been latency over the past 20 years memory capacity has improved 128 times and bandwidth 20 times latency however has improved by just 30 percent secondly the memory industry realized that shrinking dram cells Beyond a certain size gave you worse performance less reliability less secure worse latency energy inefficiency and so on here's why a dram cell stores one bit of data in the form of a charge within a capacitor a capacitor being a device that stores electrical energy within a field that bit is accessed using an access transistor as you scale the cell down to nanoscale sizes that capacitor and its access transistor get leakier and more vulnerable to outside electrical noise it also opens up new security vulnerabilities these technical limitations and problems are fundamental to how the hardware Works which makes them extremely difficult to engineer our way around the industry is going to grind out small Solutions but those will be small so the problem also opens the door to brand new radical ideas that might give us a possible 10x improvement over the current paradigm in a previous video I talked about the Silicon photonic AI accelerator where we try to use light's properties to make data transfer more energy efficient alongside that we have another idea let's alleviate or even possibly eliminate the Von neumen bottleneck and memory wall Problems by making the memory do the computations themselves compute and memory refers to a random access memory with processing elements integrated together the two might be very near each other or even be integrated onto the same die I've seen the idea be called other things throughout the years processing a memory computational RAM near data Computing memory-centric Computing in-memory computation and so on I'm aware that there are differences between these usages but those differences are very subtle I am generally going to stick to saying compute in memory the name has also been used to refer to Concepts expanding on the SRAM idea SRAM is often used for the memory cache that sits on chip with the CPU but what we are more referring to here is bringing processing and compute ability to the computer main memory itself the idea is well suited for deep learning if you recall running a neural network model is about calculating these massive matrices the Google GPU had lots of circuits for running multiply and accumulate or Mac operations the actual arithmetic is relatively simple the problem is that there is so much of it that needs to be done so in an ideal case a compute and memory chip can execute Mac operations right inside the memory chip this is especially helpful for running inference on the edge outside the data center these use cases have energy size or heat restrictions being able to cut up to 80 percent of a neural Network's energy usage is a game changer the idea is decades old dating back to the 1960s Professor Harold Stone of Stanford University first notably explored the idea of logic and memory Stowe noted that the number of transistors in a microprocessor was growing very fast but the processor's ability to communicate with its memory was limited by the number of pins so he presented the idea of moving part of the computation into memory caches the 1990s saw further explorations of the idea in 1995 Terraces produced what we would probably call the first processor in memory chip it was a standard 4-bit memory with an integrated single bit logical unit this arithmetic logical unit can bring in data apply some simple logic to it as dictated from a program and then write it back to memory and then in 1997 various professors at UC Berkeley including Professor David Patterson the inventor of risk created the Iram project with the goal of putting a microprocessor and dram on the same chip several other such proposals followed throughout the 1990s but none of these ever caught on for a number of practical reasons first memory and logic are hard to manufacture together their fabrication have sort of opposing goals again logic transistors are all about Speed and Performance but memory transistors have to be about high density low cost and low leakage all at once it is hard to build a logic process with a dram process and vice versa dram designs are very regular with lots of parallel wires logic designs on the other hand have much more complexity circuit elements in a chip have connections referred to as metal layers having more metal layers allows for more complexity but at the cost of current leakage and worse reliability contemporary dram processes use three to four metal layers contemporary logic processes use anywhere from 7 to 12 and even more metal layers so it is estimated that if you tried to make logic circuits with a dram process then the logic circuits would be 80 percent bigger and perform 22 percent worse and if you were to try to make dram cells with a logic process then you are essentially making embedded dram or edram the cells use significantly more power take up to 10 times more space and are less reliable the industry has since considered these manufacturing shortcomings and have come up with a variety of workarounds many proposals of compute and memory are on three levels the device circuit and system the device level leans on new types of memory Hardware other than the conventional dram and SRAM memories notable examples include resistive random access memory or re-ram or R Ram and spin transfer torque Magneto resistive random access memory or sttmram reram is one of the more promising emerging memory Technologies like I mentioned earlier conventional RAM memory store information using a charge stored within a capacitor reram instead stores information by changing the electrical resistance of a certain material referred to as a resistor switching between a high and low resistance State this structure allows reram to compute logical functions directly within the memory cells please don't ask me to try to explain it any further than that reram is probably the emerging memory technology that is closest to commercialization due to it being compatible with silicon CMOS however there remain substantial hurdles to overcome before we see products arrive on shelves the circuit level is where we modify peripheral circuits to do the calculations right inside the SRAM or dram memory arrays themselves the phrase I see a lot is in C2 Computing in C2 meaning locally or on site these are particularly clever but they also require an intimate knowledge of how memory works and still can be difficult to understand one prominent example of this is Ambit an in-memory accelerator proposed by people from Microsoft Nvidia Intel eth Zurich and Carnegie Mellon a dram memory contains sub-arrays with many rows of dram cells in normal use the memory activates one row at a time this system activates three rows at a time in order to implement an and slash or logic function two roads for the inputs and one for the output the concept is logically attractive you can utilize the memory's internal bandwidth to do all the calculations however there are significant concerns Ambit can perform facing and or and not logic operations but it takes multiple Cycles furthermore more complex logic implementations like x-nor remain challenging to utilize so far the big downside with these two compute and memory approaches is that their performance still fall short of what can be achieved with current Von Neumann GPU Asic Centric systems in other words they suffer the same drawbacks people saw in the 1990s putting memory and logic together still makes a jack of all trades master of none situation so the middle ground that the industry seems to be moving towards is implementing compute in memory at a system level this is where we integrate together discrete processing units and memory at a very close level this is enabled thanks to new packaging Technologies like 2.5 D or 3D memory die stacking where we stack a bunch of dram memory dies on top of a CPU die the memories are then connected to the CPU using thousands of channels called through silicon Vias or tsvs this gives you immense amounts of internal bandwidth AMD is working on something kind of like this with what they call 3D V cache which is based on a tsmc 3D stacking packaging technology they used it to add more memory cache to their processor chips there is a future where we can use similarly Advanced packaging Technologies to add hundreds of gigabytes of memory to an AI Asic this lets us integrate together world-class memory and logic dies closer than ever before without needing to place them on the same die ideas in the laboratory are cheap what is harder is to execute on those ideas in a way that performs well enough to replace what is already out there on the market and the fact is that the current Nvidia a100 and h100 aigpus are a very formidable competitor but with Leading Edge semiconductor technology slowing down the way they are we need new ways to LeapFrog towards more powerful and robust hardware for running AI generally speaking bigger models perform better today's best performing natural language processing and computer vision models are great but they still have ways to go which means they might have to get bigger so to get better but unless we develop new systems and Hardware that can overcome these aforementioned limits it seems that deep learning might fall short of fulfilling its Great Expectations all right everyone that's it for tonight thanks for watching subscribe to the Channel Sign up for the newsletter and I'll see you guys next time
Info
Channel: Asianometry
Views: 494,877
Rating: undefined out of 5
Keywords:
Id: 5tmGKTNW8DQ
Channel Id: undefined
Length: 16min 46sec (1006 seconds)
Published: Mon Dec 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.