Deep-dive into the AI Hardware of ChatGPT

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Everyone is busy talking about how amazing ChatGPT  is, but all I want to do is take a look behind the   curtain and figure out what kind of hardware  it's running on. During my research for this   video I came across some really interesting  answers, and without spoiling anything,   you will be surprised how old some of the  hardware is that made ChatGPT possible in the   first place. So without further ado, let's take  a closer look at the hardware behind ChatGPT. To start off this video, it's important to know  that there are two very different phases during   the development of a machine learning model like  ChatGPT, and those two phases also have very   different hardware requirements. First you have  to train the neural network. The training phase is   where the neural network is fed with huge amounts  of data, which is then processed by billions   of parameters. It's at this stage where the  combination of hardware and software is forming   the neural network. It's basically the birth of  AI. The hardware requirements during the training   stage are massive, because it has to handle  insanely large amounts of data being run against   billions of different parameters and repeat that  process over and over again. What we are currently   experiencing as ChatGPT is the so-called inference  phase, where a already fully trained and working   neural network is applying its learned behavior  to new data, like your inputs and questions.   Running inferences is, in general, less resource  intensive, at least when it comes to raw compute   power. Low latency and high throughput become  much more important, because the AI is responding   to many simultaneous requests, not unlike other  web-based services. But even if running a single   instance of inference is a lot less demanding than  training the neural network in the first place,   the sheer scale of providing inference to  potentially millions of users at the same time can   exponentially increase the hardware requirements.  In a nutshell, training a neural network requires   a huge amount of focused compute power, but once  training is finished this part is done. Using the   neural network, which is called inference,  has a much lower base hardware requirement,   but deploying the AI to many users at the same  time can greatly increase those requirements.   Now that you know the basic difference between  AI inference and training, let's figure out what   specific hardware was used to train the neural  network of ChatGPT, before we then take a look   at the systems running the inference process. Of  course Microsoft and OpenAI are trying to keep the   exact hardware configuration a secret. There  are only very general and open-ended answers,   like that ChatGPT was trained on Microsoft Azure  infrastructure and that many AI models these   days are trained on Nvidia GPUs. But of course  that won't deter us from finding out more! In   May of 2020, almost three years ago, Microsoft  announced a new supercomputer built exclusively   for OpenAI to train GPT-3, which is a predecessor  to the machine learning model used for ChatGPT.   Microsoft wasn't very specific but revealed that  the supercomputer was using more than 285,000   CPU cores and over 10,000 GPUs. Microsoft also  claimed that this supercomputer would place   within the top 5 of the TOP500 supercomputer list,  which at the time had to be over 100 petaflops   peak performance. And even though Microsoft  tried to hide the specific hardware used in   that supercomputer, they weren't very successful,  at least when it comes to the GPUs. A scientific   paper about large language models, published by  OpenAI in July of 2020, reveals, and I quote:   "All models were trained on V100 GPUs on part of  a high bandwidth cluster provided by Microsoft."   This information is all we need, because while  285,000 CPU cores are nothing to scoff, at when it   comes to running specialized AI calculations, they  pale in comparison to 10,000 GPUs. The software   used by OpenAI utilizes Nvidia's CUDA deep neural  network library and as such a training is only   really happening on the GPUs, the CPUs are more  of a supporting actor. Now that we figured out   that GPT-3, a precursor to ChatGPT, was trained  on Nvidia V100 GPUs, let's see how fast they are,   why OpenAI and Microsoft selected this specific  hardware and what it means for the training of   ChatGPT. If you're not an AI you'll probably store  your passwords in your brain. That's why today's   sponsor NordPass comes in. I've always thought  of myself as being good with passwords, but   to be honest, with the ever increasing amount of  accounts and logins I'm confronted with, I started   to use repetitive and low quality passwords.  Maybe you too have caught yourself using the   same or slight variations of the same password  before. I even had to request new passwords   because I forgot what specific password I used  for that one website. It got to a point where   I was actually thinking about using those login  with Google or Facebook shortcuts, even though I   would never want to share my logins with companies  who already know too much about me anyways. Using   a unique and secure password is super important,  especially today, where your personal data is at   risk in so many different ways. All it takes is  a single database breach or a website that didn't   properly store and hash your password. NordPass  provides a robust and secure solution. You know I   like to dig a little bit deeper and with NordPass  all your passwords are encrypted locally using   XChaCha20 encryption, which is a faster and more  secure option compared to AES. Security isn't the   only advantage of NordPass. Convenience isn't  my motivation for using a password manager, but   NordPass does it really well. Of course you have  desktop clients for all operating systems from   Windows and Mac OS to Linux and apps for Android  and iOS. It's so well integrated, you almost   forget you are now using 100% unique and secure  passwords for every single one of your logins.   NordPass is already very affordable and with  the code highyieldnordpass you get an exclusive   two-year offer with an additional month for free  on top. Plus there are options for businesses   using a company domain. If you still approach your  password security by hoping for the best, now is   the right time to do something about it! For more  information go to nordpass.com/highyieldnordpass   or click the link below the video and use code  highyieldnordpass. With that let's get back to   Nvidia's V100 GPUs and why they were chosen by  Microsoft and OpenAI. Nvidia's Volta architecture,   that's what the V in V100 stands for, is a really  interesting design for two distinct reasons:   first, it introduced a major architectural change  over all previous Nvidia GPUs and second, from   today's point of view it's actually really old  hardware. Nvidia's V100 GPUs are based on GV100,   a 815 millimeter squared silicon chip with 21.1  billion transistors, produced by TSMC in a 12   nanometer process. Compared to the previous Pascal  generation Flagship GP100, it doesn't look that   impressive. Sure it's faster, but the increase  in FP32 performance isn't that large and comes   at a huge cost in form of the massively increased  die size. But this comparison tells only half the   story, because for the first time ever GV100  introduced Nvidia's brand new tensor cores,   something that didn't exist on GP100. If you're  a gamer, I'm sure you've heard of tensor cores   before. They have been a part of Nvidia's GeForce  GPUs since the release of Turing and are used   to accelerate DLSS. On the surface tensor cores  are not that different to traditional GPU cores,   they are specialized hardware that excels at  matrix processing. To put it simple, they can run   a lot of computations in parallel, but are limited  to basic multiply-accumulate calculations used   in machine learning. Lots and lots of very simple  calculations at the same time. Volta was Nvidia's   first GPU architecture specifically designed  to accelerate AI workloads, like training and   inference. In the Volta architecture whitepaper  Nvidia claims up to six times faster AI inference   and 12 times faster AI training. With just 640  new tensor cores GV100 is able to output a massive   125 teraflops! I'm sure you now understand why  training a large-scale model like GPT-3 wouldn't   have been feasible before the introduction of  Volta. This is the key to the whole story of   ChatGPT. The version of Volta used in Microsoft's  AI supercomputer back in 2020 was most likely part   of Nvidia's Tesla product family with up to 32  gigabytes of fast HBM2 memory and with 10,000   GPUs at 125 FP16 tensor core teraflops each, the  whole system would be capable of 1.25 million   tensor petaflops which is 1.25 exaflops. We are  talking about literal exascale performance here,   at least in terms of FP16 tensor core data. But  here's the kicker: Volta was released back in   May of 2017, almost 6 years ago from today and 3  years old in 2020, The reason Microsoft and OpenAI   had to use Nvidia's Volta generation is pretty  simple: the new Ampere generation launched around   the same time OpenAI started to train GPT-3,  it was just a little bit too late. Planning and   building such a powerful supercomputer takes time  and waiting would have delayed the whole project.   At the time Microsoft and OpenAI began planning  the supercomputer Volta was the only option and   not only that, I'm sure that without Volta this  supercomputer would not have been built in the   first place. And without it no GPtT-3 and probably  no ChatGPT. You can only envision and plan such a   complex neural network if you have the hardware  to support your ideas. Now that we know when and   how GPT-3 was trained, in 2020 on up to 10,000  Nvidia V100 GPUs, we can use that knowledge to   try and figure out what hardware was used to  train ChatGPT. The relation of GPT-3 to ChatGPT   is rather difficult to describe, because not all  information is public. GPT-3 is a larger and more   general model with a wider variety of use cases.  GPT-3 can be used to generate text-based prompts,   it can translate, summarize and classifies text  and much more. Some of these features are part of   ChatGPT too, but GPT-3 in its original 2020 form  wouldn't be a very good chatbot for two reasons:   first, it wasn't trained to give human-like  answers in form of a text-based chat conversation,   and second, GPT-3 is a really large model and it  would require a huge amount of compute performance   even during inference. That's why ChatGPT was  born. It's basically a much more specialized   machine learning model focused on natural  text-based chat conversations and lower compute   requirements. It's an evolution of GPT-3, but not  really better or more capable in general, just   more focused and streamlined. If you have already  used ChatGPT you know that it has been trained on   data with a cut off before the end of 2021, that's  the reason it doesn't have up-to-date information   on current events. Regarding the time frame of its  training, OpenAI states that ChatGPT is fine-tuned   from a model in the GPT-3.5 series which finished  training in early 2022. ChatGPT and GPT-3.5 were   trained on Azure AI supercomputing infrastructure.  With this information we now have a time frame,   early 2022, and the confirmation that it was again  trained on hardware provided by Microsoft. The   question is, was it the same supercomputer which  trained GPT-3? The answer is probably no, because   as we discussed before, Nvidia's V100 GPUs were  already quiet old when they were used for GPT-3   in 2020. On June 1st 2021 Microsoft officially  announced the availability of Nvidia A100 GPU   clusters to its Azure customers, which of course  includes OpenAI, and since we know that ChatGPT   was trained on Microsoft Azure infrastructure  after the introduction of A100 GPUs we can deduct,   that it was trained on this type of GPU. The  only question left is how many A100 GPUs were   used? To try and answer that question let's take  a look at the specs of the A100 first. Nvidia's   Ampere generation A100 is based based on the GA100  chip which packs 54.2 billion transistors into a   826 millimeter squared large die produced  by TSMC in a 7 nanometer node. As before,   the performance increase in traditional FP32 isn't  large at all, but tensor performance gets another   huge bump to over 300 teraflops, 2.5 times the  performance of a single V100 GPU! And this speed   up is achieved with even less tensor cores due  to the introduction of redesigned 3rd gen tensor   cores. By the way, if you're wondering, second  gen was on Turning, between Volta and Ampere,   but Nvidia didn't release any comparable products  on that architecture. Now of course Microsoft   could have replaced all 10,000 Volta GPUs in  their supercomputer with 10,000 Ampere GPUs,   but since ChatGPT is a more streamlined machine  learning model it wouldn't be a really economic   decision, especially with Ampere being so much  faster. Luckily we have more information about   the combined efforts from Nvidia and Microsoft  to create new AI supercomputer infrastructure.   In October of 2021, before the training  of ChatGPT started, Nvidia and Microsoft   announced a AI supercomputer they used to train  a new and extremely large neural network called   Megatron-Turning NLG. And no, I'm not making this  name up. Megatron is using 530 billion parameters,   a lot more than the 175 billion of GPT-3 and  it was trained on 560 Nvidia DGX A100 servers,   each containing 8 Ampere A100 GPUs. You don't  need AI to figure out that's a total of 4,480   A100 GPUs. The supercomputer was specifically  built to train a 540 billion parameter neural   network like Megatron-Turing NLG, it's more  than capable of handling the smaller ChatGPT   model. Maybe it was even this exact same system  that trained ChatGPT, and if not I am confident   the hardware used is very similar to the system  used for Megatron-Turing NLG: multiple clusters   of Nvidia DGX or HGX A100 servers. Before we  talk about inference hardware, let's have a   look at our findings so far: GPT-3 was trained on  a Microsoft supercomputer with 10,000 Nvidia V100   GPUs and over 285,000 CPU cores. We don't know  the exact amount of CPU cores or the CPU type,   but in 2020 it was most likely Intel Xeon. These  specifications have been confirmed by Microsoft   and Open AI. For ChatGPT the results are a bit  more of an educated guess, since official specs   are not available. We know that Nvidia A100 GPUs  were used and at the time only AMD EPYC offered   the new PCI-Express 4.0 standard, which is why  Nvidia Ampere GPUs were paired with AMD CPUs.   A single NVIDIA DGX A100 server combines 8 A100  GPUs with two AMD EPYC 7742 server CPUs. If my   deductions are correct and ChatGPT was trained  on a similar system as Megatron-Turing NLG,   that gives us the following hardware specs for the  training of ChatGPT: 1,120 AMD EPYC CPUs with over   70,000 CPU cores and 4,480 Nvidia A100 GPUs.  That's close to 1.4 exaflops of FP16 tensor   core performance! As always with my videos  and especially parts where educated guessing   is involved, I invite you to cross check my  assumptions. If you think I might be wrong   or missed something, leave a comment down below.  Now let's talk about inference and the question   which hardware is currently powering ChatGPT.  Based on statements by Microsoft and OpenAI,   ChatGPT inference is again running on Microsoft  Azure servers and asingle Nvidia DGX or HGX A100   instance should be plenty enough to run inference  for ChatCPT, but of course not for all of the 100   million active users at the same time. There are  clever ways to try and calculate how many systems   are currently used for ChatGPT, created by people  much smarter than me. Semianalysis came up with   a very interesting model and the result is that  at the current scale it would require well over   3,500 Nvidia A100 servers with close to 30,000  A100 GPUs only to provide inference for ChatGPT.   That's a massive amount and a lot more than  what was used for training, which is supposed   to be the more demanding phase. But as we talked  about in the beginning of the video, inference   might be easy to run for a single instance, but if  deployed at scale hardware requirements increase   exponentially (with the user base). One thing is  clear, at the current level of demand for ChatGPT,   just keeping the service running is requiring  a massive amount of hardware and costs between   500,000 to 1 million dollars per day. It's not  cheap. Right now the publicity is definitely   making it worth it for Microsoft and OpenAI, but  in the long run such a system most likely won't   be able to stick to a free to use model, unless  better and more efficient hardware reduces the   cost of running inference at scale. And there's a  lot of new hardware on the way! So far we talked   about Volta and Ampere, but Nvidia's Hopper  generation has been shipping for a while now,   providing another level up in AI performance.  Compared to Hopper, Ampere and Volta almost seem   tiny! GH100 has a massive 80 billion transistors  on a 814 millimeter square die produced in   TSMC's 4 nanometer node. This time even the speed  up in traditional FP32 performance is quite large,   with over 3 times the flops of GA100. But the  tensor core speed up takes the cake, delivering   1,000 teraflops of tensor performance for a  single Nvidia H100 GPU. And it doesn't stop there,   Nvidia also introduced a new INT8 mode with 2,000  tops specifically tailored for AI demands. The   entire hardware industry is starting to shift  its focus on architectures specifically designed   to accelerate AI workloads. In 2017 Nvidia was  entering uncharted territory with their Volta   GPUs and its largest competitor, AMD, was  focused on releasing its first generation   Ryzen CPUs and avoiding bankruptcy. In 2023  AMD has been completely transformed as a   company and upcoming CDNA3 based MI300 GPUs  will provide strong competition for NVIDIA,   especially when it comes to AI workloads. I will  look at CDNA3 and MI300 in an upcoming video,   so make sure you subscribed if you're interested.  And GPUs are not the only product focused on AI,   more and more so-called neural processing units  and AI engines are being developed. These are   chips with only one focus: run AI training and  inference as fast and as efficient as possible.   Jim Keller, a legendary semiconductor engineer  who worked on the original AMD Athlon 64 and   AMD's come-back Zen architecture, now works at  Tenstorrent, a company focused on designing pure   AI and machine learning focus processors. It seems  like hardware currently only has one single focus:   AI, AI and even more AI! While working on the  script for this video I watched a video from   Tom Scott, where he voiced his thoughts  and opinions about AI. He was trying to   compare it to the rise of the internet, but  wasn't sure how far along the curve we are   with machine learning and AI. Is ChatGPT only  the beginning, his so-called Napster moment,   or are we already close to the top of what's  possible? And while I can't answer the question,   looking at AI from a hardware point of view paints  a pretty clear picture. Current high-end machine   learning models like GPT-3 were trained on  what I would call first gen AI hardware and   even the highly praised and disruptive ChatGPT  only used Nvidia's second gen Ampere GPUs,   a product released almost three years ago. What we  are currently experiencing are AI models planned   and trained on last generation hardware. Everyone  talks about the future of AI but I can't help and   point out that we don't even have to wait for the  future. Hopper released last year and has been   shipped to customers in volume for a while now.  Just imagine the kind of AI models you could train   on a supercomputer with 10,000 Nvidia H100 GPUs  compared to how 10,000 V100 GPUs trained GPT-3. It   will take a while for the public to get access to  these new AI models, but they are already possible   today. With more money flowing into AI, hardware  to accelerate it will advance even more rapidly.   We have seen how firce competition between Intel  and AMD affected the CPU market and I'm sure   Nvidia and AMD will fight for AI, AI progress is  hardware bound and the hardware is just getting   started. In only a few years time, training a  model like ChatGPT will be part of your average   machine learning course in college and running  inference is done on a dedicated AI engine inside   of your smartphone. There is basically no limits,  which is amazing and scary at the same time,   In my opinion ChatGPT isn't a "Napster moment",  it's more like pets.com. People use it and it's   popular, but the real AI "Napster moment" will  be something where we don't have to ask ourselves   the question, it will be strikingly obvious  for everyone to see. Thanks again to NordPass   for sponsoring this video! I know, passwords  are something we don't like to talk about,   but using unique and secure passwords to  protect your personal data has never been   more important. Click the link below the  video and use code highyieldnordpass. I   hope you found this video interesting, if you did  you know what to do, and see you in the next one.
Info
Channel: High Yield
Views: 266,825
Rating: undefined out of 5
Keywords: openai, open ai, chatgpt, chat gpt, artificial intelligence, nvidia, v100, a100, h100, volta, ampere, hopper, tensor, nvidia tensor, nvidia ai, chatgpt hardware, future of ai, next chatgpt, gpt-3, megatron-turing nlg, microsoft ai, AI, A.I., Nvidia GPU
Id: 4q9-yf1eU8c
Channel Id: undefined
Length: 20min 15sec (1215 seconds)
Published: Mon Feb 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.