GTC 2017: NVIDIA Announces Tesla V100 (NVIDIA keynote part 6)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I would like to introduce you to something that has taken several thousand engineers several thousand engineers several years to create it is a masterpiece on many levels it is the most complex project that has ever been undertaken arguably the most expensive computer project the world's ever done ladies and gentlemen the Tesla Volta v100 this is made out of TMC's 12 now 212 nanometer FinFET and I'm just getting a little exercise up here 12 nanometer FinFET the part that is really shocking is this is reticle limits reticle limits basically means that it is at the limit of photolithography meaning you can't make a chip any bigger than this because the transistors would fall in the ground every single transistor that is possible to make by today's physics was crammed into this processor 21 billion processors almost a hundred billion these little connectors a hundred billion vias to make one chip work per 12 inch wafer I would characterize it as unlikely and so the fact that this is manufacturable is great just an incredible feat 800 millimeters squared if you guys have an Apple watch on your wrist the die size is approximately like that okay so you just take a look at your Apple watch gives you a feeling for 5000 processor cores in here seven-and-a-half teraflops of 64-bit floating-point 15 teraflops of 32-bit floating-point and a brand new type of processor a brand new type of processor called tensor core which results in a hundred and twenty teraflops of tensor operations a hundred and twenty teraflops unbelievable this well a DRD budget is approximately three billion and this is the first one so if anyone would like to buy this it's three billion dollars I just think that in my pocket the memory system in our in our architecture is quite unique if you take a look at the way that most most processors are organized the register files are very small the caches are very big and the delay is quite large in our case the Richardson file is huge twenty megabytes of RF register file so that the memory is very very close to the processors and that's one of the reasons why the throughput is so high sixteen megabytes of cash and we're utilizing the state-of-the-art the fastest memories that the world can make today it's made by Samsung our partnership with them is terrific the two engineering teams have been working so closely together pushing the limits pushing the limits of how fast we can drive memories and we've been able to achieve 900 gigabytes per second it is just so fast and then lastly the second generation Envy link gives us 300 gigabytes per second basically approximately 10 times the fastest PCI Express in the world today ladies and gentleman the test leftie 100 - Tesla b100 Volta has a new instruction inside it's called the tensor core it's a new CUDA tensor operation instruction that is both an instruction as well as data format it's a four by four matrix processing array and it basically does one of the most important primitives of deep learning a times B plus C on a matrix a times B plus C on a matrix and so the input is a matrix four by four 16-bit floating-point times B 16-bit floating-point plus C and we're trying to do that as fast as possible so this is the way Pascal did it and it did it incredibly fast at the time and the reason why it's incredibly fast every single row is multiplied by every single or every single rows multiplied by every single column and then when you're done it accumulates adds it all the way vertically into that green the output results and it doesn't incredibly fast because Pascal has thousands of processors because that Pascal is doing the thousands of times at the same time and that's the reason why Pascal was so fast however we felt that that just wasn't fast enough what we should do is do it in parallel and in parallel and so this is what the Volta tensor core does it literally does the four by four multiply plus C at the same time and it dumps it into result twenty times increased throughput really crazy stuff the net result is although Pascal the p100 is the most advanced processor the world's ever built one year later one year later Volta is one and a half times the floating-point performance general-purpose computing twelve times the tensor operations compared to Pascal for deep learning training and six times for inferencing I'm going to come back to inferencing a little bit inferencing of inferencing for all of you who are not familiar training the network is the first step very computationally intensive and the second step also computationally intensive not as intensive but computationally intensive is influencing the production the application of their work well that's that's Volta that's v100
Info
Channel: NVIDIA
Views: 54,405
Rating: 4.882503 out of 5
Keywords: NVIDIA, GTC 2017, GPU Technology Conference, Jensen Huang, AI computing, artificial intelligence, keynote, Tesla V100, Volta GPU architecture, AI, HPC, high performance computing, Tensor Cores
Id: 3aAEKRDhrj8
Channel Id: undefined
Length: 6min 29sec (389 seconds)
Published: Thu May 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.