Elon Musk Just RELEASED Its AI Powerhouse: Tesla DOJO Supercomputer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
project Dojo will get more than one billion dollars in funding from Elon musk's Tesla by the end of 2024. to create software for self-driving cars the dojo supercomputer will have the capacity to analyze enormous volumes of data including videos from Tesla vehicles in its most recent earnings call Tesla added that it has begun producing the dojo training computer however how powerful is Tesla's Dojo supercomputer is it comparable to Google supercomputer or will it ever rank as one of the top five fastest supercomputers in the world let's find out in today's episode hello everyone welcome back to Elon Musk Evolution where we bring you the most recent news about Elon Musk and his multi-billion dollar companies space news and the latest science and technology but before we begin make sure you subscribe to our Channel and click the Bell icon so you don't miss any of our amazing videos during a conference call with analysts Elon Musk announced the spending of over 1 billion dollars because customers use the driver assistance program autopilot and other related capabilities like the full self-driving beta musk asserts that Tesla has a vast amount of video data Tesla customers use of the camera based driving assistant software dubbed autopilot and a related function known as full self-driving beta that is built up more than 300 million miles of data has led to Tesla having access to a staggering amount of video according to musk the business reported that it had started producing its Dojo training computer in its most recent earnings announcement the announcement regarding the cost of project Dojo alarmed the market which caused the price of Tesla shares to decline by four percent after the market closed Zachary kirkhorn Chief Financial Officer at Tesla later emphasized that the investment will be divided between r d and capital expenses he added that it is a component of the previously disclosed three-year spending plan The Verge reported that Tesla already has access to one of the largest and most powerful supercomputers in the world which is built on Nvidia GPU technology on the other hand Tesla design chips are being used to construct the new Dojo supercomputer Elon Musk has also stated that if borrowing rates keep Rising the manufacture of electric cars will probably keep lowering the cost of its cars in the second quarter Tesla outperformed forecasts for profit and revenue because of its pricing strategy of cutting prices to increase sales what is dojo Tesla designed a supercomputer called the dojo to develop its machine learning and deep learning AI systems Dojo then is a supercomputer created by Tesla to train its neural network algorithms and machine learning models with greater efficiency and speed Elon Musk asserts that the dojo supercomputer will be the fastest training supercomputer for artificial intelligence ever created why let's find out why was it made we must first comprehend how AI functions in order to understand why Dojo was developed in its simplest form AI is a collection of statistical models and algorithms that have been applied and trained on data to think and decide for itself data can be in any format including video pictures text corpora and plain old numbers now a lot of data must be used to train AI for it to be effective in actuality there is a perfect linear relationship meaning that the more data an AI system is trained on the more effective and Superior it becomes now in order to train AI on data you need a lot of processing power the more data you have the more processing power you require for this reason AI developers employ devices known as gpus which are similar to regular CPUs but much quicker however as your data grows to the point where training them on gpus is no longer effective it transforms into big data to train AI on Big Data you use distributed computing software which basically involves splitting the processing load among several computers neural networks are used in Tesla's AI self-driving technology and they are trained on millions of videos to become better keep in mind that artificial intelligence improves with more data and these many video files are simply transformed into Big Data therefore employing gpus would be inefficient and distributed computing would be fantastic however you would want a large number of computers to train AI on data as large as millions of data films that are updated every day as distributed computing is essentially just sharing processing work among a large number of machines Tesla came to the conclusion that it would be more cost and technologically efficient to build a supercomputer and the dojo supercomputer was created as a result how does it work you would need to grasp Tesla's data pipeline which is regrettably a trade secret for some reason in order to comprehend how the dojo supercomputer will function precisely however based on the few publicly accessible material it appears to work something like this when you use Ai and it makes mistakes it doesn't work as it should it makes note of that mistake and improves on it the self-driving AI is taking note of its surroundings using an integrated camera and sending that video back to Tesla servers so the data can be used to train it more so the more people who use the self-driving AI the better it gets now let's take for example that you own a Tesla and are currently driving it but you're tired of holding the wheel and want to use the self-driving feature it's just sitting there right and it goes smoothly the amazing part is that the data gathered when using the self-driving feature is also used to train the AI of new Tesla vehicles that are now being manufactured in other words the AI basically just gets better and better so that when you buy a new Tesla itself driving will be better than the last ones let's get back to the dojo in theory when the video data from the car is delivered back to Tesla's super computers the dojo supercomputer then obtains it and utilizes it to train the neural network that drives the Tesla self-driving AI another amusing thing you should be aware of that the data is delivered in real time meaning that while a Tesla self-driving capability is used the data is being transmitted and the dojo supercomputer is working non-stop and now let's dive into the dojo architecture according to talpes the dojo core contains an integer unit that incorporates some risk V architectural instructions in addition to a large number of Tesla developed extra instructions Tesla mostly implemented the vector math unit from the ground up talpes did not explain what this entailed he did add that this particular custom instruction set was made to run machine learning kernels so we infer that it wouldn't be particularly effective at running crisis the dojo instruction set supports 64-bit scalar and 64-bit simd instructions as well as Primitives for dealing with data transfers from local to remote memories it also supports semaphores and barrier constraints which are required to bring memory operations into line with instructions running not only within a D1 core but also across collections of D1 cores the core also performs stochastic rounding and can perform implicit 2D padding which is frequently achieved by adding zeros to both sides of a piece of data to modify a tensor these are the only ml Specific Instructions that are currently etched in transistors along with a set of Shuffle transpose and convert instructions that are commonly performed in software taupes did make it clear that the D1 processor the first in what we presume will be a line of Dojo chips and systems is a high throughput general purpose CPU and that it is not in fact an accelerator or to put it more specifically Dojo is designed to accelerate itself rather than using an external device to do it each Dojo node contains a single core and functions as a complete computer with a CPU memory and I O ports this is a significant distinction since each core May function independently and is not reliant on shared caches register files or anything else as a superscalar core the D1 supports instruction level parallelism within its core as do the majority of modern chips and it even features a multi-threaded design to push additional instructions through that core however simultaneous multi-threading in the dojo software stack and applications manages the distribution of Chip resources because simultaneous multi-threading is more about doing more work per clock than having isolated threads that can run separate instances of Linux as a virtual machine as a result virtual memory is not included in the dojo implementation of smt and it also has few protection mechanisms the 64-bit D1 core can process up to eight instructions in its 32b fetch window and features an 8 wide decoder that can handle two threads each cycle this front end feeds into a four wide scalar scheduler with four-way smt two integer units two address units and a register file for each thread a two-side vector scheduler with four-way smt is also available it can output data to four eight by eight by four matrix multiplication units or a 64b wide simd unit 1.25 megabytes of SRAM serves as the primary memory for each D1 core it is not a cache and the ddr4 memory that is connected to the broader Dojo network is actually considered more like bulk storage than anything else the chip provides explicit instructions to transfer data to or from external SRAM memory of other cores in the dojo computer at speeds of 400 gigabytes per second for loading and 270 gigabytes per second for storing this SRAM contains a list parser engine that feeds into the two decoders and a gather engine that feeds into the vector register file which together can send information to or receive information from other nodes without a ton of extra operations as is typical with other CPU architectures one of the primary distinguishing characteristics of the dojo chip design is its list parsing mechanism in essence this is a method for organizing various pieces of data such that they can be efficiently transported among the D1 cores in a system numerous data types are supported by the D1 core the vector unit and its related Matrix units accept a broad range of data formats with a mix of precisions and numerical ranges and quite a few of them are dynamically composable by the dojo compiler the scalar unit supports integers with 8 16 32 or 64 bits the fp16 format recommended by the IEEE does not have enough range to cover all layers of processing in a neural network according to talpes in contrast the B float format developed by the Google brain team has a wider range but less precision in order to support a wider range of ranges and precisions Tesla has developed a set of configurable 8-bit and 16-bit formats in addition to the 8-Bit fp8 format for lower precision and higher throughput Vector processing these formats allow the dojo compiler to adjust the Precision of the mantissas and exponents as shown in the above chart up to 16 different Vector formats can be utilized at once however only one type of 64b packet can be used at a time you can see the network on chip router which connects various cores into a 2d mesh in the upper right corner every clock cycle the knock can handle eight packets across the node boundary 64b in each Direction which corresponds to one packet going in and one packet going out in each of the four directions to the mesh node that is closest to each core data can be sent between cores thanks to the router's ability to perform one 64b read from and 164b write to the local SRAM each cycle with all of the etching completed on the D1 core which is produced by Foundry partner Taiwan semiconductor Manufacturing Company using a seven nanometer technology as if there were another Choice Tesla pulls out the cookie cutter and simply begins duplicating D1 cores and connecting them together on the mesh however for some reason only 354 D1 cores are available 12 of the D1 cores are placed into a local block and a 2d array of 18 cores by 20 cores is generated with 440 megabytes of SRAM distributed among those cores and a clock speed of 2 gigahertz the D1 chip produces 376 teraflops at bf16 or cfp8 and 22 teraflops at fp32 the vector units do not support fp64 this D1 device won't support many HPC workloads and some AI applications that do require 64-bit Vector math won't either Tesla doesn't need to be concerned it only needs to run its own AI apps after all of this work has been completed adding fp64 capability on the D2 or D3 chip to run its HPC simulation and modeling workloads for musk's Enterprises to create spaceships and Automobiles will be rather simple in order to connect to other D1 dies the D1 die has 576 bi-directional sarades channels wrapped around it and also has eight terabyte per second of bandwidth available across all four of its edges the dies which have a surface area of 645 Square millimeters are designed to link to what Tesla refers to as a dojo training tile without any noticeable gaps via those serdes the training tile gathers 25 well-known D1 dies and arranges them in a 55 interconnected Arrangement half of the bisectional bandwidth of the meshes of 2D meshes across the D1 chips inside the training tile or 36 terabytes per second is implemented on 40 i o chips along the tile's outside edge 11 gigabytes of SRAM memory is distributed across the cores and the tile has an on tile bisectional bandwidth of 10 terabytes per second nine petaflops of bf16 cfp8 Power are delivered by each tile these Dojo training tiles have a 15 kilowatt power consumption are evidently going to be water cooled and are made to be connected to other training tiles it is unclear how this occurs but it is obvious that you need a long row of interconnected tiles that are oriented either horizontally or vertically rather than separate racks with trays of a few devices because those would require an enormous Optical or electrical cable that is Tangled to carry the data between the tiles on the borders of the D1 mesh you'll notice what are known as Dojo interface processors or dips these are connected to the D1 mesh as well as host systems that provide the dips with power and carry out different system administration tasks although there is a total of 11 gigabytes of private SRAM main memory in each training tile a larger memory that is situated sufficiently close to the mesh is required by the system in this instance Tesla has chosen to develop a dip memory and I O coprocessor that includes more direct hopping between tiles and cores than is possible through this enormous mesh as well as 32 gigabytes of shared hbm memory we don't know what kind yet but it is either hbm2e or hbm3 as well as ethernet interfaces to the outside world 10 of these dips are put on a pair of hosts giving each set of three Dojo training tiles 320 gigabytes of hbm memory altogether however the text on the chart states that 160 gigabytes is allotted to each tile which translates to one host for each tile rather than two hosts for three tiles this dip card offers 32 gigabytes of hbm memory with 800 gigabytes per second of bandwidth and two i o processors with two Banks of hbm memory each to us there appears to be a somewhat underclocked hbm2e memory to transmit the entire dram memory bandwidth into the dojo training tile the card uses the Tesla transport protocol a proprietary interface that sounds to us like cxl or opencappy and is implemented over PCI Express a single 400 gigabyte per second port or a pair of 200 gigabyte per second ports on stock ethernet switches are connected to the card's other end via a 50 gigabyte per second TTP protocol link operating atop an Ethernet Nick the PCI Express 4.0 by 16 slot where the dip is inserted offers each card 32 gigabytes per second of bandwidth with five cards per tile Edge the host servers can receive 160 gigabytes per second of bandwidth while also sending 4.5 terabytes per second from the hbm2e memory to the tile as we've already mentioned the dips don't simply use dram as a means of storing local storage they also offer a second networking Dimension that can be used to avoid the 2D mesh when it would require several hops to connect all of those cores and tiles Chang estimates that it may require 30 hops to connect end to end across a 2d mesh Network while only four hops are required when using the tto over ethernet protocol and a fat tree ethernet switch Network obviously the bandwidth is much smaller but the latency is much lower in that third dimension thus the Z plane of networking the dojo V1 training Matrix which contains six training tiles 20 dips spread among four host servers and a number of adjunct servers connected to the ethernet switch fabric is the foundational system that Tesla is constructing 53 100 D1 cores one exaflops of bf16 and cfp8 formats 1.3 terabytes of SRAM memory on the tiles and 13 terabytes of hbm2e memory on the dips make up the basic Dojo V1 system the complete Dojo exopod system will feature 120 tiles in total 162 000 usable D1 cores and 20xaflops of processing power who knows how much money was spent on the machine's development manufacturing and software stack the dojo chips are currently being tested at Tesla's lab after being returned from The Foundry it is likely that it will soon be able to construct the machine Tesla would not have brought it up in the first place if it were not the case no time frame was provided for when the entire system would be accessible the brand new Tesla Dojo supercomputer is awesome and contributes to more effective AI training and in turn to the advancement of AI applications this is excellent news for Humanity and that ends today's episode what do you think of the dojo supercomputer let us know your thoughts in the comments section below please subscribe and don't forget to like today's video we'll see you in the next video thanks for watching [Music]
Info
Channel: Elon Musk Evolution
Views: 40,441
Rating: undefined out of 5
Keywords: Elon Musk, SpaceX, Tesla, SpaceX starship, Starship, Starship launch, SpaceX news today, Tesla news today, Tesla news, Elon Musk interview, SpaceX test, Tesla and SpaceX, Starship SpaceX, SpaceX today, Elon Musk news, Elon Musk today, elon musk news today, elon, musk, elon musk spacex news, elon musk tesla news, Elon Musk Just RELEASED Its AI Powerhouse: Tesla DOJO Supercomputer, tesla's supercomputer, dojo, dojo supercomputer, tesla's dojo supercomputer, tesla dojo, updates, 2023
Id: kMr1sSyfeNM
Channel Id: undefined
Length: 19min 2sec (1142 seconds)
Published: Tue Jul 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.