Networking for GenAI Training and Inference Clusters | Jongsoo Park & Petr Lapukhov

Video Statistics and Information

Video

Captions Word Cloud

Captions

hey I'm jonsu from meta and Pierre and I are going to talk about the tongue they're talking for Gen AI training influence clusters so generative AI is one of the hardest topics these days and it's about creating generating new and realistic contents and before genetic models become popular AI models were often used to understand existing information like image classification and segmentation so genitive AI is about generating new content versus understand the existing contents so that's the main difference and generative AI opens up a new huge opportunities and new applications so for example image and video generations and also text Generations AI goes back to 2015 when Jeff hinton's Lab at the University of Toronto showed generating an image of bowel of bananas on the table and you can notice that how low resolution they are and the next few years we've seen a lot of breakthroughs so for example Dali and stable diffusion for image generation and GPT for text generation and one of the important enablement Technologies from 2015 until now is a huge amount of compute capabilities available and Network Technologies that connect many accelerators played a very important role a metal contributed this field significantly for example this is to walk in this year and then you can see that by giving a prompt of a small cactus wearing or sunglass insara desert we can see very convincing image and also photography image compared to the images shown in the previous slide and of course there are large linkage models from beta like Lama and we can create a chapel from these models to have a large language model based knowledge Discovery and actually llms are the ones usually pushing the limit of infrastructures so in this store we are gonna focus more on the llms I and specifically large language models being for system design especially for the network subsystems so recommendation models have been the primary AI workloads in meta data center but large language models have very different characteristics compared to the recognition models first large language models whose training and influence requires much more compute and because of this especially for the large language model training we need huge number of accelerators to finish the training in a reasonable amount of time and this creates a very interesting problems to the network subsystem and also interestingly even within element in France it has very diverse characteristics for example element influence consists of two stages called decoding and pre-fill and the decoding has a very low latency requirement and let's talk about both details about this specifically for the compute demand so this table Compares how much compute we need for LMS comparing with the recommendation models elements requires multiple orders of magnitude more compute than the recognition models so for example for llm training for each sentence we need about a pair of fluffs of compute and then we need to train with hundreds of billions of sentences and the size of the models and the amount of data we are feeding to those models have been increasing and this is why we need tens of thousands of gpus for large Legacy model training and another influence also requires huge amount of compute to provide reasonable user experience and within our low latency we need a few petaflops of compute and you can notice that this huge amount of compute cannot be satisfied by just eight gpus per one host and this is why we need a distributed inference so the cost level gpus is not only needed for training anymore so we also need them for the influence and this is another interesting problem for the network subsystem there's more concrete examples these are the recent large language models trained from meta and the latest llama 2 with 70 billion parameters are trained with 2 trillion tokens and that needs 800 data flavs to finish the training and this translates into 1.7 billion GPU hours assuming we are using nvidia's 800 gpus or more than one month even if you we are using 2n100 gpus this is a huge amount of compute and these foundational model LM training have been done in research supercluster but I'd like to highlight that one of the latest model Lama two sorry three four billion parameter has been trained in a production cluster using rakibi to network fabric and then we are able to achieve similar speed and scalability compared to Infinity band and to the best of our analogy this is probably one of the largest production use case of Rocky B2 and then we hope this can help democratizing the lln training using more commodity Network Hardware and are they at better will present more details about this so if you are interested in more details you can watch his talk in the same event and about the complexity on the amount of data we are feeding into these models have been increasing exponentially and then we don't expect that Trend will stop anytime soon and this is a reason why we need a lot of gpus and we are using about 2000 gpus these days but we don't think that's going to be enough going forward so that's why we are thinking about 42 000 gpus and even Beyond and our vision is achieving more than 30 extra flops which corresponds to about one third of the theoretical Peak compute capability provided by 32 000 gpus and this will enable training the lava model less than one day instead of both in one month and this will innovate and enable much faster Innovation and also enable much more complex models trained with more data and one of the challenges in training these large models using huge number of accelerators is using simple parallelization scheme is running out of steam the current most common way of parallelizing these models is called Data parallelization and is paralyzing across the inputs but that itself is not enough anymore so we need to use other parallelization schemes like model parallelism or pipeline parallelism so basically we need to slice into along the multiple dimensions by combining multiple ways of parallelization it generates or diverse patterns of communication and that is also very interesting problem for the network so early influence is a very interesting problem for system design so for good usual experience we typically care about two negative three metrics so first one is called time to First token so basically we don't want users to wait too long until they start seeing the first response and then typically we want them to be less than one second and the second latency metric is called time per incremental token so once you start generating tokens we don't want them to be too slow and then we typically want them to be less than 50 milliseconds so basically we are seeing every tokens every 15 milliseconds and let's look at more details and another influence consists of two stages pre-filled and decode and pre-fill determinants of time to First token and decoding determines the time per incremental token and what's interesting is it has they have very distinctly different system demand so preview is about understanding the usual prompts and then you can work on multiple tokens from the user prompts in parallel so that's why it can be very computer intensive but on the other hand in the decoding stage it needs to read huge amount of the amount of data when it's going over generating one output token one by one so that's the reason why it becomes very memory intensive so one stage is very compute and the other stage is memory intensive so the inference system needs to provide very high compute throughput and also it needs to provide very high memory through and that's the reason why it's hard to contain an influence within one host typically with httpus and going forward we expect we need a distribute influence for element inference so we need a small cluster for inference so lastly we kept the first part of the talk so llms requires orders of magnitude more compute compared to recognition models and training in particular requires tens of thousands of accelerators to finish it in reasonable amount of time and because of that we need to use different types of model types of parallelization and then that generates diverse patterns of communication diverse communication patterns which is a very interesting problem for laptop design and influence also requires a small cluster and then that influence also becomes a network problem now Peter is gonna talk about going to move in depth on the system design for element training and inference thank you Jung so thanks for excellent presentation my name is Peter I'm a medical engineer and in my section we're going to dive deeply in the effect that the large language models and Jai in general have on networking topologies and other parameters of our Fabrics so as we covered briefly in the previous section the biggest challenge the biggest change from llms to llms for ranking models was the increase in compute capacity requirements what this means is that we we now need to build much larger clusters to support training office models now a big cluster naturally separates into two large domains one is a scale out and that was a collection of scale aftermates let's resume briefly the scale out domain is what connects the compute pods together you can think of racks of services small parts scale out is when we use Technologies like infiniband or rocky to implement connectivity for tens of hours of nodes so this is where scalability is most important not so much of speeds oh speeds and a joke still you have connectivity at the rate of 50 gigabytes per second it's gigabyte and gigabit on a contrast the scale up Dom domain is usually contained in one server this is your enveline technology or xgmi for a few examples on the contrast with this scale out again it was a short distance but very high bandwidth like in contemporary system the Delta between scale out bandwidth and the scale up Bandit is about 9x but that means from 50 gigabyte we're now moving to 450 gigabytes per second and now as we mentioned previously when you train parallel when you train model in part of fashion you generate two types of pluralism at a very high level one is data parallel another small parallel now scale out part of topology naturally maps to the data parallel traffic and the scale up nature of the encapsulates the motor part of traffic and now let's take a look at how this looks topologically jongsu spoke about the goal they need to build apologies which we currently contain to 32k gpus or accelerators even though it's a large number it's of the limit but here we are looking at the fabric but instantiates such topology for Network Engineers this isn't looking too surprising in fact this is a well-known cost apology which has multiple tiers of connectivity at the very bottom you have your racks in our case each rack has 16 gpus in two servers so effectively every rack has two domains of skill-up topology scale up bandwidth above those racks you have your scale out fabric this is where Infinity band or Roku works in our example we have a rocky fabric as mentioned before we deploy both Rocky and infinity band Fabrics but Rocky is also unique you see more examples from Indian band in a public and Rocky so this is this slide demonstrates the Baroque instance notice that we have in each layer above the racks 18 cluster switches now this is important because it gives you additional capacity to protect against failures as you will see further failures and reliability is of utmost importance for these clusters and these designs now there's a lot of details to implement Rocky of which ADI will talk separately in his presentation on a review public implementation but I want to stress out that this is pushing Rocky or anything about to estimates but both in very large clusters of thousands of gpus and this slide captures what happens inside these Fabrics now just to recap Once Again when you train models you generate two types of traffic patterns one stem from data parallelism another from model pluralism now the most challenging part is model parallelism but before we get there let's take a look at data parallel patterns there you generate patterns like or reduce or all governed use scatter these are well known they've been known like for many years for practitioners who've been trying to train models before imagine size here is usually substantial but it grows smaller and smaller as you increase the size of a scale-out domain and this is where some of the challenges become more evident as we'll see later latency becomes more visible latency here it means propagation latency notably however we did a part of patterns typically can overlap efficient of compute it's not Universal in some cases you won't get this for free you have to work and optimize model to achieve efficient overlapping however very often scale out part and data parallelism can be well overlapped with compute which makes them less challenging so to speak for networking not so much more to protracted model parallel is a result of slicing the networks into pieces and trying to pass activations between those components there you have your familiar or reduce or Auto all patterns which come from say tensor parallelism or pipeline parallelism and here the benefit is much much higher this is where you really need the scalar bandwidth to be efficient because the messages is still pretty large and your demand for bandwidth is 10x if not more to realize this parallelism and most importantly and critically it's much harder to overlap model traffic model parallel execution with compute part so this is where latency and bandwidth are much more important than for data parallelism and this diagram demonstrates you how all this Collective so to speak map to the network topology here you have a scale out collectives or reduce reduce scatter and all gather which map on the cluster switches above a racks this is where you see all these Rings often which result from reduced scatter but span multiple switches and go across all the racks in the topology for instance if you're changing set size is 16k gpus you typically see the ring size of what is taking 1000 gpus spanning across all the switches this is where latency starts to add up but this is where you still can overlap this collections with the computation now at the bottom of this tree you see your model pile chart you see all reduce and others which map to the scale up domains for example in case of Nvidia is going to be in the link interconnects however once you cross single server single board you have to run this traffic across the scale out and this is where you see the impact of much lower bandwidth and as you will see this bottleneck dictates the need to grow the scalp domains Beyond one server and so now let's just recap and look back and what changes with large clusters now now you can say Okay scale is great but looks like it's the same traffic same problems over again well pretty much so however it's important to reiterate that latency stats become important what's funny what's we observed in the AI training is that latency in the network wasn't so critical as it is typically it's in nhpc applications most of the time because you can overlap collect these interpretations however with llm training with very large clusters you have now machines which span whole buildings so now we have latency from switches from the fiber even from the transceivers which keeps adding up and as it adds up it starts to be visible for smaller messages as mentioned before as you increase the power domain size your message size does decrease and this is where you start to see the exposed latency and you have to really pay attention and manage which much better now for the second party liability naturally it's you should expect but as you grow Network size as there are more components and more elements they fail more frequently now to be fair most of feathers often happen in the software land not so much in Hardware but the hardware this skill also exhibit issues so often when you start with systems for the first time you have to go for burning process identify both the components eliminate them replace them and so on and this takes time second problem is that in a large system it takes much longer longer time to do fault isolation you have to track the issue with much more components and often which takes much more time when it does in a smaller setup and all of that as a training time more time to debug less time to run the actual computation and finally the thing with Johnson mentioned that inference for other lamps is now becoming a networking problem previously we can contain the inference in single GPU you can see often hear about when you run inference use only one GPU typically or even like a single PC card in the LM case you have two challenges first of all this models grows so large you can't contain them in single GPU memory or even a single host memory you have to go across just to keep the coefficients and Optimizer States and our parameters together secondly you need more compute to achieve the target latency goals for example during previous stage you need much more compute to achieve the let's say Legacy of one second for first token for large models and long sequence lens if you want to go to SQL stats of 32k 64k or even 1000k you have to go with a distributed inference and that means you now have a mini cluster that implements forward paths in distributed fashion you have to run always tensor slicing water poloism across multiple systems and if you are bottlenecked by the scale out well yeah it's your problem to solve because now you're much much slower and so as a result of this trend we foresee that the mini trans clusters are going to go to 1642 64 gpus even in modern generation and you can expand the strategory future but I don't think it will be on 64 the next couple of years and now to to recap what we have done in this section and once again the biggest shift for our lamps was it tremendous increase in computational demand now this dictates everything as you have seen higher computation requires larger clusters it requires large influence fabrics and so on large clusters have better issues right visibility problems with issues with latency and various political structure which requires optimization the biggest Trend we're seeing is that scale up an activity now supposed to go beyond a rack or beyond the node this is probably the biggest change we've seen in a topologists in the last three four years is that explosion of bandwidth but you need to realize model parallelism and once again inference is is now a most known problem you have to run inference across multiple systems it becomes like a mini cluster similar to training but only doing forward pass not the backboard pass and that's it for my part thank you so much for listening and Johnson and I can take your questions now

Info

Channel: @Scale

Views: 2,561

Rating: undefined out of 5

Keywords:

Id: 192S3xNbcEs

Channel Id: undefined

Length: 23min 0sec (1380 seconds)

Published: Thu Sep 07 2023