BloombergGPT: Build Your Own - But can you train it? [Tutorial]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome back to lucidate today we're diving into a pivotal aspect of machine learning the design and optimization of Transformer models particularly large language models or llms we'll be discussing how to determine the ideal size of these models in terms of their parameters attention heads and more while this task can seem overwhelming there are heuristic Methods at our disposal to arrive at an Optimum answer but first let's take a moment to recap what large language models and Transformers are and why they're so significant in today's technological landscape over the past few years there's been an obsessive Trend leaning towards the creation of increasingly larger language models however bigger isn't always better larger models demand more computational resources and there's compelling evidence suggesting that many such models might be oversized given their respective computational budgets but simply we've been building models that are larger than necessary leading to inefficient resource utilization so how do we determine the optimal size for a Transformer model this is the question that many researchers have been grappling with they're examining two crucial factors given a fixed volume of computation used for training known as flops what's the trade-off between the number of parameters in the model and the number of training tokens utilized by understanding the relationship between these two factors they're attempting to optimize the size and shape of their models essentially the key question is given a fixed flops budget how should one trade-off model size a number of training tokens one human analogy here is that you can have a massive brain but if you don't receive the right training or education you will not be able to use it optimally should you choose more brain power or more education let's illustrate these principles with a real world example we'll compare the performance of two models developed by alphabets deepmind the models are called gopher and chinchilla gopher is the larger model while chinchilla has been trained with more tokens the result might surprise you despite being smaller chinchilla managed to outperform gopher using the same computational resources this suggests that investing in the volume of training data could potentially yield better performance than simply scaling up the model size here we can see two charts with a range of relevant benchmarks the orange bars indicate by how much chinchilla beat gopher in each test while the blue bars show Gophers wins as you can see this isn't even close to a tie chinchilla wins at nearly every Benchmark chinchilla chose to spend more time at school studying more subjects rather than adding more brain power and the results speak for themselves let's go beneath the surface and explore the technical aspect ASP of model scaling there are two primary strategies for estimating the ideal size of Transformer models the first one is a heuristic that relates model size data set size and computational resources the second one is empirical involving a study of over 400 different sized llms the first approach known as the compute data model size heuristic was conceptualized by firms like open AI they found that as they increased the number of parameters the performance of their models increased as adding more parameters was always leading to Greater performance and this showed no sign of diminishing with scale there was frankly little incentive to challenge the Orthodoxy of Simply building a bigger brain under this approach for a 10x increase in compute budget the number of parameters would increase by 5.5 times and the amount of training data by 1.8 the second approach chinchilla scaling laws found that parameter sizing and token amounts should be increased in the same proportion so for a 10x increase in compute budget you would increase the parameters by 3.3 and the tokens also by 3.3 this was based on empirical observations on training over 400 models with parameters ranging from 70 million to 16 billion and on 5 billion to 500 billion tokens you can see that in the chart on the screen in the top left for comparable compute budgets the chinchilla model has opted for a much smaller model size than other models but it chose to be trained on between four to eight times as many tokens it chose to utilize its finite resource of flops during training to be exposed to more data rather than having more parameters again opting for a smaller brain but more time spent at College studying more subjects the chinchilla results found that a smaller model trained with more tokens could achieve comparable or even Superior performance for the same compute budget this observation challenges the prior trend of obsessively creating larger and larger models emphasizing the importance of a balanced approach to model scaling based on their analysis of over 400 models of varying size the chinchilla team were able to devise a couple of parametric equations based on model performance to specify the optimal number of parameters and tokens given the compute budget measure in flops and these equations are shown on your screen so in theory these principles are great but seeing them in practice is what truly demonstrates their value let's look at a case study where a development team successfully applied the chinchilla scaling laws let's see how they use them to help them design the size and shape of their large language model we'll focus on an industry-specific llm Bloomberg GPT designed for financial markets the Bloomberg team set out to build a decoder only causal language model leveraging a Transformer architecture the chinchilla scaling laws guided the model size and shape in the paper describing the architecture the team claimed that their total compute budget was 1.3 million GPU hours on 40 gigabyte a100 Nvidia gpus if we look at the Nvidia spec sheet for the a100 we can see what the benchmarked teraflops per second is for this specific Hardware now this is likely to be a theoretical maximum that assumes 100 efficiency memory bandwidth limitations latency and pipeline stalls will mean that this is tough to achieve in practice and we'll address this a little later but for now let's note this theoretical maximum and convert it to scientific notation 1.56 times 10 to the power 14. the 1.3 million GPU hours can be converted into seconds and for consistency will represent this as 4.68 times 10 to the 9 seconds therefore multiplying these two together the budget equates to 7.3 times 10 to the 23 floating Point operations with reference to the paper again the adoption of checkpointing acts as a kind of tax on the flots utilized so we'll need to apply a scaling factor of 75 percent to this number of flops as mentioned earlier we're unlikely to achieve the maximum rated capability of this Hardware we'll assume we get a 66 percent of the specified compute power meaning that our total budget is 3.61 times 10 to the 23 flots if we plug this into the chinchilla scaling laws for approach one we get pretty close to the number of parameters that Bloomberg state in their paper the difference is likely to be an overestimation on my behalf of the efficiency of the a100 once we have the number of parameters it's a matter of applying some further empirical formula and some best practices to get the other model parameters the approximate formula for the number of parameters is shown on your screen the number of layers multiplied by the hidden Dimension multiplied by the number of attention heads plus the vocabulary size multiplied by the hidden dimension we know that the parameters that we want is approximately 50 billion and the vocabulary size for Bloomberg GPT is 131 072 tokens the Bloomberg paper also acknowledges some constraints and best practices some work by Levine etow forms an optimal relationship between the number of layers and the hidden Dimension size furthermore it's best practice to have the hidden Dimension evenly divisible by the number of attention heads and for optimal tensor performance on the GPU it's preferable to have the hidden Dimension to be a multiple of eight this leaves us with an optimal model as shown in the table so you see the chinchilla scaling laws while a theoretical concept has practical applications in guiding the design of large scale Transformer models this example should Inspire confidence in you that these laws rules heuristics and best practices can effectively guide the design and scaling of your own models if you wanted to build this model it's a relatively simple thing to code with the hugging face Transformers Library here's the code however unless you have access to a vast Computing surface with loads of RAM on your CPU and GPU this is unlikely to be a successful Endeavor as you'd expect these models are too large to be hosted on a home PC unlike the smaller models in some of the prior lucidate videos to sum up our exploration today suggests that the future of language model development May necessitate a shift in our scaling strategies instead of focusing only on creating larger models we should also strive to maximize the utility of our training data by striking a balance between model size and training data volume we can create more efficient and more powerful language models thank you for joining us today on this deep dive into optimizing Transformer model sizes if you found this video insightful do give it a like and don't forget to subscribe for more Explorations into the world of machine learning for exclusive content consider joining as a lucid date member until next time [Applause]
Info
Channel: Lucidate
Views: 13,565
Rating: undefined out of 5
Keywords: machine learning, deep learning, natural language processing, what is machine learning, machine learning tutorial for beginners, machine learning tutorial, Transformer Models, Large Language Models, Model Optimization, Deep Learning, Natural Language Processing, Chinchilla Scaling Laws, Gopher vs Chinchilla, Training Data, Model Parameters, AI Research, BloombergGPT, Deepmind, OpenAI, Lucidate AI Insights, build a transformer from scratch
Id: TSqkKH4zuZY
Channel Id: undefined
Length: 12min 31sec (751 seconds)
Published: Tue May 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.