FCRC Plenary: Computing in the Foundation Model Era

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] is about to begin [Music] [Music] [Music] thank you [Music] foreign [Music] good morning everyone I take great pleasure in welcoming you all to the Federated computing research conference this year the ninth such event in the history of this conference series yesterday was Father's Day congratulations and our Happy Father's Day and today is Juneteenth a Federal holiday here in the United States I I especially thank those of you who celebrated these Special Commemorative days here with us choosing to come to FCRC 2023 thank you I am Timothy Pinkston a professor at the University of Southern California in Los Angeles and it is my honor and distinct privilege to serve as the FCRC chair this year my preferred pronouns are he him his FCRC started 30 years ago back in 1993 initially convening every three years until it transitioned to convening every four years starting in 2020 2003 that's 20 years ago as you know FCRC brings together many Affiliated research conferences and Associated workshops during this week-long co-located event this year we are happy to host 14 different conferences in symposia and two uh computer competing research Association CRA uh independent workshops this past weekend we welcomed the start of lctes ismm and Spa today we welcome the start of isca and pldi conferences I'm happy to report that nearly 2600 total attendees have registered for SCRC this year beyond the exhilarating technical conference in Workshop talks the serendipitous networking opportunities and the interesting cross interactions that attendees can have with persons from various Technical and areas and backgrounds uh FCRC hosts plenary sessions on top of uh the on topics of broad appeal given by leaders in our Computing community we hope you'll enjoy this year's FCRC it will indulge fully in all that that this year offers in so doing please be mindful that FCRC promotes openness and inclusiveness SCRC is a safe place for Rich intellectual discourse for free exchange of research results and ideas for open scholarly interactions we support and welcome a welcoming environment for all to thrive that lives up to acm's core values which promote diversity Equity inclusion we are grateful to have a cares presence here at FCRC the committee to Aid reporting on discrimination and harassment policy violations cares members are available to anyone who may experience any unacceptable Behavior which we firmly stand against thank you cares and for other organizations and committees such as cares thank you you care and I trust that we all care I also think all of the conference leaders for their uh dedication and hard work in making helping to make SCRC highly successful this year and thereby continuing the rich history uh of this Federated conference event I think all of our corporate sponsors and academic SCRC sponsors for their dinner support this year our FCRC sponsors include future way Technologies in the USC for turbia School of Engineering at the Platinum level Google at the Gold level meta and Samsung at the silver letter level in Alibaba Cloud Amazon pathway Springer VMware and the University of Virginia School of Engineering and applied sciences at the bronze level let's give a hand for all of our sponsors of course I also think all of the sponsoring ACM sigs for their support and cooperation energy sigmetrics sigom sigplan and Sig Sim I also think the SCRC steering committee for the wise Council and guidance that they've been giving me uh Vivek sarkar is the chair of this steering committee thank you vivac for all your guidance and advice I thank the ACM especially Donna Capo and the executive events conference management team led by Jillian uh for helping to ensure that this conference is a success uh under unfolding uh in a very organized and uh smooth way thank you I'd like to acknowledge the presence of the ACM CEO Vicki Hansen and the uh Chief Operating Officer Pat Ryan please stand thank you for your support finally I wholeheartedly think Dilma de Silva uh our FCRC uh plenary speakers chair for putting together such an outstanding set of plenary sessions including for the first time in the fcrc's history I planned a a plenary panel which will occur tomorrow afternoon which we hope you all will enjoy note that all plenaries including this one will be is live streamed through our SCRC website so with that said these opening remarks I'm happy to turn the podium over to Dilma foreign [Music] I'm chairing the plenary speakers part of SCRC I'm from Texas a.m University and I'm currently serving at the National Science Foundation my pronouns are she her hers and it is my honor to introduce our first speaker and sorry if I'm speaking too loud and remember reading it's my teaching voice so Professor kuliolo condone is the Cadence Design professor of electrical engineering and computer science at Stanford University he's a Pioneer in multi-core processor design he founded a Fara web systems to design high throughput low power multi-core processors and those processors now power the line of servers from oracles that was based on spark in 2017 he founded the samba Nova systems a machine learning and artificial intelligence company that he is still leading he is has received a lot of awards so just mention a few of them he's a member of the National Academy of engineering he received the Harry H Goode Memorial award and he recently received the 2023 IEEE computer Society Eckert's Markley award um I have just one more comments so before we get to the talk there are microphones only stands on the aisles on the side so you'll be able to uh queue for questions in the qoa session of the talk so thank you so much and join me in welcoming professor [Applause] thank you good morning everybody thank you uh to uh Dilma for that wonderful introduction and to the organizing Committee of the FCRC for inviting me to give this plenary talk the title of my talk is Computing in the foundation model era and uh let me Begin by talking about Foundation models right so everybody uh in the room is is uh familiar with Foundation models because everybody's played with chat GPT right so the models behind chat gptp are these huge models with billions of parameters that have been like gpt3 uh gpd4 and then other models that that uh generate images like mid-journey and stable diffusion and they're trained on huge amounts of of data both text and images and some of them are trained on on speech and they learn some very interesting representations uh that give them amazing capabilities one such capability is the idea of in-context learning so this is a property that blows up the whole idea of cost-specific models that existed before the foundation models came along and the idea is you can use the same representation uh with minor customization using text using English language text no longer programming require is required and you can get uh amazing accuracy on all sorts of tasks that you never specifically train the model for and so you know what happens now is now you can develop new applications in days with undergrads uh what might have taken weeks or or months and in years with PhD students in the past uh so you know these Foundation models are going to have wide reaching uh uh implications to both technology and and society and there's the center for research and Foundation models uh uh being led by personally Aang at Stanford that's investigating what impact they'll have so what are some of the impressive things that you can do with Foundation uh models uh well you know of course you can generate uh text all sorts of uh of text generation uh uh uh possibilities you can write code in Fairly obscure languages like verolog I'm a hardware designer so I sometimes write verilog or maybe my students do and uh and you know you can actually get these these models to generate uh uh uh uh the the code and uh oftentimes it's correct sometimes it's got bugs but if you give it bugs and you ask the model to correct the bugs it will tell you how to correct the bugs and tell you you know why the bug occurred you can generate art with these models by using uh again English language prompts and you can get pretty good pictures here's one from stable diffusion uh for for New York from New York and it looks pretty good much better than I could do and then of course you can also use it for scientific purposes like uh trying to predict how proteins and RNA faults uh so uh Alpha fault is being used to design drugs and I was recently at a conference and somebody said well the the equivalent of chat GTP for the biopharmer uh was was Alpha fold so it's having a tremendous impact and uh it's gonna you know transform what we can do so to attain these capabilities over the years the models have been growing in size right so they started out you know uh with with Bert Lodge which by today's standard should be called by uh tiny teeny uh and uh they grow up to uh models of the size of 500 billion uh parameters like Megatron right so that's sort of a thousand-fold uh uh increase in size in in three years ten uh fold uh or an order of magnitude per year and so over time as you uh train these models with more data and with more parameters uh they increasing quality but they also substantially acquire new capabilities for example they can explain jokes uh so here's a a joke I tried a 10 000 random restarts of my neural network but I was accused of overfitting I guess no good seed goes unpunished so you give this joke to a 1.3 billion uh parameter model and it basically repeats the joke and doesn't really explain it you give this joke to a 175 billion parameter model and it tells you that in fact this is a play on the uh phrase no good deed goes unpunished and tells you you know why this is this is funny and so you get these this sort of these step functions and capability with increasing it parameter size and so of course as you increase parameter size you also dramatically increase the computational requirements uh for training and inferring uh these models right and so you know roughly the day you know it's 200 teraflops per parameter once you get uh you know to the 200 billion regime so for a systems designer the question then is giving the increasing importance of foundation models uh you know how do we Design Systems uh that are both enable these these models to scale and also of course scale efficiently because of course you know we're in this regime where uh performance is limited by power and so everything has to become fundamentally more energy efficient and the models themselves have to become have to be be efficient and so if we can scale uh these models by another two or three orders of magnitude then we can potentially unleash new capabilities but maybe more importantly we can allow more people to have access to these models uh because the models that we see today that are already very capable then could be uh attainable to people with much fewer resources or and so this would be an important in democratizing the accessibility to Foundation models and making more organizations and more researchers uh capable of doing work in this area so not only are these models scaling but these models are not fixed right so machine learning researchers continue to innovate in the algorithms that they are developing to uh for these models and so we need to provide a way of of not only making them more efficient so one of the ways that you could potentially make them more efficient would be to take a algorithm and cost it directly in Silicon as an Asic device but then it would no longer be flexible and so the question is can we attain both high efficiency efficiency that you might get from a very customized Asic together with the flexibility you might see from a general purpose Computing device and we think the answer to do to providing both scale efficiency and flexibility is to have a vertically integrated uh co-design of Innovations in ml algorithms Innovations in data flow compilation environments and compilers which are kind of the core to the execution model of these machine learning algorithms and then new hardware reconfigurable data flow architectures which are specialized for executing data flow uh models and so for the rest of my talk I'm going to go through each of these individual components and show you how they they work together to provide a dramatic improvements in the support for these Foundation models let's start with ML algorithms foreign component of most of these large Foundation models is is a Transformer style architecture basically the ml uh world has decided that this is the architecture uh that they are going to focus on because it provides high accuracy on a wide range of tasks and you know as many of you know this is a a stacked encoded decoder architecture where the heart of it is this uh algorithm called attention attention allows you to draw connections between different parts of a sequence of tokens so for instance it can tell the difference between it knows that the in the first sentence it refers to the dog and in the second sentence it refers to street right so it can learn uh the the the these connections and the key to doing that is uh this uh queer qkv query key value input and the idea is is the attention mechanism learns the correspondence uh the mapping between a query and a key value pair and this has computed us as awaiting sum and so the key thing to know is that you know typical sequence lengths are in the 1K to 8K token range and the the uh what one would like to do is increase the sequence length capability of these attention uh algorithms by an order of magnitude or maybe much more and so the question is sort of why is this important well the sequence length determines the context which you can give to the model right so if you have only you know uh four to AK tokens then maybe you can only do 10 or so pages of text but if you have a 100K uh tokens and maybe you can do books or plays and and much longer legal documents and so much greater capabilities are provided by longer sequence lengths the other thing that's also possible sock you can actually you know analyze uh images uh at Full Resolution so the images you take with your uh your mobile phones I have resolutions which are much higher than most of the models that are used uh in in uh they're actually in use and so the question is you know if you could get uh better higher resolution models then of course uh you could you have more robust uh insight into whatever you're trying to do with these these uh image models and then you can open new error error areas there are lots of uh things that are important such as time series video Medical Imaging data and genomics data can be measured can be modeled as a sequence of tokens and so what you need to do of course uh to to uh analyze this data is be able to deal with much longer sequence lengths so let's look at attention and understand uh what some of the problems are and what we might do to uh actually uh improve the performance of the attention algorithm so the key component of attention is this scale dot product uh calculation right where you do a matrix multiply masking and then you do a operation called softmax and so it's represented uh uh in the equation as you know the soft Max of the Mask of the uh The Matrix multiply and so attention is slow because essentially what you're doing is a quadratic calculation in in and and which takes quadratic space and time and so this shows pictorially what happens uh you know first you create the attention Matrix uh then you might mask followed by Soft Max drop Mac drop out and finally uh the multiplication times the uh value uh vector and so the attention as I said is is uh quadratic because you are creating uh these matrices and so what happens is that uh the the size of the Matrix given the sequence length uh will not fit in on-chip memory and so you're forced to to constantly move uh The Matrix data back and forth between the compute and the memory and so what happens then is that this constant movement of data becomes the bottleneck to the performance of attention and fundamentally uh you know the size of the Matrix that you can contain in your off chip memory limits your sequence length and so the question is you know could you do better well if you gave uh the flash attention algorithm to a Smart uh student who took my uh computer uh parallel processing course they would come up with two they they would they would in that course they'd learn about two two key optimizations and so tree down and down through a a PhD students uh who work with with Chris Ray uh who who looked at this algorithm and of course many people had uh come to the conclusion that in fact it was memory limited but they were the first ones to come up with a comprehensive way of fusing the different components of the attention algorithm together uh and then of course if you fuse them you still have this problem but the memory size of the Matrix is still too big and so the next optimization that which you would get from a parallel uh Computing course would be so that's Fusion is one the other would be tiling right so you would tile the uh uh The Matrix into pieces that would fit uh better into uh into to the on-chip memory and uh then thereby you would get a much more efficient algorithm turns out there's one little wrinkle which is you can't tile this computation arbitrarily because there's this soft Max computation that requires multiple tiles in order to uh be computed correctly and so uh in order to do this then you have to know how how to change the algorithm so it's not a straight you can't apply the principles of of parallel Computing directly to this algorithm and get it to work correctly you also have to understand the algorithm and deal with this problem of of softmax but if you do that now you can get a two to four x Improvement of uh in performance on on Graphics processing units gpus and a 10 to 20 x reduction in the amount of memory required so this is dramatic and lots of people you know saw the results of uh of of of this flash attention algorithm and started using it right so if you use gpt3 or GDP uh if you use chat gcp you you know under the hood you're using flash attention uh you know if you go uh to or if you look at uh open AI they you see dramatically longer sequence lengths are available and again uh a lot of this is due to the use of The Flash attention algorithm uh the uh open source uh folding algorithms are also using uh flash attention so this is uh already having a dramatic impact on what people are able to do with these Foundation models so one of the ideas uh you know for being able to scale the size of the model without uh increasing the amount of storage memory requirements or compute is to use sparsity right so this is an idea of course that that goes way back in in uh in in scientific Computing and so the notion was why can't we use sparsity to train machine learning models uh so lots of people have tried to do this right uh so the challenge is both getting the accuracy that you got when you trained your model using dense matrices while actually speeding the computation up we're actually taking less time right so it turns out that there have been lots of ways of of trying to find these sparsity patents and and maintain the same level of accuracy but uh what you know people have not been able to to do this right they've tried all the all kinds of ideas lottery tickets hashing schemes Dynamic sparsity masks uh but the you know the net result is that they've um slow down the training uh by by a large Factor five five x or or more in some cases or they uh don't actually or they they lose lose accuracy uh so unstructured sparsity uh where you were looking at individual elements of the Matrix uh being zero or non-zero is the the scheme that is being able to maintain the highest accuracy but it turns out that unstructured sparsity is not Hardware efficient you know Hardware likes to work in blocks of computation because dense blocks are what you can make uh work well in in terms of doing the Matrix computations and they also much better use uh the bandwidth uh that that is you the the crosses the pins of a of a chip and so uh uh two uh students at Stanford BD Chen and tree Dao have come up with a a scheme called pixelated butterfly and the key that's behind pixelate pixelated butterfly are the idea of Monarch matrices a monarch matrices are a way of uh doing sparsity in a structured way right so instead of you know arbitrary uh points in the uh uh or arbitrary locations in The Matrix being zero on the non-zero the idea is to have uh blocks and you've got this block diagonal representation and and uh and the Monarch Matrix is basically a multiplication of uh two block diagonal matrices and two permutation matrices it turns out that these Monarch matrices are expressive they can express any structured Matrix uh like uh Fourier transform sine cosine transform and uh they are Hardware efficient because they use these dense blocks uh in terms of dense Matrix computation and memory access and so the question is you know can you actually train ml models using these Monarch matrices so uh sparse training has been shown uh with these Monarch matrices uh for image models vit base and here what we see is uh we can use Monarch uh and and get uh the same accuracy for imagenet uh classification but now you actually get a 1.8 time speed up so this is the first time that you actually see improvements in speed up with sparse training while of course maintaining the same accuracy another way of doing uh sparse training is to train uh sparse for a large fraction of the time but then switch to dense uh at the end of the training uh time and so in this case you can you're running for 80 sparse and 20 dance again uh you are on gpt2 you get a speed up of uh 1.7 and of course gpt2 is a language model so these ideas are all these results are pretty promising they haven't you know been tried on the very largest models yet but we are certainly into the uh you know billion parameter model size and they are showing uh very encouraging results uh so the question then is Could you actually get rid of this uh quadratic attention uh a algorithm right so as I said attention is important for being able to come up with these correspondence uh between elements of uh of that between elements of of a sequence a question can you do this without uh attention and so uh again using Monarch matrices there's the idea of Monarch mixer which is a new paper by Dan Foo showing that actually you can uh use uh these matrices to uh get uh uh the the sort of mixing that you need uh to make these models work and uh showing an improvement of in speed up of 5.4 uh on bird uh so it's on image models and on GPT uh you are not getting speed up but you're at least matching the accuracy it turns out that you know uh TPT is actually a more difficult model uh for this kind of approach to work then lastly another uh way of getting rid of uh of of attention is uh is by using what are called State space or signal processing methods and there's a set of papers uh you know based on on these uh types of models uh one of which they're called hyena models and you know I kind of you have to ask the uh the the graduate student why he named them uh hyena but the key idea then is they're both attention free and so sub quadratic and they've been shown to scale to hundreds of thousands of tokens and uh you know million tokens uh uh uh experiments are in in in process so because they're signal processing type uh uh algorithms the core computational component in in them is fft phosphoria transform and uh you know ffts in fact don't work very well on on current ml Etc accelerator hardware and so the question is you know is fft going to be more important as a component of of Hardware going forward so the point being that as I said uh accelerator hardware for ML needs to be flexible because the algorithms are changing so last point I'd like to say about algorithms is that you know ml models are data flow right so uh you know you get uh the models are developed using these Frameworks Pi torch and tensorflow and what you get out is a data flow graph of of ml uh kernels uh The Matrix multiply convolution pooling and then the connections between them of course are the tensor data so you know data flow then is is a core component of these algorithms so let's think about how a compiler might optimize the data flow in these algorithms so you start with a representation of your model in in one of the Frameworks Pi torch or tensorflow uh and and then what you get out is is this graph of operators and the question is sort of how do you execute that efficiently or how can you translate it to be something that could run efficiently on an accelerator and so we spend a lot of time thinking about how to develop compilers for uh for various uh architectures and accelerators uh and uh one of the things conclusions we came to is that a core uh component of the intermediate representation of the compiler should be what we call parallel patterns right so parallel patterns are you know these uh operations that tell you about both the computation and the data access patents and they're ones that you are very familiar uh with map zip reduce flat nip and group Buy and they operate on collections right there's no parallelism without uh operating on data types that have multiple elements and uh you can convert all uh these uh data flow graphs from the at the machine learning abstraction level into these hierarchical gra data flow graphs of parallel patterns and then you can optimize at this level to improve parallelism right can optimize in various ways you can optimize you can do tiling you can do Fusion you can do technique uh called uh meta pipelining and you know as you know one other point is of course you can also represent SQL using power patterns and then as I said you can do Fusion tiling meta pipelining which is a form of hierarchical pipelining and I'll say a bit about that in just a moment and then you might generate some intermediate representation which is close to the accelerator uh that that you want to execute your machine learning algorithm on and so we we came up with one which is uh based on uh parallel patents and it's this uh parallel patterns expressed as a hierarchy of tiled pipelines and with an explicit memory hierarchy and uh this is expressed in the language we call spatial some more recent work that we've been involved in is to create a sparse abstract machine for tensor algebra and streaming data flow so this is worked by all of Olivia Sue and Max Strang and uh the idea is to represent uh linear algebra uh uh kernels expressed as einstum shown in the middle uh and then uh with a combination of of uh the data formats and the schedules uh you can create this intermediate representation which consists of uh composable uh Primitives and a a format for the tensor data that streams between these Primitives and so you can compose these Primitives and you can use this intermediate representation to explore the design space uh or to a map to uh lower level uh accelerators to to map to accelerators such as a reconfigurable data flow accelerator which I'll say a bit more about and the compiler for doing that is called Stardust uh so if you want to incorporate uh uh compilers like Stardust or other optimized libraries uh some very recent work presented at FCRC uh this morning is the idea of Mosaic which is an interoperable compiler for tensor algebra and the uh lead author of that is uh Mania bansal who's an undergrad at Stanford so she has a award-winning paper and she hasn't even graduated yet maybe she graduated on Sunday and so and and this paper basically allows you to take these uh Einstein representations and take components of them and map them to existing libraries and it can do this automatically and it can allow you to expand to explore the design space and it can fill in the bits of the uh the expression that are not handled by the library so you can map them to mkl to blast or you could use a compiler like Stardust so uh uh you know a while ago when I was uh uh you know working uh uh uh in on architectures of various sorts I remember a quote by Jim Smith uh who used to be a professor at Wisconsin and he said if you have a vector problem then you should build a vector computer so we have a data flow problem so we should build a data flow computer so this is what we set out to do uh and then we uh we call it a reconfigurable data flow architecture and the reconfigurable data flow architecture is designed to execute parallel patterns expressed as spatials so you see the co-design between the compiler and the intermediate representation and the architecture that is going to execute it and the key components of this architecture are this these tile you know this tile array of of patent compute units that do the compute and patent memory units that do the memory uh provide the memory uh access and the two uh students who students would who worked on Ragu prabhaka and yaksi Zen and it was presented to iska in 2017. and so let me tell you a little bit more about the patent compute unit the patent compute unit is this reconfigurable Cindy pipeline and so you have these input buffers which allow you to take data from the on-chip network which connects all the different units together and then you've got some number of pipe stages in depth to exploit pipelining and some number of lanes in width to exploit Cindy parallelism so you can map a basic block to a PCU uh and uh in this example you you map each of the operations to a different pipeline stage and so the parallelism you can exploit is of course the product of the width uh in terms of Cindy and the depth in terms of pipeline stages and of course you're going to configure the PCU so you don't have to do instruction fetch every cycle so you remove the overhead of doing that and then of course you stream data in to the PCU through the input buffers and then there's an output fifos that while output fifos are going to be the input fifos of the next unit the patent memory unit is meant to supply data to the uh pcus and the key components of that are a highly banked uh set of of of of of of a a banked uh memory unit with uh the ability to supply the bandwidth uh to the uh for the for the requests for for the uh loads and stores and the ability to generate fairly complex address schemes to match the potential uh for the uh sparsity that you might have in your uh your algorithm so sparsity requires a regular memory access or regular data access and you support that sort of capability uh by using uh these uh dedicated address units and then data again streams uh into the uh input buffers and out of the output buffers so you know the idea then is that you can use an architecture like this to improve uh flash attention and the way that you do that of course is you're gonna take uh flash attention with its ability with its uh uh uh it it's fusion and uh tiling and you're going to lay that out in space such that you can exploit pipelining uh between the different uh tiles right and so you also have data flow execution which is controlled by tokens so this reconfigurable data flow architecture allows you to exploit parallelism at these multiple levels within the PCU or the vector and pipe stage level and then across multiple pcus and pmus using this meta pipelining approach and as you see in the picture you've got multiple tiles in flight because you've mapped the different components of the flash attention algorithm in space so the point here is that by mapping your data flow graphs in space you get the natural Fusion of the data communicating on chip and you also get this pipelining effect which is shown in in this picture this pipeline meta pipeline diagram and so you know at uh in this example at tile four you've got four tiles in flight at the same time and so potentially you get a speed up of a factor of four so the idea of sort of using uh plasticine as a way to exploit palism and enable the acceleration of machine learning uh algorithms uh was uh worked so well that that we started to decide to to form a company and this company was founded in in 2017 and my co-founders were Rodrigo Liang and Chris Ray Chris Ray is of course a a colleague at Stanford Rodrigo Liang I've known for for many years and uh so this uh first uh commercial uh example of of a reconfigurable data flow architecture was uh our RDU and it was called the Cardinal sn10 and those of you who know about Stanford uh uh know the the all Stanford athletic teams are called cardinal right so this was a seven nanometer chip uh had 40 billion transistors so it was really a substantial chip 640 uh pcus uh with a peak of uh 312 uh BF brain float 16 teraflops and uh the support for uh other data types and then uh 320 megabytes on chip with a lot of bandwidth to this on chip memory because of course the memory is distributed so how did the sn10 differ from the research Proto or the research ideas that that we uh developed under the plasticine well one of the things of course we did was we needed to support Matrix multiply efficiently so we converted the uh we allowed a mode in which the pcus could run in systolic mode so that they could do Matrix multiply efficiently we added some uh more data formats and we added support in the pmu for doing data Transformations so this is pretty important to support reshapes to those of you who uh you know do uh machine learning now this is you often need to to change the shapes of your uh your your Matrix your tenses and so the data align unit could support this this kind of Transformations such as trans transpose we also augmented the interconnection network so that it wasn't just statically scheduled so it could be dynamically scheduled and this made it much easier to develop uh uh uh different uh different algorithms in terms of the uh compiler we had to we made use of of kernels but the whole idea of use of using parallel patterns and data flow optimization uh was basically the same as we had developed uh under uh the the in in in the plasticine compiler now of course it had to be made much more robust and uh and it had to be able to be able to be driven by pie torch ml algorithms and so of course there was a huge amount of effort uh that went into developing uh the samba flow uh data flow compilation uh environment to enable you to take Pi torch ml algorithms and generate uh configurations uh that could optimally use the resources of a sn10 and uh and follow on rdus so let's look at at you know how you can execute uh ml algorithms potentially more efficiently than uh than GPU architectures can look at at uh the architecture of uh the uh Nvidia a100 which is a GPU with uh uh 108 uh streaming multiple uh multi-processors and it's got uh uh each of them has a hundred and ninety two uh kilobytes of SRAM and then it's got uh 1.5 terabytes of bandwidth to the uh off chip high bandwidth memory which has a size of between 40 and 80 gigabytes so if you kind of look at the memory hierarchy as shown on the right you've got SRAM 20 megabytes with a total of 19 terabytes of bandwidth per second and then you've got off chip HPM 1.5 terabytes per second uh 40 to gigabyte 40 to 80 gigabytes in size and then the whole chip has a Peak Performance of around 300 teraflops if you look at sort of how you execute these kernels you execute them in time uh one at a time and you move data back and forth if you compare that to what you do with samanova you're going to execute them in space and you're going to make use of the much larger SRAM on chip and you're going to make use of the uh off chip DDR but the DDR bandwidth it's much less only 150 gigabytes per second and so if you lay things out in space then you get to make use of the on-chip bandwidth and you use you get use you make you make use of the meta pipelining and what you see is an improvement in performance on sparse Matrix computations and the performance improves especially in terms of the uh pipelining of the meta pipelining as the sparsity increases and the key reason for this is because uh you uh by higher sparsity the impact of the gems goes down and because you can do the permutation on chip using these uh data transformation capabilities you get much higher performance so if you run attention on the a100 turns out that potentially flash attention increases the arithmetic intensity such that it becomes compute bound but then of course you run into some of the limitations of the fact that you're running with threads and so the warp limitations limit you to a hundred teraflops uh on flash attention the sn10 uh also is uh compute limited but because we have many more resources we don't have the limit of threads and warps and we have a lot more bandwidth to the on-chip SRAM and a lot more SRAM you get a higher performance you also of course need to of course one chip won't do it you need to be able to scale to multiple chips and we do that uh in a cluster and then we can also do that by uh we can also do that by connecting multiple clusters together and making use of the large uh at drams to make it simpler to scale and we can scale out in a fairly linear fashion with very large models models of over a hundred uh 176 uh a billion parameters so one example of doing this is a foundation model uh called Bloom chat right so which is 176 billion parameter model and this model is an open source model and we've shown that we can uh run this model on the RDU and we've released it and we show that it's actually does better at fine too at uh at multiple language models and so this is something uh that is is is being quite uh important in the open source a move area so finally we want to be able to support both language models and image models and so you want multi modal models and again the large memory is really important here and uh in the uh RDU the ability to have lost memory allows you to uh support 3D uh image models uh or 3D uh models and so here we're showing the uh ability to do uh kidney cancer segmentation much better using a three D model uh that uses uh the memory of the RDU to get much better segmentation accuracy than you can with the limited uh memory that you have on conventional accelerator architectures so in conclusion then Computing in the foundation model era is this co-design of ml algorithms programming languages and accelerator architectures uh so we've talked about uh you know sub quadratic attention algorithms uh block sparsity for ML algorithms how the importance of domain new domain specific languages data flow compilers and of course exploiting sparsity in the pl uh Arena and then accelerate architectures uh key will be ideas and reconfigurable data flow large memory to support uh the uh uh the the the large um parameter models uh that will be increasingly in use and potentially uh you know new types of compute Primitives like fft uh could potentially be important so thank you for your attention and at this point I'd be glad to take questions [Music] so there are there are microphones in the middle please uh walk there I know it's far and we would start with one at your right thanks for the great talk I'm rajat from Cornell um so on the languages side of course spatial innovated a lot in order to make it's so much more scalable to design these accelerators I'm wondering um now that the pl Community is so much more interested in building dsls and applying sort of type systems research and semantic models for languages that generate Hardware what do you see as the key challenges um in building new programming models that support this kind of new programming models and Hardware design techniques thanks I'm having trouble hearing the key substance of the question okay yeah what is what are some of the key challenges on the language design side uh for program models for accelerator and designing accelerators our key challenges on the programming languages side so I think uh you know one of the issues uh that we're exploring is is sort of it would be spatial architectures right you both need to do things in time and in space and you'd like to be able to to uh um uh smoothly decide when you you'd like to execute things in time and when you'd like to execute things in space and we don't really see today any languages that give you that kind of expressive capability so that's one of the challenges uh uh that that we think are important on on the uh on the pl side and then of course it's sort of you know how do you explore the design space in a uh in in an efficient way right because you know the problem with these spatial architecture is you've blown up the the number of things that you potentially do at the same time and figuring out how to chop up your model uh such so that that uh you both fit within the constraints of whatever the resources you have and also get get uh get the best performance is another another big challenge thank you now from your left yes hi currently uh Ben Zorn from Microsoft research thanks for a great talk uh really interesting great advances in many dimensions my question is about uh your thoughts on leveraging the newest language models like gpd4 to actually help you do design and implementation of Hardware algorithms and languages yeah I think this is a really interesting area so we recently hired uh at Stanford a new faculty member called Azalea marashiani and uh you know she and I have sort of started to look at this area I don't have any hard uh you know results but I think that the notion of using these language models to uh both through Hardware design and sort of design of all sorts of things is is really going to be interesting this whole idea of of using uh ml to improve systems right so we've started to look at uh at with Muhammad Shabazz at Purdue this area sort of how you do it in the uh in the networking air Arena right uh where you can think about replacing uh a lot of the uh heuristics uh for for security and for for load balancing and other sorts of things that you'd like to do with with the data driven models and so I think you know this is uh this is an approach that uh is going to be uh useful through through all sorts of systems I think you know the key thing is sort of you know how can you uh find the right sorts of models and how do you train them right because you probably you know figuring out all the different things that you might want to do uh in in terms of Designing uh something for uh you know exploring the design space and using that as training data you know is is probably going to be a big chat be a challenge right so in some of sort of the other work that we've done uh that that seems to be be uh be a limitation is that that you know you don't have as much training data as you might like great thank you from Yale University the talk was excellent and I understood a lot on how we can help scale up these models understanding exactly how these large models work and do the things that they do is an important task that a lot of people are working on can you comment on how as system designers and Hardware designers how we can help provide some features uh to Aid this work yeah I think I'd reiterate what I said in the talk is that a you want to create more scalable and efficient systems and flexible systems so that you know if you provide only certain Primitives that that go fast like dense Matrix multiply then you're kind of limiting the ml researchers to use those Primitives because if they use anything else they simply won't be able to train large models so you want flexibility but you you also want to be able to scale and you want to be able to provide mechanisms and and the ability for more researchers to uh to to play with these large language models which funnel means you need to bring down the cost of training which means more efficiency thank you back to our left hi Google Marcelo from Princeton University you mentioned flexibility and programmability in this data flow architectures I've seen these two trends of CEO functional units like cgra that you have in plastic scene and samanova but then other Trend in like teeny tiny Isa programmable course what is the decision of going for these functional units which require like upfront configuration rather than Isa programmable course is it the last mile of performance that you're trying to achieve well I mean I think the the issue is I mean there's two issues one is so you're saying instead of having these dedicated function units why not just have a bunch of cores like teeny tiny course yeah well you know so the problem is is that if you make the cause too small they again become inefficient right because the Workhorse is going to be some sort of you know Matrix computation unit like Matrix multiply right so you know you still want to provide that sort of functional unit and so the question now is how you chain them together and you want to do so with as lower overhead low the the smallest amount of overhead as possible and do you want to allow streaming and and and and and and pipe meta pipelining and so the question is if you put a core in there with its overhead and it's programming uh overhead is that the best way to to achieve what you want to achieve given that you know we've had a lot of uh experience trying to to make uh cause uh run efficiently and uh it you know turns out to be pretty challenging thank you uh I'm Priyanka from Yuma Summers so my question is uh in the long term whether we should focus on efficient general purpose Computing or specialized Computing uh as the models keeps evolving uh like considering the impacts on the environment like embodied carbon footprint yeah well I you know I think you want to do both right you you we want to eat our cake and have it we want efficiency and we want flexibility and so so that's why it's a challenging problem right you know uh and you know because otherwise you know you just just you just die you know be wasteful in terms of your use of resources and and and so it's not either or it's both and that's what we're trying to achieve with these reconfigurable data flow architectures yeah I have one more question so you should try to into best algorithm and then you go down to improve the hardware or you should have first have a good hardware and then you try to improve your algorithm to work it best for the hardware what's the best approach well it's a co-design it goes back and forth right I mean that was that was you know you saw that with the flash attention algorithm you see you see this with this meta pipeline you first understand what the bottlenecks are with current Hardware you adjust the algorithm and then as you move forward you might decide that you really do need a different algorithm okay thank you yeah um good afternoon I'm Siddharth from Carnegie Mellon um you mentioned a few times in your talk that f50s are a challenge for MLA accelerators but could you shed some light on why it's difficult for ML accelerators to do ffts especially given that there are a lot of specific Hardware that does FFG yeah fft probably it's a law uh log n log n now requires data movement in what's called a butterfly uh patent and it turns out that that that this kind of data movement uh patent is not well supported on current accelerator Hardware so that's the limitation it's basically a data data movement problem rather than than raw flops okay last question uh hi this is Henry from Princeton I'm the lucky one so I have a question about the memory systems of seminova so if you have a very large large language models like with hundreds of gigabytes or parameters and to perform one inference you need to at least read all the ways from in seminova your dram which has a very limited bandwidth compared to a GPU hbm so how do you optimize your memory system to still achieve like very high performance in the case you have a lot of a lot of ways to read from direct yeah that's a really really good question uh so you know the model the chip that I showed you in the system that I I showed you was kind of optimized for training right large batch training if you want to optimize for inference then you're right you do need a lot of bandwidth now you can again you can optimize the algorithm right this so you might know that there's multi-headed attention but there's also multi-query attention right so the multi-query attention needs uh smaller numbers of uh of of KV cash parameters and so it's more efficient so you can make uh you can change the algorithm and actually you can do pretty well but then ultimately you want to introduce HPM so stay tuned yeah you know we you know we want to add a layer to the memory hierarchy especially for inference which is the case that you pointed out so good observation thank you uh holy will be around for a little bit so you can still ask questions we want to thank Cooley but also team You Wanna Give a uh something that you need to say better do right now and then we will let's thank the speaker one more time thank you so it's now the lunch break please exit as fast as you can and orderly as you can the uh there will be a Awards uh luncheon in Cyprus uh one and for the rest of us the our lunch is in Cypress three okay thank you pldi is in Cypress one Awards [Music] foreign foreign
Info
Channel: Association for Computing Machinery (ACM)
Views: 2,590
Rating: undefined out of 5
Keywords:
Id: gADw3NtGDVE
Channel Id: undefined
Length: 69min 51sec (4191 seconds)
Published: Mon Jun 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.