2023 LLVM Dev Mtg - Mojo 🔥: A system programming language for heterogenous computing

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you Alex wonderful we're very happy to be here and talk a little bit about what we've been up to so we'll start with what is mojo um at a glance the top level points of Mojo is that it's a pythonic systems programming language so what does that mean that means we're here to do really cool things with systems and compilers and it happens to look like python but forget everything you know about python please please so this thing is about one year old so it's still pretty early still in development it's still quite interesting in doing some cool stuff though um and we also have a Vibrant Community we have over 150,000 users we have a b big community in Discord and there's a bunch of excitement around this so we'll dive today into why did we do this in the first place that's often something we're asked we'll talk about how we approach designing a new language from scratch we'll talk about internal implementation details including some of the horrible things we did to lvm talk about what this means for accelerators and compute and then wrap things up so first why why why why why why right so many of you are working on AI and if you work on AI the question I will ask of you all is if AI is so important to the world why is all the software so bad this is a huge question a huge problem and I think that many of us that have been working in this industry for a while have been struggling with solving this problem in many different ways and so for me when I look at this I think that the challenge is really fragmentation complexity it's all these systems that do not work very well together that are being built by well-meaning people in different groups and areas but they don't really actually work together and so for a user this is a huge pain point and why is this well I'll speak for myself if you're enabling a chip you're focused on the chip so many of us are paid to solve one specific problem we're not here to solve an industry scale problem and you can't afford to do it you don't have the time you don't have the schedule you don't have the headcount you whatever often the organization that you're within in my experience makes it very difficult to solve some of these problems and so our Approach at modular is that we need fewer things that work better and so that's what led us to building modul in the first place it's really kind of an organization that can span across many different of these problems and invest for the long term in building and hopefully lifting the industry over time so how do we do this specifically well we're building what we call the AI engine right well the AI engine if you look at modern ml stack a lot of folks are trying to throw layers of python on top of all this AI Tech that has been built up we're tackling it at the hardware software boundary reinvesting no surprise in compilers and so what we want to do is we want to unify and integrate all these lowlevel technology systems so that Innovation can happen up on top with programming models and Frameworks and all that kind of stuff our approach is to meet people where they are so people use Pyers people use Jacks people use tens flow that's awesome these all have pros and cons and there's other stuff as well and very few people actually want to rewrite all their code and so for us it's very important to be drop in compatible meet people where they are and work with their existing systems the other thing is that this is not a research project like there's a lot of really interesting and cool things that have been built over the last 8ish years of AI infrastructure it often gets fragmented out into all these different systems we've learned from many of them and so what we're doing is we're pulling this back together and doing hardcore engineering not research to build a production quality system that we hope can scale for the world I'll go through this super quickly what is an AI engine well it's really things one is this operator graph the operator graph in the interesting case is heterogeneous so people often focus on for example a GPU and how do I make Matrix multiplications go fast and that's a super important problem but often folks forget that AI today is a distributed problem it involves the host involves the accelerator involves pre-processing data loading this whole thing and so you can't really solve the AI problem for a user unless you really tackle this whole problem and furthermore like this is really heterogeneous like as we've seen there's all kinds of different accelerators there's all kinds of different Hardware when you have a cluster of lots of machines like microarchitectures don't always match there's a lot of complexity in this space so many of us have been working on this again for a long time and so we've seen the rise of Kernel libraries this is how many of these systems were first built and one of the challenges that I won't go into in depth many of you probably already agree is that colel libraries don't scale right and so many of us for multiple years now have been building Ai compilers and so there's lots of these lots of different approaches online kernel Fusion lots of cool algorithms get invented and used we can talk about all the different pros and cons and trade-offs but the thing I want to claim is that neither of these approaches scale kernels don't scale hopefully many people understand that but neither do ML compilers and to a compiler audience that maybe is more controversial than to Kernel audience so I thought I'd dive a little bit into why this is and the challenges that we see with this that led us to our approach with Mojo and the system so the first is generality right I mean empirically today ml compilers are not very general right generality includes not just matrix multiplication again data loading pre-processing all this stuff but also Dynamic shape sparsity there's better and worse systems out there and I mean and there's definitely progress in this area but if you're coming at it from a user's perspective they want things to just work and if they don't just work then they'll move on and spend their time something else generality is also important because you know if you're again coming from a hardware enablement perspective you don't really have time to invest in all the other parts of the problem and so it makes sense that many of us working on bring up a chip don't actually focus on the big the big picture parts of the problem another one is community so you all are wonderful compiler nerds you know I'd love you all obviously um and I am myself a pretty big compiler nerd but the problem is is that nobody can hire compiler Engineers this is pretty well known and so with AI compilers this becomes even worse because how do you hire somebody who knows compilers who knows AI modeling and all the different exotic new model of the day who knows all the numerics and the data types and knows all the specialized hardware and how do you find that that unicorn person that knows all of these things together it's very very difficult out there and if you need a compiler engineer to be in the loop of Novel research there's very few companies in the world that can afford attract to do that so I believe that you cannot have a compiler first approach to this problem simply because there's enough Talent out there I mean I love you all and you're all very valuable but this is very difficult particularly for the scale of what AI research is today second if you're compiler engineer it seems really weird that we're re-encoding all of compute into IR Builders and stamping out all this stuff and so you feel like there must be a problem here at some point finally there's this fragmentation problem if you want to solve and build a heterogeneous compute system we have to face the reality that AI developers the researchers are in Python the Frameworks the host side computes all in C+ plus the device side is in Cuda and sickle and other things right and so if you want to build a system that can scale across all these different levels of abstraction there's a huge fragmentation problem here and we need to be able to unify this otherwise we can't have one system that can reason about it and so if you want to be able to build this and solve this problem you have to kind of come back and look at the big picture of what's going on here and the nature of compute has changed so this is what has led us to Mojo now how did we approach building Mojo I mean you know the outcome we'll talk a lot more about how it works but how do we even get here well when we started modular we started with a a thesis a hypothesis right we we believe that we could get to state-of-the-art Performance against a lot of Ender systems and do so with a single source of Truth in our code for numerics this hasn't really been done before there's definitely systems been around the space but this this thesis if true can can enable and unlock a huge amount of innovation in the industry and so what we did was we said okay let's go invest in some very fancy compiler stuff generalized fusion and uh cashing integrated distributed compilation like lots of cool stuff let's figure out what we want to do and then let's go validate that but for validation we didn't actually care about syntax so what did we do well we went and we actually go went and built the thing we went and built a compiler and completely ignored syntax right why well mlr is great you can write mlr by hand you don't need a front end and so what we could do is we could actually go build major kernel libraries and things like this and validate architecturally we could deliver the performance that we wanted to show that the compiler worked iterate rapidly on the compiler without having to change a dependency and go and do this and what we found fortunately is that it works the technology we built actually is good it worked it was proven out and then immediately we figure out that writing large amounts of mlr by hand is maddening and it doesn't scale and there's no way a real normal user could actually do this right and so but this validation of the algorithms of the compiler Tech of the low-l system which is very novel and Jee will talk about later was really important to building our system and doing so without being anchored on syntax I think was very good for both Focus but also for the ability to iterate so once you get that you get to the point of saying what about syntax right syntax actually does matter and so the three major approaches we looked at are do we take an existing language like C++ or Swift or something like that do we do an edsl do we do a new language and so when we were talking about this we came back to our core principles right our values our goals which is that we want to meet people where they are and whether you like it or not AI developers but also most software Engineers are all in Python right Python's pretty arguably the most popular programming language in the world and so if you're coming from a python Viewpoint arguing with people trust me I've been there to try to get them to switch to a different thing is a huge amount of work and it doesn't really go anywhere and so we realize and believe we had to go with python and what that meant is that meant that suddenly a bunch of existing systems were just off the table like C++ is not python Swift is not python like these things are not Python and so that really allows us to focus our our frame what about edsls well edsls are super common they're super popular and they exist for lots of good reasons they're relatively easy to implement we've had several talks at the conference about how to use Python so you can you you can extract and build IR from Python ests and things like this means you don't have to build tooling you don't have to retrain you can get to Market fast the problem is that they provide a really bad developer experience right you don't get a debugger this really can't fit into the existing systems if you care about host performance in generality Python's not there right at least not the level of performance that we care about and so what we really want is we want a system that allows us to to innovate at all layers of this stack okay well how about a new language again you know kind of where we're going with this but a new language has the advantage of you get the best quality of result you can control everything you can invest in in things you can Target CPUs with high performance which is quite important to us but what you need is a strong vision for what you're trying to do you need a long-term commitment because the demo is easy but a production quality thing hard you need to be able to pay for it you need to be able to track people you need to be able to have a big Target of developers that makes it worth doing in the first place and so this is actually well known to be ridiculously expensive like building a new programming language is not a simple thing that you should reach for as your first outcome but as you know yes we wanted baby little Mojo to be built and what we decide to do is actually do this and why well it's because it's the only way to achieve our goals to achieve the best quality quality of result for AI developers and many other developers worldwide and be able to lift the industry there are many point solutions that demonstrate many different capabilities but we really want to go beyond this and integrate and unify the world and so if you come back to what we need to do we think that we have all the constituent ingredients here with a good Vision we think we know what we're doing we also know how hard this is so I've personally built several major programming languages that are used in production and have seen the entire journey and made many mistakes and have learned from them and so with full knowledge we step into this and say okay let's do this so I'll give you the high level design points of Mojo um as you know it's a member of the Python family over time it will grow into being a full superet because we don't want to do a python 2 to3 thing anymore to python programmers as we said before it's focused on systems programming high performance working backwards from the capability the speed of light of Hardware definitely not working forwards from what python can do today also lots of hardware anything with the program counter can apply um but coming back to this also and we'll talk about this a little bit it's about unlocking the modular compiler stack and so instead of talking about the high level fluffy stuff I'll introduce Jeff and he can tell you a little bit more about how it actually works chis for the introduction so we are started off by De risking the core hypothesis and we have an ml based compiler that is different a little bit from the systems that predated it and we've proven that we can beat state-of-the-art the problem is that we've got like 50,000 lines of handwritten ML and handwritten ml is like right once read never it's soose you have to write the types every time you use an SSA value it's pretty hard to actually write incorrect code but then it's not readable it's unmaintainable and the new people being brought into the company are like what what is this so we need syntax we need a programming language for ML um why all ml well it turns out that modern computers are getting really complicated modern types are getting really complicated look at just floating points most languages give or take have a float in a double but M has things like floate E4 M3 F I'm sure it's useful okay and that means that we need to have access to it there's probably a piece of Hardware somewhere on it that uses this data type and it's very fast um that's just the tip of the iceberg em is such a vast ecosystem with many different kinds of Hardware targets domain specific dialects and so on and we would like Mojo to be able to take advantage of all of that so we need syntax sugar for ml in general but then how do we approach something like that well we start with the types in a programming language types tend to be the most loadbearing element you need types to do computations on them after all so let's start by focusing on a library based language that means that we write all the parts of the language in the library and the good news is anybody can write libraries so this scales the e e of engineering to everyone in the world who can write Mojo not just a couple of people who work on the language and that's really important because we don't want built-in types in the language to be special or be more performant than what you can enable in the library because that bottlenecks performance and the scalability of the system to the people who work on the language so we need to give Lang we need to give people who use the programming language Library authors the same power as language Engineers um it turns out actually that python has a really extensible type system You could argue that userdefined types in Python are actually much more powerful than the built-in types like inter float and the reason is because python provides this kind of ability to encapsulate type semantics behind Thunder methods which are really syntactic wrappers so let's just use that in Mojo right we use a struct which is like a class but it's densely packed in performance to wrap an ml type and then we use Dunder methods as well as class methods to wrap mlr operations and what you get is any mlr type will work any mlr operation will work and so now we have 1 plus 2 dgar to mlr op index. add the other important aspect is we need to make sure that these userdefined abstractions feel native that they're zero cost so how do you how does moo do that well it has a couple of bells and whistles to tell the compiler that treat this type in a specific way effectively giving a built-in like experience and one of these is say always inline no debug which will always inline the function no question about it and for a better debugging experience it nukes out all the debug info so you don't step into a plus of an integer so if we put this all together just this pieces of basic types so you have a simple while loop in Mojo well the par will then spit a bunch of source level IR right but then Mojo has guaranteed optimizations that run all the time such as the always in liner and M and then this gets desugared down to IR that is pretty close to what we would have written by hand and that's important because it from the get-go provides a predictable IR gen model for the programmer and it helps us get an offramp from all the handwritten ml but so it turns out we've actually discovered what ml really stands for uh it's Mojo fire Emoji language intermediate representation and the best part is your dialect works too so this is zero cost substraction around any ml so let's say you have a shape dialect with a mos. type and it implements plus to concat and subscript to get dim well now you can write shape functions in Mojo it spits out some IR that's been desugared to and then you can ingest this ir and do cool compiler stuff like shape inference and the best part is all of the language tooling just works so you get code completion you get doc generation you get syntax highlighting and even debugging if that's relevant but ml just forms the bottom level of the language it's how we talk to the hardware it's how we talk to the various dialects building on top of that requires high level abstractions and the way you do that in Mojo is metaprogramming so Mojo needs to build hardware generality and the way we do that is with metaprogramming so you can write a kernel without caring about what the vector length is and then say in this example ask the compiler to pick one for you it turns out that meta programming is also pretty cool uh generics are nice code reuse is great um and it allows to have scalable development so where can we look at for metaprogramming system well I actually like C++ I don't know about you and C++ has templates and duct typing in C++ is really powerful let's you write some pretty crazy generic code the problem with that is that the usability is poor I think template error messages get better every year but there's still some room to go and it turns out that for the kind of metaprogramming high performance programming needs C++ templates just aren't good enough so imagine you have a tensor type it has a static or dynamic rank it has a static or dynamic D type it has partially Dynamic shape partially Dynamic stride it gets ugly pretty quickly so it's not good enough and let's see if we can build something better so it turns out once again python actually has really powerful meta programming decorators can arbitrarily modify objects and you know return a function where there is a type and it with full a reflection in Python is what enables all these crazy libraries such as the ml Frameworks like pytorch Jackson tensorflow as well as things like Numba the problem with the python metap programming is that it happens at runtime which means it's slow it's not going to run an accelerator and it give us zero control over the generated code so the challenge for us is let's try to do it at compile time so that brings us to modjo parameters modjo parameters are compile time values that form the backbone of the meta programming system so structs can have parameters these are compile time values functions can have input parameters and then you can declare name parameter values with Alias declarations so you can kind of think of them as being like C++ templates but they're a little bit different for example in C++ you have using declarations for type ilas and conexs for declaration for compile time values but em Mojo types are just compile time values and so aliases and say compile time floats and compile time ins are the same thing the most important thing that gives is that the meta language is the same as the actual language and Zig really Blaze the trail here by having no distinction between the meta program and the actual program and Mojo we strive to ensure that almost any user defined type and function can be used and called in a parameter expression at compil time and the way we do that is with an mli interpreter that has a full memory model so to really drive this point home we have an example here it's Phil a vector with a bunch of integers okay not too bad this function can be called an either compile or runtime and you know if it was compile called a compile time you can even return a type instance and this Vector has Heap allocation that is computed at compile time and then Ed that runtime so when does this happen when do we do say instantiation of parameter values function specialization and interpreting of code well it doesn't happen in the parser like in C++ so in Mojo we do parameter instantiation in a process called elaboration and it happens later in the compiler pipeline what that means is that now Mojo needs a IR representation for parametric code so in this example we have a piece of ir and we have a parameter in the IR called value importantly this parametric IR is Target agnostic it's portable so that means something like size of lives directly in the IR and it is resolved by the elaborator so this enables something like split compilation like Cuda and perhaps one day separate compilation of generics like Swift so the elaboration pass is an ml pass that performs function instantiation as an IR transformation so in this piece of ir we've got two calls the function print int with two different parameters they get Stamped Out into two new functions and the callers are replaced appropriately um one consequence of a pass to do elaboration is that the language is late Bound By Design uh that poses a couple of language design challenges but that means that you can do cool stuff like autotuning where any parameter value can be autotuned I.E the elaborator says oh okay width can be 2 4 8 16 or 32 let me just go have five instantiations of this function and then use some benchmarking to pick the best one for you so this is how we get the very bottom layer of Hardware abstraction where the programmer can write an algorithm and then we let the programming language pick the best parameter for you and this also allows us to avoid some of the performance problems with C++ templates for example let's see you have a generic function add and for generality we pass the arguments by con reference passing it by const reference is fine for a large struct type thing that doesn't fit nicely in registers like a string but then for something like an integer this ends up becoming con reference to an INT which for a trivial type like int is not very performant and so if this function doesn't end up getting inlined what ends up happening is the inss get pinned to the stack this is bad for performance with late elaboration in Mojo we can have late Avi low ring which basically means that the source code is not the same as the ABI and this you know makes language interop slightly more involved but it's not big of a deal but what it means is that for a generic function like add in Mojo when the elaborator instantiates the generic types it can then change the callon conventions of the types to respect the guarantees uh that it has so for a heavy type like string it stays in memory it gets passed around as a pointer it's nice and efficient but for an integer it gets passed around in registers in SSA registers and returned out as a function result so that's just an introduction to how Mojo meta programming Works let's talk now about more how the coent architecture works and some of the more unique details of that one of them is that the entire Mojo compiler stack is driven by the orc jit from bottom to top and this gives us lazy OnDemand compilation so you don't compile things you don't have to it enables responsive tooling and turns out that having a jit is important for something like autot tuning and search and we get compiler caching at each stage of the pipeline meaning that you don't need something like ccache to get code compilation caching uh well we also use orc jit not actually as a jit uh we use it to generate static code like static archives and executables and in the OR jit we've built a really dumb but fast Linker that just takes a bunch of object files pulls out the symbols and slams slams them together into a static archive for a Linker we do call into the system Linker was we mentioned before we have a pre- elaboration portable IR but that also means that we can ser I this into mlr B code and that makes Mojo packages architecturally portable a Mojo package will contain this parser level Source level IR as well as the pre- elaboration ir and optionally you have the post- elaboration and pre-compiled code for various targets so what this means is you can ship Mojo packages without source code with just the bite code the parser is able to take out this Source level ir and reconstruct metadata like function signatures and type members and so on and with optimized and pre-compiled code in the packages Mojo packages become like portable build caches so if you're on a common system like an M1 Mac and you pull a Mojo package it will probably already have the pre-built code for you so what is a compilation with a package look like well if you start by importing a function from a package the parser goes and reads out the Declarations from the package it will then lower into the full pre-abortion ir and the reason why you need the full parametric IR so that you can instance the function again and so that the elaborator can call The Interpreter on pre-compiled code during elaboration we don't reoptimize and ranti all the functions we just drop them out with the post post elaboration IR into the ml module so that gives us lto and ml but I mean ml is pretty far away from link time but it's a similar idea but we actually trash these pre-compiled functions out of the IR before we go to llvm and that has some interesting implications so Mojo is a bit of an usual probably slightly controversial user of of llvm so llvm is fantastic we love lvm we love everyone here but it's uh got a couple of issues the most standout of these is that it's single threaded and what that means is on a modern system like an AWS 192 core machine you get arbitrary slowdown for compilation speeds you only use one core the other problem with lvm has got a couple of passes that don't tend to be strong enough for our use cases and that are difficult to control and predict a lot of the stuff in lvm was built for something like clang but in Mojo we'd really love to be able to autotune and unroll Factor the good news is the m is a thing so let's focus on the excellent strengths of lvm lvm is great at stuff like scalar optimizations from inst combine and other function level optimizations like Loop strength reduction we ended up disabling passes like the vectorizer the loop un roller and even the inliner as well as a couple of the other IPO passes and the solution is to replace them in ml where we get inass Paralis and push many of these optimizations out into the library which is something Abdul will talk about in a bit so what happens when you get rid of all the IPO passes well you get to use LM as a per function code generator this gives you full Coden parm at a function level across the entire stack and what that means is that pretty much the entire Mojo compiler pipeline is fully paralyzed except for the linger and the parser parser could be paralyzed one day um and that's really just the tip of the iceberg and what we could fit into one presentation there's so much more to Mojo and there'll probably be more talks coming in the future but for now I'll pass it over to Abdul to show you all how to write some fast code in Mojo So going back to what Chris said at the very beginning we had a hypothesis to begin with we want to write fastcode that's why Mojo was written to begin with we wrote things when m we've proven a lot of the tech L's right things in Mojo and let's show the performance so let's step back how does existing performance libraries how are they built today well the short answer is whatever it takes to get performance there's no you know style guide or anything like that that's usually um like maintained and that also means like there's a lot of suffering because there's lack of tooling Etc so what people do is they write things in assembly oh great you know but please don't it's not a super productive programming language others build compilers as C++ templates and God forbid you know you mess like one of the sevens becomes a six and you get some nasty error message others build C++ DLS that generate asms others write python programs that generate assembly others write python templates that generate C++ templates that then you feed into client and these are not like research projects these are production libraries that are used today you probably used one already these are you know by the big companies and as a result you're kind of losing a lot of things you lose on maintainability debugging tooling and becomes hard to develop and iterate on these performance libraries and that's why they call them performance ninjas right you lock them in a room give them some coffee and then they give you speed up and we don't want to do that we want to reduce suffering the other thing is what happens is these performance libraries are pre-built and shipped as kind of you know black box binaries and what that means is you've encoded when you built ahead of time you've encoded all the hardware semantics tile factors Etc in the library you've made it into a blackbox so other higher level uh things in the stack like a graph compiler cannot reason about uh what the library is doing you've also encoded like specialized patterns popular things like a resin up block or a Transformer block into your library and what happens if there's a you know Transformer version two or resonet you know 53 you're kind of screwed in in that domain there's other things like there's no consistent API there's Bloss there's Bliss there's one DNN Etc and the distribution story is even worse there's a 1 DNN and then there's a Zen DNN but then if you are on arm you have to use you know something else as well so we want to solve all of these things and that's the reason why we built Mojo we built it to solve our problem of writing high performance libraries and the first thing we want to make sure is the developer is happy and they have all the tools that they need to be productive so rather than as kind of Chris mentioned a lot of developers are not compiler Engineers you know they can write libraries they probably cannot go and you know write a pass and so on so let's put optimizations in the library and I'll have some examples later on let's also Leverage What computers are good at so when I was in grad school a lot of of grad students were essentially you know grid Searchers they would just enumerate everything try 50 things you lock them again in a room for a month and they say oh the best tile factor is six and four and so on let's not do that let's use computers computers are great at these sort of things they can scan things you can do smart searches and so on so let's use autotuning let's use algorithmic selection and let's build that in the language and let's make sure that we have tooling to make these people productive debuggers how do you like how do you debug the python template that generates C++ template that you know does something else it's it's it's hard to begin with to debug C++ templates let's Al also build like a language that's aware of like the 21st century so Sims are a thing so let's be you know simd first let's have scalers to be a generate form of simd of simd of length one and make the simd parametric let's also make the library the one we ship the standard Library you know have first class support for simd types also multie is a thing so let's build parallelism and asynchrony into the language as well and finally you know we can have these nice things but sometimes people are like I want my assembly back or I want to use the lvm intrinsic well all of this is built on top of ML and lvm so you you can get any of the intrinsics that you want uh you can reach into them you can also write inline assembly which is kind of interesting given that you're in a python you know syntax language and you can Target any llvm backend so we're not like we're standing on the shoulders of giants so we're leveraging all lvm and M back at infra to do that let's also not build a DSL so even though you know some of our use cases is AI the programming language should be General I should be able to do you know some operations in Mojo but then do the plotting uh through our python integration and that requires a general purpose programming language so one of the things that we made a decision on is let's make the you know kind of compiler lean and let's move a lot of the optimizations and the infra to be you know kind of functions in the uh in the moo Library so we use very limited number of dialects in M cor and I know this is might be controversial so we're not using vector arith linal or any of these dialects mvvm any of these dialects we're only using the lvm and index dialect and there's a bunch of reasons for them sometimes they're not General enough sometimes they don't fit in our use case they bring in a lot of code that we don't care about and there's like you know you know for the lack of better terms sometimes like cylic dependencies and so on and we you know having a lot of the functionality in Mojo code means you could iterate a lot more quickly so let's like Implement something like a vector dialect type of thing in Mojo So we have the simd type and we have a function called reduce Max and in you know if the size of the width of the simd vector is one we're just going to return the scalar directly if we're on x86 it ends up like there's a you know lvm has an instruction for horizontal Edition or horizontal Max that's not great for uh Intel so we could do a kind of a tree reduction thing but if it's floating point we use a different algorithm and we call directly to an lvm intrinsic this is compared to you know how the vector dialect lowers you're writing essentially the same stuff minus the special case for x86 in essentially C++ code to allow our directory to the lvm dialect we could also do similar things like transforms so as Jeff mentioned we disable the lvm vectorizer and instead we have folks be you know kind of opt in to the vectorizer and we've implemented a vectorizer in you know these like five lines of code so in one case we are going to we've parameterized the function on the simd width and we're going to call it when you know for the specific simd withth and the leftovers we're going to call the function with a value of one so what does this mean to the developers it means that when you're trying to do an optimization when you're trying to add a new feature or Target a new hardware the first thing is not oh I'm going to need to write a dialect or I'm going to reach into table J the first thing is I'm going to reach into Mojo and I'm going to do experiments and so on you can invent new optimizations weird ones incorrect ones or maybe even point optimizations that only works in this function in this domain in this context this is all fine but I care about performance I'm also a compiler engineer but I ultimately I care about performance so let's look at the performance of Mojo So one thing that people anchor on is the mandle BR set the mandle BR set you know we have a blog post that was recently published but essentially at the end of the blog post you end up with this like 10 lines of code and if you run this 10 lines of code you get 68 times 68,000 times faster than Python and you can kind of see the progression you know you can look at the uh blog post uh you know after this presentation there's a progression how to go you know to 90x faster all the way to 68,000 faster and but at the end of the day this is the code that you're going to see but nobody cares about mandal bro you know you can there's ways to cheat in mandle BR we're not cheating here but nobody cares about mandle BR so let's solve a hard problem so let's look at matrix multiplication so matrix multiplication has been studied since a lot of us have been born there's also a lot more papers that were published this year about you know matrix multiplication it's also difficult you know the problem is dependent on the cache size and microarchitecture so also core part part of La Pac and you know the ml system which means Hardware companies to go in the top 500 supercomputers they have to optimize matmo or to be on the top of the ml perf they need to optimize MMO so a lot of effort goes in to you know optimize maal and these libraries have been developed for decades you know before some of us were born as well but we also don't want to write you know the python template that generates C++ template that you know maybe goes to python again and so on uh let's be principle so let's have a few kind of core things that we want from our maal we want a single source of Truth we don't have multiple F we don't want to have multiple files we want to have one implementation we want it to be as fast or competes with state-ofthe-art you know even though you know we can read assembly and we can program C++ let's not do that let's write everything in Mojo let's make it you know fusible and do fancy stuff support Dynamic shape work on multiple architectures Etc our core hypothesis from the very beginning and here's what we ended up with so this is again a blog post from a few months ago uh we're actually faster than this now but we can compare against the best in class on their Hardware so we're 1.4x faster than Intel uh on you know skyl systems and this is fully Dynamic we're not specializing on shape we're not doing prepacking we're not you know I wish we were doing tricks uh it's easy to get these results if we were doing tricks but that's what we're doing we have no inine assembly and let's like run the same code but now on Intel or sorry on AMD 1.6x faster do the same thing but on arm or 1.2x faster in fact like you know our implementation is about 2,000 lines of code this is a toy implementation but this is putting everything together the interesting thing thing about this toy implementation is this is what the Llama moojo uh there's a public GitHub repo that's using this and this implementation beats using this they are beating the Llama CPP implementation that's public so with that we've validated our hypothesis you can build you know portable uh performance libraries with less suffering and with that I'm going to hand it off to Chris good awesome so to wrap things up um Mojo is still early in development as we talked about there's still a lot more that is yet to be done um one of the things we're doing that's uh I think pretty cool is we're developing this all in public and so we have a road map you can go see what we're doing we have new releases that come out very frequently now one of the questions we get asked all the time is does a modular open source anything right and so the answer is comes in two fold one is yes we have Upstream stuff all of the time including tons of core improvements to ml um apparently the The Interpreter that Jeff was talking about on Tuesday is very popular and so we can work on that and so we're very like good open source systems from that respect Mojo itself we I think we'll take a little bit longer but we want to start the Open Source process later this year and so we'll start working on that now I expect that to take some time because we want to make sure that we get the core design really right and not everything is best done with designed by committee but we really want to see this thing scale and go and have a big impact for the world so coming back all the way to the beginning we talked about Ai and the AI engine and this kind of stuff now we don't have time to talk about it today but the cool thing about what Mojo means for the AI engine is that you can actually tackle these heterogeneous compute problems because you can finally scale across lots of different hardware and this is really cool don't have time to talk about it today if you're interested we have a keynote at the nurs conference later this year where we'll talk a lot more about this in detail so that I think that's the end of our talk and we're very happy to take any questions if You' like to check out Mojo you can go to the web page read about it download it and use it today thank [Applause] you thank you Chris Abdul and Jeff are there any questions you have miks in the LA good time yeah thanks um thanks for test test thanks for the great talk my question is I haven't seen anything about uh GPU offloading in in your slide is that in in plan or what what are you uh intend to do with it so there is one bullet point actually on that there's so much more and yeah Mojo does actually support uh like GPU offloading and split compilation like Cuda but it's something that we did not talk about in the presentation and sure would Shar like to talk about in the future yeah thank you um hi um you mentioned that uh you don't need to use ccache because um you kind of mentioned that like can you elaborate that a little bit like how you guys dealing with caching so it turns out that ml has a nice serializable format called bite code and but bite code provides a predictable hashing and so we can use mlr bite code as the form to hash and cache compiler Transformations across the stack okay thank you so we also didn't have time to talk about but there's a whole distributed cast backing this thing and there's a whole bunch of fancy stuff put into it hi uh how are you doing the autot tuning is it offline or is it dynamically online and how do you define the objective function for the search yeah so you have a choice you could do it offline or online if you compile to like file you've done it offline uh the objective function right now is something that the user provides because it's data size dependent you know Hardware dependent and so on so it's up to you to Define that we do provide a benchmark module so that it makes benchmarking a lot simpler uh and that allows you to do that yeah uh if you're doing it online how do you control for like variation in data or do you rely on so The Benchmark library that we provide has you know good uh like you know number of iterations and so on until you get stability and so on so it it handles that Al so it's not actually in production autot tuning we use autot tuning today so I don't know what for that yeah so I mean there's core capabilities then there's future stuff also I mean one of the things that it's designed for but we haven't actually done is the like send the IR to an fpga and do evaluation remotely and then pull it back and things like this or or a simulator yeah exactly yeah a great talk uh there was a point in the slide about uh optimization in the providing optimization in the library as as opposed to the compiler are there any uh maybe I misunderstood this but I from my understanding it it's possible to come into like performance pitfalls because C++ has uh built-in likely built in unlikely and then you can it's really easy to misuse those and end up in a situation where your code is slower than without without these kinds of annotations so my question would be what happens if a user provided annotation conflicts with something that the compiler would also have done at the same time uh well so from a compiler design perspective one of the things jefff was talking about is we've removed not all but a lot of the super unpredictable things in the lvm optimizer so our goal is to give full control and predictability to the programmer which is very different from the make spec go fast kind of approach to compiler design and what that does is that gives you the ability to then go and design Library features that do things like you know you can Abdul you can talk about some of the crazy stuff you've done but what's also important is that um we have these abilities to say please vectorize this Loop please unroll this Loop and so on but not everyone who's writing say application code is going to think about vectorizing every single Loop and autot tuning every other loop so what's important is that we provide control to the users who care but also provide a default called experience that is like good and optimal and the compiler does its best but the important thing is what the user says will always take precedent and that's how you get control sometimes a compiler does things and you end up with code that says you know optimize you know compile the section of code with Dash o z type of stuff and you kind of want to opt out of compiler optimization because it's interfering with how you laid out your code um uh are there any plans I I have a followup question sure come after uh last question please uh hi uh so you mentioned that uh you only use two dialects uh in Mojo llvm and um index dialect two Upstream dialects two up okay so so so you don't use other things like f fine and stuff which means that if you want to use Hardware specialized libraries then the programmer has to do different tiling for ampere versus Hopper versus Volta and so on so isn't that just pushing the burden out from the compiler and high level stuff into the into the programmer you're going to that's exactly what that is very Hardware specialized performance libraries and then people who write this thing would have to understand the architecture really really you know well I think the thing is that they're more likely to understand the architecture really well than the compiler engineer right the compiler engineer has to have two things writing C++ for on CPUs that Target gpus this is like okay I'm a Cuda programmer I'm laser focused let me I see the trade yeah so that means that the people writing high performance libraries for very specialized accelerators they need to be experts at those accelerators right okay right so they need to be expert in one area not two areas so the go give a kernel programmer superpowers okay right but um but that's that's that's our approach to it as Jeff talked about Mojo can talk to any dialect if you want to you can use aine in Mojo we can plug and extend the system with dialects as well so so that's always an option okay so that a conscious decision is what I'm that's the that's really the conscious decision you're making is that you going to get experts to do the the performance library and they will just work well so so this is the thing design it colonel libraries don't scale because of the magnitude of the problem that and the cross product of all the different Integrations and all the stuff that chronal libraries struggle with but there are more Colonel programmers and performance Engineers than there are compiler Engineers by far sure right and so and okay so it's really about enabling the the talent that actually knows how to do all this kind of stuff versus having a compiler engineer in the loop that becomes a bottleneck okay thanks and we'll be around as well like throughout the conference so feel free to yank any of us thank you Chris Abdul and Jeff let's thank the speaker again
Info
Channel: LLVM
Views: 49,153
Rating: undefined out of 5
Keywords:
Id: SEwTjZvy8vw
Channel Id: undefined
Length: 49min 48sec (2988 seconds)
Published: Wed Nov 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.