Intro to the Architecture of LLVM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Nice, thanks for sharing

👍︎︎ 1 👤︎︎ u/timvinc 📅︎︎ Aug 11 2020 🗫︎ replies

can you share slides, please : )

👍︎︎ 1 👤︎︎ u/rufftuffnutt 📅︎︎ Nov 02 2020 🗫︎ replies
Captions
again so i'm i'm giving kind of a higher level but a little bit more specific overview of the architecture of llvm um sorry previous um talks have been kind of discussing research papers this is more of just like a kind of a high level design you know if you were to build a compilation pipeline you know what what is an example of one of the production systems that we use quite extensively throughout google look like um you know take everything i'm going to show you today with a grain of salt i'm not an expert on this i've only been working on it for uh almost two years kind of thing so you know this hasn't been vetted by people way more senior than me could be wrong right um but but generally you know this is kind of my understanding of the world so uh just double checking that i'm sharing the tab correctly other folks can see it thumbs up yeah okay um so this was uh there's a excellent book series um there are three of these books but this this diagram kind of shamelessly stolen from uh there's this book called the architecture of open source and the each chapter is written by a different author about um from a different open source project there's like a chapter on ninja and all about the architecture of ninja um this is taken from the uh llvm chapter so i think some of the chapters are free online so i recommend go read it uh it's it's pretty old at this point and it's not super accurate but this is kind of the general idea right is um you know when when building a compiler right it it it would be kind of nice if you know we could centralize all the classical compiler optimizations that aren't really language dependent or aren't really target dependent kind of back end part um and and have that be like a core set of libraries that that's reusable right and that's kind of the core idea behind llvm's optimizer um and we're seeing kind of similar things a little bit with with mlir but i'm not really going to get into that today um so the idea is basically if you if you have some kind of intermediary representation what i'll call ir from now on um that that's more general you can kind of have multiple different language front ends emit this and then be able to reuse these optimizations on this kind of meta high level language or ir and then from there you know you can feed that off to a different back end and generate code and this makes it really easy if you want to either create a new language and not have to worry about classical compiler optimizations or machine specific code generation by just having your front end emit llvmir so a lot of new languages like zig or rust or swift all kind of emit lvm ir and then llvm takes llvm takes over from there clang is a front end that handles c c plus plus um the objective c and c plus plus openmp i think cuda kind of translates these all to hello vmir and there's actually kind of more to these equations right so i'm going to get into more in depth um with the architecture so let me flip tabs to um a little bit better view can folks see this the the new thing okay um so the idea here basically this is the front end of of clang if we zoom in kind of you know pinch and zoom a little bit at the architecture let's see what's going on here um and you know i kind of the whole thing to me is one big pipeline um kind of thing um but it gets a little bit more complicated than that we'll see in a little bit right so kind of as input to this whole process we have you know whatever source language that that the human wrote the machine doesn't understand it right unless someone has an interpreter for it so you know we want to generate at the end of the day the machine specific um code but instead for a front end we're more concerned with kind of outputting llvmir um so kind of the first step in clang um and you have to forgive me i have a lot of um kind of source files uh in here you'll see these.cpp this is more for like if a beginner is interested in understanding like where in the source tree is this done right so a common common theme when getting started in a new code base is like you know i know the general overall idea but like where is it implemented so you know i kind of have um the paths to the source trees if you were to clone lvm project off github uh you could go and take a look and like read through these implementations and you know i find the interfaces for these objects really helpful in understanding you know what is the core responsibility of each of these classes right so first thing we do is is we kind of pre-process um do the job of the the c preprocessor right kind of expanding macros including uh other source files kind of thing um the lexer breaks up um breaks up all the the individual characters into tokens from there uh we do parsing so we want to convert those tokens into an abstract syntax tree um so the active parsing and constructing an ast are kind of two separate ideas and there is some type inference that actually occurs at this level um so most cases of c plus pluses keyword auto is actually instantiated at uh parsing time so there's some type inference that occurs there from there within clang uh semantic analysis and template instantiation and more type inferences is kind of done so semantic analysis is is really the brunt of you know we want to try to find mistakes that a programmer has made either things that may lead to bugs that we might want to warn them about or things where we're we're not going to be able to generate valid code you know this is ambiguous or not valid code um and their compiler errors right so clang emits what we call diagnostics which can be either warnings or errors and you know you can flip flags around to set one or another some things are always hard errors um but uh the the template instantiation ends up basically cloning parts of the ast but kind of specialized um for for other for specific types um and then there's also type inference that has to be done uh for for templates right so they have different type deduction rules then the auto keyword does in c plus plus and for decal type i don't remember if decal type i think it's done in this this later phase i could be wrong there then the the final part is um emitting the lvm ir and so in clang clang's idea of code gen is not targeting machine code it's it's targeting llvmir and it basically just does an ast walk and clang has the ability to construct a control flow graph and it does so for some semantic analysis but it's considered both heavyweight and it's not used today as part of codegen sometimes we construct delayed diagnostics where if we do any optimizations in clang and they and we delete the ast node the corresponding diagnostic doesn't get emitted kind of thing one of the things that's different between gcc and clang is klein does all the semantic analysis up front gcc will do early inlining and dead code elimination so you can have syntax errors in dead code in gcc and it will not warn you clang it will it's kind of a very obvious thing that i run to ohag so from here let's take a look at what the ir kind of looks like in in lovm from a class perspective so let me zoom in on this so the idea in lovm at the irr level is there's kind of um four real classes that encapsulate one another like matroshka stacking dolls right is kind of there's a class lvm module is you can think of it like a translation unit but it's a little bit more generic when you do lto everything's in one module it's kind of you know the world as we as the compiler understands it from there module contains uh one one to many functions right just like you would in c so you know c plus plus classes actually get um they get boiled down in it's just syntactic sugar and scoping for a bunch of functions right broken down into functions from there functions contain basic blocks um basic blocks are basically straight line sequences of statements that have that get terminated on control flow so like branches if statements for loops any kind of edge in a graph you might kind of links these basic blocks together and then finally basic and so a function can have one to many basic blocks and then finally a basic block can have one to many instructions right so um and i'll show you in a little bit kind of what the llvm ir looks like kind of thing but this is you know very high level and in even later passes in llvm ir kind of keep a very similar kind of structure uh structure so now if we zoom in on the back end here so within llvm um kind of in the source tree there's a top-level directory called lovm it actually has kind of both the middle end um the ir optimizations that are generic like not specific to a machine but then the back end itself which is machine specific uh are both within the same directory but they're kind of two concise units right so so opt is a collection um of that that takes in love ir and emits llvmir and this is like kind of very classical compiler optimizations live here um a lot of the the passes aren't really dependent on one another and can be run in any order and the ir should be able to be fed from one pass into another in most cases um so a lot of the passes in here are broken up into um both analyses and transforms um and then sometimes uh utilities so um there's there's a lot of transforms depend on certain analyses but the transforms themselves are destructive so basically if you compute some kind of expensive analysis like what is the cost of inlining this function into another and then you um actually go ahead and inline something kind of the previous analysis you've done may now be invalidated so there's this thing in lvm called the new pass manager um you know passes can be chained in pipelines and then you can repeat pipelines you can have multiple pipelines you can have you know just sequences of passes but there's a way in love for you to say like my transform's destructive it invalidates these analyses or it requires these analyses to to be run right so for instance the inliner is broken up into mostly three pieces you know there's actually a couple more but the really important ones there's the inline cost analysis um the in uh which kind of says like uh should i and and could i inline this um transform the inliner is what actually makes the decision um kind of based on those those distilled signals and then um in utils inline function has a bunch of helper functions that will actually do the code transformations and cleanup um you know it it doesn't know better if you tell it to inline it it's going to inline it and even if it's not profitable enough so kind of the the transform is what is like driving making decisions based on what's been told by the analyses and there's a whole bunch in here i don't i don't go through them but we'll go we'll go through and see a list of them in a little bit um so uh from once we kind of leave the middle end and we're in the back end we're focused on code gen and there's um kind of a collection of passes that are both machine generic which live in a certain part of the tree uh and machine specific and they work together and it's very common in llvm you know it's a c plus code base um it's very common for the generic classes to be a base class and then to have machine specific derived classes and the machine specific derived derived classes will typically override the base class and either do something very specific to that target architecture or um uh call into the base class if they don't have to do anything special they may do both um is very common as well so the um it kind of takes an lvm ir and then as we'll see we we go through two other actual irs within llvm and they get kind of machine and path specific so they're not as nice as opt where you can mix and match passes however you please they kind of get ordering dependent which is you know makes them a little bit harder to work with but um you know and there's kind of multiple passes the only two i'm going to cover are instruction selection and register allocation are kind of the two largest ones the most important ones in my opinion they're not actually required at all when writing a back end for llvm technically the only classes required for a back end are specifying the target machine such as um maybe what the cpu is is really kind of describing describing roughly what you're targeting and the data layout which is really like what's your data model um you know what's the size of a pointer are you ilp 32 or lp64 or you know do you support long double and what the hell does that look like and kind of things like that really um so lvm used to have a c back end it could generate c code so it only requires those two classes though c back-end has been removed and i don't know if this is still true anymore but instruction selection is is kind of the big one and llvm has three instruction selectors that it uses in the tree um and selection dag is is kind of like the workhorse um it it builds it actually converts the ir out of ssa form into a directed acyclic graph and tries to do kind of pattern matching on it uh it's it's very old but it you know it's the most well tested kind of the best thing if you want full optimization selection tags to go to um you know a lot of slowdown in lovm i suspect comes from selection dag since then people have written two other instruction selectors moving to them is a lot of work so fast i sell is is for the case where you care more about compilation speed you don't really care about performing every optimization or as many aggressive optimizations during construction selection can generate code very fast and since then global iselle has been like a newer instruction selector i think our ar 64 back end uses it in place of fast i sell but it's still a work in progress moving the various targets kind of to it it tries to keep the same ssa form and not waste time converting back and forth from ssa to dag back to ssa but it's still a work in progress and has been for a very long time um so there's there are three different instruction selectors in lvm where selection dag is kind of the base one and even if fast iso and global i sell you know don't really know what to do in a particular case they will fall back to selection dag they'll bail out kind of thing but um kind of once we're done with instruction selection the output of it is a whole new ir it's a different class in lvm instead of lvm instruction it's llvm machine instear people sometimes refer to this as mirror the machine ir and machine ir start it begins to start to look a little bit like assembly um there's like instead of having operations that are from lvl land they are more machine specific at this point but some of them are very generic some are kind of high level or pseudo operations that will eventually be expanded into multiple instructions at the end of the day and kind of the big thing with with the mirror class is because we haven't done register allocation yet the operands to it are a mix of virtual and physical registers right so um before we do register allocation we kind of pretend we're targeting a abstract machine that has infinite virtual registers um and eventually when we get to register allocation you know we're going to say well a concrete machine doesn't have infinite registers so we we need to either allocate them to physical registers or emit spills to the stack in reloads and the thing with mir is actually after instruction selection sometimes there will be concrete physical registers um and the cases where there are concrete physical registers are um when the calling convention of the api that you're targeting require it so kind of your rdi rsi rdx whatever like your c calling convention for instance um if you have register variables that are allocated to specific physical registers if you have um extended inline assembly that has outputs with constraints that say it must go in this physical register kind of at that point post i sell pre register allocation that that's when um you know we have this mix of virtual and physical registers right and there's there's there are many passes kind of before i sell there's many passes after i sell for instance we do a we do instruction scheduling before register evaluation we actually do another form of instruction scheduling after register allocation as well but when it comes to register allocation llvm has four different register allocators and again it can depend on you know what what target architecture you're targeting and what optimization level you're targeting right so for instance when i'm targeting x86 and um turning on you know compiling four optimizations maybe o2 uh regional greedy degree register allocators is what's what's run you know there's a fast register allocator if you compare you care more about compile times there's kind of a basic one i don't know if it's really used greedy is actually derived from the basic register allocator um this header is interesting this region basic actually defines the interface for which all register allocators need to kind of um i don't know i guess define their their find themselves against um but essentially after register allocation we should not have any more virtual registers we should only have concrete registers or spills to the reloads from the stack so we still have um we're still at the mirror level after regelic a bunch of passes as well finally we get to a class that i guess is claimed it's been misnamed uh it's called asm printer but at this point we now convert to the final ir in llvm class called mcinst some people call this the mc layer at this point uh you have like a one-to-one mapping between you know an object in like a cbs plus object in memory um to an instruction an actual like machine uh assembler mnemonic or instruction that that will be emitted um kind of thing so so as in printer job is kind of you know take take this abstract high level assembly like language and and like really straighten it out and and you know really get it in ready for for code gen so now from here uh the the mc layer major component of it is the mcstreamer class is a base class and there's two major subclasses that are really really important to it so most of the time when we're compiling we want to generate machine specific code we want to generate object files and we want to wrap those object files that are um you know we've kind of encoded our instructions like as a printer will now you know um uh sorry mcstreamer will is what's in charge of like you know when i see an assembler it's almost like an assembler right it's like you know for for an instruction uh generate the the bit pattern and the encoding for this instruction that the machine wants right and there's kind of child classes of that there's mcl streamer mocco streamer wincoffstreamer wasm streamer and these will output you know kind of object files wrapped in whatever object file format um you're targeting based on your target machine or api um conversely if you want to say disassemble or you want to generate um assembly instead mcasm streamer is used and will it's basically just called printf on a buffered stream that's like literally its design um and it will uh emit the assembly code um so inline assembly is generally treated as black box through llvm uh all the way through mchazem streamer we'll call like you know if you have like format specifiers you know literally just printf tricks um some interesting classes to take a look at in llvm um are target register info and target instar info basically they describe for a specific machine uh you know what does the registers look like um and they're actually written in this high-level meta language in llvm called table gen um it's like cbos plus templates on crack uh it's it's kind of a descriptive way instead of a imperative way of describing uh things that you may have a lot of and you kind of want to generate definitions for so for instance you know i don't want to generate encodings for every possible instruction in x86 instead you know what can i parameterize here um and what can i generate definitions for so table gen will take in these td files it'll generate a inc file that's valid c plus plus that will get pound included by the c3 processor into typically headers in llvm during the build of llvm itself and then those headers get included in various c plus plus objects which can then extend them further or you know do things with them so register info kind of describes you know what are the registers of my machine um and they can differ between you know extensions to iso's typically ad registers for instance um but then target instar info will also define you know what are all the available uh what are all the instructions as well in my isa or extensions of the isa um so uh the the idea behind llvm um more generally is that uh like it's very common in the gnu world or c in general to have lots of um little library uh shared objects that you all link together and llvm kind of says well why don't we make c plus plus classes for all these and then you know reuse them to build a compiler but also like a disassembler and you know obj copy and everything in bin utils and kind of makes it very very easy to reuse so um that's kind of the high level architecture of llvm now i just want to give you a quick overview of how um like one of the things i appreciate about llvm is that i find it very easy when debugging it uh the abilities that it provides for introspection of you know i suspect there's a compiler bug where the hell is it going wrong and you know this is just a pipeline that's it's very easy to actually um introspect um so this is kind of the second half of the the presentation i'm good on time that looks good so i'm going to share a different tab now okay can folks see this terminal let me see if i can blow this up a little bit um so the idea is let's start off with a simple c program here um you know maybe some syntax highlighting would be nice um so does anyone want to unmute themselves and and kind of you know talk out you know this this c code as written by human is a little bit um inefficient does anyone spot maybe an optimization or two or more that they can think of that you know would be nice to apply to this c code we could backwards iterate save ourselves a uh register yep i like that what else can we do and just fill the entire array with um x plus y whatever that is how are we going to do that change what optimizations are going to help with that well just move the uh move the x plus y up at the top and then just replace it very well yeah yeah is there anything that we can think of ain't unrolling cool perfect okay so those were you know the two that i was thinking of let's uh let let's go through and see you know how does llvm optimize this function right i'm guessing bella vm will probably just do the x y and then call mindset that's my guess let's let's take a look let's see um so uh kind of the first thing i'm going to do is i'm going to emit um i'm going to emit lvmir so this is kind of a big command line let's walk through it so i'm going to tell clang i'm going to say i would like you to produce llvm ir for me by default lvm's ir is can be serialized to disk in either human readable format or a space efficient binary encoding um so we're gonna say dash capital s when mixed with llvm is gonna say emit the human readable format um rather than the machine efficient encoding um from there i'm going to uh this is kind of messy but this is you know the the incantation i need that says um i i want you to admit ir that has not been optimized from the middle end and i kind of need to mix these together because if i don't specify an optimization level or if i say oh zero clang's actually going to mark all functions in the ir as like please don't ever optimize these and it makes it a little hard to then manually run the the passes themselves um i'm going to say you know i don't really care about debug info for the purposes of this demo this cleans up the ir a little bit you know not super important this is a llvm specific flag that's going to say um you know keep the identifiers i used in my source code because it makes it a little bit easier to like match up mentally what variables there are otherwise you know local variables whose names aren't really important they're not going to be visible symbols may just get lowered to numbers not super helpful um right so so let's run this and the idea now is uh this produced a file called fu.ll so so let's take a look at what this looks like so this is this is unoptimized lvm ir so right away you see um the data layout remember i talked about that and the target triple is the kind of the machine that i'm talking about i'm targeting which you know this is this target triple which actually has four items in it but it's called the triple um it this kind of defines an abi or at least part of the abi the data layout is actually like way more specific about some things right but you can see like f 80 is telling you something about long doubles and you know i don't it talks about endianness and you know i don't know i can't decode this string by by eye but you can see here we have a function here called foo um it's it's allocating a bunch of things on the stack here you know it's storing some stuff some bit cast there's lifetime information about some variables there's a store there's a branch to the beginning of the for loop in the loop we do you know some load some comparison and then a branch either into the the body of the for loop or some kind of cleanup routine the cleanup routine i don't know what it's doing here it created a big cast and then marks the end of its lifetime so that seemed like a waste i don't know like this isn't meant to be efficient ir this is ast walk right jump to the end of the for loop so the body of the for loop here right we can see you know load two things that we had spilled to the stack previously and then uh here's the add we're doing we're saying you know load x load y add them together uh you know here we do some calculations based on the address where we want to store stuff and then we do the store of the result of the ad into that address and then we we branch to the increment of the for loop um so this is kind of a canonicalized for loop representation in lvm you can see here you know we we load our our um iteration variable uh you know we add one to it we increment it we store it back and then we branch back to the condition uh in the condition right is we we we do another comparison and the jump right so this is the body of the loop um and you know at some point we decide in the when we do our condition to to branch to the uh let's see the cleanup jumps eventually to the to the end right so you know it's not not super efficient there's all kinds of like attributes and stuff about type based alias information and you know some information about the machine and you know stuff a lot of like function level attributes will get added here and stuff um yeah so time check we're doing great we're way ahead of schedule i'm happy with that um cool all right so uh that we've kind of generated um this is we've kind of cut the pipeline off at the knees um before the middle end has run and optimized these so um let let me um let let's say uh so alex you mentioned loop invariant code motion uh well let's say i want to run just the loop invariant code motion pass on this uh file well what we can do is um let me write this to food boom so if we take a look at this file we'll notice you know it's still very inefficient the entry is doing all kinds of like loads and stores and conditions and cleanup but you if you remember what the for loop body looked like before you'll notice now it's significantly shorter um so if we take a look at that real quick if we say fu.ll and we go down to the four body you can see here that there's uh you know kind of multiple instructions right where we kind of load x load y add the results um and store uh do we do a store i hate i don't know the rest of this is calculating um information about about the uh the index into the array that we're going to write it now if we go back to line 38 had a store ah good uh yeah that was the store into the array um so uh here if we take a look in the for body after loop and variant code motion we can see there is no addition being performed in the body of the loop instead um instead if we go back up to the entry of the function and we look for add boom this is here's the ad being done here and actually part of the the address calculation got hoisted up as well um but you can see here now all we do is basically figure out you know where should we be writing into the array and just store the the address uh the the result of the addition into that that address so um essentially this is like much much more efficient but at least you know only loop invariant code motion has been done there's more we can do there's more we can do here right so um let's take a look now if we were to run um basically every uh let's run um let's run every pass on uh i don't want to print all of them let's run every pass on on this file uh and and take a take a look at uh you know what it looks like so we'll write this out to foo dot is there anything else we want i think that's good oh didn't give it input of course um so we're going to take kind of the output from clang so forget the loop invariant code motion thing for a second um let's just see you know what what is the ultimate optimized ir look like for this um so that's not right we want food and sorry we want to emit that as human readable not the binary format so we use dash s then we can open it up okay so uh can anyone see what's going on here so it's a little bit tricky and i'm going to walk you through it a whole bunch it yep so it unrolled the loop fully and it unrolled it 25 times it looks like you're vectorized it also it vectorized it so you can see here um there's once up front so this what it's doing is it it's basically it performs the add and then it does a scatter operation to scatter this into a cindy vector and then it's basically doing a bunch of cindy uh stores so it was a good guess about mem copy but it depends on kind of the size of the machine we're targeting um uh like in in kind of scheduling you know we may decide to lower this maybe in the back end or something to a call to mem copy uh you know versus like cindy kind of thing and so you know there's there's kind of trade-offs there there's issues around are the operands aligned or not you know a lot of things go into this but um essentially llvm unrolled the loop 25 times and is using a four wide vector of uh 32-bit integers um to to do these stores so um you know if we actually checked that like if we passed along the size of the array and you know checked that it was safe to write all these that might be better but you know because we didn't it's undefined behavior to write past the end of the array so we assume you know whatever just just unroll the loop fully and go for it right so so that's pretty good but um let's let's actually manually invoke the back end now and and take a look at you know what actually assembly code gets produced um actually the the middle optimizer pass know anything about the ultimate target um so i guess like what i'm wondering is why did it choose to use eight wide some of the vectors when i think intel can do it or sorry you can it did four wide but i think intel at least can do at least eight and maybe even 16 at this point uh i don't know and there may be a pass in the back end that unrolls it further kind of thing so there's a generic uh loop vectorizer and then i assume with target information the loop vectorizer may actually use target specific information to decide how best to represent a vectorized loop but it's something i'm not super familiar with um one thing i wanted to show uh is actually uh let's take a look at what are all the passes that get run um as part of the middle end so uh this little incantation here what i did basically is i can say print after all i can say dash print after equals licking like whatever the name of the pass is and i'm just grabbing here it's going to dump the full lvm ir every time but um i i just am grepping for kind of a print statement that tells me what pass is being run so there's a lot of acronyms here can anyone call out maybe some that they recognize cse is common sub-expression elimination i think yep anyone know what sroa is scalar replacement of aggregates good um let's see so here we try to infer function attributes right there's things that you can't express in c but if we have visibility into the the definition of a function we can actually deduce more information to generate better code about it um constant propagation let's see what are some other ones i recognize um mem to reg right is uh you know maybe we don't need to be constantly storing and reloading things but maybe we can try to keep values live across basic branches uh redundant instructions it's like there's some other ones jump threading basically this is the full loop rotate loops we'll try to canonicalize loops instead of looking for every possible way a program can loop if you have a canonical loop form it makes it a lot simpler that all passes only need to operate on a specific formation of loops so llvm will try to transform you know for loops and while loops like chris talked about you know maybe iterating backwards is nice if when possible right that's something loop rotation might do um loop on switching might try to reorder if you have nested loops what order the nesting occurs at uh go ahead whoever's about to talk yeah so i was wondering what the canonical loop form for llvm was good question that's uh i think it's documented and i recogni i recommend going to check for it um i think uh the for loop i showed you earlier is kind of the hint at it but there may be a more specific definition you're looking for okay i i just remember in like the compiler course i took an undergrad is everything turns into do while loops and so i was wondering if that was true in llvm i don't know at the end of the day everything's going to get lowered to jumps which are basically go-to's um whether that's in a structured form equivalent to a do while and at what point it's hard to say uh i don't know enough about kind of the loop specific um optimizations to have an informed answer for that sorry um but we can see here you know that loops we can determine are statically dead on roll loops global value numbering is a way of uh simplifying common sub expressions i believe um i don't know what lc ssa oh it's just a verifier is loop loop contained ssa i think oh no loop closest yeah [Music] all the cleanups instead you kind of defer to dce and like you just run it afterwards and so you know dc gets run a lot it's a trade-off between you know if you run this pass and then that pass or you flip the ordering you know you don't they're not commutative so you don't get the same results necessarily so theoretically you could run dc after every freaking pass and maybe it cleans things up but maybe it's not the most efficient use of of compile times um lower float to int maybe there's cases where you know we can determine you don't actually need increased precision there's parts of llvm that are now taking a look at doing optimizations figuring out what which individual bits are actually required for certain calculations john greger has blog posts and talks about this you can take a look at you know loop rotation i think we talked about is a bunch of these probably are getting run again yeah yep and so there's verification passes as well the ir has various invariants and we want to make sure that no pass violates those or we want to catch those violated invariants as soon as possible because those typically lead to cogen buttons cool so now let's take a look at the back end so yeah and this presumably is parameterized to some extent by the source language rules as well right yes so what happens actually so the tool i'm running for the front end and the back end is opt in llc which are generic command line utilities from llvm used heavily throughout the test suite but they're just for manually invoking specific passes for actual language front ends they will specify a pipeline of what um machine independent pipelines they want um and then what back-end specific pipelines they want okay but something like expression reassociation if i remember correctly the rules vary from language to language right so some of these things are presumably parameterized right so either the front end can choose either to not invoke these or the pass itself may take a look at you know some information to say like i'm not vowed to run here just return early i'm done okay so so you know there's passes in tree that don't get invoked by clang ever but maybe they do by julia or maybe they do by rust or maybe they do by swift um kind of thing so the idea is if you have all these in one place um you kind of have compiler legos that you can play with to build you know your kind of compiler right so um let's take a look at the back end okay another question oh yeah go for it uh so if uh let's say loop and variant code motion uh adjusts does link to one function but doesn't to another um and that invalidates some existing analysis it will lvm only run it on the one that was changed by the previous pass or will it rerun it on all of the existing uh code in the module um so nothing gets gets rerun um it basically says like if it's basically a way of um it's a way of like memoizing the analyses right it's it's like the analyses are heavyweight and we don't want to recompute them if we don't need to so it's a way of creating a pipeline and then having that cache being validated manually by the past manager um basically right right so like if the cache is invalidated and then the another pass asks for that data it will on demand write the analysis uh that's my understanding yes okay cool so uh cool let's take a look at the back end i'm not going to generate any machine code any objects instead i'm going to generate some assembly cloud code and we'll take a look right you're at about 45 minutes too nick just the time check cool thanks um so i did not want to print it oh i did not want to print it after all sorry so um let's take a look at the assembly food s so here we can see you know the assembly of our function foo right we have a global symbol with global visibility called foo it does an ad you know based on x and y from our calling convention it moves these into a xmm register one of the cindy registers uh what the shuffle here is um so we moved into the first lane this broadcasts it to all lanes um and then we uh basically write this to memory 25 times and return so that was our you know our function and you know we determined for some reason you know four wide cmd was the best way to do this whatever i don't know um and then same thing we can print after all and sorry i want to pipe this through grep and here are some back end stuff so speak up if you recognize any of these but from the beginning um we'll take a look at uh so a bunch of these are still on lovemyr and if i spot it if i spot um there's post finalize iso so here at the finalized isl stage this is basically instruction selection is you know very big pass itself a very complex beast but at this point now we're in in mirror land um you know tail duplication if conversion uh you know combine instructions together kind of classical um classical uh like well i guess they're more machine specific at this point but um compiler optimizations you know eliminate obviously dead instructions uh this is where you start to see more back-end specific things right there lea is like this god operand uh operation in x86 where like you can do multiplication with it and like a bunch of things right so you know convert things to leas and whatever things that depend on implicit e flags and from here there's actually a grouping of these is actually the um is actually the register allocator right so these are kind of interesting um two address instruction passes uh we want to try to in mirror we want to try to use three address instructions instead of two so there's some tricky stuff we do there um instruction scheduling kind of the pre-ra scheduling here we these kind of work sorry i guess these work together as register allocation stack slot coloring kind of does too i guess you know some group of these tries to figure out like for the greedy register allocator what's the the liveness of variables uh the rate the live ranges of them color stack slots and you know allocate them to physical registers um tail duplication is like actually copy code into both ends of a if then statement if it's you know deemed to be a speed up or whatever actually lets you do uh tail folding no there's some branch folding pass later that that depends on this um but eventually like f entry calls for instrumentation x-ray like things we're not doing the pass runs and says i don't have anything to do returns early um where was it uh i don't see it now but at some point we inserted uh the function prolog and epilogue grab pro prolog apple log insertion this is what the pass was called prologue episode log insertion right so um you know if you need to realign your stack to 16 bytes on x86 so you can use uh um stack line operands for cmd right um and yeah so so that's kind of an example um and this is kind of where i wanted to stop uh for questions so yeah that's that's everything that i have um cool does anyone have any any questions about lvm the architecture or various aspects of it and if there's other people on the call that that know these things about lvm feel free to chime in as well just a comment it seemed to me that the reason you were getting full white vector instructions probably had to do with the default architecture target right maximize compatibility with old implementations sure let's uh let me play around with um if i set dash march equals native let's see what happens right because the the the base isa with no except uh instructions and extensions to the isa is has ssc2 which is kind of four wide cmd um so you can go back to clang and i say um equals native and then we rerun opt boom we got eight wide cmd right you can see here that we're using eight wide eight by uh 32-bit and that's because llvm's native vectors are parameterized right they're not fixed with vectors they're arbitrary uh they have to be fixed with and mri makes it so that arbitrary dimensions can be unknown right so i meant that the uh they're polymorphic types or parametric types they're not just you know vector 12 or vector for the vector of some width right yes right yeah l m yeah lvm ir has support for arbitrary like vector sizes but then whether or not that's legal for your different right so so here you can see we're using um a ymm register is uh it's uh i think is i think it's sse4 i don't i don't know if it's actually an avx register let's see what is ymm oh god why am i mx86 avx okay yeah so so when we're added that's part of the avx extensions they support eight 32-bit um single precision floating point numbers or four 64-bit double precision floating point numbers um you know maybe i if uh my host i'm targeting should have support for you know avx-2 and avx-512 uh you know maybe there was a trade-off there where lvm decided not to use those don't know kind of thing um do you think it looks at the like trying to find a common divisor between the number of lupun rolls and the uh vector size so that you don't end up having to do a tail and it said that it would so go ahead alex if it did that it would need to use four it doesn't divide a hundred you're right you're right so we could take a look at at the output you know maybe you might do software pipelining and it it might do something where you know it sets up the first few iterations of the loop or the last few iterations of the loop to kind of handle that and it could use a smaller register to handle the tail uh use eights and then four to get the last four yeah so you can see here it it it does this for the tail right is it is it just lowers to four moves right because eight into 100 is i guess 96 and then these the four yep yeah this was you could also pick eight only for alignment or alignment reasons anything larger might have required stricter alignment and your incoming array was you know just an integer array and typically llvm will have debug flags so for instance i ran lickham loop and variant code motion um but there's additional flags i can pass that say print out debug statements that people have inserted previously so for instance right now i'm looking into a bug in llvm's register allocator and so you know the way i'm debugging this is i'm literally invoking llc because register allocation is part of the back end and i'm passing a whole bunch of flags that turn on you know print debug print statements for the greedy register allocator and uh you know where we calculate uh where we insert uh inline spills and debug information about you know how we calculate uh live ranges and things like that so you know it's it's very nice llvms very easy to work with it's very uh easy to introspect you know what's going on and where and how decisions were made kind of thing so one of the things that i found really useful when working on clang was just the ast printer that would just dump the ast to the terminal um now i was only working on lvm for a little while but so you definitely have more experience there is that something you still find useful um for one for i didn't know the name of all the c plus plus uh ast nodes uh and so being able to just feed it a file and see the ast nodes was really helpful yeah so i don't um convene wrong but um sorry i have it the slides that i that the talk is from i have the uh the invocation from um isn't that dumb ast oh because i think i need to do this is ast dump sorry let me pull up my slides here because i'm i'm mixing this up um but basically the the talk that these slides are kind of pulled from i have kind of the generic ones let me just share this tab because you know here's the output oh x clang of course um so this is literally like let me punch this you know back into my terminal here and i'll share it again oh oops present not that chrome tab there we go okay so uh basically you know yay fun colors um this is what the ast nodes look like after kind of parsing in ast construction there's actually a bunch of implicit things that get defined like what a double underscore 128 is weird things that we probably shouldn't be like an ns constant string tag which sounds very objective c specific to me um like va list tags i don't know from here there's type def's from so in my example i pound included stood def stood definitely declares like three or four things there's actually not a lot in it but it declares pointer diff as a type diff to long size t wrt right so those came from our header um and then from here here's the ast of and i guess max align t is defined there as well then in my source file you know i have a function declaration of function name foo uh parameters a x and y we have compound statement right block statement we have a for statement um for statement you know has a bunch of things you know this looks like a bug to me i don't know what the that is um we have you know part of our the the conditional check you know will be less than is that yeah and then here's the increment um binary operator is our assignment we have another binary operator that's our addition of x and y in here we're storing you know a is the left-hand side of the assignment kind of thing yeah so so there's a bunch of you know um there's general debug compiler generic debug flags like dash e dash capital e dash capital s dash lowercase c there's clang specific ones like dash exclaim dump ast and lvm and you know stuff like that but the null is not it's not necessarily wrong it's just like a optional probably uh part of a for statement so like if you look if you look there like the notes hanging off of like for statement there's basically things that are you know in each of the the semicolon terminated parts of the for statement and then so like there's like the loop body and the you know like the conditions that you're checking and then the post operations that you have there right like pre-operations like you know um so so yeah so the missing so the null part is probably just some other like thing that can be present in a in a kind of loop and they they leave it there just in case yeah i mean it could be something from objective c or something but um i've highlighted an interesting node here something i didn't i didn't write in my source code can anyone tell me or guess at you know this node that i've highlighted right here you know based on the name what do you think is going on here the bane of my existence oh my god the number of implicit castes that are everywhere in the compiler so this is this is part of the language right like people like to give um javascript a lot of for having input conversions but the thing that a lot of people maybe might not understand is that actually c and c plus plus are full of the same type of implicit conversions and promotions as well and there's actually very tricky rules for that that are very error prone and actually hard to remember um something i i don't like they're trying to help with with ergonomics but actually in klang and lvm the implicit caps are actually inserted into the ast at parse time and that's actually how they're handled yeah this is actually a great way to like learn like when you have something like and you're trying to understand like how it how it's actually getting cast like or becoming the proper type or making sure it's the proper type that you thought it was going to be particularly if you have something with like auto um you can use this to like dumping the ast to get that information function the thing that i use this for was when i was uh writing my um ast walker um you have to say on what nodes you do what and i didn't know that some of those notes were going to be in there and so being able to actually feed the code in and then look at the names it's also nice all the line number and column stuff's in there so it's really easy to trace this stuff back yep yep that's all attached to the tokens the hard part is just not destroying it destructively in your transforms so that way you can emit proper debug info which is a whole other trick um so i'm sorry i've got to run if you have other questions i'd love for you to you know ping me on chat and we can follow up kind of thing um or anything and if you have feedback let me know thanks everyone for tuning in your questions i appreciate it thank you yeah thank you nick see ya i'll uh i'll put this on a uh g drive you appreciate it thanks alex bye
Info
Channel: Nick Desaulniers
Views: 1,175
Rating: 5 out of 5
Keywords:
Id: bUTXhcf_aNc
Channel Id: undefined
Length: 58min 37sec (3517 seconds)
Published: Mon Aug 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.