2018 LLVM Developers’ Meeting: L. Hames & B. Loggins “Updating ORC JIT for Concurrency”

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so hi everybody my name is Liang Hames and I'm here to talk to you about updating orc for concurrency helping me out is Brecon logins over the last year or so while I've been working on this project Brackins been helping me out with white boarding design ideas for this improving class naming and things like that he's also been helping me up at building use case demos so that we could test out API design ideas along the way as we made them and he's brought along one of his use cases today to help visualize some what's going on in these jet api's so a quick overview of the project for people who aren't familiar with orc orc stands for unrequested compilation it is an API for building LLVM based Jets and it be a modular extensible replacement to MC jet so it supports cross target compilation same as MC jet did you can compile code for the current process but you can also compile code for a different process on the same machine or even for a process running on a different machine running a different architecture and you can send code over the network to be executed unlike MC jet it supports lazy compilation so you can defer compilation of functions until the first time they're called it also supports custom compilers if you're going to be lazy you don't have to compile your code down to LOV Mir and then stop being lazy from that level down so lets you wrap up your compiler and have your compiler invoked lazily when codes needed from you and the design of this whole system is based on mcg it's static compile up plus an jet link or architecture and I'll talk about that in a little bit because it provided a lot of inspiration for this design and then about this concurrency project in particular I'm going to bury the lead or anything orc now supports concurrent compilation so you can now run multiple compilers in parallel in the jet on different threads and have code fed into your executing jet process from those compilers concurrency can be mixed safely with laziness and remoteness so you can mix lazy compilation and concurrent compilation with this API and the aim is to provide performance improvements so obviously if you can distribute work across multiple cores or multiple machines then you can execute faster and the other thing that concurrent compilation provides you JPI is that you can start speculatively compiling code if you're jetting and you only have one core then you never want to compile anything accepts like the bare minimum to execute the next bit of the jaded code because anytime you're compiling you're not executing if you can compile on background cause you can start compiling ahead of where the JIT in code is executing so that hopefully by the time a JIT the JIT calls a function you've already compiled the definition of that function and you don't have any additional latency to call it the other motivation for working on this project apart from getting the performance improvements we wanted to do this as soon as possible because as you'll see the changes required for concurrency of quite disruptive to the API so we kind of just wanted to get those changes out of the way the longer we leave that the more code would get built on top of this IEP I and the more code would get broken when we did finally add concurrency to it so this offices offers us a bit of future proofing and so on that note for existing orc users if you're looking at the code examples that it could come up come up in this talk and you're thinking that looks a little bit different um the API is similar but different to what I'll henceforth referred to as alt v1 some of the concepts from the original or kpi's have been removed completely like symbol resolvers and lazy symbols are totally gone there was no way to make them work with a concurrent compiler and some of them have been replaced so you now use a new mechanism for symbol lookup and for lazy compilation and I'll talk about those a little bit later on there are some totally new concepts and types to support lazy compilation in a concurrent environment and then since to do this project we had to touch basically every API in orc anyway I've taken the time to remove all of the top level templates from orc so the aim is that we can provide you a fully featured C API now we haven't done that yet but those improvements are coming for most of the changes between aqua v1 north feed to the update path should be pretty straightforward once you've seen this talk if you have a use case and you can't see how you would update your your current use case for me to come and talk to me and I'll be very happy to help you out so the agenda for the tutorial today I'm going to show you ello JIT which is a concurrent JIT API for LOV Mir it is very similar to mcg it except that it can compile on multiple cores at once after I've shown you what that API looks like to use I'm going to show you how it's built and this is also going to be quite simple because it's built out of off-the-shelf orb components and finally I'm going to describe the api's that you need to integrate your compiler into an objet session so let's start out by describing this MC jet model that I mentioned earlier the static compiler plus jet linker model because it did provide so much inspiration for the new design so MC jet takes in LOV Mir and runs LOV M code gen to produce an relocatable object file in a buffer in memory and then we run a component called runtime D while D over that relocatable object file to produce ready to run memory for your target process now the first part of this pipeline is just the LOV Mir static compiler this is the LLC command-line utility basically it just doesn't dump the object file that it produces out to disk the only thing that makes this a JIT is this runtime T Wildey component which is a JIT linker that takes the object file and applies all the relocations and memory to make the code runnable so that's what MC jet is under the hood from outside it looks like a black box that makes ir runnable under the hood it is just the static compiler again and reusing the static compile of this way minimizes bugs and maintenance so this was a really really clever design by jim gross bach and daniel Dunbar since we introduced this every improvement that gets made to the static compiler is automatically available in the JIT and every bug fixed that's made to the static compiler is automatically available in the jet so this was a really great idea when I came to doing the concurrency project I realized that awk was going to need structured symbol tables to enable this concurrency and at the same time that I was thinking about that MC jet had provided this insight which is that LLVM ir is in some level made for linking every global value in LLVM has a linkage type has a symbol visibility MC jet had already sprouted support for some linker constructs like common symbols and it was doing that because existing front this produce code that is meant to be linked and we don't want to have to turn around to front end designers and say oh could you please add a JIT mode to your front end before you can get this code so the idea was if we're building structured symbol tables anyway and if LOV Mir is meant for linking then if we build symbol tables that follow the static and dynamic linkers symbol resolution rules then we can jet code from existing front ends and for existing projects this would be a super set of mcg its features so existing use cases are all still supported and suddenly questions about constructs that look kind of odd in a traditional JIT like what happens if you add two definitions of a weak symbol to the JIT can suddenly be answered in terms of you know how would the linkit have treated this obviously if you have a linker and you add two definitions of a weak symbol to the same die Lib you just discard one of them the JIT will now behave in the same way so it's more intuitive for those of us who are familiar with the static compilation model so the way this looks is that the open v2 program model in the or v2 program model a JIT session now contains one or more jet die libs each of which contains some symbols so let's imagine that we were getting a command line assembler tool because I am a compiler engineer and I lack imagination for use cases I'm sure there are things that you can jet that are not compiler tools but why would you so you take your AAS you you're assembling man your command line assembly your jetting would have an AAS sort of program that we represent as a jet pilot this would have functions like main and pause options and print help you would hope to defer the bulk of your assembler work to a Lib as libraries this would have functions like assemble and disassemble and verify and there'd be some sort of support library with things like print and scan and report error now these are JIT die lives so they don't the symbols in your dialogue don't initially have addresses associated with them instead they have materialises associated with them and a materializer is just something that can generate a definition on demand for that symbol normally this is going to be a compiler but you can add objects to the symbol table in which case your materializer is just a linker you can add stubs to a Jedi living which case you materializer is just something that splats out stub definitions in memory and the other thing that Jedi webs have is dependencies between each other for symbol resolution so that you can you can say that the a SGI lib lynx against lib Azzam and lynx against live support that means the code that you adds the ast eyelid if it has external references we will try to find them in Lib Azzam and then in live support so the idea the fundamental idea in the new API is is that you would describe your jittered program as if the the compiler inputs were being linked in a regular build you link the compiler outputs in the JIT you link the compiler inputs and then we're going to defer a compilation of each of those inputs until it's actually needed so let's take a look at what this looks like when we use the ll JIT API now for these code examples I'm going to use some shorthand to fit this on the slide and I'm gonna pretend that errors don't exist so that we don't have any error handling boilerplate but otherwise this is what the code looks like so you would start by creating an LL jet instance so you can say Auto J equals L or JIT creates and the first argument to this create function is going to be a jet target machine builder one of the first things that we came to when we started doing these new api's is that you can't have you if you create an MC JIT instance you just give it a target machine if you have multiple threads one target machine isn't enough because they're not thread safe so you need to be able to create a new target machine for each pod you want to compile them so that's what jet target machine builder does it's just a factory for building target machines you can construct it with a target triple or you can call the static detect host method and that'll build you a target machine builder for your hosts once you've got your target machine builder you can tweak the target options and the sub target features if you would like and under the hood the JIT just calling create target machine to get a new target machine whenever it needs one the second argument to the yellow jet create function is the number of threads that you want to compile on if you give us zero threads we'll interleave compilation and execution on the main thread if you give us one or more dedicated compiled threads then we will compile on the compiled threads and block the execution thread whenever needs to wait for something to be compiled once you've done that you want to add some code if you're going to add code you need a die Lib to put it in so you can get a reference to the main dialogue that is always constructed for you by default by saying J get maimed eyelid and then you can add code by saying J a died our module pozzolana file and give it some path again we're pretending here that there are no errors to deal with in reality you would have a little bit of error handling boilerplate so once you've done that you can call J look up and you can give it a string name for nello VMI our symbol that will return you a JIT symbol for that for the symbol that you looked up you can then call get address on that symbol cast into a function pointer and call it and that's all that it takes to compile I are in this system and make it part of your process if we were going to do a slightly more complicated example where we have a couple of jet dial herbs we could say you have live main we will add a live azzam so this is our example from earlier we'd create that linkage relationship by saying Lib main add to search order Lib assam so you read that as basically Lib main links against the Bassem now we can say J and I our module to live as 'm as amel l and now we can continue as before and now main dot ll can access any code that's president as amel and this example is actually even slightly more complicated than it needs to be Lib main is special and every die Lib that you create is automatically add into the search order for the main so in practice you don't even need to add this links to relationship full of main so I'm gonna call Brecon up to demonstrate that this works and can compile on more than one core I actually gave a demonstration back in 2016 where we compiled GCC and lazily one function at a time and it had a little visualizing that went along with it where each function had a little square and it lit up when the function was being compiled so I the feedback from the 2016 talk was that the API was good but the squares were really good I mentioned this to Brecon and he took this idea and ran with it so this is the demo you're gonna say all right Thank You Brighton okay squares hi everyone so you know laying talked about how to build one of these Jets and one thing that I want to emphasize when I'm running this is that this is effectively that code sample only we've hooked up some callbacks and we're gonna try to draw some pretty things so I'm gonna launch this put it in full screen and open some stuff up but before I do that I want to kind of go over the interface just just a little bit because when we run this later things are gonna be happening pretty quickly and I want to just make sure everybody knows what's going on so the new jet architecture is module based in the sense that laziness occurs on module boundaries not symbol boundaries so a lot of what you'll be seeing here for the squares are modules not individual functions so when and I our module is added to the JIT it will show up as a square in this added section up top and that means that you know if it's LVM textual ir it's been parsed if it's bit code has been deserialized but now it sits inside the jets internal data structures now once someone looks up an address of a symbol excuse me in the module then it's going to move to the in queued state once it's there if there is a free slot available in one of the compiling threads then it will move there and actually compile so in this case it would take the iron turn it into machine code once that's done for our modules and all these symbols in the module have an address and so the square which is the module will move down to the resolve state but that doesn't mean that it's ready yet it doesn't mean that you can jump to code because that module may refer you know to other external symbols that also have to be compiled and so once a modules transitive closure of dependencies has also been resolved then it will finally move down to the ready state and from there you can you can jump to code you can run it and you'll know that all the code that it depends on is also ready so what I want to do is I want to show this happening kind of in slow motion I'm going to slow the JIT down I'm gonna open and this is our familiar 403 GCC test case I'm gonna open the basic one which splits the whole test case up into two modules and if you're not familiar it's a subset of the GCC compiler that we've compiled to LOV Mir because it's fun to compile a compiler while it's compiling a test case you know that's that's kind of fun so we'll set the thread stitute and say open now because we're slowing the JIT down it's gonna take a bit to see things pop up but we'll see two modules they'll immediately move to and cute because main has been looked up and they'll immediately move to the threads you can see we're occupying both threads one of the modules is kind of big so it takes a little bit longer but now they're both resolved and they move down to ready and everything is executed let's just do it again only a little bit faster still doing too so you can see that it's much faster but there's still this long compile time for this for three GCC part one it's because it's a huge module it's got a lot of code in it we'd like to speed that up which we'll do in a moment so before that I'd like to bring laying back up talk a little more about the internals and how to build fancier things all right Thank You Brecon I also wanted to say a special thank you because I think that's the only compiler demo that I've seen outside of GPU land that has a frames per second meter in the top right hand corner and he did take the time to optimize that to make sure we got a steady 60 60 frames per second on our compiler demo that's awesome so okay yeah we can jet compile we can compile on multiple cores the code that you just saw was alive this stuff runs a Santa Sam Sann clean so it's still a work in progress but it's already getting like robust enough to handle non-trivial stuff now before I move on to how you implement ll JIT I wanted to mention a few of the utilities that are available to you while you're using it there's a lot utilities that look familiar if you've worked with linkers before so for instance if you wanted if you were to add absolute symbols to your jet session you can just say jet dial in define absolute symbols and you can give it a map of symbol names to addresses and those will now appear to your genetic code you can define symbol aliases so you can say this symbol in my jet symbol table is just an alias of this other symbol and you can define re-exports which is basically just an alias that's allowed to cross and jet dial it boundaries another really useful utility is that you can attach and generators to jet die lives to generate definitions on demand this is particularly useful for things like exposing precompiled code to the JIT so for instance if you want to make your own processes symbols available for execution budgeted code you can do that by saying JIT dial it generate a dynamic library search generator get for current process if you wanted to make a particular dynamic library available to the JIT you can do that by saying set generator dynamic library search generator and give it either an LLVM sis dynamic library or just a path to a dynamic library to load and then the other interesting utility is for the utility for lazy compilation in the new scheme so I mentioned that compile callbacks are gone compile callbacks used to attach an arbitrary closure to a stub to allow you to do lazy compilation the problem with that is that it's very difficult to synchronize when you have no idea what's going to run inside that stub so what we want to do to take advantage of the synchronization that we already built for symbol lookup is to turn lazy compilation into a lazy lookup problem and we do that with the utility called lazy re-export so a regular re-export is just an alias a lazy re-export is not an alias but it's a stub that acts like the alias symbol if you call it when you first call a react sport stub it will perform a lookup for its target symbol and then it will jump to it on subsequent calls it will bypass that lookup and go straight to the compiled body so obviously this only makes sense for function symbols you can't lazily re-export data symbols but for functions this is how you make them lazy and an a note of caution you need to take care to preserve or to change symbol names if you want to preserve function pointer equality you need to make sure that everybody agrees that the stub is the canonical definition of the function if function pointer equality matters for your language so what this looks like to use let's take the example from earlier if we had a live assembler that had assembly like an assembly disassemble function in it and we wanted to compile those lazily the way I would usually do this is to create a Lib Azzam in pulled die Lib and put the bodies of the functions in this dialogue and I'd rename them to dot body and now in Lib Azzam you can define a lazy re-export from Lib azzam in pole and you can just say there is an assemble symbol that re-exports assembled our body from Lavazza mempool and there's a disassemble stub that re-exports disassembled body so now if anybody tries to link against assemble and disassemble the materializer for them just needs to generate a stub so that'll be really really quick and it's only if you call those stubs runtime that you'll trigger a look up for assemble body or disassembled our body so that's how you do lazy compilation in the new model so let's turn to how you actually implement something like a legit using the off-the-shelf components and it's actually really easy so you start out with an execution session this holds the string pool the session mutex and error reporting utilities so you always have just one instance of this class Fugit next you would need a linking layer so that you can add object files to your jet that takes a reference to the execution session and it takes a memory manager builder so that it can build a new memory manager for each object file you add but we don't want to add object files to add JIT we want to add LLVM IR so we'll need an LLVM ir compiler and for that we can use this utility called concurrent IR compiler you construct that with one of the target machine builders and concurrent IR compiler just compiles LLVM IR and as the name implies it's safe to call from multiple threads so a regular pass pipeline you definitely can't call from multiple threads even if you have a different module Concordia compiler you can so we can take that compiler and use it to create a compile layer give it a reference to the execution session the link layer where it should send its output and a reference to the ir compiler and that's it we can now add code to our you know newly reconstructed ello JIT so ello jet had a method called a dial module that was just using compile layer dot add likewise ll jet had a lookup method and that just wraps execution session lookup so that's how you do lookup now is through the execution session so you can say execution session look up give it a symbol name get the address for that symbol and cost it to a function pointer in call it so we have almost reconstructed a basic ll JIT except LOD it was able to run on multiple cores and we haven't talked about how you do that yet and there's just one chain and change that you need to make for that and that's to set the materialization dispatcher so when you do a lookup we do a sweep and we collect all the materialises for the symbols that you looked up and then we need to dispatch them and we call the execution session dispatch materialization method to do that the signature of this method is just it takes the JIT the Lib that you're materializing into and the materializer that's going to do the work and the default implementation just calls materializer do materialised on the Jedi live this default implementation means that the thread that does the compile is the one that you did the lookup on you did a lookup you've got the materialises you run the materialises on the same thread and then eventually you'll unwind and give back the results of the lookup but you're also free to dispatch that materialization on any threads that you want and we will still block and make sure that everything the results are returned you when they're ready so in this case I've just spawned a new thread for every single materializer in practice you wouldn't want to do that because you might end up with hundreds of threads compiling at once but you can substitute a thread pool for this and add work to a thread pool to be done that way and that is all it takes to take you Antonio compiler from a single call one to a multipolar compiler and one extra note there's an additional utility entry called ello lazy jet that takes this slightly further so ll JIT doesn't do full lazy compilation ll lazy git adds that functionality this is if anybody's familiar with it basically the functionality from the original llvm like legacy JIT API that is long dead now but you can add IR and have the iya lazily compiled for you one function at a time so to get that in l legit you would add a compile on demand layer give it a reference to the execution session the compile layer to send the ir - and - off the shelf utilities a call through manager and a stubs manager once you've done that if you call compile on demand layer add and add your ir to the JIT that way your ir will be compiled lazily and the way that compiled at one demands layer does that is it just scans your ir and builds lazy re exports for each of the functions in it and that's all it has to do whenever it gets a call for a particular function it pulls that function out into its own module sends that module down for the compile layer so that's l o j-- it it is a basic concurrent jet api for LOV Mir it's entry now and you can take a look at it it can be used as an MCG at alternative and if your MCG at use case is simple i would definitely encourage you to try this out and see if it works for you it's also really useful as a starting point for building your own jet which i will talk about in a moment of course i glossed over some details here because we couldn't fit them on the slide the entry class does some name mangling so that you can look up symbols in terms of their IR names but under the hood always uses the platform-specific linker mangling on linux this is a no op on Mac OS it's usually adding an underscore on Windows variants it differs from ABI to ABI but the entry class will match this the native mangling for the system so that you can always look up symbols in terms of their IR names the entry version has support for running static constructors and destructors which we haven't shown on this and it also supports adding object files which we didn't show on the slide but having seen these slides and knowing about these few things that are missing you can probably go and take a look at the entry code now and recognize pretty much all of it there's not much more to it than that so moving on to wrapping your own compiler which I think makes this whole system a lot more interesting to do that there are five big api's that you need to understand four of which we've already met so this execution session that provides the global state there's the JIT dialogues that provide the symbol tables there's layers which wrap up compilers and then there are materialises in particular this is implemented by the materialization unit type which wrap up program representations and finally there is a materialization responsibility class that tracks responsibilities for compilers through the system and I'll talk a little bit more about that in a moment but first the layers of materialization units are there to wrap up your compiler and module representation basically orch doesn't know anything about your compiler api it doesn't know anything about your program representation right layers are gonna wrap up your compiler api so that all can invoke it all to be one layers serve the same purpose if you're familiar with them materialization units are going to wrap up your program representations so the Jedi Lib can scan it to see what symbols the module provides but also hold on to it so that you can we can compile it on request and hopefully one didn't have any concept of a materializer because lookup was implemented through the layers themselves and they were already up for you they were all ready for you to define so you could work natively in terms of your module representation now that we have first class symbol tables we need to wrap your representation up because these are just wrappers they tend to be very very simple so a layer might look something like this you would have a reference to an instance of your compiler and a reference to the base layer to send you the key your compilers output to and you just have to define two functions one to add code to the JIT one for the JIT to call you back on when it needs code from you the add method usually just takes module representation from you in whatever you want representation type is wraps it up in a materialization unit and college it dylon define to store the module and the in the symbol table and then the emit callback that you get from the JIT you just get handed back an unwrapped instance if you're a module representation you can pass that to your compiled instance and then send the output of your compiler along with the responsibility object down to the base layer materialization units are similarly like very very simple these are going to inherit from the materialization unit base class so you would have a reference to your layer and an instance of your module representation you will inherit a get symbols method from the materialization unit class and so you have to tell the materialization unit what symbols your module is going to provide and you can get that information back by calling get symbols you also have two methods to implement one is the materialized methods so the jib will call you on this when it needs to materialize this module and there you just call through to your layers emit function and pass along the unwrapped module and the responsibility object the other method that you need to implement is discard so we're acting like a linker and you can add two modules that have a death each have a definition of a week symbol to the same jet die Lib if you do that the JIT will call you back straight away for the second module and say I'm never going to choose this definition I already had a definition of that symbol so that's your opportunity to either mark that symbol as available externally if you want to keep it around for optimization or to delete the second definition entirely alright so this can help you save a little bit of memory if you have a lot of odr definitions or something like that you're adding them to the gym and then the most interesting class in this this said I think is materialization responsibility and this was motivated by a problem that we had in the JIT which is what happens if a module fails to compile in single threaded compilation if something goes wrong it's really obvious what's gonna happen somebody's gonna generate an error and that error is going to tear down the compile stack that's all very natural and very obvious but if you're doing a concurrent compilation where compiled on one thread might have been triggered by a request on a different thread that's not true anymore the air is gonna tear down the compiled thread stack but unless you take some concrete action the clients that were waiting for that compiled and never gonna hear about this failure and the symptom of this will be starvation your threads won't just sit there waiting for a response from a compiler that has gone away and will never give them an answer so to solve this problem we introduced this materialization responsibility class and it tracks responsibility through the JIT and it guarantees that you will notify people if you use the class correctly the contract on this materialization responsibility type is that either the responsibility object reaches the JIT linker and succeeds in linking which means that the linker calls resolve and emit for you for each symbol or the fail materialization function has to be called now we have four enforced this contract with an assertion in the destructor so if you're using this API and you fail to do this your program at least in debug mode will fail with an assert this enforcing contract with an assertion and the destructor is my new favourite idiom because worked fantastically in LLVM era so responsibility is owned it can be transferred you can break it up so that you can hand off work to different compilers but it's never duplicated and it's never dropped and instances of this responsibility object are created for you by the JIT you don't have to worry about creating them you just need to follow the contract so the API for the responsibility objects looks like this you can call get symbols to find out what symbols you are responsible for compiling this is usually totally redundant because if you've got a responsibility object you're also holding a module that has definitions in it and those definitions should match up exactly with what you're responsible for compiling but you could use this for sanity checking and verification a more interesting API is the get requested symbols method this lets you ask on the symbols that you're currently responsible for which ones have people actually asked for and you can use this to break up modules based on dynamic information we are still a module at a time compiler so if you've added a module with a hundred functions in it and somebody asks for one of those functions we to turn around and call you and ask you to compile the whole module but this gives you a chance to look at what was actually requested maybe break that module up into a smaller piece and you can take all of the parts of the module that weren't requested and replace them in the symbol table so that you are no longer responsible for them to do that you would call the responsibility replace method you give it a materialization responsibility covering a materialization unit covering some of the symbols you're responsible for it will put that materialization unit back in the symbol table and you'll no longer responsible for those symbols you can also delegate responsibilities if you want to break up a compiled job and run it on more threads you can call delegate to split up your materialization responsibility object there's a resolve method that's usually called for you by the jet linker that's used to assign addresses to symbols and there's an emit method that's used again by the jet linker to notify the jet once a symbol has had all of its bits written out in memory and its memory locked down and so that symbol at least in isolation is ready to execute and finally there's the fail materialization method that I alluded to that will notify anybody who's waiting for symbols that you're responsible for that they're not going to get an answer so you Center this API you've already seen basic uses where we're just propagating this around so if all you're doing is compiling a whole module and passing it along the set of symbols that you're responsible for it doesn't change a more interesting use case would be using the get request in symbols method to find a set of symbols that you're actually receive actually been requested and split them out into a sub module you can take the remainder of the module that hasn't been requested that isn't needed right now and replace that in a symbol table and then you could call the base layer and compile the sub module that people the sub module containing the definitions that people are interested in right now so quick summary of the API materialization responsibilities contract enforces safe communication between compiling and linking tasks in a concurrent jet and what does this even in the failure case so we didn't want to give you an API where when it succeeds it's really great but when it fails it just blows up in arbitrary waise when the concurrent if the compiled concurrent file compilers fail you have a very guaranteed failure mode for this all threads that depend on a failing compiler will find out about it layers and materialization units wrap up your compiler module representation to make them usable by all canned they're very easy to write jet dialogues provide the symbol tables an execution session provides the context so that's so it for the API details it's hopefully not a not particularly complicated I want to mention a couple of the implementation details because this was a the really interesting part to actually get all this to work so first is the locking scheme compilers aren't the hard part of this write compilers are already embarrassingly parallel so we don't have anything any work to do to get compilers to run in parallel but coordination is required for several tasks within the jet in particular lodging queries dispatching compilers and integrating the output from the linker into your running process so to make this work the simple table has cheap critical sections that determine what needs to be done and then the actual expensive work of doing this stuff gets done outside the critical sections so for instance for symbol lookup the the algorithm looks like this you lock you launch the query against all of the symbols that you're interested in getting an address for and you find the materialises that you need to run and this process is really cheap this is just a map iteration then you unlock and you actually run the compilers and that's really expensive but it happens outside the critical section so you can paralyze it another interesting implementation detail is the use of coal backs in the primitive lookup so at the base layer the lookup is asynchronous and uses callbacks I often get asked why we didn't use promises and futures for this and we do for the blocking version of lookup but the native version is uses callbacks and the problem is that promises and futures block threads so the link e uses lookup to apply relocations if lookup blocks a thread then you're going to need at least n threads to compile a complete module dependence graph of size n and that would preclude running on a fixed size thread pool if you had 10 modules that all depend on one another to complete and only nine fred's those nine threads again block and the lost module that you're waiting on doesn't have a thread to get linked on so the solution was to use callbacks and to rewrite the JIT linker to use continuation passing it's the old jet link linker algorithm at a very high level looked like this you would record three locations look up all the external symbols and in your to apply the relocations the new one just looks like this to record the algerie locations look up the external symbols and you apply the relocations in the callback and what this means is that if your first link depends on other people and there's no other threads you're gonna turn around and hand you hand your linker work to the next module on the thread which is going to turn around in hand both of your linkers work to the next module of the thread if you have a hundred you know interdependent dye lives you get to the end of that chain and the last module has a hundred linked as worth of work to run but it will run them and it will make progress so the new algorithm will run on a fixed size thread pool just a single thread is enough for this system and there are actually two callbacks that you can register for the first is unresolved which is a fired when all of the symbols that you've queried have an address they're not necessarily ready to use yet but they've been assigned an address and this is typically used by the linker when applying relocations the second callback is the on ready callback and this fires when all of this queried symbols that you've you've requested are ready to access if they're functions they're ready to run if they're data symbols they're ready to read right this is typically used by claw by clients this is what the blocking lookup blocks on and the interesting thing about this one is it does require the transitive closure of all of your dependencies to also be ready to access it's not enough to say that my function has been outputted in memory or even that the functions that I depend on have been outputted in memory you need every bit of code that you could reach to be ready in memory before you can start executing through that so to make this work execution session manages a full dependence graph of all of the symbols that are currently different compiling while they're being compiled the graph vanishes once you're done with the compiled but while it compiles are going on we track those dependencies for you so worried that this is a bit slow but it does actually scale reasonably well even in the naive implementation that we have so far and I'm gonna get Breck enough to demonstrate that okay more squares anyway okay all right so let's bring up the JIT demo again go to full screen it's time I'm going to keep it on full speed it's gonna buh-bye pretty quick but since there's some slowdown for visualization itself it'll be it'll be slow enough to see and then we'll run through it again and slow it down a bit so first thing we're going to do is go up here and we're going to pick the 4/3 GCC test case that has one LOV Mir module for each C function which is our each C file which is a bit more of a traditional way of partitioning things second we're all well most of us are fortunate to have machines with you know they can run quite a few threads at a time quite a few cores this this machine has you know four cores hyper-threaded so I think I can get away with eight threads some of you have fancier machines and you can feel free to imagine what this would look like on yours so let's go ahead and run this and see what happens alright so okay thank you so so as you can see you know we've got some we've got some some modules that were never needed so they weren't compiled and then all eight threads were used but if you paid close attention you may have noticed that they weren't used completely fully all of the time and to show that off a little bit I'm gonna run it again slow it down and talk a little bit about our naive Speculator so let's run it again let's open it and now let's let's go ahead and slow this down a bit so as you can see the thread slots are not busy all the time now why is that that's because you know we couldn't resist trying this out so Lang made a naive Speculator which when it encounters a module whose symbols need to be compiled and materialized it looks at all of the externally used symbols and says okay let me go try to find those and run lookups on them as well now this visualizer really helped us out when doing that because when we first tried it we got a little bit of parallelism but what would happen is they would all you get to a point where only maybe two threads would be used and what turned out happening was all of the fast materialized errs were getting blocked by some of the slower ones so you know lengths fix that up a bit and as you can see we still have work to do in particular if you have if you went to stefan's then LTO talked you might be thinking well you've got the CFG and module summaries you could you could trace through that and that's something that we'd like to look at also profile guided optimisation you know you run the program you beat a path maybe the next time you run it you're gonna you're gonna run through the same path or some statistical variants maybe we ought to put those ahead in the queue so those are some exciting things that we think we may be able to do and Lang's gonna talk a little bit more about that when he comes back up so let me speed this up I'll finish it and there we go all right thank you breakin okay I so a quick summary of the state of where we're at where we're at with this project so open our supports concurrent compilation in the jet this enables performance improvements in particular and enables speculative compilation which were very interested in but as Brecon mentioned we've only done a very very naive version of for this demo the concurrency that we've added is safe to mix with lazy compilation and with remote execution so if you're using any of the existing features and wondering how this interacts you can turn this on like it is orthogonal to everything else it's really easy to wrap your compiler up for use with the new API and one of the big differences from ography one is that you now configure your JIT session by declaring linkage relationships between program representations of input rather than in the static compile where you would take compiler applet so the implementation of all of this is still in a fairly early stage but we do already see speed ups from this so when we add cause we don't see ideally we would see an e linear speed-up because compilation is invalid embarrassingly parallel we don't see anything like that yet because we still have problems with lock contention and some in efficient algorithms somewhere we're very bad at dependence tracking right now the run time D yld linka it can't tell us about exactly which symbols depend on which others so we have to say every symbol in a module depends on every external reference from that module that makes our dependence graphs very big that makes dependence tracking very expensive and dependence tracking happens under the walk at the moment so we have work to do we know that but it's progressing nicely so next on the roadmap one of the things I really want to get in is the new C API we removed all the templates so we can enable this so we're hoping to get this out soon we want to remove all t1 so as soon as caranaloka API clients have tested this out and are happy with it we can kill off the legacy or API that's entry at the moment and save ourselves some maintenance after that I would like to get a logit to feature parity with mcg that's going to be adding support for JIT event listeners and for static archives once we've done that we will encourage MC JIT clients to try out ll JIT because if we can get everybody moved over to LL JIT that would allow us to kill off the whole legacy execution engine API which would be a big win we've got lots of performance improvements to work on obviously and what's a new documentation to write but we are hard at work on it so we've reached the stage now where we were happy to get up here and tell you about this and encourage you to try this API out and I hope you will do so thank you very much [Applause] questions hey you mentioned getting rid of a bunch of the templates did that add like runtime indirection or what was your solution of that like wouldn't performance issues they're basically everywhere when we removed a template we were in a position where we were going to run a compiler and so we're like okay you know now we have to call a virtual function or do some other thing that is in context so inexpensive that it won't matter so I do there will be performance implications to this but not meaning for words sorry what's the plan what's the time scale of switching are you currently on mcg or on objet v1 we're on okay we would like to switch soon we've renamed the entry classes so I don't know if you're living on top of tree but if you if you are okay the oak v1 classes are all still there at the moment but they've been renamed with a legacy prefix we would encourage you to try it out we would love to kill it off ideally in the next release yeah basically as soon as possible we'd like to kill it off this is really exciting so it's ready now to try this out it's it's entry it's ready to go all of the stuff that you saw was from entry code yeah so we can also have our front ends no no that's exactly the point of the the lair rapa is so that you can wrap up your front end so again the idea is to link compiler inputs so as long as you can tell us what symbols of module provides you can store that module in whatever representation you want I'm in the process of upgrading the to kaleidoscope tutorials at the moment but they talk about exactly this and there's a kaleidoscope chapter where we add ast directly to the JIT and we compile lazily from ast in that case basically you would have an AST materialization unit and that would just have to scan your ast and find out which symbols that bit of AST defines as soon as you tell us that we can add JIT dialer symbol table entries and then when somebody looks them up we'd turn around and hand your ast back to you and say could you please compile this AST ok thank you cool we have time for one more question for doing genius compiled way you might want to potentially yes yeah there are a bunch of other pieces that you would need to make that work but it is potentially useful for that yeah okay all right so thank our speakers again great demo [Applause]

Info

Channel: LLVM

Views: 1,775

Rating: 4.891892 out of 5

Keywords: LLVM Developers' Meeting, LLVM

Id: MOQG5vkh9J8

Channel Id: undefined

Length: 49min 53sec (2993 seconds)

Published: Sat Dec 08 2018