But Mummy I don't want to use CUDA - Open source GPU compute

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

I saw the whole talk, pretty good! There are so many compute stacks and OpenCL implementations out there, it's crazy - every driver implementing things in a different way, having their own unique set of bugs etc.

His idea is good, and it would be great if vendors actually worked on such a common implementation, atleast for OpenCL - I doubt it'll happen though. Most likely it's up to community contributors and those few paid employees like him to do this work.

👍︎︎ 27 👤︎︎ u/aperture_synce 📅︎︎ Jan 25 2019 🗫︎ replies

I was wondering why Intel Beignet wasn't mentioned at all. It had good (I think) OpenCL 2.0 support several years ago already, worked well on my Ivy Bridge laptop and was completely open source. Of course it had its own llvm/clang fork...

TIL it has been deprecated a year ago

Starting in Q1’2018, Beignet has been deprecated in favor of NEO OpenCL driver (https://01.org/compute-runtime).

We encourage the existing Beignet community members to explore the new driver stack and provide feedback.

Beignet remains our recommended solution for legacy HW platforms (e.g. Ivybridge, Sandybridge, Haswell).

👍︎︎ 9 👤︎︎ u/haagch 📅︎︎ Jan 25 2019 🗫︎ replies

With intel releasing a discrete GPU I wonder what they'll be doing in this space.

👍︎︎ 7 👤︎︎ u/cp5184 📅︎︎ Jan 25 2019 🗫︎ replies

As a GPGPU user (HPC developer), I have the utmost respect for the work Airlie is doing and I wish him to succeed in this unification of the open source implementations. There's however a couple things I would like to highlight:

single source is a bait and switch; it's very good for prototyping, but when it comes to hand-coded stuff most major projects find themselves moving away from it sooner or later, because of the loss in flexibility: there's a reason even NVIDIA has been experimenting with online compilation (NVRTC) since version 7 (2015) and has officially supported it since version 8; it would be better if priority was given to having a robust, complete implementation of OpenCL (possibly 2.x) rather than SYCL (that builds on top of it anyway);
tooling is essential; one of the biggest advantage CUDA has over the competition is its profilers and debugger (that used to support OpenCL on their hardware, but now doesn't anymore); to get anywhere close to be as appealing as an alternative, Mesa should provide hooks to allow similar tools to be built (and thus way to enumerate and collect all performance counters available for each supported device, and their evolution across kernel execution, as well as the possibility —on supporting hardware— to preempt execution to step through functions); if I'm not mistaken, similar things have been made (relatively recently, and largely thanks to Valve involvement) for OpenGL (and possibly Vulkan?); exposing them for OpenCL (and thus ultimately SYCL as well) too would be a massive boon;
finally, the ecosystem; most developers don't bother with any of these compute API directly; they rely on higher-level libraries (like the mentioned cuDNN, cuBLAS, or thrust) that allow leveraging of the GPU computational power without any knowledge of the hairy details; the hardest part to break NVIDIA's stranglehold on HPC will be getting FLOSS-friendly companies to cooperate on such libraries.

👍︎︎ 6 👤︎︎ u/bilog78 📅︎︎ Jan 26 2019 🗫︎ replies

The talk is a newer version of this: https://www.youtube.com/watch?v=d94N2Lu4x9s

👍︎︎ 2 👤︎︎ u/est31 📅︎︎ Jan 25 2019 🗫︎ replies

A common implementation has less value than it seems since most likely you need different algorithms for different GPUs anyway.

The whole point of GPUs is acceleration, so performance should always come before portability.

👍︎︎ 2 👤︎︎ u/BibianaAudris 📅︎︎ Jan 26 2019 🗫︎ replies

Nvidia CUDA is unique because it has more computational function than standard OpenCL giving CUDA ability to compute some CPU workload, rather than passing to CPU then back to GPU in some applications. In world before 2017, where CPU power was expensive, it was good choice to using CUDA.

But In current CPU development, which AMD is pushing forward CPU raw power with delivering efficient cost, giving an more advantages implementing both CPU and GPU. Thus using OpenCL is more advantageous for using CPU + GPU mode. A 32 core CPU + 30 TFlops (or more) GPU with high memory cache is now possible within customer price hardware. In future, the PCI gen 4 will be available start with customer-grade product. For majority customer PCI gen 4 advantages is still unusable, but in OpenCL it will reduce much issue relating to CPU + GPU Latency in OpenCL applications.

If Nvidia didn't see that's comming and keeping their "fancy" card price going Up, CUDA will be falling behind than OpenCL. The advantages of using CUDA is now irrelevant and soon will be impact to CUDA future development and implementation. Their marketing Ray-tracing is one of their effort to make their GPU still relevant. But that in visualization computation only, the place where only Nvidia has more advantages.

Thanks to AMD, OpenCL golden era was starting also for Open-source community. The place where CPU + GPU is better choice and for future development. "If you can't break trough their strength, break trough their weakness". Radeon VII is solid professional Card in customer price, perfect for OpenCL development. It crush any Nvidia offer in price, bandwidth cache, and memory.

👍︎︎ 2 👤︎︎ u/meme_dika 📅︎︎ Jan 26 2019 🗫︎ replies

We use compute shaders exactly because of the suffering OpenCL could cause upon one's head. OpenCL was a pain (in terms of configuring and making it work across different vendors).

Does anybody else use OpenGL compute shaders? It seems they are pretty unpopular and not used anywhere but withing game engines. Are they so much worse than CUDA?

👍︎︎ 1 👤︎︎ u/Freyr90 📅︎︎ Jan 25 2019 🗫︎ replies

Captions

to talk about not using CUDA thank you hi I'm Dave Ellie I woke up red hat in Brisbane mostly working on graphic card situations for the last few years it might have seen me talk but I've sort of recently started getting a bit more involved in compute and CUDA that sort of area of the world but not directly in CUDA because CUDA as everyone in here would know is the binary only closed-source stack that that has no interest to me and it's very hard for me to recommend that anyone could use it but what else can they use so my talk today is but mommy I don't want to use CUDA yeah someone can work out what movie - there was a reference to something my kids watched a lot when I wrote the title of this slide I'll tell you at the end if I don't remind me ma'am but yeah just to give you a love you want to talk about today so I'm gonna start off with just some sort of use cases what's just being used for in the world at the moment I'm gonna move on to and what sort of api's exists to solve those use cases and what their statuses and then what is it compute stack what are the pieces of a graphics card GPU compute stack that I'm talking about here how does stacks and those API is fit together how to result how to packaged and then a bit more of a speculative this is what I kind of like to see happen future you know what we could just end up so I will start with use cases so that the main use cases I hear about and I see from you know in industry generally circle around these sort of areas you've got up a lot of interest in the AI ml areas particularly around the whole tensorflow system tensorflow then is built on top of a couple other little thing like eigen and blast sort of areas so you got a lot of glasses and linear algebra software so that's one areas that we see quite a you know interest at the moment in I'll go into a little bit more whether I you know where I think it's going to go HPC then is like high-performance computing you see a lot of that it's usually a massive supercomputers and very specific work sets on very you know the code is written on top of CUDA but there's a very specific set of code that's just for that task and then scientific yeah none massive computing and Nani IML mostly sort of falls into the scientific area it's like you know people have data sets they want to process them they're not doing on massive clusters they're not doing it with AML but they've got their own sort of processing algorithms and stuff they want to run so these are the sunny areas you generally see CUDA being used and it turns up in other places but you know you'll get embedded you'll get up it is you know these are the main ones I'm seeing at the moment so what is CUDA well it's an API it's defined by Nvidia there is no CUDA standardization body there's no CUDA standard there is just whatever Nvidia decide is the next graphics chip is gonna have that's an extract of CUDA usually or you often you have new features they just make a new version CUDA it's all closed source there is open source real quotations of some parts of the stack but generate the compartment can get an unaltered compiler but generally it's awful source all controlled by Nvidia it's C++ based single source I'll go into a little bit later what that actually means but the idea is you write you write software and the control flow is single source yo the source you don't know where the source is gonna run on the graphics card in CPU they're not separated they're all in one source file you can read through it track where the control flow goes and then when it's compiled it decides where it ends up and then there's a bunch of support libraries that have video produced to support CUDA so they've got si UD DNN which is used a lot with tensorflow we've got in blast implementations under got a number of other things that they produce it's definitely the top of the you know the leader but what's out there that's sort of up-and-coming or what's challenging it AMD have defined again it's standard it might be a standard they set up a standards body for it but you know I don't know who else is actually putting stuff into that at the moment everyone's in it I'm not in it so I don't know what's happened being inside its hip which stands for to just rolls off the tongue here heterogeneous computer interface reportability I have to write it down I can remember that the source code of this does exist it's released on github again Hiep itself is a C++ based single source language and they have a bunch of support libraries in the same modeler so this is it's kind of like I can't believe it's not CUDA in a lot of ways but yeah they've done most it but then they've actually released the source code to it again that's something I'll get into a little bit further on later another thing you often hear is all open CL is this competitor to CUDA that's what we should be using an open source why aren't we using CL through all of these fancy things open CL is a Chronos standard which means it's a standard defined by the Khronos group which does OpenGL and Vulcan and those standards all of the big graphic card companies are members Red Hat some member there's a number of implementations of this from different graphics about vendors OpenCL has kind of got this bit of a weird they brought the 1.2 standard everyone started implemented that and then they brought out these 2.0 2.1 2.2 and nvidia kind of went I'm going to stick to 1.2 and that's kind of flatlined it's pick up because even if a video start bringing your to and they are are everyone was like oh that's all that's not we're gonna compete against they're running it on the same graphics card you're running the Nvidia stack it's like it's kind of being hobbled a little bit but one big difference between this and the previous two that actually sort of answer the question why can't we you do these things with open CL is it's not single source so when you write open CL program you ride a chunk of C or a chunk of C++ that's going to run on the graphics card and you're running your chunk of C or C++ that's going to run on your host and it's just like kernel launching API is to launch those kernels on the graphics card so you don't can't see that sort of control flow the data flows easily because it's like having to hop across and go through this code and you know it's just a different concept and a lot of the pieces of this especially the tensorflow stack and the eigen stack are very sort of C++ single source they're very template heavy and you know and I originally OpenCL didn't even have C++ support so that's another I started happening I don't think I picking up too much but maybe we will get some support for it but yeah the big difference Lawton CL is this it's not single source and there was in C++ support on till recently also it has online and offline compilation so generally what CUDA you do all the building you get an object a binary at the end when you run that that might get optimized but that gets executed weird up and Cl you have the C code that's the open CL C code in your by in the binary and when you run it it compiles it then so it actually parses all the C code and runs it through a full compiler at that stage when you're actually executing it for the first time it may cash them and out or may do things but that's what that's online you can with later versions do offline where you can actually run it through in advance get a binary back and then give it back that binary later but it's a bit Messier it's a bit more like distribution you don't have this single executable with everything in it but in a new feature that's been added to open CL as well as this thing called spear V again I'll another segue later will tell you what IRS but they've realized that that was a problem you having either just a big chunk of source or a big chunk of random binary he will not really want to distribute that you don't want to be leaving your fully annotated C source inside the binary you're giving to people apparently don't know why it seems insane like perfectly exactly what you want to do but they decide to have an intermediate representation there compiled the C code down into this utter binary that's not quite the finished work and then they'll finalize that which is just compile it but they call it finalization to distinguish it from the other stage of compilation just to keep everything yeah so they were sort of like what's been going on up until recently recently a newer figure which I knew were sort of standard has sprung up which is it's only recent I think it I haven't have two dates in my head but it the final versions is a standard me not that long again this is called cycle if you want to pronounce it I think that yeah expiration cycle it's a Chronos standard again but it's C++ single source so it's solving that I want one program I want to have to write the code war you know in one thing and then have an easy to understand that we'll just run word like where it's applicable where it needs to know if it needs executed GP will do that generally what this was done if you want to launch on the CPU side there's no graphics chip it'll use something like open MP but if you have a graphics chip it will use the open CL interface to actually launch the kernels again the things that's going to launch will either be that spear V or it'll be binaries okay to what where the differences are at that in a bit implementations of this so far most of this was driven by a company called Coldplay they've got a closed implementation of this they're pretty big in the Standards Committee excuse me I've been working with Xilinx on and Red Hat on this thing called tricycle which was like an open-source runtime and trying to get the compiler stuff ready and so it's pretty much being more of an educational exercise in figuring out how a list of fits together but then I gave this talk I'm gonna save verbatim but know this bit changed but I gave this talk about two months streamers go in Vancouver at plumbers conference and in between then and now last week Intel have come out and say on the ceiling list and said we really want to upstream our si Lang port of cycle and our runtime of cycle and then it's like can we help another cell can i what can we do so we're there work currently waiting on just getting it onto github first but it this looks like a pretty big chunk of the puzzle actually going to arrive in the next few weeks it'll probably take a bit of time to upstream but they simply committed to that and they've seen to be they've asked me about helping out and if the guys inside links we're going to talk to as well so this was a big change and that may yeah bring things on quite quickly um other things in the area there was a C++ a and P standard this is a microsoft standard I haven't really looked too deeply into it I think there's some it's been around in I think AMD have had stacks for pieces of it there's open MP open up P is like this is the CPU parallelization thing it's getting better for GPUs but it people keep saying oh you should just use open MP but they're generally people who work on CPUs and don't actually understand that that doesn't really work on GPUs that you know the sort of treading you need isn't the sort of thing that OpenMP really gives you yet that openmp people are working on this as well though there's problems everyone is working on it it's but that's the question of what's gonna actually happen there was open ACC again this is another similar vein to open MP Nvidia work on that a bit as GCC's called for it I haven't looked into lately too much and there's also this kind of outlier so the Vulkan API arrived that I talked about previously and someone has talked about this week but Vulcan has the ability to launch compute shaders but and it's a very low-level API the problem is the computers it launches are not the same thing as the computers that OpenCL launches they're quite different beasts they can both be implemented as spear v but there when you get down to what they actually do they run quite differently but and first I'm volcán didn't have pointers so that's kind of a feature people wanted but Vulcan just got pointers in an extension and that just made a whole other problems way easier so the way the approach to how the standardization is going is quite different from there pretty much want good solid features and they want them with implementations they don't want people saying this is a great feature we should put in standard and then two years there know it's implemented it yeah that's bad well again where is this sort of going really the future is C++ something in 20 that could be actually gonna left two question marks there it could be Turkey who knows hopefully 20 but there's a lot of push from all of these groups involved to get a C++ standard you know this should just be part of the language why we all implementing these you know layers on top so there is also standards body work going on everyone is inputting the cycle people are inputting the ACC people the MP people everyone's putting in their work I have I'm not actually understand there's what I could be down here I wanna but it's quite yeah there's there's ongoing work someone from Red Hat's works on it and you know it still have meetings happen quite often and they discuss it but one thing they've kind of realized that at least at this point with the C++ standards was there's no point just again writing without actually validating that it's useful and so that's what a lot of these are implementations are far it's like well this is the way we think it should work and here's our limitation of how it should work we put could we get that in the standard we take bits from other places so it's trying to converge all the existing implementations it's a one sing single C++ standard and even if you have that that's just your programming language you still need something to run it on and so you still need a runtime you still need all these other things and that's sort of like well we would have to build those parts of the stack anywheres no point sitting around waiting for the C++ standard to finish and then go now we'll start doing it because there's a lot of work you need to do anyway so we may as well start moving it forward in some way so that's kind of just the overview where things are sort of out at the moment just sort of take it we are into next where what are the pieces of this what's how does the stack look so this is kind of a generic picture of a compute stack it's not you know mostly a sort of a single source compute stack but you pretty much at the top you've got your application source you feed it into your compiler compiler look sitting goes okay this chunk of code is gonna go on the host this junk code is gonna go on the device it'll compile both of those so generally it passes all the C++ a into ast and then it figures out which bits go where it converts all the hosts compiled called into a native object code it converts all the device compiled code into IR our binary so the intermediate representation or a binary I'll get into that again the IR is what you kind of like binary is a bit messy I'll talk about that and then it spits out enough object file and that's your executable and that's what you want to run what's inside your object file two pieces a CPU jump code and a chunk of GPU executable code a code that you want to give to the GPU at runtime um yeah again I I or is what's out of in there it can be binary as well and then you got the other stage well I've got this how do I run it well you launch it but you have to launch it on top of some sort of runtime stack and the runtime stack generally has a few support libraries a bit of a runtime stack the kernel driver for your graphics card and then the hardware so it's it's not hugely complicated but I am making it a little bit easier OpenCL kind of different because of that offline online offline sort of problem you know has it allows both so when you actually go to run an open CL application you may need to have a full open CL compiler in there to do all the compilation from open CL down to the stuff that you give to the runtime so you may then have a another IR compiler it to do and then you may run that on the kernel side it's kind of like yeah the offline bits the same it's just pretty much a support library open CL runtime in the kernel variable if you go online you've got a full compiler like a full C++ or C compiler that's not small and it's like that's not fast generally owner so just little bit of a diversion though because I keep saying IR without really telling you what an IR is but I think you now know enough that I can tell you what it is intermediate representations are a compiler world thing but pretty much you have a high level source language and you've got low level binary and anywhere in between those you're going to have representations of your code these representations are called intermediate representations because there's between the two and and you you're always converting from one down to the final one some compilers could have two or three at ease in the Mesa stack I think this could be places you could go through four of those TechSoup rooms they all have different places in the stack somewhere you know needed some better for transporting across the wire some are better at compiling so we're better at compiling with little memory it's the other but a few examples so in this sort of compute Stocki generally you see you hear about nvidia PTX so Nvidia have a low-level binary format called they call SAS but they don't show you that they don't document that they don't tell you about that there are people who reverse engineers but Nvidia are not interested in revealing it so what their compiler spits out when you build with CUDA is generally a binary with a big loop of PTX and a few SAS binaries beside it so if they get onto a graphics card that they hadn't seen before of a new one they can you optimize the PTX and the program will execute and they'll be you know you may not be optimal you should go back and rebuild but at least it will run PTX is like a low-level assembly but they can translate that into the actual assembly and II have there GCN it does they're pretty much there is a and it's a binary they don't really have an ir in their stack they always give you the binary in the thing they give you like two or three different wineries because different graphics cards unity from binary but in their hip stack there's no real IR you just get the binary chronal stand defined spear V spear V is an IR that's yeah it's defined pretty well it's it's a binary it's SSA it's high let's start a high level of you know anywhere compilers it's a pretty high level I are and it's got two variants people say are it's barely it should be able to run it but no OpenCL and Vulcan and GL now have actually the Vulcan GL is a bit mostly similar but the OpenCL on Vulcan are very different beasts there you can use the same tools just to look at it but in terms of executing it they need a lot different resources and a lot of different systems yeah there's quite a lot so when when you say spear V that's not really meaningful enough you need to say spear V with Colonel or OpenCL or spear V for its shader for focus and then you have these these are the arms that you start to see in the computer stack you also have IRS hidden inside the computer stack so in maces case the Mesa driver stack we've got this thing called near which is a internal intermediate representation and we've also got LLVM ir you see used inside an LLVM this stuff between tools so that's just sort of IR is there there an artifact of the compiler stack book sometimes we want to see the malice of the Komondor second that's what the first four are so just on open CL stack so so I've gone through what a stack we have a lot of open CL stacks some vendors have a lot of open sales tax on their own they don't need us to help but the main ones we're starting to see sort of converge on AMD of two but the one they've got one based on the Rock'em stack and one on there pal stack they call them but there converging on the rocket thing but first I guess it's vendor-specific there's no way you're gonna get this code to be ported somewhere else and video Dan have their own OpenCL stock which is all binary and an Intel of recently announced Darryl and OpenCL stack which is Neos this is what the the vendor OpenCL stacks look like they all generally fork LLVM at some point we all contain a big chunk of LLVM probably a big chunk of si lang so there's complicated everyone has their own I think it's pretty much the message we have to get across here so what does it the picture I showed you earlier of what a stack of generic stack looks like well this is what it looks like for the CUDA stack it's pretty much CUDA application source code a compiler device compiler it sticks Nvidia PTX is the IR that comes from the device a native object code sticks man yellow object file so it's just a that's why the CUDA stack but generally looks like a simplified version and in execution it's got CUDA libraries cooter runtime cuda driver nvidia hardware all of you over generally binary you may get some of the libraries open source AMD of their Rock'em stock it's so that you can write CUDA code run a true this thing called hip which translates it sorry you're right and a translator to their hip code and then you compile the hip code goes into device compiler and then produces AMD specific binary code that goes into the elf a big fun so the big sort of step here is if you're compiling CUDA here you pass a true hip and it started in replaces all those crudest to hips in other places and a few hundred things like that but yeah it's tough there's a lot of said sorry pearl or something here but and then when you execute it you got the language runtime Rock'em runtime Rock'em start unclear where they could run it on different drivers and the kernel driver and then they've got the AMD hardware this stack is pretty much completely open-source parser kojiki sources the license it's on there is open sourced I'll give you my distinction and then we've got the Intel stacking so the Intel corn-stack is just open see elders know they don't have a CUDA implementation that sits on top of this I didn't have hip implementation they do support spear V is an execution so it's it's not quite in the same category as the other two the other two are pretty much focused on trying to get cooter running wid OpenCL is an extra in Telecom would open CL and I think I'm trying to figure out what to do next with cooter type situation but then we have this open stack that sits in the Mesa tree that kind of got half written by AMD before they change direction one time and then other people kind of true a few bits into it and someone's row but no one's really sat down and driven this thing it's just being we need this I mean people are doing educational work where you're just playing around trying to add things to it so it's only doing up and see how 1.1 I think you're adding missing from 1.2 is probably printf but when you think about printf on a GPU it's kind of hard so it's not trivial but that way so that it's not a it we could port it from somewhere else just no one has buttered yet yeah but it's based on the gallium drivers that we use for all of our OpenGL stuff so it's like we already have drivers sort of hardware that can execute pieces of code and we use those for all the accelerated graphics work well one of their spine we using those and there's support for some AMD nouveau is getting support from more features of it freed reno hardware so the adrenals the qualcomm hardware but we also had this bit of a realization was like well we're going to see a few more arm accelerator chips they all want to open CL as well running our own stack for all these is going to get pretty you know sad pretty quick swing are very messy excuse me so that sort of gives you a view where we're at mesa needs work but a lot of the pieces are there and the prison under what we call what do these vendors say when they open source something Nvidia are easy and video never open source they don't even like to use the word they've got their binary you know where you live you know you're not getting great support because you know you can't figure out what anything does but AMD of commands at all well we've got our Rock'em stack it's a source and Intel come out and say our neo open sale stock is open source and I'm I can't argue with that under the license they are open source but in terms of development model it's very hard to do anything wait they're taking open source as a release model not a development model which we all know that the benefits open source started development model not really the release model it's like how you develop things is well it produces the better results not the fact that everyone can just get the source maybe we didn't know that all the time but we all know it now and so they also tend to also to exert control it seems to be a big point that first question you ask someone when they produce a new stack is will you support your us porting that to your vendors hardware but your competitors hardware and they're like yeah no oh yeah will you take our patches back upstream to support that yeah no so you go around the vendors and you ask the same question of each one the first question you are asking the hour long meeting and then you just don't listen progress of the hour-long meeting because they have answered the only question that matters you have no way to involve yourself in the development process yes you probably could submit both patch fixes but in terms of our feature ads or getting an insight into Road mapping or getting an insight into you know we're just too good it's just not there it makes it hard as a distro bender which i abstence ibly am to figure out how to put these in your district because how do I know it's not gonna just regress every time I got a piece of code troll over the wall and get help every three months you know how do i how do i validate that all the stuff is just you know they're not just does not redesigning it internally without telling me and six months later it's all gonna be a new stack you know it's like trusting yourself to this it's like oh we see how linux worked and we'll just ignore that lesson and do it all ourselves you know it's like they're not taking the lesson of how yeah yeah and these all these companies do contribute to Linux and in some places there controversy user space tax collaboratory collaboratively but other products that company still haven't learned those lessons all they heard was open source and you know it that just means we put the source on github doesn't they didn't hurt no you have to work with your competitors you need to build on a stack a single thing is a much easier thing for a distro to package much easier thing to get in front of your customers but you're using for support much easier thing to build tools around you know how do I build a debugger for this well I've got a CUDA bugger oh I want to use Rock'em well here's a whole new stack with a whole new debugger and a whole new interface it's like they're still not on the right path on the road so it's and I'm from a distributor even if I was going to accept this which you know I'm not I'm not actually in charge of this decision I get it input to it but these are very large bodies of code there's very little common code and there's this everybody Forks LLVM and si lang on a different day and as their own stack of patches on top now one of my tasks previously it right out was maintaining our LLVM packaging for fedora and rel just so we could use it in the graphics drivers and this was quite a sync of time just maintaining one fork of LOV M&C lang I can't imagine I probably could find the time I can't imagine having the sanity after it to do this three or four times a month or a year you know it's just having to collate these rapidly diverging forks of LVM and si Lang into one coherent place or even just managing four of them and trying to figure out how shared libraries work because if you accidentally load the wrong LVM you know one application opens both stacks and they've booked at chunks of LLVM and si Lang and they put a symbol versioning around the same time that's not gonna end well but they don't these aren't problems that the vendors see because they only care about their stack and they would never contemplate that anyone would ever want to run softer across different graphics cards that's not in there you know it's like no you just buy our graphics card why would you ever want to have a choice it's like doesn't come into their design decisions so I've sort of decided well I'm gonna see what we can what's out there now and what can I propose as a sort of a stack so what do I what's my vision of what this should look like well I first of all like we need at least a reference implementation or something it doesn't have to be the fastest thing doesn't have to like you know wipe Nvidia CUDA off the map it just has to be a solid reference implementation that's vendor-neutral so the same code we can run it on the same you know as much of it as possible shared across all of the devices so yeah you could build something on your computer and take the graphics card out and put different graphics card in it still runs you know whoa scary is to you yeah this might not matter if you're a HPC because you're unlikely to swap out all the graphics cards in HPC overnight but to get this sort of technology out in front of people and make it useful for actual you know real hackers and just in that case you need something that's a bit more you know you can hand binaries to other people or you can have the source sort of you know it should work yeah so it needs to be based on a shared codebase there needs to be one sort of maybe not one overarching project but if you have a sea lion contribution it should be in si lang it should not be a fork of it if there's an LOV m-chuck it should be in LLVM if runtime libraries they should have a home somewhere either in LLVM or in the free desktop sort of area but it should be standards-based CUDA as great as some people tell you it is is only great as long as it does what you want but you can't get any input into that it happy the ability standards-based things are generally slower to happen which is why cycle is only happening now but generator a better thing to base your future on because they're less likely to go rapidly away or diverge into something you can't get support for well yeah I'd like to have a common API to top for running applications I saw the runtime api will be common as much as possible and the ir that we use inside the elf will be common so that yes you could include other things in there but at least the base ir is something that can be executed across all of the vendors we have and then i'd like to be able to enable common tooling maybe gdb isn't the greatest debugger for this sort of task but if someone wants to make gdb work with this it should work by plugging into the stack across all three vendors it shouldn't be I have to port it to all these different api's and work out how do your work they may have to figure out how the bugging works at all in the first place but when they're finished it should be one gdb that everyone knows how to use so it seems fairly reasonable to me and so this is the kind of where I ended up a proposed stack it's the same picture from earlier but with all of the pieces that are other people's replaced by pieces that are ours its source code would be C++ with cycle because cycle is a an actual standard even joining Cronus isn't simple but it's at least a standards body with then people who care the C line without a single si Lang cycle front end we would then have the yellow VM based device compiler come out of that and that would produce the Spear vir and we would put the spear vir into the off object file along with the native object code that's the and then for the runtime piece we would have the application some cycle libraries a cycle runtime based on the mesa stack that would take the Spear v code and finalizer into whatever vendors GPU you're running ons binary format that don't again get executed by the GPU so the idea would be that that yeah the spirit V code would run under OpenCL we just end it with just executed and that would be the same across all the wineries and the driver would be all a Mesa based system or with the gallium drivers there is an option to do this on top of Vulcan currently it's not it won't work but I see Vulcan being where we actually wanted to go OpenCL just seems like a stepping stone towards the the Vulcan it guys a lot better for this sort of thing see Ellis or it's the Vulcan implementation of spear V just doesn't have all the features yet but since it just got pointers and that was the big missing piece things are starting to converge and I suspect it may make it that it will get will do the OpenCL thing first then worry about adding that on as a nice feature so the idea that the runtime would be based on clover which is the Mesa stacks OpenCL runtime we'd minimize the GPU specific code as much like abstracted out we use the low-level gallium drivers we have already we use the same spirit e-pass sure we have right now we use neat near intermediate representation down to the harbor some driver backends might want to use an LVM finalizer switch like the AMD code currently always goes near to binary via LLVM compiler asian again that's but that would be a more of an implementation doodle detail inside it's not something you would be worrying about from a bigger picture well one thing that just happened as I said that thing that's just happened in a couple of weeks ago and will hopefully actually get open-source soon and will never drop screen it's pretty much this slide the Intel's plan to actually work with the you know the community of people to do this pretty much covers this so when I gave this talk to three months ago I was just facing into doing this and now I don't have or at least I'm now facing into a different task of doing it helping someone else's code get upstream which I'm actually better at than doing it from scratch so it works out well for everyone this stack the runtime is pieces we have most of this in various places this is like so for a while we had gallium drivers for just AMD and just Nvidia and now Intel have started working on the gallium driver and are going to produce it so we will then have the low-level execution driver for all three vendors into one place the another few people in Red Hat have been working on getting spear V executable on top of Nouveau so that they can get the hmm feature in the kernel an actual open-source user so that we don't get it removed from the kernel and that's pushing ahead quite well they've got that you know basic examples working and running at that at the moment so that the idea that you can execute a cycle or a spear v kernel on Nvidia machine and I am alright when I start playing with this about six months ago first task I did was get a spare V kernel xu on my AMD card so the pieces are all started slowly coming together in the mace aside from just pieces from intel we just have to get the dot the dots join and get them sort of like solidified and then you know zatia and all that will abound afterwards but the terms of where this is actually happening it's a lot further ahead than it was my first talked about of two or three months ago it's probably not far enough ahead for anyone to actually use it yet which is unfortunate because the whole point of talk is can I not use CUDA you can delete take if you if you if you want another binary stack to use you and want to use cycle you can just use the coldplay stack you know but you're just taking a different home but it's at least a whole you have a better path out of because we have a path to get out of this with eventually four cycles so well yeah I encourage people that are into CUDA and have used CUDA or have people that use CUDA in their office to spread the cycle sort of thing get them to look at it and to sort of tell us what's wrong with it or we're pretty neat changes I'm slowly learning C++ to actually understand a lot of this that's actually being a bigger chunk of my time than anything else so I has any questions hello great dog you aware of any modern VK like initiatives for cycle or open Co question about Moulton V carriages to run stuff on Apple just to clarify so there's a layer that's been worked for Vulcan to run Vulcan apps on top of metal the Apple thing there has been some talk of doing a spear V execution on top of Moulton but again Vulcan because he doesn't have the missing pieces and I haven't done an analysis of what metal has I'm not sure if metal computers are actually capable of doing again all the features we need so I suspect someone is looking out but I don't know how far it's being gotten to jaehwan there's some of the FPGA when there's also use OpenCL to HDL compilers will that be able to somehow fit into this Mason stuff as well or is it completely orthogonal stuff so I've been working with zài links on this already for the last three or four months on and off so Xilinx have a couple of engineers very interested in cycle as well and they're looking for it to run cycle applications they are unsure whether spear v can actually do what they need but I'm trying to encourage them to get speery enhancements done so that they can stop doing what they're doing which is passing random LOV Mir which changes every compiler and is very isn't stable to give as their interface to their last stage compiler but that yeah it so the tarried n would be we could share the upper stack and then the execution environments a bit trickier because you have to do the HDL and that's something that those people seem to like to not tell us about but if someone wants to do that I think there was a talk here from Tim about doing open source FPGA stuff so maybe there's a space there to do something independent of the vendors themselves you mentioned Intel AMD Nvidia from a GPU perspective what about things like freed Reno would that still be able to leverage the whole cycle yes the idea for the arm drivers as well one of the big contributors to some of the OpenCL IR code and Mesa is Robert Clark who did the freed Reno driver so he's working on because he needs to have a CL solution for freed Reno and so we are planning on free Drina I think there's possibility for of the other the the pan frost open moly stuff and those guys once they've got a baseline up and running them because they are building a gallium driver and once you build the gallium driver you just have to validate that it can run so any more questions so Intel's also going to be building in FPGA is integrated onto the processor so is that team also talking to the GPU your own Intel firmly in the we don't want to tell you how to build FPGA bitstream camp unfortunately which means any stack they produce is probably the they could share the top part of the stack but the bottom part would proprietary only they have an execution environment but that's all need to put the FPGA stream on to the FPGA it doesn't actually help you build it so I don't I think F PJs are gonna take a bit more work to bring it to the fold what about OpenGL ES that fit in there not really OpenGL I hesitate to say on camera I like current thinking but I'm sorry open open CL yes open CL es there's no it's pretty much just open CL there there isn't really an embedded very interval CL there's a newer that open CL have realized some of their mistakes and they're doing an open CL project to try and make the API one of the big problems I think I would open CL which I have said on camera before is that they took a bit of a kitchen sink approach and anyone that wanted any feature got it thrown into the standard and the hardware was coming out two or three years later and it was like well I never invented all that stuff you put in the standard and it's like oh no we just taught hard about it it's like okay so now we're starting to hit those features and we're realizing we've put things are in the standard nobody can implement or they're not gonna get implemented properly so I think there's gonna be a bit of a push but from my point of view I would rather that things would concentrate on the higher level spear oh sorry cycle language and leave the exit that cycle has open CL Interop well I'd rather nobody went near that I don't think it's the best plan yeah things like eigen and tensor flow can be ported the cycle transfer has already been ported to one of the prelim cycle releases but the current tension flow master is not compatible with the current release standard and like I've had a one patch to tensorflow for tree line include headers just like waiting for about six months I'm not really sure that's one of my next focal points is to try and gather everyone that's doing their own tensorflow ports and try to pull them into one place and you know start getting upstream to care about it because so far upstream just doesn't have the care factor I have one more which when you're doing this preliminary work are you working on AMD or Intel you're clearly not working on Nouveau no I've been mostly doing it on one of my MD machines but I have done some one until I I actually took a binary I built in me and pulled it onto my laptop in like October November last year and executed it so it definitely works but but mostly that's just because you're the the Intel people just had all the bits from the right place and I merged them together it was like whoa look I can execute things now if people wanted to join in which which card would you recommend them working I'd probably be targeting an AMD at this stage but Intel's not far away you can build an Intel stack that can do this and until some engineers are actually working on pieces of it like so it's it's definitely possible okay ask you to show your appreciation [Applause]

Info

Channel: linux.conf.au

Views: 81,667

Rating: undefined out of 5

Keywords: lca, lca2019, #linux.conf.au#linux#foss#opensource, DaveAirlie

Id: ZTq8wKnVUZ8

Channel Id: undefined

Length: 43min 11sec (2591 seconds)

Published: Fri Jan 25 2019