The Magic of RISC-V Vector Processing

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're taking a glimpse at an underdog of an instruction set but this Underdog might reshape the landscape of the Computing industry this is no ordinary risk 5 board the kendri k230 is the first widely available consumer risk 5 processor with the ratified 1.0 Vector extensions let's get wired in Vector the next instruction V add is going to whoa whoa whoa whoa hold on before we get into the raw assembly instructions let's talk a little bit about what risk 5 is what the he Vector instructions are and why the 1.0 spec is so important risk 5 is an open standard instruction set architecture or Isa isas are like blueprints for a CPU defining how instructions are used and executed over the years many proprietary isas have been created like x86 or arm these isas are not only costly to license but it's it's almost impossible to get new instructions added to the spec without very powerful connections what makes risk 5 special is that the standard is completely open and free for anyone to use researchers developers and scientists can build and modify their own risk 5 CPUs without any licensing fees it's also extremely modular the base risk 5 Isa is quite simple but allows for optional extensions silicon makers customize a risk 5 CPU with just the features they need increasing efficiency risk 5 is used everywhere from Tiny iot devices to full machine learning accelerators like Google with their tensor processing units one of the key extensions that highlights risk 5's flexibility is its support for Vector instructions but what are vector instructions and why should we care [Music] imagine you're in a bakery you need to bake some cookies and you have two options you can either bake them one at a time also known as scaler processing or you can use a big tray to bake dozens of cookies at once Vector processing which do you think is more efficient the large tray right this is essentially what Vector instructions do at a basic level they allow the CPU to perform the the same operation on multiple pieces of data at the same time in a single instruction this type of computing is incredibly useful for things like image processing scientific simulations machine learning and more let's jump back into the computer and see a real life exampleand the mandal bra is a complex fractal that requires intensive calculations to visualize I've written some code to compare two approaches a traditional method and a vectorized method in the non vectorized implementation each point in the fractal is computed individually which is inefficient however we can use Vector instructions to process multiple points simultaneously each one of these fractals went through 1,000 iterations and as you can see the vectorized version was computed significantly faster in this case three times faster on this x86 C puu numpy takes advantage of assembly instructions like V Mo PS and vad PS to perform parallel multiplications and additions the risk 5 spec has similar speedups with a few additional tricks up its sleeve the risk 5 Vector extension is a bit special using something called Vector length agnosticism or vaa sounds pretty complicated but let me explain going back to our baking analogy Vector length agnosticism is like having a magical baking tray that can dynamically adjust its size to fit different ovens traditional CPUs like x86 use fixed length Vector registers which is like having fixed size baking trays we started with small ones like MMX 64-bit in 1997 then added bigger ones over time eventually x86 got things like SS 128bit in 1999 AVX 256bit in 2008 and AVX 512 512 bits in 2013 but there's a problem each time a new tray size was introduced the bakery had to keep using all the old trays too for backwards compatibility which makes things much more complicated the risk five Vector instructions are different instead of using a fixed tray size it has a tray that can magically adjust to any size this means we can use any oven big or small without needing to keep multiple tray sizes around so why does this Vector length agnosticism matter and means that the same risk 5 Vector code can run on different size CPUs without modifications whether it's a single board computer like this guy or a full supercomputer the compiled binary can remain the same that means that as Hardware evolves bla ensures that existing code can take advantage of new advancements in Vector processing without recompilation it's effectively future prooof not to say that risk 5 hasn't had its fair share of challenges particular when it comes to the vector implementation you might be asking yourself Lori what is so special about this particular board after all sbcs like the milk 5 Duo beagle 5 and star five's Vision 52 have been out for a while this little guy the k230 is one of the few boards as of today with the ratified 1.0 Vector extension specification all those other boards I mentioned earlier either don't have Vector extensions at all all or use the 7 draft spec here's what's tricky about the earlier 7 draft spec it's a preliminary set of vector instructions more of a proposal to get feedback from developers and guide the development of the final spec the point7 spec isn't really designed for your everyday Computing projects it was a way to beta test what was yet to come the 1.0 spec is a fully ratified standard with detailed documentation and code examples and will serve as the B Bas line going forward for future backwards compatibility tool chains compilers and packages can start to Target the 1.0 SPC now without fear of becoming obsolete anytime soon unfortunately most programs compiled for the 7 draft spec probably won't execute correctly on 1.0 Hardware due to some changes in instruction formats and encodings many edge cases have been smoothed out especially for instructions like V slide up and others ensuring more consistent Behavior across data types in any case now that the 1.0 spec is here there's no good reason to continue tinkering around with obsolete Hardware enough chitchat I think it's time for some assembly to very briefly explain what we're working with here this is a can MV k230 version 1.1 single board computer which contains kendri k230 system on a chip if we take a closer look at this stck you'll notice that the CPU cores are actually c908 cores manufactur Ed by Shan Tai one has the vector extensions and is a bit faster and the other core is smaller slower and doesn't have any Vector speedups today for Simplicity I'm just running Debian on the 5104 kernel with just the larger 1.6 GHz core with the vector instructions here I am connected to the board right now and I already have an application that I pre-compiled to take advantage and demonstrate using the vector instructions to add two vectors together here's our binary so if we want to verify the instruction set architecture we can use the file command let's just do file vector and indeed we have a 64-bit elf binary that is running the risk 5 instruction set architecture let's take a look at the code and walk through some of the vector Specific Instructions that we want to take note of here I have the source code for that Vector application that I mentioned earlier and let's walk through what are the instructions that are specifically dealing with vectors right here this vset vli is going to be really important inst struction but let's move on to some of the others and come back to this first of all we have two vectors that we're defining down here each of them has eight values inside of it each of these are going to be one word which is going to be 32 bits wide and then we're allocating some space where we're going to be storing the result or the addition of these two vectors first of all we have some load address commands that are just our standard risk five instructions that are loading the address of this data into to the A2 and A3 register the next instruction is going to be very important this stands for Vector load elements and what this is doing is this is loading the elements pointed to here inside of our actual Vector register V 0 so we have specific registers that are designated for our Vector operations and interestingly enough this size is going to change depending on the size of the elements in which we're loading for example since we have the word values that we're using inside of our vectors these are going to be 32 bits wide hence we have the vector load elements 32 the next instruction V add is going to be the meat of our Vector instruction addition so this obviously is performing addition between two different vectors VV stands for vector vector that we're going to be specifying as two operan for this now since we previously loaded the vectors into our v0 and V8 Vector registers we can use this instruction to sum these two vectors and then store the result inside of our v0 Vector register but now all of the result is inside of a vector specific register so if we want to use this and store this back into our memory location inside of the space that we've allocated for this result we're going to have to use another special Vector instruction and this is going to be Vector store Element 32 again the 32 depends on the size of the elements within the vector that you're trying to store we can take this v0 which contains the result vector from our addition and store this inside of A4 which has the pointer to this location in memory and we have successfully performed our addition now let's move on to our last important Vector instruction this V set vli is going to be really important and set up the type and length for the vector that we're about to be dealing with one important thing to note is that you can request a larger length for the vector than what is actually available inside of the hardware now let's move on and let's go debug our code so we can watch this vector addition happening in real time I'm going to go to my SSH connection for my board and we have our previously compiled binary that we're going to run and let's use GDB so that we can demonstrate the vector instructions in addition happening in real time I'm going to do GDB SL vector and let's start our application let's do a breakpoint on the initial starting point of this application now we have a first breakpoint set but let's also set one towards the end so that we can see our values before and after we've performed the actual vector addition so I'm going to do disassemble start and let's do another breakpoint maybe towards the end of this application before the exit system call let's do B start plus let's say 52 now we have two break points set I'm going to run my application and we have successfully hit the initial break point which means we can inspect the initial values of the two vectors before we performed our vector addition so what were those variable names we had VC one and VC two so let's inspect the current data stored inside of there let's do x/8 decimal words since we stored eight values inside of our vector and let's do our vector and VC one here we go we have our 1 2 3 4 5 6 7 8 and we should see something similar for VC 2 as well and there we go so the these are the two vectors that we're going to be performing addition on using this vad vector vector instruction if we checked the values stored inside of our result memory location right now it would probably just be kind of junk data because we haven't actually performed the addition yet and stored that addition inside of that memory address so let's let our program run so we can get past that portion and then check the addition or the sum of these two vectors I'm going to do c for continue and now we've hit our second break point towards the end of the program now we've effectively performed our addition so we can inspect the memory location inside of our result variable let's just reuse our name and we'll do result and then sure enough 2 4 68 so on that is the sum of these two Vector values so we have successfully performed our operations if you'd like to look at a real world implementation of the risk five Vector instructions you can actually take a look at the FFM Peg source code now this particular commit contains a ton of inline assembly that's performing a lot of different operations on a few different vectors let's scroll down and take a quick look at what the code looks like let's go here we have our inline assembly that is getting added down here now let's look for our Vector specific operations we should recognize this V set vli setting the type and length for the vectors looks really similar to the same instruction that we were using previously let's see where our Vector instructions are we have a couple vector multiplication is actually multiplying a vector and a scalar value together and storing the result inside of our V8 Vector register moving on we have a lot of different Vector specific operations that are multiplying different vector and scalar values together and performing a little bit of vector addition and all of this is placed inside of a loop here we have our Loop up here these kinds of optimizations are really important for encoding Frameworks since they require a lot of high performance for their operations for example if we were taking a look at our code and we wanted to add these two vectors together and we wanted to do it one at a time that would repeat these instructions over and over again for every single value inside of these vectors one by one but we can optimize this process so much more by using Vector operations which makes our code so much more optimized and so much more efficient all right before you go out and buy one of these let's talk about how tough this is as cool as risk 5 is we're still in the very early stages and software support is rough it's going to get better with time but I wouldn't throw out your Raspberry Pi in the closet just yet the number of pre-compiled packages is weak to put it lightly and even then few are taking advantage of the vector instructions needless to say things can get pretty slow if you aren't compiling your own packages peripheral support is weird and I couldn't get HDMI or Wii to work and the ethernet Nick is funky with a randomly changing Mac address that I can't seem to get rid of but I guess I just can't get over the fact of how exciting this time in history is just think you're watching an instruction set being developed before your very eyes and not just any instruction set an open source one if you're clever enough yes even you could propose a new CPU instruction join special interest group pass the community review and get stakeholder consensus boom your instruction is now part of the ISA how cool is that admittedly it might be a little bit ambitious but nevertheless possible and there are certainly other ways you can get involved get one of these guys to play around with some assembly don't just let the compiler do everything for you I'm a big believer in understanding the assembly instructions themselves and this is a rare time in history where you can watch an instruction set being created before your eyes join some mailing lists hang out in the working groups and spread the word might just be the future of computing as always thanks for watching and I hope you enjoyed this video if you like the style of content don't forget to subscribe and until next time Lory wired out I literally hold my controller I'm going to blame blame the controller not the person holding the controller oh shoot [Music]
Info
Channel: LaurieWired
Views: 238,879
Rating: undefined out of 5
Keywords:
Id: Ozj_xU0rSyY
Channel Id: undefined
Length: 16min 56sec (1016 seconds)
Published: Thu May 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.