Real-Time Linux on Embedded Multicore Processors - Andreas Ehmanns, Technical Advisor

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so it's ten o'clock I think we should black start with this talk welcome everybody and thank you for joining this talk my name is andreas 'man i'm working at as a technical advisor for embedded software systems at MBDA germany and today i'm giving you a talk about real-time linux on embedded multi-core processors so some of you may ask we you timeliness let laughing you multi-core that's also nothing new for linux so what's the content of this talk the talk is about the combination of both so around one year ago I was in the situation that I had Linux real-time Linux running on single core processor so hardly becomes more and more obsolete every new processor nowadays multi-core processor so the question was how can I come to the right side running my existing software on your multi-core hardware is this possible and if how could this be done so a few words about the agenda I already started with motivation at the beginning I have to say a few words about Linux and real-time about latency measurements but then we come to the second part of the presentation the interesting one the migration from single core to multi core you will see a lot of information histograms there you will find that there are a lot of effects on multi-core processors which you should be aware of that it's mandatory to knew a little bit about the processor hardware architecture so I'm sorry my laser pointer is always working on one side a presentation here so maybe I switch during the talk and of course at the end there will be a short summary so the combination of the vanilla Linux kernel and applied are TPM page that's nothing uses well-established in the embedded area but on the other side the semiconductor industry is thriving in evolution to multi-core so nowadays the big problem to get a single core cpu for your system so the solution you can say take multi-core processors use one course which of all the other course so this is a solution but you waste all the other course you have so question is can we use the other course so in this presentation I will outline one possible way to migrate to multi-core of course this is not the one and only way the golden way it's not an answer to every problem you system may look different your hardware software may look different ok so when I started one year ago I had a single core processor system it was a 6u in authority the g4 policy running clunkier hat so it's the well-known processor in the embedded area the harder becomes more and more obsolete and the question was can replace one or several boards with a new multi-core board this was possible and of course you have a lot of advantages from of course you have more computing power on each board you have less power consumption you compare the same CPU power and you have less heat dissipation these three points are always interesting in the embedded area you need more power you want to have less heat dissipation and power consumption for example if you think about a smart phone these are typical things there of course there are some disadvantages compared to multiprocessor systems so in multiprocessor system each processor has its own resources caches memory I oath on multi-core processors these resources are shared so what you could expect there are interferences so that was the hardware part where I started the software running on this board is justice vanilla Linux kernel for for three from kernel.org the according are Capri and pitch of course we did a lot of current configuration it's a fuller key system is configured to the addicted system using high resolution timer we switched off everything we didn't need or which might have a negative impact on written behavior such plugging its power management dynamic frequency scaling and the typical things additionally we use the tool cyclic test cyclic test is a part of the RT test packages and can be used for latency measurements so one general note I have to give you a little bit information about this tool but the tools is not the content of this presentation so cyclic test who knows the tool or have you had experience with slightly test three four there are some okay so maybe it's necessary to say a few words about this tool I call it cyclic test simplified because it's a big tool with a lot of parameters I just wanted to give you the idea what the tool does in general it measures latency or response to stimulus so how is this done the tools clipped for defined time it measures the actual time when it's woken up and calculates the difference of next exit and the actual time so in a perfect world you would expect the different view but of course from an interrupt from timing event until the application is gated there's always some time so you have at least an offset and sometimes it takes a little a little bit longer this is the main loop taken out from the source code a little bit simplified you see what I told you before it's sleeping for tiles which is the time when it's woken up and it calculates the difference and then it situates again and again at the end cyclic test generates an output it's doing some kind of histogram in the format which you can directly fit into the new plot for more information there's a lot of things on the web just google it around for example this is a very comprehensive presentation about cyclic test its parameters and so on but for this talk is just interested necessary that you understand the idea behind psychically test so now we start a cyclic test on the old single core processor system and that's what you see here this is the histogram generated with new plot the efi on the x-axis the time in microseconds on the y-axis number of samples so please note that this is a logarithmic scale it's not linear if you use the linear scale you have a big peak here at the beginning and then there's nothing to see here so to see is the curve it's necessary to use the logarithmic scale so we put the system on the high load that's necessary if you want to do latency measurements because if you test an idle system so who wants to deploy an idle system to a customer and of course it's necessary to do long term measurement it doesn't make sense to run a few minutes we are talking about days weeks or better months so that's necessary to see sometimes there are unsweetened outliers and this is what we had what you had want to see so you see here the red curve that is the kernel with the pre and patch and the blue one the kernel without the preamp edge so this red curve means here you have a maximum latency around 25 microseconds that's not bad without the 3m page we have maybe 70 microseconds but you could say hope that sufficient for my system what's the problem the problem is that this blue curve doesn't end here it's going up into this direction there there are outliers up to 5 milliseconds so if I make a plot up to 5 milliseconds you cannot see the curve so I just cut it here talaq microseconds that's the interesting part so and of course this curve is totally Hardware dependent if you have a different processor for example if you I have some test with an arm a 9 then you have a little bit more Plateau here's going up to 2 200 for 50 microseconds okay that's the old hardware so the dressing-table you want to migrate to new hardware since we had an power 50 in the past and the legacy software uses some features of the PowerPC especially the LG vector unit so it was the idea to buy a new pair of jeans the new power pcs from free scale now it's nxp in a few months it will be Qualcomm it's not called PowerPC anymore it's called core IQ so that's the power 50 family Freescale years by this around one and a half years ago provides three boards where the core IQ processors split on to and a lot of interfaces are available here so this called reference design dot that's what the RGB is standing for the T here denotes that is the newest core IQ family there is 1000 2000 and 4000 family for the low mid and high end processing area the next two digit digit denotes how many cores are there so this is an 8 core system this is the 24 core system and the fourth digit is just for some special variants more or less interfaces different interface and so on so on so we write this hardware and just to give you a short overview this is the hardware this is the processor on the heart with the each 6500 PowerPC core you'll see 8 or 24 cores it's running at 1.8 gigahertz and you have 4 or 12 gigabytes so if you have a look at the PowerPC family history it's the step of more than 3 PowerPC generation compared to the old g4 one so from software point of view and XP delivered to software development kit it's based on the octo and you can create your boot image kernel in it renders your DB file and so on so on the curl is a little bit older it's a four one eight and recording Korean patterns applied so the first attempt was just move the kernel to new hardware we let the kernel handle or course doing a speed configuration we just change a few configurations in the corner it's a new CPU core and it's different out there but on the general setup we didn't change anything just to see how it's running on the magical software cyclic test in this case is started with an additional parameter to tell cyclic test to run to start one thread on each core and find thread to the car using affinity so and then let's see what likely test will record so that's the histogram here interesting thing here this is an idle system we put no load on it it's not necessary that you see every curve here so I know the yellow color is a little bit hard to see the interesting thing is you have the speed here at the beginning as before but you had a lot of things here it's up to four cores which generate entries here so you could say are we done so the maximum value is around eight microseconds maybe it isn't sufficient for your system of course we are not done this is an idle system so okay the next test is under load and now you see that it's different we have around four cause the system is under load we have a CPU load on each core we make a big traffic on the ethernet on the fuel line and if you start a new run you will see that the colors change here so this means that in the Noorani different cause are generating the same piece and on the previous histogram or the previous histogram ended around hundred microseconds the highest entry was 80 micro seconds now we have something above 200 microseconds but that's obvious because the system is under load now okay again you can save off 200 microseconds that's okay for my system but especially if you are an engineer you want to understand what's going on there and maybe your system has other region requirements oops so sorry of course we did the same measurements on the 24 core system but you can imagine it's really hard to put 24 cores into an histogram so I just show you here the idle thing but for the further discussion of next slide has only used the T 2080 because it's easier to see the effects are the same we also did a lot of Investigation on this hardware and if there are differences through to 2080 I will mention is from the presentation okay so as you are seen in one of the first slides we set we configure the system to let the Linux kernel handle on course and that's what Linux is doing so the scheduler decide on which course have to run and the tasks can be migrated dynamically so the idea is we bind all tasks to one core for example to call zero this can be done with the simple bed script just loop over the entries and /proc and then use the task set command to assign the processes to call zero if you do so get this picture here you see that there are two cores here instead of four cords which has entries in desire region but they are still too close so and again the results vary from one to one you will always see the red one this is core zero in this case we have core to the blue one and next one we have put four five six seven chains from time to time okay next thing have a look at the interrupt just to look at on /proc the interrupt you get along a list and I've picked out here the serial interface you see this interrupted science thirty-six and here you see the eight calls and how often a core managed to interrupt so you can see here the interrupts are handled by different course so the idea is you do the same as we did before for applications we migrate all I a queue ending to one core to core zero this is what this best would do to us and we set the default affinity to call zero phone UI our truth if you do so you get this one here so it looks similar to the previous one but yes less entries here in this ear area and the previous histogram there was also a curve like this year so something changed but we are not completely happy with this they are still too cold and it changes from 1 to 1 which course are involved so instead of on going this way you can say okay let's try a different approach we are going to isolate core from the current schedule instead of blogging system i grading bag thread and IQ handling to quote one call we say at the very beginning we tell the colonel to take out course from the scheduler so there's a boot parameters on the kernel which is called iso CPUs but you can add a list of CPUs or core numbers so the main page says remove the specified abuse or core from the general colonel SMP balancing and schedule algorithm so and if you do so you want to move your application your rhythm application your task to the other course you can use games to test it to apply to deploy your application to dedicated core so the idea now is let's core 0 handle all the kernel and the u.s. class and Reserve Corps 1 to 75028 for user application so you use this kernel parameter oh there's a typo yeah without the SSD and so and let's say let's have a look what happens then then you get this histogram so you see here core zero has a lot of things to do up to 245 microseconds let the height variable original outline this histogram and all the other course are somewhere hidden here in the first peak so and at the end of each run cyclic test generates the summary this is can be seen uses the power of the summary so for each core which is here in each column called one to eight you get a minimum latency the maximum latency and average latency so see here core zero that's the entry here and the maximum value for the other course is 13 microseconds so we talked a lot about latency measurements this is necessary this is important to do this to understand with your system but of course you don't want to generate a system where cyclic test is running so you have a real application if you are in the situation that you have existing system or you have existing software on single core you can migrate it to the multi-core systems and you are happy guy because you have software and you can just run your existing real application on the new hardware look what happens if you start a new project of course normally you don't have the software at the beginning so in this case I recommend to try to do a reference implementation at least of the critical code or code section so that you have a reference implementation to test on your multi-core hardware and see how this behaves but this is not so easy at the beginning of course you should have an application which can run power it is completely sequential it's makes no sense to use multi core hardware then of course you need time measurement and time measurement is the topic for complete dedicated take care of that especially in combination with caching effects make sure that your application behaves on test system as it would do on the final target simulator implement net messaging if necessary normally on education maybe get some information get some data in has to do something and since data out so this is typically typically much slower than just the computing algorithm so it's necessary that you implement this and also if your application that I all typically heart bio Hardware it is much slower than memory so it's different if you write your result into memory or into real Hardware and of course you should do long term measurement to find out unfrequent outliers and last but not least you should check that your application does what it should do so it makes no sense to optimize your algorithm and do things like this if you are not sure that your application works correctly as long as you do not know this every time measurement is worthless okay in our situation we were in the heavy situation we had existing software so for the new for hardware I took out some code which was in real time critical available and it was available in two versions we had a pure C C++ implementation and we had a highly optimized version using the LT Beck single instruction multiple data vector unit of the PowerPC this is a very powerful powerful unit and depending on your application can gain more than a factor of four ten or sometimes up to fifteen so the algorithm can run nearly hundred percent in parallel it uses big look-up tables so big in our case means 5 to 10 megabytes so it's big enough that it does not fit into the level 2 cache so we can see caching effects here and unfortunately we have to simulate storage of data and hard because the original interfaces are not available on the new hardware so now we run the algorithm on one core to core three coils and so on up to eight cores so you keep the number of course here on the white x-axis and the number of microseconds and one iteration of the algorithm needs to execute so as expected discussed go down here you have around factor factor of 10 between one and a cone this is just the influence of a lot of pipelining things additional parts in the process architecture with speed up a little bit so it's more than a factor of eight and if you for example compare four and eight cores around a factor of 2 between them so this is something you might expect if you say yes I have an 8 core system now the ultimate version we have a good performance improvement up to 5 course but then we have black so it's nearly the same speed for 6 & 7 course and it's reduced performance for eight cores so between 5 and 8 cores we have effective round of 1 so what's the reason for that we have an 8 core system but there is no spirit anymore what could be the reason for can you hear me no okay okay sorry so what I showed you before what an average of the algorithm it did a lot of iterations and let's just the average value so here he printed the number of iterations and the time they're little token so the less red line here the algorithm using one for so you have those women table line here is always the same time the algorithm is to execute this is the same for the people 3/4 4/4 because you will all times going up show notice is very nice if you look at the same thing for price course it's different but this is 5 course here you see a lot of jitter here so and the jitter becomes more and more if you move to fixed cost 7 4 or 8 chord so we have a long less than 2% jitter you up to 4 course but will record your outliers in the average it around fifty two hundred percent of the execution time so from this picture I would say we can use for calls even if you get a speed-up with five calls but not more so but again we have an eight-course system why can we use only for call so we see have a look at octave you info or log on top the system tells you we have a call if you have a look at the manual from LP the tells to the t28 e has 4 physical cores and eight virtual cross - what does this mean so they call it dual credit course so to understand what the meaning of doers right of course you should have a look at the core block diagram but it's not necessary that you can read everything here so are these are the typical elements of a heart a processor for example yes what's the Unisphere you have a lot of healing and so on and so on so the interesting thing is here the coloring this is stress zero and the yellow one is red one so you see that a lot of heart the elements here are available twice so this is what an XP course do a spread across so far this level here the gray shaded boxes with for example the vector unit so this means our system has won a civic unit per physical core so in total the g20 85s saw as a unit and that's the reason or this explains what we have seen before so it's necessary if you think about deploying your application to cross to understand how the harder looks like makes no sense to run a spread with an acidic instruction on the T 2018 and so the more you should check your OS numbering scheme if you do this if it is poor and they are good so core zero in calc or one is one physical core two three is one physical core and so on and so on but you check if your system uses this lumbering or different so if so it makes sense to run one alphabet first year one year one interview so for example zero three four six or if you use say I use core zero for the OS cotton use : one of them field so but there are more things you should think about this is the t20 80s block diagram what you are seeing before what the core block diagram it was just this thing is unity so you see you have four physical course you have the advantage here and there is one advantage for its critical court not for each word record so if you have two applications running on : and cojiro their share one level impatient and the level to test this length it's one level location for all course so you might get interferences when your application are using a lot of memory access then furthermore you have a lot of things here diversity a lot of I hope we have a lot of high-speed i/o series at our Thule rapid I know PCI Express one or ten gigabit ethernet and so on and so on whereas the channel DMA here give us beyond the middle the switching fabric the Cornett fabric which interconnects everything so the details of the fabric is not mowed even if you have nine days and Expedia to get the information so buddy this is necessary to understand if you have interference or not so in this case the only thing you can do run your application to test to a lot of test drive combinations of interfaces i/o and so on and so on and this really you have a queue management human is a bar manager frame immature these are elements which shows the use interference effects and enhance the throughput so that's what stayed in the datasheet but I don't know how this is working so and now if you go to the e 4240 look bigger than before you have here the switching fabric you have a number of i/os on top and here you have again four blocks and you have been seen for the 220 a tea party has three of them so the C 4240 is just a combination of three to tea 2080 on the fancy side and another interesting things that you now have three level two cases one level two cache for each class block so Andy Stanley measures interferences so the algorithm we had before if you learned it on course still and cortina so this go here on this car here to get the execution time if you deploy it to code view and created one in the layer behind then you have a speed of around 50 percent so that are the effect of the sheriff level two cache caching that's a really big and complicated thing and I want to dive into the details of caching just give you an idea to do some basic case here we have a suit run speed ran speed testing the speed of your memory so the only tracks would receive blood plasma kilobytes because the lines here the one case a suitcase and you have a food court and megabytes on the y-axis is again logarithmic scale here so if you start with one call justice line your blue code for level one page is going down Palomas occasion is going down again for manual that's what you would expect if you use to course of course not only take physical court and we get the blue line going to here for level 1 level 2 and that is going down a little bit earlier the other compared to the Western this is because the level 2 cache is shared between on court so and again you can go through for course again the same service going down a little early again x1 if you don't do the same with 8 press you cannot deploy it anymore to physical court you have to with almost record and then you see the first year we see that you already have interferences in the level 1 third that's because to course in the Connect virtual cross share one level entry of course you can do a lot of small place for Kate caching you can also do the test in two to forty to forty born in Toronto this is just to give you an idea what could happen so and of course there are more things to consider it becomes more complicated for the 240z 42:20 you do not know that most of the internal or the cornice architecture the bottle manager amongst all these interfaces you have here you have the DMA channels so it's necessary if you know how your application works what communication is does which I always needed how much access to memory son try to do reference implementation and measures so I know that guy is this so it with fundamental education so below the Linux and price every combination of interfaces interfere the interviews together based metrics all the interfaces and there are interesting things sometimes it's profound you put but you can't understand it because you do not know the coordinate internal architecture so of course this interesting that this is lot of work but it does not help you you have to measure it you will not get the information to you need to understand the internal or maybe you are developers at any speed maybe you have the information but for me unfortunately not available and of course there are more interfering for us freedom a channel and so on and so on so my recommendation if you want to go to multi-core Hardware first of all you should know your function and the non-functional requirement of the system so you should understand your interface requirement then take all the data sheets everything you have to try to understand the architecture of your protester hardware I use this hardware here because it's a little bit more complicated than other ones but even if you have four small answers from the internal processor as such it is not that easy if you don't have software try to do a reference implementation for certain depending on application could be algorithm could communication house error or you should you should know what your application does and then think about how you deploy your application the task your thread to dedicate it cause you have seen before that it's important to know the architecture and say okay are deployed to defeat risk or and also the other one this could have a great influence on your performance and then test the application test that it's working as expected functional requirement not cause to some timing measurements so and if you with also unsatisfying and you still think that your hardware which you need and do iteration we again your requirements which understand everything correctly and again we look at the projects so architecture for this situation also so now at the end of my presentation of short summary we've seen depending on system requirements of analytics color and applied scientist Henry used on multi-core systems even if you have written requirements on out-of-the-box but there are a lot of configuration parameters provided by Linnaeus to adapt your system to your only of course the system and the software designers need a good knowledge about the hardware architecture the processor and depending on your processor architecture should really think about the deployment which thread tasks should run on which called enabling or not leading cause and so on some application may be do just a little bit messaging other to perform algorithm before highly memories to report on office but that must be your system and should know what uses to me so I hope you have seen that algae minnows and that logical system is not fun something magic it's not necessary that you are good to do that and of course the chose therapy is an example for dedicated combination of hardware and software and it's not an exhaustive analysis so your system may look different or completely different you may have different with why mine is sweet and requirements and a whole the presentation would include encourage you to try this out it's funny thank you for your attention and now we have some time operation announcer yes please I'm a moment okay a lot of place where you have a lot of I see that I okay so the question what okay if we have tried to run real Santa's a long ways and ultimately might go harder on different course not we dealing solely the event here what just to say we have only reason requirements for applications and we try to find out how to configure the system that we come it can run it here so of course there are a lot of combinations are also a lot of different wheels I mean this is just one we used here that you can use something else so this is a lot of work to do but actually be only this what I showed here Arika many countries leading the legatus determine why funeral happy and he tried to bring in three industry classically exactly like our contestants they always try to give you the try the question you find out exactly what okay so the question was could be the hit where my shows where I showed how big the jitters for the you can cause if you use tools like x-rays for example to understand what's going on yes we use x-rays for some measurements I didn't show it here because the time is limited but the idea what I wanted to beat you is here we just saw the effects and then we had a look at the process the heart the architecture to understand things and I wanted to give you the idea that it's necessary to understand how your heart looks like of course you can try to do a tracing and see what's happening what's behind there and what is generating what P for example in the latency measurement we needed a lot of work there it's really limited for the timing here what we do the most important eleventh on the schedule a communication between a commission like that Colonel what another occupation is not Hardware okay so so we did not distinguish between hardware and software next so the idea was we take minerals and the preamp edge out of the box we don't want to modify the scheduler as additional software just see on a higher level how our application behaves and what effects could arise of course you can dive into the details understand which which is caused by software which delay is called a hardware and so on and so on that's interesting well a toilet is you actually find a waffle we recorded surface element we will investigate it fixed because your team has not seen a - well that's a doesn't have to be back you can walk so people like that please RT users have featured on our okay for good so other more question no question yes I thank you very much for your presentation was really useful my question is more like general so we know that chief manufacturers like to pumped numbers so you have a lot of course because we want to market the product for your specific application so what was the real solution for the phone using physical course so really are you just exploiting for physical course let's talk about the smallest how do you use it and not eight so at the end of the day if you want to use vector processing and what is the final judgement that you want to tell okay designers so at the end so we had a set up of a lot of single core processors communicating a lot of things so the complete system is much more complex from being algorithm and internet just one more time there are a number of applications some of them just monitoring a little top and some of them are communicating and Valentine and I really picked out this algorithm stuff because this does time critical compared to the other software and this was important to see and meet the legal requirements we have or not so this is an ongoing investigation we do at the moment we don't have a final solution that we say Peggy CPU or the CPU chief at the etiology for the forties how many CPUs can be put on one the board how many spots we need how we can feed in all the like all interfaces that is very not enough space on the front on deck Toronto so at the end it's still ongoing research so I wanted to give you the information work it is up to now that it's not that easy but you can deal with multiple processes not in any case so maybe your system is different you have requirements of less than 10 microseconds so we problems tell us - this way does it answer your question ok so I'm all sweaty ok if not thank you very much for the angel [Applause]
Info
Channel: The Linux Foundation
Views: 10,498
Rating: 4.9384613 out of 5
Keywords: internet of things, linux, linux foundation, openiot summit, embedded linux conference, embedded linux
Id: Q8vCi3ns0bs
Channel Id: undefined
Length: 43min 50sec (2630 seconds)
Published: Tue Feb 28 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.