Creating Custom AXI Master Interfaces Part 1 (Lesson 7)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello how are you doing I'm at sadly I'm a member of micro electronics system design research group of tu Kaiserslautern and this is video number seven that we are creating in the series of videos that we are creating for zinc and zinc training and we have called this video this specific session we have called creating custom XY master peripherals the easy and high-performance way so the purpose here is to talk about how we can create modules that contain XY master flocks and we want to do it in it as easy as possible way and at the same time we want to use a method that gives us some level of acceptable performance so I briefly go over how was the problem and how we are going to do that here mmm the table of contents of this session if we begin with the motivation that we have to add acts on master flocks to our hardware accelerators and then in fact we talked briefly what are the methods that you can create modules containing AK XY master plugs and then we propose a very simple example design and then through a practical session we go and we design that system and we see how the designing practically should happen okay so as the motivating part suppose that I have developed a hardware accelerator and this hardware accelerator is a kind of simple pixel processing engine so what it does it receives the stream of pixels and then it performs some kind of processing on the received a stream of excels and at the end it reports the results to the CPU for processing the pixels however my Hardware accelerate or needs big portions of memory this is because the thing that the hardware accelerator is processing is actually frames of images and as you know each frame may take two megabytes four megabytes or even big numbers so the hardware accelerator needs to receive this stream of pixels needs to for example perform the first stage of processing put the result or partially the result on the DRAM memory then read read it back perform another processing put again the result on the memory and then the algorithm is completely finished it will report the final result to the CPU the important point and the sink does new here is that your hardware accelerator should be able to access the DRAM memory directly ok we know a lot about the zinc architecture we know that on this boundary there are some high-performance pores they are slaves for the logic that we have on the PL so I need an axon master here and the excellent master should get connected to one of these high-performance ports and then the important point is that this hardware accelerator will initiate read and write transactions through these high-performance port to the DRAM memory space but before going to that point let's have a look at other possible solutions suppose that I don't know how I can add I saw a massive logs to my module what are the possible solutions to solve the problem the main possible is to use an axis central DMA module so your hardware accelerator instead of performing and initiating the write and read request the DRAM memory it puts this Duty for the dma engine so what is happening is that for example through the g p0 port the arm has everytime programs your dma engine since the hardware accelerator and the arm house are working in collaboration our paws knows what types of memory transactions this hardware accelerator needs to perform to do the processing tasks so the arm engine programs every time the central dma engine so that the central DMA through its own axon master port reads the pixels from the hardware accelerator and then puts the pixels to the DRAM memory or wise phase versa reads the pixels from the DRAM memory and then it puts the pixels back to the hardware accelerator so in this scenario what's happening is that our Hardware accelerate or doesn't have its own axon master port what is an actual a slave port and this exercise is slave for view practically connected to a DMA engine and the dma engine will be programmed by the cpu so that the required data transfers between horror accelerate or and the dram memory happen one small note in this block diagram is that I am already adding the second exile slave port and suppose that for now this access slave port is a kind of port as only used for configuration and reporting that stays to the CPU so it's not getting involved directly in the data processing flow it's only for configuration now this solution has obviously problems the biggest problem is that whenever any processing is needed to be done by the hardware accelerator the CPU gets involved so the CPU is not able to freely perform its own responsibilities but is always get it's always involved in programming the DMA engine at a switchable time and in a suitable manner and then what happens for example if the type of processing algorithms we are implementing here requires a very large number of accesses to the memory in small sizes so for example every time that a read or write is happening the size of the data which is being written or it being read back is actually maybe 16 bytes and the type of algorithm a very large number of read and write transactions like this the performance we're seriously drop in this scenario because the programming of the cdma itself takes time so the CPU will be always involved the performance will not be acceptable okay we can improve this architecture in a way that allows the CPU to get related from the operation of hardly accelerator what can we do is shown in this block diagram here I use the axis central dma engine but I configure it in a scatter gather mode now I can put the operations that the actual CDMA engine needs to do in this block memory here these operations or better to say these descriptors will be placed here by the hardware accelerator the interface between the hardware accelerator and the dual port memory that holds the descriptors is a very simple block ram memory based interface so it is very simple to implement and then the descriptors here will be read by the access CDMA and they will be executed a one after another so whenever my hardware accelerator needs to perform some kind of read or write operation to the DRAM memory what it does is it just puts switchable and correct descriptors in this memory and then the DMA will read the descriptors and then it will perform the required transactions in this solution the CPU is practically not needed anymore maybe it is needed only for initializing that CDMA core for the first time and initializing the CDMA core maybe after some specific intervals or maybe in the beginning of each set of frames but overall the CPU can freely perform its own task thanks to the scatter gather possibility that we have for the central DMA ng but you know this solution it works but I personally I don't have a good feeling about this solution I feel as the designer of this hardware accelerator I feel I don't have complete control on the timing of the transactions that should be performed to the memory I don't see why I should put a completely separate different module in my system just to be able to initiate transactions to the memory it was perfect for me if I could have my own XO master block here so we come to the next solution the next solution is that our hardware accelerator has its own dedicated excellent master block impractically what's happening is that the RTO the code that I'm writing here the design that we are creating here will talk directly to this axon master block and with a very low latency whenever we need we can initiate read and write transactions to the memory and this excellent master block is completely under the control of the logic that I have here so everything is in my hand another my control under my observation and this really helps me to design better to create a more high performance module and even later to develop this design better and to maintain and upgrade it in a faster time and this is the topic of decision how do I add an axon master clock to my own module to my own RTO so there are methods to do that Here I am listing the methods through which we can go and add in fact an actual master plug to our own design obviously the first one is your professional designer and you have time so you see it and you begin writing the entire RTL from scratch and based on the specification which is given to you by arm for the AG side and obviously yes this solution can sometimes be the best solution especially when you have very tight requirements on the latency of accesses to different locations in the system and on the performance but I would say this will happen very rarely for most of the situations is not really needed to sit and create an axle master plot from a scratch completely so there are other solutions that can be followed and the earth most of the times useful and enough so as we talked about I pif blocks in the previous session there exists i Pai F block IP interface box that can be used to have actual master plugs inside our own module so there exist practically two IP ifs for XY master interfaces the one gives you in fact an axle master light interface the other one gives your module an excellent master bares interface so practically there are two cords coming from Xilinx as RTL that you can take these cores we can put them in your design you can instantiate that module inside your design and then you can connect its port to the rest of the ports and there's the logic that you have in your design and that works for you like an axon master so it contains the aksoy for ports and gets connected to the rest of the actual world now another solution that we followed in the previous session for having actual slave interfaces is in fact the auto-generated codes which are produced by vivid Oh when you create custom peripherals using the value environment so if you follow the create custom peripherals wizard of the vaginal environment as I show you in the previous session you can have in fact actual master and actual slave plugs inside your custom peripheral and as you generate in fact that peripheral using vivid Oh it also inserts an example running easy-to-understand code for you for the plug inside your code so you can follow these and you can have an actual master light or excellent master even bears automatically generated by provided environment but you know in previous session for actual slaves I use this method but now today I want to use this method because I believe the axon master logic gets much more complicated than the I saw a slave logic and it's kind of more reasonable to have the axial master plug as a completely separate module with defined pores that you can initiate you can send your read and write requests to eat and on behalf of you it will generate required aksoy transactions the final in fact way to have an excellent master obviously is to use the V value HLS so when you are designing your module with the HLS and in fact we see with few lines of code your block can have actual master interfaces which initiates for you required to read and write transactions yeah so I think we will cover this further in future and for now I will use in fact this method so for our experiment we want to use an axle master Birds Eye pif we want to insert it in our design and we want to see how can we use this module to be able to initiate regeneration sections to wherever that we want in our architecture the same as before the set of signals which are involved in the axle interface are divided into these channels clock and reset right address channel right data channel right response channel read address Channel and read response channel the signal obviously they don't change in comparison with actual slave interfaces so the same set of channels that we had for axle slave in their faces we have also for actual master interfaces in previous session we were developing an axis live interface and now in this session we are going to produce an axle master interface which is practically the other end of the link okay let's talk about a practical example of this design slightly here we want to create a simple image rotator our image rotator is capable of rotating images coming images or better to say stored images in the memory 90 degrees or more 90 degrees clockwise or counterclockwise so here is how our hard drive accelerate or operates for our experiments we assume that the image is already in the DRAM memory and the responsibility of the hora accelerator is to read the image from the DRAM memory to perform the required rotation task and to write the rotated image back to the DRAM memory our hardware accelerator has an axial slave port and through this axial slave port the CPU will inform our Hardware accelerate or about the task that it should do either it should rotate for example the image 90 degree clockwise or counter clockwise or it should perform a mirroring operation or whatever simple rotation operation operation that it can do our hardware accelerator will contain an actual master for and through this axial master port it will read the incoming image and then it will write the final image for the addresses of the incoming image and the final image which are practically physical addresses in the DRAM memory the CPU will inform hardware accelerator about these physical addresses through this axial slave port finally there is an interrupt from our hardware accelerator to the CPU and whenever a rotation task gets finished completely this intro' forgets enabled to inform the Seaview that the task is done so this is the hard isolator that we want to design its architecture looks kind of very simple it will contain two acts are clocks one exercise slave light clock and for this guy I will use exactly the same method that we used in previous session session six access live interfaces so I will use the automatic code generated by vivid Oh for the exile slave clock and for this guy as I told you I will use an axe or master burst IP I have and then we will develop in fact the required logic which we receive the physical address the dimensions of the image and the comment from this plug here and then it will handle it will manage the axon master bears to read the pixels one by one from the incoming image address locations and to write them back in a suitable address location in the destination image okay this is the end of our theoretical part of station seven and now we go to the practical part and we see how we can design this hardware accelerator thanks for watching you
Info
Channel: Microelectronic Systems Design Research Group
Views: 25,528
Rating: 4.9306359 out of 5
Keywords: EMS, University of Kaiserslautern, Technische Universität Kaiserslautern, TU Kaiserslautern, TU KL, Microelectornic Systems Design Research Group, Norbert Wehn, Matthias Jung, Mohammad Sadegh Sadri, Mohammadsadeh Sadri, Mohammad Sadri, Xilinx, Zynq, High Level Synthesis, HLS, FPGA, Lecture, Tutorial, Advanced Microcontroller Bus Architecture, AXI, AHB, AMBA, Lesson, Teacher, Learn, Lessons, Student, Education, Students, Teachers
Id: cDc9B2zAPz4
Channel Id: undefined
Length: 23min 9sec (1389 seconds)
Published: Fri Feb 06 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.