HC33-S1: Server Processors

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to hotchips33 session 1 cpus all right welcome to hot chips 33 my name name's ian brought and together with cliff young i've had the opportunity to chair the organizing committee for this year's hot chips before we get started i want to take this time to acknowledge our sponsors without the sponsors we wouldn't have a conference so we really appreciate their help and support and i would also like to thank the team at intel our rhodium sponsor and they encourage you to keep an eye out for clues to something unexpected okay so now down to some details we really encourage everyone to use slack it's a great way to interact with the speakers to ask for help so the info for setting up for slack was sent in your credentials follow that info to log in and then use the different channels there to interact with speakers ask for help introduce yourself to people there are many different channels and they have a different naming scheme so the key code for the naming scheme is based on the first letter c is for conference talks k is for keynotes p for posters s is for sponsors if you want to go talk to our sponsors and t is for the tutorials last year we had some issues where we had too many attendees all on the same vpn so if you're having issues with uh image quality please you might want to check to see if you're on vpn so really encourage everyone to sign up for slack it's also a good way to get help if you're having some logistical issues just use that help channel on slack during the talks the session chairs will moderate the talks and they'll actually take questions from the slack channels so during a talk log into the slack channel for that talk if you've got a question you know look to see if it has been asked before if it has and upvote that question with the thumbs up emoji so that's how the session chair knows that uh there's interest in a particular question and at the end of the thread there will be a thread that gives you a chance to appreciate and thank the speaker with the applause emoji you can use other emojis as well but applause is kind of the default okay so a little bit behind the scenes on hot chips this is a volunteer run conference it's composed of three different committees there's the steering committee which is responsible for uh the overall mission of the conference the operating commit committee which runs the uh the operations for the committee and the program committee which selects and creates the program so everything here is volunteer run and and the reason hot ships is such a special conference is all of these volunteers are from the chip chip community where cpu architects chip architects from industry or academia and we work together to really put on the conference that we know uh the community would appreciate and attend so it's simple as that that's why it's such a great conference is because it's from the volunteers and the volunteers are such great representatives of of the chip community so this is the list here of the operating committee um i won't go into the details of everyone here but i'd just really like to thank all of the volunteers from all the different committees there's a picture here from our studio command center where we're operating the conference and that was uh saturday night as as things were getting set up so again it's really the the volunteers here that make this such a special conference so with that i would like to introduce the program committee chair lisa thank you ian along with guri sohi who is a professor at the university of wisconsin we are the program committee co-chairs for hotchips 33 and we'd like to welcome all of you to our program i hope that you're all in a safe and comfortable place to listen to a lot of excellent talks today and i hope that you're all staying healthy let's go ahead and continue in as ian said everybody who works on this conference are volunteers and this year the hot chips 33 program committee is also made up of volunteers these are people who are very dedicated to working through selecting the best possible talks that all of you are going to have a lot of interest in and will enjoy they come across industry they come from universities you can see all of their affiliations here and you can see that it's a very broad based group of people i want to thank them here as i have a moment to say thank you for all of your hard work the types of things that they do for getting this program put together is to identify keynote speakers encourage them to be willing to give presentations here they go out and solicit submissions for talks they will read every single one of the submissions and there are many very carefully and then we select those submissions for the talks that are going to be given and the posters that are going to be presented and after we select the talks the session chairs spend countless hours working with the speakers to get their talks really high quality with a lot of engineering content so they're not just marketing talks so the kinds of deep details that all of you are actually interested in seeing about the types of chips that are being presented today we have this is a bit of retrospective we did tutorials yesterday but since this entire conference is being recorded i thought i'd mention it in case you didn't have a chance to view them you can go back and view the tutorials now in the first session we had ml performance and challenges of machine learning and we had excellent speakers from nvidia google intel graphcore facebook amazon and microsoft and as you can see that's incredible list of people who are working on machine learning all the time and so they are very knowledgeable they gave excellent tutorial presentations covering the breadth of machine learning including how challenging it is to produce different types of networks how to tune them for performance how you run benchmarks what are all the details which behind ml perf and the challenges of managing the very large networks these days and the size of the data so this was a very interesting tutorial for people who might not know the background and details about what goes on in machine learning following machine learning tutorial we had the advanced packaging tutorial this one was dear to my heart but we had speakers from intel and amd talking about both the packaging technology and how they actually use it in some of their very high-end products we had speakers from tsmc talking about their technology for advanced packaging and then we had a follow-up at the end with an expert in the field from tech search international inc and jan put it all together to kind of give people an understanding of the various different packages they had just been um listening to how you might choose what what would work well for your product so this is also a great tutorial it's all videos all there so if you want to go back and follow it up sometime later you can we have three keynotes for this year's conference a little later today we're going to have art deduce who's the ceo of synopsis and he's going to talk about artificial intelligence and how it interacts with the tools for designing chips it's title as does artificial intelligence require artificial architects and that will be at 12 30 today that specific time later in the afternoon today here in pacific time we're going to have abraham bachrach who is the cto at skydio and he's going to be talking about the skydio autonomy engine i am hoping for demos in this particular keynote because it's a pretty cool autonomous flight vehicle that they that they developed then tomorrow in the morning pacific time at 10 am pacific time we're going to have dimitri kuznetsov who is the deputy under secretary for ai and technology in the department of energy and for those of you who do high-end compute you know that there is a lot of interest from the department of energy in very high-end performance chips and artificial intelligence and he's going to talk a little bit about what the architectural challenges are in that space that will be tomorrow as i said at 10 a.m pacific time letting you know that we had 85 abstract submissions and we had 27 accepted talks thank you keep going i don't know why i have a graphic here okay later today we're going to have cpus and then we're going to be seeing a couple of interesting chips that came out of academia from the university of wisconsin in the university of michigan we have infrastructure and data processors and rounding out the program this afternoon into the evening will be chips that are basically enabling chips for automotive 5g and high bandwidth memory products so a little bit of a spectrum of papers today tomorrow we're going to be swinging back to focus a little bit more on machine learning we're going to both looking at machine learning as inference in the cloud and then look at machine learning the larger heavier duty lifting chips for doing training and computation platforms we're also going to because i know there's a lot of people who have interest in this talk about graphics and video processors and we're rounding out tomorrow's session with some very new technologies we have a couple of sensors we have quantum computing and we have ar contact lenses i'm looking forward to that talk as well okay ian talked about this already but we basically have a q a process through slack so there's channels please post your questions in the appropriate channel there's a couple of graphics there keep going there we go and on and last but not least we selected 18 outstanding posters these were these were posters that probably some of them could have been excellent talks and we just didn't have space in the schedule so i encourage you to go check them out the posters will be staffed determined by the presenter themselves because these posters are from across the world we have posters from europe we have posters from the far east and asia we have posters from the americas you can see them at the bottom of the program tab if you go to the conference website and scroll down below the program you will see the poster links down there there are slack channels there where you can interact with the presenters and there are links to all the poster pdfs so please during breaks or even you know later this evening because some of these presenters are in completely different time zones they may be available to chat with you please avail yourself of this resource and last but not least i just want to encourage you to sit back get comfortable and enjoy our conference thank you for joining us now we're going to be heading in to our first session which is cpus which was pretty much the focus of the original hot chips all the different microprocessors that were being developed at the time was one of the reasons we got together and started this conference and we're pretty excited about the cpus we've got today they're going to be very very interesting we've got a good broad spectrum this morning's chair for that session is namsung king and namsung is an ieee fellow he's an acm fellow he has been a corporate senior vice president samsung and he's currently a professor at the university of illinois so he is an excellent person to have shepherded these papers into the talks that you were going to get to see today and from that i'm going to let this travel on into the first session good morning good afternoon and good evening everyone i'm liam king and assistant chair and i'm very excited to be the assistant chair for the first session of hearthstone uh 2021. this system will give you four presentations on the state-of-the-art cpus two from intel elder lake and the sapphire rapid cpus one from amd gen two cpus and one from ibm tell them processor the first presentation is from intel and the fieldwork regen the recently announced elderly architecture comprising heterogeneous course and other interesting features for wide range rangeable power and energy patient client processors and let me introduce our speaker dr rotem essie is an interfero and the leader calling the performance architect of interest mobile client professors i've been doing the internet in 1995 and since 2000 he has been needing notebook system and the process of processors power performance architecture i've been holding about 150 patents and he is three times winner of the entire achievement award now effie will start his presentation hello everyone i'm pleased to introduce today alder lake other lake brings an innovative hybrid architecture this is the biggest architecture change we have driven in recent years today i will share with you the why and the how we all understand the importance of data and compute to other more to our modern society and we are driving for more and more compute today most applications are single or lightly threaded applications the importance of multithreading is rising with data decomposition and machine learning algorithms furthermore the recent reality remind us how we really users are using our computers it is never single application it is always some productivity workload running while we collaborate with our colleagues with a browser open in the background we have antivirus data security and so on for many years we have been driving the single threat performance by adding new and smarter architectural structures new instructions and accelerators these were driving size and power the way that we address the multi-threading vector was taking this growing core and duplicating it many times to create the multi-core processor basically moving in lock step on the diagonal of this chart this served us well for many years but recent reality of moose law and the north scaling drives now for a change the innovative approach that we have taken on alder lake was to break this diagonal and introduce two new cores the performance core and the efficient core to address the single thread performance we unleashed the p core and allow it to grow and deliver even more performance to address the multi-threading we created an efficient core intended to to deliver the highest computational density possible within the constraints those two cores are architecturally equivalent with different micro architecture and different design point we call this innovative architecture performance hybrid now let us have a look at those two cores first the performance core the performance core is built to push the limits of low latency high singlet performance by building a wider machine deeper and smarter it is built to excel on large footprint code and data and the design point is high speed the efficient core is designed to construct a throughput machine and deliver the most efficient computational density you can see here that this is not a tiny little core intended for power scalability this is a very capable machine a core level performance it has a deep front and it has a wide back end out of order machine which is optimized for density and efficient throughput it comes in four cores on a module with a shared multi-level cache it is fully isa and architectural compatible with the performance core in this illustration i show the benefits of the p core and the e core in the different scenarios on the left side we see the single thread performance on a power unconstrained scenario on scalable applications the combination of higher performance and higher frequency allows this the p core to provide up to 50 percent more performance than the e core on the right side i took for example the ultra mobile without the hybrid architecture in order to fit the high constraints the size and the term and the power envelope of ultramobile we can build a four picker processor at the same power and thermal constraints using the hybrid architecture we can take two peakers and eight e cores and build an equivalent processor this hybrid machine will outperform that pico only machine by more than 50 percent now that we have introduced the core the cores let's have a look at the i the other lakes system on a chip using the mix of the pickles and the e-cores we were able to build the most scalable family that we ever introduced it spends all the way from the small package bga type 4 ultra mobile all the way to the lga socket high-end desktop furthermore the aldelec family are leading the industry to a new io technologies we have introduced the 16 lane of pcie gen 5 which is 2x faster than the pci gen4 we are leading the transition to the new ddr5 by supporting all the existing memory technology ddr5 dvr for lpddr5 and lpddr4 the memory controller has a voltage and frequency scaling which allows to optimize power performance and speed how do we achieve that we do it using a modular design i've introduced the p core and the four e core module with the mlc they are equivalent and they are interchangeable we can mix and match any number of of any number of these together on a single system on a chip we have a bigger integrated graphic and a smaller integrated graphic we have a set of ips i o ips and accelerators and a high speed interconnect and base and using all these we mix and match and build the entire alderley family when we architected this part we took ourselves a goal to make it a seamless transition to the market meaning any software that exists already should work as is out of the box without the need to spend for special enabling or for the hybrid architecture the operating system also does not need to have any hard coding it does not need to have any notion of pico and eco and what the capabilities are and it should not be aware of the topology the number of cores and the arrangement all the different code types are architecturally exposed to the operating system as logical processors and the properties of all of them and the smartness how to use them is built into the hardware how we do that brings us to the most exciting part of the hybrid architecture which we call the thread director by now you probably understand that extracting the most benefit of a hybrid architecture is all about placement we need to put the right workload on the right core at the right time in order to do that we have built this capability directly into the core hardware we monitor the workloads as they are running nanosecond by nanosecond and collect information of their properties we provide feedback to the operating system and the scheduler how to use the course we dynamically update this information as condition change the load of the work or the physical power and thermal conditions we made the power and energy management of the core hybrid aware with deep and intimate knowledge of the voltage frequency scaling the size of the cores and the individual topologies let's now dive one step deeper and see how the thread director works this chart shows a distribution of many many applications and the ipc to ipc ratio between the p core and the equal one means that they are equal in ipc and numbers higher than one means that the p core outperforms the e core we have built a machine learning based predictor that tracks the software as it runs in the microarchitecture nanosecond by nanosecond and generates a prediction it runs on one core and predicts the ratio to the other core we then take this value and bucket it into four classes class zero is the mainstream of application you see that this is the bigger bucket this is where most of the applications are legacy applications are these are the numeric manipulation and data movement and and so on you can see here the claim that i previously said that the e-core is a very capable processor you see that the ratio between the pico and the e-core on the mainstream applications is about one generation of processor this is a very capable processor where the p core shines is on the emerging workloads is on the wide vectors on the machine learning accelerator and so on there are the ratio of ipc to ipc can reach much higher numbers class 3 is the type of workloads like busy loops and the memory bound application which are not dependent on the processor and it doesn't matter where we they will run they will run the same time we take this class venue and we store it in the thread context meaning that the operating system swap the thread in and out the operating system also have observability for each thread what class it belongs to and it uses this class in the scheduling process another feedback that we create for the operating system is a table that is called intel sdm the hfi table this is a table that we've built based on the intimate knowledge of the processor the voltage frequency curves the properties of the e core and the p core and so on this table contains two columns for each class one for performance and one for energy efficiency the semantics the value in the table in the performance core is performance capability and in the efficiency column it's the energy efficiency of the core there is one row for each processor you can see here that there is no such thing as e core and p core as far as the operating system or the architecture concern there can be any number of different types of cores each core can be different than all others and each one of them is enumerated with a value that says what is its performance and what is the energy efficiency of this core it does not have to be that the e-core is always more energy efficient and the pico is more performance for example on class 2 the performance of the pico is much higher than the e core energy is a multiplication of power by time so if the time saved by running on the p core is more than the power saved by running on the e chord then on class 2 the p core may also be the more energy efficient the values here may change as the condition change for instance if we are highly power and thermal constrained and we need to lower the frequency we may get to a point where the e core becomes more performance than the p core in a very constrained condition so this is runtime feedback to the operating system fully enumerating what is the capabilities of the different cores a zero value in the column means do not schedule obviously if there is an affinity then the operating system will schedule the thread there but sometimes it is more efficient to consolidate software thread to less to a smaller number of cores sometimes it is more efficient to consolidate all of them on one type of course only on the e chords or the p chords and so on so all these intelligence is built into the hardware and the firmware of the thread director note that this table is updated much slower than the the classification of threads this classification is update every scheduled window this one is looking on the wider observation window does some statistics over time and generates this feedback when the operating system comes to schedule a new thread and need to make a decision where to schedule it it reads the hfi table the operating system has the notion of the priority of a thread the operating system knows if a certain thread is high priority it's background it is user interactive or it is low priority if an operating system tries to schedule a high priority thread it will look at the performance column pick the highest performance value just sort it and schedule on the highest priority core which is a bill available for scheduling and if it's a background or low priority it will look at the energy efficient column and choose the one with the lowest value that is available for scheduling so all the directives of which core is more efficient or more performance is communicated through the outer hfi table and no hard coded is needed in the operating system the topology also is built into this table so any number or any combination of core count is supported by this table let's have an example and see how this works in in real life the underlying drawing here is the physical course of of other leg the e-course and the peakers the top layer is the operating system and in between we have the thread director and with the ehfi table built into it at a certain given time the operating system may have many many applications in the ready queue ready to be scheduled each one of them has a class as mentioned as part of its context and the operating system knows which is foregone and what which is background when the operating system is about to schedule a background task it will look at the energy efficiency column of the ehfi table and redirect it to the most energy efficient core in this example it would be the e core when it's scheduled a high priority any class and there are cores available then it will schedule them on the performance score based on the performance column usually it would be the peak or now things start to be interesting and need the support of the hardware when there is a contention with when there are more threads competing for let's say the picot then there are actually available picos in this case the operating system will look at the classes and prioritize them class 1 and 2 have priority over class zero so if all the if the performing core are taken by class zero and class one or two need arrives and needs to be scheduled the operating system will migrate it to a less performing core in this case in this example it will be an equal and schedule the class 2 on a peak or so here this is a measurement that we have took in the lab they are not official and are not to scale it just to illustrate where the thread director brings the value obviously if all thread or symmetrically it doesn't matter where we schedule any naive scheduling like first come first save will work equally equally good when there there is asymmetry between the threads then the thread director brings the value here the green bar shows excel with with the office ai excel is a class one class zero application number number calculations and data manipulation and pointer tracing and ai is the type two machine learning and we see that when we compare random scheduling with the first come first serve to a thread director-based scheduling putting the a i on the p core and the number calculation the class zero on the on the e chord we get a significant performance improvement and the other bars show other combination and the blue and the orange show class zero with class one or two uh which pref which give priority to class one and two over class zero on the p cores and the yellow one is class 1 and class 3 where we direct the class 3 to the e chord making room for class 1 and freeing up power and thermal and energy resources to for the other threats on other lake we re-architected the power and energy management in order to make it performance how how it breed aware unlike a power scalability type of hybrid there is more into it the power management algorithm has an intimate knowledge of the topology the core types the voltage frequency curve the performance for each of the different cores it is made aware of the type of workloads which is running at a certain time on the course this capability is already built in the intel speedshift technology since the lake family when the operating system schedules a software thread on a core it schedules together with a value that is called epp energy performance preference that tells the power management algorithm what is the priority of this thread we have seen in the previous example that on a loaded system it's at the same time the e-corp may be running a mix of performance high priority and low priority threads and the pico also may run maybe running high and low priority threads when we come to balance power budget between cores or between threads we need to be aware of the of the priority that means that the low priority threads will be running at a lower voltage lower frequency and the higher priority will be running at higher voltage higher frequency regardless whether it is a p core and an e core when we balance power budget between p cores and e cores the optimal voltage frequency point is a function of the performance and the physical properties of the course and obviously when we are running low priority or class 3 types of workloads we can run the the cores at a low voltage frequency point and conserve energy [Music] all a right of questions and uh based on some voting we have a few interesting questions so the first question was from amd elliot from amd so he was asking what the security implication of a threat director as a side channel was so could you answer that fb so what are the security implications of a shredder director as a side channel it's a question from earlier from amd the credit performance security okay thank you for the answer and the second question the second popular question was about the dive photo and the pcies and which side is someone from someone was asking how many pci 3 4 5 lanes for different types of are the recognized um okay that's great and the last question so charlie from semi accurate was asking whether td or thread director will be available for linux and if so what corner and when is all right thank you and let me move on to the second presentation there the second presentation is from amd and the speaker will present the next generation gen 3 core architecture designed to provide the scale-up performance for servers data centers and the supercomputers and let me introduce our speaker dr mark heber mark is a senior fellow at amd where he has contributed to many processor designs of all last two decades holding a variety of roles in physical design rtl and the microarchitecture most recently he was the lead architect for the gen 3 core and he holds a phd in computer science engineering from university of michigan now mark will start his presentation when you use your computer you want smooth frame rates for gaming and fast processing that makes the computer feel like an extension of you with the next generation zen 3 core we're delivering another large increase in cpu performance for gamers content creators high performance computing and more i'm mark evers the lead core architect for zen 3 and i feel incredibly privileged that i get to tell you about this core on behalf of a large and very talented team we started the zen line of course with the release of zen in early 2017 with a ground sub redesign with over 50 percent increase in ipc and a new soc architecture based on a four core complex this marked a new era in the market for amd the zen 2 design built on this with higher frequencies more ipc a larger l3 cache larger floating point unit and a new 7 nanometer process and then late last year and continuing into this year we launched the seven nanometers and three core products zen three is based around a new eight core complex it delivers a big performance uplift to both frequency and ipc and has doubled intake throughput and an innovative new set of l3 cache solutions including support for stacked amd 3dv cache with large ipc improvements in every generation we are well exceeding past industry trends when we started the zen 3 design we were ready to aggressively redesign the architecture to deliver another landmark increase in 1t performance through ipc and frequency and to unify the course and the cache in a contiguous eight core complex to improve effective latency and provide scale-out performance for servers data centers and super computers we also wanted to introduce new isa extensions and expanded security features and the support for the amd 3dv cache integration and we wanted to do all of this while enabling platform level scaling and energy efficiency while maintaining socket compatibility with past products to simplify the upgrade cycle for our partners and customers zen 3 was a grounds up design that included a thorough reimagining of many of the pipelines and functional units the diagram on the right shows the zen 3 microarchitecture and we'll go into more detail on that in the later slides but first some of the high level characteristics zen 3 supports simultaneous multi-threading to get that extra performance in an energy efficient manner when additional threads of work are available the flow of instructions through the pipeline starts on the top right with the state-of-the-art branch predictor feeding a sequence of addresses to the front end of the core instructions are then fetched and decoded four instructions per cycle from the 32 kilobyte icash or eight ops per cycle from the op cache that can hold four thousand instructions the resulting offs are placed into the op q they're then dispatched up to six ups per cycle to the integer or floating point schedulers and to execute the ops there are four integer units plus dedicated brands and store data units we support three address generations per cycle and also additional floating point resources including the capability for two vector floating point multiplier accumulates per cycle the load store unit has a 32k d cache supporting three memory ops per cycle backed by a half megabyte l2 and the instruction and data level 1tlbs hold 64 entries each backed by a 512 entry level 2 structure for instructions and a 2k entry for data now this at all adds up to 19 ipc uplift for zen 3. that's the third consecutive generation with a double digit ipc gain and the largest gain since the original zen core and that's in addition to the frequency uplift the size of the bars on the right shows the approximate contribution of each part of the design to the overall improvement and you can see that to get an improvement this large we had to significantly redesign logic throughout the core with performance contributions from just about every corner of the chip and one of the largest chunks of that extra performance came from the front end of the core we reduced the minimum mispredict latency one of the most important latencies in the core by up to three cycles even a very good branch predictor will be wrong sometimes and this helps us get back on track faster when that happens we also reduce the prediction latencies for taken branches so there's no lost bubble cycle in most cases the tage predictor configuration was optimized and we moved more of the btb storage in closer to the first level for more consistent performance with larger footprint code the target predictor for indirect branches was doubled in size and the elf the 32k l1i cache has improved prefetching leading to better utilization also the relationship between the icash and the opcash was reworked to process op cache fetches faster and better handle the cases that switch back and forth between the op cache and icash overall this makes for faster fetch especially for branchy and large footprint code we introduced a new distributed integer scheduler organization to support better scheduler efficiency and wider execution issue some instruction latencies were reduced and we support a larger out of order window through the increased register file scheduler and reorder buffer sizes finally the peak issue bandwidth what some may call picks was increased from 7 to 10. all in all this gives us more execution bandwidth and the ability to extract more instruction level parallelism to feed it now i just mentioned that we increased the number of integer pic ops picked per cycle from seven to ten but if you do this in the straightforward way of just adding more general purpose alus it can be really costly so to accomplish this as efficiently as possible we kept four alus and three agus but we added new branch and store data units at a smaller cost offloading the more expensive alu so they could focus on doing the operations that needed their full capabilities since the new units do not produce register results this was done without any increase in register file right ports or growth in the register bypass network the new distributed schedule organization also allowed for more uniform use of the capacity across a variety of workloads on the floating point size we increase the dispatch bandwidth to 6 ops per cycle into a larger 64 entry scheduler we have two multiply units and two ad units but we also added separate float to end and store data movement units to get more out of the main functional units and we grew the scheduler to help extract parallelism to feed them the main units are all 256 bit wide we shortened the latency for the important floating point multiply accumulate instruction to four cycles to speed up execution and we doubled the number of units that support int8 ops to speed up workloads using those staying with the theme of larger structures on the load and store side we also grew the store queue to 64 entries we improved our prefetchers especially focusing on better prefetch on page crossings and better coordination between the l1 and l2 cache prefetching and we also provided a little better configurability for the prefetcher the load store unit supports three memory op accesses per cycle to the 32k data cache all can be loads but max2 can be stores with some additional restrictions for floating point loads in stores the 2k entry l2 dtlb has 6 page table walkers for those cases when you're still missing this may seem like a lot but a few workloads that randomly access data from a very large data set can generate a lot of concurrent tlb misses so this helps those workloads and to round it out we sped up execution of short string copies and improved our handling of store to load dependencies now after that detailed walk through i'd like to summarize n3 in a different way by looking at what changed the most from zen 2. this really doesn't capture everything since many of the units were redesigned in a much more comprehensive way but it still gives a good feel for the differences on the front end we doubled the size of the l1 btb we improved the bandwidth of the branch predictor in part by removing the the bubble cycle that often occurs with taking branch predictions we sped up the recovery from mispredict so that even when in those cases when the very accurate branch predictor did mispredict still we could get back on track faster and we sped up the sequencing of op cache fetches so you could get fetch faster from the op cache but we also made for quicker switching between the op cache and icash pipes so that when you had a moderately large footprint code that didn't fit in the op cache you can handle those transitions better on the execution site we added the dedicated branson stored into pickers and built a larger window to extract instruction level parallelism out of we also reduced the latency of some select ops that also helps extract more parallelism out of the code on the floating point side also going wider to six wide dispatch an issue and faster latency on the fmac instruction for load store we have higher load bandwidth we have higher store bandwidth and more flexibility in how you can mix those load and store ops in the pipelines we improved the memory dependence detection for those cases where a load is close together with the store that that creates the data that it needs and we increase the number of page table walkers to support tlb misses but the real reason for the micro architectural changes was to deliver more performance and we're very excited about the 19 ipc improvement over zen ii 19 is a geometric mean across 25 workloads but as you can see on the right side of the chart some top gaming titles got more than 30 percent ipc improvement now let's talk about the support for new software features in xen3 at amd security remains foundational in our designs zen 3's infinity guard security feature adds strong additional layers to our security offering in the first generation epic secure encrypted virtualization enabled encrypted memory for each virtual machine to protect confidential info in the second generation we added suv es to protect the virtual machine control registers on the cpu from being compromised and now with zen 3 and the third generation epic we add secure nested paging to protect data in use at the virtual machine level secure nested paging is a new layer of protection in a multi-tenant cloud it eliminates additional attack vectors through the page tables meaning that the cloud provider's hypervisor controller can't see into the virtual machines or make changes to the tenant this protects data and use even from cloud administrators moreover all these confidential computing capabilities can be seamlessly implemented as a lift and shift with no application modification needed we'll continue our priority focus on security and plan to add more hardening of amd security offerings going forward zen 3 also adds support for several new instruction set features all zen 3 products get support for new 256-bit encryption and decryption extensions doubling the data size for those operations and for memory protection keys to product provide additional user level controls for access protections and a shadow stack for protection against return oriented programming based security attacks for server i already covered the new extensions to amd infinity card protection with secure nested paging and we also added a few more sedes enhancements in addition servers get support for broadcast page table invalidations through the new inval page b instruction and support for process context ids to reduce tlb flush requirements socs using zen3 cores also have many innovations in the l3 cache architecture that help take performance to the next level this picture shows the difference in the core complex architecture between the zen 2 and zen 3 designs the zen 3 core complex was re-architected to be based on eight cores sharing in l3 instead of four this doubles the amount of l3 cash that's directly accessible from a given core and accelerates core and cash to cash communication for gaming and other workloads this organization results in a reduction in effective memory latency by reducing the amount of l3 misses l3 misses especially for workloads with a fair amount of data sharing as part of this transformation we incorporated a new bi-directional ring bus this is a more scalable design and the bi-directional nature helps keep latencies low by giving more flexibility and how to route commands and data between cpu cores and the l3 with two 32 byte data channels going in opposite directions we deliver high bandwidth access to the l3 banks in a power efficient manner while keeping latency low looking a bit deeper at the send three cash hierarchy the picture on the right shows a different way to represent it showing again how eight cores connect to a shared l3 cache i already covered the l1 and l2 cache capabilities in the earlier slides but an important characteristic of the l3 cache is that it's filled from l2 victims rather than on all l1 and l2 fills this helps improve utilization over an inclusive design the l2 tags are duplicated in the l3 cache to facilitate faster cache transfers and so we don't need to send probe requests for misses to the l2 cache we support up to 64 outstanding l2 misses per core and 192 misses from l3 to memory but perhaps the most exp exciting part about the l3 cache is the built-in support for the amd 3dv cache we have demonstrated the 3d v cache on a prototype 12 core processor based on zen 3. in addition to the 32 meg l3 on the base die an additional stacked 64 megabyte l3 brings the total up to 96 megabytes per ccd or a staggering 192 megabytes total this is enabled using through silicon vias that are present in the zen 3 ccd and a direct copper to copper bond with the stacked memory again we don't do technology just for the sake of technology this delivers serious performance on the prototype we demonstrated an average 15 percent improvement across many pc games five of which are shown here that is the kind of ipc uplift you would expect to see from a full core generation now those were some exciting performance numbers but i want to share a little more about what products you can find zen 3 in and what kind of performance and power efficiency they deliver the first products that were launched with zen 3 cores are the amd ryzen 5000 series mobile pro processors with unprecedented performance and battery life and the pro version of those mobile processors delivering multi-layered security features to help provide protection at every level from silicon to os and then the 5000 series desktop processor processors delivering up to 26 gaming performance generational uplift and the third generation amd epic server processors delivering world record performance and advanced security features with amd infinity guard and speaking of those world records when we launched the second generation epic processors we were able to show over 80 world records across a wide range of applications and workloads that's pretty good but now after the launch of the third generation epic with zen 3 epic has over 200 world records across nearly every type of application workload and customer including databases cloud enterprise and high performance computing that's a testament to the performance that the third generation epic family brings and zen 3 also improved performance per watt over the prior generation we remained in the same seven nanometer technology as send to so this improvement was all due to better architectural efficiency and physical design optimizations this chart shows system performance per watt running cinebench r20 multi-threaded tests for competing gaming processors using similarly configured systems the zen3 based systems have around 10 to 25 percent performance per watt advantage over this n2 systems and a much larger advantage over systems using a competing processor but this is where i get really excited gaming performance was one of the main targets for the zen 3 design it's all about delivering performance that matters to the user and that's the end result of the enhancements in and around zen 3. as we state in the same seven nanometer technology as the prior sn2 generation these improvements are all down to the new architecture and physical design optimizations the 19 ipc uplift access to a larger portion of the l3 cache per core higher frequencies across the stack and a unified eight core complex all together add up to great gaming performance as the chart shows it adds up to an approximately 26 average gaming improvement and as high as 50 percent on some games and this is without a 3db cache so to summarize we got a 19 ipc uplift in client workloads higher frequencies across the desktop family lower effective latency with the eight core complex better power efficiency an industry first prototype demo of copper to copper dye stack cache and a slew of new server world records along with the next generation security features we feel we solidly met our main design goals for this n3 core with leadership gaming and server performance and it doesn't stop there as excited as we are about how zen 3 turned out we are focused on delivering another solid improvement with a 5 nanometer zen 4 which is on track to build a product like this takes an awesome team i want to extend a big thank you to the amd course team and all the other amd teams that made zen 3 possible this was a really gigantic team effort and zen 3 came to fruition through their hard work thank you for listening in and i believe we have time for some questions mark thanks for the great talk now it's a q a section and we have a lot of questions here because of the time limit i mean i don't think we can do this for all the questions but one popular question was whether the b test yeah so there are a lot of different workloads that can really benefit from the larger v cache it's obviously a little different from um from program to program we haven't actually announced the specific products that will have a amd 3d v cache yet so i can't tell you exactly which products will be out there but we do see some workloads across many different segments that benefit greatly from that larger cache yeah that's a good question and actually one that i hear internally a couple of times too so there are a few workloads that have a really large random access type of memory footprint where you actually do see quite a fair bit of outstanding tlb misses at the same time where you can get that extra benefit now obviously a lot of different workloads that you run will not need any more than a couple of page table walkers but we did see that benefit in a few places and we figured out a clever way i'd say of getting that many without paying a huge cost for it so we decided why not give that extra opportunity for those places that do benefit from it okay thanks for the third answer was he was an ice game whether these two normal degrees is yeah so i i think there are a couple of different places that question could be going so let me answer it based on my understanding of it but when it comes to the 3d v cache the extra latency to the l3 cache is not particularly large i don't have a specific number i'm going to quote here but also when it comes to the chiplet technology of having different ccds and io dies to cover the to cover different functionality in some ways it can actually give you more flexibility than the uh the monolithic design so it doesn't have to increase latency significantly or power for that matter in general we find that we can build you know basically the best products we can build with chiplets and that's covering all of the different tradeoffs yeah and i don't recall if we published anything specifically on that so i'm not going to give you an exact number there uh i'd say it was a you know fairly modest increase in in size it's uh definitely larger but it's not like a gigantic growth so most of the extra benefits that we unlocked like the 19 ipc and the extra frequency that didn't come at the cost of some gigantic growth in area okay foreign foreign processors including the third processor power 6 9 and 10 in the last five generations of ibm processors interested that presentation [Music] let me ask you a question how often have you used the mainframe today did you use your credit card to pay for your morning coffee or order something online did you go grocery shopping or get money from an atm then probably you've used an ibm z system today hi my name is christian jacoby i'm the chief architect for z processor design and today i'm introducing the ibm talent chip telum is the next generation processor for ibm z and linux one systems it's a very exciting moment for me to be able to talk about this chip publicly for the first time i've seen this chip grow up from from rough ideas in the concept phase through high level design and the ups and downs of implementing the chip and now that it's working well on the test floor and we can talk about it here at hot chips it's just a major milestone for the project and for me personally the telum design is focused on enterprise class workloads and it provides the necessary performance and availability and security but it also has a new feature with a built-in ai accelerator geared towards enabling our clients to gain real-time insights from their data as it's getting processed i'll talk you through all those details but before i do that let me give you a little bit of background around ibm z as you can probably tell from my initial questions ibm z systems are a central part of large enterprises i.t infrastructure in industries like banking retail and logistics but not only large enterprises in those kinds of industries are using ibm z the same capabilities are getting exploited by startup companies in new areas like digital asset custody enterprise workloads are an ever-evolving mix of established technologies and new technologies take languages for an example it's not uncommon for an enterprise workload to be composed of programs written in cobol java and python and node.js enterprise workloads combine traditional on-prem data hosting with openshift-based hybrid cloud and they are combining traditional transaction and batch processing with artificial intelligence that ladder point is particularly interesting increasingly enterprises are using the data they own and process to gain insights with ai models insights they can then use to optimize their businesses the ibm talent chip is designed for such mission mission-critical workloads enabling enhancements in both the traditional aspects of enterprise computing and ai capabilities let me talk through some of the details on the attributes that are traditionally associated with enterprise workloads first there's performance and scale enterprise workloads are very sensitive to per thread performance meaning the ability to finish every single task very quick and they are also very sensitive to scalability so that they can scale up to the sheer number of tasks thrown at those systems every second the telem chip has an optimized core pipeline and a brand new cache hierarchy and the new multi-chip fabric that i'm going to describe in detail enterprise workloads are very heterogeneous banking workloads are very different from say logistics workloads and even within those workloads there are very heterogeneous kinds of programs there are some common types of operations that happen across a wide range of applications for example data sorting compression and cryptography ibm z has a long history of implementing hardware accelerators for such tasks in cooperation with the firmware and software team to enable best possible end-to-end value from those accelerators like i already mentioned we are now implementing a new ai accelerator and we have re-optimized all the existing accelerators to work perfectly in harmony with the new cache hierarchy and and fabric design of course enterprise workloads are also very sensitive to security and the ibm talent chip implements a number of innovations in that regard as well we now implement encrypted memory and we have a performance improved trusted execution environment the trusted execution environment enables clients to run containerized workloads in a way such that the hardware ensures that the system administrators and the hypervisor administrators cannot get to the data in those and containers that obviously aligns very well with a hybrid cloud operational model last but not least enterprise workloads and mission critical workloads need best possible reliability and availability the ibm z15 predecessor chip already provided seven lines of availability and with the talon chip we're driving the ball forward through a number of enhancements for example with a new error correction and sparing mechanism that can recover data even when an entire l2 cache sram array has a wipeout error we can transparently correct the data and we can then implement a spare array without the software even noticing i'll take you now on a small journey through the chip to talk a little bit more about how the telom processor achieves the performance and scalability enhancements before i come back to the ai capabilities let's start with the eight cores and l2 caches per chip we have optimized the core for best possible performance and we are investing a lot of silicon real estate into that pure core performance for example through a very deep high frequency out of order pipeline and very large structures like the branch prediction tables and the caches the out of order pipeline can run with a base frequency of more than 5 gigahertz and implements smt-2 there's a number of enhancements that went into the core pipeline one of the bigger ones is the redesigned branch prediction we have now an integrated first and second level branch prediction pipeline which allows us to access the second level btb with lower latency when branches are not found in the first level branch prediction we also implement a new mechanism called dynamic branch prediction entry reconfiguration that allows us to vary the number of branches we can store in each table entry based on how many branches are in any given instruction cache line and whether those branches are going far or staying nearby depending on that we need few or more bits to store the branch target address and based on that we can then put more or fewer branches into the branch prediction tables with that design we achieve more than 270 000 branch targets that we can keep in every single course branch prediction tables that shear size is a testament to the scale of these enterprise workloads on z15 we implemented shared physical level three caches on the processor chip and we had a separate cache chip that implemented a large level four cache on the telom chip we are implementing all of that logic in a single chip and we opted to quadruple the l2 cache to 32 megabytes of course the l2 access latency is very important for the performance of enterprise workloads so we spent a lot of engineering effort to get that latency as low as we could and we achieved a 19 cycle load use latency that's roughly 3.8 nanoseconds which already includes the access to the 7000 entry tlb we have four pipelines in the l2 that allow overlapping traffic so that the performance of the l2 does not bog down under load now i mentioned the shared level 3 and level 4 caches that we had on the z15 generation on the telem generation we don't have those as physical caches anymore instead we are building virtual level 3 and level 4 caches from the private l2 caches and overall we can provide 1.5 x the cache per core at improved latencies compared to z15 from a software perspective and software performance perspective it still feels like a traditional cache hierarchy even though everything is built from the l2 caches that's an important aspect to drive a consistent workload performance gain across a wide range of workloads with the talent chip let me describe in a little bit more detail how we achieve these virtual level 3 and level 4 caches first we are interconnecting all the l2 caches on the chip with a ring infrastructure that supports more than 320 gigabytes per second of ring bandwidth then based on that infrastructure we are implementing what we call on-chip horizontal cache persistence what that means is that when one l2 evicts a cache line it can look around on the chip to find a less busy l2 and push the cache line into that other l2 so that it stays close by on the chip should the workload come back to that data it's accessible very quickly with on-chip latencies that way we achieve a 256 megabyte distributed cache on the chip with an average latency of only 12 nanoseconds that is faster than the physical l3 that we had on z15 we then apply the same mechanism across multiple chips we can group up to eight talon chips and form a virtual two gigabyte level four cache across those eight chips so let me describe a little bit how we are using the talon chip to build out a large scale system of course we start with a single chip with its 256 megabyte cache the telom chip is designed to fit on a dual chip module so there are two chips with 512 megabytes cache on one module four of those modules get plugged into a four socket drawer think of the drawer as the motherboard that can hold four of those dual chip modules so that gives us eight chips and the virtual level four cache of two gigabytes and then up to four of those drawers can be interconnected into a system with up to 32 chips and eight gigabytes of cache forming one large-scale coherent shared memory system that enables the scale that our clients most demanding workloads need all of that is enabled with the fabric controllers and the cross chip interfaces that are on the perimeter of the chip there are latency and bandwidth improvements compared to the prior generation along that entire build up from single chip to the full system i'll just mention two here the dual chip module uses a two cycle transfer path between the sending chip and the receiving chip meaning we can send data out of one chip and receive it in a latch on the other chip with just a two to one clock in between we achieve that by having perfectly synchronized clock grids on the two chips of the dcm i already mentioned that on z15 we had a dedicated cache chip and that cast chip also was a hub whenever two processor chips needed to communicate with each other which led to a little bit of added latency through that hub chip now having everything combined into the telom chip we can implement a completely flat topology within the drawer meaning every one of the eight processor chips in a drawer has a direct connection to every other chip in the drawer that further reduces the latency of that large virtual level 4 cache taking all of these enhancements together the improved fabric controls and on drawer interfaces the core design and the cache hierarchy we can achieve over 40 per socket performance growth that's the kind of performance growth our clients need to keep up with the increase in their workloads i spend a lot of time describing the details of how we achieve the performance and scale i now want to switch gears and talk more about the embedded accelerators and specifically the accelerator 4ai but before i go into the details of the ai accelerator i want to spend a little bit of time to explain the use cases that we're going after and that we're shaping some of the design decisions when i look at enterprise workloads and ai use cases they've roughly fall into two categories the first category is what i would label business insights where clients can use ai on their business data to derive insights they then use to improve their businesses examples include fraud detection on credit cards customer behavior prediction or supply chain optimization the second category i would label as intelligent infrastructure where ai algorithms are used to make the machine more efficient examples include intelligent workload placement in an operating system database query plan optimization or anomaly detection for security let's take credit card processing as an example and specifically credit card fraud detection we know from our conversations with clients that when they try to do that with an off platform inference engine that they cannot achieve the low latency and the consistency of low latency by sending data from ibm z to a separate platform also when sending data to a separate platform it creates all sorts of security concerns that data after all is sensitive and often personal and so the data needs to be encrypted the security standards need to be audited and those things create additional complexity in an enterprise environment so based on that we know from our clients that they would much rather have the ability to run ai directly embedded into the transaction workload directly on ibm z that way they can score every transaction a hundred percent of the time with the best available model that they want to use for that task for that reason we chose to implement a centralized on-chip accelerator directly shared by all the cores let me talk through some of the attributes that this design point provides us and compare that to those basic use cases first of all i mentioned we need very low and just as important very consistent inference latency by having the accelerator accessible by every single core when the core switches back and forth between non-ai work and ai work it has the ability to use the entire compute capacity of the aiai accelerator for when it does perform ai work that's different from most other server processors that are implementing some ai capabilities directly in their vector execution units in that design point when workload switches back and forth between ai and non-ai work the ai work can only get the portion of the total capacity that is belonging to that core in our design point the entire centralized accelerators capability is available to every core when it needs it second we had to optimize the ai accelerator's compute capacity to match up the total transaction capacity of the talent chip we want our clients to be enabled to perform ai inference as part of every transaction so we needed to implement sufficient compute capacity for that the centralized ai accelerator provided us with some amount of flexibility on floor planning and where we place the accelerator on the chip and also how much area we can devote to the accelerator between those two considerations we implemented the ai accelerator with more than six teraflops of compute capacity we also know from our clients that they are using a wide range of different types of ai models ranging from traditional machine learning models like decision trees to various types of neural networks we designed the accelerator to provide acceleration to the operational types that occur in those different types of ai models i already mentioned the importance of security and how we are avoiding sending data off platform with a built-in accelerator but of course it's also important to follow the strong memory virtualization and protection mechanisms that ibm z on its course implements i'll describe how we map that from the core directly onto the accelerator and then last but not least ai is a fairly new and quickly evolving field so we designed the accelerator with extensibility in mind there's a lot of firmware involved in how the accelerator works and so that enables us to provide updates and new features and functions with new firmware releases in the future the hardware design of the accelerator also naturally lends itself to enhancements in future generations of silicon so let me go a little bit more into the details of how the accelerator works we defined a new instruction called the neural network processing assist instruction that instruction is a memory to memory sysc instruction meaning the operands the tensor data are directly sitting in user space in a program's memory so for example a program could have two matrices sitting in memory and a destination matrix and call the instruction and the instruction would perform the matrix multiplication of the two source operands and put the result into the destination operand the instruction can perform many types of operations like matrix multiplication pooling or activation functions there is firmware running on both the processor core and the ai accelerator the processor firmware performs all the address translation and translates the program virtual addresses to physical addresses and it performs also access checking as it performs that translation that way we inherit all the natural virtualization and protection capabilities of the core for the ai accelerator the core also performs pre-fetching of the tensor data into the l2 cache so that data is readily available when the accelerator needs it the firmware sitting on the core and the accelerator are then building a data pipeline that stages the data from the l2 cache into the accelerator and distributes it within the accelerator to gain the maximum efficiency of the compute performance there speaking about the compute performance like i said we deliver more than six teraflops per chip which provides us with over 200 teraflops in a 32-chip system the compute capacity comes from two compute arrays the upper array is the matrix array it consists of 128 processor tiles each implementing an eight-way cmd engine with 16-bit floating point format the array is designed as a high density multiply and accumulate array focused on matrix multiplication and convolution the second array is the activation array it consists of 32 processor tiles with an eight-way cmd for floating point 16 can also perform fp32 operations that array is optimized for activation functions and other complex operations like lstm in order to make the maximum efficient use of the array we invested a lot into the data flow surrounding the array we have the intelligent prefetch engine which is firmware controlled and receives the translated physical addresses from the core and then fetches the data from the l2 cache through the ring into the accelerator and then results go back the same way it can perform operations on the ring with about 100 gigabytes per second the data gets loaded into the scratch pad from where it gets distributed into the input and output stages of the compute grid array itself along that data path we have data formatters that ensure that the data arrives at the compute engines in exactly the format and layout that the compute engines need we can distribute the data with more than 600 gigabytes per second and through all the firmware coordination between the core all these data movers and the compute array itself we maximize the compute efficiency of that compute grid let me step back out from the accelerator and talk about the software ecosystem that enables the exploitation of this accelerator there's a broad and open software ecosystem that enables our clients to build and train models anywhere meaning they can build their models on ibm z they can build their models on ibm power systems or they can build their models on any other system they can use the tools that their data scientists are already familiar with you see a lot of familiar logos on this page and then the trained models can be exported into the open neural network exchange format and then the ibm deep learning compiler can take the onyx model and compile and optimize the model for direct execution on the ai accelerator on the right side you see a typical enterprise stack consisting of the operating system and container platform databases app servers and applications as i already mentioned there are use cases for ai at every layer in that stack the operating system can benefit from ai for intelligent workload placement databases can optimize their query plans and then of course at the application layer clients can embed ai into their transactions for things like credit card fraud detection or supply chain optimization i did talk a lot about the goal of achieving low latency so that ai can be embedded real time and at scale without slowing down transactions so we built a number of models in cooperation with our clients proxy models that reflect real-world applications of ai inferencing on this chart i'm showing one example this is a recurring neural network that we co-developed with a global bank to reflect their credit card fraud scoring models and we ran that model on a single talon chip and we can run more than a hundred thousand inferences every second with a latency of only one point one millisecond and then as we scale that system up from one ship to two up to eight and 32 chips we can perform inference on more than 3.5 million tasks and the latency still stays very low at only 1.2 milliseconds now this is only running the ai inference tasks we are not actually running the credit card transaction workload but it does show that talom's ai accelerator has the capacity to provide low latency real-time inference at massive scale so that it can be directly embedded into the transactions at very high bandwidth let me summarize i introduced the telem processor chip for the next generation of ibm z and linux one systems i explained in some detail how the telum chip achieves the performance and scale enhancements and for lack of time i only gave a few examples on how the telem chip also improves the security and availability characteristics and then i described in some detail how the embedded ai accelerator will enable our clients to embed ai directly into their enterprise workloads of course this chip is the work of a very large team spanning the globe and spanning multiple groups inside ibm from the ibm system chip development team to the ibm research division our technology development partner on this project is samsung we are manufacturing the talent chip in samsung's seven nanometer euv technology the entire team is very excited to see this chip come to life and we can't wait to see how our clients will benefit from all the capabilities we put into the talent chip thank you very much for your attention and we now have a few minutes for questions and answers [Music] okay so thank you for the great part christian we also have many questions here and since we have some time let me go through one by one so the first question is from perry so does data movement use so the ai accelerator accesses the data um that it that it pulls from the caches through the ring infrastructure that way we achieve the low latency data access for when the core is already processing data and the data sits naturally in the caches we can pull that data directly into the ai accelerator the firmware then manages you know putting that data in the right place uh in the scratch pad and then distributing it within the accelerator through dedicated buses into the input and output fifos that are surrounding the computer race uh so it the answer is a little bit of both the firmware managers that the the uh the transportation of the data within the accelerator and then we use the ring for for bringing data in and out of the accelerator all right thank you and one question from tao zhang at alibaba what's the packaging technology used to connect your dyes in a module it's it's a fairly standard technology there's no bridges or something like that we just really put these two chips very close together they have less than a half a millimeter of spacing which introduces some interesting complexities from a thermal and mechanical perspective but otherwise the signaling goes through you know standard packaging technologies it's really in the signaling technology where we have some really cool innovation uh the the two chips run on a completely synchronous clock grid um and we can essentially go from one latch out of one chip travel through uh the packaging and receive into a latch on the receiving chip with only a two to one latency and so that signaling technology essentially allows us to build the dual chip module as if it was micro architecture only one big chip that we then cut in the middle for manufacturing so what is the advantage of the input socket and the intel drawer links yeah so but with the um with the dual chip module we have uh 320 gigabytes of bandwidth between the two chips uh again this is you know essentially designed as if it was one large chip and and that dual chip module bus is you know micro architecturally feels a little bit like a staging latch on on the chip and then um across all the chips in the drawer each link in each direction has about 45 gigabytes per second in bandwidth okay and the one question from john and redhead how is memory altering preserved between cores and accelerators uh well that's where the magic in in the cache fabric design is right we we keep track of which data is where when we when one cache has a cache miss it broadcasts and and looks around on the chip who else has this and we have certain memory um state bits that we track in the directories uh that tell us whether we need to broadcast further out uh across the draw and then we have uh bits that we're tracking on whether we need to to even go across the whole system scope um and then of course when data arrives we have to make sure that we can't actually use that data on a core before we know that all other copies that need to be invalidated for memory ordering have actually been invalidated so there's a lot of complexity and tracking all the all the you know information where do you need to go and then there's a lot of complexity in ensuring the handshakes um and of course you need to do those handshakes without introducing too much latencies those so there's a bit of cycle counting and things like that involved to make sure that we maintain the fully coherent um strong ordering that the z architecture has for memory access thank you and then let me move on to the next question so this one is from click beyond google he was asking silent today the question has been showing up at hyperscale in the cloud scale deployment lately ibm has a story the history of doing checking not just the parity and ecc memory but also to protect logic does the problem include that kind of checking and correction polarity can you tell us about what's done please yeah i can i can say a few things right um uh that it's absolutely true that ibm z has a long history of of designing the chips for um you know best possible availability and of course that means more than just ecc on the memory in fact we are using a technology we call redundant array of independent memory where we spread cache lines across eight dimms and then when a dim fails like for example if a power regulator would fail on a dim we have the error correction code spread across all eight dimms and we can transparently to the software recover the data the workload just keeps running even though you've just completely lost the dim so it goes far beyond just the ecc technology we are applying the same technology on the large level two caches so that even if we have catastrophic failures failures for example on an array decoder we have the ecc codes spread out across different array types array instances so that we can recover the data and then as it comes to the logic we have probably hundreds of thousands of error checkers across the chip that check consistency of state machines that check illegal states they check various handshakes and whenever these error checkers trip we have transparent recovery mechanisms in many cases for example the entire core can uh we have what we call the the the checkpoint registers uh whenever we complete an instruction we update the checkpoint register and when an error checker pops in the core be that on a data bus or on one of those control checkers we can basically flush the core pipeline reset the caches reset the branch prediction basically the core goes through a mini boot sequence but that it can then it can re-initialize itself from the checkpoint register uh and it can transparently keep running the software the operating system doesn't even notice that the core went through such a recovery action so absolutely telum includes a lot of that logic um across the entire chip and a lot of investment uh in terms of engineering goes into those things to achieve the availability that our clients need from these systems thank you and one question from a cause uh the line degree how does the title maintain its linear scaling of course 32 chips uh so that's that's it so we've got to differentiate there's a lot of uh scaling work that goes into just generic kind of workload [Music] where obviously we invest a lot in the fabric design i talked a little a bit about this in the presentation about how we optimized the latency between the chips um and across the drawers to ensure good scalability across the entire spectrum you know of the system build-up that standard workload scaling obviously you know nothing is ever perfectly linear but with the investment in latency and bandwidth etc we get pretty good pretty good scaling the ai chart that i showed that shows almost perfectly linear scaling because really those are independent tasks each chip when it performs these ai computations can do so essentially independent of every other chip the model is contained in the in the chip local l3 cache or l2 cache um the uh the data is is local there the program runs local there so it doesn't really interact a lot in this benchmark but it still shows how the ai accelerator has the capabilities and the bandwidth to scale up to millions of transactions per second to enable in transaction inference so that our clients can embed these ai capabilities into each and every one of their transactions all right thank you for answering so many questions in detail question then let me move on to the last presentation the last presentation is from intel and the speaker will present the interest suffering next generation interview scalable process for data center servers let me introduce our speaker mr arita biswas i need to avoid a master degree in electrical and computer engineering from carnegie mellon university he joined the internet in 1997 as a circuit designer and the micro architect on the first cantil processor during his 20-plus year career at intel he has done everything from validation and silicon debugging to reliability modeling and research he has led technologies for reliability and ucd growth since 2005 developing the foundational science behind the architectural vulnerability factor and analytical effort modeling also which are published cited and recalled in many universities it's small team has been responsible for developing and deploying substantial technologies such as turbo boost max 3.0 and the three different services that you profit for software rapidly hi i'm origin biswas lead architect for sapphire rapids thanks for having me here today i am excited to introduce sapphire rapids our next generation xeon scalable processor launching in the first half of 2022. we have designed sapphire rapids to establish a new standard in data center architecture it's architected to deliver great out of the box performance with enhanced capabilities for the breadth of workloads and deployment models in the data center sapphire rapids delivers a step function in performance across a broad set of scalar and parallel workloads most importantly it is fundamentally architected for breakthrough performance in elastic computing models these include containerized microservices and the rapidly expanding use of ai in all forms of data-centric compute sapphire rapids also advances the state of the art in memory and io technologies our overall architectural philosophy for xeon is to deliver the best infrastructure for the data center as such xeon spans a wide range from monolithic server node deployments to data center scale elastic solutions delivering consistent performance across compute storage and network usages xeon architecture is optimized to deliver great performance and big improvements at both the node and data center levels the new performance core in sapphire rapids brings significant scalar performance improvements additionally multiple integrated acceleration engines and increased core counts provide for a massive increase in data parallel performance furthermore these performance cores are paired with the right levels of cache and industry-leading system capabilities of ddr5 and pcie gen 5 to provide optimal balance across compute memory and i o finally all of these are integrated through a modular soc architecture that provides consistent and efficient performance scaling across the socket node and data center at data center scale it is critical to deliver great performance and utilization under multi-tenant usages low jitter performance to meet the tight service level agreements as well as elasticity across the entire infrastructure in contrast industry standard benchmarks tend to focus on node level compute throughput and don't necessarily reflect the reality of data center scale usages we have drawn on deep insights from multiple generations of xeon products deployed at cloud scale to inform the sapphire rapids architecture as a result we deliver big advances in each of these areas for example we offer several virtualization and telemetry capabilities to improve multi-tenant usages we expand quality of service capabilities and architecture enhancements to reduce jitter for performance consistency under high utilization in addition we're introducing several microarchitectural and architectural capabilities to improve performance across a broad set of workloads deliver to deliver baylor's data center elasticity data center deployment models exhibit significant overheads sapphire rapids fundamentally changes the paradigm of handling these overheads through acceleration engines these accelerators not only speed up the overhead processing multi-fold but also significantly offloads the cores enabling them to deliver more application workload performance ladies and gentlemen this is sapphire rapids with that let's dive into the details at the heart of sapphire rapids is a new modular tiled architecture that allows us to scale the xeon architecture beyond physical reticle limitations here is that same sapphire rapids without the lid so you can see the silicon underneath and the four tiles sapphire rapids is the first xeon product built using our latest emib silicon bridge technology at a 55 micron bump pitch this innovative new technology enables independent tiles to be integrated in a single package realizing a single logical processor the resulting performance power density and software paradigm is comparable to equivalent monolithic silicon we are now able to increase core counts caches memory and io free from the physical constraints that would otherwise have been imposed on the architecture leading to difficult compromises this base soc architecture is critical for providing balanced scaling and consistent performance across all workloads this is key for data center scale elasticity and achieving optimal data center utilization with this architecture we are now able to provide software with a single balanced unified memory access with every thread having full access to all resources on all tiles including caches memory and io the result is consistent low latency and high cross-sectional bandwidth achieving over a terabyte per second of aggregate bandwidth across the entire soc this is one of the critical ways we achieve low jitter in sapphire rapids while sapphire rapids delivers out of the box scalability for existing software and ecosystems users can also enable clustering at subnuma and subuma levels for additional performance and latency improvements sapphire rapids sets a new standard for data center architecture with the seamless integration of cores and acceleration engines providing a heterogeneous compute infrastructure sapphire's foundation is built on three main pillars pillar one is compute sapphire rapids delivers the highest levels of compute performance through a combination of high performance cores increased core counts increased ai performance and the industry's broadest range of data center relevant accelerators pillar 2 is i o it delivers leadership io capabilities through cxl 1.1 pcie gen 5 and upi 2.0 technologies pillar 3 is memory all these are provisioned with intel's highest bandwidth and low latency memory solutions through industry-leading ddr5 optane and hbm memory technologies now let's look at the details of these three pillars starting with the data center performance core as mentioned earlier optimizing exclusively for standard benchmarks would have been the easy path but doesn't reflect the full picture of real data center usages we use the insights from generations of large-scale deployments to inform our microarchitecture choices for the performance core just to provide a flavor of this data center workloads exhibit large code footprint and are bottlenecked by front-end performance we've fundamentally redesigned the front end to address these bottlenecks in the performance core consistent performance under multi-tenant usages is also critical the core delivers several improvements like vm denial of service protections enhanced cash including an increase to 2 megabytes of private l2 cache per core and new tlb qos capabilities for multi-tenant usages we also introduce autonomous and fine-grained power management to improve core performance without jitter we also added several new architecture enhancements in the core including new instructions and capabilities relevant for data center usages i want to provide a few examples of the new isa capabilities here we integrated amx capabilities to accelerate tensor operations for ai workloads we're introducing the accelerator interfacing architecture instruction set or aia which supports efficient dispatch synchronization and signaling to accelerators and devices from user mode as opposed to high overhead kernel mode to address the growing demand for signal processing we've introduced half precision floating point to avx and another example is the cld mode instruction that helps with optimal movement of data across the cache hierarchy to improve shared data usage models another major area of focus for sapphire rapid's compute capability was to explore breakthrough improvements for the high levels of common mode tasks causing overhead that we see in data center scale deployment models instead of traditional approaches we embarked on a new direction using optimized acceleration engines we didn't arrive at this decision all at once rather we've been moving in this direction for some time we found these engines to vastly improve processing of these overhead tasks and enable greater utilization of the performance cores for higher user workload performance all for a significantly reduced power and area cost but we realized that simply attaching accelerators was insufficient for truly integrating those functions the challenge of using dedicated acceleration engines lies in the difficulties around software models usability shareability memory management and so forth so we decided the best way to truly address those key challenges was by creating a full set of novel technologies to support seamless integration of dedicated acceleration engines with the general purpose performance course technologies such as aia and advanced virtualization capabilities enable us to avoid kernel mode overheads and complex memory management typically associated with such schemes these provide the necessary base functions to simplify enumeration software development and deployment for acceleration engines sapphire rapid supports critical acceleration engines for processing the most common overheads i'm excited to introduce a couple of them today data center usage models involve significant data movement overhead as part of workload processing examples include packet processing data reductions and fast checkpointing for virtual machine migration sapphire rapids introduces the data streaming accelerator engine to offload the most common data movement tasks dsa can move data between cpu caches and ddr memory as well as i o attached devices in this graph we show an open virtual switch use case in which with up to four instances of dsa we see nearly a 40 percent reduction in cpu utilization and a 2.5 x improvement in data movement performance this results in nearly doubling the effective core performance for this workload intel quick access technology is not new to intel products sapphire rapids provides seamless integration of the next generation qat engine greatly increasing its performance and usability it is of increasing importance that all data in the data center is secured while processed transmitted or stored furthermore the ever expanding data footprint is progressively maintained in a compressed format for both cost and efficiency our next generation qat acceleration engine supports the most popular ciphers hash public key and compression decompression algorithms and can chain these together for single pass operations performing these functions using qat is significantly faster than the software implementations on the performance core and greatly reduces the number of cores needed for these widely supported services sapphire rapids qat can achieve up to 400 gigabits per second of cryptographic ciphers and verified compression and decompression at up to second each in this example with the zlib l9 compression implementation we see a 50x drop in cpu utilization while also speeding up the compression by 22 times without qat this level of performance would require more than 1 000 performance cores to achieve the intel dynamic load balancer offloads the task of offload management itself capable of making up to 400 million load balancing decisions per second it offloads cue management tasks provides dynamic load balancing and rebalancing based on awareness of workflows power and work priorities this enables efficient load balancing across cpu cores which is extremely important for usages such as packet processing and certain types of microservices and can be used when offloading to either integrated accelerator engines or to discrete devices with growing compute capabilities a balanced architecture must deliver commensurate improvement in i o sapphire rapids delivers breakthrough advancements with its io interfaces which is our second pillar for sapphire rapids sapphire rapids introduces the industry's standard compute express leak technology or cxl 1.1 for memory expansion and accelerator usages in the data center to cater to the growing i o speeds in the feeds we introduce support for pcie gen 5 while also enhancing the qos and ddi capabilities to go with it sapphire wrappers delivers optimal multi-socket performance ski and scaling through advancements to our ultra path interconnect or upi technology that bring more links and wider widths and higher speeds compared to prior generations including a new eight-socket four-link gluis topology that makes use of the new upi capabilities to deliver higher bandwidths than previous generation eight socket configurations improving multi-tenant usages as well as reducing memory managed overheads have both been mentioned previously as key capabilities advanced by sapphire rapids two of the virtualization technologies that enable these are shared virtual memory and scalable i o virtualization svm is a key technology that enables cores integrated accelerators and discrete i o devices to significantly reduce memory management overheads by providing a consistent coherent view of memory on which all computation can occur regardless of the compute engine actually processing the computations scalable iov greatly improves and simplifies scalability sharing and enumeration of device accelerators integrated or discrete versus our prior generation single root iov technology for a data center processor to deliver across all workloads the compute and i o capabilities mentioned need to be augmented with the right balance of cache and memory architecture to deliver sustained bandwidth at low latencies the third pillar of sapphire rapids sapphire rapid supports a large shared cache that allows dynamic sharing across the entire socket we're nearly doubling the shared cache capacity over the prior generation and enhancing the critical qos capabilities to further improve effectiveness with industry-leading ddr5 memory technology we are delivering the next big step function in bandwidth while simultaneously improving power efficiency in addition sapphire rapids delivers multi-fault performance improvements and qos capabilities with our next generation intel optane memory and we're not done with memory just yet in addition to the support for ddr5 and octane memory technologies sapphire rapids also offers a product version that integrates high bandwidth memory technology or hbm in package this delivers higher performance in the dense parallel computing workloads that are prevalent with high performance computing ai machine learning and in and in memory data analytics typically cpus are optimized for capacity while accelerators and gpus are optimized for bandwidth however with the exponentially growing model sizes we see constant demand for both capacity and bandwidth without trade-offs sapphire rapids does just that by supporting both natively we further enhance this with support for memory tiering that includes software visible hbm plus ddr and software transparent caching that uses hbm as a ddr-backed cache ai usage is becoming ubiquitous in the data center due to its success relative to traditional methods in order to deliver data center scale elasticity great ai performance is required across all tiers of compute this is one of the major focus areas for sapphire rapids as i've mentioned previously we introduced amx capabilities using a combination of tiled register files and tiled matrix operation units in the new performance core that provide massive speed up to the tensor processing at the heart of deep learning algorithms we can perform 2k integer rate operations and 1k b float 16 operations per cycle this represents a tremendous increase in computing capabilities that are seamlessly accessible through industry standard frameworks and runtimes we augment this with strong general purpose capabilities large caches high memory bandwidth and capacity to deliver breakthrough performance improvements for cpu-based training and inference we're also seeing that the vast majority of new scalable services are being built using elastic compute models like containerized microservices this trajectory was clear when we started architecting sapphire rapids so to address this we chose capabilities and features to improve the computing model for throughput under sla with low infrastructure overheads we made architecture enhancements across the product spanning the core accelerators and soc capabilities to inform these improvements for example the aia capabilities that reduce microsurface startup time advanced telemetry improvements for optimal microservice load balancing and orchestration and a number of capabilities in qat dsa dlb and beyond to reduce the networking stack overhead with a microservices service mesh we've been using multiple proxy workloads to develop these capabilities and optimize the open source software stack to benefit from them this chart shows the speed ups we're seeing in our architectural models and from some early measurements on death star bench and other example proxies normalized at the perk core level in summary sapphire rapids delivers a massive leap in performance and capabilities to establish a new standard in data center architecture at the root of sapphire rapids is a modular tiled soc architecture thanks to emib technology that enables significant scalability yet maintains a monolithic view it delivers substantial performance across scalar usages and massive improvements in emerging parallel workloads like ai it delivers great improvements for monolithic workload deployment models while exclusively optimizing for elastic compute models like microservices it brings industry standard leading memory and io technologies to feed the massive compute capabilities in a balanced manner as one would expect with all of this sapphire rapids is a complex undertaking and i would like to thank the many teams across all of intel that are bringing sapphire rapids to market we can't wait to get sapphire rapids into your hands thank you very much [Music] in purity it was very great part and again we have a lot of questions and among them many of them are related to hbm so a question from hiroyuki the hbm cache mode consists of the direct map and another question was where do you keep the tag if the hbm is configured as a cache okay and moving on to other questions barney with no affiliation he was asking how is the ai performance of amx compared to nvidia a 800 100 so i don't have that right now but we um basically we're seeing an impedance improvement over um we showed a demo um okay and you have already answered many of the questions and i do have two questions so usually intel cpu support ddio and when hpm is used as a cache i wonder where the data goes first when the ddi is enabled is it going to be 3k so is it going to be hdm okay so it's not going to go to hpm right okay and my second question is about cxl so cxl is a very interesting interconnect providing the test currency and also memory expansion interface and that kind of interface is not the first time ibm did the tapi in the past can you compare cxl against the cappy and what has been improved over tapi in the past um um um all right thank you for the answer and uh raj with the unknown affiliation he's asking whether you can comment on the interdice crossing maintenance um all right so i push any new questions that have not been answered so thank you again and it was such a great talk thank you um uh um all right so since we have a little bit more time let me ask you another question from right from alibaba can you comment uh more on uh sapphire rapids qos capabilities does spr support last level cash partitioning memory bandwidth partitioning and the waters compared to the previous generation um all right thank you so much for the detailed answers and let me conclude this session thank everyone
Info
Channel: hotchipsvideos
Views: 72
Rating: undefined out of 5
Keywords:
Id: ID8ZS74yqYQ
Channel Id: undefined
Length: 132min 24sec (7944 seconds)
Published: Mon Dec 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.