Intel Fights Back | Arc Battlemage, Xe2 GPUs, & Changing Hyper-Threading

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] Intel has official battle mage information for its next GPU architecture it is 4:00 a.m. in Taiwan we've wandered into this park because the story is important enough to film right now so this is one of the first motherboards with xe2 on it which is what's going into battl Mage in discret gpus for desktop later this board was stealthily handed to us at the show with the lunar Lake integration on a future laptop motherboard even though it's mobile silicon it still features the architecture that'll build the Next Generation desktop GPS for Intel Intel unveiled the basics of its battle mage architecture it has a keynote technically going up around the same time as this video but we're going to skip all of the AI infested keynote discussion from the CEOs and instead just focus on the engineering discussion for architecture now on the CPU core architecture side Intel also has some news the biggest one is that it is ditching hyperthreading or smt we're going to talk about that later in the video but we're starting with xe2 architecture and and what we can expect for the upcoming battle mage before that this video is brought to you by the fractal torrent High airflow case the fractal torrent is one of the best air cooling performers we've tested largely thanks to its included 180x 38 mm fans the torrent is spacious and easy to work in with what fractal emphasizes is a heavy focus on function the case uses a unique front panel to add some flare without a Major Impact to Performance you can learn more at the link in the description below we'll start off with one of inti's first party charts so Intel showed some charts with clock normalized uplift against specific functions of the GPU like a 12.5x relative increase in draw XI 2.1x normalized increase in pixel blend rate which comes from some changes we'll cover today as well a 4.1 X increase in mesh Shader dispatch and then some lesser areas of uplift like tessellation Intel also showed it's a new xe2 render slice officially for the first time and we're finally able to start getting a picture the changes Intel's Tom Peterson presented some details to the Press focusing on performance increases from Mostly utilization changes and a reduction in wasted processing in the pipelines this is where Intel is focusing its efforts for battle mage and he did the presentation in typical Tom Peterson fashion all right so the xmx is the heart of our AI accelerator and I'm showing you a little animation this is showing you 16x 4X 4x2 which is kind of saying I've got 16 Lanes I can do uh four chunks of eight8 bit integer and I can do four at a time because we're four deep and there's two operations per Mac so that's telling you hey I've got a lot of goddamn Mac operations okay and we checked and that is an official unit of measurement now Intel's a750 and a770 we've talked about how they have more silicon than they can really leverage uh I think the comment we made previously was that they should be just wrapping the cards in money before they ship them out the door because they use relatively massive dieses at a cost that wouldn't make sense for envidia for example to sell that die size that die area uh but the problem Intel has run into is a mixture of driver optimization and then they also made some architectural choices early on with alchemist that have limited some of their ability to scale so that's what Intel's trying to fix this is a render slice from the original Intel Alchemist architecture each render slice is pictured as containing four XE core blocks and each of those itself contains the vector engines there's also xmx Hardware load and store and cach Hardware the render slice also contained four Ray tracing units Samplers High Z geometry rasterization hardware and a pixel back end this is the new battle mage render slice there's some color changes but you can ignore those they're not important one of the larger changes that Intel made is rearch contacting the GPU to sim D16 instead of sim d8 simd means single construction multiple data moving to Native sd16 for the engine was promoted as a key change architecturally and Intel described the change like this quote basically there's eight Lanes of computation for every instruction we've moved that architecture to be 16 wide which has a lot of efficiency benefits but also compatibility benefits more and more often you'll see games running right out of the box end quote now this change also means that Intel's xve or XE Vector engine core count has reduced per XE core and render slice but this was an intentional change of course the new XE core for Gen 2 is haved for Vector engine count per XE core however Intel tells us that this is countered with the increase in width to sim D16 from SIM d8 we'll be curious to see if this leads to a reduction performance anywhere else in the pipeline in rare scenarios but overall the biggest Improvement should be compatibility the claimed upside being better day one support for games is really what Intel needs right now for Arc uh in our videos like our most recent Intel Arc one-ear later revisit that we did looking at the drivers the biggest thing we focused on was that while yes it has improved no we can't recommend it still for people who need Day One support for the most recent games because it's not always there so moving to uh Sim D16 natively Intel says should help with some of this and that would be a big deal the next big change Intel highlighted was implementing execute IND direct support natively for indirect draw and dispatches rather than what Intel did on the original XE which was emulated Intel told us that the 12.5x increase in draw XI performance shown in the chart that we saw earlier comes from the Native execution indirect implementation Intel stated this quote this is used by engines like unreal 5 extensively having support for this and Hardware is a need for the Next Generation games Intel also noted improvements to several other areas rapidly those areas include the geometry block mes shading performance out of order sampling and execution fetch High Z and triangle calling and also render Target prefetch on L2 cache and even the clear Behavior has changed they've also made improvements to blending and to Ray tracing Intel says that these changes will result in better utilization our interpretation of that statement is that it's a waste reduction instead so now Intel is going to be focusing on for example clears if we take that one instead of writing zeros to everything they're marking a bit to say get rid of all this we're going to start from scratch on a new frame and that's just one of the examples where previously from what intel was saying they were wasting resources and burning time on something that could be done in a much more efficient way and then those resources could be used elsewhere or just the performance could potentially increase changing to high Z and Coline behavior is also specifically a waste Improvement Coline isn't A New Concept so Intel stated this quote we've redone our high Z if you think about fixed function it generally starts with reading geometry then you're translating geometry using Matrix multiply in World space you're moving these Triangles around eventually you have to get to a point where you show a pixel you're going to convert a triangle to a fragment which you're going to shade to draw a pixel that fragment processing is very important for the order of those triangles in the final scene High Z remembers which triangle is closest and if it ever sees a triangle that's further away it will instantly cull it from the pipeline it's a trick to reduce pressure on your later shading so the way they're handling this is a little different and we'll get into that but the very basics of calling that basically everyone does at this point would be something like and right now I am obstructing geometry If This Were 3D space behind me maybe it is we'll find out one day if I move a little bit that geometry is now revealed when I step back in front it's hidden there's no point rendering it if I'm blocking it so uh typically the way coling works in games and rendering is looking for objects that are obstructing other objects determining is there any blending that needs to happen is there a transparency layer uh is some kind of attribute to the formost object that would affect but still show the rearmost object and if not get rid of the rear one it's not going to be seen anyway save that geometry processing for something else so that part's not new but more aggressively calling triangles or fragments that won't be seen means as Tom Peterson said here later processes can focus resources elsewhere the Improvement should be lower resource utilization on wasted shading to instead allow higher performance or more appropriately spent resources on visible parts of the scene the pixel back end has also improved when there are transparencies in a scene the GPU has to do math to blend the multiple layers of transparencies and Intel has increased bandwidth for shading to improve performance in this part of the pipeline the company also stated a 33% increase in pixel color cache reducing bandwidth demand overall and that reduction in bandwidth is because more data can be resident in cash reducing transactions out to memory int also noted these improvements to its geometry pipeline performance quote on the geometry block we've improved fetch we've redone how it work is distributed to the geometry pipeline to improve utilization by a Time mes shading performance has improved dramatically with vertex reuse a lot of times you have a Vertex that's used across multiple triangles and now we avoid a second fetch we now support what's called a stripe if you think about direct x a lot of the time you have a Vertex that's used across multiple triangles we now take advantage of that in hardware and avoid the second fetch so this is another waste reduction avoiding a fetch keep that resource for other transactions Intel also said that out of order sampling Improvement is necessary to be able to efficiently fetch compressed textures that reside in cash Intel's approach to sampling is to pull large sometimes multi-gigabyte textures from cash and Intel says that it can fetch these more efficiently now the company also noted programmable offset support saying quote we're talking about an offset relative to the center of the sample which allows us to implement higher level filters directly in Hardware so the theme Here is there are improvements in Hardware there are some architectural changes key ones like changing to simd 16 native uh but then also Intel is trying to integrate this with the software improvements it's made so on the driver side it has been working on fully rolling out it's directx11 re architecting for the driver stack and that's something we covered previously it already went through dx9 and all this re architecting of the way the drivers work means that now with a more mature driver State Intel can start combining that with with Hardware improvements and fixed function changes to we'll see become more competitive maybe they probably will have to increase the price at some point though because uh it's too too expensive for them to ship those cards right now but the next big Improvement was to the compression algorithm Intel said it supports a new 8 to n compression algorithm on the pixel back end this is when it spoke about clear performance as well noting that the move to Mark a bit instead of right zeros quote is a no-brainer but dramatically improves performance and quote on rate tracing Intel re-explained how bounding volume hierarchies work and then stated that xe2 is increasing to three traversal pipelines two triangle intersections per RT unit and 18 box intersections per unit Now intel was already actually doing pretty competitively in Rage tracing so in some scenarios it was almost embarrassing AMD still not up at Nvidia levels of performance but it was sitting between the two when it worked well so if intel is continuing to work on RT performance and it increases its Day One support for games then it may be AMD that has the most to worry about here there was a lot more discussion about xe2 as it relates to specifically lunar Lake we'll flash through some of the slides here but all the stuff we've talked about so far applies to all xe2 implementations and that means battl Mage and lunar Lake both would see these changes we'll leave out the lunar Lake specific GP discussion for today the processing and condensing all this information abroad is already intensive on the schedule so let's move on to some of the announced core architecture changes for lunar Lake because some of the also apply for Intel's future AOL Lake solution which is going to be the next desktop part and we're hearing rumors that it's probably fall of this year Intel gave us a deep dive along with other media on its architectural changes for the core so P core and E cor is for lunar Lake but this also scales to airlake in a lot of situations so these microarchitectural changes Intel says are supposed to set them up for better scaling going forward and their focus was on efficiency and on IPC but they also again talked about dropping hyperthreading and smt so lion Cove pees will show up first in lunar Lake Mobile CPUs and those are the ones that overhaul fundamental design to allow for that scaling going forward Intel refactored its internal CPU design process and core architecture IP which it says will continue to give benefits across multiple generations and markets saying quote our new development environment allows us to insert knobs into our design to quickly productize s so derivatives out of our Baseline suet pee IP and indeed the line core version that goes into lunar lake is different in several aspects than that which will go into AOL Lake later this year it's not completely shared but there are some things that we're seeing across both getting back to the removal of hyperthreading Intel says that in general hyperthreading gains roughly 30% IPC in the same core area footprint in the Silicon which is why it's been used for about 20 years now it's not free though Intel says that hyperthreading necessitates duplication of many architectural elements filling the pipeline with multiple thre threads requires Hardware to track the execution into the pipeline and the state of those processes this takes area and it takes power to accomplish Intel gave some numbers to better understand this so it created a situation with a hypothetical architecture where uh hyperthreading is on for one and off for the other but otherwise everything is controlled the new efficiency optimized peores without hyperthreading would give 15% better performance per watt and 10% performance per area Improvement as compared to a PE Core built with hyperthreading but with it turned off then with hyperthreading on the new optimized Core built without hyperthreading still wins by they say 5% in performance per watt but falls behind by 15% in performance per area that loss is where the new ecores would come in which we'll get to shortly and platforms that purely consist of peores would still benefit from hyperthreading in this situation now buzzword warning here for AI so Intel has updated thermal and power controls thermally they stated that the predetermined static guidelines are gone and instead Intel is moving towards uh quote AI self-tuning controllers which would actively Monitor and adapt to the specific workloads that are being run Intel is also adding finer grain clock control at 16.67 MHz clock intervals Intel gave the example of a workload where the max theoretical clock was 3.08 GHz Intel's previous 100 MHz steps would have the core cap out at 3.0 GHz but the new 16.67 MHz steps would allow for a 2% higher clock of 3.67 GHz on the flip side this also gives better control in power constrained situations like a laptop the end results should be higher sustained clock speeds for given power budgets and thermal constraints anywhere on the curve other improvements include a much larger instruction prediction block larger operation cache and deeper operation queuing vector and integer operations Intel says are now in separate Pipelines with separate schedulers for out of order execution updates to the memory subsystem were detailed by intels including uh reduction in latency of the nearest l0o Cache and adding an intermediate L1 cach to create a three-level hierarchy total per core cache capacity slightly up overall some really quick notes here as well so Intel threw a couple numbers out there for one they claim an average of 14% more performance at the same frequency for one PE core in lunar Lake versus meteor Lake remember some of this will scale the air Lake but not all of it will uh ultimately mobile is a different thing performance per watt ranges between 18% more at the low end of the power range and 10% more at the higher end versus meteor Lake once these updated peores make their way to the desktop platform we'll obviously thoroughly Benchmark them and look at how they perform in air Lake but for now some quick notes on eor as well so these are called skym are we back to please please don't bring us back to Skylake inel we got away from it Sky mod uh this one is the biggest change from the previous generation and it's upping the core count from two to four cores per cluster with the removal of hyperthreading on the peores Intel is betting that its new ecores are good enough to pick up the slack on multi-threaded workload quote our ecores are getting so good that we can deliver better than smt or better than hyperthreading performance without hyperthreading that's the key motivation for making the pee wider and extracting more performance out of every megahertz the updates to the individual core themselves are similar to the changes in P cores instruction caches are larger Q's are deeper the core is much wider in general with Vector capability being roughly doubled Intel says vector is generally the type of compute that AI leverages as well so it wanted to focus on that the ecore memory subsystem was also updated there's also more cache available overall with 4 megabytes of shared L2 between each four core cluster with double the bandwidth compared to the previous generation for performance Intel gave some single-threaded performance numbers between the low power Island of the previous generation meteor Lake and lunar Lake Mobile CPUs integer functions were up by an average of 38% and floating Point Vector is up by 68% claimed on average though Intel's graph has one outlier uh overall in a single-threaded scenario Intel says the new EC cor uses 1/3 the power at the same performance it says 1.7x more performance at the same power and two times the performance at the top end with the new core using more power moving to multi-thread Intel says total toal performance uplift on the low power island is heavily increased due to two things the cores are faster and the increase from 2 to four e cores per cluster also contributes integer performance is again oneir the power at the same performance 2.9x more performance at the same power 4X more performance at the top end using much more power and Intel gave an interesting performance comparison for desktop where it put the new skymont ecor against the outgoing Raptor Cove peor this is inherently an uneven comparison since peor scale higher with higher power and frequency but the bottom half of the curve is interesting there are a number of other improvements as well like bottom up scheduling thread director changes but we'll talk more about this once desktop starts shipping so that's going to be all we have for you on this for now it's a huge amount of work to not only read and study and write this type of content remotely but also do all the editing for it uh so we put together as in-depth a detailing as we can using Intel's architecture day information they had a ton more information out there there's probably another four or 5 hours of content that we need to work our way through and see what we want to detail for you all in future architecture pieces but for now that gets us started and for us at least the xe2 future desktop battle mage implementation is going to be the most interesting out of all of this and it sounds like the biggest changes are going to be possibly to early support for games alongside the improvements overall in reducing wasted processing time or as they call it utilization improvements so that's it for this one thanks for watching as always subscribe for more we're going to skip the keynote maybe we'll come back and cover it in news or something if it's interesting enough but uh I've had enough AI buzzwords this week I think between Nvidia and AMD we had around or over 200 instances of the letters AI so just skip the in one for now you've got the information that really matters though for the core audience here thanks for watching subscribe for more you can support us directly by going to store. Camas NEX net and grabbing a shirt like this one or just go watch some of our other videos from our compy text coverage cuz we have a lot online already including a ton of cases and coolers we'll see you all next time
Info
Channel: Gamers Nexus
Views: 398,125
Rating: undefined out of 5
Keywords: gamersnexus, gamers nexus, computer hardware, intel arc, intel arc 2024, intel arc gpus, intel battlemage, intel alchemist, intel arc battlemage, intel lunar lake, intel xe2, intel pcore, intel ecore, intel hyperthreading, skymont, lion cove, intel bmg
Id: MGD41i5QCyk
Channel Id: undefined
Length: 20min 8sec (1208 seconds)
Published: Tue Jun 04 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.