YouTube's Existence is Insane: How Video Compression, Encode, & Decode Work (Basics)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this is the third and final installment in our recent technical discussion series with this guy Tom Peterson an engineer who's worked at a couple of the big silicon makers in the industry and most recently Intel in this video we'll be distilling the incredibly complicated topic of video playback compression and media encode and decode into the most basic parts and video has to be encoded because you'll blow up the planet and because we don't want you to blow up the planet we have to encode it which means compress it and there are times in this video where I have absolutely no idea what's going on we're going to do a 2d discret cosign transform but it is left as an exercise for the reader yeah exactly exactly I'm going to punch on this although some of it is easier to understand like the fact that YouTube has absolutely no business working with the insane amount of video it handles on a daily basis if you just look at YouTube okay YouTube would generate 155 pedits per second which is 130 times the total Global Internet band except what makes it all possible for websites like YouTube is the Technologies and that's what we're talking about today and these apply basically agnostically across all of the hardware vendors we'll do our best to facilitate the sharing of the knowledge that Tom brought with him to the show following our recent discussion of how drivers work and what optimization means for games alongside the topic of simulation time error in gaming those are our previous two videos this video video will help give a beginner to intermediate understanding of how digital media actually works let's get started before that this video is brought to you by Squarespace and visiting squarespace.com Gamers Nexus will give you 10% off your first purchase with them we've built a number of our own websites with Squarespace where we list catastrophic PC Hardware failures to inform subscribers of those failures we also built our store website with Squarespace using its built-in e-commerce tools and of course we built a website for our CEO snowflake because she demanded our audience who really runs the show get to the core of your idea and spend less time on web design by signing up at squarespace.com Gamers Nexus or click the link below hey everyone we're back with Tom Peterson from Intel and we did a video recently about drivers and what they are and what optimization all the busy work that we do yeah busy works not busy work all that great stuff all that great stuff they do little bit of an undersell yeah so and now we're gonna talk about uh some of the actually topics I'm not very familiar with so this is going to be going over some media incode decode uh features but also starting with a block diagram of Alchemist so we'll kind of cover some of those topics but when I I asked Tom what this one is about he said uh everybody knows about this part but we got These Guys these guys over here exactly so this is the uh Alchemist 512 die which is used on 770 and 750 and these are our render slices which do all the graphics we just got done talking about all the vector and Matrix and how this is doing Graphics but all the while these little media engines have been sitting over here doing their thing effectively this is kind of a visual of what is within the Silicon yeah is it just a containerization of terminology or is there an actual physical Hardware uh like separation yeah so there is replication so if you if you're designing the layout of this chip you'll see there's a block that looks very much like an XE core array there's actually a block that looks like the vector Matrix unit combo and then you would kind of take that and spit down four of them with some other stuff and that becomes a block and then you put down eight of those and you've got more array so there's both a containerization for logical like how do I talk about things but there is a construction you know uh component as well we have a a full die that would be a perfect die you call it a 512 that's 770 but then if one of these has a defect in it you can turn this off entirely with no uh impact on the rest of it there's no power leakage or you know this can be completely broken and we can isolate it entirely and so the remaining dye represent a down bin of that skew which allows you to save the which allows to save the silicon and if you think about it these dyes are big and and silicon errors are you know happen so we by by doing this redundancy stuff and that happens both at the Block Level and at the subblock level so inside of a lot of these blocks there's extra components built in so that if one gets fried you can switch some circuits and like swap in the replacement piece so that's actually much more common it's like extra screws when you take something apart it's like the old Nvidia boards yeah exactly yeah yeah right yeah yeah exactly so this is this is effectively built with redundancy in mind and it it has down bending and repair built in okay yeah so then let's get over to the media stuff this is the full disclosure for the audience I am not very well versed in uh encode decode av1 h45 so I'm coming into this with very little knowledge uh I will play the role of facilitating conversation thank you and I will play the role of a recently born media aware person we see I want to uh before I even begin to say thanks to James Holland for all his help uh because he's our media architect and he he's basically helped me get to the point where I can talk about this coherently okay so anything that I say wrong is actually not on me it's on James okay there we go here we go all right so what is media is the way I start right I think we both know the answer to this um media is not just one picture it's a whole collection of pictures and this is how animation happens like you go to the movies you see picture actra picture each one's a little different and that's motion right media is that series of pictures plus uh audio right next to it and these are both put into one file together called a container and that might be an mp4 file or a movie file so think of it as a bunch of pictures and audio Al together okay now now um each picture is made up of pixels and pixels are effectively little dots right you get a lot of them you get a grid and each dot is made up of three colors red green and blue right so you can get uh millions of Colors by combining red green and blue and that's one pixel each color can be 8 Bits 10 bits or 12 bits which is kind of saying how many possible colors do I want for red it might be 16 million if you're 8bit all the way up to 68 billion if you're 12bit so this is a again something that people often hear of which is 10 bit color yeah I was going to say you see this appear in camera marketing yeah exactly exactly and it's talking about what's the resolution in the color space of a particular Channel okay so you can see uh if you're doing 8bit color you need three bytes per pixel if you're doing 12bit color you need six bytes per pixel okay now that feeds into what we call the media format formula made that up yesterday and the way it works is width time height time data per pixel which is that 63 thing time frames per second times duration is data size okay okay so what we're going to do is fill this in with a few examples so if you're this this is why I guess for for example when we're rendering a video you can normally approximate pretty well the the size you know you know yeah so if you if you just take 1080p as an example 19 by 10 3 bytes uh per pixel 30 frames a second time 60 gives you 11.2 gigabytes per minute a lot of data coming from generating video okay and that means that if if you just look at YouTube okay YouTube would generate 155 pedits per second which is 130 times the total Global Internet B yes yes okay so sorry YouTube YouTube is sucking down the the the bits yeah and it would be 155 pits so we need compression okay and that's really what we're talking about because compression takes all of that data that you would have had and squeezes it way down okay yeah otherwise this is I guess as you're saying physically impossible physically impossible you would not have a YouTube business right it would be impossible yeah right all right so how does this happen um encoding is basically very similar uh over the years and it is pretty much five phases starts off with color space conversion then does something called spatial and temporal redundancy removal then a a generating decoding error and then quantization and symble coding how how is this process defined is this defined by by the industry at yeah there's there's the way to think about it is um this this sort of discipline has evolved over decades and um there are committees that form uh standards around like a series of different types of conversions and and those standards become av1 HC ABC and this is the process that they use to do this is like an abstraction of a specific code this is just video and video has to be encoded because you'll blow up the planet and because we don't want you to blow up the planet we have to encode it which means compress it okay and it's got the five phases that I'll talk about in a little bit of detail sure all right well the first one I want to dive into is color space conversion okay so it's a really cool topic it has to do with how human eyes see right so let's just take a look so to begin with pixels are made up of three different colors are red green and blue in this case but the truth is it doesn't have to be that way for example your eyes have sensors in them that are called rods and cones and it turns out rods see luminance and Cones see red green and blue okay so cones seen color Rod see black and white and if you think about it your eyes uh because of the way your your cones are working are more sensitive to certain frequencies which really means that brightness is the thing that you're most sensitive to and uh colors is less sensitive okay that's all because of the physiology of your eye so we're going to take advantage of that in the first step and we're going to change the way we code instead of red green and blue we're going to use something called Luma and chrominance which is y UV and this is the same information it's just coated as black and white magenta and VI and and green so it's pretty cool and this means that um we can actually take advantage of the fact that you can't really see these colors very much and we can compress them okay so that's what's going to happen here we're going to start off and saying we're zooming in yeah we're zooming in on our two can and you can see this little grid here and if these would be the colors for that block in red green and blue but if I coated it up in Luma and chrominance it would look like this you could kind of see there's a hard Edge and there's mostly gray and then you can see a little bit of color down here so now as we compress we have something called 444 which means we're going to use all of the components of all the chroma and the luminance and you get exactly the same thing as the RGB in this case we're going to do what's called 422 which means we're going to use less of the Y the U and the V channels is this an of bits yeah how many bits am I using like to to talk to store chroma so so in other words is it four bits four bits four bits and then four two two four two two so four bits for luminance two bits for U and two bits for V and then on the far right you have zero bits for zero bits for there's uh four bits for uh luminance and two bits for Chroma so yeah got it and so if you think about it the difference between this picture and this picture and this picture is actually very slight but just doing this chroma convergence we saved up to 2x compression and you'll see this show up a lot on media formats and so I guess when uh this doesn't happen as much anymore but especially in the older days of say YouTube We're well compression is still not great but on YouTube that's not your fault though I know we're getting there we're getting there but we're getting to the fans your fault or we gon there's a lot of things that are my fa now but uh especially in the earlier days though you see that sort of flicker behavior when the compression is I I'll just call bad yes um so is that from is it an issue of of not having enough data to reconstruct something or is okay yes and actually you that'll that'll make more sense when we talk about the sort of the temporal stuff okay right because this is right now chroma you would see the chroma problem where you'll see something look purple when it shouldn't look purple but that's less visible than the thing that really looks like the thing's broken where you see all that blocking that blocking is from the temporal stuff okay so that brings us to spatial and temporal redundancy search so after chroma down sampling we've already reduced it by maybe up to 2X and so now we're going to do something really cool you're taking advantage of the fact that this is a movie and there's a lot of coherence from frame to frame to frame so maybe we St we save a full frame initially but for subsequent frames we're only going to save things that change okay and so think of it as motion vectors are calculated by looking at the second frame compared to the first frame and then saying oh you know what I saw the sun went down a little bit so I will save a motion Vector almost like an instruction so to generate frame two remember move these pixels from point A to point B and we'll just save that same for frame three and same for same soes that allow avoiding recalculating everything else well it basically says I don't need to store any of these pixels okay right because I'm going to take the original frame and I'm going to generate this new frame by just applying a sort of an algorithmic manipulation I see right so so we can throw these millions of pixels out and just remember motion Vector motion Vector motion Vector okay now the problem with this is it's not perfect so we have more work to do which is called correcting terms so the idea here is that we've now got our instructions set which has allowed us to compress by maybe up to 20 times but we know we have errors so what I want to do when we generate the correction term is we're going to generate the image that we would just using our instruction set and I'm going to compare it to the original I'm going to take the difference uh I guess this this conversation reminds me of I think the first time I encountered any kind of compression uh education it may have been you but it was from Nvidia talking about uh Delta color compression yes it was not me probably Jonah oh yeah maybe it was Jon but some of this starting to sound familiar to that where you're trying to focus on just the thing that changes just the differences yeah okay so in this case we we uh use the you know we've got our instruction set we've got the prior frame and now we generate that new frame using just the instruction set compare it to the real frame and now we have what's called a residual and the residual is just like a collection of bits that weren't quite correctly moved because you know you're using an instruction set it's not going to be perfect pixel okay so residual and kind of the literal use yeah it's the residual error and if you apply it by adding it back to your original image or adding it to the generated image you'll get back to exactly the original image got it okay so now after we've generated the residual we have more more work to do now this is where I'm going to start getting a little fuzzy okay okay because this this whole topic here no matter how much James has talked to me I cannot explain it yet it's called it's called a frequency quantization and what actually happens is we convert it from this is now just working on the residual right because we're going to compress that's the only thing up here we're just going to remember the instruction set now we're compressing the residual to save it along with the instruction set okay so to quantize you're going of think of it like we're going to convert it to the frequency space and then we're going to remove high frequency components so I will show you a picture and then I'm going to say there's some math here and there's a couple pictures yeah we're going to do a 2d discrete cosine transform but it is left as an exercise for the reader yeah exactly exactly I'm going to punch on this but basically think about it as some fancy stuff happens there's some quantisation happening can you give me the the the basics yes what the hell is that what is that okay here's the basics so first of all let me start by saying you don't really do this on the full picture you do it on the residual but this is very confusing if you just show a residual so this this would just be a few bits this just to make it a little easier to understand but imagine that's the residual what this is representing is the components of the frequency space that have a non-blank pixel right and and so you would say if there's a pixel down here that's if there's a spot down here that's got a strong value it means there's a a large flat area in the picture so is this a location for location map no it's more like these represent different frequencies it's more of an XY yeah XY in frequencies right this is XY in pixels this is a frequency going up this way higher frequency going these directions got it okay so these pixels out here are really representing real high frequency noise and these pixels down here are representing sort of flat black low noise spaces so what we do is we take these pixels up here and we say thr out or or use them less right and that's how you get this picture which says focusing our energy down here reducing the pixels out here means we can reconstruct effectively the same residual but not store as much information that's called quantization and we actually end up storing this not that got it right so that's that's so we don't really store the residual we store a frequency uh kind of view compressed quantized yeah I mean if you don't have a compressed video um back in the day you would see hitchy video playback or you'd see blocking playback right and and on the blocking playback you can think about what's happening is I can't I can't get through this entire F let me flick fourward to one I can't get through this whole five-phase thing in time to update my frame in that 16 milliseconds window and because I can't get through it something got dropped I see and most of the time that thing that got dropped is the application of the correction term back to the decode right so you'll see sometimes the you you can tell a motion Vector thing didn't get moved or the instruction set that's generating that drive image didn't really get play okay yeah all right so we're down to our last step which is called symbol coding and again I don't want to get too deep into this but there's something called Huffman and coding which is really cool and it B it basically says for a collection of bits that represents the residual in the frequency space let's um find the bits that are most common in a pattern and we'll give that a code that is smaller right and so you give the smallest code to the patterns that are most common and by doing that across your entire image you can compress it that's called a Huffman encoding and and earlier when we were this is the one I felt like I was starting to at least at a a very rudimentary level I was starting to kind of understand parts of that where um we were talking about taking a a shorthand binary as a reference for call it a lawn form binary or something yes and so is this is this lookup table or reference fixed it it's calculated you know by the standard committee and basically you're going to say uh We've looked at a lot of images we've looked at a lot of movies these are the residuals that occur and so the frequency distribution of the patterns and the residuals is X which means that this is the Huffman tree that we want to use so is is the purpose of it to just avoid uh pushing through significantly longer binary yeah because you're going to get 2x compression just by this process right and that's that's you know really significant and and what's really cool about it it's lossless because you're because you're you know every encoding has a precise decode you're not losing anything by doing this you're just doing a sort of a changing the symbolic storage right but this is cool and now we're done that's how that's how encoding works and what it does is it takes a an input which is usually large high bandwidth goes through this process and now you're down from gigabytes down to megabytes decoding is just the opposite right so you're going to take the data stream and you're going to expand it back up into the image and you you know you kind of start by undoing that symbol you're doing symbol decoding then you're going to do your spatial kind of inverse and it all exactly follows backwards right okay that's the story how do you feel I feel like I have a lot to learn space yeah so what I thought I'd do is take this and apply it to us because everything I did so far has nothing I mean it has a lot to do with us CU we contribute to those standards generally but it's same for everybody same for everybody okay so what's what's the story here um so I thought we'd walk through how do you take that function that you need to do which is encode and decode and build Hardware to do it well okay and Intel's had a long history with quicksync and others of being incredibly good with media processing and that same Legacy is now part of Arc so quick sync we actually is is the reason that we use Intel gaming Parts rather than any kind of HDT is that right yeah for the video production machines because Premiere is so accelerated by the igpu with Quicks sync that it will if I have like a 480 and a 490 in a machine and one has the igp disabled that one's going to do worse no matter which card it is it's pretty amazing which is yeah so it's the same guts that I'm about to show you what what it really does right so it's the same both in our integrated and discreet so we're showing you this is in the discreet the this that block that was on the right exactly that one that was so ignored for so long it's now having its moment out here right so the media engine has two mfx engines and a bunch of fixed function stuff right the mfx has the encoder and decoder which we just talked about is kind of this pipeline right so now if you look at my names they actually match pretty closely the phases of this Pipeline and this is all fairly fixed function so that means we're doing a very low power very high performance directly decoding and encoding uh in Hardware right and so you're taking actual this we're now looking at actual silicon real estate to represent F function Hardware exactly and and you can think about you know some of these functions are actually pretty heavy weight so the amount of this is like almost exactly the same name I guess inverse Quant is quantization yep correction terms corms and what's Happening Here is this is that feedback I talked about where as you're doing encoding you generated your instruction set here and then you need to actually run that image through the decoder okay to get like what would my image look like if I just did my instruction set and then you use that to do your uh error term right so that's why this thing kind of feeds back and forth do does all of this does this process get its own driver or part of a driver there is a media driver um and I'll show you a media stack in a little bit okay yeah yeah so that's cool I I love the way this picture matches the actual functions of decoding end coding um if you dig into this a little bit you'll see that it's attached to memory obviously and we do see take uncompressed in and stick compressed out okay uh if you want to do decode it's exactly the opposite compressed in uncompressed out and we also do transcode which is the same idea you do uh obviously decode and then you get an uncompressed image that you run back through encode and this is supported in a bunch of different apps right it's pretty cool um I would also say uh we do both we have two mxs which means we can do dual decode and encode okay so that if you as an example there's a thing called a group of pictures or tile split so you basically say I want to encode a video split it up by scene or split it up in some other way send one to one m mfx and the other to another and you can basically double your your speed okay I see pretty cool for the same video for the same video okay yeah it's pretty cool uh lastly this is the software stack now the software stack is a little bit bigger than the 3D software stack because applications up here like handbrake or adobe or you you name it there's tons of them they have multiple choices of Frameworks to to work there's obviously something called FFM Peg which is uh a multiple you know Source open and available there's gstreamer hmft is mic Micosoft version uh vpl is from Intel so uh there's just so many choices that people make today but at the end of the day they all go to either our vpl runtime or Microsoft's d3d runtime and then vulcan's off on the side vul's off on the side and the way to think about that all of these runtimes are going to call into just like on DirectX there's a UMD a KMD firmware in the GP yeah this this slide's getting a little closer just things I'm familiar with just for no nothing else than just name yeah CU this is kind of we interact with on the you know uh video side for our our stuff so compression is all invisible to us compression is Magic I swear to God I never thought about it until like five weeks ago right and it's amazing uh so then this section we talked about a little bit in our other video which is just the how to drivers work what do they do and user mode driver kernel mode driver and Graphics firmware those are blocks that we talked about in that one yep and that's applicable to uh I mean running games as well it's a different UMD this is the media driver but think of it as you know what types of things does it do it does like how do I program the hardware to do those functions depending on the apis that were called from the runtime so you know it would basically be programming the mfx blocks to do the right thing right do you have anything else related to Media you want to go over uh not at this time not at this time this is a good description of media I mean obviously the reason I'm here to talk about is because we do a really good job on it so I would say anytime you want to run some performance benchmarks feel free we've got a lot of good Hardware in there that does some good things but uh yeah we're pretty excited about it so that's the media overview for uh compression decompression and uh the human eye so thank you Tom yeah you're welcome lesson as well and uh we've got the driver video and then the last one we're doing uh is present Man 2 which that's exciting for us because it relates to benchmarking so check back for those thanks for watching and and we'll see you all next time

Info

Channel: Gamers Nexus

Views: 188,361

Rating: undefined out of 5

Keywords: gamersnexus, gamers nexus, computer hardware, how compression works, how encode works, how decode works, intel arc, intel av1, intel arc encode, intel arc premiere, intel arc decode, yuv, rgb vs yuv, quantization, transcoding explained, video encoding explained, video rendering explained

Id: swoIPD7EvEw

Channel Id: undefined

Length: 26min 0sec (1560 seconds)

Published: Tue Apr 02 2024