Garbage Collection (Mark & Sweep) - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I thought today we could look at garbage collection which is a form of automatic memory management so the the problem here is when you're programming you never know how much memory your program will use when it's running so if you're running an image processing thing you don't know how big the photo you'll be editing is if you're a web browser you don't know how big the web page will be so you dynamically have to ask that the operating system give me some memory the physical part of the Chip And you get access to it and later you give that back and you go through this sequence it turns out to be very difficult to do that manually and so there are techniques including garbage collection that do that automatically so you can't just say right give me absolutely loads because I want to be able to use whatever I want then well you can and it's like in anything in life you can ask for all the resources and they might even be given to you but then no one else has access to them so you're trying to balance out being a good citizen uh with also asking for what you need typically if you go back to older programming languages and all you see is an example if you wanted memory you had to ask for it and release it yourself so um there are two Primitives here one's called malloc and malloc says give me some quantity of memory please and the operating system will and I'll simplify a bit it will then reserve a physical part of your RAM chip for your program to use and then free says I'm done with that block of memory however big it was you can have it back you can free up that physical part of the ram chip so if we have a little look at some actual code we can see how this works and then what the challenges are so here is a very simple C program I just filled out the skeleton because I always forget the bits these days and I can say uh I would like to allocate uh let's see 32 kilobytes of memory so malloc says that give me 32 kilobytes of memory and then I can do things like I can set so we call that store value so I can say the first byte I'll just store the letter H oops and then I can read values back so I can say printer character and then I've got a load and then when I'm done with that I can say I would like to free that memory so that's a very valid little program if I've got it right let's just quickly compile it and see how many things I've got wrong uh so it prints out H that's good so that's things working correctly so I've mallocked memory and I've read it so I've been a good citizen I gave the memory back uh when I was done with it so several ways this can go wrong the first one is I forget to free the memory so there's now what we call a memory leak if I don't use that bit of memory for the rest of the program then the the operating system will think I've still got claim to it and it won't take it back even though I may know I will never need it again and if you have too many of these memory leaks you run out of memory the resources are no longer there so that's something we've probably all experienced your program just exploding or if you're machine going into swap it's often because of this so that's one mode of getting things wrong another one is I can free memory and then try and read and write from it so when I've moved the free statement here I'm then reading and writing and then that program will explode in various random ways the operating system will think oh you're done with that memory I'll give it to someone else or use it in some other way and you're trying to read and write from it and sometimes it works for a while and then stops working or you get random data all sorts of problems can happen and another one which I'm particularly fond of is where you try freeing the same memory twice now the operating system might notice this but what sometimes happens you say I'm done with this memory someone else asks for this memory they get the same pointer like ID should we say and then you free it and you free the memory they're using so they they are perfectly good citizens of Gotham City going about their business with no worries and you've then said they no longer need the memory and then they go Splat so these are all I mean there are some other ones but these are the classic ways that you can go wrong in traditional memory management and it turns out as humans we're incredibly bad at doing this particularly for big programs shouldn't the operating system manage some of this for us is or is that where we're going with this sort of so it turns out it's really difficult for the operating system to do this but in within our programming languages we can um have an automatic memory management system that then gives things back the resources back to the operating system more swiftly without us as humans having to say malloc and free people have realized this is a problem for a long time and in fact perhaps surprisingly even in the late 1950s the very first automatic memory management systems were being created and the basic idea is that you say I would like to use some memory and the automatic memory management system will work out when you are finished using it and then it will give it back to the operating system on your behalf so if you were to put this in terms of C you would have malloc but you would no longer worry about calling free and because you no longer have to call free we can get rid of all of these sources of errors and make things a lot simpler okay so there's a book it feels like it's a book coming it feels like it there is a bot of sorts there's there's a couple of techniques and one of them is the obvious one and and has some trade-offs let's say so the obvious way of doing this is what's called reference counting so you say when I allocate some memory I'm going to put a little counter saying how many parts of my program are still using this so when it's allocated the count will probably be one and if I give it to someone else used the count goes up to two and when they're done using it they put the count back to one and when the count goes to zero no one's using the memory it can be free even go back to things so it's not fully automatic but it's semi-automatic and it's definitely less likely to have problems but there are some consequences one I've got to remember to add and decrement these things accurately which is surprisingly difficult to do correctly um the other one is that means every block of memory now has a little integer associated with it which takes up extra memory and surprisingly the literal adding and incrementing and decrementing these counts can become surprisingly um punishing in terms of performance if you keep handing them out and getting them back over time and then just finally reference counting cannot deal with what are called Cycles so we'll see a maybe a little example of that later um you can end up finding remember me that you have finished using that where the counts are always above zero and cannot be automatically returned so is that like a memory leak it is it is a memory leak yes it's absolutely a memory leak exactly the same consequences your program just grows but almost worse it looks to use a program like I finished using with this memory it looks like I've done the right thing and yet somehow it's not being freed um so that's reference counting has its place it's still used in in some languages but what's more commonly used in most languages now you know your JavaScript your Javas um is what's called garbage collection now sometimes this is used as an umbrella term but more commonly it's used to mean a specific type of automatic memory management where it's more accurately able to determine when you have finished using some memory and it works really really well so maybe we can see a a little example of that what I do first of all I just run a very stupid little small Python program and we can just see that there must be some sorted memory management going on so we'll just make a little class and we'll make some very long Loop [Music] this right so what this program is going to do is just in a long Loop keep allocating some memory now naively if this was C the program the memory usage would just go up and up over time because I'm not saying free so when I say c angle bracket allocate some memory effectively and we can just force that to make clear that we're storing some extra things somebody must be doing something and then uh let's run this and we'll run this in another window as well so it will see if I picked a big enough number no I really need a bigger number to make it run longer it's going to make it run for very long so the first one of these python processes you can see what under let's take under size uh or actually under resources which doesn't matter which one the program keeps running and it's never using any more memory in this case it actually turns out using reference counting at the moment but as soon as I finish using the memory it's deallocated that's automatic memory management of some form working but what happens if we do something a little bit more sophisticated so let's kill that so let's make another new font and what we're going to do now is just look how one of these garbage collection algorithms work now they turn you can go into huge depth making these things super fast but I just I will show you the basic algorithm which is called a mark and sweep algorithm where we work out which memory is still being used which can be free everything else all the complexity is just an optimization to make those stages go faster but they don't affect the fundamentals so let's save this one so again I'll just make an empty class and what I'm going to do first of all I'm going to make an object so in Python this is a blank object what I'm going to show you is the problem of working at which memory is live at first so there's a concept of what are called Roots so when we allocate memory we have to say which objects am I still using we call those the roots and I'll simplify and say any object reference from a variable is a root so I'll try drawing and remembering that I have the artistic skill of someone with no artistic skills in fact my artistic skills should be legendary for their lack of existence so we'll say I've got a variable X and it's allocated a chunk of memory over here it doesn't matter what's in that memory particularly and I have a reference to it so now we can say one of the roots is X so it's pointing to that memory in the intuitively that block of memory here is alive because I'm still pointing at it and I can make a second variable called y and that will also have another block of memory over here I'm getting some very variable with arrows here doesn't really matter intuitively now both of those boxes of memory are still being pointed out they have Roots so they're alive now the first thing I can say is I'm going to put the null value or none value in Python's variable y so now I've effectively removed that pointer there now so intuitively this second object the the one with the thinner lines is no longer being pointed to so what's going to happen in Mark and sweep algorithm is the garbage collector every sort of can say what memory is still accessible so it's going to start at the roots and it will say Okay variable X see what you point to at you point to something over here so I'll mark that as tickets being used variable y That's pointing to nothing nothing to do there and then it says right I finished the mark phase of the algorithm look at all the objects in memory there are two one of them I ticked I could reach the other one no take so I can now deallocate that so Mark work out what's live sweep get rid of the stuff that isn't so that's the very simple version and then you have to think what happens if there are more complex graphs than just two variables and this algorithm is so simple it scales really nicely so I will put back the object into why and I will say that the X object now has a reference to the Y object so I have this little line going over here okay so now I've got if I was to do run the mark sweep stage both of those objects are clearly accessible but now imagine if again I say why is none I get rid of this and I run the garbage collector first object I can reach from X yes tick that and then it will then do what's called a reachability algorithm effective what can be reached from this object well if I go from this object I've just marked with a tick and via the Y reference oh yeah look I can take that one as well it's in use so now I don't want to free that second one even though it's not being directly pointed to from a variable so in this case it won't have to get rid of any of those objects in memory and that in a nutshell is garbage collection there are some other details perhaps we want to talk about a few but really that's the core of the algorithm it's very nice and simple very intuitive and the nice thing is it will free all of the stuff I can't reach so it Suites where anything that's not marked exactly so a sweep is just a Boolean flight just it is this reachable yes or no and if at the end of this the mark phase it's not marked as reachable outagos and this might be completely irrelevant compared to this but just thinking about the kind of idea of allocating or grabbing some memory why don't programs just get memory when they need it why do they have to kind of grab some allocate it and ah okay good question we have this idea that in in or sometimes we have this idea that programs such in a sense like know and Advance what they're going to need to do but you know imagine you're writing the software for like a a web uh like an e-commerce basket how many items are people going to put in the basket some people are going to buy one the shopaholics are going to buy a hundred so you never know in advance how much you'll need to do and so this is really all about allowing you to dynamically allocate memory because you just don't know how much you'll need then I'll be free so over time your process will its memory usage will go up and down depending on what you need and your process might run for days weeks months going down to tiny amounts going up to huge amounts also bearing in mind that even with a very simple program like the one I'm writing you can easily allocate hundreds of megabytes if not gigabytes per second so if you don't free memory pretty quickly you won't have resources really quickly are there any problems with this then there is one and it goes back a little bit to our reference counting example earlier which is um sometimes people find they're using say Java which has amazing quality garbage collection algorithms and yet their program seems to leak memory and keep getting bigger and bigger and bigger and it's because they've got a route left in other words they have a pointer to what looks like one little object and and which they've forgotten that they're keeping alive that object points to directly indirectly to millions of other objects you know gigabytes of memory and as long as there's that one little pointer left the rest of the stuff won't be freed so occasionally you have this challenge of my memory is huge and you have to find often there's this one little reference you need to clear in order the garbage collector come with us all of the other stuff isn't reachable and there are tools to help you with that but it's something that all of us get wrong at some point and my last question doesn't marking the memory use memory yes it does and so I alluded to you earlier that the real algorithm so say if you look in the Java virtual machine that has several different garbage collectors they go do some amazing tricks uh to squash down the moments so the marking thing is one single bit in memory it's stored alongside some other information that you have to have around and so more or less not only does it not take up any room but implicitly you remember when I was doing the little example after each space I went and cleared the little tick marks the marks and the proper garbage collectors say hey even going over all the objects and and setting them to false the mark bit that is just too slow so what we'll do we'll remember last time was it did the mark bit set to True mean free or does the market set the false mean free so they don't have to go around and unset things all sorts of little optimizations to make this as fast as they possibly can and when you have got a memory leak is that why we turn things off and on again yes and I am a a habitual Turner offer and back on her again because yes it really does fix memory leaks and many other patented and unpatented problems I make that number a lot bigger so I make it an order of magnitude bigger this for Loop it now takes a bit longer and if I make it longer again we'll see depending on which style of core it's going on I think it's within three cycles and then if you go any longer then it's going to take you longer you know
Info
Channel: Computerphile
Views: 61,431
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science
Id: c32zXYAK7CI
Channel Id: undefined
Length: 16min 21sec (981 seconds)
Published: Fri Jan 20 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.