GopherCon UK 2018: Andre Carvalho - Understanding Go's Memory Allocator

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone I'm Andrea I'm here today to talk to you about memory allocation we are going to cover a few concepts about memory such as virtual memory and how a process is laid out in memory while it's running but our main goal is to take a deep dive into the go runtime and understand how its memory allocator is implemented first just a bit about myself I'm a developer at globocon Robert Kong is a media company from Brazil I work at sue which is an open source platform as a service that was built inside global as a way to make our developers deploy their applications faster you can think of it as a hero code that's who can run on your own infrastructure and I also blog about various things on my personal blog I'm also planning to do big blog serious series of posts about this subject so if you feel interested about that you can check it out later and I also like reading books and this is my favorite technical books they'll you know they'll UNIX programming interface it's a really big book I've been reading it for over a year now and I'm in still yet to finish it but it is really interesting because it goes and explains exactly how many of the Cisco's that are available in Linux works and things like that so it's really interesting if you want to know how so it's really interesting if you want to know how Delian how Linux really works and one of it said chapters is precisely about memory allocation so it it explains how what are the Cisco's available to allocate and deallocate memory and how can you can use it to allocate and deallocate memory and by the end of the chapter did the reader is invited to implement a memory allocator and this is how I got into and try and try to understand what how memory allocator works in how could how I could implement one and as I go developer got curious and and wanted to learn how the go memory allocator works so first just a few concepts the first important concept is virtual memory so process do not read directly from physical memory this happens because of very literal reasons because first from a security point of view one could implement a process that would read the contents of the whole physical memory and that would be pretty easy to read what other processes as our writing so that would be pretty bad in a security point of view and also we would need to do to handle all the coordination between multiple processes so we need to make sure that we weren't written to a memory area that other processes were written and so on and that would also be pretty hard to do so virtual memory is the concept that abstracts that our data away from the process so the process has a virtual view of the memory that's available and there are several ways to implement virtual memory the most important ones are segmentation and page tables and being page tables the most common one this is the one that we are going to see next so here on this light we can see how virtual memory works on the Left we have a process that's running so it's memory space is divided into pages each page is usually four K bytes but that can change and each table help each page each process has a corresponding page table which handles the mapping between virtual pages into physical memory frames so here we can see that page 0 is mapped to frame frame frame 3 of the physical memory so and we can also see that some of the memory frames on the physical memory are associated with other process so they are never going to be mapped on our page table and also we can see some of the mapped pages like page 6 and 7 and those pages are missing corresponding entry on the page table this simply means that the operation system swept those memory pages out to the disk because it was it it was having problems like memory pressure so once the OS feels like that there's no not enough memory for the process that are running they are going to be swept to the disk so if the process tries to access something that's on those pages the operating system is going to copy those pages from disk into the physical memory and and update the page table if their corresponding entries so this is pretty easy to spot if we try a really simple go example so here we are generating a random random number from 0 up to 100 then we print the generated number and the corresponding memory address of that variable and then we just looked forever just to to make sure that this program doesn't exit if we run two instance of that program at the same time we may get an unexpected result at least filled on over to memory or how virtual memory works so here we got two different values 653 on the first one and this address and in the second instance god we got the value 68 and the same address is reported as the address of that that variable this feels weird but this address is actually just inside that process virtual address space so it's not actually the same physical memory space so that that's no big deal ok and also another concept that's important to understand is how a process is laid out in memory where it's running so a process memory layout is divided into multiple segments the first segment is the text segment which contains the binary codes that was loaded from the binary the second segment is the data segment which contains initialized Aesthetica variables then the BSS segment which holds uninitialized is Tatsuki variables and then the two most important ones which is the heap and the heap is we're dynamic allocated what variables are at stores the hippie grows upwards so it's not completely mapped once the program starts its keep it keep groans as we allocate more memory and the program break marks the end of the current rip heap and their other important segment is the stack the stack holds function stack frames so local variables and parameters and things like that and every time we call function will add its function stack frame to the stack and when we return from that function we remove it so it grows downwards we can think of stack allocation as simply appending something to the end of array for instance so here you allocate a given size we simply increment increment the stack pointer in return the pointer to the beginning of this light gray light green area so the application is going to use this space for this last location enter the allocate we can simply just decrement the size that was allocated from the actual stack pointer so now we have the allocated it and this space now is completely unused and can be used for our next allocation this is very simple it's really fast it's not not expensive to do these operations but it's pretty limiting because I can't do locate something that was allocated on the middle of the stack so true I need to deallocate with the opposite order that we allocated so imagine that I yeah and to overcome those we use heap allocation so heap allocations are used for objects with size only known at runtime so in C we usually use malloc and free to handle heap allocations C++ has new and delete for those but in go things are a bit different we know that there there isn't malloc function go so go use a technique called skip analysis to figure out if your variable needs to be allocated on the heap or on the stack and it also has garbage collection to know when some some variables from values can be de-allocated so before going through the details of how goo is actually implementing memory allocation let's see how one could build a really simple heap al akhirin so to implement some minimal alligator we need to implement two functions the first one is the malloc function which receives the size of the memory area that we want to allocate and return and must return a pointer to that error and the second function is the free function which receives a pointer to an arid an area that was previously allocated and must free the resources associated to it so we can think of an alligator as being something that stays between our application and the operation system so while the application talks to the map to the alligator by using the maloca and the free functions the alligator is going to use system calls to talk to the operation system to both allocate and deallocate memory there are several Cisco's that can be used for that the most common to allocate the most common used to allocate memory then map Cisco and to the allocate we can either use a map or the m8 valid Cisco the raw memory allocator actually use a map to allocate in Linux and the EM advice is called should they allocate memory okay so our minimum allocation is going to keep simple linked list with the free objects that are available to fulfill allocations each add objects is divided into two parts the first one is the header segment which contains the size of the object and a pointer to the next object on the list so I'm just after the header segment we have a data segment which contains that size in bytes and it's the part that actually returns to the application as an area that they can use to allocate objects okay so let's do an example I'd suppose that the application called malloc passing fair with them which means that stranger allocates 10 bytes and our Ethel starter of our program are our alligator is actually empty so it has no objects on its free list so it is going to try to traverse this list there is no object so it's going to allocate anyone from the operation system and to do that it's going to use them map Sisco them map has a bunch of different parameters and flags and so on but those are the most important important ones the first parameter is this their starch adders of this this memory mapping that you are creating the second one is the size of the memory the memory map that you want to create the third one is the perm this size needs to be on units of page so for K H K and so on so on next we can set some permissions on this memory map so here we are using both rights and read permissions this is normally watch once when you are doing memory allocations for for this this reason then there are a bunch of flags that can be set on the a map some are are really obscure ones but usually for memory allocation we want to use the private flag which means that only this process is going to have access to this memory map and also we want to use the anonymous flag which means that we are not expect expecting to load the contents of this memory from a file on the disk so it's going to so it's not going to load that content well once we start when create our memory map ok so we do the map so we got a 4k byte memory map we add to our linked list so the first first 12 bytes is going to be used to hold the header and the rest of the bytes are used are available for applications to allocate memory from but the application asked for 10 bytes I'm not going to return all those bytes and which would be wasteful so we what we do is that we are going to split this object into two objects and we made up that's precisely the size that the user access could so 12 bytes to the header and then 10 bytes to the data segment that's going to hold the allocation so we use a total of 22 bytes and the rest stays on the starter for the link Dilys and then we can simply return a pointer to the start of this data segment which is going to hold what whatever the application wants sure to store there and this is it and that's it if the application tries to do another allocation we can keep dividing this splitting this greater object or doing new caution map to get more memory and what about the allocation so when the application once is done using P it's going to call the free passing P as a parameter which is the start of the data segment of that object if we decrement the size of the header from P we are able to fetch its header which contains all the mate mirror data about that aperture object and we can simply add it to the link to list so just adjust the pointers on the linked list to make sure that it's now back to their free list and our new upper a new allocation for 10 bytes can use this object directly so this minimum worker can be implemented in a few hundred lines of codes this is actually the one that I used for the exercise on the that book that I mentioned but it has a few issues that we did not try to solve so the first issue is fragmentation a real memory allocator has to to track is going to solve these issues or most of them so the first one is fragmentation we keep splitting larger objects into smaller ones but if the application now starts to allocate larger blocks we need to have a way to act to also merge smaller objects into larger ones to be able to fulfill those without having to call and map again we are we did not handle coop corruption which means that if the application tries to free the same pointer twice it's going to corrupt our linked list and if and we need to make sure that we are not free in the same pointer at the same time maybe throw an error or something like that and we are actually never releasing memory back to the operating system so in we need to figure out when to do that like setting a threshold of the amount of unused memory that we are going to hold and also how to do that what's this cause we are going to use to release that memory to the OS there are actually many memory locators that don't solve this one it may sound surprising but it's a piece your value G memory allocator and we also didn't handle anything related to multi-threading so if multiple threads are using our allocated we are going to probably need to some kind of lock tool to protect it from concurrency axes so let's start to paper into knowing how the go runtime a locator actually works so first we are going to see how a real memory allocator not this toy one that we built actually works and that one is the TC malloc the TC malloc or thread caching malloc is a memory allocated that was originally implemented for the C language by Google it serves as a basis for the goal runtime memory allocator and its goal is to reduce locked on to lock contention for multi-threaded programs so in TC Malaki each thread has a local cache and there are two types of allocations is mall locations which are allocations which size less than 32 K bytes and large allocations and interesting malloc memory is managed in units called films each spam is a continuous run of memory pages and also instead of keeping the header segment with what we called on our toy a locator close to the data segment the actual spans are allocated in a separate memory area and they just point to the real area that the applications going to use so first large allocations larger locations are served by the central heap the center heap heap is this global structure that's shared by all threads and they requested size is rounded up to the number of pages so if the application calls Malik with 34 Ettore 3 K bytes it's going to be rounded up to 36 K bytes which are 90 page the central hip maintains multiple linked lists with free object if with free spans it's linked list montañés expands of a given size so we have one page which contains expense that are one page in size two page linked list and so on and the last link to this are four spans that are over two hundred and fifty five pages so ever spend that's larger than that sketch on a single single linked list and to do a large allocation the central hip is going to start on there and linked list so if you are trying to allocate 9-piece it's going to start looking at the nine page linked list for a spam if it's if there is a spam there it's simply going to return the memory error of that spam through the application and we are done if there there isn't I spam them we're going to look on the next link do this and so on until we find one if we are going to return now largely spam to the application we are actually going to split though that spam into two and just return the desired amount to the user and keep the one in another linked list if the central hip is not able to find span that's large enough large enough for their allocation it's going to use them map Cisco to allocate anyone and add these to the linked list before returning to the application so it's really simple the application talks directly to the central ripped through the a locator and if the central has spam its return to the application else the central hip is going to ask for a new one from the OS and this requires our requires acquiring a lock on the central hip small locations is more locations are served by the local thread cache and they requested size is rounded up to one of the size classes there are large number of size classes so imagine that the application calls malloc will four bytes or six bytes this is going to be rounded up to eight bytes and it bites is the first size class and this is all done to reduce fragmentation so you keep the number of different sizes of box smaller than just doing like random sizes for each kind of allocation which makes reusing objects harder the local thread cache also maintains multiple free lists each list is for a given class so class zero has two items on the linked list and so on these are not spammed these are objects that are smaller than spans and for instance if the application asked for an object from class 0 we can simply return one of those to them and that's it it's important to notes notes that if the allocations fulfilled from the local thread cast it's not required to acquire a lock because it's local through the thread so then there are not multiple there aren't multiple threads trying to to talk to the local thread cash if the allocation is actually actually for class one and we are lacking any objects on the free list we need to go to an more central structure which is the center of free list there is a central free list for each of those classes and those are shared between all threads so it needs to acquire a lock the local thread cash is simply going to ask for a bunch of objects from the central free list and those are added to their free list and one of them is going to be used to fulfill the allocation but what what happens if the central Phyllis has no expense available the central free list is then going to ask for a new span from the central heap and this requires another lock which is global for for the entire execution so this new span is acquired from the central heap and is that it's the center of free list and one of those and some of those object objects are going to be given to the local thread cache and one of them is going to be used by the application so this is the path of a small location so the application Foursquare is the local thread cache if this is if the local thread cache has an object for that given class the allocations fulfilled here and no locks are required and this is the first path of that allocation but if it's not if there isn't an object available the local trade cash goes through the central free list for that class if there aren't objects available it goes to the central hip and so on each layer below is more costly so we try to avoid the most going below so yeah so each layer also tries to acquire more objects that from the bottom layer then are required so it tries to to go there the minimum possible so the allocation on TC Malik the TC malloc calloc area keeps an array that map's memory pages to spams so this means that page 1 and page 2 are managed by spam a page 3 4 & 5 are managed by spam B and page 6 is managed by spam C so when the application calls free on a given object from the objects address we are able to fetch its page and from the page by using that array we are able to to go to the spam that's actually managing that page and from this spam we have a bunch of metadata which contains also the size of the objects that are allocated from that object from that spam so if the spam holds small objects we are going to simply add appended that small object into the local thread cash link to this and the allocation of the object is done but if the spam was actually used for a large allocation we are going to have to check the pages that are near our current our disband page to see if they that page is also free so for instance if we are the allocating expensive we are going to check check page 3 to see if they if spam B is also free and if it's the case we are going to merge both spams into a single page spam and then we can simply add the new spam into the two page three list of the central hip and as I said before the some alligators never read release memory back to the OS and this is the case of the original TC malik implementation all the allocations all did all that the the allocation does is simply add the object into the free list and it keeps really reusing those objects as most most as much as possible so we thought we've been talking about mellow key and free and we all know that gold doesn't have malloc and doesn't have free so how is there go memory allocator actually invoked before we try and see how how its differs from TC malloc so this is a really simple go code we have a main main function that calls another function f this F function is marked as knowing line just to make the example simpler so here we are preventing the compiler to do some optimizations that would make this much harder to to show and all that they have functions does is allocate some may allocate a verbal and initializing it with value 10 and then it returns the address to that verbal so before we do anything if the I variable is allocated on the stack and we return from left if the main function tries to use that verbal verbal this is going to result in invalid memory access because we are going to try and access a variable that was allocated on the stack of another function so we know for a fact that the I verbal has to be allocated on the heap because M is going to try and access it so we can use the - GC flags which are not garbage collector flags they are go compiler flags took the go build command and by passing down the - M - M flag to the GC flags we are actually telling the compiler to tell us about the compiler the optimization that the optimizations that it's doing so here in marks with those red arrows we see that the compiler figure figured out that the address of the iver bow escapes to hip and it decided to move the Iver bow to hip what the competitor is doing here is called escape analysis there are there are large there are multiple resources about this so if want you to look it up I got someone on the references and the compiler tries the most to keep most of the variables on the stack because it's it's much cheaper but if it's not able to figure out if they stats enough it's going to place the variable on the hip and this is the case but how does the how does this translate to a call to the runtime memory allocator so here we are also using another tool they go to compile using the - capital s flag which means that we want the go assembly output from the compiler the generated assembly we don't we don't need to be an assembly expert I don't know most of what it's on its output but here it's pretty obvious that there is a call to a function called new runtime dot new object and by expecting expecting the runtime code we see that this function does a coach another function that starts with malloc which means that we are on the right track so the malloc GC function is the actual function that implements memory allocated a locator inside the runtime it receives the size of the object that we are allocating also a pointer to the type and a boolean that says if we want this memory to be zeroed out before returning ok so let's see now how they go a locator is actually implemented and how its differs from the GC milk so the goals a locator is based off TC malloc and the biggest difference between TC malloc and the go a local is actually related to garbage collection the a locator is pretty much really tightly coupled to the memory to the garbage collector so it needs to cooperate with the garbage collector to make sure that everything that unused gets freed and this also makes it really hard or even impossible to replace with rather an implementation song C in C++ it's fairly common to have multiple memory allocated implementations that you can just link in to your binary and and that's it it's really hard I think it's actually impossible to do that in go because this is spread around multiple parts of the runtime code which is good for the garbage collector but it's bad if you want to try another memory allocation implementations and there are actually three types of allocations is instead of just two there is this different type of allocation which is the tine allocation and it's used for objects with size less than 16 bytes and that have no pointers we are going to cover this one in a minute so before diving into the details this is a really oversimplification of how the garbage collector work single so the garbage collector in goal is said to be a concurrency marking sweep garbage collector and basically it scans all the objects that were allocated from an application and then it marks the objects that are being used by the application there they are said to be objects that are live so for each object that has a pointer or is pointing to another objects all those objects are marked and by the end of the mark phase with all their objects that are on markets are free are unused so they can be freed and this me and this is called sweeping so the objects that are not live are going to be swept on the next phase and this actually happens into two different phases and that's why it's called a concurrent mark-sweep so it happens in background by a separate go-go routine but it also happens in response to allocations so they are garbage the allocated before doing some of the allocations it's going to try and sweep some of the objects to make sure that it's not requesting memory unnecessarily so the M heap is the central hip on-the-go memory allocator so instead of keeping just the free sperms the EM he also has linked lists with spans that are busy busy spans our spans that are being used by one of those caches or were used for a larger location so before doing a large allocation dem heap is going to try and sweep the same amount of memory that was requested for this allocation so it's going to traverse those linkage links looking for parents that have objects that are not being used anymore that were not marked on the last GC phase and it's going to sweep if the spam is completely unused it can be moved to another link to the free link catalyst so the free spams are kept in a fairly similar structure as to see malloc the biggest difference is for expense that are larger than 255 pages this is actually a recent change on the runtime about a year ago I think and this is they are also they are actually kept in a structure called a trip a trip is a randomized binary tree and this was actually done because some users were having performance issues with is with heaps larger than 100 gigabytes so once they were doing larger locations and to find the spam of a given size was becoming really costly to traverse all these this structure so to find the spam of a given size as much performance on such a structure so yes people have heaps that are larger than 100 gigabytes that was impressive and the commit message has more information about this change so after do so damn heaps going to traverse those lists is going to look for a spam that's enough to to accommodate the allocation and is going to return it to the application if there isn't an available spam they go runtime a low key is going to use them Maps is called to allocate anyone and after doing this allocation depending audit the of memory that's life on our process these core roots in may perform some additional work for the garbage collection and this my cause I stopped the words on your program so this means that some some of your allocations if you are doing lots of larger locations some of them may end up with a larger Leighton's so it's something that we should take care or at least be aware of a small allocations look pretty much how they do in TC Malik so instead of talking about treads ingo the each logical processor which is a pea on the wrong time has an M cache which is the local cache these local cache keeps I spam for each of those classes for which of the size classes which pretty similar to TC mallet there are six six size classes these are some of them so for instance we have the size the class four is used for objects with 64 bytes each spam has eight K bytes so each span is able to hold 170 objects so doing a smaller location the M cash is going to look for the corresponding spam of that class and it's going to look if there is an available index to hold that allocation if there is this address is returned to the application the application is going to use that is what but if they spam is completely filled it's going to request a new span from there M central which is the central free list for that class there are one M central for each class so supposing that we are doing an allocation for class zero them central has two different lists one for spins that are non empty and one for empty spoons so the MC hunter is going to look for I spam on the non empty spam list and if there is one this spam is returned to the M cash and M cash is going to use this spam to fulfill their a lock but the MC Antrim may have no spam Jean demonium empties families so it has to so before asking for a new spam from there and hip dem Center is going to try and sweep the existing spam on the empty spam Alize so if it's able to find to sweep they request the amount of objects that it requires this is this spam from the empty spam list is going to be moved to the non empty spam list and is going to be returned to the M cache but if that's not enough the SLS results them Center is going to request a new spam from the M hip so this man is added to the MC and row list and then this family is also sent to the local cache so it can use to fulfill the allocation and the last type of a location is the China location the China China locations are allocations for objects with no pointers and with size less than 16 bytes the main target this is from from an actual comments on the the the a locator code so the main target of tying allocations are small strings and standalone escaping variables so there there are a few benchmarks that show that shows that it's really good for performance so it's just an optimization so for China allocations HP which is the log logical process keeps 64 bytes objects that was allocated from an spam so this is an object and not a spam and each tiny allocation simply appends the tsuba the sub object that's been allocated to the end of this larger object so it looks how they stack allocation that I mentioned on the beginning work so just keep appending it seems to appending into an array but if the the array gets if these larger objects gets completely used and is not enough to to hold the new allocation Democrat is going to simply request for a new object from there the P is simply going to request a new object from the M cache and this works just like a smaller location so eventually and also this object that was being used is going to be just kept there and eventually the GC will deallocate the old object as soon as all objects that were allocated on this larger object gets freed so it's going to sweep it I said that the original TC malloc did not release memory back to the OS the go memory allocator actually does it so the runtime periodically releases some memory back to the OS it's going to look for spams that were swept more than five minutes ago so if it's able to find a spam that words that were swept more than five minutes ago it's going to release the memory the memory page that are associated with that spam and to do that in the UNIX it used a map advise Cisco this Cisco is a fairly general Cisco for memory allocation you can set a bunch of different flags related to the memory so in the in this case the goal memory allocator use they don't need flag which simply means that the operation system is can simply release the physical resources that are associated with that virtual memory mapping so it's free to just reuse the memory frames and here is the page table entries and so on and this is part of the reason that sometimes you look at the output of top or H top or free and you see a go process that should not be using so much memory but this is - using it's simply because it didn't release memory yet so yeah and just to wrap it up so the runtime has this really cool function called the read main set it populates a fairly large structure that has a bunch of stats is about memory allocation all of those fields got pretty nice back comments so you can just read and try to understand what each one of them mean but I tried and I failed and but after looking at the runtime code and seeing how things work I I was able to understand most of them so I hope the this talk also helps you doing that this is probably the most important slide of the talk so if you want to take a picture now it's the time these are other reference that are used I'm also going to publish on the notes I made during my research and things I also would like to thank Damian Green ski I'm probably getting his surname wrong and Vito tomorrow for living reviewing this is flight deck Thanks [Applause] questions yeah once the spam is free they are actually the thing that happens with TC mark where you merge sperms together that also happens on the goal allocated so on the free list is checks for the adjacent spams to see if they are also free and just merge them together yeah even that they can be allocated yeah but they need to be contiguous in memory so yeah I don't I it's hard-coded that I don't think all sorry so the question was if there is a particle reason for for being five minutes to release memory I didn't try to look it up on the like the code history to see if there is but I didn't find a there there isn't comment explained that it's just hard coded there yeah I've seen in maybe looking at the get history we are able to find out yeah sorry and sorry sorry no not not really well I'm pretty sure that's the job of the OS to do that so the memory mapping each sorry so the question was when once the memory is released back to their OS how it how is it guaranteed that another process doesn't have access to that so that's actually I think leads irid once you are allocating so when you do lock it I think that's actually the job of the OS to to do that because when you do them at 4:00 in Cisco so everything that's on the virtual mapping mapping is not being used anymore and the physical memory is actually released so I guess it's part of the OS butBut I'm not completely sure I can look it up later yeah so I think the main difference I I didn't really so the question was for garage so each ball routine has a stack so how is the allocation of the garage in a stack handled that's correct so that's an actual different parts of the code base that I didn't really go into but what I know that's the biggest difference is that that memory is not handled by the garbage collector so it's free it's manually freed once they go routine ends so it's a completely different data path but but yeah you're right it's taxing go like garage insects are actually allocated on the heap but I guess the biggest difference and and this is also really important for performance reasons is that the garbage collector does not have to to handle yeah so that's the subject of an entire talk but I know that the memory allocator keeps on structures called a bitmap and this bitmap maintains bits for each words that is allocated on a program so it's basically sets bits on the objects that are currently being used so based on that this this bitmap is kept both by the allocator and the garbage collector so they marked up those bits on the words that are still being used and they're a locator can use that bitmap to know if it it's it's possible to swap sweet sweet peas or not know it's not some reference counting so the the marking is actually you mark I think you use different colors you have white objects which are objects that are not being used anymore but you can mark objects as gray or black also as soon as you allocate a new object that objects marked black so it can't be swept in this garbage collection phase so but if it's a gray object it may be or not so at each phase so how objects that were marked on this phase are not being reallocated on this one but on the next phase everything is white and its marked again so it uses that as a way to to figure that out it's not reference count there is a really nice talk about the garbage collector I can it's not on my hands but I can show you it thank you [Applause]
Info
Channel: GopherCon UK
Views: 8,326
Rating: undefined out of 5
Keywords: computer science, technology
Id: 3CR4UNMK_Is
Channel Id: undefined
Length: 47min 15sec (2835 seconds)
Published: Thu Aug 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.