Nina Zakharenko - Memory Management in Python - The Basics - PyCon 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(moderator) Hello everybody. Our next session is about memory management in Python. Please welcome Nina Zakharenko. [applause] (Nina Zakharenko) Hi everyone, thank you so much for coming to this talk. My name is Nina and I've been programming professionally for about 10 years now in a variety of different programming languages. When I’m not writing code you'll probably find me snowboarding or mountain biking. But let's go ahead and deep dive into the basics of memory management in Python. So, first off, why should you care? Well, knowing about memory management is going to help you write more efficient code and that knowledge is going to help you solve problems with slow programs. It's going to help you troubleshoot, and it's really going to help you debug. Sometimes you really want to answer this question: why is my program taking up so much memory? And why is that memory usage increasing over time? I’m going to give you the tools to help you answer that question, and as a bonus, writing more efficient programs makes you look pretty darn smart. So what are you going to get today? We're going to get a vocabulary. You want to know more about the topic but it can be easy to get overwhelmed. Sometimes you're just faced with too much information and too much new terminology so we're going to cover the basic concepts. My goal is to make this topic more understandable, even if you're just starting out, and I’m going to do that by giving you a foundation. What are you not going to get? Well, you're not going to be an expert at the end of this talk. That's the downside of the thousand-foot overview and that's okay. For the experts in this room, though, a quick disclaimer: My talk applies to CPython only. So in order to really understand memory management in Python, we need to get a little bit philosophical and ask ourselves, what is a variable? A variable is a container that can store different variables -- or different values. So in this example we have a C-style variable. It’s a very simplified example. So here, we need to declare the type of that variable before its assignment. And that's because these variables live in memory like this. They live in a fixed-sized bucket. And that bucket can only hold the same-sized value or an overflow can occur. When we change our C-style variable, so later on we say a=6, what's happening is that the data in that memory location gets overwritten. So now it's 6 and our old value is blown away. Now here's a big idea. Python has names, not variables. How are Python objects stored in memory? Well, we have names, we have references, and we have objects. And name is just a label for an object. And each object can have lots of names. Those names reference that object. We have two different types of objects. We have simple objects like numbers and strings. Those simple objects are just that; they store their own value. Each simple object is generally stored in memory once. We also have container objects like dictionaries, lists, and user-defined classes. Those container objects store references to either simple objects or other containers. So what's a reference? A reference is a name or a container object that points at another object. What's a reference count? It's the number of references that you have. How do we increase the reference count? In this example, we're setting the name x to be 300. That increases the number of references by 1. Later on, we say y = 300. Instead of creating a new slot in memory for that 300, we're just adding another reference to it. So now our reference count is 2. In this example we have z, a list, a container object that has 2 more references to 300, so now our reference count is 4. How do we decrease the reference count? One way is with using the del statement. So, in this example, we’ve removed the reference from x to 300. Now we have the one from y. What exactly does the del statement do? Well, it does not delete objects. It, instead, will remove that name as a reference to that object. And what does that do? Reduces the reference count by 1. Another way that we can decrease the reference count is by changing the reference. So in this example, if I set y to none, I’m going to remove that reference. Now the reference count of 300 is potentially 0. Once there are no more references, we don't really care if that object still exists. It can be safely removed from memory later on. So put it in the trash. One more way of decreasing reference counts is going out of scope. So if I have this really simple program and I run it, as that -- if I have the simple function and I run it, as this function is running, inside of that function the reference count of the word 'seven' increases by one. After that function has exited, that word 'seven' is now out of scope, so the reference count goes down by one. The final way that a reference count can go down is when the program exits. We have no more references. So local vs. global namespace. If that reference count is going to decrease when an object goes out of scope, what can happen with objects in a global namespace? Well, they might never go out of scope and their reference count might never be zero. So the trick here is to avoid putting larger complex objects in that global namespace. And if you do, make sure you clean them up, either by changing the reference or by calling a del on them. Internally, every Python object holds three things: its type, its value, and its reference count. So in our previous example, we had two names: x and y. That gives us two references to one object. So we have a object here, it's -- it knows its type, it knows it's an integer, it knows that its refcount is 2 and it knows that its value is 300. Let's take a little bit of a peek under the hood. And as a warning, don't try this example in the interactive environment or the REPL. So if I have my x and y equal to 300, we can take a peek and see what memory location the object lives at by using the identity function, ID. If we compare the memory location of x and y, we’ll see that it's the same. I can also ask Python, "Does this object live "in the same memory location?" by using the is keyword. So if I ask Python, "Is x y?" I'll see that it's true. So now we can talk about garbage collection. What is garbage collection? You can think of it as a way for your program to automatically release memory when the object that's taking up that memory is no longer in use. Back in the day programmers had to allocate and deallocate their memory manually, and, spoiler alert, it really kind of sucked. If you forgot to free your memory, it can cause memory leaks. If you accidentally overwrote your memory, your program could totally crash. Garbage collection came in to the rescue. I like to think of garbage collection as memory recycling. There are two main types of garbage collection, the first being reference counting, the second being tracing. In a way, Python uses both. So how does reference counting garbage collection work? Well, we add and remove references. That refcount is increased with every assignment. Refcount is decreased when that reference is removed. When the refcount reaches zero, we can go ahead and just immediately delete that object. We know that we have no use for it. And that causes an interesting cascading effect because when you do decrease the refcount of any objects that that deleted object was pointing to, if their refcount reaches zero, you can delete them as well. So one refcount reaching zero can mean lots of objects being deleted from memory. So the good thing about reference counting garbage collection is that it's pretty easy to implement, and when that refcount is zero, those objects are immediately deleted. But it has some downsides. There's a lot of space overhead involved, right? So when your reference count is stored for every object, that data has to live somewhere, and there's execution overhead too, because that reference count is changed on every assignment. And unfortunately, there's a pretty ugly side to reference counting as well. It's not generally thread safe, and reference counting garbage collection does not detect cyclical references. So what is a cyclical reference? Let's talk about it by example. Here we have a node class and it contains a value and a reference to the next node. We declare a root node, a left node, and a right node. Then we point our root node at left, we point our left node at right, and we point our right node back at left. What does that look like? The reference count of root is 1, it’s referred to by the name root. The reference count of left is 3, it's referred to by its name, the root, and the right node. And the reference count of right is 2, it's referred to by its name and by the left node. So a cycle occurs when we have two objects that point or refer to each other. What happens when we remove the names as references to these nodes? Well, we removed the names but the internal references, that next, is still there, so the reference count of root is 0, but the reference count of left and right remains 1 because those two nodes refer to each other. Reference counting alone does not garbage collect objects with cyclical references. And early Python implementation was only refcount-based and that caused lots of problems because many types of objects can have cyclical references, like graphs or doubly-linked lists. And early on we realized that we really needed something else. So there are two main types of garbage collection. We talked about reference counting. The second type is a strategy called tracing. Tracing generally uses an algorithm called mark and sweep. The first step is marking any objects that are reachable. So we start at a root node and then we follow all the references and we mark the live ones. This algorithm gets run when the number of objects in memory is greater than a certain threshold. The next step is sweep. So when marking is done, the sweep phase will remove the dead objects. Cyclical references get deleted with this algorithm too. Which one does Python use? We know it uses reference counting, but it also uses a strategy called generational. Generational is a type of tracing garbage collection. And, remember, we need another strategy because reference counting doesn't clean up those dangling cyclical references. Generational garbage collection is based on this theory that most objects die young, and really, how tragic. Frequently we create objects to store temporary values or they're used in one function call and never really get used again. So how does generational garbage collection work? Well, Python will maintain a list of every object created as a program is run. Actually, it goes ahead and creates three lists. They're called generation 0, 1, and 2. Newly created objects are going to be stored in generation 0. Each object is only stored in one generation, and we optimize by collecting young objects in generation 0 more frequently than the old. Now remember that only container objects with a reference count greater than zero are stored in these generational lists. So, similar to mark and sweep, but instead of keeping a list of all the objects, we only keep the ones with those active references. That means that fewer objects are tracked and scanned. When the number of objects in a generation reaches a particular threshold, Python will run a generational garbage collection algorithm on that generation and any generations younger than it. So when we run on generation 2, we're actually collecting on generation 0, 1, and 2. During that garbage collection cycle, Python makes a list of objects to discard. It runs an algorithm, that is out of scope for this talk, to detect those cyclical references. Then, if an object has no outside references, it's put on a discard list. And, when that cycle is done, it's going to free up the objects on that discard list. After our garbage collection cycle, objects that survived will be promoted to the next generation. So objects that are in the last generation, generation 2, are those that are going to stay there as the program executes. A big idea here is that, when the refcount reaches zero, we get an immediate cleanup. But if we have objects with cycles, we are going to need for garbage collection to run, and in order to do that, we have to wait. So if you have those cyclical references, it could potentially slow your program down. They could really be a culprit. So, some reference counting gotchas. Reference counting is not generally thread safe. And we're going to see why that's a big deal later on, because what happens if two threads try to increase and decrease the reference count of one object at the same time? We might end up with a pretty big problem. If we remember our reference cycle from before, we had a node left and a node right. They retain their reference counts of 1. Those cyclical references are going to get cleaned up by generational garbage collection. Except in Python 2, if your class declares -- or has a dunder del method -- Gotcha! Thankfully this is fixed as of Python 3.4. Now what is the dunder del magic method? Sometimes it's called a destructor. It is not the del statement. So del x does not call the magic method dunder del. The del statement decrements the reference count of x -- er, your object by 1. And the dunder del magic method is only called when the reference count of that object reaches zero. So we run it before an object is removed from memory. If you're using Python 2, this is a pretty big gotcha. You can end up with lots of crud in memory. So keep a watchful eye out. One strategy that we can use to improve our memory usage is called slots. So what are slots? We know that every Python instance contains a dictionary of its names and values. So if I print the dunder dict of my dog class here, we’ll see that it has a key name and a value, buddy. Slots turn the internals of an object from a dictionary to a tuple. What's important about tuples? Tuples are immutable, meaning that they can't be changed. So slots prevent us from setting names on instances willy-nilly. And some of us have probably -- well, I’m gonna say all of us have definitely seen this exception before. What happens if we try to set a name on the string, Pug? What exception are we going to get? It's going to be an attribute error. Python is going to complain at us and say, "Well, this string object doesn't have an attribute name." Makes sense, right? We can use slots to emulate this behavior in our own classes. So, I have this class Point. I declare the slots to be x and y. What is the type of slots? A tuple. The only attributes we can set on this class are those that we define in slots. So I can set my x and y on an instance of point, but what happens if I try to assign a name to my point? I’m going to see one of those attribute errors, right? My point doesn't have an internal dictionary. Now, why is this important? Well, let's take a look of the size in memory of a dictionary versus a tuple. We can use the getsizeof function to return the size in bytes for built-in types. So the size of a dictionary in bytes is, in my implementation of Python, 288 bytes. But the size of a tuple is 48 bytes. Now that's not a huge difference for one instance, right? But it really adds up when there are lots of instances. So when would we want to use slots? Well, if we're going to be creating lots of instances of a class, they might be worthwhile, and if we know in advance what properties that class should have. Now let's talk about the elephant in the room, or in this case, the shark. What’s a GIL? A GIL is a Global Interpreter Lock. A global interpreter lock prevents multiple Python threads from executing Python code at the same time. So there's one GIL for each interpreter. In other words, two Python programs running on one machine don't share a GIL. So one thread can run in their interpreter at a time. Why does Python need this interpreter lock? We need to prevent reference counts from being changed concurrently. What happens if two threads try to increase and decrease the reference count of an object at the same time? Those operations might not happen in order, and that's a big problem. The advantages and disadvantages of the GIL? Well, the upside is that we get a fast and simple garbage collection algorithm, but the downside is a Python program, no matter how many threads exist in that Python program, only one thread will be executed at a time. So if we want to take advantage of multiple CPUs, we want to use multi-processing instead of multi-threading. Each process will have its own GIL. And it's on the developer to figure out the best way to share information between those processes. So if the GIL limits us in this way, can't we just remove it? Well, the GIL is a bag of snakes. It's caused a lot of infighting in the Python community, as you may or may not be aware. In an early version of Python, about 1.5, a patch was submitted that removed the GIL. And, spoiler alert, it didn't really go well. It sped up multithreaded application and that's a good thing, right? Well, it also slowed down single-threaded applications by about 50%. So, for better or for worse, the GIL is here to stay. In the meantime, the Python code base has become significantly more complex since that patch was introduced. Just be aware that not all Python implementations have a GIL. For example, Jython and IronPython. Sorry guys, deal with it. So what did we learn today? Garbage collection is pretty good, right? It's way better than doing it manually in C. But it's not a magic wand. It adds overhead in terms of execution time and space and it still doesn't prevent us from being inefficient. So now you know how memory is managed in Python. You're familiar with the concepts. Please keep learning and exploring. It'll help you write better, faster, more efficient code. There's a lot that I didn't cover so I added a bonus section in my slides with links to additional resources. I'll give you a link to the slides at the end of my talk. Consider Python 3. It's better. It's an active development. There is a better GIL implementation that's more efficient. And remember that problem that we had with destructors? Well, Python 3 can call those destructors on objects with cyclical references. And these are just some of the many reasons to make the switch. Or for scientific applications, consider NumPy and Pandas. These packages are optimized to work with large numeric data structures and don't suffer from some of the same problems that we saw here. Thank you guys so much for your time. [applause] (moderator) Thank you, Nina. We have time for one question only. Please raise your hand, be brave. Ah, it’s great to see everybody’s totally an expert on memory management now. Thank you, Nina. [applause]
Info
Channel: PyCon 2016
Views: 42,779
Rating: 4.9026456 out of 5
Keywords:
Id: F6u5rhUQ6dU
Channel Id: undefined
Length: 27min 0sec (1620 seconds)
Published: Tue May 31 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.