Matt Bachmann - Better Testing With Less Code: Property Based Testing With Python - PyCon 2016

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[applause] Howdy, everybody. Today -- I'm Matt Bachmann, and today I'm gonna be talking about property-based testing in Python. So I want to start with something that I think we already know: testing is important. It leads to better refactoring in that you have the safety to make changes, knowing that you didn't break everything. It gives you better design, in the sense that if you have code that's well tested and that's easy to test, you tend to have better code overall. It gives you regression protection in that you can encode a bug in the form of a test and verify you don't get that exact bug again, which is just delightful, and I think we have an argument that all this leads to faster development overall through safety. But I think we all know that testing is hard. First of all, it's code. I think a lot of people forget this. Test code is code. It needs to be maintained as such. You need to deal with duplication. You need to maintain it and make sure it still runs fast. Isolation is tricky in that trying to write tests that properly isolate parts of your system can lead to this interesting balance of, "Have I mocked too much? Have I not mocked enough?" And so it's a very challenging problem that I don't think we have an easy answer to. Fixtures are something that has to be managed. Pytest makes this very easy, but still, it's a large amount of test data that needs to be dealt with. And finally, the value's indirect. Not very often are you posing that the code is well tested as, like, the top line feature. And when you're playing with new technology, you're not first -- well, most people aren't -- just reaching for the test, going, "OK, let me write some tests, and then let me try this new, exciting thing." You usually throw the test and say, "I'll do that later." That's because of the indirect value. But I think the key is that when writing tests, we want to capture the important cases and minimize the coding overhead. And this is a bit of an easy thing to say, right? We want to minimize how much work it is to test it, but we want to maximize the amount of coverage. And I have a tool that I think -- well, I don't have a tool -- there is a tool that can help with that, and that's property-based testing. But, to go over this, I'm going to start with an example: sorting a list of integers. Now, I realize every day we have to write sorting algorithms, but just bear with me. It's an example that helps, that keeps it simple. So I have a sorting algorithm, and up on the screen here, you see essentially what -- a reasonable test for the sorting algorithm. I throw an empty list at it, I throw an unsorted list, and I throw a sorted list. We can quibble about other tests that need to be run, because obviously this doesn't cover all the possible cases. You can also point out that I have some duplication, and that this could be a parameterized test, but generally reasonable. But I think we can do better, and the way we can do better is with property-based testing. If you get nothing else out of this talk, this slide really captures it. You describe the arguments that go into your test function, rather than specifying them. You describe the result of your code, rather than saying, "If I get x, I expect y," you do it more generally, trying to avoid that hard coding of data. And when you do this, it enables the computer to be the one to try to prove you wrong. You let the computer do the hard work of figuring out what cases break your code, and you just say what you want your code to do. So in a property-based test for our sorting example, we have two parts. The arguments, fairly easy. It's just a list of any integers. The properties of the result is, it better be a list. If I throw a list at a sorting algorithm and get a dictionary, I got a problem. All the elements better be there in the same amounts. I don't want any new elements added to my list. I don't want any elements taken out. And finally, the results are in ascending order. In other words, the list better be sorted after my sorting algorithm runs. So what does this look like in Python? And in order to do that, I need to a quick plug. The library that I'm gonna be using throughout this talk -- and really, it's a library you're gonna reach for in Python -- is called hypothesis. It's written by David R. MacIver. It's got fantastic docs. It's inspired by Haskell's quickcheck. And if anything I do impresses you today, he offers training for your organization. Look them up. Go to hypothesis.works. Plug over. Let's go on. So this is our property-based test for testing our little sorting algorithm. It looks pretty straightforward. It takes in a list and it sorts it. it verifies that it got back a list. Awesome. It's using the counter dictionary to basically say all the elements are there in the right amounts by counting both the resulting list and the input list. And then, finally, this little bit of functional code -- just trust me if you don't want to figure that out -- saying it's in ascending order. So, looking at our test, it looks like any standard test. It will -- this test will run in Pytest, it will run in Nose, it will run in unittest. But syntactically, we have one big difference: you have the given decorator. And what this decorator is doing is generally generating data to throw at your test using the description provided, in this case, a list of integers. And what it's doing under the hood is it's calling your test again and again and again with data it comes up with: an empty list, a list of one element, a list of two elements with big numbers, so on, so forth. These are just examples I came up with. You can see some crazy examples hypothesis comes up with. But the cool thing is that in your test suite, it looks like one test. Pretty nice. But the thing that's really, really important, that not all property-based testing frameworks do, but I think is kind of critical for them -- quickcheck definitely does it -- is: say I've got a failing example up here. We can all guess on what I broke in my sorting algorithm to cause this test to fail, but what hypothesis will do is it will take that list and it will try to simplify it. It will say, "Hey, that failed, but so did this," And then it will do that again, and it'll finally say, "I don't think you wrote a sorting algorithm at all. "I was able to simplify -- "you can't handle the most basic sorting elements." But I think, looking at this, we can say the bottom is far easier to deal with than with the top. So even though you're dealing with random data, a good property-based testing framework will try to reduce that data down to a case that's easy to reason about whenever possible. Before we go on, I want to talk a little bit about the given decorator. And this is just the way hypothesis lets you write strategies to throw at your test cases. Essentially you have your types that you define in the form of these strategies: you have Booleans, floats, strings, complex numbers, ints, so on, so on, so on. You have your collection types, your dictionaries, your tuples, your lists, sets, builds, lets you call a function with arguments that get generated. And then you can combine these to more complicated things. I can give an entire talk on just generating data in hypothesis, but that's not really what I'm here to do. But I'm gonna go over a quick example, just so everyone walks out of here with a sense of what it's like to work with these strategies. So I have a dog test. It test dogs. It uses the builds decorator in order to construct a dog, and I describe the arguments to that dog. That way the dog gets put into my test and it runs. But very quickly, you'll realize that I have defined the strategy way too broadly, and what you tend to do is you tend to kind of massage the data a little bit. You say, "Well, I don't want breed to be ANY string. "I want it to be one of my known breeds." So you swap that out for a list of known breeds. You pull them out of that. Name -- chances are you don't want to deal with the empty string, so you put in some limits on that. You say, "I want at least five characters." That's fine. And then for height and weight, I put in floats, and if anyone works with floats, they know they're more complicated than just decimal numbers. I obviously don't want infinity, I don't want NaN, and I got a min-max value. And then finally I want to make sure it works with the boss's dog, so I put in the example, and what this is doing is saying, "I don't care what data you throw at my test, "just make sure one of the things you throw at my test "is this specified one." And so the idea is you sort of massage these strategies in order to generate the data that really fits your domain for your test. Fancy. But hopefully you're getting a sense that we have this framework that lets you have potentially infinite cases with no real data hard-coded, right? We're not saying, "If I get x, I expect y." You're saying, "If I get something that looks like x, "I expect behavior y." And that is very little code. Pretty powerful idea. But if you walk out of here with just this, you'll go back to your computer and you'll start playing with it you'll start to realize that you've not really written many tests like this. It's actually really hard to think about how it should fit into your code until someone teaches you patterns. Welcome to my talk. [laughter] First thing I want to bring up: I'm not talking about a magic bullet. This is not going to replace your entire test suite. This is not going to be just "revolutionize everything ever," but it is going to be a very interesting tool that I think more people should be reaching for, and hopefully by the end of today, you'll give it a shot. First pattern: the code should not explode. I challenge anyone in this room not to find a way to use this pattern in their code, because essentially, when I run your code, it shouldn't explode. I think this is a pretty basic requirement. So, say I got an API. I do some contracting work for Batman. He has an API for his batputer. It'll pull out criminal aliases. It's got some parameters, an ID, a sorting, and a max. And the biggest thing about this API is it's a JSON response, and it's got some status codes: a 200, 401, 400, 404, and anything outside this range is weird. A potential explosion, as it were. So think about how we would test this API. And keep in mind, I'm not trying to say the API works. I'm not promising that. All I'm promising is that it didn't explode. That's the big thing to take away. And here's what that test looks like. I give it my properties of an integer -- uh, I give it my parameters of an integer, some text, some other integer, feed it into my test. I simply make the API call. I verify, no matter what happened, that the response came back as JSON and that the status code is in one of my expected status codes. I guarantee you if you put this pattern into your code base -- and it doesn't have to be an API example; you can do, like, raising -- making sure your code didn't raise any unexpected exceptions, so on, so forth -- you'll make your code more stable, and you'll potentially make it more secure. I've seen this pattern catch security bugs, but mostly what it catches is a lot of validation errors, because your users are going to throw data you don't expect, so you're probably not going to think to encode everything into a test. And letting the computer try to find those problems is a lot cheaper than letting your users find them. Boop, boop, boop. Nobody saw that. Pattern 2: reversible operations. So the idea here is if you have something that can be undone, you have a property which you can test. Some examples of when you use this pattern are encoding operations, undo operations, and serialization. I'm going to focus on serialization for my example. I also like to call this pattern the "It works, don't fi-- don't -- "It works, don't break it." "It works, don't fix it"? Whatever. Historical object. So I have an object. Sorry, my brain got messed up. [laughter] This is the reversible options pattern. So I've created a historical event. It has an ID, description, time, and I want to communicate with a web front end. So I write JSON encoder and then I write a decoder. So I can send objects to my framework -- to my web end, my web can send me objects back. And the key property I'm trying to test is that I can go from Python to JSON back to Python without losing anything. In other words, can I encode? Can I decode? And here's what that testlooks like. Take in my types in order to construct my object, build my event, dump it out into JSON, load it into a different object from JSON to my object, and verify I got the same thing back. And some of you might be saying, "Well, this is very easy code. "I could just write one or two tests and I'd be fine." And chances are, most of you would be fine. In fact, I was worried when I wrote this example that I wasn't gonna be able to reasonably construct a bug, but it turns out, I found a bug. It was in dateutil, and the problem with dateutil -- and this is a very obscure bug. Dateutil is a fine library. But the problem was, if you had a historical event that was between 0 and 100 A.D. and you were parsing it as an ISO-formatted string, dateutil would mess up. And this is exactly what i'm talking about. it's an obscure bug, but I'm glad I found it in the -- when I was writing the test, mainly because I was writing the talk. But if in a real app, I'm glad I found it in the test and not in production. And that patch has been fixed. Dateutil continues to be great. Pattern 3: testing oracle. Earlier when my brain broke, I was talking about this example. So the idea behind the testing oracle is you have a system that you know to be correct, and you have a system you don't know to be correct, and you use the unknown system to test -- use the known system to test the unknown system. I like to call the pattern the "Leave it alone" pattern. [laughter] You have an ugly system. Leave it alone. Use it for your test. Other cases where this is useful: it's useful for when you're emulating something. If you can hook up that original thing to your test suite, you can use that as your test. Another way -- another thing it's good for is optimization, and I'll get into a little bit about that later. So the property being tested here is that my clean, fancy, beautiful new system better do the same thing as my broken system, assuming the assumption of "the broken system generally works "but I just can't touch it because it's scary" is correct, which, I admit, is a big exception, but let's go on. This is what that test looks like. Very simple. Take in -- generate your arguments, throw it to your new hotness, run it against your legacy system, and verify you get the same result. Now I guarantee you, if you try this pattern out, you'll find that your legacy system isn't as great as everyone was telling you, but that's fine. You can -- this is where massaging the strategy comes in. This comes into throwing out examples that you know not to work. And there's a lot you can go into this, but this isn't specifically a hypothesis talk, so I'm not gonna go into those details. Another example of this pattern is comparing against brute force. I think we've all been in the situation of we wanted to write a fancy algorithm, and we knew the easy way to do it where you check every case, but you know in production that's just not going to fly because it just hurts, performance-wise. But you can use that easy-to-implement version as a test for your fancy, super-cool version, and that test looks identical to the one I showed you earlier. You take -- you generate your arguments, you throw it to the easy-but-inefficient solution, and make sure you get the same result in your optimized, fancy solution. And then just, a billion test cases, almost no code. That's effectively a three-liner that I've broken up because I like to break up lines. Pattern 4: stateful testing. This is sort of the pro mode of property-based testing. The idea behind stateful testing is that we're testing more interesting systems. Everything we've talked about up to now has been: input goes into my system, I get an output. No side effects, very functional. And a lot of people, when they hear about property-based testing, think it only works for this case, and in the real world, our systems get more complicated, and they get more complicated, and so on and so forth. So, stateful test is you define a state, you define what operations can happen in what conditions, you define how those operations affect the state, and then you put in your assertions: what must be true at any state. And I'm going to walk you through how hypothesis models this idea. But when you do this, what you're effectively building is a vessel for hypothesis to go out, search the search space, and bring you back bugs. "Find me bugs, test suite. Bring them back." And here's what that looks like. So I got an example here. This is a max-heap. A little bit of CS primer. The root node of this tree is the max of the entire tree, and for every subelement of that tree, that property holds. So, if I go down the 19, it's the max of that subtree. If I go to 36, it's the max of that subtree, so on, so forth. That's the main property being tested here. We have a few operations we can run on our heap: the creation of the heap, pushing elements onto the heap, popping elements off the heap (in other words, popping off that root and making sure the tree stays balanced), and then merging two trees. In other words, I got two heaps. Put them together to create a new heap, keeping that property. So in order to test this, what we do is we create a data store of heaps. And what we're going to do is we're going to throw heaps in there that have been generated to be used for other conditions. And we have a cloud of integers, which is basically that given property from before; in other words, pull from here and get an integer. So, in __init__, we construct a heap and put it into the data structure. Pretty basic. For push, we grab a heap that we generated earlier, we grab an integer out of the cloud, add that to the heap, therefore modifying the heap in that structure. Merge takes two heaps out of our data structure, puts them together to create a new tree, and then puts that into the heap structure -- in other words, creating another thing that we can pull from later for tests. And then finally, pop: grab any heap that we've generated and put in that structure, pop up the main element, scan the tree looking for the max, and then make sure when we call pop, we get that same max. In other words, this is the property we want to be true no matter what. And what we've done here is we've created a system that allows hypothesis or any framework you're using to scan the search space by trying these different operations out. And here's what this looks like in code, in case my diagrams didn't clarify. We create a machine, we define our data structure -- that's the green little database- looking thing I created earlier. We have a rule to create a new heap, which is: it creates a new heap, throws it into the target, which is our heap data structure. We do heap push, which grabs a heap out of our heaps, grabs a value out of the integer cloud, modifies the heap, therefore modifying it in our structure. Then we have the merge operations which grabs the two heaps, puts them together to make a new heap, and then puts that result back in our structure. And finally we have pop: grab a heap that's not empty, scan it looking for the max value, call pop on it, and verify the result's the same. Even if none of that made sense to you, that's fine. Just, like, take this point away: when you run hypothesis, what you'll find is it will go out and it will try to find you a bug. But not only will it come back with a bug, it will not say, "Sir, madam, I have only -- I have found a bug." It's like, well, I can't do anything with an assertion error, right? What it'll bring you back is far more interesting than an assertion error: it'll bring you back a program. By creating all these steps and defining all these steps, it will spit out what steps it ran in order to create the bug. It said, "After running all these steps, "then that last pop, the assertion failed. "I did not get the max element like I expected." And I'm not lying to you, I swear! This is me running in a terminal with a different bug. See on the bottom there? You know how nice that is to see when you're dealing with a complex system? Steps! Reproducibility! [laughter] [applause] Sometimes I have trouble getting that out of bug reports, you know? [laughter] So, just to summarize: property-based testing: describe the arguments, describe the result, have the computer try to prove you wrong. Now, the call to action, because now it's time for you to do something for me. Download the library. I recommend hypothesis. I'm not going to say it's the only library out there but it seems to be the best in the Python world. Use it. I want you to all use it, and then share how you used it and find more of these patterns. Find ways to use it, especially if you can get me more examples of the stateful testing. I know mercurial uses it, and I know that hypothesis uses it internally, and I know PyPy also has used it. So it's got some real world usage, but I want more on this because, quite frankly, I'm still learning, and I just want more people do it. And that's all I got for you today. Here are some resources over there, and get the slides later. I got other things you can look at to learn more about it. And that's all I got. [applause] (host) OK, we have about 10 minutes for questions. Let me run to you with the mic so that everybody and the recordings can hear it. (audience member) Thanks, Matt. Great talk. You had an example of the -- your function decorator example. Does that go inside the given when you're doing that or is that -- you put that on top of your test? (Matt Bachmann) You put the decorator on top of your test function, yeah. (audience member) If I wanted unit tests to be deterministic for a given run, is this something I can ensure with hypothesis? (Matt Bachmann) This is an advantage of hypothesis. It stores its generated examples in a test database, so when it finds a failure example, as long as you haven't wiped out that database for some reason, when you run it again, that same failure will pop up. (audience member) Every time you run it, how many samples does it do? (Matt Bachmann) I forget the default. I think it's roughly 100, but it's configurable. So you can configure it by the test, you can configure it globally. I actually recommend if you're using it for development to keep that number fairly low. Like, I use sometimes as low as 20, but then when you run it in CI, you can bump that number really high if you want to spend the extra time trying to find more obscure examples. (audience member) So I see you're using Pytest. How does that mix otherwise with fixtures and so forth, noticing that -- (Matt Bachmann) In my experience, it works fine. The property -- the arguments get filled up and whatever don't are assumed to be fixtures. It works great in Pytest, and it works in other frameworks just fine. (audience member) Hi. Have you tried testing it in... Like, most of our code is generally dependent on other pieces of functions. Like, you'll call a function A, which will call B with some arguments, which will call C with some arguments. So generally when you're testing, say, B, you kind of mock out what A calls it, but then what is it passing C? Have you had cases where you had to test this kind of middle-level function which is interrelated as part of the call tree? (Matt Bachmann) So this is where I'm getting in the "I'm still learning how to use this thing." Those kinds of examples get trickier and trickier. But unless you -- it's a challenging way to think about it, right? If you can describe what the final state will be, that tends to be how you'll work. If you want to test, like, a middle function, you try to find tests and write it directly. But once you start talking about mocking, I think it's much more complicated to, say, mock in a general way that doesn't depend on data. (audience member) Have you ever combined this with any coverage tools to kind of close the loop, and do you have any thoughts on that? (Matt Bachmann) So, my experience is that this does work with coverage.py, but I will say at times it can be slightly unpredictable because the testing is fairly random. If you bump the examples high enough, you'll probably get as much coverage as you're gonna get, but it is not necessarily predictable, so you might see some weird results in your coverage reports. (audience member) So I'm curious how well this works with asserting error conditions. So, if I input these three things, I expect this exception to happen or I expect this failure case to occur. (Matt Bachmann) Yep, so that's -- that's basically what your test framework handles for you. So, in your test you can do things like "I expect an error here." If you're testing for error conditions, I would try to -- I would try to separate those tests. If you want to write a test that just is expecting errors in certain conditions, you can do checks on the data that was generated. Like, "If i get an error and the data looks like this, "don't fail. It's cool." That kind of thing. But, you know, then the test starts getting complicated. (audience member) In the example that you showed with the dog class, I can imagine having that sort of pattern all over your tests would get kind of, like, long. Is there a way to define those sorts of fixtures separately so they can just be reused really easily? (Matt Bachmann) Short answer: yes. You can basically create custom strategies that can be pulled from. (audience member) So, I imagine you end up writing some traditional unit tests in addition. Do you kind of commingle those, or do you regard this as a completely separate process? (Matt Bachmann) That's really up to you to decide. I think a separate process is probably ideal, simply because these tests tend to run a little longer because they're running so many examples. In fact, hypothesis will punish you for writing slow tests because it's running them hundreds of potentially thousands of times. So if you're doing a CI system, I totally recommend breaking them up, doing your fast unit test, then doing your property-based test and moving on from there. (audience member) Forgive me if this was asked and answered, because I heard talk about mocks. How well -- or is it possible for this to handle responses from subsystems which would traditionally be mocked? (Matt Bachmann) So, once again this comes into, "I'm still learning how to deal with this," and this is kind of why I'm giving the talk: in order to encourage people to explore these ideas. Mocking gets weird, just because usually what you do is, I spit out this result. Mocking in a general way is a challenging problem, and I'm not necessarily sure how to fit that into this system. And this comes down to: it's not a magic bullet. it's a tool in the toolbox. Sometimes you can reach for it, sometimes it's hard to reach for it. But if you find a way to work with it with mocks, please blog about it, share about it, tell people. Nice. (audience member) If you're testing an API, I guess it doesn't matter if you use quickcheck or, like, this tool, because you're really -- the language doesn't even matter. Is -- am I thinking -- is that correct thinking? (Matt Bachmann) You are completely correct. If you're comfortable with quickcheck, you can use it to test your API. I just know that, you know, people like staying with the language they wrote the system in, but there's no -- if you're doing some like an API test that's actually going over the wire, no problem, but you can also imagine -- this is one case where you could, in theory, generally mock, in the sense of don't literally make an API call, but just make the mock API call to your system which would be easier to do if you stayed in Python. (audience member) Hey, thanks so much. This is so cool. So I work in education and I feel -- I've been trying to get more testing into the university where I work. It's a process. So I feel like this has a lot of potential for helping students really think about, like, the correctness of their code. And I'm really surprised I didn't know about this. Can you talk a little bit about the history of this? Like, how widely used is this? How new is this? (Matt Bachmann) So I'm gonna step a little bit out of my knowledge, since I'm fairly new to this stuff as well. Just a reminder: I did not write hypothesis. But if you go to hypothesis.works, he talks a little bit about this. But my understanding is it started originally in the Ha-- in the functional world with Haskell's quickcheck. And that was a situation where that shrinking is very easy to do when you have a powerful, strong type system. One of the very interesting things about hypothesis is that it does the shrinking in Python. But it started in the functional world. It's been ported over a lot of places. A lot of the success stories you'll see will be in those worlds, mostly Haskell, and Erlang, you'll find examples of. But as far as, like, other languages, like, I have not seen a strong version of Java for this. Although, if you're a Java shop that's interested in property-based testing, go to his website; he wants someone to pay him to write the Java version. He's got a prototype. Plug. Sorry. But yeah, that's all I really know. It started in the functional world and it sort of went from there. (host) Let's see, are there any more questions? I think -- I think that's everybody. (Matt Bachmann) All right. I survived! (host) That was great. I really liked that. [applause]

Info

Channel: PyCon 2016

Views: 17,315

Rating: undefined out of 5

Keywords:

Id: jvwfDdgg93E

Channel Id: undefined

Length: 27min 28sec (1648 seconds)

Published: Tue May 31 2016