[applause] Howdy, everybody. Today -- I'm Matt Bachmann,
and today I'm gonna be talking about property-based
testing in Python. So I want to start with something
that I think we already know: testing is important. It leads to better refactoring in that
you have the safety to make changes, knowing that you didn't
break everything. It gives you better design,
in the sense that if you have code that's well tested
and that's easy to test, you tend to have better code overall. It gives you regression protection
in that you can encode a bug in the form of a test and verify
you don't get that exact bug again, which is just delightful,
and I think we have an argument that all this leads to faster
development overall through safety. But I think we all know
that testing is hard. First of all, it's code.
I think a lot of people forget this. Test code is code.
It needs to be maintained as such. You need to deal with duplication.
You need to maintain it and make sure it still runs fast. Isolation is tricky
in that trying to write tests that properly isolate
parts of your system can lead to
this interesting balance of, "Have I mocked too much?
Have I not mocked enough?" And so it's a very
challenging problem that I don't think we have
an easy answer to. Fixtures are something
that has to be managed. Pytest makes this
very easy, but still, it's a large amount of test data
that needs to be dealt with. And finally, the value's indirect. Not very often are you posing
that the code is well tested as, like, the top line feature. And when you're playing with
new technology, you're not first -- well, most people aren't --
just reaching for the test, going, "OK, let me write some tests, and then
let me try this new, exciting thing." You usually throw the test
and say, "I'll do that later." That's because of
the indirect value. But I think the key
is that when writing tests, we want to capture
the important cases and minimize the coding overhead. And this is a bit of an easy
thing to say, right? We want to minimize
how much work it is to test it, but we want to maximize
the amount of coverage. And I have a tool that I think --
well, I don't have a tool -- there is a tool that can help with that,
and that's property-based testing. But, to go over this,
I'm going to start with an example: sorting a list of integers.
Now, I realize every day we have to write sorting algorithms,
but just bear with me. It's an example that helps,
that keeps it simple. So I have a sorting algorithm,
and up on the screen here, you see essentially what -- a reasonable test
for the sorting algorithm. I throw an empty list at it,
I throw an unsorted list, and I throw a sorted list. We can quibble about other tests
that need to be run, because obviously this doesn't
cover all the possible cases. You can also point out
that I have some duplication, and that this could be
a parameterized test, but generally reasonable.
But I think we can do better, and the way we can do better
is with property-based testing. If you get nothing else out of this
talk, this slide really captures it. You describe the arguments
that go into your test function, rather than specifying them. You describe the result
of your code, rather than saying, "If I get x, I expect y,"
you do it more generally, trying to avoid
that hard coding of data. And when you do this,
it enables the computer to be the one
to try to prove you wrong. You let the computer
do the hard work of figuring out what cases break your code, and you just say
what you want your code to do. So in a property-based test for our
sorting example, we have two parts. The arguments, fairly easy. It's just a list of any integers. The properties of the result is,
it better be a list. If I throw a list at a sorting
algorithm and get a dictionary, I got a problem. All the elements better be there
in the same amounts. I don't want any new elements
added to my list. I don't want any elements taken out. And finally, the results
are in ascending order. In other words, the list better be
sorted after my sorting algorithm runs. So what does this
look like in Python? And in order to do that,
I need to a quick plug. The library that I'm gonna be
using throughout this talk -- and really, it's a library
you're gonna reach for in Python -- is called hypothesis.
It's written by David R. MacIver. It's got fantastic docs.
It's inspired by Haskell's quickcheck. And if anything I do
impresses you today, he offers training
for your organization. Look them up. Go to hypothesis.works.
Plug over. Let's go on. So this is our property-based test
for testing our little sorting algorithm. It looks pretty straightforward.
It takes in a list and it sorts it. it verifies that it
got back a list. Awesome. It's using the counter dictionary
to basically say all the elements are there
in the right amounts by counting both the resulting list
and the input list. And then, finally,
this little bit of functional code -- just trust me if you don't
want to figure that out -- saying it's in ascending order. So, looking at our test,
it looks like any standard test. It will -- this test
will run in Pytest, it will run in Nose,
it will run in unittest. But syntactically,
we have one big difference: you have the given decorator. And what this decorator is doing
is generally generating data to throw at your test
using the description provided, in this case, a list of integers. And what it's doing under the hood
is it's calling your test again and again and again
with data it comes up with: an empty list,
a list of one element, a list of two elements
with big numbers, so on, so forth. These are just examples
I came up with. You can see some crazy examples
hypothesis comes up with. But the cool thing is that in your
test suite, it looks like one test. Pretty nice. But the thing
that's really, really important, that not all property-based
testing frameworks do, but I think is kind of critical for
them -- quickcheck definitely does it -- is: say I've got
a failing example up here. We can all guess on what I broke
in my sorting algorithm to cause this test to fail,
but what hypothesis will do is it will take that list
and it will try to simplify it. It will say, "Hey, that failed,
but so did this," And then it will do that again,
and it'll finally say, "I don't think you wrote
a sorting algorithm at all. "I was able to simplify -- "you can't handle
the most basic sorting elements." But I think, looking at this,
we can say the bottom is far easier to deal with
than with the top. So even though you're
dealing with random data, a good property-based
testing framework will try to reduce that data down
to a case that's easy to reason about whenever possible. Before we go on,
I want to talk a little bit about the given decorator. And this is just the way
hypothesis lets you write strategies to throw
at your test cases. Essentially you have
your types that you define in the form of these strategies:
you have Booleans, floats, strings, complex numbers,
ints, so on, so on, so on. You have your collection types,
your dictionaries, your tuples, your lists, sets, builds, lets you call a function
with arguments that get generated. And then you can combine these
to more complicated things. I can give an entire talk on just
generating data in hypothesis, but that's not really
what I'm here to do. But I'm gonna go over a quick example,
just so everyone walks out of here with a sense of what it's like
to work with these strategies. So I have a dog test. It test dogs. It uses the builds decorator
in order to construct a dog, and I describe
the arguments to that dog. That way the dog gets put
into my test and it runs. But very quickly, you'll realize
that I have defined the strategy way too broadly,
and what you tend to do is you tend to kind of massage
the data a little bit. You say, "Well, I don't want
breed to be ANY string. "I want it to be
one of my known breeds." So you swap that out
for a list of known breeds. You pull them out of that. Name -- chances are
you don't want to deal with the empty string,
so you put in some limits on that. You say, "I want at least
five characters." That's fine. And then for height and weight,
I put in floats, and if anyone works with floats,
they know they're more complicated than just decimal numbers.
I obviously don't want infinity, I don't want NaN,
and I got a min-max value. And then finally I want to make sure
it works with the boss's dog, so I put in the example,
and what this is doing is saying, "I don't care what data
you throw at my test, "just make sure one of the things
you throw at my test "is this specified one." And so the idea is you sort of
massage these strategies in order to generate the data that
really fits your domain for your test. Fancy. But hopefully
you're getting a sense that we have this framework that lets
you have potentially infinite cases with no real data
hard-coded, right? We're not saying,
"If I get x, I expect y." You're saying, "If I get
something that looks like x, "I expect behavior y." And that is very little code.
Pretty powerful idea. But if you walk out of here
with just this, you'll go back to your computer
and you'll start playing with it you'll start to realize that you've not
really written many tests like this. It's actually really hard
to think about how it should fit into your code until someone
teaches you patterns. Welcome to my talk. [laughter] First thing I want to bring up:
I'm not talking about a magic bullet. This is not going to replace
your entire test suite. This is not going to be just
"revolutionize everything ever," but it is going to be
a very interesting tool that I think more people
should be reaching for, and hopefully by the end of today,
you'll give it a shot. First pattern:
the code should not explode. I challenge anyone in this room
not to find a way to use this pattern in their code,
because essentially, when I run your code,
it shouldn't explode. I think this is
a pretty basic requirement. So, say I got an API. I do some
contracting work for Batman. He has an API for his batputer.
It'll pull out criminal aliases. It's got some parameters,
an ID, a sorting, and a max. And the biggest thing about this API
is it's a JSON response, and it's got some status codes:
a 200, 401, 400, 404, and anything outside
this range is weird. A potential explosion, as it were. So think about
how we would test this API. And keep in mind, I'm not
trying to say the API works. I'm not promising that. All I'm
promising is that it didn't explode. That's the big thing to take away. And here's what that test looks like. I give it my properties
of an integer -- uh, I give it my parameters
of an integer, some text, some other integer, feed it into my test.
I simply make the API call. I verify, no matter what happened,
that the response came back as JSON and that the status code is in
one of my expected status codes. I guarantee you if you put
this pattern into your code base -- and it doesn't have to be an API
example; you can do, like, raising -- making sure your code didn't raise
any unexpected exceptions, so on, so forth --
you'll make your code more stable, and you'll potentially
make it more secure. I've seen this pattern
catch security bugs, but mostly what it catches is
a lot of validation errors, because your users are going to
throw data you don't expect, so you're probably not going to think
to encode everything into a test. And letting the computer
try to find those problems is a lot cheaper than
letting your users find them. Boop, boop, boop. Nobody saw that. Pattern 2: reversible operations. So the idea here is if you have
something that can be undone, you have a property
which you can test. Some examples of when you use
this pattern are encoding operations, undo operations, and serialization. I'm going to focus on
serialization for my example. I also like to call this pattern
the "It works, don't fi-- don't -- "It works, don't break it."
"It works, don't fix it"? Whatever. Historical object.
So I have an object. Sorry, my brain got messed up. [laughter] This is the reversible
options pattern. So I've created a historical event. It has an ID, description, time, and I want to communicate
with a web front end. So I write JSON encoder
and then I write a decoder. So I can send objects
to my framework -- to my web end, my web can send me objects back. And the key property
I'm trying to test is that I can go
from Python to JSON back to Python
without losing anything. In other words,
can I encode? Can I decode? And here's what that testlooks like. Take in my types in order
to construct my object, build my event,
dump it out into JSON, load it into a different object
from JSON to my object, and verify I got
the same thing back. And some of you might be saying,
"Well, this is very easy code. "I could just write one or two
tests and I'd be fine." And chances are,
most of you would be fine. In fact, I was worried
when I wrote this example that I wasn't gonna be able
to reasonably construct a bug, but it turns out, I found a bug. It was in dateutil,
and the problem with dateutil -- and this is a very obscure bug.
Dateutil is a fine library. But the problem was,
if you had a historical event that was between 0 and 100 A.D. and you were parsing it
as an ISO-formatted string, dateutil would mess up. And this is exactly
what i'm talking about. it's an obscure bug,
but I'm glad I found it in the -- when I was writing the test,
mainly because I was writing the talk. But if in a real app,
I'm glad I found it in the test and not in production. And that patch has been fixed.
Dateutil continues to be great. Pattern 3: testing oracle. Earlier when my brain broke,
I was talking about this example. So the idea behind
the testing oracle is you have a system
that you know to be correct, and you have a system
you don't know to be correct, and you use
the unknown system to test -- use the known system
to test the unknown system. I like to call the pattern
the "Leave it alone" pattern. [laughter] You have an ugly system.
Leave it alone. Use it for your test.
Other cases where this is useful: it's useful for when you're
emulating something. If you can hook up that original
thing to your test suite, you can use that as your test.
Another way -- another thing it's good for
is optimization, and I'll get into a little bit
about that later. So the property being tested here
is that my clean, fancy, beautiful new system better do
the same thing as my broken system, assuming the assumption of
"the broken system generally works "but I just can't touch it
because it's scary" is correct, which, I admit, is a big exception,
but let's go on. This is what that test looks like.
Very simple. Take in -- generate your arguments,
throw it to your new hotness, run it against your legacy system,
and verify you get the same result. Now I guarantee you,
if you try this pattern out, you'll find that your
legacy system isn't as great as everyone was telling you,
but that's fine. You can -- this is where
massaging the strategy comes in. This comes into throwing out
examples that you know not to work. And there's a lot you can go into
this, but this isn't specifically a hypothesis talk, so I'm not
gonna go into those details. Another example of this pattern
is comparing against brute force. I think we've all been
in the situation of we wanted to write a fancy algorithm,
and we knew the easy way to do it where you check every case,
but you know in production that's just not going to fly
because it just hurts, performance-wise. But you can use that
easy-to-implement version as a test for your fancy, super-cool version, and that test looks identical
to the one I showed you earlier. You take --
you generate your arguments, you throw it to the
easy-but-inefficient solution, and make sure you get the same result
in your optimized, fancy solution. And then just, a billion
test cases, almost no code. That's effectively a three-liner
that I've broken up because I like to break up lines. Pattern 4: stateful testing. This is sort of the pro mode
of property-based testing. The idea behind stateful testing is that we're testing
more interesting systems. Everything we've talked about
up to now has been: input goes into my system,
I get an output. No side effects, very functional. And a lot of people, when they hear
about property-based testing, think it only works for this case,
and in the real world, our systems get more complicated,
and they get more complicated, and so on and so forth. So, stateful test is
you define a state, you define what operations
can happen in what conditions, you define how those operations
affect the state, and then you put in your assertions:
what must be true at any state. And I'm going to walk you through
how hypothesis models this idea. But when you do this,
what you're effectively building is a vessel for hypothesis to go out, search the search space,
and bring you back bugs. "Find me bugs, test suite.
Bring them back." And here's what that looks like.
So I got an example here. This is a max-heap.
A little bit of CS primer. The root node of this tree
is the max of the entire tree, and for every subelement
of that tree, that property holds. So, if I go down the 19,
it's the max of that subtree. If I go to 36, it's the max
of that subtree, so on, so forth. That's the main property
being tested here. We have a few operations
we can run on our heap: the creation of the heap,
pushing elements onto the heap, popping elements off the heap
(in other words, popping off that root and making sure the tree stays
balanced), and then merging two trees. In other words, I got two heaps. Put them together
to create a new heap, keeping that property. So in order to test this, what we do
is we create a data store of heaps. And what we're going to do is
we're going to throw heaps in there that have been generated
to be used for other conditions. And we have a cloud of integers,
which is basically that given property from before; in other words,
pull from here and get an integer. So, in __init__, we construct a heap
and put it into the data structure. Pretty basic. For push, we grab a heap
that we generated earlier, we grab an integer out of the cloud,
add that to the heap, therefore modifying the heap
in that structure. Merge takes two heaps
out of our data structure, puts them together
to create a new tree, and then puts that
into the heap structure -- in other words, creating another thing
that we can pull from later for tests. And then finally, pop:
grab any heap that we've generated and put in that structure,
pop up the main element, scan the tree looking for the max,
and then make sure when we call pop,
we get that same max. In other words, this is the property
we want to be true no matter what. And what we've done here is
we've created a system that allows hypothesis
or any framework you're using to scan the search space by trying
these different operations out. And here's what this looks like in code,
in case my diagrams didn't clarify. We create a machine,
we define our data structure -- that's the green little database-
looking thing I created earlier. We have a rule to create a new heap,
which is: it creates a new heap, throws it into the target,
which is our heap data structure. We do heap push, which grabs
a heap out of our heaps, grabs a value out of the
integer cloud, modifies the heap, therefore modifying it
in our structure. Then we have the merge operations
which grabs the two heaps, puts them together
to make a new heap, and then puts that result
back in our structure. And finally we have pop:
grab a heap that's not empty, scan it looking for the max value, call pop on it,
and verify the result's the same. Even if none of that
made sense to you, that's fine. Just, like, take this point away:
when you run hypothesis, what you'll find is it will go out
and it will try to find you a bug. But not only will it come back
with a bug, it will not say, "Sir, madam, I have only --
I have found a bug." It's like, well, I can't do anything
with an assertion error, right? What it'll bring you back
is far more interesting than an assertion error:
it'll bring you back a program. By creating all these steps
and defining all these steps, it will spit out what steps it ran
in order to create the bug. It said, "After running
all these steps, "then that last pop,
the assertion failed. "I did not get the max element
like I expected." And I'm not lying to you, I swear! This is me running in a terminal
with a different bug. See on the bottom there?
You know how nice that is to see when you're dealing with
a complex system? Steps! Reproducibility! [laughter] [applause] Sometimes I have trouble getting that
out of bug reports, you know? [laughter] So, just to summarize:
property-based testing: describe the arguments,
describe the result, have the computer
try to prove you wrong. Now, the call to action,
because now it's time for you to do something for me. Download the library.
I recommend hypothesis. I'm not going to say
it's the only library out there but it seems to be
the best in the Python world. Use it. I want you to all use it, and then share how you used it
and find more of these patterns. Find ways to use it,
especially if you can get me more examples
of the stateful testing. I know mercurial uses it, and I know
that hypothesis uses it internally, and I know PyPy also has used it. So it's got some real world usage,
but I want more on this because, quite frankly,
I'm still learning, and I just want more people do it. And that's all I got for you today. Here are some resources over there, and get the slides later. I got other things you can look at
to learn more about it. And that's all I got. [applause] (host)
OK, we have about 10 minutes for questions. Let me run to you with the mic so that everybody
and the recordings can hear it. (audience member)
Thanks, Matt. Great talk. You had an example of the --
your function decorator example. Does that go inside the given
when you're doing that or is that -- you put that on top of your test? (Matt Bachmann)
You put the decorator on top of your test function, yeah. (audience member)
If I wanted unit tests to be deterministic for a given run, is this something I can ensure
with hypothesis? (Matt Bachmann)
This is an advantage of hypothesis. It stores its generated examples
in a test database, so when it finds a failure example,
as long as you haven't wiped out that database
for some reason, when you run it again,
that same failure will pop up. (audience member)
Every time you run it, how many samples does it do? (Matt Bachmann)
I forget the default. I think it's roughly 100,
but it's configurable. So you can configure it by the test,
you can configure it globally. I actually recommend
if you're using it for development to keep that number fairly low. Like, I use sometimes as low as 20,
but then when you run it in CI, you can bump that number really high
if you want to spend the extra time trying to find
more obscure examples. (audience member)
So I see you're using Pytest. How does that mix otherwise with
fixtures and so forth, noticing that -- (Matt Bachmann)
In my experience, it works fine. The property --
the arguments get filled up and whatever don't
are assumed to be fixtures. It works great in Pytest, and it works
in other frameworks just fine. (audience member)
Hi. Have you tried testing it in... Like, most of our code is generally
dependent on other pieces of functions. Like, you'll call a function A,
which will call B with some arguments, which will call C
with some arguments. So generally when you're testing,
say, B, you kind of mock out what A calls it,
but then what is it passing C? Have you had cases where you had to
test this kind of middle-level function which is interrelated
as part of the call tree? (Matt Bachmann)
So this is where I'm getting in the "I'm still learning
how to use this thing." Those kinds of examples
get trickier and trickier. But unless you -- it's a challenging
way to think about it, right? If you can describe
what the final state will be, that tends to be how you'll work.
If you want to test, like, a middle function, you try to
find tests and write it directly. But once you start talking about mocking,
I think it's much more complicated to, say, mock in a general
way that doesn't depend on data. (audience member)
Have you ever combined this with any coverage tools
to kind of close the loop, and do you have
any thoughts on that? (Matt Bachmann)
So, my experience is that this does work with coverage.py,
but I will say at times it can be slightly unpredictable
because the testing is fairly random. If you bump the examples high enough,
you'll probably get as much coverage as you're gonna get,
but it is not necessarily predictable, so you might see some weird results
in your coverage reports. (audience member)
So I'm curious how well this works with asserting error conditions. So, if I input these three things,
I expect this exception to happen or I expect this
failure case to occur. (Matt Bachmann)
Yep, so that's -- that's basically what your
test framework handles for you. So, in your test you can do things
like "I expect an error here." If you're testing for error
conditions, I would try to -- I would try to separate those tests. If you want to write a test
that just is expecting errors in certain conditions, you can do
checks on the data that was generated. Like, "If i get an error
and the data looks like this, "don't fail. It's cool."
That kind of thing. But, you know, then the test
starts getting complicated. (audience member)
In the example that you showed with the dog class, I can imagine having
that sort of pattern all over your tests
would get kind of, like, long. Is there a way to define
those sorts of fixtures separately so they can just be reused
really easily? (Matt Bachmann)
Short answer: yes. You can basically create custom
strategies that can be pulled from. (audience member)
So, I imagine you end up writing some traditional
unit tests in addition. Do you kind of commingle those,
or do you regard this as a completely separate process? (Matt Bachmann)
That's really up to you to decide. I think a separate process
is probably ideal, simply because these tests
tend to run a little longer because they're running
so many examples. In fact, hypothesis will punish you
for writing slow tests because it's running them hundreds
of potentially thousands of times. So if you're doing a CI system,
I totally recommend breaking them up, doing your fast unit test,
then doing your property-based test and moving on from there. (audience member)
Forgive me if this was asked and answered, because I heard talk about mocks. How well -- or is it possible
for this to handle responses from subsystems which would
traditionally be mocked? (Matt Bachmann)
So, once again this comes into, "I'm still learning
how to deal with this," and this is kind of why
I'm giving the talk: in order to encourage people
to explore these ideas. Mocking gets weird, just because usually
what you do is, I spit out this result. Mocking in a general way
is a challenging problem, and I'm not necessarily sure
how to fit that into this system. And this comes down to:
it's not a magic bullet. it's a tool in the toolbox. Sometimes you can reach for it,
sometimes it's hard to reach for it. But if you find a way
to work with it with mocks, please blog about it,
share about it, tell people. Nice. (audience member)
If you're testing an API, I guess it doesn't matter if you use
quickcheck or, like, this tool, because you're really --
the language doesn't even matter. Is -- am I thinking --
is that correct thinking? (Matt Bachmann)
You are completely correct. If you're comfortable with quickcheck,
you can use it to test your API. I just know that, you know,
people like staying with the language they wrote the system in,
but there's no -- if you're doing some like an API test
that's actually going over the wire, no problem,
but you can also imagine -- this is one case where you could,
in theory, generally mock, in the sense of
don't literally make an API call, but just make the mock API call
to your system which would be easier to do
if you stayed in Python. (audience member)
Hey, thanks so much. This is so cool. So I work in education and I feel --
I've been trying to get more testing into the university where I work.
It's a process. So I feel like this has a lot of
potential for helping students really think about, like,
the correctness of their code. And I'm really surprised
I didn't know about this. Can you talk a little bit
about the history of this? Like, how widely used is this?
How new is this? (Matt Bachmann)
So I'm gonna step a little bit out of my knowledge, since I'm
fairly new to this stuff as well. Just a reminder:
I did not write hypothesis. But if you go to hypothesis.works,
he talks a little bit about this. But my understanding is
it started originally in the Ha-- in the functional world
with Haskell's quickcheck. And that was a situation where
that shrinking is very easy to do when you have a powerful,
strong type system. One of the very interesting
things about hypothesis is that it does
the shrinking in Python. But it started in the functional world.
It's been ported over a lot of places. A lot of the success stories
you'll see will be in those worlds, mostly Haskell, and Erlang,
you'll find examples of. But as far as, like,
other languages, like, I have not seen
a strong version of Java for this. Although, if you're a Java shop that's
interested in property-based testing, go to his website; he wants someone
to pay him to write the Java version. He's got a prototype.
Plug. Sorry. But yeah, that's all I really know. It started in the functional world
and it sort of went from there. (host)
Let's see, are there any more questions? I think -- I think that's everybody. (Matt Bachmann)
All right. I survived! (host)
That was great. I really liked that. [applause]