James Saryerwinnie Next Level Testing PyCon 2017

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I feel if you want to take it down there too okay probably self to the side is fine all right so for our next talk we have a talk on next-level testing and here's James hi everyone thank you thank you all for being here today my name is James sorry Winnie I'm a software development engineer at AWS and I work on the SDKs and tools team so primarily involved with the Python tooling things like the AWS CLI moto3 which is a AWS sdk for python and a couple of other open source projects that I'll be using as examples today so today I'm going to be talking about testing and in particular testing that goes above unit testing and integration testing and it is to help us write higher quality software in as efficient manner as possible and so today I wanted to share a couple of test types that I've used in projects that I found really helpful and for each of these I'm going to talk about not only what they are and why you would use it but also try to get some real-world examples and some tips on how you'd integrate it into your project one of the things that I always found is whenever I hear about these new test types they make a lot of sense and I understand them but there was a little bit confusing exactly how I can add them to my project or exactly what types of issues are going to find so I'll try to give you some examples from projects I work on now without being said there's a lot of stuff that I want to go over so I think we just jump right into the first topic here which is property based testing so with property based testing the idea is pretty straightforward you know here's a traditional test with something like absolute value and normally what we do with unit tests we would write specific cases so we test positive numbers we test negative numbers we test zero but property based testing is all about giving general statements about whatever your function does so instead of a specific instance like the integers 3 you might make a statement that says for any integer you can throw out this function it should return a value that's greater than or equal to 0 and this is what we mean by properties and in almost any case there's always some sort of property you can assert about your test and if we wanted to do this ourselves and not use a framework this is a very simplified version of it but we could just say let's pick a bunch of random numbers and call it with whatever number we've generated and just double check that the values greater than 0 this is a very simplified example here but that's essentially the essence of property based testing which is you write assertions about the properties of your functions and then use usually you use a framework to help generate the sample test input and then one of the great things about the framework will look at is that once it finds a failing case it can simplify it down and try to tell you the minimal input needed to reproduce that issue so the library that I've used that I really like and I think is probably the clear front-runner in this space for python is a library called hypothesis and one of the two things that I really like about it is that it has really powerful test data generation we'll see an example of that in the real world example just a minute and it also is really good about minimizing the errors so whenever you find an error it can shrink the input down to just usually a few characters depending on whatever your input is and it's really nice so you just pip install hypothesis you use your existing test runners you don't have to use any new executables you just add it to whatever tests that you have and take our first example here what we had before was a for loop the equivalent here is just using this given decorator so this is just in our normal test suite if you have unit tests that we just go in that section there and what we're doing here is we're importing this two things given and the strategies thing so given is the thing that's going to wrap our test function and give us the looping and it's actually much more involved in just a simple for loop but it will generate a bunch of sample test data pass it to our test function and then we just assert properties about whatever has been given now so far I've been using the simple integers strategy here which just gives a bunch of random integers and the nice thing about this is it's very easy to just try it out in the repple and see what happens so here I'm importing integers and I just called dot example whatever my test it is and you can see it's give me all kinds of random data large number of small numbers positive negative numbers but the thing that I really like about this framework is it's much more powerful than this so here's another example here what we're saying here is we want to list type and the list can have elements that are either strings integers or boolean x' and we also only want lists that are less than 3 or max sizes 3 and so if we do that same thing with test data and we call dot example on it we can see we get an empty list we get a list of boolean's with negative numbers long Unicode string so we get all kinds of test data it's really composable like that and you can also plug in your own custom strategies if you want to generate really advanced test structure data structures so that was the intro to it I wanted to give a real-world example now so one of the projects that I started I think it's several years now is this thing called James path which is a query language for JSON it's using a couple of project products if you're familiar with the AWS CLI if you've used - - query to grab instance IDs or state name james paths is the language that back set and then also recently is added to ansible via the JSON query function it's the same thing but what's nice about this is a pretty standard architecture for a compiler there's a lexer parser and interpreter and it's okay if you're not familiar with that we're just going to look at the first part which is a lexer and for the lexer it's if you're not familiar with how it works takes a string so in this case fubar and it breaks it up into token so we can see the return value here and specifically it's a list of dictionaries and you can see for each type it gives me a type that says unquoted identifier here's the value the next one's a token of type dots and so on and then in james path the last token is always an EO F token to indicate we're done but right off the bat if you think about just this the lecture that takes string converts it to various tokens there's several properties that we can say so the first one is something that I think no matter what code you have you can always write something like this which is you call whatever method or whatever thing you're testing and it either has to return some type that you expect or it should raise some specific exceptions but raises anything else then we are going we're going to say that it's in a bug and something we need to look at so I think this is a pretty straightforward property that again really applies to any kind of code you would write but for a lecture there's additional things that we can assert so here's a couple more examples let's say if you think about how a lecture works it tells you where the start of every token is right and it wouldn't make any sense to have a token that had the same starting location that's just not really something that makes a lot of sense so we can assert that here additionally the way that it tokens a tokenizer aleksa works is it linearly scans through your input so the starting location should always increase it'd be really weird if you had something at index 0 and at 5 and then at 2 and then a 1 and usually would confuse your parser so we can also assert that the starting locations are always and we have this all set up we run this and one of the bugs that have found was this key error which you know if your user of the library and all of a sudden you see que quiera unknown I mean it doesn't make any sense right so that's why one of the properties is to make sure we catch any of those errors and raise a proper exception type that has a lot of context about the expression you're parsing and why it failed and the I won't go too much into the details of what's going on but essentially what happened was there is this tokenizer part here that's supposed to say if it can't figure out what token the elf type is this string unknown but the problem is that that's not really that's not a valid token type and the proper way to fix this is to use an exception so if you don't recognize the token as opposed to raise a lexer error and that's the proper way there's not you're not supposed to return a token type that's type unknown because the parser doesn't understand that but I think what was more interesting here was not that it caught the bug because this is now the fix but it's also a chance to give you design feedback so what it was saying here was that you're using these hard-coded strings in the lecture and hoping they sync up with things that are in the parser that's indexing into some dictionary there but it would be a lot easier if you just had a tokens module that actually had you know a proper attribute not only could you get autocomplete you would get proper errors in the case of just running a unit test because you get an attribute error and you can also get a lot of benefit from tools like pilant via static analysis that would tell you that this module doesn't have this token called unknown so there's a couple of other things you can use with hypothesis and again it's really about just making yourself more efficient about the code that you're writing and a couple of examples one is using it for refactoring so even if you don't use it as a final test one of the things I've done is if you're refactoring and it's a pure refactoring of a function you can take your old function and you can take a new function and write a test that says anything that I get to the old function should give me the same value if I pass it to the new function and if not then it's not a true refactoring that I might have introduced a bug in the new refactoring it's also useful for C extensions so if you have a pure Python module you have a C extension you can say for any valid input if I feed it to the pure Python version and I feed it to the extent version then everything I should get the same result and if not I probably have a bug most likely in the C extension so it's really useful for a number of things besides the traditional types of tests you might think of so as far as integration with CI I think what I found is I run this as part of every commit as part of the Travis build or Jenkins build the two things that I've run to in practice since I primarily work in the library space we support a number of Python versions which includes 2/6 and hypothesis is not available for 2/6 there's a separate package that supports 2/6 that I haven't used but so I have to exclude it if I'm using 2/6 and the other thing that I've run into is that I think by default hypothesis tries 200 test case or 200 examples in that loop so that for loop would be 200 examples and I found that sometimes when I'm testing I will get some errors that only occur on certain runs of Travis and to help fix that if I bump up the number of times up to in this case 10,000 that seemed to work and that I think that made the test run about 30 seconds total whereas the default is maybe a second or so and that's something that I've found really useful to make sure that you have reliable runs on Travis CI and then again you can just pass that to the decorator for each test function you have just pass the settings decorator and I should mention the suppress health check was also something that I ran into in practice on pi PI so if I believe if it's a setup or something doesn't occur fast enough it will timeout the test and so I just have this disabled and that's worked well for me so that was property based testing hopefully it gives you an idea of some of the things you can use it for I found it really really useful now I want to move on to fuzz testing Sophos testing is kind of similar to property based testing except it usually just gives you a byte stream so you have some random input data here and it has this loop and you call your code under test whatever that might be and if you find any kind of unexpected exception so normally in the fuzzing world it's a seg fault or something but in Python we are looking at uncaught exceptions that we weren't expecting then you have a fuzzing failure and to illustrate this we'll start again with a simple example before looking at the real-world example consider this function here it's buggy and specifically if you pass it a string that's length five that then has the characters be u G and then G and Y it will throw a runtime error and I'm doing this with this nested if-else to show you how fuzzers can really help you out here but let's say that we didn't use any kind of fuzzing framework we just wanted to do this with brute force so a simple way we could do this is start with characters that start with length 1 all the way to 100 and just generate every possible sequence of characters that we could and in fact we're only using string dot principles which is letters and numbers and certain symbols let me try that if you do this and you run it on the machine that I tried it on it took about eight and a half minutes until it got to character or strings of length five until it got to the specific sequence B you GGY so eight and a half minutes to find a simple bug like this what buzzers can do is find this a lot quicker so the one that I've used that I really like is AFL stands for American fuzzy lab and it has coverage guided fuzzing and what that means is it's able to look at the input it's generating see what parts of your code get executed and then use that to guess how it can change its input to further explore your code and normally AFL is used for C programs and you can use it to instrument your C programs and it's also hooked into the compiler to instrument to binary but there's also a Python AFL this is the name of the package you would install and if you're familiar with this system it does that in order to integrate with Python coverage data and the way the fuzzer is generally work and this is the case for most of the fuzzers i've seen even in addition to AFL is you have some sort of script usually some sort of shim that takes input on standard in and then execute your code and then you also have a set of sample input files so if you were fuzzing a language just would be valid valid programs if your fuzzing a file format parser this would be biol valid file formats and it's really used as a starting point for it to start mutating your code and once you have that this is all you would need for that same example in Python you just import AFL at the end and you have this while AFL dot loop you call your main function and again our main function is reading from sis that standard in and every time doodle oops sister den will have new input now I wanted to show you an example of how this works so hopefully we can see that this is a video here and what's happening is it's the same function this is on the same ec2 instance that around the brute force version and as you can see it's checking for length v checking for be you and then g gy ok all right so i'll try to tell you what's happening so essentially what's happening is there's this input here this corpus and what what's going to happen or what is happening is it will run this PI AFL and that's a little hard to see but I just wanted to give you sample input here so we're saying that the results go to this - oh and then the input comes from this corporates and if you run this you might be able to see this it's kind of dark here but it'll switch screens real quick and you can see in probably about a few seconds it caught the bug maybe that's sleazier and at the top there it says it took about three seconds to run and for any of the fuzzers we specified a results directory right and in that results directory there's a crashes directly that says the input that crashed and if you look at that it should be this bu GGY down at the bottom so switching back to the slides here to reiterate what happened there when we ran the fuzzer it created this directory and in that directory there were a couple of files there so anytime that finds a crash it puts it in the crashes directory and that input or that filename has a metadata about where it was in the input but the contents of that input or that file is what caused a crash and so in that example there it was saying that this string sequence broke this code it raised a runtime error but what I found really interesting about this is that if you look at the queue directory this queue directory has a number of files and in that directory has additional inputs that got it further along in code so for example if we look at the contents of these files we can see that it starts and it figures out that strings of length 5 get me further along in this function than strings that aren't length 5 and then additionally starts with B and then B ubg and eventually all the way until it finds that bu GUI takes it all the way to crashing with a runtime error so that's one of the nice things about these fuzzers they're really smart about trying to explore all the various parts of your code to make sure that you're able to test much more in depth than you could from a brute-force or at least a lot quicker all right so as an example of using this one of the other things I work on is the AWS CLI and in the AWS CLI we have this shorthand syntax and it's kind of a similar thing to JSON what we found is that if you specify JSON and seal command-line it's a little bit awkward with the quoting and depending on what shell you're using so this is something that's a little less noisy and we can use that same framework that we saw before we just have something that reads from standard in tries to parse it and again has to get a shorthand parse error or else it fails and to see this we took on the examples that we have in our documentation about valid shorthand syntax and use that as a starting corpus and fortunately didn't find anything too crazy the one thing that it did point out which when you look at it I guess obvious in hindsight but is this max recursion error so the parser that's used in james path is a recursive descent parser and of course with recursive defense you can hit max recursion now the behavior itself is expected in the sense that yes you will eventually hit a limit but the bug is that this is the error that a user would see just runtime error max recursion exceeded and with no context about what happened and ideally we want to catch that say we were trying to parse something this is what we're parsing this is what ran into and make it a little bit easier for them to troubleshoot so this buzzer helpfully pointed that out okay a couple of tips and practice that I found when I run this so I don't run this as part of the Jenkins or Travis build it usually takes a really long time to run and there's a couple of heuristics we can't we won't have time to look at about when you should run it or when you should stop running it and when it's found enough but the two things that I found really helpful is one to use the multi-core support so in that example before in that video we were just using a single core just a single instance but you can denote one process with - m and then all the additional ones with - that are more randomized but what's nice is they'll all use the same results directory so when we saw that queue directory that had all the interesting input all the other child processes can use that as starting points to really explore the space and it gives a nice increase in the amount of sizing you can do pretty neat time the other thing that's really nice is the persistent mode so in persistent mode normally if you're not using persistent mode it executes a process and then the process exits and with persistent mode what it's saying is instead of exiting will just continue to reuse the same process now you have to be careful in the examples I've been using those are fine because there's no global state we're not mutating anything at the module level and we're just instantiated a new class each time so that's okay but if not then you would have to use the version of this where it just exits the process each time and I found when you do this it usually gives about a three to four times speed-up depending on what you're doing but that was pretty consistent for me so those are the two tips I recommend if you're going to go down this route okay next one stress testing so so far the two types of tests we've looked at are really about using the same code but randomized input to try to explore and find issues with your code this one's a little bit different and I'm kind of overloading this term here but with stress testing ideas that you take the same input well you have different execution so primarily this comes up when it when you talk about threading so ideally yes I should I should put the standard disclaimer if you can use multi processing or async i/o or something those are definitely options you should explore but a lot of times especially for i/o bound code threading is a really good solution that works well and we use it a lot and the example here I'm just going to jump into an example is with streaming downloads so if you've ever used in the CLI s3 CP where you're downloading to the - means write to standard out or if you've used download file objects to an on C cable stream one of the problems is downloading to a stream is that you have to write sequentially so you can download in whatever order you want but if you're actually writing to your final stream you have to write in order versus a file where you can seek around wherever you want just write out the bits as they come so one of the things that we have in our code base is this idea of a sliding window semaphore and I'll going to do this really quickly and mostly just to help kind of get the context of how these tests can help but this is one point where we really want to make sure we don't mess up any of the synchronization because you could get into all sorts of trouble potentially deadlock code or get invalid data so the way that the semaphore works is if you check out a chunk of a file so imagine this is a file here as each thread comes along it can release a part of the file right and so in this case one thread might be done with this part of the file so it releases it another part might be done with this part of the file but you notice the right-hand side isn't moving until the left-hand side is done so we can continue to remove parts of our file until we finally get to this last part here because all the while there's this thread on the other end that's waiting to be able to write stuff to this file but it's blocked until the left end opens up so what happens is once it's left and opens up this whole window is available which done slides over to this part and then now the threads that have been waiting for to be able to download the second parts of this file can start downloading them and they call this acquire which then gives you gives them an integer which represents what part of the file they can work on and everything can continue as expected so the way that we test this is we have this sliding semaphore class and we specify a couple of things so we're going to try with 10 threads and 50 iterations and for each of those threads it's just going to acquire a value and then release it point I think one millisecond later and I'll do this for however many iterations it wants right and we take all of that spin up our threads and we let them go for a while and we wait for them to finish now with the stress testing we're not really interested in testing the specific parts during the test we're more interested in what happens after so after all this is done the one thing we can guarantee is that or sorry the two things you can guarantee is that the number of slots available should be our original one so if we've really acquired things and then release them we should be back to where we started so we started with five so we should be at five and the other thing is that if we were to acquire another number another part of the file it should be wherever we left off so in this case this is the number of threads which is ten times the number of iterations that we're going to require and release that so the very next block of the file should be that times a number of iterations not just a good check so if we were missing synchronization if we didn't lock something properly we would likely be off in this number and that's just a great way to be able to test this type of code okay there's a number of other examples here but I'm going to leave it at that for stress testing it that pattern generally applies for any kind of threaded testing there's a couple of other spots we use it but we run this as part of every CI build so every Travis build has this it usually runs pretty quickly for local development usually can run this in a loop with a little bit longer if you wanted to you know just have little more assurance that there's nothing wrong but in general we just run this as part of the CI suite okay last one mutation testing so the motivation here for this test is that we have been looking at types of tests that help us write reliable code and higher quality code but what actually tests the test so as an example here is a function and two tests for it and if we run this we feel pretty good right 100% line coverage 1% 100% branch coverage but if you look at that last test the test ad is false it's actually not testing what that value is assessing that the keys there but it's not properly testing what the value is and the reason this matters is let's say let's introduce a bug so we're working on the code base we think everything's great and we just introduced a typo subtle typo misses gets it through accidentally gets your code review but effectively this is just saying X plus 1 right X minus negative 1 is just X plus 1 and if we run our test tests they'll pass so we might think everything's good we have a hundred percent line coverage or a hundred cent branch coverage we may not actually be even alerted to the fact that our test is missing some things that it should be testing so this is the motivation for mutation testing the idea is that you take your test and you take your program you modify your program in some small way usually just one small change at a time you rerun your test suite and if all the tests pass that means that you're able to break your program or break your your code and no test told you that you did that what you wanted to use if you change your code in some breaking way you should get a failing test doesn't matter what the test is but as long as something fails it'll let you know that you're protected against that and the way or started the library that I use for this is a library called cosmic ray it's great it's really simple you pip install cosmic-ray and then there's really just three steps so you start a session you give it some name my session you tell it where your code is and what tests you have and then you run cosmic-ray exact with the name of your session now this will take a long time this usually takes anywhere from it really depends on your test suite a couple of hours to you know days and once it's all done though you will have this cosmic-ray report command that you can use and it will tell you what it was able to mutate so with the James pass library here's a couple of things that it found we're going to use the lexer as an example again but in this lecture it gives it to you in a nice de format too which is which is cool it says here's your original code here and here's the mutation so we changed a start plus two to start minus two and everything passed right and that doesn't seem right you shouldn't be able to end a token before it started and potentially that actually if we had a property test for that that could have caught this but the tests still pass and so we do it again it kind of like really hammers home the point that you're missing test because it said we can also divide by two and we can also you know square the number and everything still passes so I think the point is like yes it's missing a test clearly here's some of the other things that the mutation testing framework can do so it just messes with binary operators changing into not equals you know equals equal to less than equals messing with constants so you see self done index equals zero to self that index equals one and just she runs your test week to see what happens so a couple of tips and running this it's Python three only so you have to make sure that you handle that appropriately and it has a long execution time this is not something that I run as part of the normal suite this is usually just used as an audit so every now and then we'll run this see what happens and if we have missing tests and go back or missing gaps then we'll go ahead and add tests for them or update our tests and it's really helpful if you have a fast test suite so it is possible to do distributed test execution but celery with this I haven't used that so I'm not exactly sure what's all involved in setting that up but if you have a fast test fit you'll generally get your results faster okay so to wrap up here we looked at property based testing fuzz testing stress testing and mutation testing hopefully this gives you some interest in exploring these additional test types consider adding them to your project they're very straightforward to add and I found them really really helped with my own projects here's some additional links to the projects that I use and linked to James path and that's my twitter handle if you want to follow me but once again thank you everyone so we do have time for some questions so as a reminder once again please make sure that your question is actually a question and anyone who is leaving if you wouldn't mind being very quiet as you can people in the back sometimes have trouble hearing so say for instance when you're doing the the AFL say in your corpus you or say your function takes in some kind of custom object how would you would you refactor your function such that the object was kind of taken apart and passed in as parameters or would you put a bunch of items in the corpus how do you handle creating a custom object that gets passed into your function and so I don't have any experience with that that's usually where I use hypothesis for that but then you don't get the benefit of the coverage I'm guided fuzzing so in my experience it's mostly been with parsers and with you know binary formats kind of thing so you might be able to take a stream and try to use it to then unmarshal to an object but I've never used it for anything like that so I can't speak to that Thanks you mentioned continuous integration and also adjusting test or tweaking things for performance and test run time what are your thoughts on how these impact run time and how that should affect how you integrate this into part of a bigger testing process I think in general if you're using it in a CI system I really think that it makes sense to try to bump up especially for the random ones like the hypothesis test and fuzz testing or not just testing the stress testing to bump it up as much as you can I mean the more randomized data you get I think you just have better coverage one thing that I didn't mention is that usually when we do find issues with this we convert them to proper unit tests that you lose the non-determinism part like you want to make sure you have deterministic tests but there's usually as high as you can comfortably get away with is my suggestion yeah kind of on that same theme all these tests that you just they sound very interesting and I'm totally on board with I want to use them but I'm not super comfortable with having non determining non-deterministic tests on part of my CI right because I don't want someone to make a change and then it breaks because of something that just happened to be a weird edge case and then new contributors they don't know whether it's something they did so do many thoughts on like how or especially these ones are very long right Dubonnet thoughts on things where you could use these sort of in a parallel track maybe like I never found a good infrastructure for maybe running fuzz testing automatically on something like Travis CI but not so build focused just for you test discover yeah and in fact we're fuzzing that's that's actually usually what I do I don't have it as part of the CIA suite primarily just because it takes a long time and you have to look at you know data manually but but yeah I think even for the hypothesis testing which one is pretty quickly I still yeah I also get uncomfortable about having potentially random failures that may not make any sense the one thing that I have looked at and I forget if I don't think the projects that we're going to set up this way but usually if you at least have it as a separate environment for something like either Travis or Jenkins you can say you know we're running the unit test for running potentially integration test when we're running property based tests and make it clear that there might be some element of randomness to it but in general with our workflow the tests get run as part of a pull request or as part of the feature work itself so that by the time it's merged into master you just have there's still a case of course that you will get some colors that you didn't see in your normal test suite or I mean in your normal PR runs but I think that's just a risk that are something that you to trade-off you'd have to consider all right let's all thank the speaker one last time all right [Applause]

Info

Channel: PyCon 2017

Views: 5,245

Rating: undefined out of 5

Keywords:

Id: jmsk1QZQEvQ

Channel Id: undefined

Length: 32min 5sec (1925 seconds)

Published: Sat May 20 2017