John Hughes - Don't Write Tests

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so just a reminder if you haven't been in this room yet today we're taking questions at the center Mike rather than running Mike's just due to the size of the room so if you have questions I'm sure John will be happy to answer them for you if you just line up for the end of the talk okay and so take us away okay thank you so I'd like to start my talks by asking the audience a question so here's the question who really really loves testing okay maybe maybe half a dozen little bit more that's not bad so you're pretty typical as an audience testing it turns out is not the most popular part of software development so I just want to start off by thinking about why that is why is testing hard well imagine that you're writing a test suite for some code that has n features what are you going to do you're going to write maybe three or four test cases per feature that's perfectly okay that's a linear amount of work but we all know you won't find all the bugs if you do that because some bugs only appear when two features interact so what are you going to do are you're going to write tests for all pairs of features well that's a quadratic amount of work that doesn't sound so good and even if you do that then not all bugs ago to appear if you test for features in pairs some bugs only appear when you use three features together and that's a cubic amount of work and this is now starting to sound really pretty bad and of course it goes on from there but because it erased conditions as well so that errors don't even appear every time you run a test it just gets harder and harder and in the end it's hair tearing ly difficult you can never test enough so what's the solution simple don't write tests [Applause] generate them because we still want to test our code right so I'm going to talk about quick check test case generation tool I've been working with for quite a long time now which originates from paper by classes and myself we're up on the top right corner there and the way that we normally use it is that we start off with some API and a test and then we generate random sequences of calls and we keep doing that until the test fails and now when a test fails usually this the sequence of calls may be quite long and it's very rare that all the calls in that sequence contribute to the failure so the next thing we do is try and identify the calls that are responsible and shrink the test case down to a minimal failing example and that's what we then report to the user for diagnosis that's a bit abstract so if you haven't seen it I'm going to do a demo quick check now so I'm going to demo a circularbuffer if I can get the code to appear here that's not valid curses okay so this is a circularbuffer in see most of you probably right see better than I do but 10 in mind so here's a struct to represent the buffer we've got a pointer to where the date is going to be stored I'm going to keep track of the next index put something in the next index to take something out besides the buffer and so on here's the code for creating a new buffer we have to allocate memory for the data initialize the fields allocate memory for the struct assign that return a pointer and so on here's how we put something in we write the argument to the data area at the at the input pointer and then we increment the input pointer modulo the size very simple obviously no bugs but it's a nice example to test and here's how we get something out we read from the data area at the output pointer and then we increment the output pointer modulo size and I've got one more operation here that just tells us how many things were in the buffer just take the difference between those indices modulo the size so that's my code and I'm going to test it from air lag so let me just compile that code it's in the file Q dot C okay and now I can call that C code from the airline Shell's let me just demo it I'll create a queue with space for five elements there we are let's put one into it let's put two into it okay let's ask how many elements are in the queue at the moment - that's good let's take something out there's the one there's the two there's on initialize memory and some more and some more oh look there's the one again there's a - it's a circularbuffer so it's C code add it and it behaves exactly as C code would expect so now I can test this code from Ed Lang and it's the airline version of quick check I've got a demo so what I'm going to do is generate some tests for the C code which I will do by calling quick check oh wait let me just compile my test code and now I will call quick check and give it my specifications an argument oh no a test failed but you saw it really working ok so what's happening here what's happening is that we generated a random test case and what you see up here this was what happened when I ran the random test and I think we can agree you probably wouldn't want to debug that and then quick check has searched for the core of a test case the mineral example and down here this is the part that I want you to read okay this is this shows us what happened in the test that actually fake so what happened I created a queue with space for one element I put a zero into it and I asked how many values are in the queue and size return zero but I just put something in okay so I want you to I want you to help me now here's the source code for size what's wrong okay so think of the example I created a queue a size one I put one thing in what happened that I did that hmm oh yes yes the size is one so the size function returns the difference between two pointers modulo 1 but that's always zero in fact but I put something in I incremented the input pointer modulo 1 so that's also zero so oh dear putting something into the cube didn't change it at all and if you think about it actually whenever I make a queue size N and I put n things in the input points go to wraparound it's going to look like an empty queue so I can't distinguish full from empty queues that's the problem here and so my representation is inadequate any suggestions for a fix what well I am storing sighs I could yeah I could for example store a boolean to keep track of whether the queue is full or empty but then I get lots of special cases in my code that we horrible so there's a lovely hack which I'm going to show you so the problem only arises when the queue is full right so here's what I'm going to do well I'm asked to create a queue of size n I am actually going to create one of size M plus 1 what does that mean that means for something to observe the bug they must put n plus 1 things into a queue of size N and that means it's their fault and I have successfully avoided blame which is the goal of all good software engineering so let me recompile my C code and then I will just run the last test again I can do that if I just move quick Cheers so this will run the last failed example now it passes and now I'll run random tests and let's see what happens oh it still didn't work there must be another bug okay so look what happened here I created a queue of size one that's really - I put something into it that was okay I removed the thing by calling get so now now it's empty again I put something else into it that's okay how many things are in the queue now I put two things in I took one thing out one and so size returns -1 oh no what happened here well if we go back to the code for size right what happened was I put two things in the queue is really of size 2 so the input pointer has been incremented twice modulo 2 it's wrapped right back down to zero the output pointer of taking one thing out it's been incremented once so here I'm computing 0 minus 1 that's minus 1 modulo 2 what is minus 1 modulo 2 I did I've studied mathematics I know what it is is plus 1 right let's just check with their length minus 1 modulo 2 oh no air long doesn't implement modular arithmetic properly it returns -1 and so to see so that's the problem right I thought modular would always return something positive but if this our expression is negative I get a negative result from size and that's clearly wrong so I've got to make sure this expression is never negative how can I do that yeah thank you absolute value okay let's just make sure that works I'll recompile the see rerun the last test it's fine so where are my now I found a problem on my code I created a test case that provoked it I've fixed my code I've rerun my tests or my tester green I'm done right would you like me to run some more random ones just till we feel comfortable all right let's see what happens oh no it still doesn't work what's happened here okay now I made a queue size two that's progress that means cues of size one work now we know that because it's minimum okay so what does this test gauge do it puts three things into the queue it's really size three and I take one thing out so how many things should be in the queue to what a size return one why is that because the input pointer has wrapped around again so this is zero I've taken one thing out this is minus 1 modulo 3 that should be to mathematically but if I take the absolute value I get 1 so what I need to do is not take absolute value that's a easy mistake to make how to make sure this is positive in some other way and I can do it by adding queue size here and that will ensure that that is never negative and everything should work let's see a recompile the see rerun the last test that passes should I run random ones just be on the safe side maybe yes 100 test passed 200 300 400 500 seems to be working now okay well thank you very much for helping me with my debugging and let me go back to slides here so what we've seen here is that with the random tests and the properties that haven't shown you the test code but my test code was able to find a variety of different bugs with the same test code that's really nice isn't it and what you've seen also is that when bug was found then it was shrunk down to a minimal failing test case and I hope you'll agree with me but those minimal tests are quite easy to debug it's great to know that everything that appears in the test case is relevant to the failure that's an enormous help for debugging if you look at the first randomly generated tests they contain a lot of random rubbish think of it as noise why it's their purpose random rubbish is very good for provoking failures but it's very bad for debugging them so this shrinking process that simplifies the test case as much as possible we think of as like extracting the signal from the noise and it's I think it's key technique for making this approach useful okay so I've shown you testing a very tiny piece of C code I didn't show you my test code it's actually as large as the C and if that were true for larger pieces of software then the whole approach wouldn't be very useful so you might be wondering does this really scale so let me tell you about the largest project that we've used quick check to test and that is part of the outer SAR basic software there's a diagram of it here I'm not going to play this in detail what is it though this is the structure of software that runs on every processor in a car that follows a desire so you can see here there are lockers each colored box is a module and then the boxes that contain colored boxes are subsystems so you can see there's an Ethernet stack on the right there the second thing is a can bus stack the can bus is a network that's very commonly used in cars most of the communication goes over canvas and then the other stuff does routing there are some other protocols and so on the point is just that there's quite a lot of software in the outer star basic stuff and there is a standard specifies how this should behave why because car manufacturers like Volvo cars want to be able to put subsystem together containing code from different providers now you might think that the right thing to do would be to have one implementation of an Ethernet stack for example but you that everybody just shares in the car industry one open source solution would ensure that all the different components work together but that's not what they've done what they've done instead is they've standardized how these components should behave so that there are many different suppliers of canvas stacks and so you can be running can a canvas stack from one supplier you know in your gearbox and from a different supplier in your engine controller and they have to talk to each other and thanks to the standard then everything should just work well guess what it doesn't and in fact when you first put a car together you can't even boot it let alone drive it and a part of the problem can be that the suppliers of this basic software didn't follow the standard and so their components don't work together so Volvo cars paid us to develop quick check models of all the outside basic components and to test the software from their suppliers see if they we're following the standard or not and I can show you one of the books that we found here it is this is in the canvas stack so the canvas can carry one message at a time and all the messages on the canvas have an identifier which also serves as their priority so this test case first of all sends a message with priority one and what happens where we can observe that it says message one so now the bus is busy then it sends two more messages with priority two and three okay so now you can imagine that the client software has called the canvas stack saying send these messages but of course it can't send them yet because the bus is busy so they get queued up then the test case calls transmit confirm so sending a message is normally called from above the protocol stack transmission confirmation is called from below by the hardware bus driver and in this case it's confirming that the message with identifier 1 has been set so now the stack has two messages queued up and it must choose between them and the trick about these priorities is that the smaller the number the higher the priority so which message should be sent next the one with priority 2 and which one was sent the one with priority 3 so that was kind of interesting and in this case we actually had the source code of the can stack often we don't have source code of software we test but in this case we did so we could look at it we could find out what the problem was cycle it slated to you in the canvas the message identifier serves also as the message type and in the first version of the canvas protocol there were 11 bits allocated for storing the message type but if you think about that that's like saying you can have 2,000 different types of message in the entire car it's like saying your java program can only have 2,000 different classes it's just not enough anymore so as a result the the can standard has been extended so you can now use an extended cam ID of 29 bits and that allows what is it well enough different message types a even for today's software but you can still use the old format so when you send a message you have to remember what kind of identifier have I got should I use the small format or the large one and if the numerical value that determines the priority in this particular canister then this identify was stored at an unsigned 32-bit integer and look the extended field is only 29 bits long so you've got three spare bits and the top bit was used to record whether the message should be sent with an extended ID or a standard ID so guess what what I didn't show you before is that the message with priority two was encoded using an extended can ID and the C code that was deciding which message to send okay the developer forgot to mask off that top bit before comparing the priorities so it was treating the message with priority two as priority two to the 31 plus two and that's why it was sent second and not first so I really like this example and partly because it's a very low-level bug a C programmer forgot to mask off a bit in this protocol stack but even so you can provoke the bug with a short sequence of API calls and you can find that short sequence by generating longer random ones and shrinking them as quick Chek does so it's very nice illustration for technique working for much larger scale also it's an important bug the canvas has priorities for a reason everything in a car pretty much talks on the canvas the stereo the brakes so here's a tip next time your emergency braking don't fiddle with the volume on the stereo so in this project we read 3,000 pages of PDFs containing the specifications of all of these outer cell components we turned that into 20,000 lines of quick check code which i think is quite good ratio we only needed 6 lines of code to test what one page of PDF said we used that to test in total about a million lines of C code from 6 different suppliers and even though when we got the code it was supposed to be working already we found more than 200 problems of which more than 100 we're in the standard itself I found one of those myself where first of all there was a requirement in the standard very clearly stated and then afterwards there was a paragraph explaining what it meant that contradicted what the requirement said so you know what are people to do the other thing we found was that in those cases where we compared our test code to existing test code I was nine times shorter and more over tested more so does it does this technique scale yes it scales I also want to show you I can't resist show you my favorite ever bug this is a bug in the database that comes along with air land and to provoke it here's what you have to do first of all you have to open a table and then you have to close it and open it again no really after that you do three things in parallel one lookup and two insertions and about every hundredth time you execute that you crash the whole database so the database concern it was it was crashing in production at one of the biggest users in Sweden about once a month and I knew about this bug I thought it would be really fun to hunt it down so that's why I went when looking for it but by the time I found this then the guy who was maintaining the code had spent more than six weeks working at the customer trying to figure out what nerve the problem was and at the end of that time their best guess was it seems to happen when the file is about one gigabyte maybe it's something to do with rehashing now we know you need a database with at most one record and only five or six function calls to reproduce the problem and once we once I'd found this very small test case that I sent it to the maintainer and next day he sent me a text version and this this bug really shows the enormous value of shrinking and getting a small test and what a dramatic difference it makes and how easy it is to debug these problems the other thing I really like about this bug is that there's no way anybody would write this test in a test suite right I mean there's there's quite a few calls in the API why would you test this particular one and yet this is the smallest example that can provoke this bug so it's a kind of bug that you can really only find by searching for them ok I just want to go back now to one of the counter example I showed you remember this one so this is of length 5 ok it's 5 steps and so you might think if you're searching for a small example like this maybe you should be running tests of length 5 but that isn't what we do I measured the lengths of tests I was running and here's the graph that shows what I was doing when I was calling that devil for you so you can see that the tests test cases I ran varied widely in length about 8% of the world length zero which might seem a bit excessive but it's not so serious those tests run very fast but then there were much longer tests the longest ones were more than 200 steps so why do we do this why do we run much longer tests than the counter examples we're looking for in fact the average here is is over 15 steps so 3 over over 3 times the length of the counter example that I actually wanted to find so the key thing here is to ask ourselves well how quickly can we actually find faults by running these random examples how does the length of the test affect the probability of hitting a thought so I measured that too so look at this this is the probability of a test page failing on the y-axis and on the x-axis it's a number of steps and test cakes and because there aren't so many test cases of greater lens it gets a bit noisy towards the right but nevertheless when you look at it you can see well actually all the tests the past so I didn't manage to hit this bad case in a random test of length 5 but then as the test got longer and longer so they got more likely to fail well now you might say of course a longer test case is more likely to fail it's doing more stuff so there's more chance that we'll do something wrong and of course that's true so this graph isn't really the one that you want to see what you want to see is as I do more work running a test how does the probability of of failure per unit of work change so what you want to see is this graph divided by the length of the test right because each call is you know on the assumption that cost about same and this is what that graph looks like ok so this is the probability of one command failing in a test case of varying lengths and you can see that for the shortest tests there's simply no point running them they all pass because the shortest failing was five steps long but after that tests are only a little more than five steps or though most likely will pass but once we get up to the 20 to 40 commands range that's where we're maximizing the probability of each command revealing a failure and this is something that we have observed again and again and again in many different contexts but generating longer random tests and shrinking them back down is by far the fastest way of finding short failing examples so why is that well I can give you some intuition I'm waving my hands here I'd better but there's some truth in what I say anyway so this represents a long test case suppose it fails well as I showed you when the test case fails it's usually because of some sub sequence of commands maybe these five commands were responsible for failure so what about all of the rest of them well it turns out that quite often the other commands of the test case don't matter in other words as long as these five commands appear somewhere in the test case is quite likely to fail with the bug that we're looking for so now ask yourself in a long test case of length n how many ways are there of embedding 5 steps into that sequence it's exponentially many right you're getting exponential benefit for a linear amount of work is that good or what so that that explains why reason really this kind of testing is unreasonably effective and it's quite quite amazing to see okay well if you want to know more haskell quick check was the original version of quick check there's a paper from 2000 that describes that cubic quick check is the one that I demoed today and we have a company that supplies that along with services but there are also many many other tools that have been inspired by quick check like test stop check and closure scarlet check FS check and have sharp and so on so I encourage you to go and look at some of these tools and try them out if you haven't done so yet then you will be amazed don't write tests generate we have some time for questions I could just use the center mic there Bryan Newton for MIDI Anna hi John all right so aside for him to take away that I should not get in a car one of the things I was wondering about your talk is about shrinking test cases in the context of non-deterministic failures like the database one that you demo because at that point each shrinks step may or may not yield there's no true that a question of whether on the strength succeeded or not do you simply set a threshold for how many times you try it each time you shrink right so suppose you are trying to debug a non-deterministic failure and you've got a complex case it fails and you're doing it manually what you do you'll try to simplify it but you won't run that test just once to see if the failure is still there you have to run it repeatedly that's what we do we just automated and how many times you run a test before you decide that this one doesn't fail that depends on how rare the race condition is that you're looking for and that's something typically that the quick jet user will adjust so for the database examples to get reliable shrinking I ran each test a thousand times okay thanks it can be a little slow but it's a whole lot faster than trying to figure out what's going on without doing that Thanks so what happens when you use a quick check within a first or second year programming course I have used it on the very first programming course and I told the students method for writing functions where they wrote down and the type they wrote down the names of the parameters and I got them to write down a property before writing the code and I was you know it was really the students did what I told them they wrote a lot of properties they got a lot of benefit but I think that was perhaps the wrong place to do it because at that time it was not easier for the students to write a property that it was to write the function so I think it makes more sense to introduce quick check testing once students are writing code that's a little bit more complex so that their properties can be much simpler than the kilted they're testing hi thank you very much for creating a quick Jack I am a little curious if you have any recommendations about how to use it to explore like very large state spaces and in particular I notice in the equipping quick check it seemed that perhaps you start with really small ones and iteratively move to bigger cases basically I found it difficult to generate good examples when I'm dealing with large like very high dimensional data structures yes so I think random search is very good at exploring very large spaces because what other approach can you possibly take to space is very large you can only sample from it and random sampling is quite effective but what I don't have time to get into in this kind of talk is to discuss the measurement of test cases so one of the downsides of using a tool like quick check is you don't see the tests and as a result when you generating complex tests it might be that almost all were rather trivial so you might get a false sense of security so it's very important that you collect statistics about what the tests are actually doing look for example how well am i covering my source code if you're missing part of it entirely you need to adjust the distribution if your we measure all kinds of things so for example I often use a test or the process registry that maintains a table of process identifier and the most simple way of generating those tests generates a table that is almost always empty right I know I measured the size but of the table and I adjust the distribution so that I get a nice spread of sizes you using it too like this doesn't free you from needing to think about what you want to test but it raises the level of which you think about that you need to think about what should I measure to have confidence in my tests and then once you've measured it how should I adjust the distribution so that I get the measurements that I want so there's there's a lot of skill and experience in doing that which I don't have time to discuss in a talk this length Thanks hi um so I was wondering in particular it sounds like at a high level the goal of a tool like quick check in some sense is to find counter examples to some theorem so you state some property of the program and you're looking for a witness that that theorem is false so I was wondering in particular if you had considered I don't have a great sense of sort of the overlap between this and the formal methods literature but if there are techniques from say s and T solving that could be useful here or vice-versa yes so that's right so what we do is we formulate a property of our code correctness condition and then we look for a counter example that shows that it's not correct quite often what we discover is that the property is not correct right so in general what happens is that funnily enough customers write their code before asking us to help them test it and that means that in practice you not usually test some code that has already been tested to a certain extent and that works to a certain extent which means your test code is a new code so your formal specification is the new code of course that's where the first bugs are and so one of the nice things about using a tool like quick check is that there's very little manual effort involved in discovering a discrepancy between the specification and the implementation and then you can quickly see okay here's a counter example there you have to ask yourself is it a real counter example is it a bug in the code or is it a misunderstanding my spec and usually go through a process of refining the speck until you actually have captured the intention and from that point on the books that you find are in the implementation so if you if you try to start off with your spec and assume that it's right and then try and prove that the implementation meets it then you're going to be doing a lot of proof work but that may end up being wasted because the specification was wrong in the first place so I see using a tool like quick check as an appropriate step you would take before putting in the effort of trying to do a formal proof because I think you get a lot of the silly mistakes in the speck out of the way and many of the silly mistakes in the code out of the way so you can start trying to prove that your implementation is correct in a good position where it most likely is correct hey John um last talk so I guess one of the interesting things from my perspective you know you're working with Haskell in particular and you know it doesn't provide a first-class support for expressing the kind of properties that you want to look for and but there are programming languages which I do provide support for expressing you know pre and post conditions and other properties that you want to check and I'm wondering you know what you think about languages like that as they would help tools like quick check because those things would be written into the code maybe from the beginning hopefully okay so the devil that I ran it was actually with the add line version sure sure sorry yes yeah and then I didn't show you the test code but what consists of is a state machine model of this you know the circular buffer including preconditions and postconditions and state transition functions and so on so that we've built a framework for expressing specifications in that form because when we started the company we discovered our customers didn't have all that much purely functional code and you know the customer's always right so so we worked on the state machine modeling formalism so I think you're perhaps referring to languages like Eiffel that have contracts built in I thought would be a good example yeah yeah and so that can of course be helpful if an assertion fails in code under test then that tells you the test has failed but in quick check we use the state machine models not only for deciding whether a test has passed or failed but also for generating tests and so that means that we need to use these preconditions in particular in a slightly different way if they were just if they were hidden promise by being written in the code of the system under test it would be harder to use them for test case generation but you could extract the preconditions right having then that's what I'm imagining yeah possibly what are the one of the things that has worked well for us is that quick check the version of picture that we use is entirely black box and that means that we can test code in Java or C++ or whatever the very first customer we had we managed to crash a base station which we knew was programmed in C++ generated from ROS RT and the most serious crash we found we looked in the log and it was full of Java exceptions what whether they come from so it turned out that we had corrupted the state of the base station and violates the fundamental invariant but the base station also had and operators GUI that was in Java and that GUI is the first thing to fall over when the state the base station was corrupt and so that's how come they appeared there and we could do that because because the testing is blackbox all we have to do is invoke the system on the test using the same test interfaces that anybody else will use for testing it and that limits what we can do we can't process the source code but at the same time it gives us a great deal of freedom in the kinds of systems that we can test ok let's thank the speaker again [Applause] okay and so now there's a coffee break so we'll see you in a little bit
Info
Channel: Curry On!
Views: 12,248
Rating: 4.9352226 out of 5
Keywords: CURRYON, curry on, functional programming, functional, fp, generative tests, quickcheck, property-based testing, testing, tests, software testing, erlang, haskell, programming, programming languages, generate tests
Id: hXnS_Xjwk2Y
Channel Id: undefined
Length: 41min 15sec (2475 seconds)
Published: Tue Jun 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.