Raoul-Gabriel Urma, Kevin Lemagnen: Adv. Software Testing for Data Scientists | PyData London 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: Raoul-Gabriel Urma, Kevin Lemagnen: Adv. Software Testing for Data Scientists | PyData London 2019
  • Author: PyData
  • Description: The journey to deploy a model to production starts with testing it rigorously, including its code implementation. In this tutorial, you will learn about state of the art ...
  • Youtube URL: https://www.youtube.com/watch?v=WTj6T0QdHHM
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Oct 02 2019 🗫︎ replies
Captions
welcome everybody how we're feeling now we can pumped are you pumped yeah he's pumped up that's what I like to hear so this tour is about substitute testing for data scientist the reason I put advanced so advanced is always going to be subjective but we're going to go beyond a simple I cert equal we're going to see what after that so that's why it's gonna be a little bit more advanced so before we kick off we have a nice little team today so we have a David hello the event is a senior Teaching Fellow Kamri spark then we have Kevin Kevin is the CTO of cameras lock and you can see we all look completely different in those pictures than what we look today because those photos were taken a long time ago especially you Kevin you look really young I'm cool so what are we going to do in this tutorial what we're going to answer different questions so the first one is is my code correct that's a good start and that's what unit tests are for then the next question is well how do we write good test then actually it's quite boring to write test you know you have to write all those assertions you have to set things up would be nice if we could get tests generated for us so we're going to look at that and you'll see that it's actually not that easy to do and can be inconvenient and then finally we look at the real-world situations where in practice we depend on databases we depends on rest ap is that we have to plug into how does that impact our testing so those are the range of topics that will be covered in this tutorial so there is a bit of a set up to do and I'll walk you through the outline so we'll kind of motivate the need for testing for data sites after all testing was created for software development so why should we care as data scientist about testing then a quick review of the testing available setting the scene then we'll dip dive into unit testing using PI test which is one of the tools available we'll talk about why it's good for data science in particular we'll talk about writing good test what are patterns available and then we'll look at something that sounds really great for dinner parties property based testing and then test doubles and mocking so it's quite a meteor agenda today so this is the setup for today if you've just arrived please take a photo of this slide so you can do it I'm gonna leave it up for 30 seconds and then I'll move on so it's about I think 80 100 Meg's so you know it's call of people in this room we'll see how the network goes but that contains essentially all the files they'll be using this tutorial for the demos there's a series of exercises as well there's a notebook and there's some data sets so that's why it's taking a bit of space cool alright let's let's go through so what's the background so we're gonna look at a data set made out of Kickstarter campaigns and we're going to build a model around it so Kickstarter campaign essentially is a way to crowd fund a project and a project might have different features like the amount to raise the category of the project which region this project started in right so those features and we'd like to build a model that's going to predict whether a Kickstarter campaign is successful or not okay so we're going to write a bunch of code to do the modeling and that's gonna create a question and well how do I test this whole thing cool so I'm gonna jump into [Music] Nobuko so once you've downloaded the git repository you should say something similar to that and we have a notebook which I'm just gonna kick off will zoom in so what you'll see is that we've got a zip that contains a bunch of data I'm gonna get pandas to load this up and then we can get a feel for what the data is and what you'll see in this data there's different columns so there is the photo for the campaign there's the name of the project there's the goal so that's the amount to raise there's a bunch of interesting feature like the country the the currency then the deadline so that's a timestamp on when the campaign is gonna close when the campaign was created when I was actually launched when the the post was created and then a bunch of Jason kind of data that represents further information like the category the URL bugger an image and so on right so this is the real deal this is not like a made-up CSV file with numbers this is a data frame that contains text in Jason and we're gonna have to build a model processor and test it okay cool so that's the scene so what you can do in the comment line so you provided with a script called run dot pi and you can run it and by providing the argument train essentially what it's going to do is to train the model on the training set using believe logistic regression and then it's going to serialize this model into a job Lib file ok so just to give you a flavor of how things are structured here so I've got a model this is the implementation of my model okay it's a class it's gonna create so hopefully who came to Kevin's tutorial earlier yeah great so Kevin did a fantastic job at explaining how to psyche learn to build a modular pipeline of a series of transformers which can reuse so those transformers might be doing some feature extraction they might take some JSON data and extract for example the category or they might be doing some translation between the the country to its region so that's all the code is being set up so that's my pre-processing step and then training model logistic regression and then I can predict okay so that's the the code for the model it's available in the model dot PI file ok now as you can see all of this code heavily makes use of different pre-processing steps so there is a categories extract country transformers what's all of this stuff how do we know this is correct is it doing the right thing so in this tutorial we're going to look at well how do we unit test it all so if we look at one of those categories extractors what you'll see is that we have a module called transformers which capsulate all the Transformers that my model is going to use the benefit of doing this is that if we have a team of data scientists they can already use transformers that they've created they can experiment with different models but we get a benefit of code reusability so that's why all those transformers are packaged up into a module and what you'll see the code is not exactly a trivial there is some business logic there's some pausing we're creating some data frames we're extracting different index so it raises the question on is this correct okay so that's the scene that's the background the structure of the files and now we're going to look at what we can do we've all of that cool so testing for date designs so first we talk about testing what is it we want to test there's so many things we could test so on one side you've got a difference between functional and non-functional requirements functional requirements are simply is it doing the right functionality given input is it producing the output I'm expecting right so for example there's my model output prediction in the expected range you know is that two categories is it ten categories if we have a finite set of outputs is it actually producing a result in this output so that's a purely functional requirement the next question you might be wondering well I've written all this code that's going to do some pre-processing it's going to take this JSON object extract different attributes may be converted do 100 coding all of that so we have complicated business logic that we want to test so that's the functional part now we also need to worry about non-functional requirements it's we no longer live in the days where we have a Python notebook and dot predict something happens what happens to the model we need to package it it's going to be served by some sort of an endpoint right so is my model when it's deployed can it handle requests can it cope to a thousand requests per minute is there sort of SLA when I'm asking for a request do I get a response within a certain amount of time and for example the classic scenario fraud detection delays are going to be important we want to detect the fraud right now not in ten minutes time so those are non-functional requirements and because data science has become complex and more and more specialized then we have different specialized skill set handling those different requirements so non-functional typically would set with the data engineering team as opposed to the data science team now on top of that you might have other concern so it's not just weather is it doing the right thing or not how well is it performing so when you get push and you test your model do you have a performance improvement in comparison to any previous iteration of your model can we detect those changes what sort of test dataset did we test on is that going to have an impact on the performance so those are series of questions that we need to ask ourselves now that we're dealing with this piece of code that will be deployed in production so in this tutorial I'm gonna focus on functional requirements and functional testing does this look familiar to anybody so why are we testing well there's different reasons why we want to test but one of them is we want to avoid this right we want to avoid a catastrophic failure in production right so who works in a financial service here right if suddenly the transaction processing fails it's going to be a love and happy people are there so we want to avoid that right so testing is about giving confidence that we've got a correct implementation and building confidence means that we hope to avoid the scenarios where something will fail now unfortunately things are really complicated okay so again in past data science was all hot it's still hot there's kind of more academic let's get a Python notebook it's great pandas boom predict then what today it's looking like that so I haven't I didn't make this diagram there's actually this was produced by a team at Microsoft and a leading conference on software engineering so a software engineering conference is not talking about data science so that's quite cool and instead actually the the data science workflow has becoming increasingly complicated and there's multiple steps involved so there is well the data that's just one step you know what data do we have where is it stored what's the process to collect it is in different formats we've got Chaser and we've got CSV we've got pickle fires we've got job Lib files we've got Park Hae so what's going on there so there's a dedicated team and specialists are handling just data collection right there's multiple database technologies that allow you to store different data formats with different trade now what about the cleaning step you know data is messy there is unknown values the schema might be completely messy we might not have everything we want so who is responsible for the cleaning part then if we are to do any sort of machine learning then there is labeling of the data this feature engineering what are features of interest your business domain which is going to have an impact on the Moodle training so that's Python code that we're gonna have to write presumably we have to test it right because it's going to be complex business logic and only then does the training come in on like let's apply some fancy model let's evaluate it learn from that and then go back to the feature engineering step maybe get the stakeholders involved to collect more requirements and so on once you've done that well it's on your computer how do we deploy this as an application for users then we need to worry about cloud services you need to worry about how to wrap this up into HTTP endpoints where is it going to live how do we test that who's got ownership of it and then we need to monitor the request what sort of requests are users making what's the output of the model once it's deployed how do we evaluate that and learn from it so the data science workflow is really complex and what's happening is that each of these modules is a series of software and code that we need to test those matching functional requirements to build confidence that we have the right implementation now on top of everything I've said we have to deal with dependencies we have to deal with libraries that we're going to package in in the application those will have different versions then I have different behaviors there's backward compatibility issues we have to deal with web api s-- so live system that we're using which you have no control on those behaviors might change databases and cloud services so there's a ton of complexity from his data set to a model that is deployed in production and that makes a life a lot harder as data scientist it comes to proving that our work is functionally correct so the challenges I'm sorry I said many different data formats software developers are used to collections list numbers classes and objects right those are simple to unit tests it's a well known problem and data science suddenly we have to deal with pandas dataframes JSON objects which may have missing attributes and so on so that puts challenging in terms of unit testing we have to work with database and web api so those are dependencies in our code and those dependencies might have different outputs depending on which day we run the code so that's gonna have an impact we're dealing more with blackbox models so typically something like a neural net it looks like it's performing well it's got lots of parameters but I have no idea of its internal implementation so when it comes to unit testing we have this removed from us we don't have knowledge of its implementation is really where to kind of do blackbox testing so the best we can do when it comes to those models is testing ranges because given an input the output might be non-deterministic or it might be in some sort of range so that kind of changes how we have to test now because the data science workflow is so complex that means there needs to be enhanced collaborations with different specialist data scientists and data engineers have to work together so that means we have to worry about code maintenance code modularity could reuse api's to kind of facilitate productivity so those are new challenges yeah it's no longer a notebook it's now deployed in production for people to use no worries this is fine everything is burning it's okay we're going to look at a series of tools that hopefully are going to be helpful to kind of go towards the path of functional correctness so why a software testing - to summarize maybe the least interesting one you know fine bugs of course that's something that we want to have this maybe situation we haven't thought of we might write a test case for it and that's going to highlight an issue in the code which we can correct okay so that's kind of bugs in the implementation more importantly it's about removing the fear of changes so doing like regression testing let's say you have a fully implement working model and pre-processing pipeline if you make change to it how do you know the behavior is preserved you haven't introduced a new bug so testing allows to remove the fear of introducing changes it enhances debugging process instead of having a big fat exception in your console well now we're going to have some interesting assertions going to tell us in which scenario things are working or failing so that kind of help us in terms finding the root of the problem and Jenny it's about increasing confidence before we deploy into production so that's why software testing is important for the data science workflow so let's review the tools available in in person before we focus on PI tests and hypotheses so there is something called doc test has anyone heard of doc text it's really cute it sounds cute it's pretty useless I think because it sounds good right so you have we all care about documentation we've always been told documentation is really important you know I'm going to look up a function and it tells me what it's meant to do the parameters words returns that's important pilot is gonna flag it up if you don't have your doc string on top of that you can include the thing I've bolded so using the how do you call this symbol the great song let's go for the great sign three great signs and then is basically Python code and just below is the output we expect when this function is executed okay so a duck test essentially allows you to test you're almost kind of testing your documentation you have the unit tests within the documentation so this is kind of a helpful example to see potentially how this function is intended to be used okay so it's useful right you look at it like well here's an example of how to use this API on top of that it acts as a test so test is embedded in the documentation and really the purpose of that is only to validate doctor mutation it's not to validate correctness it's not about being comprehensive about all the scenarios that we want to test it's just an example so this is not a tool for functional requirements testing this is just documentation and there's no natural fit we've a continuous integration or coaches deployment pipeline it's basically code that's mixed into documentation and it doesn't test all the test cases so that's not really a tool that we should care about as data centers however there is another package called unit test which is part of the standard library so raise your hand if someone's using tests expect quite a few of you yeah perfect allows you to do unit testing in Python and it's really popular with developers because the whole philosophy of unit test and the API wasn't designed for parsing it was designed for object oriented languages like Java Java has a library called J unit which is classes based with methods decorators and so on that's where the API comes from so unit test has an ope API so that means there's more boilerplate you have to define a test case it's subclasses test case you introduce methods then you put assertions the way you put assertion is again a method so this is not as convenient for data scientist that comes from our Python background where you use two series of functions so that's what unit test is less popular in the data science ecosystem perhaps more popular in the software development system where developers are used to using things like Java Scala and C++ I have an Opie API so it's more bullet plate and there is limited diagnostics in terms of e assertions available in the intestine package now PI test is the cool kid in Python so it's designed for Python so what I means is simple functions you don't have classes you don't have methods you don't have 20 different assertions as message it's just a function call so it's got a low bullet plate which is what we like we don't want to spend a lot of time writing stuff it's Python it's got to be simple and then there's a couple of really interesting of features which are easy to use they're also available in unit tests but there's more bulla pit okay so one of those features parameterize test so that's the situation where you test one input and output but the template of the test it's going to be the same so in practice you might copy-paste the same test and change the inputs and output you expect right parameterize testing we'll cover down in this tutorial what allows you to do is to abstract what's being tested and inject the inputs and output you expect so allows you to reduce code verbosity and improve code maintenance and finally there's fixtures so that's the idea of decoupling data from from your test or an environment so sometimes you may have to plug in into a database so you want to have some sort of setup that is available to all your tests and then you want to tear it down and close the connection to your database PI test makes that quite quite important it's quite an advanced feature and Pam tries testing use fixtures and we looking at so let's take a let's do a quick demo just make sure it runs through the same page before we look at more advanced things so and the list of files you provided what you'll see is that they're series of exercises that we're going to look through and then there's different tests so I'm going to look at unit tests first of all so this is how I specify the unit tests using the interest library everybody's familiar with that before you tell me a role you should use assertions for pandas I know we'll cover that so basically you specify a class you subclass it you define a method and then you specify assertions using assert echo which is a method available as part of test case class okay so when I do that I can run this unit test okay great so let's use the so if you want to run all the unit tests available in project you can just pass the unit test command what it's going to do is to run all my tests including the unit test case here so what I'm going to do now is to introduce a failing test so here's another test just to show you what an assertion failure looks like so I'm going to unit test my country transform that takes the pandas dataframe and it's sort of output that we want to to expect so in this case what we expect is the output to be SI which is not gonna be correct but let's let's run this and what you'll see this is an example of a failure that's produced so in this case here Europe is now equals to s a so Belgium the region for Belgium is Europe and we expected si obviously this is just for a demo point of view so my permutation is correct the unit test is wrong but you see an example of a test failure so how does that compare to PI test so here's an example using PI test immediately you can see you know we've removed the bullet play there's no classes there's no methods and there's a simple function called assert that we can be using and I can introduce a failing test so really simple to write and when we run that in the command line what we'll see is also a test failure so in this case I'm expecting the result to be other but we have none that is produced so let's dig through here and what we might want to say is fill in a and then other cool so I'll rectify the implementation then we'll run a test again and now all my unit tests are passing okay so using my unit test in PI test I highlighted a scenario where my implementation was incorrect a fixed implementation I run the test I know it's passive ok so the main point here is PI test there's low code boilerplate and that's why it's really attractive for data scientists awesome so let's do some more interesting stuff so we're gonna dig down into PI tests and I'm gonna get you guys to write your patter so here's an example of a directory structure so that's my preferred approach my background is more from Java and software development where is the area of separating out the source code versus the test because when you come to packaging and deployment all your source files are available in one directory that makes the packaging process really simple now I'm not a strategy that people tend to use is to have your test module available together with the module in the same directory right so my have my transform is PI module and the same directory and I have the test transform the PI module ok if we have that that makes it simple from a import point of view because when you run your test the module is readily available in the same directory so that's great there's no magic required for importing but it makes the work of a packaging the application more difficult because you have to extract what source code what's not and deploy that so I prefer to separate them out and decouple them so quick exercise just make sure everyone's in the same mindset so in this exercise what if God is so this exercise called test exercise 1 but PI and in this exercise what we want to do is to unit test a transformer so the transformer is a categories extractor and what you'll see now is some decent pieces of code right so it's taking a JSON string as an argument it's pausing it it's extracting the attribute called slug you doing some splitting it's going to do some validation and then returning a panda's data frame so what I want you guys to think about here is you know first write a PI test for for this this method so I'll give maybe 3 minutes and we'll put your everyone to make sure they come to a PI test so to recap the PI test what you need is essentially you just write a function and then you specify your assertion using assert now if you want to run it in the comment line you can use the argument called exercises and what exercise is going to do is to run all the unit tests in the exercises folder ok so if I run this well you'll see that some things are failing some things are passing because they're not yet implemented so yeah cool so give you guys about five minutes right one unit test using PI test for this exercise and then we'll work through it together instead of five minutes I'm gonna do three minutes I've been told by my team that I should hurry up good stuff in the interest of time and I'm happy to publish the solutions I'll get push the solution after the talk I'll just walk you through sample solution why so quick of course your forward so I've got a JSON string here which is what is getting passed by my transformer it's expecting a attribute slug and then some sort of categories and the schema for the categories that you have the main category a slash and then subcategories right and I'm going to call the method and then I'm expecting the result to be a list that contains games and playing cards so those to fix that's what the transformer does so cool it's if I want to run a unit test if use an ID like pycharm like myself you can run it straight from the IDE otherwise in the common line Python ruin and then exercises so who's use PI test before ah so maybe half and then everybody's unitized okay cool so this is a simple example of how to use PI test now I'm going to carry on because I really want to go into the you know more interesting topics so the first thing is pandas so far my extract categories function was returning a list that's easy that's supported by PI test the assert function can compare into an equality test with a list well with pandas things are slightly different right so let's let's look at an example where I've got a test oh and what I'm going to do so it's in this code we basically are going to unit test the time transformer the time transformer takes a panda's data frame it's going to extract different columns multiplied by some sort of adjustments let's go do some business logic to find out how many days do we have in between the the deadline timestamp and the launch time step okay so those are numbers we want to extract the number of days and that number of days is going to be a feature in my model okay how many days were available for the whole campaign so I want a unit test that is this transformer are correct so I've created pandas dataframe here which is a sample data so that's the input I know what the expected output is and I just want to assert that the left-hand side is so what we expect is the same as the result okay so that's what we might wish to write in PI test so far so good so let's let's run that boom what we'll see is that the output is not particularly useful so what we get is a fat extraction saying the value a true value of the data frame is ambiguous so we essentially using a panda's data frame in a boolean expression context and as a result of that the diagnostic produced in the console is meaningless I have no idea what's wrong what's going on so to prevent that and I'm gonna make this test fail again so let's just run this yep so you'll see it blows up I have no idea what the bug is I have no idea what's the difference in the expected end result now pandas supports a testing package and as asset frame equal is as it seriously call is asset index equal that allows you to do accurate equality test on data frame or series so this function scores set free miko so if i use that and run my unit test again what we'll see is that we get something a little bit more useful basically we get a nice trace that says the values are different the left side we've got 29 the right side 129 so now actually i get in a day of what the the problem might be or what the source of the issue is okay so let's let's fix this test and now if I run it again we'll see everything it's going to pass so what's the story here is that as a data scientist you use pandas pandas is our favorite library right there is specialized assertions that you can use that gives you meaningful Diagnostics ok so if you didn't know about them then you now do and encourage you to you to use them to check that your data frames are correct cool so the next topic is parameterize test so it can be quite repetitive to try for different inputs and check for the the same outputs right so for example here I've basically checked one input if I were to check four different cases maybe the deadline is zero the times time is zero you know if I'm paid by lines of code this is beautiful I can just copy that you know put zeros zeros and then zeros and zeros and test runs formula zero timestamp actually I've no idea what that's going to be cool Yolo cool well something's right in the code it's good stuff but you see now if I have to come up with yet another scenario that we want to test again after you introduce lots of code replications so parameterize testing is the solution to that it's really that clearly there is a template here right so there is this sort of setup with the inputs I had some sort of basic transformation then there was a step where I'm doing invoking different functions and they're my assertions the whole thing can be Prime tries with the input and output I want to check so rather than use code duplication we can create a template and then pass on the inputs and outputs so that's what Pam tries testing is let me show you how it looks like in PI test whoo isn't that sexy for she we're going to make less money because these few lines of code but it's more maintainable so everyone's going to be happier so let me explain a little bit what's happening so the first thing you'll notice is that I've created a list here with the different scenarios that I want to test for this list contains tuples okay so you recognize the triples with the nice little parent sees now a tuple basically has the input as the left-hand side and the expected output as the right-hand side okay so what we can essentially do is to each rate through each element in that list and we have the input and expected output so that's the set apart so that's nice because you may collect those requirements from the business right the business might tell you giving the same point that's the output I'm expecting that's great you cannot translate those requirements into a simple Python code and it comes the dark magic so the dark magic is this part so that's using a decorator so a language feature in Python and what this decorator says is first so this is the awkward part you have to provide a string and this string is essentially the list of parameters that are going to be mapped to the input and output okay so that is applied to each tuple in the list okay so clearly we've got a sample data frame and we've got the expected data frame great and that's the list that we're passing so know what PI test is going to do there's some clever stuff going on it's going to say great now I can inject each sample data frame in each expected data frame for all your test cases so sample DF is going to be mapped to to this in the first run and it's going to be mapped to that in the second run it's going to be mapped to that in this third one and the same thing with the output so now so let's just let's run that in the IDE see what's over result we get cout and you can see there's three tests now passed right and the reason it's three tests is because there's three scenarios that I'm testing out so parameterize test allows you to be more productive focus on their business requirements and translate that into code yeah correct yes so you basically have your list of variables in the string and that's being mapped onto your function definition but in your tuple size needs to be obviously like the same cool so perfect rise test encourage you to use it it's a great tool for writing momentum will test the slides has exactly the same code snippet so I've done this demo so there is another exercise let's look at this together so in this exercise what we want to do is to write some comet rice test for the time transformer which is exactly what I was demoing so let me show you the sample solution so again here I'm basically setting up my test cases there's a list of tuples each tuple contains the pandas dataframe with the different inputs that i want to test and you'll recognize the zero zero that was kind of testing earlier and I'm setting up the data so that's the part where you actually need to to think more right so what are the different scenarios that I want to check correctness for and that's a painful part because you consciously think what all the possibilities where things are going to go wrong and in a bit I'm going to talk about generative tests to kind of go around this issue and the template part is the easiest part right so once you know what the template is I've got a time transformer I want to apply processing and then I'm basically going to do an assert that the two pandas dataframe or equal okay so let's let's run that and what we'll see is that two tests were passed because I've got two samples and testing out four okay so you can combine parameters tests and the pandas assertions okay all those things can come together awesome cool so we've learned about Patos basics we've learned about PI test parameters tests we learn about pandas assertions which are absolutely vital when it comes to data science now how do you write good tests you know we can go crazy let's write tests for everything it's gonna be great but like what makes a good test so the first thing there is a simple pattern which allows you to structure you test essentially what it makes your test is bit more readable more recognizable that's called a given when then pattern so it's not a design pattern like you know structuring software classes it's just a skeleton to recognize the behavior of test so there's the given step which is your preconditions so those are the inputs are you setting up there's the wind step there's the behavior you trigger for example I'm calling the transformer and the den step is what is it we want to assert on so here is a test that follows the given when then pattern so I've got my set up some nice space saying it's obvious I've got the part where we're playing behavior I can immediately recognize that just by looking out and then what is it we want to verify okay so what it means your colleagues when they look at your test they'll like you because I blow you this is really simple to understand I know what's going on okay so that's a good pattern to to follow for readability and code maintenance now there's a ton of libraries that allows you to write more declarative tests so that's basically attempting at bridging the gap between the business language and the code okay so I cert being a and seeing a and Len equals two is very Python you need to be a programmer so someone with more of a business background if they look at that because they might be involved in the requirement phase and the test that might not necessarily be like able to understand what's going on Excel get bogged down on syntax now what assertion libraries and PI truth is one assertion library developed by Google that turn a bridging the gap so another way to write this would be to say I said that a contains exactly B and C and keep in mind here we've not specified the order right so C and B my opinion if you order so we just care that they are contained in to this list okay so here's a bunch of examples of assertions available in pay truth yeah it's like at least over a hundred assertions available so I'm not going to cover them all I'm encourage you to look at this project especially if like I'm saying the business uses are involving the requirements in the testing part then that can be a helpful way to close the vocab r-e for example you can say something like I'm expecting a string to start with the character a so that's like writing English or unless has a specific length or something is on zero so let's look at a quick demo of that so in the notebook I've got proof available here so let's I've loaded everything cool so I'm going to take the first row this is a panda's data frame and I'm basically gonna set up some assertions on that so to give you an idea I want to write that the deadline is non-0 great that's passing who is you available yeah so what that says here and you know not true that this number is zero okay so there's also enhanced Diagnostics that's what's really helpful so when things fail there's a meaningful error message that kind of tells you what's going on so that's specifically useful if you know you've adopted continuous integration you get push you get a report from your server then you have like nice you know error messages that you can quickly pass now let's look at a third example searching that the blurb contains the word songs and that's passing if I were saying blow actually let's see nope it's failing and the sort of diagnostic that is generated is the text here original songs and music videos to jumpstart the Hawaii comb will tow chef's container would love and it's not so we get an assertion error okay so pate roof is maintained by Google so that's a project to to to look at cool so on top of that so given wind and paddlin allows you to structure the test pi troof allows you to write more declarative tests that bridge the gap between the the business of camera in the code but how do you write maintainable tests the top of all of that so tests are just code it's just Python at the end of the day so we should treat it like Python code for what it is so here's an example of a bad test who likes this test okay no one cool it's good stuff so what's bad about it yeah one what is it yeah test one means nothing right so if the test fails in the console you see test one error what's that supposed to mean yeah magic numbers magic rebels what's five what's one zero I have no idea what are we trying to achieve here and yet I did go so yeah it sucks and there's most of their sucking which I'll explain a bit but here's a an example of a good test okay we're not optimizing for number of characters guys okay we're optimizing for reading the test you wrote at once it might be read a hundred times so let's optimize for the read okay so nice meaningful test case name just like the gentleman with a nice beard said right it has that model predicts correctly with some standard input great the construct of the predictive model let's use named arguments fix PC say what is it we're trying to do or maybe I should even extract five as a constant of variable say what it is I've got input features a nice variable names we're following the given wind and pattern and more subtly this assertion self I said true the only fee that's going to do is to tell you that the boolean expression given as an argument is true or false but I want to know what's going on what number did we get how we're comparing against the output right so this diagnostic is gonna be meaningless she's gonna be true or false what we want is well you want to tend where you go 12 or you got 11 so there's an off by one error somewhere so I can go back in my code and see what were the issues right so thankfully with PI test you should use assert you get out of the box so that's another reason why you can ditch unit test and use PI test right because using one function you get all those enhanced Diagnostics out-of-the-box so what are the best practices so verbose naming is better you want to test the behavior not the implementation so in the example I had I was actually accessing an internal field to check the state of my model so if that changes over time with some refactoring my tests are going to be refracted to so you wanna make sure you test the API so that the internal implementation can freely change and you test is still working so that's good don't repeat yourself so good duplication and so on parameterize test is the solution for that go now another tool that the software development ecosystem has been using is test coverage I want to be a little careful about encouraging getting 100% I definitely encourage getting the green bar you know you want your tests to be passing you don't want failing test for sure but what test coverage allows you to determine is the lines in your code that were not executed when you run your test okay so like potentially is useful to detect things that you're not testing but it doesn't mean you've actually tested all the scenarios for different behaviors just mean those lines of code where we're not executed okay so it's just an indication so it's typically integrated as part of a continuous integration pipeline right because if you got zero coverage that sucks really but you don't necessarily mean at 100% you want to kind of get indication wise potentially not test it or test it so there is a project called practiced curve which is a plugin that integrates with pi test it's going to give you some indication of what you're doing so I'll show you in the comment line what it looks like so I can run my coverage here if I run that what you'll see is this kind of output so my model dot pi which is here the lines one to 62 nothing's tested and that's because I didn't write any unit tests for my model right so I could have written a test for example on the predict a method is it in the expected range I didn't do it now my transformers you can see there's different parts which I'm not testing so clearly between 63 74 something is going on so let's see what I tested so we'll go to transformer between cool yeah so that's the time transformer so this is the code that I was testing and I showed you a demo earlier that was ready interest with time transform so that's been picked up but everything else has not been tested okay so that is kind of a useful report that you can take and then write more unit tests to increase the coverage but again it's not indication that your code is fully correct because you might still be having all your lines of code executed but there might be certain inputs that trigger different behavior cool so the next exercise is about refactoring a unit test so let's say I've got a test that is written like that then how would you go about refactoring this test so clearly we can use a panel assertions so let's jump through a solution so this is a test as better structured tree separating out what's the preconditions what's the code that we want to apply and test and then what are the assertions okay now on top of this hair we could have used the third frame eCos rather than kind of manually extract the different bits and bobs and do an equality on that okay cool ah let's go to a something really fancy which I've got my doubts on how practical and useful it can be so that's kind of controversial but I did find a bug in my code by running hypothesis so actually there's credit here the reason I'm giving a big caveat I don't want everybody to embark on the buzzword bandwagon property based testing that's all adopted it's not that trivial so so far everything I've shown you was it's called manual testing I have to come up with scenarios that I want to test against so I'm testing for specific inputs and specific outputs so that's easy to write you know PI test boom given when then assert it's all great it's data Minister given an input I know exactly what the output is it's great it runs fast right because simple function to execute so unit tests are meant to be executed fast because when I make changes in my code I merely want to know some sort of regression report where I want to know what's gone wrong I don't want to wait ten minutes for the whole thing to run and then have a cup of tea and come back right I've got a momentum I want to fix it now there's a massive downside the downside is if you think about the the integer space that's an infinite domain unless we think about it ain't like it's 32 bits int right but there's so many numbers available we're going to test for one so it's not really representative of the state space available when I write a specific test there might be things I'm missing out and typically it's things like negative numbers we always think about positive numbers what if my data I'm reading from an API produces a negative number which might be an error code how is that going to impact my implementation so we don't really think about it's kind of stuff right so because it's unnatural so manual lots of pros is it set up but might not be checking the full spectrum of the space generative test is the idea that unit tests are being created for you the inputs are generated so you don't have to worry about that oh that sounds cool so the pros is that potentially it's way more comprehensive because we can navigate in search a wider state space so that's a big benefit with generative tests the big cons is that it's harder to write so if I give you guys a bunch of inputs you don't know what the outputs are so how can we write an assertion we can't we can't say given X we expect why you can't do that because you're generating inputs you don't know what the outputs are for those inputs right because you want to test for it so conceptually it's it's really difficult it's non-deterministic so a generative test depending on different runs might generate different numbers so you're only once you might say I just want a hundred tests generated but what if your state space is a billion you know you're only going to get a subset of that the next row and you're going to get another subset of it so you have this non-deterministic behavior which is really annoying because if you think about continuous integration you get push you want to know exactly what's going on if you get push the same code you don't want to have different test reports right that's just gonna you're gonna go crazy you're not going to know what's going on so we want deterministic behaviors when we test and the other downside is a lot slower right so if you generate test well there's some computation activity by default so it's always gonna be slower I'm not saying it's super slow I don't have a an equation lab it's definitely slower than running one test because you generate stuff so that's the caveat so now I'm going to look at generative testing a little bit so the intuition is that instead of providing specific inputs and outputs you specify a range of inputs so I want data frames with two columns that have integers that's the range of inputs now it can be any integer that's going to be generated for us and you need to specify a rule and the rule is like well what is it we can check well I want to check that any output I still have a panda's data frame with two columns but because I don't know what the correct results are going to be I can specify a property like I'm expecting all the numbers to be positive that's a property or I'm expecting the length of my data frame to be two that's another property that's an invariant that we expect which we can specify but it's not asserting specific input is getting a specific result it's basically saying there's sort of property which I'm expecting always to be true and if you have that the rule together with inputs are generated then test cases can be generated on the fly and that's the premise for property based testing let's look at an example so I'm using a project called hypothesis which is an open source and available there's quite a lot of work going on at the moment it's always moving rapidly so let me explain these bits of code before I give a demo so this decorator given is provided by hypothesis and everything in bold is what we call a strategy so you need a strategy to generate inputs my strategy here is I want to pandas dataframe so there's a panda is extension for hypothesis so that's good I stated scientist we can use that and it says I want a data frame with two columns one called go one called static USD rate and those two columns I'm expecting floating-point numbers okay so I've defined the range of the state space definitely their frames always two columns and the values of floating ports now using PI tests beautiful features then this is injected as the argument to this function as a sample data frame so what happens this is going to do is generate random data frames that I have two columns we're floating point numbers and that gets passed to my unit test every time it's generated okay so that's how we generate test now the problem I have is I don't know what the result is going to be because I'm trying to test the implementation right so the old best thing I can do is to set up a property so that's an invariant I'm expecting the length of the sample they do frame those generate the input to always be equal to the result if somehow the result is going to have three or four one or zero columns then something's going wrong in code somewhere right so that's an example of a property so let's do a demo and I'll show you how I actually found a bug using this stuff so test transformer hypothesis okay well so first I'm going to run the decoder I was showing to you so if I run it from here where you'll see so let's run this in your console so using one at PI you can say hypothesis okay and what you'll see let's just look at the output before I go back to the code what it says is that 100 passing examples zero failing example and you know I configured hypothesis to generate 100 examples you can configure that to a thousand ten thousand if you want to Bob viously that's going to slow down your test okay so what I was doing is to essentially check for the name now I might want to do something slightly different now I'm going to set up a different property I'm gonna set up a property that expects the adjusted goal to be greater than zero right so when the pan is their friend ran by hypothesis get pushed down into the transformer here we'll get a result I want to check that that result I'm expecting is going to be elements greater than zero so now if I ran hypotheses what we'll see is that there's some weights in arrows that I have not been tested in this case here find an assertion error that if the input is 0 and then infinity we've got some issues let's let's run this again I wanted to find me the negative numbers okay so it's always producing the same test case the reason is doing this is when it's failing the first time and you haven't fixed that it's got check the failing test the first thing for the next one so that's kind of a smart way to check that you've corrected that implementation now let's jump to a better example so to do that I'm gonna show you the source strategies that you can create so at the end of the day this is only going to be useful if we're able to generate different inputs when it's panels dataframe whether it's Jason whether it's text whether it's integers so using hypothesis I can import strategies so in this case I'm going to pour two strategies one called text that allows me to generate arbitrary text one co-list allows me to generate arbitrary list with number of elements so this could here since she says I want a list of text okay please generate a bunch of examples so if I run that we'll see we've got some empty list are generated and then it's going to do weird things like different Unicode characters because that's always going to create some issue you don't do line if you doing some decoding empty list or a list with an empty string and so on right so this is a strategy that allows me to generate list events now perhaps more interestingly when it comes to data sciences we deal with data frames we learn about doing assertions on their frames so how do i generate them so this is the data frame dies two columns and above generating floats so if I run that what you'll see is that it's generating a data frame with two columns one row with somewhere floating point number and then another one and if I run this again it's going to check for the nan situation which potentially is gonna break things somewhere right so those are test cases that you want to make sure your cost code is robust against now my favorite and the reason is my favorite is in practice you have to deal with JSON data right and JSON data is typically a string they represent a sort of nested dictionaries so how can we generate nested dictionaries that's the real deal right so there is a shortage Rico fixed dictionary which I'm going to input and what that takes as an argument is the structure of your expected dictionary this dictionary is simple it has one key called slug okay and the value is text okay so further on that you'll see that it's gonna generate some random stuff every time which is great because this is the hard stuff that we never think about so now I've got a tool that generate all those inputs and when you Barros deploy in production there'll be some smartass that's gonna you know try and call your predict and point with some weird characters right so you wanna make sure you test against that so I'll use this example in a bit for the demo but another one that a particular ID is the regex so you can generate text from a regular expression so for example here I want to say a string that starts with a B and C and this true character long so return my wound it was ABBA I'm a favorite music band just kidding cool so you see it generates text that fits the regular expression so that's quite handy as well especially if you've got some understanding of what data you expecting there's a structure to it now you can't combine the whole thing right so for example I can combine the regex strategy inside the dictionary strategy which is part of a list strategy so I can generate a list of dictionaries that following regexes so all that stuff is really composable so that's really the power of hypothesis so for example here I'm going to generate a dictionary which that's because I'm gonna import JSON right so what it's doing is to generate some dictionary that follows this recipe so the recipe says you need a key called data the value is going to be generated from this regex that produces strategy and map and filter are available in hypothesis so that allows you to do transformation on your strategy or further filter on the strategy right so for example what I'm saying here is giving the strategy that generates dictionaries I want to generate adjacent string from that so I'm calling map and pressing J's and dumps and I'm generating an example so each example is a string there is a JSON object right so this one line of code generate inputs there are JSON object represent as a string which follow a specific schema so that's really powerful so how do we do how do we use all of these stuff so I've given a couple examples of how to generate input so you can do it through values so things like integer text and boolean you can strategies for collections like dictionary lists map and filter are available so that's for example if you're generating numbers and you want to filter out the range of the numbers or there's some sort of property I want to filter on that can be useful there you can generate your own strategy by implementing an interface in hypothesis right so that's all possible so the prologue of generating input is solved the huge problem that is left is how do we specify properties right because conceptually that's difficult so here's what the community has come up with when it comes to specifying properties so you can specify things like invariants so if you know that the result is going to have a specific shape certain number of columns for example that's an invalid that's something you know is not going to change you're not testing the output you're testing a property ok the size of a collection you're expecting a list to always be of size 3 when you pass it to a pre-processing function you can test on that you contest on the result of the list but you can test the property for example you expect all the values generator to be positive or negative or in some sort of threshold you can do that so that's invariance those conceptually a finger fairly simple and you can specify them right now where things get a bit more difficult is when we want to do useful stuff so there's another class of properties called round trips so that's typically when you have two methods that work together so if you insert an element into a list of you insert something database there's probably another method available that's going to check for containment right so you can specify a property that says when I insert something I'm expecting it to be available so those two methods work together and the JSON example lo jason loads and jason dumps work together right so if i encode and decode something i'm expecting the same input so that's a property if i reverse something twice i'm expecting the same input so that's what we can specify but you see conceptually it's not really business logic it's kind of thinking about some properties in contract that's why it's difficult to adopt a tool like that now wonder is simple though is does not crash so that's basically you call a function and you accept you expect no exceptions to be thrown and a classic example is if you try to do max on a list if the list is empty what do you think is going to happen value error it's not possible to find a maximum so in this empty right it's impossible you need at least one element find a maximum so those are things that we don't think about so hypothesis will generate empty lists and then you're saying when I call this code I'm expecting no exceptions to be thrown but with the empty list something would be thrown so that's a different kind of property now one that can be useful if you know what you're doing it's called the Oracle test so let's say you're testing the implementation for your code where there's an alternative implementation that you know is correct that act as your Oracle right so give an input you can pass it to your function you can pass it to this magic Oracle is always correct and then you can compare the result of those two so that's the closest to doing classical assertions and an example that is brought up you know in the communities 14 right there's a built-in sorting function in PI state we hope it's correct right so if you give an input and you sort it you'll get a result if you've written a new salt you can test the output using this Oracle that we know is already correctly implemented so that's possible but having those fully working implementation is not always available so that works when you have an interface and maybe you're coming up with a more optimized version of the same algorithm then you can use the less optimized one as the Oracle all right so that's actually where you can use property based testing and you'll find problems in the implementation right so let me show you the bug that I found in my code which I'd never thought about so that's quite interesting experience for me it was quite enlightening so I want to test my categories extracted so let's go to the categories extractor and show you the problem cool right so here's the categories extractor okay what it does it takes a JSON string and the JSON string is like this right so it's going to have something called slug and then it's going to have web and then Python okay that's the input what I want as a result is web and Python sounds simple right so I've written some code for that all right well this is the code right it looks it's basically it pauses the string in Jason gets the slug and if the string is empty we have a default which is a slash and then we split on that okay so if the string is empty we'll get to empty categories produced yeah so first of it so let's write a test for it so using hypotheses who spoil alert so this is my test right so what I've done is create a strategy that generates a fixed dictionary using slug from regex so that's the string with the slash and thumbs down into Jason so I get a string sounds good right so far so good and then the property dumb I'm expecting is you know the list should always return to elements because there is the main category in the sub category cool so let's run that woo nightmare what's gone wrong well so let me run that in the actually I'm gonna copy that in here and then we'll get something a little nicer cool so what happens if we get this kind of input can you run see I never thought of that what if we ever double slash and the value for the slug my code is returning a list with three elements when we have that right so that's the situation you never think about and the reason what we have done is if I go back to the implementation we are splitting on the slash and there's two of them so we get three elements I was not expecting that so this is an example of some data that I structured in a way that I would never have thought we would get but the regex says you splitted by one slash now if I fix the code which is right so what I need to do is to say max plate by one so if we overload the version of split essentially what I'm saying is split the first time you see a slash and that's the tail we no longer pass the tail because in the tail there is another slash that we would split on otherwise and that's what's generating three elements as opposed to two okay so if I refractor the implementation I would say split on the first slash and that's it then I can run the test again and what we'll see is that it's passing so this is another test is failing somewhere else but let's let's run it from the solution right so that's kind of really subtle scenario where hypothesis was actually quite helpful for me to to detect a bug cool so how much time I ever got left Oh perfect ten more minutes so a related topic to property based testing and the bigger picture is we have those kind of contracts that we want to check against a static type checking so that's also a property because we say we expect a value to be an integer who expect to value to be a stranger to be a list now if you know where this project it's good one to check let's call my PI dad's static type checking to pison so when you say a function have a string and can return the string then my PI will try and detect those errors before you run your code right so that's another form to do testing not on behavior implementation bond types and values okay that's cool project to look at unfortunately it's pretty limited right now to things like collections less than values there isn't support yet for pandas so you can't specify type signature for pandas dataframe with two columns and floats and get static type checking you can't do that but there's progress for numpy there's an experimental project that is adding support for type checking to things like a numpy array so you might want to say this temp array has a D type of integer our to detect that before I execute the code so there is ongoing work so that's looking good for us hopefully over the next you know I don't fix months or years there's more and more code checking tools available in the Python ecosystem alright final topic of today it's called knocking and a waste how much ah how much who's this guy I actually don't know what what's his name who knows what his name is do you guys stretch him back huh which one do the one on the left they could look quite similar no day well I don't know but this is the concept of a double right so when you're in a dangerous situation that might impact your health what do you do you call a double that's going to take the risk for you right so double is someone that replaces you for a specific situation can we have doubles for testing yes we can why would we want to have doubles for testing here is this situation you followed good code practices using saccular and transformers and pipeline and you've extractor your processing steps it's amazing it looks good awesome however as I explained at the being of the presentation the data science workflow is really complicated we'll have so many dependencies databases api's cloud services so your code is very likely calling a live system somewhere that's returning a result that potentially is helpful for your pre-processing so in this case hey maybe I'm loading something from AWS okay so I've got a dependency to the AWS client in my code and now I'm going to try and test process data well how do I test it because every time I call this code it's calling freaking AWS buckets every time so we're in the situation where in order to test the code I somehow need to kill this dependency or replace it with something I can control because the live system depending on where the wind is blowing might have a different output depending on the day so how do we get predictability when we have dependencies on life systems that's the problem that knocking and test double is solving so there's other issues right if I call a REST API difference output but it might be down the server might be out of service that means my test is going to fail that means your CI system is going to fail there's fire everywhere now clearly connecting to an endpoint or a service is slow right because maybe a call over the network so we're paying some latency cost which goes against the principle of unit testing we want to get failures fast so lose all the series of issues with unit testing when we have to deal with dependencies which is the real world that we have in the data sense work flow so what's the solution a test double so it's essentially an object that has exactly the same interface as your dependency though you control it and you can say what you want it to return or how do you want it to behave so it's a way to replace the dependency with something you have control over and the benefits is that it's fast and you have predictability so the attorneys always deterministic now before I show demo it's a big topic so different popular libraries would provide you with testing suites which would implement doubles samoto its testing library for Bato which is the way to interface with AWS so essentially all the different servers like connecting to s3 there is a double that doesn't actually connect to AWS the returns the pre can result but it's got a same interface PostgreSQL provides you with an in-memory database which you can setup and teardown for the tests so you don't actually have to connect to a live database and flask you know if you implement endpoints for your models would provide you a test client that is now actually making requests in a server ok it's all encapsulated cool so here's an example of how to set up a mock in Python and then I'll show you a demo so let's say I've got this preprocessor object and this preprocessor has a dependency to a Dao right so a data access object so something that interface with a database I don't know what these databases but I've got an interface for it right so that's the benefit of the deal now the problem is the dao connects to your life system so i want to kill the dependency and control it so you create a magic mark I know it sounds fancy that's how it's called and what you can do now is well whenever some code somewhere calls the method or the function in this case called load data I want to control the return value that's something I am setting so I get predictability and the return value is going to be data okay so what it does is whenever load data is called his pre kind information we no longer need to connect with the live system and what you can do with Marx is to specify assertions for example this function was called once so that allows you to verify interaction with your code okay so there's many assertions available you can check how many times a function is called where it's once or multiple times with specific arguments or whether it's not cool right so that's what a mocking framework will allow you to specify so that's for verifying interaction now the other side you can also control the return value or for example a specific exception dice draw so those are the two features available so let's look through an example so what you'll see is that I've got a transformer here which is an OD transformer so transformer that is actually connecting to a REST API so it's passing the country and as an argument it's parsing the result from Jason and then it's returning the region so for example for Canada it's going to work for CA it's going to produce Canada for B e it's gonna return Europe okay the region where the country is associated so I've got an implementation here which is hard coded unfortunately it doesn't scale to every country it exists in the world so I've decided to have my transformer using API the problem I've got now is when I run a unit test I'm depending on the API which might be rate limited so my tests are going to be flaky so using a mock what we can do is to say I've got my country transformer whenever this function is I called the get region from code here's a mock I'm going to specify a side-effect so a side-effect here allows me to return an iterable so that means every time the function is called we iterate through those values if you have one you use return value which is what I had in the slide if you have to call this function multiple times and you want a different result to be produced every time it's called your side effect any trouble okay so the first time it's invoked it's gonna return Canada the second time it's invoked it's going to return you can i'll it okay so I'm creating my partner's data frame which has country and two elements transforming it and now what I can check is that this function get region from code was actually called and it's called within the transformer somewhere when the processing is happening so what I can do now is thanks to this mark verified that the correct interaction is happening under the hood we were depending on a life system okay so let's let's ruin this and what you'll see is that the unit test is passing the implementation is correct and if I dig down to my transformer Y you'll see it's happening is actually simple implementation I use map and map is iterating over all the elements in data frame and to do that it needs to call get region from code and that's where my assertion comes in to say it's been called twice with different outlets okay start smoking in a nacho so really useful technique to essentially substitute dependencies to life system of something can control and get predictability on whoa alright so thanks for coming to my tutorial so we're gonna have a move tomorrow on Saturday and Sunday come say hi and thanks for coming [Applause] solutions in one of the breaks so you can download them and do the exercise in your own time as well
Info
Channel: PyData
Views: 3,154
Rating: 5 out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3
Id: WTj6T0QdHHM
Channel Id: undefined
Length: 88min 40sec (5320 seconds)
Published: Thu Jul 18 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.