Christopher Fonnesbeck - Introduction to Statistical Modeling with Python - PyCon 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

r/datascience


For mobile and non-RES users | More info | -1 to Remove | Ignore Sub

👍︎︎ 1 👤︎︎ u/ClickableLinkBot 📅︎︎ Apr 05 2018 🗫︎ replies
Captions
so let's officially start okay cameras rolling all right welcome to intro to statistical programming that would have called it in the afternoon session of PyCon 2017 welcome to Portland I was sitting there this morning in a hipster coffee shop which I liked a frequent where I live - drinking you know a pour-over bon iver was playing in the background I just felt like I was here so I'm in Portland and I'm feeling great my name is Chris Fonz Beck I'm a assistant professor of biostatistics at the Vanderbilt University Medical Center I'm not an employee available University we split were different different institution now so I work on primarily epidemiology infectious disease as well as Bayesian statistics computational statistics and evidence-based medicine I'm also a Python developer and and this is my first PyCon at least the first one that I've spoken at I've been to lots of sigh PI's so you may have seen me there before but I haven't been to many of these so first we're going to deal with some logistics the first is sticky notes I've got how many have taken software carpentry before but nobody three or four okay all right this is a cool little system that I've picked up for teaching and things like this you're going to get two sticky notes and I found star-shaped sticky notes it's really cool and each of you will take one each of you will take to one of each color when you have an issue you stick your yellow sticky I will use yellow for problems like before it's nature's warning color right now bees and poisonous fish they're all like have yellow on them so we're going to use that as trouble and then occasionally I'll say when you're doing an exercise you know put up your your blue sticky when you're done and then I'll know if you're doing an exercise that kind of you know once we get to kind of 70% done so each you will have a blue and a yellow so just pass these around okay let me do half on one side half on the other okay I mention they're star-shaped Oh a few more there we go press those down take one of each color and pass it down yeah tends to work really well rather than and and then if you've got a question during the session of any kind could be technical or whatever you can put your yellow sticky up there and that way you have to sit there with charm up them you so that sticky notes we're going to this this sessions supposed to go through to what 440 I think but we mercifully get a break in the middle where coffee and stuff arrives that's around three apparently it's been left up to my discretion so it'll be around I've got actually at 310 so play all the good pastries won't be gone by then and so we'll play that by ear and I've already told you about the github repository I think that's all the logistical housekeeping kind of stuff now in terms of the course so I wrote the I wrote the paragraph that I use to apply for this tutorial here it's a little bit out there wrote this seems seems like years ago now it's mostly still true it's probably going to be better than that and and but the motivation behind this is a couple of reasons how many status actual statisticians are in this room good because you'd be very very bored if you were already a statistician how many of you are what you call beta scientists whatever you however you define the word yeah how many are developers and mostly developers so so part of the motivation for the it's kind of what I figured it would be is is to instill a little bit of statistical Michael statistical literacy right because we all we all play with data we're all worried about things like uncertainty and perhaps making predictions perhaps estimating things but we may have not had that background or if we did it was in an undergraduate class you know three years or maybe thirty years ago and chances are that class wasn't very good mine wasn't and so what this is designed to do is to give you some sort of pragmatic tools and also modern tools to use for kind of day-to-day data analysis you've got two things you want to compare them how do I do that we don't have to go back to your notes from undergraduate statistics and then also doing it in a computational way rather than an analytical way right we're not driving equation Siri I might show you some equations for interest but we don't need to know all of that so that's that's the idea and then also some of you that already how many do machine learning as part of their job or hobby or whatever alright so and a lot of you have computer science backgrounds rather than statistical and there's sort of a statistical and a computational side to machine learning right and so this is going to give kind of a statistical perspective rather than a computational one so hopefully that will be beneficial to some of you as well okay let's see I've burned burned about ten minutes there how is Anna constellation so here let's do this blue stickies if you're ready to go no stickies means it's in progress and everything is going well or you just not followed or not you don't have to do all this stuff they're better if your hands on but if you just want to watch you're welcome to do that and yellow if you've got some issues that need to be attended to okay all right so I'll just go through here and try to get this going and turn this off now but all hopefully hopefully if it all hopelessly fails for you you can always buddy up and pair program for this okay nothing worse than its connectivity problem that's going on who's got another yellow sticky this may be the solution to our problem here at least force Olivia so my binder is a site that will host repositories built on Conda and install a video for you and then you can run that on somebody else's hardware without having to worry about all of this stuff who else is on here our Conda packages which are binary the last one is prime c3 and it's it's building and installing this place so it that takes a while on my computer - one more over here and see barn see that okay so some of you are having issues on the windows front with Condit installing stuff so I've added a file to the repository or I'm trying to and even get pushes choking here hopefully this will work for those you who are still struggling all right I'm going to let that run and think for now we should push on if you're having trouble installing this pair up with somebody I think that's the best use of our time at this point that's not how that one is going to take a little time to run it should have done it by now but I think there because of the connection issues it's probably pulling stuff down really slowly I thought the Wi-Fi would be a little faster but I guess those was bad bad assumption on my part so for those of you who do have things up and running to get things going for those of you not as familiar with ipython notebooks or Jupiter notebooks what you do is go into the directory where you've got all of the tutorial materials and you type in Jupiter notebook and it should pop this up in your in your browser and where we want to do what we want to do is get into this notebooks piece here and so my plan is is to cover four things we'll see how optimistic that is it's fine if we don't get to everything the first part was going to be a quick tutorial on pandas how many of you already know how to use pandas relatively adeptly most of you do okay I may slide through that relatively quickly for the benefit of those who are not but we won't perhaps spend as much time on that so we can get to the statistical stuff the reason I did this is because you know data cleaning and preparation is probably 80% of the time that you spend on data analysis irrespective probably of what domain you work in and so it's really important and pandas is a really important tool that feeds Intolerable all this of stuff so this won't be a comprehensive you know West McKinney tutorial but it'll go through a little bit about how to import and and manipulate data and then the next piece will be an intro to statistical inference primarily from a Bayesian perspective and so and and that may be a little intimidating to some but it's going to be super user-friendly and again not not not very math heavy very pragmatic teach how to do stuff like comparing two to two groups to one another work for different types of data and then get into an important topic called regression modeling where you have two variables and you want to compare how one variable predicts another and then finally if we have time a little section on missing data because missing data is very important statistical topic that often gets overlooked by machine learners so that's the plan and and again if you're having issues do pair up let me see how my let's see how my binder is is looking - still running okay what this will do is allow you to run it on on a remote server with all of the packages installed but it looks like it's still cranking so we'll we'll move on alright so we'll open the data preparation one first let me blow up my screen does that look everybody read that okay so a panda's is a high-level tool for doing data manipulation and transformation that is primarily concerned with tabular data so rows and columns and and it's extremely flexible you can deal with a variety of different data input types and a variety of different variable types and it's also extremely fast the original designer of pandas West McKinney was extremely obsessed with performance and as the result pandas is a very efficient and and capable tool for handling and transforming data its origins were in sort of financial data analysis and so it's extremely adept at doing time series analysis and and but really any arbitrary kind of matrix layout data with row and column labels can be accommodated and and it's got a ton of useful features including sort of built-in handling of missing data again a topic I'm going to revisit later one of the most important things that it does is it allows data tables to be aligned so it's got indexes on rows and columns so that it's very smart about how tables are joined together sort of a database type of an approach and it's really good at doing things like slicing and indexing reshaping and pivoting you can go on and on and there's books and tutorials written about this it's two main features are a couple of data structures that it has and the first is called a series and series is essentially just a an array such with a numpy array but the added [Music] the added attribute that it has is is an index and and by default it gives you an index of ordered integers but you could give it other labels as well so an array or a series has kind of two main attributes the the underlying values which is just an umpire array and this this index which is its own object type and so if we're not interested in having integers as labels for example we can give them labels so let's say say this is some biomedical data this is bacterial counts from a microbiome study and so we've got four count values and now we're going to give it index labels corresponding to the the genus of that bacteria and you can use those labels than to pull out the value that you need you can do fence your indexing by let's say pulling out only the ones whose names end in bacteria and so it leaves out the others you can check for membership but you can still use you know underlying all of this is still an integer index and so you can go in and use positional indexing if you want to and and you can decorate decorate your series with names both for the data and the index itself so these are particularly I said it with genus these are File of bacteria and I'm going to call this counts because that's what they are one of the nice things is that it's really tightly integrated with numpy so numpy functions not only work on pandas data structures but they return a pandas data structure back right that you would expect it perhaps to convert it to an empire a but it doesn't do that so you retain all that high-level stuff we can filter according to values so let's say we only wanted to counts over a hundred or a thousand and one way of thinking of it is as sort of an ordered key value store it's almost like a dictionary with an order dict in the sense that it's got labels that you can index by but they do have orders based on the underlying index we can pass any sort of index that we want so for example if we give this an index that includes a file name that doesn't exist in the original data set what we get is perhaps a little surprising to those not familiar with pandas it excluded the one that we didn't specify in the index and then it added the one that's in the index but doesn't have any data okay so so it index is super important right the index determines essentially what is done with with the series and then lots of very useful built-in methods so we can check you know which ones are no so in this case we have this missing so I told you missing values are automatically accounted for you can see this here so na n not a number is our placeholder for for missing values if you come from the our world our has had missing you know missing value it's kind of built into the language because it was designed for statistical analysis and Python it's going to come from third third-party packages like pandas and numpy and so this is the alignment that I was telling you about so if we go back to that original bacteria a series and let's say we want to add that to this one what happens well we end up with an index with five elements rather than four we added two to series of length four and we got one of length five and the reason is is because again at respect the indices of both of these series so it creates the union of the indices and where data was missing for what from one or both it's got not a number and you might have experienced as to why they're both nan when we had data for one of them in this one here and the reason is he add the Ninh and propagates right you add missing two five the answer is missing not five you can't assume assumed missing is going to be zero so generalizing the series is the data frame so this is a this takes you can think of a series as being one column of data if you like well in practice if we're using this for data analysis we're always you know we're usually dealing with more than just one column of data we're dealing with multiple columns with multiple data types and and so and we need them all in one place because as each column represents a variable each row typically represents an observation so in this case here each row represents not just a patient but a patient observation so this is actual this is actual microbiome data from work I did a few years ago where essentially these were kids with necrotizing enterocolitis and they had either stool samples taken or gut samples taken and those samples were assayed using PCR for bacteria to see how they compared and so each of these is the count of a particular phylum of bacteria in a particular patient so each row represents some sort of independence usually independent observation right so that's where the data frame comes in super useful by default they're sorted by column the columns are sorted by a name but you can realign them simply by indexing so if we want to file them first we just pass an index with that order and we'll it'll return the data frame ordered as we like and important thing to notice with pandas is that almost everything is done not is not done in place right so it will return you with the new object the new data frame leaving the original as it was and that that seems a little inefficient sometimes but it's really useful when you're doing iterative data analysis and data exploration because you don't want to accidentally change something along the line during your analysis because it's hard to go back to the original data you'd have to go back and rerun import the data Andry manipulate it here you can always go back and use that original data frame and in its original form the important thing about data frame is it's got the second index the corresponds to the columns so we saw the index on the rows with the series now we've got the same thing for the columns which presents a little bit of a conundrum when it comes down to indexing well what what is it going to index when you pass it a value well if if we want to access columns we do that using a dictionary indexing but and and when you do that it returns you a series so a single a single column gets extracted as a series if you want to retain a data frame structure you use this this double bracket notation which is just saying I want a data frame but that data frame just has one column what do we do about rows if we've got if we're indexing rows as well as columns we use the dot loc attribute and this is just an access er for both columns or rows and columns so this will take the third or the index three observation which of course in Python is the fourth one and it returns me a series notice that it's of data type object because it's taking a slice across variables which may be of different type right so it doesn't convert all of these cast all of these to the same data-type it only retains the most general types within this case it's an object I'm going to skip the exercises for the Python section because so many of you already are familiar with it but if you're new to pandas do go ahead and try those there are lots of ways of creating data frames because we get data in a lot of you know in a lot of forms now you can pass it a num pie array if you want to but then you've got to add labels and indices index labels if you need them here's an example using a list of dictionaries where each row in the each item in the list is the dictionary that represents one datum because sometimes you get data iteratively over time row by row rather than all at once and that will create a nicely arranged data frame as well a really important thing to remember about pandas data structures is that when you index a data frame or a series you're always getting a view into that subset of the data set that you're looking at rather than a copy and this requires a little bit of caution when you're using them so for example let's let's say that we wanted to manipulate our data and say that actually index five here didn't have the count that we thought it did there was actually a zero there something went wrong and so we're going to assign zero to that and you can see that you get a warning here because it is so easy to misuse this and so so we extracted these values from the original data frame to get a series so this is a series and but if we go back and look at the original data frame that zero is propagated over and the reason was that you were always manipulating this data frame you were just looking at a little piece of it we're looking at a view of it best way to avoid that is you know to make a copy so now we copy and say oh no actually there were a thousand in that one and then we go back and look at bacterial data it didn't change that to a thousand because because we actually made made that physical copy that a new copy I should say all right I'm just going to put back the value that we accidentally changed makes it really easy to add rows so or columns so let's say all this data was twelve from 2013 I could add a a a column called year and I'd all have to do is pass it the scalar I don't have to resize it so it will it will broadcast it as necessary however you can't use attribute indexing to create a new column and I've seen this happen over and over again with new users let's say I wanted a because all of these columns exist as attributes as well as using indexing if I go bacterial data dot phylum you know it returns me the view on that data but it doesn't work both ways if I want to add let's say that these were all treatment patients they were receiving a treatment of some kind and then I look at the data well where's my treatment column well at the end of the day this data frame is still just a Python object and and Python objects have attributes and so it didn't know whether you know whether to create a column or just to you know have an attribute obviously defaults to an attribute so so now we just have an attribute called treatment one which doesn't really mean anything it's not really data in that case it's not data at all like you know um Python table data frames like Python objects in general have other attributes it's got you know there's methods for joining and applying functions and so on this just we just added another attribute with value one and who knows what it's for you know they're very flexible it's just not des that's not how you add data to the data frame so Auto alignment so when you add a column that isn't just a simple constant like we did with year we've got to link be a little bit careful so let what I've done here is created a series that represents this treatment you know treatment control kind of thing and notice I've only created one of length 6 so four zeros and two ones if I add this as treatment so here's how I'm supposed to do it is to pass it using the dictionary notation dictionary indexing notation I get this it actually worked even though there are different lengths because indexing right because because indexing is great and so it put Nan's here which is a reasonable thing to do it doesn't work with other data tired yet data structure types I should say so if we have one called month that's just a list and I try to do that and it's only of length four it breaks because it doesn't know what to do with a numpy array or a list because it doesn't have an index so it's threw up his hand essentially if it was the right length it would have worked so this for example works right here's a list of the appropriate length with just January's in it can remove columns just by deleting them or using the drop attribute that's not too surprising we can extract always extract the underlying data if point we don't need all of pandas sugar then we can strip it away and just look at the raw data anytime that we need to that's not super interesting one thing to note is that indexes are immutable so if we don't like the fact that our first index is labeled it has the number zero supposed to be 15 for some reason you can't simply replace it what you've got to do is is create a whole new index and assign that new index as the index attribute the reason is is that indexes can be shared among data frames you can take the index of one day to frame and assign it to another and you don't want the carpet being pulled out from under you or the rug pulled out from under you having elements changed if I were allowed to do that it would it would change the value in both of these because they're referring to the same object whereas if I simply swap out the index of one of these two it wouldn't it wouldn't affect the other all right how are we doing on time here 207 it's not too bad one of the great advantages of using pandas if nothing else is that it makes the importation of data in whichever form that it comes to you in well I work with clinicians and they like like to give me spread I get I've had data comes to me in Word documents PDFs and things like that and it'd be nice if everybody stored their data in relational databases like they're supposed to but the real world is not like that and numpy is great too but but it's data import mechanisms are quite a bit more limiting particularly when you're dealing with heterogeneous tablet tabular data so the load txt function for example which is you know what you would have used before pandas you know requires you to specify a whole bunch of stuff very cryptic signatures for variable types and so on which was great at the time and then become became a little tiresome and then pandas came along and made everything nicer so for example let's look at some this is some data we'll use later these are Olympic medal counts scraped from the Olympic website if I want to read that in to Python I can use pandas read table function and it's very flexible you can specify the separator whether or not to use one of the columns as the index whether or not there's a header in this case there wasn't one so I'm assigning names to it and so in you know in one function call you can get a properly formatted and imported table and this separator argument is great it takes regular expressions so occasionally you'll get a for example a table that is has an arbitrary amount of spacing between variables you know and and you wouldn't want to just say three columns or three spaces you can use this which specifies one or more spaces one or more whitespace characters we can scrape stuff from the web this actually broke for me the other day because of another underlying library html5 lid I think it's working again now although RN and that appears not to be so we'll avoid that oh now it's just hanging on me let's see I think I'm going to plonk on the rest of pandas I've shown you most of the stuff that you'll need to do they're going to failed okay failed nicely now just briefly CSV values CSV it's just a special case of read tables where the separator is is built in with as commas pre specified as commas you can if there are perhaps rows with bad data or things you want to exclude for some reason you can skip the rows you can if you've got a very large data set and you want to process the data chunk by chunk because your resources are limited you can do that so I can let's say I just wanted the first four row maybe there's 30 million rows of data not uncommon for genetic data you can just pull in a little portion of it so this is just the first row or in fact there's a argument called chunk size that you can specify that returns this weird object called a text file reader and this text file reader is a Python generator and what it will do is pass you things one chunk at a time okay so chunk by chunk so here's the first 15 is the next 15 and so on and so in this case just so happens there were 15 rows for every every phylum you can see the taxon changes every time I could calculate tax on specific you know means or whatever here without having to import you know potentially again millions of rows into memory remember all this is being done in memory so these are actually pretty useful things to know [Music] hierarchical indices what the heck because they're so cool this is more this microbiome data this one is properly so the thing about the data that we're looking at before I mean it's not a big deal but you know the thing about these max is that they got rid of the Escape key when the caps lock Q is sitting there just waiting to be removed from my keyboard if we look at the index of our bacterial data you can actually see learn is unique ah there we go there's an attribute that tells you whether or not it's got a unique index I guess that one I think I changed that one but any case that the data remember has one row not for every patient but for every patient observation and we sometimes want a index that represents that kind of unique combination and so what we're really dealing the data actual data here is these are remember I said before that this data is is a sample taken from tissue stomach and then the stool and so every patient and and bacterium has account associated with it and what we can do rather than specifying a single column as an index so if I used either these two columns as an index they wouldn't be unique because there are multiple patients and or multiple taxa but patient by taxa it's going to be have a unique index and so when I import it I can actually specify which I showed you this before last time I said index call equals zero here I'm going to specify two columns rather than just one which seems a little odd and so what happens is it creates what's called a multi index where now every row is indexed not by a single integer or label but by a tuple of labels so what it really does is it allows you to represent kind of multi-dimensional data while retaining the tabular format without having to go in two three and four D data structures so now I can index stuff out but I've got to do it oh oh yeah this is the old yeah API changes hadn't change this one this should be loc rather than IX now they've deprecated IX used to be IX so now I can index out the first or the looks like the third row here by passing it that tuple but I can still pass it a single index for one of the components of the hierarchical or the multi index and so now I get the whole proteobacteria and you see now that there's a single index rather than a multi index because I've just taken that subset out that's really great you can similarly do a cross sectional in method method for indexing but I don't usually find that as useful and we can reorder them as well if we want one on the inside versus the other although the nice rendering on Jupiter notice it takes repeating cells and unifies them so they're a little bit nicer to read now I'm going to talk about relational databases I think and see what time is it now - 16 now we better push on so that I'll leave this one with you this is a nice example this is actual data that was compiled from the Ebola outbreak in West Africa and compiled by Caitlin rivers on a github repo so if you're looking for a really big ugly messy but wonderful data that's a good one and and this represents a really nice real-world example of what you have to do to clean clean data to make it analyzable and this example shows how you do that with with pandas because each tables got different labels associated with it it's got data in different formats it's got missing data and all sorts of other things and this this you know so for example here's one from Sierra Leone and it's got you know date and variable and a bunch of provinces and then this one here's got you know description rather than variable and it's got inconsistent descriptions here this uses underscores and you know so if we wanted to you know extract all of the let's say new cases of Ebola if we were surveying doing surveillance on how the outbreak was progressing we'd have to somehow clean this up and this is a nice example of how you might do that so perhaps you can do this one you know on your own or on the break but I think I'm going to push on given our earlier technical difficulties and get into the statistical analysis okay let's open notebook number two and then I'll try to get through this before the break all right so this is going to look at it's kind of an easy introduction to how you statistically compare data from two groups first looking at outcomes that are continuous so they got values that are you know measured let's say on the real line you know measurements like height and weight and blood pressure those sorts of things and then and then we'll get to binary outcomes which are also common dead or alive male-female true/false those sorts of things so just in general you know what we mean by statistical inference is we're trying to learn from data that is either incomplete or air contaminated by error we don't necessarily mean mistakes we just mean variability right there's no variability or error a we wouldn't need statisticians Y a and B we don't have to observe and the reason we wouldn't need statisticians is that we'd only have to measure one of everything right you just need one thing and then you know everything there is to know right there didn't change and the approach that we take so if we're interested in whether or not two groups are different from one another one bigger than the other did my experiment work did does the treatment improve you know the probability of surviving cancer or does it make your IQ higher if you take this fancy new drug the way we would do that is using a statistical hypothesis testing framework and this is what we have had undergraduate statistics everybody right this is what you learn right you learn t-test chi-squared tests nonparametric tests ANOVA all this stuff that's all statistical hypothesis testing in action and the the goal of statistical hypothesis testing is actually not to evaluate the hypothesis that you're interested in is to evaluate a null hypothesis it's usually sometimes a lot of the time not related really to what you're interested in so if it's let's say does my drug increase increase survival of cancer patients well the null hypothesis is no it doesn't that you know has zero that's usually what it is no zero effect survival is the same with or without this drug and so what we do we we carry out a test that has one of two outcomes we either reject the null hypothesis or we do not reject the null hypothesis don't say anything about your research hypothesis just the null hypothesis and rejection occurs when some test statistic that we have T statistic Z statistic etc chi-squared statistic is higher than some pre specified threshold value right so you probably remembered looking at tables like this alright you looked it up in a textbook and and if it was and then you get you got to choose the level at which you thought it was significant okay and notice it doesn't say anything about what you're interested in which is your research hypothesis and so what what we say is that we reject the null hypothesis and that's pretty much all that you can say and depending on how you choose your null hypothesis it may not even be meaningful right 0 may not be the proper comparison for what you're trying to do now no drug has zero fact I mean you know seen somebody sugar pills and you get you get nonzero effects on things so so it can be very hard to interpret the outcomes from statistical tests and and there are a lot of subjective choices that go into setting this up a which statistical test to use which null hypothesis to test that have just showed you which significance level right what's the default statistical hypothesis testing level what's the magic number 0.05 and and it's arbitrary it is arbitrary it was it was traced back to Ronald Fisher and it had a paragraph that said something about you know one in 20 was a reasonable level to choose so Ronald said so and that's why we do it sometimes you get point zero one and medical stuff but it's entirely arbitrary point zero four you can't publish your paper or you can publish your paper point zero six you can't those sorts of problems so choices are often based on these arbitrary criteria statistical tradition and that the evidence is very indirect again it's testing a not your hypothesis but something that may or might not be related to it and it generally overstates the evidence against the null hypothesis so instead of testing of a better approach a more informative approach is to use a estimation based approach and this has nothing to do whether with whether it's Bayesian or frequentist but rather than testing whether or not things are two groups are different yes or no rather estimate how different they are it might be zero it might not be zero and also give some measure of uncertainty around that estimate and that's what we're going to use here so how many of you take have taken Bayesian statistics either informally or formally before not as many so you usually don't get it an undergrad again my education is a little bit outdated I was an undergraduate in in the 90s but I certainly didn't have it in any of my courses and you might have gotten it a little bit when you talked about probability because base formula can be derived from conditional probability and that might have been the only thing oh here's Bayes formula as a result of conditional probability and that's it you most people didn't really use it I'm going to give you the quick rundown about what it is and and why it's useful Andrew Gelman who is a big Bayesian say it says this bait what are Bayesian methods they're practical methods for making inferences from data using probability models that's the key point four quantities we observe and about which we wish to learn and the key thing was probability models this is probability modeling writ large so everything that you the nice thing about Bayes is that everything you do doesn't matter what model you're fitting what kind of data you're analyzing all of the inferences are in terms of probability statements so this is going to be my notation throughout here we're going to talk about things we don't know these can be parameters in a regression model these could be unknown means of a treatment or a control as theta some Greek letter and then our observations are data or Y and so what this is saying this expression says the probability of theta given Y so the bar's is a conditioning we're conditioning on Y what it's saying is what's the probability of my hypothesis being true of my effect size being greater than 6 something like that given the data that I've observed so right away that should strike you as a little bit easier to interpret right this is a direct statement about what we're interested in right what's the probability that these two things are the difference between these two things are different from 0 given what I've observed in my experiment I've gone out and observed stuff what is the evidence from that say about what I don't know so couple benefits ease of interpretation because there are probabilities so we're going to be dealing with probability distributions right normal distributions binomial distributions and so on it is a nice summarization of uncertainty we've all got estimates of variance associated with them you can incorporate uncertainty into parent parameters we sometimes know stuff about these parameters before we go and fit our model we can incorporate those in and it's actually very easy to calculate summary statistics and things that were that we are interested in after we've gone in and fit our models so the big difference between Bayesian and frequentist and or classical statistics is is this essentially if you're a frequently that's Ronald Fisher right there looks like Santa Claus but it's Ronald Fisher and the way frequentist look at the world is they view data as being random right the random they're the outcomes of random processes flipping coins doing scientific experiments you know doing a/b testing for websites that sort of thing and so we consider those to be random and then model parameters or things we don't know are considered fixed right the effect of via the quality of your website the effectiveness of this new drug and so we're going to condition on those and so what they say is so they're they're notation is slightly backward from what we did before right we had say two given Y here we have Y given theta so we're going to condition on these unknown things and say what's what's the chance of us seeing what we saw if this thing was true so you see it's a little more more indirect right we're going to assume what we don't know is true and then see how likely what we observed is it's not wrong I'm not saying it's wrong there's nothing wrong with hundreds of years of Statistics it's just more indirect and so what statisticians do who are frequentists they derive estimators for things right things to estimate the variance things to estimate the mean and they choose them based on these on various criteria right so-called optimality criteria like we might want them to be unbiased we might want them to be the have the minimal variance as little as little variability as possible more most efficient in terms of use of data and one example is the estimator for binomial probabilities right coin flips and so on an estimate of let's say the probability of heads in a coin all statisticians have to use a coin example at some point here's mine is the number of let's say heads over the number of times you flip the coin right so if it's around five over ten for example 50% that's an unbiased coin that estimate 0.5 or whatever that ratio is is your frequentist estimator for the probability of heads and we choose it because it's known to be unbiased and minimum variance okay Bayesian is as I suggested by the way around Beijing's view their data is fixed it used to be random before we observed it but once you record it right once you write it in your lab notebook it doesn't change when you go back the next day they're not different values so they once were random they're not anymore so we're going to condition on those we're going to hold those is fixed then our parameters are model parameters are considered to be random and I only put random and scare quotes we don't necessarily mean that they you know change you know survival rate of you know under drug a doesn't mean that it changes all the time although it might what we're saying is we're going to use probability to define our uncertainty in that parameter and the more we know the narrower it's going to get right so we're using here probability as a state of knowledge rather than as some stochastic sort of thing of definition of randomness if you're a Bayesian you don't need to develop any estimators which is great because I'm terrible at that so the only variable or the only estimator you need as a Bayesian is Bayes formula this is obviously what this paradigm of statistics is named after this can be easily derived from the definition of conditional probability as I suggested before and it looks like this let me give you a quick tour of Bayes formula I already showed you this right this is what this is writ what Bayes is all about what's the probability of these unknown things given what I've observed this is known as a posterior probability the reason it's called a posterior probability is that it's what you know about theta after observing Y so posterior means not your but it means before prior probability is what we know about theta before we did our experiment or observed our data or whatever and in between that we have this thing called a likelihood a statistical likelihood that does describe the variability of the data frequentists are all on board about this this is common to all varieties of statistics these classical statisticians don't don't recognize so we take this initial probability that our initial state of knowledge we update it with information from our data and we get a thing called a posterior probability this thing down here is essentially just a normalizing constant probability of Y the probability or data kind of a weird thing ostensibly all it does is it normalizes because remember all these are all the PR means probability so these you know remember the rules of probability right we have to sum to one or integrate to one they have to be between 0 and 1 all those sorts of things this ensures that that the probability in something that's not something that's unnormalized ok it turns out though this normalizing constant is a huge pain in the butt and the reason it is is because what it is is essentially the numerator integrated all over all of theta you just take those Thetas and integrate them out it looks like this well if you're trying to calculate those posterior distributions that we're really interested in this becomes impossible right so I can do univariate integration I still remember how to do that I probably need textbooks to help me with two variable integration anything higher than that forget about it and even if you're an expert mathematician hundreds or thousands you can't do it analytically so for most problems of interest directly calculating posterior distributions is impossible and that's one of the reasons that Bayesian approaches have been kind of on the sidelines relative to other relative to classical statistics they were just hard to calculate but now we have computers that allow us to implement algorithms that we're going to see see today okay priors were sort of controversial with respect to Bayesian statistics remember priors is you know characterizes what is known about the unknown quantity before you go out and observe things well they're viewed as by some as being subjective right depends on who you are right what you know about something and so you know the the the old you know trope was that well you can just make your analysis say anything you want by the priors to make it that way and that's absolutely true but you got to write your prior down you got to tell us what it is and if your model is sensitive to your prior choice you know you got a problem secondarily all of statistics is subjective I already showed you things about the p-values right choosing alpha equals 0.05 you chose to do that it may have been statistical tradition but it was chosen and somewhat arbitrarily chosen which test statistic to use which data to collect in the first place there also there's no such thing as objective statistics and a story so all you got to do here you know again you have to be honest about what your information state is beforehand you can make priors that are uninformative if you really want the data to speak for themselves but typically in science we want to build upon previous work right you want to take what's known and improve upon it and it allows us to do this likelihood functions I won't say too much about that other than again it's the evidence of in your data it contains all the information that's used to update this information state and there's a nice property called the likelihood principle that says that the likelihood function contains all the experimental information about you from your data that's relevant to your model and that's that's just really nice how do we do Bayes three easy steps first step specify a probability model okay so we got to specify we get a completely specifier model so priors for the unknown parameters what your data likelihood is what your data is in general any covariance any missing data just write it all down and then coat it up and we'll do that in a little while and and this is the tricky part for some write what you know what prior should I use for different things and this is what I'm going to show you a little bit later down further along and so the thing the body of knowledge that's most important for being a Bayesian is probability being familiar with probability distributions and how probability works and and this kind of just takes a little time to get in a little intuition you know this is count data so I'm going to use a place on this is a data for you know binary this is a prior for a Verya a variance parameter well it has to be positive so I'm going to choose a gamma and these sorts of things and and and they're always references for every everything but there's always a lot of examples there's against the statistical tradition involved here too but it just takes a little bit of familiarity just a quick overview in general we have a couple different kinds of random variables discrete and continuous so discrete ones are defined at particular points so you know trues and falses are you know discrete values there's no halfway between true and false counts of things right we can only count in integers there's no no no decimal numbers so here's a you know and so they have these probabilities look like this this is a Poisson distribution often used to define counts of things that's its functional form so the data can look like 0 1 2 because they're counts and so on it's got one parameter lambda that has to be greater than zero and it's expectation is the same as its variance so there's kind of a suite of discrete parameters that you can use here and so in I'm going to pull out oops forgot to run all of the prior cells here oh oh you know what it is I didn't I did the mortal error here source activate what is called stat PyCon I didn't activate it all right so I'm just going to draw so I'm going to specify a Poisson distribution and I'm going to give it a mean of one and I'm going to draw a thousand samples front or ten thousand samples so I can calculate it to mean and you know that's what its distribution looked like throughout this tutorial I'm going to use a package for Bayesian statistics called PI MC 3 that I'm a developer for however this tutorial is not going to be focused on that software in terms of its details and technical capabilities and so on we're going to consider this kind of a black box right we could we can do statistical analysis with a whole bunch of different types of packages and so I don't want to focus on on on PI MC but we will use it kind of to generate things like this just as a side note continuous random variables on the other hand are defined over an interval right so let's say between 0 & 1 if we're talking about a probability or you know from positive to negative infinity the whole real line potentially the most famous one being the normal distribution right everybody that's the most important distribution in statistics it's got a mean and a variance the mean can be any number on the real line the variance has to be positive because you can't have negative variability and it's functional form looks like that and similarly again you don't have to know necessarily all the mechanics but just how to parameterize it so here's a normal distribution with the mean of negative 2 and a standard deviation of 4 so sure enough the mean is around negative 2 the standard deviations around 4 and it looks nice and bell-shaped sort of I had more fat more bins on my histogram about every two months you'll see an angry tweet for me about histograms and theis on okay step two calculate a posterior distribution so we've specified a model we'll show you how to do this step by step in a minute calculate that posterior distribution but and remember this could be the hard bit right so the first one is the conceptually hard part how to specify your model hundred that's usually what people have trouble with is just writing your model down the hard part technically is getting this posterior distribution can't do it analytically so we do all sorts of approximations so all the software all the all the use of time see today all it's going to be doing is approximating a posterior for us and sometimes those approximations can be really good sometimes they can be lousy so step three is always check your model check it this is this should be the case no matter what kind of statistics you're doing but it's particularly important for Bayes because you know this is model based inference and so the assumptions that you use in your model are usually by and large unverifiable and so you got to go in and see does the model actually fit the data like even the data not even talking about like out-of-sample stuff the data that was used to fit the model does it look like something that could have come from that do the conclusions make sense are they reasonable are the output sensitive to changes in model structures so this whole prior controversy I was talking about go in and see if choosing somebody else's prior or slightly different prior completely changes your inference if it does you've got some explaining to do or maybe have to go back to the drawing board if it doesn't you can be sort of happy so model checking alright let's get into it so we're going to do the simplest possible statistical analysis where I said comparing two groups we're going to start by just looking at one group all right so we're going to do statistical inference for one group we're going to use a data set here this is a popular data set it's a radon data set get rid of my cool bars and stuff so you can see the pretty pictures that I stole from the internet so radon is a radioactive radon gas is a radioactive gas that causes cancer the biggest cause of lung cancer amongst non-smokers and it usually just seeps into your house from the ground right so this picture kind of shows from the groundwater gets any if you got a basement you're sometimes more susceptible to this stuff when it comes in through your shower and stuff and you know my house in Nashville we had high radon readings and so we had to you know get people to remediate and put a big plastic thing around it and a little pump that blows it out into my into my neighbor's yard and so so that you don't so it's a it's a chronic thing so if you're in somebody's house for two three days with radon but you shouldn't be worried but if you live there every day for your entire life you could you could end up with lung cancer which is bad so so this is a data set of from the EPA radon readings in households in Minnesota and we're going to focus just to keep it simple on a single County in Minnesota Hennepin County anybody here from Hennepin County Minnesota I don't even know where it is is it Minneapolis okay yeah I picked one of the bigger ones so that makes sense and so let's say so the EPA says that if if the if there the reading in your house is over four that's picoCuries per liter you should seek remediation another arbitrary threshold right if mine is three point nine I'm probably going to remediate but I'm going to pick an arbitrary threshold like that so let's say we're interested in whether the mean log radon value is greater than this threshold value and what's the problem that this may be even more important what's the probability that any randomly chosen household has one that's that high they're two slightly different questions right so pulling out my Hennepin data so I'm queer so here's here's my pandas in action queries are really nice method in pandas you can write little so you don't have to do all this dot notation you just write a little string query and it will pull it out and and I'm going to pull out the log right on value this is what it looks like notice oops symmetric kind of Bella shaped thinking of some distributions here and we've got 105 observations okay all right step one of Vaes right write down your full probability model okay typically this requires two maybe three things your likelihood function your prior for sure right so let's start with the likelihood function so bell-shaped hmm weirdly bell-shaped so normal distribution we start with something easy we're going to start with the normal distribution okay looks kind of normal ish who knows let's start with that again we're going to check them all later so if you're wrong no big deal so here's what we're saying Y which is the log radon distributed normal with a mean and standard deviation so that implies there's two unknowns here right mu and Sigma so we need two priors how do we choose distributions to use as PRIZE is a big question there's a couple of considerations a discrete versus continuous right if you have things that are you know fractions of integers you shouldn't be using discrete parameters because you'll never you'll never get there discrete distributions because you'll never get there the support of the variable what do I mean by that is where's the what values over which is defined so some distributions are only defined by positive numbers for example or only on the zero one interval and so you got to choose that appropriately and then you're available prior information which may be relevant you might have information you know a whole bunch of previous studies about some of this stuff well you could pull all those in and plot them and say oh this is looks like it's gamma distributed I'm going to use that okay so for the mean the mean could be positive or negative this radon on the log scale so it can be negative otherwise if it wasn't on the log scale I wouldn't be doing this so I'm going to arbitrarily Center it at zero and set a variance of a hundred so a standard deviation of 10 so we're so it's a normal distribution yeah but it's so the variation is so wide that it's almost flat so we're choosing what we sometimes call a non informative prior there may again there probably is lots of prior information about this but I don't know what it is and we're just going to let the data pick appropriate values standard deviation well we know that it has to be positive right can't be negative standard deviation and you know it's not going to be in the thousands or millions these are log radon measurements so I'm just going to pick a uniform value between zero and ten I I know that it just can't even be even close to ten so I'm going to say uniform it could be anything between zero and ten again if we're wrong with both these things we can go back and check them all right now let's put this all right this all down so here's how we specify it so one of the reasons I'm using PI mc3 aside from the fact that I develop it is that it's really easy to write the stuff down without doing a lot of programming so here's what we do we create a model we're going to call it radon model we do that with a with statement this is a context manager for those of you familiar with that term and then inside this model we can define our parameters and so I'm going to set my mean see it looks almost looks like you write it down on a whiteboard right except you have to give it a name and then Sigma is uniform between 0 and 10 okay so done that part and again lots of software choices stand Jags there's one called MC even commercial software is getting into the action now because they know it's good stuff but we'll use Plan C here all right and the likelihood then is normal we said that right but see here notice the mean and standard deviation aren't integer values because they're unknown they're the priors I just defined so this is essentially why I called a dist here I should just call it Y right Y just to be consistent and then if you have data associated with a particular variable you pass it as an observed argument here so we're saying here's my data so this defines it as a likelihood okay so I hope pretty relatively easy model specification all right now here's where we calculate our posterior distribution I'm going to completely blackbox all this stuff right there's a whole other tutorial that you I can give about MCMC and variational inference and approximations and that it's all super interesting but that's not what we're here for right we're here to teach you a little bit of how to do this stuff methods and so we're going to blackbox this stuff and just just have to trust me that it's generating good good posterior distributions out of interest we are using a numerical approach called variational inference that approximates the unknown true posterior with a known distribution that it kind of optimizes until it gets it as close as possible to the true okay and so it'll go in and it's actually done relatively quickly so it's nice and generic here just as fit fit the model do what you got to do and then I'm going to pull a thousand samples from it to get an approximate posterior distribution and so now I can plot some of this stuff so let's look at the MU and this is on the log value so I'm actually also plotting a reference value of log four because remember it was 4 picoCuries per liter and this shows some pretty interesting stuff this is the posterior distribution of MU which is the mean expected value of log radon here's the critical value again admittedly arbitrary but it's sometimes our arbitrary values or policy related and we can't do anything about them right this is one of these cases and this is where the ease of interpretation kicks in once we have this posterior distribution we get a lot of stuff for free we get this whole distribution of values so you don't from Bay is you don't just get a point estimate you don't just get a mean you get an entire distribution of values and from the distribution we can extract the mean we can extract a what we call a credible interval a probability interval that defines our uncertainty in it you get all of that for free once you've calculated it and so what this is saying is that the true value there's a 95% chance to true value somewhere between 1.2 and 1.5 on the log scale and this is interprete Balazs a direct probability statement so you may have in the past calculated something called a confidence interval that cannot be interpreted that way okay confidence intervals are statements about the repeatability of the experiment it doesn't really say much about the precision of the parameter this you can actually write down there's a 95% probability the true value is between this point and that point given my data full-stop and then here's the cool thing here what this is saying up here is that 80 about 84 percent of the values are less than this critical threshold 4 picoCuries which on the log scale is 1.4 and 16 percent chance that it's greater so we've already answered that first question right and our first question was whether whether whether the mean log radon value is greater than 4 well there is a 16% chance that the mean value is greater than 4 so we have a nice again measure of uncertainty around an estimate the next thing we want to do is well what what's the probability that a randomly chosen arbitrarily chosen household in heparan county is greater than 4 now we got to do prediction we just did estimation we've got an estimate of something now we want to predict right this is a prediction right we're saying oh you put on your carnac hat and predict and so we're going to do something slightly different because prediction is partly a function of what we know about the parameter but it's also a function of a stochastic process right a random process some houses are some houses aren't irrespective of how much we know about the parameters and so we have to use something called a posterior predictive distribution which is another integral but that's okay we get this one this is another thing we get for free after we've done our sampling we're going to call Z our new unknown thing and so what we're interested in what's the probability of Z in our case some house greater than four given Y what we've observed already about a bunch of other houses and that's an integral of here's our recognize this right this is our posterior and then this is the random process this is generating me a new house let's see what its radon is and then you integrate over theta integrate so we're counting integrating over theta means we're accounting for the uncertainty in the things we don't know about so it's all fits really sort of philosophically satisfying way of dealing with things okay and we can estimate this just from the posterior samples that we took doing you have to go back and do any sampling so we're going to take our muse and our Sigma's muse so these are the samples that I drew from the posterior distribution same thing for Sigma's and I'm going to draw from the sampling distribution which is not we saw Hennepin County's log radon is kind of normal there it is so now I've got a whole bunch of radon samples let's see which ones are greater than log four and take the mean of it 47 48 percent long value so there's a 48% chance that an arbitrarily chosen house in Hennepin County has dangerously high radon levels actual usable actionable information very direct information so this is kind of an example why we like this stuff any questions about that so far yeah so it's just an estimate that distribution was just an estimate of the unknown parameter so if we measured all the houses in Hennepin County we would there be no variance we would know exactly what it was so there'd be no uncertainty about it but there's still a stochastic there's still a normal sampling process going on for individual houses so yeah it may seem a little weird that it way it was only 60% chance that was higher but 47% of the houses are greater well because it's north that's just a mean yeah there's also a variance associated with it yeah mmm yeah I've drawn yeah if we drew more and more samples well there's two sources of uncertainty one is I did a random fitting here although if you use the same random seed these should be the same so yeah if we use the same random seed here we should have gotten exactly the same answer remember there's these are approximate posterior distributions so I've done sampling to estimate my parameters and they've also done sampling to estimate the posterior predictive so there's so they should be in the ballpark but they won't be too you know five decimal places for sure how we timewise we've been close okay all right this is great actually I'll get pretty close to finishing this by the break so so here's the step three that people always forget or ignore get excited it's really easy because the software has gotten so easy to use it's super easy to get an answer that looks great that it's totally meaningless so it's always important to go back and check alright so how do we know this model is any good well it's important a to look at this and see if it's reasonable you know if you're a environmental scientist does any of this make any sense so see if it passes the sniff test but more technically you want to compare the model to the data that was used to fit it and the technical term for this is posterior predictive checks so what we're going to do is we're going to generate simulated data sets from our model and compare the simulated data to the actual data and the actual data should look like some sort of random draw from those simulated data right so if we simulate a bunch of data it's over here and our observed data point is over here we're trying to make the claim that the observed data could have come from our model well it's not very convincing if if it's not even in the same neighborhood right so we can choose a whole bunch of statistics to compare I'm going to just compare them graphically here so here's the green are the observed data and the blue are the purple are the simulated data and they you know there are you know there bumps and things that are slightly different but they're reasonably close to one another I'll give you a better example of this when we do regression in the second half of the course yeah cross-validation yeah you can do that but we started discourage cross well a you don't really need cross validation in Bayes another issue with that is that it's not efficient analytically because you're throwing data you're not using some of your data to you want to use all the data available to you to estimate your parameters you don't want to hold aside any of it to cross validate there are model comparison methods that I'm going to show you later AIC that can be shown to be asymptotically equivalent to doing leave one out cross validation so there are other metrics that will give you similar outcomes all we're trying to do here is check just goodness-of-fit we're not necessarily trying to optimize as a parameter we're just saying here's my model this is the best model I could do and it looks like it represents the data well doesn't mean ideally you would have now the best way to validate this is to go collect more data from Hemet aprenden County and compare that and see if it still fits right but you always want to use all the data that you have in hand to estimate your parameters because data are valuable even in the age of big data and so you want to use all of it as efficiently as possible this is sort of there's another component of model checking that I won't get into and that has to do with model convergence so I ran a variational inference algorithm there's another algorithm that you've probably heard of called MCMC and there are sort of mechanistic checks that you have to do to make sure that you're all of these all of these algorithms have theoretical guarantees that they will work if you run them long enough but that doesn't always turn out in practice and you sometimes have to do convergence checking but that's again more of a technical thing it doesn't really have to do with kind of the philosophy of building Bayesian models so we'll set that aside for the time being some sort of model checking is important though here's another part type of model checking prior sensitivity so remember I picked those priors just by going that looks good right we don't know I've used my intuition as a statistician they may not be any good so it's always good to either you know go in and change the hyper parameters by an order of magnitude or something like that here I'm going to go even further and specify completely different priors I'm going to pass a date what's called a flat prior this is truly uninformative it's a flat line from positive to negative infinity which actually means it's improper right you can't have that doesn't work but we're going to do it anyway they don't even have to be normalized so a flat prior for the mean and then a half Koshi distribution for the standard deviation and half Co she's kind of like a normal distribution with fat tails and only the positive half of it so I'm going to put those in there everything else is the same so I'm going to call this one prior sensitivity so there's that posterior here's the original here's the original posterior and it looks pretty good in fact the mean is identical and the directive interval is slightly off but so not hugely sensitive to prior choice at least based on the quick comparison that I did so that should give you some degree of comfort but you haven't done anything completely weird and the nice thing about this is you know you can bring somebody else in that's you know a radon skeptic or whatever and they can put their priors in and see if the data should convince them with you know maybe more pessimistic priors something like that it's very handy all right that was just what you got two two one two groups yeah here's two groups two groups with continuous outcomes this is often it the case right we want to compare two things so this is actually made up data from John crush peas book which is really great this is a for a evaluate the efficacy of a quote/unquote smart drug that's supposed to increase intelligence and so these are IQ my fake IQ measurements from drug and sugar pill alright and so you know is is the drug group bigger than the placebo group who knows so let's do this let's see let's tackle this problem now we don't have this fixed arbitrary point that we're comparing to we've actually got another data set so this is perhaps more common so what we're going to do for the likelihood and just for some nice variety we can use a slightly different likelihood here the thing about normal distributions is they have really skinny tails so in other words they're very in they're very sensitive to extreme values they don't tolerate them very well so after the ninety-five percent of the of the values have to be within three standard deviations of the mean the t-distribution is a slight variant of the normal that has fatter tails so it's tolerant to it expects outliers all right now and then because you can see here we've got this you know really low IQ value here maybe a quite a really high couple really high ones here so yeah most of it's in here but occasionally we get stuff and we don't want it to bias the mean right so I'm going to use the T that's what the student t looks like it's much uglier but again we don't we don't care we're just going to use prime C's implementations and there's an extra paren so T's got in this case a mean and a variance or mean and standard deviation but it's got another parameter called nu here which essentially it's that it's like a degrees of freedom parameter and it defines essentially how close to normal it is so if nu is close to zero or sorry if sorry if new is very large the t-distribution looks very normal so the tail shrink and if it's very small close to zero it has thicker tails okay so it's got three parameters each rather than two so our likelihood function looks like this we got two of them one for the drug one foot placebo and each of these we're going to say that the news are the same that's model assumption that I'm going to make that we could test if we want to I'm also going to say that the standard deviations are the same they look similar the only thing we're really comparing here are the means but again we could generalize this model if we wanted to okay just to give you a little bit of hands-on time and we will after that we will have a little break if you have gotten this working by now maybe anakata is still downloading who knows go into PI MC 3 and draw 10,000 samples from a student T distribution and it's just called student T in PMC lat lingo with parameter mu equals 3 and compare those to a similar number of draws from a normal distribution with parameters mu equals 0 as mu standard deviation equals 0 that wouldn't work very well would it Saturday we Asian equals 1 so let's use this as a simultaneous work on this example which shouldn't take very long and go get some coffee because I know I need some and we will reconvene in ten minutes and I'll answer questions sure can you you you you you you to view our back so let's reconvene so yeah this is just to give a little bit more familiarity with the interface so student T distribution new equals three zero and one for the other parameters are the defaults if you so zero I'm not as familiar with Jupiter if you do shift tab tab well I guess you didn't now there's too many other things involved here hold on oops do shift tab tab on student T it tells you the things that it expects I don't have the defaults ring down do we hello but first you parameterize it with whichever parameters you want and then call the random method size equals 10,000 similarly with the normal and that works you'll be able to pull out something like this which shows exactly what I was trying to describe the for with my hands is the fatter tails so the blue is the student T and you can see fatter tails down here and conversely normals more peaked in the middle so that comes in handy some practical beige and stuff there all right so that's the likelihood so let's turn our attention to priors so we need new to muse and Sigma looks like I did use to two different Sigma's okay nevermind so same as before normals this time I'm going to Center them around 100 just because it's IQ I know something about IQs standard deviation of 10 Sigma is uniform between 0 and 20 and then from new I'm picking a weird-looking thing and it's an exponential with a parameter 1 over 29 and what that this was sort of hand tuned to describe the range of T's between being very heavy tails and very normal life that's all that's kind of the degree of variability we're trying to describe so this is what it looks like so values sorted between 0 and 100 have enough prior probability that a lot of data will run them over so the thing about priors is we've got to be careful of is never to assign zero probability to values that could possibly be the true value because you know you're multiplying your prior by your likelihood right and when you multiply anything by zero it's going to stay zero doesn't matter how much data is trying to influence you but even kind of thin tails there's still probability down here and if you've got data to overwhelm it it will you know it will overwhelm the prior so so this is reasonably good okay and so here's the likelihood so the drug likelihood in the placebo likelihood student key there's new there's new 1 lambda 1 ok the called lambda and student T rather than Sigma and then the absurd observations are the drug IQ and the placebo IQ that comes straight from pandas okay then so models fully specified we could just run the model now but there are some particular because we're testing particular hypotheses there are other quantities that we are interested in that are simple transformations of the unknown parameters in our model and so what poncey allows you to do and again any other useful probabilistic programming language allows you to do is to define we're known as deterministic quantities and and that's what I'm gonna do here I'm interested in two things a difference of means IQ versus placebo and then a standardized effect size where we're essentially taking that difference and dividing it by the square root of essentially the pooled variance or the pooled standard deviation so I'm defining these two things here I could just say mu 1 minus mu naught rapping at this deterministic object let's the let's the fitting mechanism save its values to a trace so that we're going to use them otherwise it would kind of slow them away and so I'm going to call this one diff of means and this one effect size and again these are just deterministic nodes mean that they're completely determined by the values of their parents that's kind of a technical definition so you'll still get a definition or a distribution of them but that distribution will be purely a function of in this case mu 1 mu naught right so at every iteration it will subtract me one from u naught and you'll have a whole distribution of differences here this doesn't add any stochasticity to the model at all okay so we're going to add that and we're ready fit the model done and let's look at these are the parameter values this is a little less interesting maybe here let's go straight to the here we go all right so difference of means so the mean difference is basically one so you get an extra your expected increase in IQ is one point I'm taking this awesome drug only a 1% probability that it's zero or less and standardized effect size yeah sometimes people are student standardized stuff but this is kind of the one of interest and it's our conclusion is that treatment group is greater than the control group the probability that it's greater is 0.9 sorry yes that's right sorry it's confusing here it's both 0.99 probability of being 0.99 greater just by chance so a point greater with 99% probability okay let's do an exercise so there's a data set that I use in my class from time to time is Nashville's historical precipitation going back to 1871 and I want to know two things what's the probability that the expected rainfall in January is larger than July historically and what's the probability that January rainfall exceeds July rainfall in a given year so if you let's say next year do we expect how will it be will you have a higher rainfall in January versus July okay give that a try I imported did the data import statement for you here so you just have to specify the model now sorry I'm going to help you out a little bit here let's have a look at this stuff so let's we'll do this part together let's do gel eye rain Nash precip dot July and that's aid January okay and so let's have a look at this then okay so rainfall data looks kinda like this and so this is the tricky first part right you probably you may have panicked when I told you you're just going to do this by hand on your own and part of it was probably like how do I go from this to a specifying a model well again the fridge start beginning at the beginning what is my likelihood look like and then what is what are my priors look like well looking at the data and rainfall data it looks like this right it can't be negative rainfall not even in Arizona and and it usually has a long tail how to the right so which you know which distributions do we look at there are lots of common ways actually wikipedia has got a pretty handy reference of distributions it doesn't do a good job of highlighting which ones are are used all the time like I've never heard of a sham pair known distribution you'll probably never what use one in your life there's a handful of sort of plans he's got like 30 or 40 of the main ones so you can go in and look at - t3s documentation if you want to if you look at C from PI and c3 import distributions and then this one kind of gives you the list of all of them but that can be a little intimidating to so this one's pretty straightforward though so it's continuous rainfall as a continuous value but positive so an easy one that I'm going to recommend is you take the logarithm of the rainfall and use normal so if we look at apply now skewed the other way but you know it's a good starting point perhaps you could also use a gamma distribution gamma distributions are nice alright so in real life if you were presented with this problem you would probably just go out and look at rainfall models and see what other people used but gamma distributions are useful see how they look they can they can look bell-shaped but they can also see this kind of skew one here that looks rain falling yeah yeah let's do that one so let's do this one together with model as rainfall model call it whatever you want let's start with the likelihoods so we'll call it January I've already got that is the name of the data just call it jam equals we're going to import gamma from PI MC 3 import gamma let's just look at gamma here gamma can be specified two different ways it can be the canonical way of doing it it's got a shape and a scale parameter that determines this kind of weird shape so alpha and beta but often you want to they're sort of hard to interpret so you can actually alternatively see this as alternative parameters use a mu a mean and standard deviation kind of like a normal so we'll do that here we'll use mu and SD just like before so we're going to call it mu equals and we're going to end up calling this what mu jam and SD we're just going to call this a sigma we're going to assume that they have the same standard deviation at our peril observed equals Janrain and of course we'll copy this and do July as well cool likelihoods are done priors we need one for Sigma for those of you not familiar if you're on Python three and you're in Jupiter if you want a Greek character you just go slash Sigma or slash whatever the Greek letter is and you hit tab and it converts it over for you so what did we use for variances before I did a uniform one before I did that weird half Koshi thing I don't know pick one uniform Sigma from zero to a thousand and then I'm going to go mu Jan and this is just a mean right and this is well they want to be positive in this case though don't they yeah let's go uniform again from zero to item let's say these main falls goes high that's log rainfall and well let's go up to I don't know 25 and we'll do the same thing for July and then I'm going to add one of these differences again I'll just call this D deterministic D and it's going to be mu Jan minus mu July and it breaks because jewelry doesn't exist because I called it July rain there we go model specified any questions about the model specification I'll give you a second or two to catch up and I'll drink some coffee exactly what I did before just with slightly different problem with rainfall model samples equals fit and something like on the order of 20,000 usually works and then I'm going to draw a thousand samples done so now hoops we do this what plot posterior thing give it the sound set of samples and then you just so that it doesn't give us all of these variables you can specify particular ones and all I'm really interested in is D here and sample Zoo with an S okay and you can put a well don't need to so well if you want to you can sometimes use it useful to put ref bal equals zero which is like the no difference thing so our question was what's the probability that rainfall was greater in january than july in nashville you got about a twenty nine percent chance of having of the expected value being larger based on historical data how about for one particular how about for one particular so let's say next year 2018 what's the probability of it being larger so what we want to do then is i could have specified something in the model but i can do this with the values afterwards see what's the easiest way of doing this it was easier before when I only had one post on parameter so let's do things this way and we'll go I'm going to go back and run the model again because that might be more intuitive running it this way let's call this January 2018 and I'm going to copy this line except without the observed part because these aren't things I've observed but so one way I could have done this was to go back into my trace pull out individual parameters for MU and Sigma and sampled from gammas and then seen how many were larger than the other could have done that by hand here's an automated way of doing it well here I'll semi-automated I'll just do this so this will draw my 2018 values then we'll do this again 10 - July 4 10 July in zip samples 2018 samples sure there's more efficiently you're doing this but this will work 33% chance so it's a little bit larger not as extreme as the last time but there's a little extra stochasticity here because it's a it has two sources of uncertainty one is the parametric uncertainty we don't know what the means all the way have a pretty good idea because we have a lot of data we don't exactly know what the means are for the two months over time and then the stochastic part of drawing from a gamma distribution right so there are two slightly different values okay all right last part of this section yeah yes okay do Omni slow down anybody else need me to slow down a little bit maybe a couple people a little a little bit too fast am I talking too fast or am I going too fast through the examples I'll try to slow down in general please ask quit once the way to slow me down is to ask me a couple questions so if you have questions put up your yellow sticky I will see it yours is up have you got a question sure alright if you want to see me scroll back here I will post all this stuff so what I'll do after this is I will post all of these rendered with the examples and everything for sure mm-hmm is it Plan C or related errors Oh so if you do inputs from time C 3 import gamma does that error for you oops this guy up here one step that some people miss is when you if you're having a lot of errors if you started this thing up before you did source activate stat PyCon then you wouldn't have gotten all of the goodness that wasn't you know anaconda installs a lot of this stuff for you but certainly plants III won't be installed automatically so you have to do sort you have to install your virtual environment then you have to activate it and then when you're all done today you go source deactivate and it'll go back to your original settings so of this may have a quick look at yours Oh you you hopefully this gives some folks time to catch up yeah okay let's push on a little bit and I'll try to slow down please do interrupt me with questions if I lose you at any point all right we're not going to get through all of this and that's fine I'd rather you understand what I get to as opposed to me sprinting to the end so yes you bet yep sure so the three steps right specification calculation and model checking I'm actually skipping the model checking step here right we've checked a model like this before so that's okay so are you okay conceptually with how I specified the model how I chose things for the model so the first step was to you know have a peek at your data and so I'm looking at this data here and I'm saying that's sort of gamma distributed I could have used again a log normal distribution a couple of other things I'm trying to use something that's positive and continuous okay everybody good with me there so I went in and instantiated a model called it rainfall model and what I started with was this gamma I could have done it the other way around start with the priors and done the likelihood afterwards and so I created two gammas one for January's rainfall so it's data is Janrain mu mu Jan standard deviation Sigma and then another one exactly the same for July at least one person had the error where they cut and pasted and forgot to change Jan to Joel and it looked like a mess which we're happy about because we want that's a good sign that you messed up and you want to go back and debug it so we got January and July two gammas two different distributions that overlap somewhat then we got to specify the priors the things we don't know there's two different Mews and I did one standard deviation I could have I could have said also said well maybe they're very differently dispersed and so I could have two different Sigma so if I want to I could go Sigma Jan and Sigma July right this is just a more general more complicated model no problem I can do that if I want to uniform oh did I do it again see wasn't even cut and paste just stupid me all right okay that works too then I wanted to look at the thing we're comparing this stuff is kind of extra these are conveniences I could have just stopped here and gone back and pulled out the parameters that I need this is kind of a convenience because it's doing this subtraction for me instead getting samples of it for me so you know it's just a deterministic node it's just taking this and subtracting it from that that's all there's no extra unknowns or anything involved here and then and then here are just draws from the posterior predictive so I'm taking I'm essentially taking this model and drawing a single value an extra I call the 2018 it could be anything it could be just four could be a year in 1720 that I didn't observe it's just some year that I don't know didn't observe and if it comes from this model what's the expectation or what do I think it's value is going to be so now I'm done and I run the model and someone started asking me about why 20,000 all that these are yeah these I'm trying to make this a black box that's just a good long enough value for us to get a good estimate is all we really need now right now additional questions about this mm-hmm oh it's not I'm using it later so I use it after I'm done I'm interested in that value I'm interested in that value I want its distribution so here's where I use it at the end I see well what's the distribution of what's the distribution of that difference right and how different from zero is it so there's a 91% chance that it's greater than zero on average 28 percent larger or 0.28 sorry point two eight larger not twenty-eight percent does that help you okay please stop me if I don't want to lose people in this that would be failure okay so now we're going to switch to another type of data binary outcomes as I said these are pretty common yes and no things obviously not continuously distributed so first thing that should come tomorrow we're probably not using normal distributions anymore at least not for data likelihoods we get binary outcomes a lot you should never ever create them never take a continuous variable and make it binary you're throwing information away if you do that you may be interested in some threshold value and you can deal with that after your analysis but anytime you discretize something that's continuous or dichotomize something all that money you spent on collecting data you're essentially just taking some of it and throwing in the trash always build a model for the data that originally that it was originally collected for so if you do have binary data yes you can do stuff like that afterwards after you've done your analysis so again this is something I get a lot with doctors right they like to diagnose right sick and not sick you know if your BMI is greater than 30 you're obese otherwise you're not well come on some things are true step functions where you cross a line and you know bad things happen but generally think you know stuff like that it's a continuum and so you can fit the continuous model and use all the information and then say what happens at BMI 30 or what happens at some other arbitrary threshold so that's ours that that you can do that decision-making but do it afterwards and then then you're not throwing information away so that's the difference so data that is dichotomous the distribution we're interested here in here is the the simplest distribution and statistics it's called the Bernoulli distribution and it describes the outcome of a single coin flip essentially alright it has one parameter P it's the probability of the event occurring where the event is one of the two outcomes true or heads or something like that die or something like that right and then X is either 0 or 1 there's only two possible values 0 or 1 so you can see what happens here if X is 1 this becomes 0 and it goes away and P if X is 0 that one goes away because that's 0 anything to the power of 0 is 1 it gets multiplied out and it's 1 minus so it's either P or 1 minus P right probability of it occurring or the probability of it not occurring which is just the complement you often see sums of Bernoulli events which is known as binomial data so again a sorry for always using clinical example let's use a baseball example if you're a say berm attrition you know somebody hits the ball 30 times out of 100 at-bats that's essentially you know coin flips 30 of which came up heads same idea and so that data would be binomial in it binomial distribution is just the sum of n Bernoulli's so if you had 10 coin flips and you want to assess the probability of heads you'd have N equals 10 P is the probability of heads and that would be the model that's use and you'd see they're very similar they're exactly the same except you just put if you put one here instead of n it actually reduces to that so it's just a special case ok so this is the distribution we're interested in here yeah there's a whole raft of distributions like I showed you but in reality there's like 15 that you use kind of everyday binomials definitely probably the second most important distribution in statistics after the normal all right my data that I'm going to use for this one is a biomedical one it's a set of 671 infants from the Duke University Medical Center and these are very low birth weight infants and if you're born prematurely there are all sorts of negative health consequences associated with it and one of them the outcome that we're going to study here is IVH intraventricular hemorrhage 'uncle is in your brain hold essentially or your spinal fluid and you can get bleeding in there so the hemorrhaging means that it's bleeds and it's obviously not something that you want to happen and so it's a negative outcome and we're interested in things that are potentially associated with it so this is this data set here of VL BW this is actual real data and it's got a whole bunch of stuff here how long they stay to the hospital bunch of labs race the one we're going to be interested in here is birth weight and there's another variable another binary variable called a presence of a pneumothorax and the pneumothorax is essentially a collapsed lung which can also happen and we're going to see if any of these things can predict whether or not you get intraventricular hemorrhages in low birth weight infants okay so what we're going to do is see if there is a difference if there's a there is a difference between infants with and without this collapsed lung and so we can do our pandas magic pandas crosstab so these are so in some patients there's no pneumothorax well most of them no pneumothorax and no ivh so they're both kind of rare events even if you're sick even if you're premature I should say some of them have both and then you know all possible combinations and we're going to combine definite impossible into into one so things become a little more complicated when you have more than two categories it's a different distribution so we're going to combine definite impossible so that's what this step is doing here so I'm going to create two variables one called ivh yes or no and the other is the pneumothorax I'm just calling that X okay we already know our likelihood it's going to be Bernoulli so every individual every baby in the data set is you know the coin flip essentially and so we are interested in a prior distribution for the probability of this event of this unfortunate event of the ivh what do we use well so we think about P P is the probability it has to occur between 0 and 1 right so it's more heavily constrained it's not just positively constrained but you can't have probabilities greater than 1 you can have probabilities less than 0 well one of the cool distributions in [Music] statistics is the beta distribution and beta distribution is defined on the unit interval as they call it and it's got two parameters alpha and beta similar to the canonical gamma distribution and you can see you can make it do lots of things the more parameter the distribution as the more weird stuff that can do more flexible it is right so you can you know it can be slow P on one side it can be almost normal looking when it's in the middle it can be boat shaped or bathtub shaped one more interested in here sometimes here is the flat one so alpha 1 beta 1 actually is a uniform distribution so this is saying equal probability of all values for example so we're going to use the beta as the prior distribution for the Bernoulli so we specified our model essentially we got to Bernoulli's each of which will have a different p a different probability of getting IVH and the unknown value is that probability and we're going to put beta priors on them okay everybody follow so let's dive right in priors I'm changing my notation slightly just to give you an idea what you can do instead of having P pneumothorax P no pneumothorax I'm just going to say P shape equals 2 so I'm going up two of them and I'll just index them out as I need them just a little bit more economic economic economical and then down here here's the Bernoulli likelihood I've got to import it first don't forget as I did with beta I must have been imported beta before somewhere Oh to show this I think anyway and then I'm going to use cool numpy / pandas indexing for this so X is 0 or 1 pneumothorax or no pneumothorax right so what will index out the 0 with P for one it for the no pneumothorax group and the index 1 P for the pneumothorax right so I don't have to have two separate likelihood specified everybody clear about what this is doing okay so it's going to soapy paint here is that is a Index is a vector of size two I'll do a dummy example for you here let's call it foo equals negative five and seven for example let's make it an umpire ray and then foo index is let's say zero zero one one one zero one zero one so if I go foo foo end C so it pulls out the appropriate value pneumothorax pneumothorax no pneumothorax no numa so it's pulling the appropriate p for the appropriate patient that makes sense just Python just just plain Python indexing that's nothing magic too magical okay so it is and then observed is the IV H very oops I BA IV age variable sure Ken beta is the prior for P so remember what a beta one one looks like this so we're saying a priori we have no idea what this is obviously what we do right it's a rare disease it's a rare thing so maybe I should have chosen that one as my prior it probably won't matter in this case but that's what's going to happen internally all this is is defining an information state this is what we know about P before we did our experiment could be anything the likelihood is this Bernoulli and remember Bayes formula is our estimator so if we combine that prior with the information in this data I should get that's what Bayes rule tells us I should now get what we know about P given my data so yeah don't try to think too hard about what is this sampling from that it's not we're doing sampling to approximate our posterior but really we're just we're integrating we're multiplying probability distributions together and integrating them as all we're doing if you set them both to not just equal but but both value of one so they're equal heat actually they're equal here and they're equal here but they're different values that's five and five that's 0.5 and 0.5 they're just different parameterizations they're hard to interpret would alpha and beta are they're both kind of shape and scale parameters so they're not as easy to interpret as mean and variance those are very easy for us to get our heads around all you need all you're really looking for is a shape that describes your uncertainty I do this all the time with folks who aren't statisticians you know does this describe what you think the value of this is before you go ahead and run this trial or is it that so it doesn't matter if it nobody cares if it's a list 2 & 3 or whatever it's does this represent my knowledge this just happens to be the values that give you that so don't overthink the particular values but this is an important one when it when it's 1 & 1 it means it's uniform that's kind of a special case that's important if you really want an interpretation alpha is kind of the prior number of events and beta is kind of the prior number of non events right so this is 50/50 with 5 if it was alpha equals 5,000 beta equals 5,000 it would be like a little stick over 0.5 this is one success of three failures and so you see it's kind of tilted towards zero but not dramatically that's if you want an intuition about it using Perplexus better could be nothing I've claimed we know nothing that's ignorant this last one over here is ignorance yeah any other questions yeah priors it takes a while to get your head around priors but that's what you're trying to do is is ascribe a state of knowledge and for something like a clinical trial you'd almost always want flat priors because you know the FDA doesn't care what what you think your drug is doing right the FDA wants to know does does your data in your experiment reflect it whereas if I'm an epidemiologist and I want to know something about you know stopping an outbreak that's going on now I want to use all the prior informations I know to help you make a decision about whether to you know vaccinate all the children under five years old right that's when you want prior information or you just want the best the goal at the end of the day is to get the right answer most of the time and and that involves bringing all information available to bear on the problem if that's prior information great if it's data from an experiment that you're doing that's good to they've absolutely embraced Beijing yeah yeah yep yep that's absolutely yeah so well it depends on amount of information right so if I did give this alpha hey let's do it I'm not just going to describe it let's try it so let me write let's hold off the questions for a second I'm going to run them all as it is okay here are the peas so P for pneumothorax is 0.32 and P for no pneumothorax is 0.12 so saying 12% chance with no pneumothorax 32% chance with pneumothorax and and their definite the P they're definitely bigger so there is an effect of the pneumothorax so let's go back and say so the claim here if you didn't hear the question so I've been crappy about repeating questions sorry about that for people watching this on video the claim over here was is that even if we add even if we gave it an misspecified prior if you've got enough data it's not a big deal and that's generally pretty true so if I go in and I go P I know let's give them yeah let's give them like five one or whatever okay so let's remember what the means we're down here point one two and point three two pretty close right so turning it from flat to even it basically that on the other basically this but with the five rather than three didn't do very much let's now specify a beta 50,000 50,000 50,000 50,000 prior basically just like a little spike at point five you can totally yeah you can't influence it but what we were st. 50,000 50,000 is basically a prior sample size of a million and my data set here is what a hundred and thirty or something like that so yeah so if you give it more prior information than experimental information you will overwhelm the prior and so yeah this is cheating but you can't hide that from anybody right you got to say somewhere in your paper that used a beta 50,000 50,000 prior and somebody will go excuse me right yeah yeah oh no they don't know that no this is a P for the kids with the pneumothorax and P with the kids without the pneumothorax they can be they're not associated with one another it's not the probability of not getting ivh versus the probability of getting ivh it's you know it's batting average of one team versus the batting average of another team correct yeah the question there was did the peas not have to add up to one and no because there are two different probabilities there for two different problems two different populations of kids essentially better change this before I get commit this right okay any other questions about that hey we've reached the end of that piece yeah that's the big advantage of Bayes is that you can do this turning the crank on the Bayesian machine so you can take the posterior from one analysis and then use it as the prior for the next time you do an analysis if it's studying the same phenomenon you can absolutely do that there are technical challenges to doing that sometimes you nice thing about this bethe Bernoulli thing is that they're known as conjugate likelihoods and priors and what conjugate means is that if you have a beta likelihood and a binomial or a Bernoulli or sorry beta prior and a binomial or Bernoulli likelihood the posterior will also be beta distributed and so then you can take that beta and immediately insert it in as a prior that's not always the case so before when I had a normal likelihood and 1/2 Koshi what something-or-other in a uniform mean the posterior doesn't have any particular form and so so then what you'd have to do is somehow summarize the posterior and then take that summarized form and so you'd lose a little bit of information in practice doing it in theory there's nothing preventing you from taking that distribution and plugging in and as the prior so we have like 20 minutes or so left which isn't much let me see if we can do anything with regression in that time yeah I think we can so regression model essentially generalized what we saw for comparing two groups when we're comparing two groups it was basically just two input values pneumothorax or not right placebo treatment sometimes we have a continuous variable or several continuous variables as an input or several binary variables as an input for example how two different medical interventions influence the incidence of disease how does baseball players performances function change as a function of age which is continuous this was a fun example but I found how test scores are correlated with tissue LSD concentration so here's a nice little data set and it's paper note the year 1968 right correlation of performance of test scores with tissue concentrations of life surgical diethylamide and human subjects so I gave them LSD and made them take tests and surprise surprise so you pull this in they go down as you get more drugs although it's funny because I saw a article in The Guardian like just the other day when I was putting this together Kari grant how a hundred acid trips in Tinseltown changed my life so depends on what you're getting tested on I suppose if it was like songwriting or creative writing or something like that maybe it maybe would go the other way but this probably wasn't wasn't that so let's you know fit a model to this right we want to fit them all to this and so we would want to draw you know we draw a line kind of like this and so what's the criterion for drawing that line if I put my arm here why am I putting my arm here and not like there what's what am I trying to do minimize yeah the difference between the line put the line as close to all the points as you can I we've all probably taken regression courses I'm just seeing so we want to characterize the relationship between x and y and and really generally speaking what we're generous doing is generating expect conditional expectations so if we had conditional expectations before there we go before but they were just two points treatment and control so we want Y given X where X could be you know any age or something like that or concentrations of LSD and we get an expected value of what of the outcome given X okay if it was Bernoulli if it's binary outcome it's the probability of you know a pneumothorax as a function of birth weight for example that's one of the examples later on so in general this function of X can be anything it could be a weird nonparametric nonlinear thing for regression models they tend to be linear combinations right so we're multiplying coefficients by each of the predictor value so if there's so you can easily generalize this to more than one variable and it's just the sum of those and then a baseline value or an intercept right we've seen this probably in a stats class before and there's two important linear regression assumptions one is that the error the errors are normal so the differences between the line and the points are normally distributed and what's called homoscedasticity which means that the variability doesn't change along that line it's not more variable at one end versus the other that's not true linear regression regular linear regressions not for you you've got to do something slightly more general so it doesn't take much actually to fit one of these regression models without any fancy pie mc3 magic you can just use straight up optimization what you need is a criterion for choosing the values of those beta parameters and our criterion was as close as possible to the data well this is sort of what this is it's the data - the predictions from the model and I'm squaring it I don't have to square it I've got to do something though because positive and negative differences will cancel each other out so you can't just add them up right I'm going to square square it does two things it prevents these cancellation that I just talked about but also most more strongly penalize as large deviations so there's the difference is squared it makes bigger so we push this is essentially a value judgment on our line right we don't like big deviations the only little ones okay and we're going to minimize this we want this to be a smallest possible it's just optimization right so we can do this by hand more or less so here's this function as a lambda expression in Python so it's this is just that yes that's the next if I scroll down we'll see that that's somebody else's valued judgment these are just criteria all right and so I can put in you know arbitrary value 0 beta naught equals 0 beta 1 equals 1 and that's what that's sum of squares is and if I change it very slightly make it 0.7 I'll get a slightly worse value so I guess I got to go in the other direction that kind of thing so we can go into sy pi and pull out f min which is just an elder made optimization there's many ways of optimizing we'll just use this one as a black box again we'll give it X&Y which are the values from the LSD data set and we just run F min and all its doing is hill climbing it's fine or valley descending opposite of hill climbing minimizing that value and then lo and behold that's our line and this is the one where these and it's perpendicular distances somebody sometimes people think it's that value it's not it's these ones that are trying to minimize so you don't need to be Beijing or do sampling or anything to do this you can just use optimization if you want to alternative loss functions here we go what if what if we don't want to punish big values well we can use the absolute values so we always do want to get rid of that cancellation effect but we might not want to whack the ones that are way out of line and in this case there weren't any big outliers so that I think the line looks pretty much the same we don't have time to do the example that I was going to do this is the exercise here is add one more extrema an extreme observation like out here and and then fit them two different ways and what ends up happening is that the sums of squared errors tends to get drawn out the line starts tilting this way just a little bit it's only one point so it's not super influential but just a little bit where as absolute value pretty much keeps that line the way it is because it's yeah proportionally allocated to that it's sometimes known as robust regression that you you reckon you know what if you never want to throw out outliers you want to model outliers right much better way of doing it and that's a way of doing that we're expecting that one or we just don't want that extreme value to influence things too much okay we're not restricted to straight line models so we can expand this and use what's called a polynomial regression so all I've done here has added a additional term that's x squared and that allows things to be nonlinear to allow things to be quadratic so if I run this one it actually looks he said there's a very gentle this isn't a good example for this because it essentially is linear not many subtle things are happening with the LSD effect on your test but it's very slight curve the nice thing about this is that it's still a linear model even though the curve is not a straight line it's still linear and the still a linear combination of those parameters you still take a vector of beta as and a vector of X as an x-squared multiply them together that's still a linear model so don't confuse those together but you know here's one that's slightly more you know curved and so this one would have you know that probably a better place to use the polynomial value okay so it in the ten minutes left I'm just going to show you how to do this Bayesian Lee and then we'll wrap it up so the same thing here we're going to do a Bayesian approach to regression and the nice thing about this is that we're also going to be able to give ourselves not only the line isn't the only important thing but also the uncertainty in that line just like uncertainty in our predictions about you know what the probability of ivh is and so we're going to do the same thing here so we need priors so what do we need priors on here well we need priors on those betas right beta naught and beta 1 or however many betas we have and then we also need a prior on the variance right so the sums you know help the expected value of the differences between the points and the line so regression parameters and variance and so I'm just going to use normal mean 0 with standard deviation of 100 like really wide almost flat could have used uniform again and then that half cow she's saying maybe I could also use uniform here this is this is what's known as the process uncertainty again the difference between the the line and the and the points that's all okay so those are our priors oh this is what 1/2 koku that's not what 1/2 Koshi looks like some of that No okay I read again and it worked that's what half Koshi looks like so cool so here are my priors so I'm going to give them user friendly names intercept slope alright Sigma isn't user friendly but in mind okay just what I said I would do and then likelihood this is a linear regression so the errors are normal that was one of the assumptions in fact of this is that the errors are normal so you don't even have to make any choices here if you're not choosing normal here you're not doing ordinary linear regression so normal score your scores are normal the test scores mean of mu and standard deviation of Sigma where mu is the sum of the intercept and the slope times X which is the dose of LSD notice here I didn't put deterministic around it as a technical note that's because I'm not well I might be interested in you it is just nicer to look at but also I just wasn't planning on looking at muse so it's not going to keep track of MU at every iteration of the model fitting it's just going to stick it in here so it's sort of an intermediate value I'm mostly interested in the beta parameters so there he is we're done and this time just for variety I'm not going to use variational inference so instead of fit I'm going to use sample sample does Markov chain Monte Carlo which is a Monte Carlo Monte Carlo based approach that does does take dependent samples and does the same thing it gives us a proximate posterior sample from an approximate posterior okay it's done could have easily have used fit they're just using some variety and so the slope here is about negative 9 so it's saying for every increase unit increase in dose of LSD you can expect to lose nine points off your test score showed your kids Gladys this is your brain on drugs few minutes and then same thing with checking model so here's a good example for checking model fit it might be even clearer than the one I showed you before so what I'm going to do here is I had seven or look like about seven points there what I'm going to do is sample up each for each of those seven actual data points I'm going to sample a whole bunch simulate a whole bunch from my model and see where that simulated data lies relative to the observed data see how close they get and I could do this by hand but there's a nice little utility function in pmc called sample PPC where PPC stands for posterior predictive check this is exactly this equation is the same as I showed you on the other one so recall it's a function of both your posterior distribution and then the sampling distribution of your data so two component it has it's usually more variable because it's includes uncertainty about the parameter and then stochastic random sampling stuff so I'm going to do sample PPC for a hundred and so a thousand excuse me it's going to draw a thousand data points for each data point that I used to fit this model and this just proves it so there were seven data points so there's a thousand for every one so we'll have a distribution for every one so I'm going to plot each of them that's what we get and you can see this is this is fit model goodness of fit it is a good fit or sorry to be technical there's no evidence of lack of fit that's doesn't prove anything it's only seven points but we're happy that these red lines which is the true data are believable draws from these distributions so it's saying that yes my model could have generated this data yes the relationship between LSD and test scores is approximately linear right if we'd be worried if the line was out here that's 437 so yeah we didn't get to obviously well we didn't get to a lot of this stuff and missing data and that's fine but yeah I didn't want to blow through this any faster than I did any I left a few minutes for questions so maybe let's take like five minutes if you've got maybe big-picture questions about Bayes or applied Bayes yeah right yep that's a good one so at the bottom of each of these notebooks I do have what I would call specific references on specific things so these are you know regression books this book called data analysis using regression multi-level hierarchical models by Gelman Andrew Gelman and Jennifer Hill is a really great applied regression book if you do a lot of that kind of stuff and it's Bayesian a really good site to get an intuition for this stuff is Bayesian methods for hackers so this is this this is this is kind of cool it's a essentially an open source github repository that is a textbook so it was like a community developed textbook on this stuff and it gives you really intuitive taste of Bayesian inference also using Python and also using prime C so that I would recommend this it's very readable it's very non-technical in the mass sense but technical in the computation sense so it's probably good for this audience that's a good one there's a bunch of good intro to Bayes books but I would start with with this any of Andrew gell-mann's books are very good he's a he's a social scientist and so you know he's he's in the stats department at Columbia but he talks in English so yeah any other questions yeah I started pinus III when I was a postdoc like in 2003 and it was pure Python and then it was a lot of these methods like MCMC are really really slow because it's got to evaluate the complete model likelihood at every iteration and so we used Fortran extensions with F 2 pi 4 PI and C 2 and then the more recent one time C 3 which you're using now is based on C on O which is an expression evaluation library similar to tensorflow that's used for deep learning and stuff and so it builds this big graph and evaluates it and all these modern methods for fitting Bayesian models like variational inference and the newer types of MCMC all use gradient information to accelerate the approximation and so siano does automatic differentiation and so it will so that's why we've switched over to that so the answer to your question is yet it's basically all for Theon oh and we try to hide as much of the theano from you as possible because it's not easy to use theano straight yeah it's like drinking fire whiskey cut it with something any other questions out performs what do you mean by underperform uh-huh what are examples Oh what will happen to them I don't think they'll go anywhere you know I became a bait I I never took any BAE's when I was in college I came about it very pragmatically and it had to do more with I had big balls to build and it was I had no choice really I couldn't fit them using maximum likelihood and frequentist methods and so I had to adopt Bayes so all this philosophical nicety that I was talking about had really nothing to do with my own journey towards using Bayes it was more Bayesian the other feature of Bayes that I didn't really talk about is that it looks very simple but it's it's actually a bunch of simple building blocks that can be used to build very complicated models so big hierarchical multi parameter models are much easier to fit with Bayes and there's some that you know you can do benchmarks and you can compare things for performance characteristics but there are some models that you simply can't fit using non Bayesian methods and so that's kind of more the difference you can make Bayesian models and frequentist models look the same if you tweak them the right way show some sort of equivalence but they're not equivalent because they're using probability in two different ways and everything so it's hard yeah oh well I gave you the criterion of the beginning they're easier to interpret they're more direct questions about the you know the uncertainties of interest the question of interest they scale better to bigger models and and they're you know I think that's the biggest particulars to non-statistical audience they're easier to interpret so people think they know how to interpret a p-value but mostly they don't in fact they don't even most people don't interpret confidence intervals the right way so I usually sell them on that and then just showing the types of models that you could fit with Bayes that you just can't even dream of otherwise you know I'm always I always I'm a big fan of you know convincing them by showing them as opposed to trying to explain it or sell them you just show them this big analysis oh and by the way it's done with Bayes that's almost secondary then it can be entirely different again because you're involving priors you know the pride the priors tend to shrink things so have you ever done lasso regression or Ridge regression that's the equivalent of putting priors on the regression parameters and it shrinks them towards zero in fact they're exactly equivalent so yeah the priors will influence them but but again the probabilities that you get from a frequentist analysis are the probabilities of the data if you repeated of what you would observe in the data if you repeated your experiment a million times or an infinite number of times probabilities to get here are the state of information about parameters so it's it's really apples and oranges comparing them you can yeah like I said you can rig analyses to get frequentist confidence intervals the same widths as Bayesian credible intervals but they're they're addressing two different things one is addressing the position of the mean and the other is addressing the end points and how often they will include the true mean I should probably end it there I'm going to do two things thank you for coming and staying and be sorry for the technical difficulties I should have anticipated these better and sorry if I didn't go slowly enough but I hope you learned something out of this and I'm not getting rid of the repository I'm going to leave it there feel free to go through the rest of it and email me with questions if you like and good luck [Applause]
Info
Channel: PyCon 2017
Views: 35,441
Rating: 4.9585061 out of 5
Keywords:
Id: TMmSESkhRtI
Channel Id: undefined
Length: 199min 14sec (11954 seconds)
Published: Thu May 18 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.