Mike K Smith | Using rmarkdown and parameterised reports | RStudio

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] --energetic. So my name's Mike Smith. I work at Pfizer in the UK, and I'm very pleased to be here to talk to you today. Because I realize that people are lazy and easily distracted, I'm going to give you the summary of the whole top right up front. I use parametrized rmarkdown reports with child documents to write up an exploratory analysis, which I shared with some colleagues-- some of whom were quantitative, some of whom were not quantitative. The quantitative guys, like the statistician in the team and clinical pharmacologist from my department, including my manager. There were other nonquantitative guys and, you know, it's maybe unfair to call a clinician nonquantitative, but I'm going to. So this report was intended to serve all of those purposes. Now, the analysis and the stuff I'm going to show today doesn't show that data because of confidentiality reasons, but it has very similar properties. But what I really want to talk to you today is about cutlery drawers and what they say about you. Now, these cutlery drawers are from people that you might bump into at this conference. And I think it's an interesting exercise in how people arrange things, structure things, whether they do it by size, whether they do it by most frequently used, the differences between, perhaps, US and European cutlery drawers. But here's mine. [LAUGHTER] If you visit my house, you'll see my cutlery drawer. My wife will also be very surprised that you're visiting the house and quite alarmed that you just want to see the cutlery drawer. [LAUGHTER] Now, I did the tidyverse Train-the-Trainer course, and thanks to Greg and Garrett for teaching on that. And I am now taking every opportunity to have a learning experience for the tidyverse. So if you take cutlery drawer, you might want to group by type. You want to gather those things together and arrange. And if you could do that from my cutlery drawer, I would be really happy. New hashtag for the conferences-- untidyverse. [LAUGHTER] And I'm really pleased that the XKCD views the need for tidiness in the house in very similar ways. Basically, if you've got Wi-Fi and a laptop, you can put all your other possessions in a big bucket marked miscellaneous. OK, here's the premise for my talk. My brain is lazy, shallow, and very easily distracted. I hesitate to say that your brain is the same, but then we had Joe Chang up here telling us all that our intuition sucks. So I think some people in this audience must have a brain that's lazy, shallow, and easily distracted. Now, we're all familiar with this plot. It's from the R for Data Science book, and it gives us a framework or a structure for understanding any analysis. And on the Not So Standard Deviations podcast recently, there was a good discussion about this, and they came to the conclusion that really this is just a framework. It's a mental model. Real data analysis don't necessarily have to follow all of these steps or in sequence. So let me explain to you something about my data analysis and how it works in practice. And I'm sure that these experiences I'm going to relate are just for me alone, so it's not you guys. You have your stuff well together. This is just me. OK, here we go. So I go, and I get an email from my colleague who has the link to the data source. I download that data source, I read it into R, I wrangle the data, and I plot it. And then there's some time to step back, to reflect on my plot, to think about what it looks like so that after lunch, I can come back and make better plots. And I might fit a preliminary model to the data. Hey, it's the end of the day, it's time to go home. So the next day, I come in, and I find an email saying that the team has found an error in the data, so here's the new version of the data. And if your workflow is not reproducible, you're in a world of hurt here. So because I'm reproducible, I go to my markdown. I change the input data, I recompile my report, I see what differences there are. I check against the previous version. I'm good to go on. I discuss the findings with my boss. I circulate the report, and my job is done. Wait a minute. Did anyone get kind of distracted by the text in light gray? Perhaps it's not just me then. OK. The other thing is, did anyone notice that the transform and visualize bits from the R for Data Science diagram were back to front and swapped over? Cognitive load theory, score one. So six months pass. This is not an exaggeration. I got an email yesterday from my manager saying, you know that team and the work that you did for them? They're interested in your results. Can you dig them out and share them? Now, if you're anything like me, what happens is you pop open your R script or something like that, and you think, what the heck did I do? All right. But to the rescue, for me, comes R Markdown and Notebooks. These guys are now saving my life. I love Notebooks. Thank you, eBay and team who are working on Notebooks. And whenever you're writing, you need to think about who is the audience that you're writing for. Well, for me, when I write a Notebook, it's for distracted me. It's the one that is hopping between activities, not focusing 24/7 on my nice little analysis, but with a billion things to do. It's also the future me-- the six months later me that pops open the Markdown document and goes, all right, I'm good, because I've got my code, I've got the explanation, I can see the outputs. I've even got text that explains what I was thinking. Hurrah. But here's another audience for your reports-- it could be a quantitative colleague who wants to see code, who wants to see data, who wants to see your assumptions, who wants to dig into your residual plots from your model. I can guarantee you the nonquantitative people won't want to see that. So it's good if you can try to balance up and try to have techniques that would allow you to hide the bits that the nonquantitative people want to see, but also allow the quantitative people to get what they want from that report. So back to Notebooks just for a second. I realized-- [INAUDIBLE] thank you-- that there is a debate here. But for analysis, if you're writing analysis, I think Notebooks are fantastic. And back to Garrett's stuff at Markdown, if you're writing more code-- sorry, writing more comments than code, then you should be using R Markdown in Notebooks. If you're using writing more code than comments, write more comments and use R Markdown. But if you're not writing for analysis, then I really recommend that you go and read this blog post because it is very interesting. Also, because I'm lazy, I knew my manager would come back and ask for the same analysis across the three endpoints that are in my data. OK? So I'm trying to set up my report, to answer quantitative, nonquantitative, and for three different endpoints. [AUDIO OUT] --paste your code more than three times, what should you do? Write a function. If you have to perform an analysis across more than three endpoints, what do you do? Do you write multiple R Markdown reports? Parameterized reports, thanks very much. Otherwise, why would you be here? I mean, honestly. I'm not that entertaining. Anyway, back to XKCD. XKCD talks about automation, and the top panel talks about the theory of automation. It's a bit like the theory of data science. It's the mental model for what we all hope will happen. We automate things, we can get on them, we can do a completely different work. The reality is that when you automate things and try to be clever, often you wind up troubleshooting the thing that's gone wrong. But here's the thing that makes all of this work is the YAML header parameters. Now, if you're familiar with R Markdown, then you'll know about the YAML header. If you're not familiar with R Markdown, this may be a bit of a stretch, but bear with me. So the YAML header says something about the document-- it says the title, the author, the date it was done, and it says something about the formatting for the output. So here, it's an HTML document that has some certain attributes. The bit I've highlighted in red is the bit we need to focus on. These are parameters that you can pass into your document, and it comes into your document like an R object that you can then use, you can compute on, you can do all kinds of clever things with it. So here in my document, in my header, I've got a parameter which is called endpoint. It has a default value of HAMDTL17. But it also has three distinct choices, so it's not just a free text. You have to choose one of those three options. Also, I've got a boolean parameter, which is called quantitative audience, and if that is true, then I'm going to include a bunch of other stuff. But if it's false, then this is a stripped back report just for the nonquantitative audience. OK? Now, the specification of those is very close to what you might do with Shiny inputs if you're familiar with them. So then, when you're ready, you can choose to knit that document. And if you just hit the Knit button, you'll get the default values. But if you can choose to knit with parameters, then a little pop-up box comes up, a bit like Shiny. You choose your endpoints, you determine whether you're in the quantitative audience, and you can knit your document. Now, parameters are really cool because you can do things with them. So here, I've embedded the parameter endpoint, or it's params$endpoint, and that value can get pasted into text. So then, you can talk about the endpoint for the report that you're talking about, you can pass the parameter in and use it within the markdown text. You can use it in the header or the axis labels for the ggplot. And so it can be used in a variety of ways, sky's the limit. OK. But The other thing that's kind of cool is that you can use the parameters in determination of whether the code chunk runs or not. So here I am saying if it's not the quantitative audience-- in other words, if it's the nonquantitative audience-- then hide all the code. The other thing that makes this work is in the top box, and that's because what I did was to rename all of the endpoint columns in my data with the name of the parameter endpoint. So that, then, downstream, I can fit a linear model on something called outcome and not on something called params$endpoint. It just cleans up the code, and it makes things easier to see further down. You can choose to run code chunks at all based on parameters. The other thing that I've used is a child document because, if you remember, when I'm presenting this to the quantitative audience, I want to pull in some extra information. The code chunks will run, or not, according to the settings here, but you might want some additional bit of text that comes in and says, OK, for the quantitative guys, here's what I'm doing. And that's where the child document comes in. They're just plain text, R Markdown files, and those guys get pulled in if needed. So for the quantitative audience, what you see here is that we're showing the code. I've run the chunk, you can see the data. And the bits at the bottom that says Data Manipulations is that child document text. Now, if you're-- Tareef kind of stole some of my thunder in his opening talk, but that's OK. He's the president, that's fine. What you can do in RStudio Connect is that you can go across to the left-hand side where it says Input. You can pop that open, and it will have the same ability to select parameters and define what the report is going to run. But the other nice thing is that if you do that, and you save those things as named objects, then in RStudio Connect, you've just got this little drop-down menu of precompiled reports. So what you might want to do is to set up commonly used-- here, it was all right because there's only six reports in total that could be done. But it's that ability that if someone has already run this report, you can just quickly grab it, and it doesn't have to recompile. So, more about parametrisation. So what we saw was when we render the report, you go to the render with parameters or knit with parameters. But from the command line, you can pass in your parameters just through a list like this, and it's really straightforward. The second thing is you might want to change your analysis, depending on which endpoint you're looking at. If you're looking at a categorical outcome, you're not going to want fit a linear model or at least you shouldn't. So in that case, you may want to tailor your analysis, depending on the parameter endpoint. But again, it's just another thing that you can compute on within a chunk. Also, if something goes wrong with your analysis, then if you're handling the error appropriately, if you're using tryCatch or something like that, you can compute on that and then have a child document that says something helpful like, you know, something's gone wrong, contact your friendly data scientist. You know, here's his details. So the other last thing, back to XKCD, is how long should you spend parametrising your report and setting things up? Now, this is kind of alarming. I know that JD Long has also talked to this graph today. But the thing I want to impress on you is, basically, this is over five years. So if you spend a day or two days or five days sorting this out and only two people are going to use it once every six months, maybe I wasted my time. But on the other hand, I got a conference talk out of it, so that's good. [LAUGHTER] Anyway, thank you very much for your attention, feel free to ask questions. [APPLAUSE] Thank you very much, Mike. We have time for questions before the break for lunch. Hands, please. OK, can I ask one myself? So you're building all of this wonderful machinery, and then you move on and somebody else has to maintain it, how many of the people that you work with understand and value the machinery that you've just shown us? Oh, that's a tricky question. Not many at present. And how are you tackling that? Well, I've just done the Train-the-Trainer on the tidyverse, and I'm here. So I'm going to go back and evangelize and, you know. But yes, it's a problem that we need to try to roll this out and get more and more people familiar with how it works. [MUSIC PLAYING]
Info
Channel: RStudio
Views: 899
Rating: 5 out of 5
Keywords: rstudio, data science, machine learning, python, stats, tidyverse, data visualization, data viz, ggplot, technology, coding, connect, server pro, shiny, rmarkdown, package manager, CRAN, interoperability, serious data science, dplyr, ggplot2, tibble, readr, stringr, tidyr, purrr, github, data wrangling, tidy data, odbc, rayshader, plumber, blogdown, gt, lazy evaluation, tidymodels, statistics, debugging, programming education, forcats, rstats, open source, OSS, reticulate, Mike Smith
Id: p55q2szc3I8
Channel Id: undefined
Length: 15min 34sec (934 seconds)
Published: Wed Feb 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.