[MUSIC PLAYING] --energetic. So my name's Mike Smith. I work at Pfizer
in the UK, and I'm very pleased to be here
to talk to you today. Because I realize that people
are lazy and easily distracted, I'm going to give you the
summary of the whole top right up front. I use parametrized rmarkdown
reports with child documents to write up an exploratory
analysis, which I shared with some
colleagues-- some of whom were quantitative, some of
whom were not quantitative. The quantitative guys,
like the statistician in the team and
clinical pharmacologist from my department,
including my manager. There were other
nonquantitative guys and, you know, it's maybe
unfair to call a clinician nonquantitative,
but I'm going to. So this report was intended to
serve all of those purposes. Now, the analysis and the
stuff I'm going to show today doesn't show that data because
of confidentiality reasons, but it has very
similar properties. But what I really want
to talk to you today is about cutlery drawers
and what they say about you. Now, these cutlery
drawers are from people that you might bump
into at this conference. And I think it's an interesting
exercise in how people arrange things, structure things,
whether they do it by size, whether they do
it by most frequently used, the differences
between, perhaps, US and European cutlery drawers. But here's mine. [LAUGHTER] If you visit my house,
you'll see my cutlery drawer. My wife will also
be very surprised that you're visiting the house
and quite alarmed that you just want to see the cutlery drawer. [LAUGHTER] Now, I did the tidyverse
Train-the-Trainer course, and thanks to Greg and
Garrett for teaching on that. And I am now taking
every opportunity to have a learning
experience for the tidyverse. So if you take
cutlery drawer, you might want to group by type. You want to gather those
things together and arrange. And if you could do that
from my cutlery drawer, I would be really happy. New hashtag for the
conferences-- untidyverse. [LAUGHTER] And I'm really
pleased that the XKCD views the need for tidiness in
the house in very similar ways. Basically, if you've
got Wi-Fi and a laptop, you can put all your
other possessions in a big bucket
marked miscellaneous. OK, here's the
premise for my talk. My brain is lazy, shallow,
and very easily distracted. I hesitate to say that
your brain is the same, but then we had Joe Chang
up here telling us all that our intuition sucks. So I think some people
in this audience must have a brain that's lazy,
shallow, and easily distracted. Now, we're all familiar
with this plot. It's from the R for
Data Science book, and it gives us a
framework or a structure for understanding any analysis. And on the Not So Standard
Deviations podcast recently, there was a good
discussion about this, and they came to the
conclusion that really this is just a framework. It's a mental model. Real data analysis
don't necessarily have to follow all of
these steps or in sequence. So let me explain to you
something about my data analysis and how it
works in practice. And I'm sure that these
experiences I'm going to relate are just for me alone,
so it's not you guys. You have your stuff
well together. This is just me. OK, here we go. So I go, and I get an email
from my colleague who has the link to the data source. I download that data
source, I read it into R, I wrangle the data,
and I plot it. And then there's some
time to step back, to reflect on my plot, to
think about what it looks like so that after lunch, I
can come back and make better plots. And I might fit a preliminary
model to the data. Hey, it's the end of the
day, it's time to go home. So the next day,
I come in, and I find an email saying
that the team has found an error in
the data, so here's the new version of the data. And if your workflow
is not reproducible, you're in a world of hurt here. So because I'm reproducible,
I go to my markdown. I change the input data,
I recompile my report, I see what
differences there are. I check against the
previous version. I'm good to go on. I discuss the
findings with my boss. I circulate the report,
and my job is done. Wait a minute. Did anyone get
kind of distracted by the text in light gray? Perhaps it's not just me then. OK. The other thing is,
did anyone notice that the transform
and visualize bits from the R for Data Science
diagram were back to front and swapped over? Cognitive load
theory, score one. So six months pass. This is not an exaggeration. I got an email yesterday
from my manager saying, you know that team and the
work that you did for them? They're interested
in your results. Can you dig them
out and share them? Now, if you're anything
like me, what happens is you pop open your R script
or something like that, and you think, what
the heck did I do? All right. But to the rescue, for me,
comes R Markdown and Notebooks. These guys are now
saving my life. I love Notebooks. Thank you, eBay and team who
are working on Notebooks. And whenever you're
writing, you need to think about who is
the audience that you're writing for. Well, for me, when
I write a Notebook, it's for distracted me. It's the one that is
hopping between activities, not focusing 24/7 on my
nice little analysis, but with a billion things to do. It's also the future
me-- the six months later me that pops open the
Markdown document and goes, all right, I'm good,
because I've got my code, I've got the explanation,
I can see the outputs. I've even got text that
explains what I was thinking. Hurrah. But here's another
audience for your reports-- it could be a
quantitative colleague who wants to see code,
who wants to see data, who wants to see your
assumptions, who wants to dig into your
residual plots from your model. I can guarantee you the
nonquantitative people won't want to see that. So it's good if you can
try to balance up and try to have techniques that
would allow you to hide the bits that the
nonquantitative people want to see, but also allow the
quantitative people to get what they want from that report. So back to Notebooks
just for a second. I realized--
[INAUDIBLE] thank you-- that there is a debate here. But for analysis, if
you're writing analysis, I think Notebooks are fantastic. And back to Garrett's
stuff at Markdown, if you're writing more code-- sorry, writing more
comments than code, then you should be using
R Markdown in Notebooks. If you're using writing
more code than comments, write more comments
and use R Markdown. But if you're not
writing for analysis, then I really recommend
that you go and read this blog post because
it is very interesting. Also, because I'm
lazy, I knew my manager would come back and ask
for the same analysis across the three endpoints
that are in my data. OK? So I'm trying to
set up my report, to answer quantitative,
nonquantitative, and for three different endpoints. [AUDIO OUT] --paste your code
more than three times, what should you do? Write a function. If you have to perform an
analysis across more than three endpoints, what do you do? Do you write multiple
R Markdown reports? Parameterized reports,
thanks very much. Otherwise, why
would you be here? I mean, honestly. I'm not that entertaining. Anyway, back to XKCD. XKCD talks about automation,
and the top panel talks about the theory of automation. It's a bit like the
theory of data science. It's the mental model for
what we all hope will happen. We automate things,
we can get on them, we can do a completely
different work. The reality is that when
you automate things and try to be clever, often you
wind up troubleshooting the thing that's gone wrong. But here's the thing that
makes all of this work is the YAML header parameters. Now, if you're familiar
with R Markdown, then you'll know
about the YAML header. If you're not familiar
with R Markdown, this may be a bit of a
stretch, but bear with me. So the YAML header
says something about the document-- it says the
title, the author, the date it was done, and it says
something about the formatting for the output. So here, it's an
HTML document that has some certain attributes. The bit I've highlighted in red
is the bit we need to focus on. These are parameters that you
can pass into your document, and it comes into your
document like an R object that you can then use,
you can compute on, you can do all kinds of
clever things with it. So here in my
document, in my header, I've got a parameter
which is called endpoint. It has a default
value of HAMDTL17. But it also has three
distinct choices, so it's not just a free text. You have to choose one
of those three options. Also, I've got a
boolean parameter, which is called quantitative
audience, and if that is true, then I'm going to include
a bunch of other stuff. But if it's false, then
this is a stripped back report just for the
nonquantitative audience. OK? Now, the specification
of those is very close to what you
might do with Shiny inputs if you're familiar with them. So then, when you're
ready, you can choose to knit that document. And if you just hit
the Knit button, you'll get the default values. But if you can choose
to knit with parameters, then a little pop-up box
comes up, a bit like Shiny. You choose your
endpoints, you determine whether you're in the
quantitative audience, and you can knit your document. Now, parameters are
really cool because you can do things with them. So here, I've embedded
the parameter endpoint, or it's params$endpoint, and
that value can get pasted into text. So then, you can talk about
the endpoint for the report that you're talking about,
you can pass the parameter in and use it within
the markdown text. You can use it in the header or
the axis labels for the ggplot. And so it can be used
in a variety of ways, sky's the limit. OK. But The other thing
that's kind of cool is that you can
use the parameters in determination of whether
the code chunk runs or not. So here I am saying if it's
not the quantitative audience-- in other words, if it's the
nonquantitative audience-- then hide all the code. The other thing that makes
this work is in the top box, and that's because
what I did was to rename all of the
endpoint columns in my data with the name of the
parameter endpoint. So that, then, downstream, I can
fit a linear model on something called outcome and not
on something called params$endpoint. It just cleans up the code,
and it makes things easier to see further down. You can choose to run
code chunks at all based on parameters. The other thing that I've
used is a child document because, if you remember,
when I'm presenting this to the quantitative
audience, I want to pull in some
extra information. The code chunks
will run, or not, according to the
settings here, but you might want some additional
bit of text that comes in and says, OK, for the
quantitative guys, here's what I'm doing. And that's where the
child document comes in. They're just plain
text, R Markdown files, and those guys get
pulled in if needed. So for the quantitative
audience, what you see here is that we're showing the code. I've run the chunk,
you can see the data. And the bits at the bottom
that says Data Manipulations is that child document text. Now, if you're-- Tareef kind of stole some of
my thunder in his opening talk, but that's OK. He's the president, that's fine. What you can do
in RStudio Connect is that you can go across
to the left-hand side where it says Input. You can pop that open, and
it will have the same ability to select parameters
and define what the report is going to run. But the other nice thing
is that if you do that, and you save those
things as named objects, then in RStudio
Connect, you've just got this little drop-down
menu of precompiled reports. So what you might want to do
is to set up commonly used-- here, it was all right
because there's only six reports in total
that could be done. But it's that ability that
if someone has already run this report, you can
just quickly grab it, and it doesn't
have to recompile. So, more about parametrisation. So what we saw was when
we render the report, you go to the render
with parameters or knit with parameters. But from the command line, you
can pass in your parameters just through a list
like this, and it's really straightforward. The second thing
is you might want to change your analysis,
depending on which endpoint you're looking at. If you're looking at
a categorical outcome, you're not going to want fit
a linear model or at least you shouldn't. So in that case, you may want to
tailor your analysis, depending on the parameter endpoint. But again, it's
just another thing that you can compute
on within a chunk. Also, if something goes
wrong with your analysis, then if you're handling
the error appropriately, if you're using tryCatch
or something like that, you can compute on that and
then have a child document that says something helpful
like, you know, something's gone wrong, contact
your friendly data scientist. You know, here's his details. So the other last
thing, back to XKCD, is how long should you spend
parametrising your report and setting things up? Now, this is kind of alarming. I know that JD Long has also
talked to this graph today. But the thing I want
to impress on you is, basically, this
is over five years. So if you spend
a day or two days or five days sorting this
out and only two people are going to use it
once every six months, maybe I wasted my time. But on the other hand, I got
a conference talk out of it, so that's good. [LAUGHTER] Anyway, thank you very
much for your attention, feel free to ask questions. [APPLAUSE] Thank you very much, Mike. We have time for questions
before the break for lunch. Hands, please. OK, can I ask one myself? So you're building all of
this wonderful machinery, and then you move
on and somebody else has to maintain it, how many of
the people that you work with understand and value the
machinery that you've just shown us? Oh, that's a tricky question. Not many at present. And how are you tackling that? Well, I've just done
the Train-the-Trainer on the tidyverse, and I'm here. So I'm going to go back and
evangelize and, you know. But yes, it's a
problem that we need to try to roll this out and
get more and more people familiar with how it works. [MUSIC PLAYING]