MICHAEL CHOW: Hi, my
name is Michael Chow. And thanks for watching
this RStudio conf talk. I'll be going
through siuba, which is a port of dplyr to Python. So you're watching
RStudio conf, I'm guessing that you're on board with R
and dplyr for data analysis. And I've got to say, I love
R, but I also love Python. And depending on
the project or task, I've often found myself having
to switch between R and Python, or juggle both at the same time. And I'm guessing this isn't
a super unusual experience since Python is just
incredibly popular. If you look at Stack Overflow
posts, the predominant tool in Python for data analysis,
pandas accounts for about 3% of posts a month. This is just an
incredible amount. And so a lot of my
interactions have been with people who use
pandas for data analysis. But as time has gone on, I found
myself reaching more and more for dplyr. And I've really thought
a lot about that and tried to figure out
what's going on there. And to set the stage, I thought
I'd just use a dumb analogy. That pandas is a lot maybe
like a double decker bus. So this is in Hong Kong. And this thing's really
built for carrying capacity. If you have to ship
80 humans somewhere, this is your tool, right? You stack them on top of
each other you're golden. The challenge is that
it's cumbersome, right? It can't go everywhere. It's probably scary to back up. And in contrast, in
Hong Kong, there's another vehicle, the
mini bus or siuba that's the opposite
approach, so tiny capacity, holds like 16 people, and
it's just a terror on wheels. So the BBC describes it as
Hong Kong's wildest ride. And these things are just fast. They can go anywhere
and they can get you-- they're super flexible. So I think that
really reminds me of dplyr that dplyr's incredible
for exploratory analysis. It can just get you to where
you want to go quickly. Pandas is great at computation
but can be a little bit cumbersome on the fly. So siuba aims to
be small but mighty by leveraging dplyr
like syntax in Python but doing computation behind
the scenes in pandas or SQL. And I've tried to live code
with siuba just of battle test it and make sure it's
ready for the big time. I think a big question
is what dplyr is really doing that's so useful? And to really
understand, we have to go back to
Hadley's 2014 talk, where he introduced dplyr. And he mentions
that analysts have two bottlenecks, a
cognitive bottleneck and a computational bottleneck. The computational one's
the most we think of, the one we think of
most often, which is as the data
becomes bigger, it takes longer to run the code. But intriguingly,
Hadley mentions, he thinks a lot about the
cognitive bottleneck, which is how should we
think about the data and describe what we want
to do to the data or code. And dplyr aims at this
cognitive bottleneck to help people
focus their thoughts and to give people
strategies for data analysis. It does this by slimming
down, taking this big space, and slimming down the
options that people have. So there are five simple verbs
and all of these critically can combine with an
operator group_by. All of these take a data
frame and return a data frame. And he mentions that overall,
it's a very constrained design, especially compared to
his previous tool, plyr. So siuba aims to let people
capture those dplyr-like thoughts and keep them while writing
dplyr syntax in Python and then doing the
computation in pandas. So the gist is siuba
lets you transfer your thoughts from R to Python. All right, so I'm going to try
to show three motivating cases, where it can be
tricky to transfer dplyr thoughts to pandas through
roadblocks that people hit. And I know it's
tempting to see dplyr and pandas is pretty similar. So in this example, they're
doing roughly the same thing and they're using
basically the same code. So it's a group summarize. We're calculating the
average of a column and we're renaming
the result average hp. And we get roughly
the same output. But it's also
worth noting if you take to somewhere like,
Twitter, or look around, you'll see some pretty
scathing reviews. So dplyr has forever
ruined pandas for me. Or Chelsea notes, every
picture shows pandas and chaos. And I think it's worth noting,
these people, I think they like pandas so this isn't a
straight critique of pandas. But it can be really
frustrating when you try to go from
one tool in the way it structures
thinking to another. And so I would say,
these three cases try to bring that to the
surface what's going on. The first roadblock is select. So select is really
a very basic action. You can choose columns. You can drop columns. You might match certain
columns or rename a column. And the trick is that
pandas uses different names for all these actions. So there are four different
methods one for each action. The other thing is
that sometimes they have funky arguments
you need to pass. So you might have to say
access equals 1 to say, oh, I want to do
this to the columns. Or you might have to
say columns equals. Those are different ways
of saying the same thing. The last is that sometimes you
have to use intense programming constructs like a
Lambda function, if you want to do things
like match on a string. In contrast, dplyr uses the
same verb for all these actions, select. And it's really easy to
change between what you're doing to go from keeping a
column to dropping a column, or even to combine
these actions together. And so this is a case, I think
of, where changing what you do requires changing
gears and shifting between these
different functions and methods in pandas. The second example is
I would say, group_by. And this is really-- I would say grouping is
really a critical activity in data analysis in dplyr. And the pandas docs describe
some translations from R to pandas. And it looks pretty
straightforward, at least for these cases. But actually, I
would say group_by in dplyr and pandas use
radically different underlying grammars. So in this example,
on the left, I'm showing a filter in
pandas or a mutate. And note, there are two ways
of doing it or multiple ways, I should say. So you can use query,
that's the top left or eval on the top right,
or an alternative way to filter on the bottom
left to these blocks, or to mutate assign
on the bottom right. Critically, we can
ask what happens when we try to group_by
before these operations? And the answer is that
none of them work. And the trick is a group data
in pandas,a data frame group_by, actually doesn't have
even any of these methods on it to perform these operations,
except in one case, where the operation just fails. And this is probably
really surprising, I think, for dplyr users. The trick ends up being have to
shift the groupby from outside of the verb to inside it. So you basically
have to move group by into your column operations. And again, this isn't
necessarily bad. And it means pandas can still do
the same computations as dplyr. It's just I think
surprising for dplyr users. And for me, challenging,
to do when I'm trying to analyze data quickly. The third roadblock,
I'd say, is SQL. So dplyr users, if you're
like me, working in industry, I lean on dplyr and dbplyr being
able to generate SQL heavily. So you can just swap out
your data source to SQL and player will generate
the SQL query for you. In contrast, if
you're using pandas, you just need to
switch to writing SQL. And that's what I see happen
quite a bit in industry is once it hits the
database, people just switch and start coding SQL. So your computation
changes and how you code ends up changing as a result. So to recap, dplyr
is really powerful. It uses one verb,
select, for all the cases for grabbing columns. For group_by, you have this
really nice expressive syntax, where you can modify
the full table action. So you can have a
group to mutate, or group filter,
or group summarize. In contrast, pandas is
computationally powerful. And it's packed with options. But that means
that you might have to do a little bit more work
up front during the analysis. And this makes it great
for engineering but maybe a big challenge, I think,
for exploratory data analysis or quick analysis. All right, so we looked
at some core challenges in translating dplyr thoughts
about data analysis to pandas. Now I want to switch
gears and look how siuba can help you basically
preserve those thoughts and code them in Python. And it aims to do
this as faithfully to dplyr as possible. So I'm going to show
you an example of going from dplyr code to siuba. I apologize, I know I start
with these parentheses and put the pipe at the
beginning of the line and some people hate that. I'm sorry. My bad. So let's go ahead and
just start the switch. So first, we'll
change our imports. So rather than using library,
we'll do our Python imports. Then we'll change the pipe to
be greater than greater than. Next, we'll put
an underscore dot in front of the variable names. And this has to happen in order
for it to be Python syntax. And then the last thing
is a little bit tricky. We're going to take
this mean function call and we're going to
change it to a method. So it's more similar to how
pandas expressions operations. All right, and so in a sense,
with these few simple changes, we've gone from dplyr
in R to siuba in Python. And looking side by
side, hopefully my goal is that you can squint
your eyes and they just look like the same code
and you can just figure out how to swap between them. So now, I want to go
through the three examples that I showed before
and just give you a sense for what they
look like in siuba. So the first
roadblock was select. And now looking back at this
example with siuba on the left, it should be basically
the same thing, now you just have underscore
dot variable names. It's worth a read to
the very bottom one where you select
certain columns. It's really easy to
do and it matches up with the pandas code. So pandas, you can do this,
dot str dot ends with, and it will return true
whenever a column matches that. And siuba just lines up with
that way of doing things, so corresponds to
the pandas way. The second example is group_by. So we showed a
few different ways of doing filter and mutate. So the filter and mutate now
for pandas are on the right. And let's show
how siuba does it. So here's the siuba filter
and mutate on the left. Notice that they're eerily
similar hopefully to dplyr. And we can just tuck in the
group_by above the filter and mutate to make it a
group filter or group mutate. So it's meant to be easy
to just swap in and out. The last example was SQL. So I think this is the biggest,
most useful part of siuba is you can swap out your data
source from being a pandas data frame to a SQL alchemy-- essentially a SQL alchemy
connection and siuba will generate the single
query, or run the query and return a table of
data just like dplyr. And right now, it's mostly
Postgrgsql and redshift that I've worked on
supporting but it can be extended to more back end. So there's early support
for my SQL and SQL Lite, and I'm hoping to build
those out further. I think one of the
incredible things is that you get ggplot for free. And that's not by
any of my work. But by the work
of a person named Hassan, who built
a package called plotnine, which is an incredibly
faithful port of ggplot. So ideally, you can just do the
full data transformation, data visualization. And you can carry over your hard
earned skills from R to Python. So just to recap, siuba tries
to use a very similar syntax to dplyr. There are some cases
where it needed to be tweaked a little
bit to be Pythonic or to just be Python syntax. And incredibly, it aims to
bring you SQL and ggplot. And I found that
these things really help me get back up to
speed in data analysis and have been fun to
just to carry into Python and to discuss with people. All right, so we talked
about three roadblocks you might face when translating
dplyr code and thoughts to pandas and how siuba you
roll past those by essentially copying and pasting
dplyr into Python. Now I want to zoom out
a little bit and ask, why is siuba worth
in the long term? Why should you try
out or adopt siuba? And the first thing I
want to hearken back to is Hadley's point in 2014
about cognition and computation that D dplyr as a
cognitive tool has, if you've used it over
long term, probably really helped you build skills
to ask important questions of your data. And those skills aren't
even maybe that related to programming. So why not just bring those
skills with you into Python? The next point is siuba
uses dplyr's architecture. And this lets it very
flexibly add new back end. So whether you run
against SQL or pandas, siuba can support it. And I'm hoping to extend
support to Python specific tools in the future like
dask and tools like spark as well as
fleshing out mySQL support. The next thing is
that siuba runs just an enormous glut of
continuous integration tests. So it's incredibly
thoroughly tested. I would say it's
paranoid about ensuring that you get the same
result back, whether you're running on SQL or pandas. And every time I push code,
it runs thousands of tests. The last thing is
developer docs. So I've tried to leave a
nice trail of breadcrumbs. So if you're curious about
the internal workings of siuba or looking to
patch or extend it, there are just enormous
resources to do that. I'd suggest the
programming guide in the siuba docs, which goes
through all of siuba's parts. Or I have something called
architectural decision records and GitHub that document
key decisions I made, why they were made, and contain
sketches of those decisions. The last thing is if you're
interested in learning siuba, there's some nice
alleys you can go down. So the first thing I'd recommend
is Dave Robinson's Tidy Tuesday's screencasts. These are actually an R,
so maybe, not as related, but they were an
incredible resource when developing siuba and
actually a big motivation to work on it. I think being able to
see a person move quickly through data analysis
in a holistic setting is really important. And Tidy Tuesday
is a project that releases new data every week,
lets you see things in action. The other thing is there's
an interactive tutorial for siuba on learn.siuba.org. So if you're curious
to just get started, even if you've
never coded before, the tutorial's made to make
it easy to get started. It's something I've tested
on my family, and friends, and I'm really excited for
the opportunity for siuba to make it really
easy for learners to take their first steps
into coding and data analysis. The last thing is
that I've tried to put up live analysis on
YouTube of analyzing data for an hour, whether
it's translating Dave's analysis from
R into siuba, Python, or doing Tidy Tuesday
analysis for an hour. So I highly recommend
watching those, if you want to see
siuba in action, and then trying and siuba
out on Tidy Tuesday. Just take it for a spin
and see what it looks like and how it compares, say, to
using R or another Python tool. So just to recap, you
can find siuba on GitHub at machow/siuba. You can pip install it. I can't highly recommend
Tidy Tuesday enough. It'll let you get a feel for
what data analysis with siuba looks like. And there's learn.siuba.org. If you've never coded, I've
tried to design this for you. And I'd love to see people take
their first steps into data analysis through this course. So thanks to everybody who
helped contribute to siuba. Thanks to RStudio for
putting this together. And thanks to the army of people
who gave feedback on this talk and made it much, much better
than its first version. So thanks for watching. I hope you'll try siuba out. And if you have any
questions, please feel free to reach
out to me on GitHub or on Twitter @chowthedog. Thanks.