Michael Chow | Bringing the Tidyverse to Python with Siuba | RStudio

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

MICHAEL CHOW: Hi, my name is Michael Chow. And thanks for watching this RStudio conf talk. I'll be going through siuba, which is a port of dplyr to Python. So you're watching RStudio conf, I'm guessing that you're on board with R and dplyr for data analysis. And I've got to say, I love R, but I also love Python. And depending on the project or task, I've often found myself having to switch between R and Python, or juggle both at the same time. And I'm guessing this isn't a super unusual experience since Python is just incredibly popular. If you look at Stack Overflow posts, the predominant tool in Python for data analysis, pandas accounts for about 3% of posts a month. This is just an incredible amount. And so a lot of my interactions have been with people who use pandas for data analysis. But as time has gone on, I found myself reaching more and more for dplyr. And I've really thought a lot about that and tried to figure out what's going on there. And to set the stage, I thought I'd just use a dumb analogy. That pandas is a lot maybe like a double decker bus. So this is in Hong Kong. And this thing's really built for carrying capacity. If you have to ship 80 humans somewhere, this is your tool, right? You stack them on top of each other you're golden. The challenge is that it's cumbersome, right? It can't go everywhere. It's probably scary to back up. And in contrast, in Hong Kong, there's another vehicle, the mini bus or siuba that's the opposite approach, so tiny capacity, holds like 16 people, and it's just a terror on wheels. So the BBC describes it as Hong Kong's wildest ride. And these things are just fast. They can go anywhere and they can get you-- they're super flexible. So I think that really reminds me of dplyr that dplyr's incredible for exploratory analysis. It can just get you to where you want to go quickly. Pandas is great at computation but can be a little bit cumbersome on the fly. So siuba aims to be small but mighty by leveraging dplyr like syntax in Python but doing computation behind the scenes in pandas or SQL. And I've tried to live code with siuba just of battle test it and make sure it's ready for the big time. I think a big question is what dplyr is really doing that's so useful? And to really understand, we have to go back to Hadley's 2014 talk, where he introduced dplyr. And he mentions that analysts have two bottlenecks, a cognitive bottleneck and a computational bottleneck. The computational one's the most we think of, the one we think of most often, which is as the data becomes bigger, it takes longer to run the code. But intriguingly, Hadley mentions, he thinks a lot about the cognitive bottleneck, which is how should we think about the data and describe what we want to do to the data or code. And dplyr aims at this cognitive bottleneck to help people focus their thoughts and to give people strategies for data analysis. It does this by slimming down, taking this big space, and slimming down the options that people have. So there are five simple verbs and all of these critically can combine with an operator group_by. All of these take a data frame and return a data frame. And he mentions that overall, it's a very constrained design, especially compared to his previous tool, plyr. So siuba aims to let people capture those dplyr-like thoughts and keep them while writing dplyr syntax in Python and then doing the computation in pandas. So the gist is siuba lets you transfer your thoughts from R to Python. All right, so I'm going to try to show three motivating cases, where it can be tricky to transfer dplyr thoughts to pandas through roadblocks that people hit. And I know it's tempting to see dplyr and pandas is pretty similar. So in this example, they're doing roughly the same thing and they're using basically the same code. So it's a group summarize. We're calculating the average of a column and we're renaming the result average hp. And we get roughly the same output. But it's also worth noting if you take to somewhere like, Twitter, or look around, you'll see some pretty scathing reviews. So dplyr has forever ruined pandas for me. Or Chelsea notes, every picture shows pandas and chaos. And I think it's worth noting, these people, I think they like pandas so this isn't a straight critique of pandas. But it can be really frustrating when you try to go from one tool in the way it structures thinking to another. And so I would say, these three cases try to bring that to the surface what's going on. The first roadblock is select. So select is really a very basic action. You can choose columns. You can drop columns. You might match certain columns or rename a column. And the trick is that pandas uses different names for all these actions. So there are four different methods one for each action. The other thing is that sometimes they have funky arguments you need to pass. So you might have to say access equals 1 to say, oh, I want to do this to the columns. Or you might have to say columns equals. Those are different ways of saying the same thing. The last is that sometimes you have to use intense programming constructs like a Lambda function, if you want to do things like match on a string. In contrast, dplyr uses the same verb for all these actions, select. And it's really easy to change between what you're doing to go from keeping a column to dropping a column, or even to combine these actions together. And so this is a case, I think of, where changing what you do requires changing gears and shifting between these different functions and methods in pandas. The second example is I would say, group_by. And this is really-- I would say grouping is really a critical activity in data analysis in dplyr. And the pandas docs describe some translations from R to pandas. And it looks pretty straightforward, at least for these cases. But actually, I would say group_by in dplyr and pandas use radically different underlying grammars. So in this example, on the left, I'm showing a filter in pandas or a mutate. And note, there are two ways of doing it or multiple ways, I should say. So you can use query, that's the top left or eval on the top right, or an alternative way to filter on the bottom left to these blocks, or to mutate assign on the bottom right. Critically, we can ask what happens when we try to group_by before these operations? And the answer is that none of them work. And the trick is a group data in pandas,a data frame group_by, actually doesn't have even any of these methods on it to perform these operations, except in one case, where the operation just fails. And this is probably really surprising, I think, for dplyr users. The trick ends up being have to shift the groupby from outside of the verb to inside it. So you basically have to move group by into your column operations. And again, this isn't necessarily bad. And it means pandas can still do the same computations as dplyr. It's just I think surprising for dplyr users. And for me, challenging, to do when I'm trying to analyze data quickly. The third roadblock, I'd say, is SQL. So dplyr users, if you're like me, working in industry, I lean on dplyr and dbplyr being able to generate SQL heavily. So you can just swap out your data source to SQL and player will generate the SQL query for you. In contrast, if you're using pandas, you just need to switch to writing SQL. And that's what I see happen quite a bit in industry is once it hits the database, people just switch and start coding SQL. So your computation changes and how you code ends up changing as a result. So to recap, dplyr is really powerful. It uses one verb, select, for all the cases for grabbing columns. For group_by, you have this really nice expressive syntax, where you can modify the full table action. So you can have a group to mutate, or group filter, or group summarize. In contrast, pandas is computationally powerful. And it's packed with options. But that means that you might have to do a little bit more work up front during the analysis. And this makes it great for engineering but maybe a big challenge, I think, for exploratory data analysis or quick analysis. All right, so we looked at some core challenges in translating dplyr thoughts about data analysis to pandas. Now I want to switch gears and look how siuba can help you basically preserve those thoughts and code them in Python. And it aims to do this as faithfully to dplyr as possible. So I'm going to show you an example of going from dplyr code to siuba. I apologize, I know I start with these parentheses and put the pipe at the beginning of the line and some people hate that. I'm sorry. My bad. So let's go ahead and just start the switch. So first, we'll change our imports. So rather than using library, we'll do our Python imports. Then we'll change the pipe to be greater than greater than. Next, we'll put an underscore dot in front of the variable names. And this has to happen in order for it to be Python syntax. And then the last thing is a little bit tricky. We're going to take this mean function call and we're going to change it to a method. So it's more similar to how pandas expressions operations. All right, and so in a sense, with these few simple changes, we've gone from dplyr in R to siuba in Python. And looking side by side, hopefully my goal is that you can squint your eyes and they just look like the same code and you can just figure out how to swap between them. So now, I want to go through the three examples that I showed before and just give you a sense for what they look like in siuba. So the first roadblock was select. And now looking back at this example with siuba on the left, it should be basically the same thing, now you just have underscore dot variable names. It's worth a read to the very bottom one where you select certain columns. It's really easy to do and it matches up with the pandas code. So pandas, you can do this, dot str dot ends with, and it will return true whenever a column matches that. And siuba just lines up with that way of doing things, so corresponds to the pandas way. The second example is group_by. So we showed a few different ways of doing filter and mutate. So the filter and mutate now for pandas are on the right. And let's show how siuba does it. So here's the siuba filter and mutate on the left. Notice that they're eerily similar hopefully to dplyr. And we can just tuck in the group_by above the filter and mutate to make it a group filter or group mutate. So it's meant to be easy to just swap in and out. The last example was SQL. So I think this is the biggest, most useful part of siuba is you can swap out your data source from being a pandas data frame to a SQL alchemy-- essentially a SQL alchemy connection and siuba will generate the single query, or run the query and return a table of data just like dplyr. And right now, it's mostly Postgrgsql and redshift that I've worked on supporting but it can be extended to more back end. So there's early support for my SQL and SQL Lite, and I'm hoping to build those out further. I think one of the incredible things is that you get ggplot for free. And that's not by any of my work. But by the work of a person named Hassan, who built a package called plotnine, which is an incredibly faithful port of ggplot. So ideally, you can just do the full data transformation, data visualization. And you can carry over your hard earned skills from R to Python. So just to recap, siuba tries to use a very similar syntax to dplyr. There are some cases where it needed to be tweaked a little bit to be Pythonic or to just be Python syntax. And incredibly, it aims to bring you SQL and ggplot. And I found that these things really help me get back up to speed in data analysis and have been fun to just to carry into Python and to discuss with people. All right, so we talked about three roadblocks you might face when translating dplyr code and thoughts to pandas and how siuba you roll past those by essentially copying and pasting dplyr into Python. Now I want to zoom out a little bit and ask, why is siuba worth in the long term? Why should you try out or adopt siuba? And the first thing I want to hearken back to is Hadley's point in 2014 about cognition and computation that D dplyr as a cognitive tool has, if you've used it over long term, probably really helped you build skills to ask important questions of your data. And those skills aren't even maybe that related to programming. So why not just bring those skills with you into Python? The next point is siuba uses dplyr's architecture. And this lets it very flexibly add new back end. So whether you run against SQL or pandas, siuba can support it. And I'm hoping to extend support to Python specific tools in the future like dask and tools like spark as well as fleshing out mySQL support. The next thing is that siuba runs just an enormous glut of continuous integration tests. So it's incredibly thoroughly tested. I would say it's paranoid about ensuring that you get the same result back, whether you're running on SQL or pandas. And every time I push code, it runs thousands of tests. The last thing is developer docs. So I've tried to leave a nice trail of breadcrumbs. So if you're curious about the internal workings of siuba or looking to patch or extend it, there are just enormous resources to do that. I'd suggest the programming guide in the siuba docs, which goes through all of siuba's parts. Or I have something called architectural decision records and GitHub that document key decisions I made, why they were made, and contain sketches of those decisions. The last thing is if you're interested in learning siuba, there's some nice alleys you can go down. So the first thing I'd recommend is Dave Robinson's Tidy Tuesday's screencasts. These are actually an R, so maybe, not as related, but they were an incredible resource when developing siuba and actually a big motivation to work on it. I think being able to see a person move quickly through data analysis in a holistic setting is really important. And Tidy Tuesday is a project that releases new data every week, lets you see things in action. The other thing is there's an interactive tutorial for siuba on learn.siuba.org. So if you're curious to just get started, even if you've never coded before, the tutorial's made to make it easy to get started. It's something I've tested on my family, and friends, and I'm really excited for the opportunity for siuba to make it really easy for learners to take their first steps into coding and data analysis. The last thing is that I've tried to put up live analysis on YouTube of analyzing data for an hour, whether it's translating Dave's analysis from R into siuba, Python, or doing Tidy Tuesday analysis for an hour. So I highly recommend watching those, if you want to see siuba in action, and then trying and siuba out on Tidy Tuesday. Just take it for a spin and see what it looks like and how it compares, say, to using R or another Python tool. So just to recap, you can find siuba on GitHub at machow/siuba. You can pip install it. I can't highly recommend Tidy Tuesday enough. It'll let you get a feel for what data analysis with siuba looks like. And there's learn.siuba.org. If you've never coded, I've tried to design this for you. And I'd love to see people take their first steps into data analysis through this course. So thanks to everybody who helped contribute to siuba. Thanks to RStudio for putting this together. And thanks to the army of people who gave feedback on this talk and made it much, much better than its first version. So thanks for watching. I hope you'll try siuba out. And if you have any questions, please feel free to reach out to me on GitHub or on Twitter @chowthedog. Thanks.

Info

Channel: RStudio

Views: 993

Rating: 4.9384613 out of 5

Keywords: rstudio, data science, machine learning, python, stats, tidyverse, data visualization, data viz, ggplot, technology, coding, connect, server pro, shiny, rmarkdown, package manager, CRAN, interoperability, serious data science, dplyr, ggplot2, tibble, readr, stringr, tidyr, purrr, github, data wrangling, tidy data, odbc, rayshader, plumber, blogdown, gt, lazy evaluation, tidymodels, statistics, debugging, programming education, forcats, rstats, open source, OSS, reticulate, siuba, Michael Chow, SQL

Id: w4Mi0u4urbQ

Channel Id: undefined

Length: 19min 17sec (1157 seconds)

Published: Fri Feb 19 2021