Lou Bajuk & Sean Lopp | RStudio: A Single Home for R & Python| RStudio

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] SEAN: Thank you all for taking some time out of your day. We're going to start by looking at the RMarkdown document. And how we can use R and Python together inside of that document with some special new magic that's available in the latest versions of RStudio. We'll also talk about how once we finish that document, it can be published and shared in a reproducible way. So that others can view it and perhaps, you might introduce some lightweight automation. We'll look briefly at how a similar outcome can be achieved with cheaper notebooks using shared infrastructure. And then, we'll talk about how the model we're creating in this document can actually be put into place so that decisions can be made based on that model. And we'll look at how you can maybe do that with an interactive app or there is something like a RESTful API. And then finally, we'll talk about how those APIs and applications can also be developed using some of the new options inside of the RStudio toolset. So we'll go ahead and jump in and set a little bit of context. I am using RStudio on a server. And so that means I'm going to access RStudio through a web browser. And there's a couple of really nice benefits to doing this. The first is that I have more resources available to me than I would on my laptop. And specifically, if I go in to start a new session, you'll see we're taking advantage of a tool called Kubernetes that allows me to demand to specify what resources I want for this work. Another benefit of [INAUDIBLE] a server is that it means everyone on my team, Lou and myself, are speaking from a common playbook. And in this case, we're explicitly defining our environment through a Docker image where our session is going to run. Even if you're not using Kubernetes or Docker, by having your work on a server, everyone has a common home, which helps make collaboration a lot easier. And then finally, because we're on a server, we're a lot closer to our data, which means our work can scale in an easier fashion. Now, in a moment I'll be jumping into RStudio ID. But while I'm here I just want to point out that one of the things we've done in the last year for multilingual teams is added the ability for other editing tools to be accessed from this same interface. So it's almost like a Workbench, where you can pick up the right tool from that Workbench regardless of what language your current project demands. And that means for multilingual data science teams, everyone has all those benefits I just described that the common playbook, regardless of what editor they're using. And for IT, it's really nice to have this common front door or these different environments are going to be accessed. So we'll go ahead and jump into RStudio. And at this point, I want to point out that everything you're going to see inside of RStudio IDE even though I'm using the RStudio server is going to be available in any flavor of the RStudio IDE. So whether you're using the open-source desktop or RStudio Server open source or the professional product like I am, you'll have access to the interoperability that we're about to get through. So to set the stage just a little bit more, I'm doing my work here inside of an RMarkdown document. If you haven't seen RMarkdown before, it's basically a format where I can combine pros. So what I'm thinking along with my code and then the output of my code. But what's new in the latest version of RStudio is something that we're calling the visual editor. And what that means is that even though I'm editing an RMarkdown document, on the fly I'm getting a live rendering of what this document looks like. And so that means I can do things a little bit easier than I used to. So, for example, if I needed to put in a table, I could insert a table really quickly. I can add Markdown to that table and it will be rendered on-demand. I can even do things like insert citations or emojis. And so I can use all these different tools inside of the visual editor. And whatever I'm doing is going to show up as a real-time preview. And the reason why I think this is so important is that regardless of what language you're working from, ultimately, a data scientist's key job is to communicate. And so being able to write effectively in a tool that supports really rich technical communication is vital for any type of data scientist. And I found that this visual editor makes things that used to be challenging for that type of technical communication really easy. One of my favorites is I can go in here and actually insert an image. I don't have to worry about figuring out where that image lives on disk. And I can even resize it interactively. So we hope that regardless of what language, this visual editor is going to give you a head start as you're writing down what you're thinking and documenting your process. And it really combines the best worlds of a Jupyter Notebook [INAUDIBLE] IDE and RMarkdown. And if you haven't seen this before, inside the latest version of RStudio, all you need to do is click this button in the upper right hand corner. And what that then does is flip us back and forth between the plain text source code that is still available here perfect for version control and that pre-rendered view of the document. All right. So that's where we're doing our work. It is inside of this visual editor. What are we going to do? We have a made-up data science exercise. And so like most data science exercises, we're going to start by pulling in some data. And this data is coming from a database. And again, something that we're excited about regardless of what language you use is the ability within RStudio to seamlessly manage connections to databases. And so here I have a connection to a database called content. There's a number of different tables inside of it. And without switching to a different sequel editor, I can really easily preview the schema of this database table and even preview some of the records if I want. So I have this database and I'm starting out by writing some code that's going to operate within the database to do some basic transforms. And so if you haven't seen this before, using dplyr, you can write R code. Then that R code actually gets executed as a sequel. And so that's what's going on here. So we're creating some data that we're then going to do an analysis on. So let's take a quick look at that data. Essentially, we have spatial-temporal data. And so we're looking at bike-share data from the D.C. area. So those bike shares, if you've ever been walking around the city and you want to rent a bike, you can usually find a station with bikes lined up and then you pay some money to be able to ride the bike around. And so the data that we have is how many bikes were available at any given station at any point in time. So we have this time series of bikes that are available, then we also have a spatial component of where those bikes were located. So it's a pretty cool data set. And what we want to try to do is forecast into the future for a given station how many bikes might be available. And say, if you've commute home at 5:00 PM, is there going to be a bike there that I can take? And so we'll get started by doing that in R. And the first thing that we're going to do is create a test and training data set. And because we're working with time-series data, it's really important that we don't accidentally use the future to predict the past. And so, in R we've created, a test and training data set, where the training data set sequentially occurs before the test data set. And now, that we have these two things, we can go ahead and build our model. And I'm using a new R package or an ecosystem of R packages called tidy models. There's a lot of great resources online if you want to dive into modeling an R. But essentially, what this set of packages is allowing us to do is pre-process our data. And so, we have here latitude and longitude. That's the spatial element. We have the date and time. That's the temporal element. But we've used some preprocessing tricks to create a factor for the day of the week and also add some information about whether a given day was a holiday. So you have a nice rich data set to do modeling on. And then, the first thing that we might do is create a model in R. And so to do that, again, within this tidy model's ecosystem and essentially creating a workflow. That's going to take our model preprocessing along with this gradient boosted engine and fit a simple regression model. So we can go ahead and do that. And if we look at the results here, we have a predicted number of bikes along with the actual number of bikes, kind of what we'd expect. And we can evaluate how our prediction did. So one thing that we can do for that quick evaluation is plot the prediction versus the reality. And if our model is really good, all of our data points would fall along this y equals x line, where predictions match reality. You can see, we have quite the variance from that line. We could look at the R squared value of our model and see that it's not very good. That's why I'm paid to give webinars and not fit models but this hopefully, gives you a sense of what your workflow might start to look like an R. We can even because this is a tree-based model. Look at feature importance, and so you can see that our model is emphasizing the location of our bike share station and not putting as much weight on some of those factors that we create. So we have this iffy model in R. What do we want to do next? Well, we could try a lot of different things to improve our model. One thing we might want to do is try something from the Python ecosystem. And so that's where the heart of this webinar begins which is, how are we going to interoperate Python and R inside of the same context? Well, the first thing that we'll do is create a Python code chunk. And inside of that Python code chunk, one of the things that we're going to do right after that is import some packages. So go ahead and add the code to bring in these packages. And we can do a quick sanity check here to run the code chunk and it appears that we have these packages loaded inside of Python. Now, I can already see in the chat that many of you are asking how is Python managed? Where are these packages coming from? Those are really great questions. So in the latest version of RStudio, you can use a project or global option to specify what Python interpreter should be used in the context where you're mixing R and Python together. And so for this project, I've created a virtual environment and I have selected and told RStudio to use that virtual environment. They can see that RStudio is actually aware of the multitude of Python installs that are available on this server. And that's one of those benefits of working on a server is that if Lou and I were collaborating together we'd have this common understanding of what Python installations are available. So that's where the Python engine is coming from. The Python packages themselves are coming from a tool called RStudio package manager. So in our case, our server is not online. Here, we have some sensitive data that lives in our environment. So we're not able to go reach out to the internet. So instead package manager acts as this intermediary, where we can install Python packages from a specifically governed mirror of PyPI. And so that's what you're looking at here-- I can search for packages, I can see information about those packages, how to install them, what they depend on. But something really special about this mirror is that it has a safety net that allows me to time travel as well. And so if I ever got into a situation where my Python environment wasn't working, I could go backward or forward in time to reinstall packages from a specific point in the past. So that's where these packages are coming from. Let's go ahead and start writing some code. And this is where I think things get really magical. Because within this Python code chunk, inside of the RMarkdown document, I actually have access to everything that I've done up until this point in R. And you can actually see the idea here is helping me autocomplete some of the objects that are attributes of this special R object that are available. And so what does that mean? Well, I'm going to cheat a little bit and grab some code that I've written ahead of time. We'll just put this inside of our code chunk and then I can walk you through it. So the first thing I'm doing, because we're going to fit this model in Python, is to load some training and test data. But you can see that I'm not starting from scratch. I'm using this magic R object to actually pull in the training and test data that I already created. And then, I can use that as a jumping-off point for fitting my model. And so we're going to do a little bit of pre-processing with pandas and then we're going to use some functions from scikit-learn to fit another type of model. And one of the things that are really interesting and why we might jump into Python here is that scikit-learning has native support for time series cross-validation. So remember, I said we can't use the future to predict the past. Scikit-learn knows how to handle that even in the case where you're doing a whole bunch of cross-validation thing. And so that's what we're doing here to fit in SVR our model. I can go ahead and run this Python code chunk and we can take a look at the results. All right. So train results that mean, that's the mean of our cross-validation r-squared value? And you can see, whoo, it's really bad. OK. So I'm not a great Python model fitter. But hopefully, you get the idea that we can really seamlessly reuse things back and forth. And you might have noticed a couple of other things that the idea is doing to help me with this Python development. So one of the things that I briefly mentioned is that autocomplete. And so say we wanted to use a different model from scikit-learn. You can see the idea is providing me with all the different options that are available, all the attributes and methods of these objects, and the help for those methods that arguments that they take are also all going to be available in the IDE with the rich auto-completion that you'd expect. The other thing that will point out is if we look at the environment pane when we switch to the Python context, we actually got a Python environment Explorer as well. And so these objects that I'm creating inside of Python, you can actually see them and explore them inside of the IDE and I could even preview them as well. So this becomes a really rich way to interactively do your work and debug problems. And we can see that even things like the filter capability inside of the IDE as well as sort work for these Python objects. I can also switch at any time back and forth between R and Python because the two ecosystems are coexisting together inside of this notebook. Speaking of which, there's one last trick that I want to show you. So I fit this Python model and I have some Python predictions. What if I then wanted to go and do some further analysis? Well, you saw how I can reduce R objects in the Python context, but I can do the same thing inside of R. So inside of R there's this magic object called py that gives me access to everything that I've created so far in the Python space. And so I can go ahead and use that object to do something like this. So I grab some code from my cheat sheet, place it in this chunk. I'm essentially taking the Python model predictions and plotting them with ggplot2. And you can see a little bit here why our model fit is so bad. It's because we're not capturing nearly any of the variance that is going on inside of our data. All right. So we've fit this model that combines R and Python together inside of a notebook. What do we do now? Well, at RStudio, we're big believers that you should share your data science work early and often. This is the best way for people with domain knowledge or stakeholders to validate that you're on the right path. And so one way to do that that's really powerful is through publishing. And within RStudio, we can publish-- you see this blue icon here-- and we can publish to a number of places. But within a team enterprise setting, our recommendation is to publish to RStudio Connect. And so I'll go ahead and do that. You can see the ID identifies my dependencies. And what happens when I click Publish is that a reproducible unit is created here that contains not only my code, but also things like the version of R, the version of Python, the different packages that I need to reliably reproduce this document. And so I'll go and show you the end result here on RStudio Connect, which is that same rendered notebook. So we have kind of all the information that I have been working up to this point, but in the context where I can easily share it with others. So I can specify who should be able to see this. And because it's that reproducible unit, at any point, I can reliably refresh this document to regenerate the results, or I could do something like set up a schedule. So my data might change over time, but I want this notebook to render on a regular basis. And I can set all of that up because the dependencies are intertwined in a reproducible unit with this document. Now you might be thinking, that's great if you're RStudio user, which probably most of you are. But what about my colleagues on the Python side? Well, the same option is available for sharing your work early and often if you're coming from Python. And so I want to quickly show you what that looks like. In the RStudio IDE we click that blue Publish button. And inside of Jupyter Notebooks, you can click that blue Publish button as well. But I want to show you a slightly different way that you can quickly and easily share your content. And that's to import content from a system like Git. So what I have here is a Git repository. And that Git repository contains all of the example code we've been talking about today. And it's a public repository, so you can go ahead and play around with this as well if you like. What I'm going to do is take this Git repository and tell our RStudio Connect to import that content. And so I have different branches that I can choose between, and then this repository has a number of different directories. I'll go ahead and import the Jupyter Notebook. I'll give it a title here. And what happens when we import this content is the same thing that we saw an interactive publishing. The environment this content depends upon is recreated. So I have a reproducible unit of work around my notebook. If I go and open up that notebook, I can see the results. So I can share this, even with a non-technical user, who might be intimidated by the standard Jupyter Notebook interface. Here, they just have a really clean HTML document that they can read. And a data scientist can specify who should be able to see this content, but also do those same things I was talking about for scheduling or re-rendering the content on demand. So we have our notebooks. We've shared them early and often. That's great. We've gotten this domain feedback. And maybe we iterate a couple of times and really improve that model. What do we do with that? How do we ensure that the model is actually being used to generate decisions? That's kind of the key task of the data science team. Well, there's two ways that we think about this at RStudio. One is to influence how decision makers are making decisions through your model. And a great way to do that is through interactive applications. So many of you might be familiar with tools like Shiny, which allow you to do that in R. But we've worked hard in the last year to ensure that multilingual teams have that same capability. So what you're looking at here is a Dash application, which is an interactive application written in Python. And what this application is going to allow us to do is help stakeholders understand our model. So they can come in here and click through different stations and see the forecast and the location of the station. So it's a pretty simple app. But hopefully, it kind of gets your wheels turning that even if you're a Python data scientist or you work with Python data scientists, they have the same ability to impact decisions by creating interactive content. And we see examples of that with Dash, with Streamlit, with Bokeh, with a wide variety of interactive Python frameworks that are supported through the RStudio stack. So that's how you might go about influencing a person with your model. But what if you need to use your model to make a whole bunch of automated decisions, or if you need to influence not a person, but a service? One way that's common to solve that problem is by creating an API. And again, there's options for doing this in R and Python. So on the R side, we have tools like Plumber. On the Python side, we have tools like Flask. And both of these can be shared just as easily through RStudio Connect. And so I'll just quickly show you what that API might entail. Again, we have all of the same controls. And so we can look at the logs of this Plumber API. We could do things like scale. If we know we're going to have thousands of requests to this API at the same time, we can specify how we want the system to handle those requests. But at the end of the day, the idea is pretty simple. It's that anyone can come in and place a parameter of your model-- In this case, we're specifying the station that we want to make a prediction for and the time horizon that we want to make a prediction for-- and then those inputs are passed to our model and the results are returned. So here, we have the results of our forecast. But as you can see, that passing of inputs and outputs is done in a way that machines can understand. So here, we have the output in JSON. And just above that, we have the request that our interactive exploration of this API would generate. So other systems or services or software engineers can take advantage of your model at scale. All right. So we've covered quite a bit of ground. We've talked about how to make these models through notebooks. We've talked about the different ways that you can enhance those models and put them into production, but didn't actually show you the code for either that Dash application or this API. So that's the last thing that I wanted to do is talk a little bit about how you can get started writing in this type of code. And so if I go back to the RStudio IDE, if you're an R user, my recommendation is to just click New File, and you'll see a bunch of options, two of which are Shiny web applications and Plumber APIs. So those are going to get you started with creating either web apps or APIs. If you're a Python user, inside of the RStudio stack we looked at Jupyter Notebooks. But if you want to do this type of coding for APIs or applications, you're probably going to need a little bit more robust editor. And you can use either JupyterLab or the VS Code for that purpose. If I open up Visual Studio code, the last thing I want to show you here is just how easy that deployment of an API or an application is. So inside of the code again, I'm working off of that shared common Python environment. So it's easy for me to collaborate and automatically get the right Python environment as all my colleagues. I have my code here for a Dash application. And then all I need to do to deploy is use a utility that we've created called RSConnect. So this is just a Python package that you can install really, wherever you're writing Python code. And it has commands to help you then take that Python code and wrap it up in that reproducible context. So for example, I'll do RSConnect deploy Dash to a server called Dev. And it's going to run through and identify all the dependencies of this application and then give me the link to the deployed app. And if we follow that link, you'll see the exact same bike share application we were looking at before. So to recap, we covered quite a bit of ground. We created that document using R in Python and some really cool RStudio magic. That document was shared in a reproducible way. We talked about how Jupyter Notebook users can do the same type of early and often sharing, and then how we can use that model to impact decisions, either through apps or APIs. And finally, how you might go about writing those things using some of the new features in RStudio Workbench. With that, I will hand things over to Lou, who's going to bring us home. And then we'll have the Q&A. LOU: Thank you, Sean. That was great. So, while Sean was talking, I was taking a look a lot of the questions coming in via Slido. There are a number of questions there that have been upvoted, some of which that Sean covered a lot of that material in his demo after those questions came in. But we'll get to as many of those as we can. So, Sean showed off a number of different things here. For the data scientist, he showed how you can use these two languages closely together without a lot of overhead. So the data scientists can use each language for their own strengths. Also illustrated some of the different IDEs that can be used, allowing data scientists to use their preferred IDE again, making that easy. And we now support in addition to the RStudio IDE of course, Jupyter and VS Code. Visual editing of R Markdown is a great advance. Again, making the user experience, the developer experience for data scientists much easier. And to answer one of the questions in the Q&A, that visual Markdown is available in the open source version of the RStudio IDE. So a number of different ways that data scientists can use R and Python to deliver these wow results to the rest of the organization. For the dev ops and IT teams, using these centralized environments makes it easy to support these common tools for both R and Python, and to operationalize both languages without doubling the work. And by making both of these languages easy to use together, it helps data science leaders really optimize their team for the people, not for an arbitrary choice of a single language to better enable collaboration within the team and within their stakeholders, and really able to access these wider talent pools to hire new data scientists into their team. And finally, for the business stakeholders in the organization, ultimately, they don't care about what the underlying language is. They just want to have reproducible, accurate, understandable, data science insights that they can use to help make better decisions. So by sharing this data science work through platforms like Connect, they can access this up-to-date interactive analyses and dashboards, or get the information directly in their email, so they can get the answers they need in order to make better decisions. All of these capabilities are supported by the RStudio team set of products, which together combine to provide a single home for R and Python data science teams. Again, RStudio Server Pro is the centralized environment, allowing data scientists to use R or Python to analyze data and create these data products. RStudio Connect is a platform to publish the results to make them make available to business users and other stakeholders, using R or Python-based data science products. And RStudio Package Manager to manage open source packages for both R and Python. One of the questions that I saw in the Q&A was a question of expressing pain around how difficult it is to manage packages in the Python ecosystem. And we've recently added support for managing packages from PyPI to RStudio Package Manager to help address that exact pain point. And I'll ask Sean to comment on that in the Q&A section. RStudio is of course, used by millions of people every week using open source software, things like the IDE and the Tidyverse and Shiny, critical source applications that we create as part of our open source machine. But are also used by thousands of active commercial software customers, including over half of the Fortune 100 and many well-known brands such as the ones we see here. We also have been really gratified to hear from our customers via TrustRadius.com. So if you are an RStudio user, we encourage you to go to Trust Radius, check out their RStudio profile, read some of the reviews that people have left there, and add your own review, because we read every single one of these. We try and respond. And certainly, these are one of the ways that we hear from our users. We've gotten great feedback from our customers on combining R and Python in a single platform, and how it helps them collaborate among their team, and to essentially allow these teams to, as the second reviewer says, make use of their preferred language for data analysis so that they can create and publish products via RStudio Connect using both R and Python to share with their internal clients and stakeholders, as the third reviewer shows here. Now we've talked a lot along the way about our pro products, but I want to emphasize that our core mission is to engage and support the R and Python community. And we do that in a number of different ways. The most important of course, is creating the open source software that our users use every week. But there are a number of different ways we do it. We support RStudio Community site, allowing R users and Python users to gather and ask each other questions and get answers to those questions. That's a great resource. We do our annual conference. This year it was a virtual conference, under current circumstances. But that was just a couple of weeks ago. And we were very gratified by the engagement there. Check out our blog. And there's a link there to all the global. All-- sorry. All the videos from RStudio Global. Many different speakers from all around the world. Those are all free to watch. We had a tremendous amount of positive feedback, both directly on social media on that conference. So I encourage you to check out those videos. Our education team is focused to help support the education of R. We do a lot of train the trainer capabilities, providing training materials, et cetera. So if you're interested in being a certified RStudio R trainer, check out our Education page. We're also supporters of the R Consortium, a multi-vendor group to support and advance the infrastructure around the R language, as well as a platinum sponsor of Num Focus, which provides a tremendous support for the Python ecosystem, among other projects. And then finally, Ursa Labs-- we've been a supporter of Ursa Labs from the beginning. And Ursa Labs is devoted to developing cross-language capabilities such as using the Apache Arrow Project to provide access, both within R and Python to those capabilities. And it's important to emphasize how our open source and pro products tie together. Of course, it's our core mission, as I said, to contribute open source software to the community. And we spend over half of our engineering resources creating this free and open source software. As the data science community adopts open source software, this drives adoption within larger enterprises and commercial customers. These commercial customers in turned by our pro products that are focused on helping scale out in operation lines open source data science. And by buying our pro products, that provides RStudio the funds so that we can sustain our ongoing open source. We call this idea the virtuous cycle, this idea that we're supporting our mission to deliver free and open source software to the community by selling the pro software to the enterprise companies that need those features. So if you'd like some more information, we have a number of different resources. Again, these slides and the recording will be sent out within a few days after the webinar. RStudio.com/Python is your one central portal to get to a lot of this information. We also did a blog post recently on recapping all the Python related features we added in both are open source and pro products over the last year. So I encourage you to take a look at this. If you'd like to set up a time to talk to us one-on-one, you can use this URL here, rstd.io/r_and_Python to learn more, set up a meeting, get some answers. The webinar recording will be available on our Resources site. If you'd like more technical information-- and a number of the questions and the Q&A were looking for more technical details-- check out some of these links the reticulate package website, as well as providing some examples and a webinar on that topic. We also, on our solution.RStudio.com site, we've got a number of articles providing deeper information on how to integrate Python in RStudio Server Pro and RStudio Connect. Again, those are all accessible through the top level RStudio.com/Python portal. And for our community site is a great place to ask questions about R and Python open source and pro. Now going through the Q&A, the most popular question was how to, as an R user, how can I learn Python? What would you recommend to experienced R users? I did a quick poll to our education team. And these couple of books floated to the top. Python for Data Analysis or Python Data Science, both of these books were recommended by our education team as being more data first, as opposed to programming first. I will try and get a few more recommendations to add to the slide before we share it with the participants. SEAN: And I would add to that, Lou, as part of this webinar afterwards, we'll be sharing information on the RStudio community page that Rstud.io/RPyQA link, the link that is currently bringing you to the Slido with all the questions, will be redirected to that community thread. And we would actually love for all of you to give input to answer that question as well, because there's a lot of diversity in how people learn. We know these communities are coming from very different people in lots of diverse backgrounds. And so if you have something that's worked really well, we'd recommend replying to that community thread. We'd love to open source the answer to that question beyond us at RStudio. LOU: That's a great point, Sean. Thank you very much. So with that Sean, is there any particular questions that you'd like to kick off with? SEAN: Yeah, absolutely. I think one of the most common questions was, what parts of that demo are available on the open source side? What parts are part of the professional products? And so apologies if I didn't do quite enough signposting there to delineate. Essentially, s everything that you saw inside of RStudio IDE-- so that visual editor of R Markdown documents, the ability to combine R and Python inside of an R Markdown document, selecting what Python interpreter to use, the Python objects in the environment pane, the Python reppel even, those are all going to be in that open source desktop IDE, regardless of what version you use. And in fact, I would encourage folks to look at the re-articulate website that talks a bit more about some of the options that I didn't dig into for combining R and Python in that open source way. One of the questions asked, if I just have a Python script, can I use that in RStudio? And the answer is absolutely. That's something that I didn't show, but it is available in the open source IDE. So I'd encourage folks to go there. The things that were specific to our professional products would be the selection of different editors from that common workbench as well as the deployment of work to RStudio Connect. I would encourage folks-- especially there was some folks who were saying I work at a research institution, or I'm teaching at a university if I would benefit from some of those professional capabilities. I would encourage you to reach out. A lot of the professional products we give away for free if you're teaching, and are pretty discounted for research as well. And so hopefully that helps answer that kind of common question about what can I do on my own today? Everything inside of RStudio i.e. What would be part of the professional products? That would be anything you saw in terms of sharing or the different editors within the RStudio workbench. LOU: Thanks, Sean. So one of the other really popular questions-- and I alluded to it earlier-- is this idea of the challenges of package management in Python. And this is an area I'm particularly excited about because of the recent addition of PyPI support in RStudio Package Manager, initially in beta. Would you like to comment on that at all? SEAN: Yeah, absolutely. So I would tend to agree it is a bit of a mess. That it's certainly is kind of my experience. Right now, the RStudio sits on top of the many Python management tools that are available. And so if you already have a tool of choice-- you saw in my demo, I was using a virtual environment-- if you're using something like QANDA or Poetry or PyEnv, the RStudio it will sit on top of all of those options. And in fact, the kind of key engine behind all of this is an open source package called Reticulate. And within that package, you can see the different functions that help the IDE and R identify what Python environment to use. That's also something that you'll see in the Options menu that I mentioned, where if you go inside of RStudio IDE and click Tools, Options, basically any of those Python environments, whether they're from QANDA or Virtual Env are going to be available. Now that doesn't necessarily mean that the headaches go away. And so one of the things that we're working on in the future is extending the support for creating and managing those virtual environments from within the IDE as well. And so you can look out for that on the horizon. And then, as Lou mentioned, on the professional side, if you think those headaches are challenging as a single data scientist, as a team of Python users, that can often present even more of a challenge, which is one of the reasons why we're investing for those commercial teams in the package management repository that supports both R and Python to help make that work reproducible, to help the IT folks say what packages should be allowed, as well as that time travel capability that I briefly showed. So that's I think, a long-winded way of saying we all can commiserate with the challenges in Python dependency management. We're going to continue to invest and make it better. But all of the kind of work-- we're standing on the shoulders of giants here-- is available to you from within RStudio IDE. LOU: And I just want to add to that another plug for RStudio's own Alex Gold, who's also part of the Solution Engineering team, is going to be doing a webinar in a couple of weeks on the challenges of package management and how to address them. We're going to be doing a series of blog posts between now and then, talking about the package management problem. So I encourage anyone who's interested in that to check it out. Now Sean, I got a favorite next question. But anything that you want to jump to before I toss that one out? SEAN: So there was one question I really liked. It's a very RStudio question, which is asking, what are the limitations? And we try to be pretty upfront with that. So I'll just throw out there I personally use the RStudio IDE when I'm combining R and Python together. I have friends who use the RStudio IDE for all their Python work. I have other colleagues who use the VS Code for their Python work. So kind of our common theme is the tools should be subservient to you as a data scientist, and not the other way around. So pick what works for you. But some of the specific limitations that you might run into that we're investing to make better, but you might hit today are that kind of creation of Python environments-- I mentioned RStudio sits on top of that, but it doesn't really give you tools yet today for creating Condor virtual environments. And then the other limitation I would call out is that the debugger inside of RStudio today is still pretty R-oriented. And so if you're spending all day kind of writing a long Python application and you are using the debugger as a critical element of that, I would tend to recommend something like VS Code as maybe a better option. So that's my favorite question talking about limitations. What was your favorite question? LOU: Mine was a closely related question, which was-- I can't find it now on the list-- but there was something to the effect of "can we now use Python within the IDE without a lot of shenanigans?" So-- and my view on that is the most recent release RCO 1.4 has lowered the bar of necessary shenanigans considerably. You want to comment on that? SEAN: Yeah, I would agree. As I said, I do have some colleagues, especially those who know R and are learning Python for the first time, can be really nice to not introduce yet another editor. It can be challenging when you're trying to learn a new language to also be learning a new tool. And so I think if you're someone who knows R, that barrier of entry is low enough now that you can get started using Python right within the RStudio IDE. Then if you want to graduate to another editor, that's fine. But at least you're learning both those things at once. But I kind of echo that as well to educators that are out there. Someone asked, what language should I teach? Well, I think the key is to teach what you're going to be comfortable teaching to make sure that your students have a really effective data science experience from day one, that they're not stuck fighting their tools before they're able to create their first plot, whether that's a matplotlib or a ggplot2. You want to give them that gratifying moment early on. And it's my belief that a lot of folks at RStudio have done a ton of awesome work to ensure that the RStudio IDE isn't going to present that hurdle and isn't going to fight students. If you are a teacher interested in using R and Python, also do a quick shout out to our RStudio Cloud, which can reduce that hurdle even further by allowing folks to start writing code without installing anything on day one. LOU: And that's a good segue to a question I wanted to answer about RStudio Cloud, which is we had a question on whether RStudio Server is running on premises or in the Cloud or what? RStudio Server Pro-- typically, most of our customers will install or C server themselves, but where they install it varies considerably. It could be on-prem and often is. Or it could be in a virtual private cloud on any of the major cloud providers. And so our customers do both. We also have marketplace offerings for our RStudio Server Pro on all three major clouds. So you can search for RStudio in those cloud marketplaces. And that's a quick way of spinning it up. And then of course, RStudio.Cloud is, as Sean just mentioned, is a way of getting started with similar functionality without having to install anything and that the host service purchases on a monthly basis. There was also a question on-- someone asking they have a license for RStudio Server Pro. Is the launcher available, or does that require the enterprise flavor? The launcher is actually-- Sean, let me-- I thought I knew the answer to that. And I caught myself. Could you clarify? SEAN: Yeah, I would say for those kind of specific questions, or if you have questions about the professional products, our sales team is happy to help. That sounds like a cheesy ad. But I can tell you firsthand at RStudio, our folks that work with our customers are all in their own right, really talented data scientists. And they'll be able to help you navigate some of these questions. And we'd be the first to say you don't need a professional product. The open source stuff will work. Or we can help you solve some of the challenges that come up. So I think that's kind of how I would end things there, Lou, is that we'd love for you to dive into the resources, dive into that community thread. And then if you are encountering some of the challenges that we presented throughout this webinar, feel free to reach out to us. And we are happy to help you go through that path as well. [MUSIC PLAYING]
Info
Channel: RStudio
Views: 2,991
Rating: 4.9591837 out of 5
Keywords: rstudio, data science, machine learning, python, stats, tidyverse, data visualization, data viz, ggplot, technology, coding, connect, server pro, shiny, rmarkdown, package manager, CRAN, interoperability, serious data science, dplyr, ggplot2, tibble, readr, stringr, tidyr, purrr, github, data wrangling, tidy data, odbc, rayshader, plumber, blogdown, gt, lazy evaluation, tidymodels, statistics, debugging, programming education, forcats, rstats, open source, OSS, reticulate, lou bajuk, sean lopp, webinar
Id: 0Ty4y3ZYA1M
Channel Id: undefined
Length: 48min 4sec (2884 seconds)
Published: Mon Feb 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.