10 R Packages You Should Know in 2020

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's up everybody welcome back to my youtube channel richard on data if you're new here my name is richard and this is the channel where we talk about all things data data science statistics and programming subscribe for all kinds of content just like this and hit the notification Bell so YouTube notifies you whenever I upload a video so I did a video a little while ago comparing R and Python in the year 2020 and I concluded that while Python is relatively speaking anyway the king of the TOB index and of data science jobs in 2020 R is still very much alive and well while Python has been getting more popular and faster R is still a fundamental part of a lot of different companies analytics infrastructures and things like that just simply you don't change overnight that and when you take a lot of different data science and analytics functionalities R is still competitive with or just downright better than python is but a lot of people ask when they start picking up a new language do I have to memorize this many functions the short answer is sort of so most people know that R is open source and it comes with a ton of different packages that you can install and then load there are a few packages which you really should know all of the capabilities of because if you don't you really end up selling yourself short and it can take you way too long to work through some problems that exist in the real world just because you may try and create your own solution from scratch not knowing that the solution to that kind of problem already exists in some package that's out there but it's not a true exercise in memory because a lot of our packages will come with their own cheat sheets and it's a surprisingly low number of packages which will cover most of the functionalities that a lot of data scientists will have to do from day to day so I thought it might be helpful to run through ten of the packages which are going to serve a lot of people the best keep in mind this is not a comprehensive list and I'm mostly going for general usability over specialization so just as an example if you're doing a time series analysis you'll probably at least want to look at the zoo package but I'm not including that on here because time series is a fairly specific sort of application but if you know all ten of these packages it'll truly be easy to handle a lot of the problems that real-world data will throw at you and it'll also be easy to pick up new packages and add them to your workflow as needed now these are generally in no particular order but these first five packages are part of the core tidy verse that means it's super simple just to run installed on packages tidy verse followed by library tidy verse and you are ready to rock and roll with these packages first and foremost is deep liar now this is going to be your go-to for most of your data wrangling and data manipulation needs now one of the things that makes the pliers super nice is that it operates with the pipe function now this thing looks pretty weird at first but it's actually awesome and I think of it like a venn statement in English basically you're saying take your data then select these columns then filter these rows just as an example this is the first page of the deep liar cheat sheet and now notice there's a lot of different functions and applier but there's only a few absolute core functionalities so if we start on the right side you've got the Select function which you can use to select your variables aka your features for your data set then if we move into the center you've got filter which you can use to subset rows based on a condition that you specify distinct is another self-explanatory function that I use a little bit still if we look in the center towards the bottom you've got a range which is the equivalent of like a sequel order by statement then on the right again you've got mutate which creates variables based on other variables and there's other extensions of that such as mutate all and mutate at but then once we end on the left hand side we've got group by and summarized now these are very quick easy and convenient ways to create detailed summaries or even to create summary stats and then do more wrangling on the new data set that you've created now once you've got your data all wrangled into a form that you can at least somewhat work with number two on the list is going to be ggplot2 and this is going to be your go-to for most things in the data visualization Department I really love this package because it's incredibly flexible and that's because ggplot2 operates through what's called the grammar of graphics framework and this basically means that any sort of graph can be comprised of at least three separate parts but above all else you need number one data number two a coordinate mapping system and number three objects or as ggplot2 likes to call them gee ohms this is just the first of two pages of the ggplot2 cheat sheet and most of what's going on here is just the different types of visualizations that you can create and the possibilities are endless so you've got density plots histograms bar charts scatter plots line plots rectangles pretty much whatever you need now if you're not used to the grammar of graphics framework there can definitely be a bit of a learning curve and that's why the example in the bottom left is super helpful just to walk through this example it's saying make a plot of the MGP data set and then under the AES argument there's the variable highway going to the x-axis and the variable city going to the y-axis the giome point add-on tells the function to add points to the graph so we're basically creating a scatter plot the color equals cyl argument to it tells ggplot to color the points based on values of the cyl variable next we've got giome smooth which adds a smooth line that is basically a least squares line there the chord Cartesian and scale color gradient parts which are maybe a little bit redundant because you're basically just specifying the coordinate system and the color scale but then there's also the theme underscore BW object which imposes a theme on your ggplot2 graph for design purposes then after all that work here's our graph simple straightforward enough you've got the highway variable on the x-axis the city variable on the y-axis and the points are colored based on the cyl variable so you've just got a scatterplot with a line of best fit through it nothing to it now number three on our list is tidy are now tidy are exists for exactly one purpose and that's to get your data into a tidy format this basically means that every row is an observation every column is a variable and every cell in your data has a value in it now tidy R is actually the next iteration of packages like reshape and reshape - except it's a little bit more stripped down and actually has less functionality there's really four core functions that I use consistently with tidy are starting on the right side you've got separate and unite now these can be used for breaking a variable down into multiple variables or turning multiple variables into one respectively but then you've got gather and spread in the middle these have actually been replaced with the newer functions pivot wider and pivot longer but the idea is you can condense multiple different variables into new rows or vice versa so if you look at the example under gather you've got 1999 and 2000 as different columns but this function makes it easy to turn those into just one variable which is and just call that thing year this can work wonders for getting your data nice and clean and tidy now that your data is hopefully in a nice and clean and tidy format it's probably time to pass different functions to it and this brings us to number four which is / is it / or / are either way it's named after the weird vibrating sound that cats make so it immediately gets points in my book for that / gives you ways to deal with lists and also to map functions to the different elements of a vector or to a list now this is sort of like what you would with a for-loop except it's much easier to read and much faster here's the per our cheat sheet so on the left hand side we've got map functions so we can execute a function multiple times to different elements of a list or to a vector now the different functions can return lists data frames string factors numeric vectors whatever you want it's very similar to the L apply function and general apply family of functions in base R but then you've got a ton of super useful functionalities for wrangling with your data when you've got lists you've got different functionalities for filtering values creating summaries restructuring the data all kinds of great stuff number five on the list is gonna be string are and stringer is gonna be your go-to for your string and your text and your regular expression needs now stringer does have a lot of functionalities that also exist in base R but there are a lot more user-friendly and just frankly easier to remember in string R than they are in base R so string R is fairly straightforward because most all of its functions begin with the prefix STR but you've got a fairly comprehensive toolkit here you can detect patterns you can create sub strings and easily turn those into new features you've got tools for replacing patterns and predefined substrings this whole package will be at its most useful when you have a data frame that's in tidy form where several of your features are of character tied number six on this list is the first one that's not going to be part of the court ID verse and that's gonna be date and date on top of having an overall awesome name is gonna be your go-to for dealing with your problems involving date or date time aka POSIX C T variables let's face it date times an R can be a giant pain to deal with sometimes and they can have a bit of a learning curve with them when you work with them with date it's usually a breeze right up front you've got the as date/time function which is basically a wrapper for turning what looks like a date/time into a character string and then into a POSIX seat which is the main date time class that you work with in our but then your core functions are in the middle here with extracting things like the date the year the month parts etc very helpful stuff and naturally you can both extract and manipulate your data using these functions and there's also a ton of other functions on the second page for adding increments of time as well as handling durations super handy package date is definitely one to have in your repertoire number seven on this list is a package which is transparent and reflects light and I'm talking of course about shiny shiny is one of the most powerful and overall awesome packages in the entire our ecosystem what it is is it's a tool for creating interactive web applications that end users can play with and in my opinion anyway it truly blows anything that you can do in Python out of the water let's start on the left hand side here a shiny app will consist of two distinct components a UI and a server you can build them in the same file but I like organizing myself by keeping them separate now the UI is your front-end or a web page which will have inputs which are configured by the end user then the server is your back-end so basically your computer that's running a live our session on the right side of this cheat sheet you've got some examples of the different inputs the end user can play with those in the UI and those inputs get fed to different output objects inside of your server some examples our action buttons are checkboxes which can be used to configure various things but then also dates files numbers text inputs you name it then if you look down the middle those outputs could be data tables regular tables text strings plots other UIs which is a super powerful feature in and of itself and it takes a long time I think to get the handle of shiny and of reactive elements in general but it's truly a remarkable framework and the power of it will amaze you and probably your client to number eight is our markdown and with this you're getting the functionalities of a ton of different packages all with the end goal of creating usable easy to read documents and notebooks or markdown is incredible and I really think the integration that it has with the our Studio IDE makes it easier to use and more powerful than Jupiter notebooks and trust me I have nothing but love for Jupiter notebooks here's the first page of an r markdown cheat sheet and the long and the short of our markdown is it takes a lot of HTML and CSS and stuff that data scientists typically don't know very well I know I certainly don't and just doing all that for you so you can quickly and easily create reports and this page of the cheat sheet basically walks you through a lot of the steps and syntax for creating an R markdown file you've got a step for creating a document which you can then knit to turn into a doc or a PDF your R code is gonna come inside various code chunks and this is a super helpful tool for your workflow if you have to code but also constantly share your results with others because it just keeps everything in one place plus there's also features for incorporating shiny input into the R markdown as well so you can create interactive documents moving on to number nine we've got carat now carat and shiny basically compete day to day for the distinction of my favorite our package of all time which one of these two I like more sort of depends on what mood I'm in more than anything carat is a front to back tool for all of your machine learning needs now I will point out there's another package called tidy models which compiles a bunch of smaller machine learning packages into a tidy framework but in the year 2020 I do think carat is still a bit more mature future development is probably going into tidy models rather than carat but the same guy max Khan works on both of them and in the year 2020 I have to say I do still prefer carat here's your carat cheat sheet now let's actually start down the middle column so carat makes processing your data for all of your machine learning needs super easy you can even start by partitioning your data into test and training sets which isn't listed here but you've got normalizing your predictors transforming them filtering imputing missing data what have you then you can select how your model gets trained and summarized using the train control function starting with your resampling options you can pick between your usual cross-validation bootstrap out-of-bag resampling etc now you can also do sub sampling methods like up sampling or down sampling in those times where you're working in classification problems where you have imbalanced classes you have summary functions like the two class summary for an ROC curve PR summary for precision recall info etc then last but not least you have the Train function so you pick the machine learning method you want and believe me you have tons of options here you can pick how many values per tuning parameter you want the method to evaluate and then BAM you've got yourself a very easily trained machine learning model that you can then use for prediction purposes last but not least we've got reticulate and i'm putting this one on the list because I'm a firm believer in treating the two languages as are and Python rather than our vs. Python there's no need to treat it as a competition all the time because these two languages really do have the ability to complement one another very nicely also let's just be real writing Python code from the our studio IDE is just awesome so my favorite way to do this is to write the Python code from inside an armed markdown document then it's super simple inside the code chunk you specify R or Python then you're ready to rock and roll you can import the Python modules that you want and use your usual dollar sign operator to extract the attributes the way you typically would in a Python workflow but then you've got super convenient and easy mapping between our objects and Python objects you can do this manually too but reticulate gives you the awesome PI to our and our 2pi functions so these are 10 packages in our which I really recommend any data scientist to learn and I hope you do realize that while there's a lot to learn in the data science world there's a little bit of a Pareto principle at work here with our packages I listed off way less than 20% of all of the packages in the our ecosystem but still if you master these packages you're going to be able to do a huge number of things that you would need to do in the data science universe and yes there will always be a learning curve but me personally I find learning 10 things a lot more manageable than learning fifty or a hundred things so thanks for watching this video if you enjoyed it and you'd like to support my work the most helpful thing you could do for me would be to share this video otherwise please consider at least smashing the like button and then I'll see you all in the not so distant future until then Richard on data
Info
Channel: RichardOnData
Views: 14,861
Rating: 4.9089694 out of 5
Keywords: top r packages in 2020, r packages to learn in 2020, r programming, top 10 r packages, r programming tutorial, r programming for beginners, r programming language, r programming for data science, r programming 101, tidyverse, dplyr, tidyr, purrr, stringr, ggplot2, lubridate, shiny, rmarkdown, caret, reticulate, machine learning, data science, r (programming language), data science with r, r tutorial, r package, learning r, r packages, r tidyverse, r shiny, r markdown
Id: Os9Um5GHj1w
Channel Id: undefined
Length: 18min 40sec (1120 seconds)
Published: Wed Apr 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.