R Programming Crash Course

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello my name is brian jenks and i am a research analyst with the california state government i run a youtube channel all about programming tech productivity academia workflow little bit of lifestyle basically all things tech modern and code you can find me on youtube with my name brian with the y j e n k s brian jenks and today we're going to talk a little bit about r r is a specialized language that deals with statistics data science and it's heavily used in academia and those in the data science field i use it both personally and professionally i have a couple of open source repos that use r and we're going to talk a little bit more about it it's history how to use it and get you up and running with how to actually start using the language import a file make some charts and some graphs and plots and do a little bit of analysis with that and we'll cover the basics of what the language is how it works and let's get into it so programming in r r can be used for a lot of different things it's widely used in data science it's used in statistical research you can create a variety of different types of documents and reports or any sort of analysis with r and its various packages and useful utilities you can also create dashboards with various packages or the shiny package and it's widely used in academia for research and theses r is an incredibly popular language on the tyobi index from this most recent year it went from the 20th spot all the way up to the eighth spot and it was one of the fastest growing languages this last year so how is r so popular well given that it's also a very specialized language for statistics analysis data science academia the fact that it's so highly rated is a testament to how popular and great of a language it is it does have a lot of quirks there is a lot of weird history and just things that have been left around in the language because of how old and strange it is but it is a specially geared language for the aforementioned purposes and it's so widely used and popular that that's just a testament to how great it is it does have a monopoly on academia a lot of the fields of academia use primarily are for their analysis their statistical analysis and for use in theses documents publications and it is widely used in the field of data science python is definitely one of the go-to languages for data science and it is also used in academia but r definitely does have its place it is a specialized language that works in a lot of these areas whereas python is a more general purpose language r was created in 1993 but it is an implementation of the s language which was made at bell labs in 1976. so r is a mature language it's it is conservatively maintained the base package of r not the packages that you can actually import but r itself the language is very rarely changed and when it is it's usually with the utmost um backward compatibility in mind so changes to r rarely will break and you can write base r code and 20 years later it will usually still run exactly how it ran before it is highly extensible with its package system there are an incalculable number of packages more are being made every single day even i've made a package at this point and r is a vectorized language which means that even single values are just vectors of size one and this makes it so that there are a lot of power that you can get out of these vectors and it's easy to perform analysis and bulk operations if you've ever done javascript it's kind of like using the map functions and putting an operation off of a whole array of values with one operation with minimal code and we'll get more into that later in the course r is also really interesting because it does have elements of both object oriented and functional programming and you can kind of do a lot if not almost anything with r it's not a general purpose language it is specially geared but with pretty much most of these languages nowadays you can do just about anything the question is really should you and if there's a better tool for that job i mean you're not going to use a wrench as a hammer but i mean you could but it's not necessarily what it's made at best for r is also great as an interface or an intermediary language between other languages if you've ever done anything with r or if you have worked with rstudio the ide for r there is a lot of other languages that go into our development and that it works closely with for either reproducible analyses creating a good documents or just improving speed or the fact that r is single threaded so that it needs an extra boost for its calculation speed so r interfaces very well with c c plus plus it works with latex and markdown it is very closely put in hand with pandox so that you can create really great documents javascript libraries come into play when you export to html documents it's just a place that really connects well to other programming languages to produce really beautiful documents reproducible analysis and statistical interpretation of data it is a incredibly strange but worthwhile language to learn if you do anything that touches data you don't need to be a phd or an expert in statistics to use this language and receive some value or benefit from it when i first started with r the first thing i did was just start writing really cool documents using rstudio and a little bit of our code so you don't need to be an expert or use this as an expert to get a lot of value out of it now it is worth mentioning that r and python are kind of put up against each other in competition especially in the field of data science and there are pros and cons to both languages i've used python a little bit i've used r a lot but when it comes down to right tool for the right job when it when it comes to reproducible analysis the tooling the graphics the just ease of writing code that is legible and reproducible i do personally prefer r but there are pros and cons to both sides of the argument r is a single threaded language this means that it runs slower or if you've done javascript it's just like the event loop where you can only have one thing running at one time and you can't have multiple calculations occurring simultaneously there are people who are trying to work around this and make changes or certain packages that can try and take advantage of multi-threading but for the most part are single threaded so it can be somewhat slow but there are workarounds and things around that python lacks r's robust ecosystem for stats libraries there's just no other language to my knowledge that really has the amount of packages tooling and code built around statistics statistical analysis and all of that that r has r has an amazing wealth of code that's already been written and that has been widely used for decades and ours graphics are infamously better than python they ours graphics using the ggplot2 package are used in publications like the new york times i think the chicago tribune uses it a lot of the infographics and visualizations you see in modern journalism or scientific literature are all done usually with r now one thing where python does have a leg up on r is that is the fact that it is a general purpose programming language which means sure you're going to use python for data analysis but you could also use it to write a web application you could also use it to write a command line application again there you can do a lot of the same things with r but python is geared towards a more general purpose you can do a lot of different things with it and learning python for data science would also be something that you could tie into learning python for whatever any other purpose as long as the data structures that you use the same sort of logic is implemented you could use python to build whatever you want whereas r is geared more for specific purposes so that's all i have to do for this intro so let's get into the language itself all right everyone so to get started with r the first thing we have to do is install r if you go and search for our project online or just go to the website r-project.org when you come here the first thing you're going to see is this landing page we want to go to download r and it's going to ask you where you want to download from or whatever mirror it is you're just going to pick a place that is close to you i'm in the u.s and i'm in california so the closest place to me let's do oregon state university honestly it's not a big download it shouldn't take too long depending on your internet this computer i have is running windows so i'm going to download r for windows and then i'm downloading it for the first time so i'm going to do base right here just install r for the first time and you get this nice big download button click that and then it begins to download once your download is complete we're going to open it up as it isn't executable it's going to ask you to install we want to say yes and once the installation is finished you now have r installed you have the reply or the read eval print loop available so you can actually just go to your command line type r do some command line r um i never do this what typically what i do is i use an ide and the ide i choose to use is rstudio and an ide specifically designed to do our development and analysis now you do have to have r installed before you put this on but we just did that so now we're going to install rstudio and you can get to this by going to rstudio.com and then we're going to go to this button up here download in the banner we're going to use the free version because this is open source software so you can find the rstudio on github so it is free i'm going to click download and here it is for windows so i'm going to click windows and then that will download so once that installation is finished we're going to run that executable for our studio now that it's finished we are ready to go so now i'm going to open up rstudio we don't need to open r you can just open rstudio the ide and it will already be connected with r and everything will work exactly as you would hope to expect and this is what you see once you first open it you can customize this ide you can do a lot of different things with it we're not going to go into the ide itself too closely we're just going to mostly focus on the language but this is my ide of choice and i've been writing our code for a few years and this is what i would recommend using it is specifically geared for r though you can do different ides you can do vs code you can do vim you can do emacs there are other r built ides but this is the one i choose and that i've been using since day one all right so the first order of business how do you make a file and how do you execute it it's the whole point is that we're trying to write code and then actually execute that code and achieve a result so right before we do that i need to actually change this theme to a dark theme because i cannot stand light themes let's do something like dracula that's cool so how to make a file in our studio there's this little button right here make a new file you'll get a lot of different things and over time you'll learn what each of these do the most important ones that i'm going to focus on are our scripts and our markdown documents so first thing we're going to do is just start off with an r script and our script is just a script file it's just like when you write a python file or when you just have a single file with code that you execute and then it runs everything inside of it and our markdown document is completely different and our script is just something that you want to run the entire document in one go so the first thing we're going to do is type print and in double quotes we're going to do hello world the typical program so now we're ready to execute this i could save it it actually might ask me to save if i run this but here are your buttons to run we can run this document and we can rerun different code regions this is some of this stuff is for our markdown but we're going to just click run or you can see it also gives you the hotkeys control enter i'm just going to click the button if i run that down here in the console you can see the action has been executed great we don't even have to save the file for for this to happen so this file doesn't even exist on my file system yet it's just all in our studio i haven't even made it yet so to do this we're gonna just click the floppy save that to the desktop call it our script and now we have a file so where our scripts we can execute in a single batch have all the code in here and have it run right when we execute it or whenever we source the document and our markdown document is something different entirely so we can create a new file and say give me a new our markdown file so i actually had to update my packages in the system before i could run and create a new our markdown document but after i do that this is what i'm left with so the our markdown documents can export to a variety of different formats the universal format that we have is html because everyone's going to have a web browser if you need to do pdf you'll need to install a latex distribution i'm not really going to get into that and if you have ms word you can also export to that format but we're just going to leave it on the default of html and i'm going to name this r markdown and the author is me all right it'll create a template document with a little bit of stuff in it already just to show you what it looks like once we create it and i'll just run this just so you can see what the output would look like this is an arm markdown document so it's a markdown document with a little bit of yaml in the header for some like meta options and settings and then we also have these uh code chunks that are written in sort of similar pandoc format with the three backticks but then in curly braces we also have different other options like what language and then options for this specific code chunk so let's run this and see what the output is i'm going to save it as our markdown on my desktop it will then run and it goes through pandoc knits it all together and then the output will look like this so this would what you would see in the browser you have your code chunks you have some output and this is how the r markdown documents work but one thing you can do with our markdown documents is you can actually write up text with your analysis or whatever code you're writing and it doesn't have to be our code there there are dozens of supported languages that you can use in our markdown documents so each of these code chunks you could put sql code python code r stan julia whatever there's even javascript you can actually put javascript in here because this is exporting to html you can actually write javascript to affect the html elements in the document as you're exporting it through pandoc so needless to say i have a whole playlist on our markdown documents and the crazy things you can do with them but what we care about most is creating an output and actually executing some r code so you could actually write up your analysis you have a code chunk here but you don't want to have to knit your document every single time when you're trying to look at some output sometimes you just want to quickly iterate see some results see a value and you don't want to have to knit the document and wait for it to run through pandoc every single time especially if it's a large file dealing with large data sometimes you just want to have little tiny results so you can move at an incremental pace and this is one of the reasons why our markdown and r are some of the great use greatest things for data science just like jupiter and python because it allows that quick iteration so i could easily just click this little play button over here on this code chunk you can see the banded rows if i click on the play button it'll actually execute this code chunk after the setup chunk and so it'll run this because it sets up the system and then it's going to run the second chunk and it actually gives me the output right here i don't have to knit the document it's only executing the essential setup and then this code chunk and now i can see the output and i can adjust for this and i can do whatever sort of analysis i want print hello code chunk cool i can run this again and now we have additional output and this is how we're going to be working you can do r scripts and have it run the entire code all in one batch but typically my workflow that i use and you can make this however you want it to be is that i will use r scripts source it all in one go just to get clean and export my data sets or my data frames if you're familiar with python pandas and data frames so i will just use a single script to just do all my actions to just give me a cleaned set of data ready to then put into an r markdown report analysis document dashboard etc so a little bit more of what you can expect from this ide as well as the code is we have several panes here we have our files that we can access in here if we have multiple scripts or documents if we make a plot it can open up in the viewer here if we were just going to make this in like an r script we have packages that are installed there's help if you actually run for help documentation and another viewer if we're going to like run those html exports and put them in the viewer instead of a pop-out window we also have the environment up here this is where if we actually have a variable assigned a value they will both appear up here so we can see which things are actually allocated already we have history i never really look at this too much connections if you have connection to a database and tutorial if you want to take a tutorial in r studio so the first thing we're going to start off with is some basic syntax and some hotkeys i live and die by the hotkeys i learn for every program so as i go through this i'm just going to teach you the hotkeys as i use them because i just use them as i use the programs and the write code so first and one of the most important things is how do you comment out items in your code with this it's going to be ctrl shift c and it doesn't matter if you're in a code chunk or if you're outside it it'll comment out without whatever syntax it actually uses for whatever area you're in it'll actually use that syntax to comment out your code or your document so in this markdown space where it's actually a markdown document if i do ctrl shift c it'll actually do an html comment because markdown is basically an html document but with simpler syntax so i also want to do that in my code chunks so in our code the hash symbol pound symbol that is actually the same style of comment as python so i can do the same thing ctrl shift c i can toggle comments this is really useful you have multiple lines it'll comment out multiple lines now it is important to note that r does not have a multi-line comment so if you have something like a doc string and python or anything that spans multiple lines or like sql comments or c comments where you have this style syntax or even javascript this doesn't really work for r so what we're going to do is we're going to actually assign some values to some variables one of the most important things in programming so i'm going to assign this string to the variable my string so now i can run this line of code specifically just this line by doing control enter you can see that little green popped up here that's actually means that i actually ran this line control space now with that variable assigned that string you can see it has now appeared in our environment pane the value my string has the value comment me with ctrl shift c assigned to it so now i can just print the variable running with ctrl enter and it will actually display in the code chunk below if i want to run the whole thing i could so next let's just use r as a basic calculator you could do this either in the console or you could just do it in the code chunk executing the line with ctrl enter if i run 7 it just displays 7. if i did 7 plus 7 and ran that in the console it'll tell you the result in a in a sense r is basically a super complex calculator one now we're going to run several other lines we're going to assign some variables x is going to receive the value of one y is going to receive the value of two and then z is going to be the sum of both of those variables now if you're if you look at this code it's you'll notice something interesting i'm doing assignment with an arrow symbol here but i'm also doing it with an equal symbol what's the difference well if it's going to bend your mind a little bit we can actually do an arrow assignment to the right as well in this case we'd actually have to switch which side 2 and y are on but that will also work as well there are multiple ways to assign values in r all of these are valid some of them are more conventional or best practices and the one that is the best practice is just to stick with the arrow symbol but if you come from a more software oriented background versus academia you might just want to stick with equal sign it honestly does not matter just pick one and stick with it i usually just stick with the arrow symbol because the hotkey is alt minus and i'm just used to doing that at this point and so that's fine so now let's just say we have all this code in this code chunk we want to run and you don't want to knit the whole document you don't want to take up that time you just want to run everything in this code chunk to do that it's not ctrl enter that's a line we're going to do ctrl shift enter and execute the whole chunk so 7 appears the string was printed and all of these variables were assigned you can see all that exists up here if i wanted to clean up the global environment and say okay i don't want anything assigned to anything i want to start fresh i can do the broom handle here it'll ask me if i want to do that yes i just want to clean out everything there is nothing assigned now x is not equal to anything i can type x in here because x is not found so now i can just run the whole code chunk all over again ctrl shift enter and now all that stuff now exists in the global environment all right now we're going to get into data types so in contrast to other programming languages like c in java in r variables are not declared as a specific type of data you don't have to say variable int and then set it to equal to something the variables are assigned to our objects and the data type of the r object becomes the data type of the variable and there are a lot of different types of r objects but the most frequently used ones are vectors lists matrices arrays factors and data frames a little bit about each vectors everything's a vector if you just assign a value to a vector it's a vector of one lists are lists of uh other vectors and each of those vectors could be its own data type if i have a vector of a bunch of different words i can't put a number in there and it has to be all the same type lists can contain multiple vectors of differing lengths that can be different types which becomes what a data frame is which is basically a list of vectors that can all be different types so you have a tabular data set think of like microsoft excel a table like that of data matrices uh are typically and commonly used in base r i'm not going to get into the specifics of tidy verse versus bass r that's something that you'll get more experience with we're going to cover just some of the basic functionality of the language arrays i've never really used arrays in this language and factors are basically like categorical variables it's a something unique that i've seen in r versus other languages where you have code and you have a data and you take that data and you say this is categorical it's not a string it's not a number it's a category but you want to treat it like a category and categorize it and do operations and plots with a category versus specific counts of a string occurrence so that string you can actually make it into a factor and that factor means that is now treated as a categorical variable versus a string this comes in very handy when you're dealing with plots later on but let's go over these data types in detail so first a vector a vector is a value assigned to a variable or a single uh value assigned to a variable so in this case we already did some single value vectors where we did x y and z these were all single value vectors but now we're going to do something with different types and different multiple values in a vector so for instance we're going to assign true to v here and then we're going to print the class you can see at the bottom of the code chunk the class is logical because it's a boolean which is true or false next we're going to have numeric and numeric is what you would say an int or a float both of those are numeric so if i run that you can see the float is considered numeric if i change this to just 23 and ran it it is also numeric now i want to assign an integer so that's when you actually have to put the l at the end and now it's an integer why you might want to do integer versus numeric is integers take up less memory than floats do floating point numbers take up more memory complex i never done anything with complex numbers before but they do exist so you have i it is a complex number so now we're going to assign true but in double quotes to v and that is actually a character so this is what would you would call a string character vectors are strings so now if we turn this character vector or this character string and convert that to raw values it is raw and if i actually print the output of that you can see this is the word hello in raw values so now let's make a vector with more than one value that's all we've been doing so far is making a single value inserted into the v vector but now i'm going to add multiple values so we do that with the c function c is basically concatenate or combine so when i do that and i have three different comma separated values you can use either single or double quotes it does not matter in this instance they are both treated the same by r so in this i can take these three values combine them into a single vector and co and send that to the apple variable if i run that we now have apple up here in the environment pane it's a character vector of three one to three values so we have the index values of one to three one two three red green yellow those are the three variables or three values in the apple variable so now i can print that vector red green yellow and now let's print the class of this apple vector it is a character vector because the three values in it are character values or strings next is lists lists can contain multiple elements inside it like vectors and even functions or even other lists inside of them so in this case what we have in this list is the function call list to make a new one and we're gonna and we have the first item is c the c function to combine or basically a vector the vector is the values two five and three that is the vector we also have the float of 21.3 the sine function and the boolean true value so that's four items in a list so it's going to be a list of four items but one of those items is a three value vector so let's run and see what this looks like so you can see in the double brackets here this is actually the field of the list below it is the output for the list so you can see that the first value output is the vector the second value or the numerical vector the second value is the float the third one is what happens when you say uh the actual name of the function it actually defines it as a function and it gives you some strange looking output and then the fourth value is that boolean true next up is the matrix the matrix is that two-dimensional rectangular data set that can be made using a vector as an input to the matrix function so in this case we have a vector here which is a a b c b a six values all letters they're all character vector it's a character vector and we give it the function parameters of the number of rows is two the number of columns is 3. so you can basically cut this in half and say this is going to go below this for a 2 by 3 matrix and by row equals true i don't really use matrice matrixes a lot i typically use data frames a lot but it is important to know this and so when i output that you can see that we have two rows three columns and it's basically like wrapping that vector around itself so next we have arrays arrays i don't use too often in r but it is a common data structure used in all programming languages pretty much so we're going to assign an array of a character vector with the dimensions 3 3 and 2. so one very important thing to know about vectors is if you stack vectors together either in data frames or you say you want in this case we have an array using a vector of two values here green and yellow if you set its dimensions to three r does something called recycling where it'll restart the vector and say you need three values but you only have a vector of two values green and yellow well then it'll start over at the beginning and print green again so this is what vectors do if you have a vector that's too short it'll start repeating its values if this is behavior that you do not want you need to pay attention to the length of your vectors and what you're doing with them so we're going to assign this array to the value of a and print it out and you can see that we get repeats we have green yellow then it starts over at green but then it repeats again so it's green yellow green yellow green yellow and you can see it repeats each of the two values in a three by three matrix that's what this dim function is saying or this dim parameter it's saying that the dimensions of this array are equal to the three values in this vector which is a numerical vector of i want a three by three array using two values the two values are this character vector and so it basically does the same thing as a matrix in this instance where it wraps around and around repeating that vector green yellow green yellow green yellow green layout green yellow and we actually have two of these arrays so i mentioned factors earlier factors are those categorical variable variables and we are going to say apple colors here are equal to green yellow red green we have a bunch of different variables or pieces of data in this array or this vector so we're going to take this vector of multiple values and we're going to assign it to apple colors so now we have repeats in here but we want to treat this as categorical versus as multiple strings so what we're going to do is we're going to take this variable that's holding this vector and we're going to put it through the factor function so this is going to take these strings and make them into factors and group them categorically and we're going to assign that to factor apple so now let's print that and you can see it prints out all those values but we also have here are the levels so in this case we can actually see that if you wanted to have a long list of a variety of different values and you wanted to see like what the levels are in which case what are the unique factor levels we can see that there's only three of them you could also throw this into a different function and actually count up how many unique instances of each of these categories and that is also possible so you might want to do this if you have a value of male or female in a data set you can actually do that categorically on those two string values we also want to print the number of levels with the n levels function and you can see the levels are three and finally one of the most commonly used data structures is data frames data frames are used in python as well with the pandas package and they can even interchange between r and python if you did python code in here so python and r can actually interface very well and pass back objects from python to r so you can actually use both languages for your data analysis if you so chose and data frames are created with the data.frame function call and you can just pass in a bunch of different vectors and their variables which it becomes the column header so in this case we're going to make a data frame with the variable gen or the column header gender height weight age and assign those three values now with data frames this is why data frames are also a very very useful structure is that all the columns need to have the same number of rows so all of these variables we have here that are basically going to become column headers with each of these in each row so male 152 8142 is one row and it's basically transposing this but into a data frame and if you wanted to have a missing value or if you had like say four fields in here but somebody didn't have their weight recorded then you would actually have to input an n a or not applicable we don't really use null in r it's n a for a missing value in which case then then it would actually work but if you don't have those and you only have three values you need to have three values in every single vector for the entire data frame this is very important if you're going to do it manually like this typically what you end up doing is you actually import data from a csv from json or whatever data entry format you're importing and what you'll need to do is actually clean that data as it's going into r or r will say hey there's missing values and you need to determine what you're going to do with that this is all part of the data cleaning process that you'll go through with r but in this case we have a very simple data frame four columns three rows and we're going to print this out so let's actually run this and so reading this the variable bmi is going to be assigned a data frame containing these four columns with each of these rows so now when we print this out r is really interesting when it prints out data frames because it actually gives you this nice graphical output below this code chunk so i can see that i have four columns here three rows and all the data is represented here if you have a very large data set it'll actually give you pagination so that means you can see other variables and more rows by clicking on buttons that'll actually page this for you in this preview or you could also go to bmi up here and you can see you know you have some a little bit of information about it such as this is in these three uh columns in this data frame are number columns or numeric and gender is a character if this was a factor which we can actually make because we can do the function was factor so now in the environment we can see that gender is now a factor with two levels male and female and it actually displays this change in here but if you want to see like the whole view of the entire data set you don't want to deal with the pagination down here or you want to just see it in all its glory you can click this little button right here in the environment menu and it'll actually open up a new window of your data set in this almost excel like grid view and this is very useful if you just want to get a giant view of your data set and that is data frames so far we've talked a lot about data structures vectors data frames your overall data structures in a programming language and how to assign values to variables so let's start doing something a little bit more fun that has a little bit more meat to it it can immediately give you value when you're going to do any sort of data analysis and actually help you with your work not just here's how to do a for loop here's how to do an if statement these are important concepts but in r sometimes it's just getting work done that's the most important thing so on that note what we've covered so far is a lot of base r material base r is the base library of the language it is the core of the language it is where everything starts out now you can develop packages and extend the language just like how in python you import new packages you have other javascript libraries the same concept is throughout most programming languages and with the package system in r there's an ecosystem of packages called the tidy verse a single package called tidyverse when imported imports a group of projects or um yeah projects an ecosystem of these project packages and these packages all work together using what is called a tidy data format this was created and written about in a research paper in the journal of statistical software by hadley wickham chief data scientist of our studio tidy data is a way of structuring your data to work with the tidy verse and a way of making a consistent and robust analysis so when we use the tidy verse and the functions and packages within it it actually lends itself to being a consistent format of performing an analysis all the packages use consistent syntax similar function design and it will all work seamlessly together and be very familiar and allow you to achieve a lot of results without having as quite as much of a learning curve when dealing with all these different disjointed packages by a variety of developers who all do things different ways so let's just dive right into it one of the main packages in the tidy verse is something called d plier so when we import the tidy verse we're going to library tidy verse and we're going to import it and now we have it in we are going to start doing some actions with d plier d plier is really great for filtering transforming and manipulating your data you get data it might be messy it might be disjointed it might be you know filled with null values or just things that need to be cleaned up and taken care of and then you want to start performing some analysis you want to look at your variables you want to look at this data and say what is this telling me what are some questions i can make up from this data that i want to answer and provide to others so we're going to work with this little built-in data set from the tidyverse package called mpg which is uh car mileage per gallon and we have a variety of data points in here you can see that this is a 200 something row data frame of several different variables in here so the first thing we're going to do is use the pipe so if you've done anything with the command line and on with unix or linux mac os bash uh you know it might know what a pipe is when you take one action and you take its output and then pipe that in as the input to another action so if you're on the command line you might you know grab some text and then pipe that into something like awk or you know whatever you might want to do so the same sort of concept exists in r and this is probably one of the most powerful features is the unix-like pipe operator and what it looks like is this so rstudio has a keyboard shortcut to automatically insert this character and basically what this is doing here is the filter function is going to receive its first argument the first parameter of this function is going to be this entire data set this entire data set basically this tabular set of data right here the data frame is going to go in as the first argument of the filter function and if we look into it the first argument that these tidy verse functions expect is the data and this is how it is consistently set up and really useful and consistent so we can say hey use this data and then do all of your other actions for it and it's going to work like that for pretty much most of the functions in all the packages in the tidy verse the first parameter is what data that data is in the form of a data frame when the data frame is operated on by the filter or whatever function the output is a data frame if we continue to pipe this data frame through these other functions we can achieve some quite dramatic and useful results so let's do that so because mpg is going to be the first argument in filter i don't even have to put it in as an argument it's already in there by the nature of having been piped in so what do we want to filter i want to have only rows returned where the model of the car is equal to a4 so i can just do model is equal double equal sign for because it's not assignment it's comparison and we're going to say the string a4 and you can see that this is a character vector which means that it has a string so when i run this ctrl enter you can see that the green line executed this means the pipe forces this to go to the next line to process as well if you have a pipeline like this it'll run everything in the pipeline and now that 234 rows is now 7. so next we're going to take all this output this seven output right here and we're going to put it into the arrange function so now the arrange function is going to be receiving the output of the filter function the input to filter was a data frame mpg the output is a data frame of mpg but only where model is equal to a4 and now that is the input to arrange so now what we want to do is we want to use the arrange function to use a variety of sorts the arrange function you can think of it as i want to sort by this column and then by this column and then by this column etc and it can go infinitum you can just continue to sort as much as you'd like so what we want to do here let's use some numeric values so it's easy to see i'm going to use dispo and cylinder so we're going to see these two columns be sorted but i'm going to list a dispo first so we have the first argument the input is still the resulting mpg data frame after being filtered and then i'm going to say sort by dispo first then by cylinder so you're going to watch these two columns change and so now you can see that this bolt is sorted smallest to largest and we have two values of 1.8 those are both four 2.0s fours and then these continue to go to sixes but this one is sorted first followed by this one it just so happens that they both coincide this way so the arrange function can do a variety of sorts and just like i did multiple arguments here and it can continue to go on filter a single filter function can also do multiple filter criteria in a single function call so with this way you could easily pipe and chain all these different commands together to just reshape all of your data in just a couple lines of code so now i'm going to take all of this output and i'm going to pipe that into the d plier mutate function the mutate function will take will create a new variable using other existing variable data while still retaining the old column so i want to do something and i want to say uh anything before the year 2000 is an old car so let's do that we're going to say that age equals let's say year is less than actually let's make this old we're going to ask a question is it old and it's going to be true or false so if the year is less than 2 000 then it is an old car so now this new variable is going to be tacked on to the end of this data frame here so we have all this output every operation is executed every single time so this is what we're currently working with so now this is what's going to be transformed so now when i execute this all the way at the end here we have old and it's true and false asking our question 1999's here at the top yep these are old 2008's these are not old and so now we have this new piece of data we've done a calculation and you can make this as complex as you'd like and you can create new data and because these columns or in this case single data type vectors that make up a data frame are all well vectorized you can do a single operation and it works on every single quote unquote row and data frames are rectangular or square shaped so you have to have as many feel as many rows for every single vector and every value has to be filled even if that value is a missing value so now we're going to use the transmute function this is what you might want to do if say you want to calculate on an existing variable but you want to transform it into something and then get rid of the old one so say like you have a piece of data where you just have gender represented by m's and f's or whatever other characters you might use and you want to completely replace that with male female whatever else you might want to put there for a gender and then you can do something like transmute that into actually giving those full words and then turn that into a factor etc you can see how all the different ways you could take this different type of analysis so let's do something where let's see what can we do um oh let's take drive and instead of f let's just make this full wheel so before we take the output of mutate and do something with it we actually need to pipe it so the output of mutate is going to go pipe into transmute and what we want to do is we want to look at what do we have here so far let's run all this and we have drive and we want to change that to full so let's do transmute full wheel drive is equal equals drive equaling f and so this is true so what just happened so we transmuted and we kept the new variable but all the older variables all the other ones were discarded so in this case transmute says okay here's a variable name full wheel drive and we're going to calculate on that and that's what we're going to give you but you didn't say anything about anything else so sorry buddy that's all you're going to get in this case what i might want to do is say i'm not going to use transmute here you know there's a use case for every single one of these functions somebody somewhere has found it useful but what i'm going to do is i'm going to comment this out and i'm going to say hey d plier mutate once we get here what does your output look like oops what does your output look like and i want to create a new column named drive and i want to drop our full wheel drive and i want to drop this column so i don't want this to exist anymore but i do want my full wheel drive column so what i can do is i can say hey mutate comma i'm going to give you another argument full wheel if i could spell right today equals drive if drive is equal to the character f and so now what this is going to give us is a new column at the end yes these are all full wheel drive but we still have the drive column so that's where we can do something like the select function the select function does exactly what you might think you're going to select some columns to keep so in this case i want to keep all these columns except drive so select can work both ways you can either select explicitly i want this one and i want this one and i want this one and i only want those specific ones i mentioned or you can do something like piping the all of this output into select and say hey i want everything minus drive so when i run this because it's part of the pipeline now it'll run all of that and now the drive variable is gone but we still have full wheel drive over here so now we have transformed our data and through all these different operations and look how little code it takes to do this versus something like any of the c languages versus something maybe like javascript this language is specifically geared towards vectorized data analysis and manipulating and changing data and this is just a couple functions from one package in the tidy verse using the tidy data format this is how powerful this language can be so i'm going to show you one more function for dplyr before we move on from this little section and to give it justice i'm going to actually just take the full mpg data set i'm going to pipe it into the count function this is a really useful function just to get a brief overview of your data you want to see like hey how many times does this one specific value occur throw it into a histogram or a bar chart whatever you want to see what it looks like let's do that so first of all let's just toss it in there we have 234 records if we don't specify a column they'll just say hey how many rows are in this data set but what i want to see is i want to see the count of model so when i run that ah cool so we have all the different specific models they're characters but in this case we're counting each unique character and then in this case this works you might want to change characters to factors in one way it it can depend you know but for what we have right here this is interesting this works but i don't like the way this output this n n for number yeah okay cool but why is it all weirdly looking like this well we can go into the count function say hey sort this and you're gonna say sort equals true because i do want it sorted true and r can either be t or true in all caps for shorthand i just do t sometimes but when i run that now we have a sorted count and you can easily see which things are the most frequent occurring and which are the least which is actually a4 that we pulled out earlier so d plier is incredibly useful really concise functions do immediately what you want and clean reorganize get a good look at and investigate your data sets probably one of the most if not the most popular package in r is the ggplot2 package ggplot is a plotting package gg stands for grammar of graphics and it's a graphical package that lets you create data visualizations with your data sets it is used all over the world by high-level academics phds by professional journalists the i think the chicago tribune the new york times a lot of investigative journalism that actually reports on data will use ggplot and r to actually create these complex highly detailed and visually appealing visualizations so ggplot by itself without getting super complex is a relatively simple function so ggplot takes some data it says ggplot here's the data data is equal to whatever data set and because the first argument is already assumed to be data you don't even need data equals but for the sake of understanding which arguments are going to which named parameter we're just going to leave all this stuff in here for even if it is verbose but in this case i'm saying hey we're going to make a gt plot with the mpg data frame which is the full data set the whole 234 rows we saw earlier geom is a layer you can think of ggplot right here this is going to be a blank canvas cool yeah a blank canvas yeah the mpg data set is in here but we didn't tell it to do anything when you use geomes like geom point geom point is a scatter plot or a graph with just a bunch of points on it you give it an x and a y value those get crossed on here and where they intersect that is where the value is plotted this is a scatter plot and all we have here is i'm mapping the aesthetic and the aesthetic is this x and y axis on a scatter plot chart and then what is being x and y these two variables from the data frame that is listed in the ggplot data now one little interesting thing to note that is likely not going to change until the eventual maybe future iteration of ggplot ggplot3 if it ever happens is that hadley wickham created this package before he discovered that pipe operator we had earlier so we can't just pipe this into the next function so in this case ggplot is that one weird little you know red-headed stepchild that has this plus operator to carry on to the next action so you take each of these items and you add a plus at the end or you can you can structure differently and have a plus at the beginning and do some other weird ways of structuring this but you have to combine these functions with the plus to add them all together because what these geoms do and you can use multiple geoms on a single chart and this is how you create these very robust visualizations is you're basically adding a layer of paint over your chart you have a blank canvas you're adding a layer of paint you can add another geom using additional criteria colors graphics variables and keep layering on that paint until you get a really pretty and very visually interesting and communicative visualization let's look at what this does that's enough talking data equals mpg dispol and highway x and y let's run it when i run that we get this is the default plot you just said x and y everything else is defaulted this is just a gg plot a very simple gg plot and we could see oh wow actually this is actually kind of very interesting uh distribution here we have a downward linear linear regression and there's no regression line here but you can see a downward trend so we could plot a regression line on this but we also have these outliers right here and so we might ask questions like hey why is the highway so high for this for these guys but they have you know high level numbers here i'm not a car guy so i mean i don't really know what these variables mean when they're shortened like this but in any case interesting data so we can also add additional criteria to this so we're going to take the exact same plot but we're going to add a new variable instead of x and y hey the color of the points are going to be differentiated by class so now let's see what happens when we do that so now we have class each of these different classes basically a factor or a category a categorical variable of the data now changes the color the points haven't changed their location has not changed because x and y has not changed but their color has and now we can see groupings and where these mon might land so just with these three little variable assignments we're able to actually get a good deep look at our data and try and understand something about it and ask really important and pertinent questions that we can dive and delve deeper on so let's do something even cooler one of my favorite things to do we have mpg we're doing x and y but what's this facet wrap stuff and what's all this let's run that so facet wrap we're wrapping the facet by class so facet you can think of like a diamond a diamond has multiple facets different faces and you know it looks at the inside of the diamond in a different way through each face in this case we're faceting by class so each class has its own chart you can see the scale here for each of these graphs is still the same 0 to 7 and the y axis is also the same 0 to 40 or whatever up here but each class is on its own separate chart all lined up next to each other so we can see the distribution of each class and then we're wrapping so basically going left to right top to bottom and we're only wrapping for two rows that is what these arguments do but here's one different change instead of facet wrap and wrapping around we're going to do a grid and we're going to do a grid by the drive and cylinder variables ignore the tilde it'll take it's not a little bit out of the scope for this but that's just the syntax for use for here long story let's run this facet grid ah interesting so drive we have the numbers on the x-axis and cylinder we have over here all right i guess this is driving this is cylinder so in any case we have these two facets over here so we have x and y highway and dispo those have not changed still but now we're adding facets on the top and on the right by these two variables you can begin to see how you can chop slice and dice this data and how you're visualizing it to really see it in new lights and in new situations to ask those questions that pop up when that doesn't look right or that's interesting why is that that way and then you can really start to pinpoint these critical questions especially if you have business use cases and you're trying to find answers as how can we as a business make more money well what does our clients buy what do what are their spending habits what do they buy more often when do they buy that more often like you can see how all these data points work together and then you can try and look at that in using these tools and try and find out when is a good time what is a good product when is the right time to market this and all these types of things that can be answered by this data so now we're going to do another geom point right here go back to classic so this is what we started with plain old one so now we're going to add a second layer of paint we were still doing only one geom geom point but now we're going to do another one geom smooth we're going to use the exact same variables x and y highway and disciple and the exact same data set but now we're adding another layer of paint what's that going to look like so now we have this so what stream smooth is doing is it is adding a line and it's like a curved regression line where it's saying hey how close can i be to every single point as i run through it and it's trying to be as close to every single point and find the minimal amount of distance i'm probably explaining this very badly but and then the gray area is like the margin of error for where that line is but in any case we added a new layer of paint to this and it further and deepens our understanding of what this data might look like it adds another visual and a way of interpreting this so now let's just add gm smooth without actually using the variables just a plain gm smooth but we also changed some stuff up here so now we have ggplot data the mapping is x and y up here gm a point is just color equals class and then up empty gm smooth what does this look like interesting so now we have color by class we have gm smooth we have colors we have a little bit more visual information we can see our cluster we can see our uh gm smooth line and finally let's just do a very simple one when we haven't seen yet let's just do a bar and we're gonna do it with the diamonds data set this one comes with ggplot2 and we can see interesting we have the count and the cut of diamonds and ideal is the most commonly used or the common account so you can see just by these very simple examples very super simple examples what you can actually create with these very simple functions i mean this is like three lines of code and look what i'm able to get as output just with a tabular data set you could just write a bunch of data into an excel spreadsheet or a csv throw that into r and just start plotting and then you can actually see and visualize your data like this and the more layers that you add the more paint you add to this canvas the more you might actually be able to glean from your data set and be able to understand and ask those questions that you really need to be asking so next we're going to look at a little bit more of those questions exploratory data analysis looking at your data asking those questions and just looking at it all right we've taken a look at how to filter our data mess with it clean it and then how to do a little bit of visualization with that data so now we're gonna do a little bit of exploratory data analysis or eda this is a very common term and it's basically most of the job it's really you're just looking at the data set finding clues looking for weird abnormalities outliers cleaning it shaping it visualizing it and constantly iterating in those workflows to find and dredge up those interesting questions so we're going to take the diamonds data set and we're going to do something with it today so this looks like a very interesting thing we haven't done yet so if you remember from the beginning the arrow sign is the assignment operator so it's saying it's basically smaller is equal to diamonds but we have a pipe here so how does this how does this work so this is one of those weird little quirks of r that people sometimes get a little tripped up about is that they think that diamonds is being assigned here and then something's happening with this or what what happens here so the way i like to read this and i read it out loud like this so that you can understand how this actually works the variable smaller receives the value of the output of diamonds being filtered where carrot is less than three so if the carrot is less than three then those records in diamonds where carrot is less than three are then assigned to smaller the variable so now smaller is a data frame from diamonds as a subset where only caret less than three is assigned to this so now let's run this and see what it looks like ah so there's no output because we actually assigned it so up here in the environment pane if you're using rstudio and i prefer to use rstudio we have almost 54 000 observations or 54 000 ish rows of data of 10 variables that means we have almost 54 000 rows and 10 columns if we just do a little preview here by clicking this button we can see some data types we can see some sample values we could click on a little button right here to preview it and look at it i'm not going to do that just because i don't feel a need to but that's still a very good size data set so that's a lot of diamonds that have less than three carats so now we have this in the environment so it's basically it's declared it exists we can now use this in other things that variable now holds data so in this ggplot we're going to make here data is equal to smaller so the new data frame we made that has been filtered where the mapping of the aesthetic is x equals carrot okay so we're going to do a plot where the x-axis is equal to the variable of caret but there's no y-axis why is that because we're doing a histogram a histogram is not like a bar chart a histogram is a distribution so we're going to do a histogram where the bin width is 0.1 so what does this mean a bin width is just saying hey if you fall in between 0.1 and 0.2 or 3 you're going to fall into this bin and it's going to increment by 0.1 and this is a good way of catching if you have a lot of small values so if you have less than 3 carats and you just use decimals like 2.8 1.4 like we saw maybe earlier then this is going to be a good way of visualizing a good distribution so let's just look at what this looks like so we run that ah interesting so we have a lot of distribution we have a lot of values in this distribution that have a very small carrot and so very few of them actually have three carat so this is interesting so now let's do something where we change the bin with this is also just a little bit of an aside for statistics it is very good practice to try and look at a variety of bin widths of your histograms when you're actually plotting these just to see how the the data looks under just different bin widths and distributions so let's do 0.5 and we run it oh interesting you see how it gets all big and blocky like this and gives you less detail this is why you might want to do okay that's a little too much so let's bring it down to three take a look at that oh that's interesting and this you know we can begin to look at this and see it in a new light so we're gonna go back to mpg here we're gonna say data is mpg ooh new geom instead of geom histogram we're going to use geom box plot box plots are an incredibly useful chart they tell you so much information we're going to map the aesthetic of x axis equals okay what's all this reorder the class highway and fun equals median interesting i wonder what all this does and then y equals highway okay so let's look at this let's just run this ah interesting so we have a box plot here and the box plot is a really interesting chart to use because all these dots these are outliers that don't fit in and the box plot by default this line the middle is the median or i figured out actually if it's the median or the or the mean but these lines on the tops and the bottoms of the boxes are the inner quartile range so we have like the middle 50 of your data set the median value of that or in this case of the variable and then the whiskers here are actually the dregs of the data set and then all these are the outliers so you can see the distribution as you go up here so the highway mileage on the highway goes up over here as we get up into pickup suv minivan so based on these categorical variables we can see oh this goes up and this is what reorder is doing we're reordering by class by the highway and yeah median so in this case it's actually reordering all of these so we can see like the trend upwards so we can just see like okay which one is the highest and this is how it actually reorders and we can actually get a lot of really good juicy information out of this box plot so now we're going to do something else interesting i'm going to say oh diamonds the diamonds data set pipe that into the count function you saw me earlier where i said d plier colon colon when you do this this is just saying instead if you've ever used python code or you've seen python code where it says from this package import this function um it's like that so it's like you don't have to import the entire package like if i have a very simple script and i don't want to import an entire r package the entire d player package with all of its 50 hundred whatever plus functions i only need one function count or filter or whatever i could just do double colon it'll say hey from that package grab this one function and that way i can just grab what i need and i don't have to import the entire thing and so you if you declare it like i said uh library on tidyverse earlier tidyverse will call d player but it's basically like saying uh library d plier and then declaring that package i spelled it wrong but anyways uh so d plier count we're gonna pipe diamonds into account we're gonna count by color and then cut so ah two variables this time what's what does that look like by itself i can just select this instead of the other pipe here and i can run selectively that code oh okay so we get color by cut and then we can see okay oh you see what's doing here so each color gets its own representation for each cut so you're basically multiplying this across so for all the unique values for cut times all the unique values for color and then the count of all of those but oh i don't like that i want to comma sort equals true and let's run that there we go so we can see that the color g and the cut ideal is the most frequently occurring value in the diamonds data set that's really cool all right so let's take all this output right here the sort's not going to matter when we throw it into the gg plot so we're just going to take all of this and we're going to say map the aesthetic x to color so these categorical um characters here this is an ooh see it says order um i wonder if that's a factor or not so let's see x is going to be equal to color y is going to be equal to cut and then we're going to also add a geom tile what does that look like where the aesthetic is a fill equals number okay interesting you might be able to infer what this is we'll see i spilled diamonds wrong somewhere didn't i oh i just selected that and wrong let's just do this right there ah interesting it's basically a heat map so we could say okay we don't need to have a number scale for like a scatter plot because it's it's two columns of categorical variables these aren't you know sorted by a number these are colors so it's not like it's not a b c d e it's a color so it doesn't really matter the order here ideal could be sorted but you know in the end it doesn't matter and the numbers that count is just the color and the intensity of that color on this heat map so this way we can actually order it and see ah interesting all these different colors we can see that right here this is the lightest so g and ideal ah that's the highest value the darker values j and fair these are the least likely to occur and so this is a very interesting geom tile so all this was all these examples have used these curated data sets that are that come with these r packages diamonds comes from the ggplot package uh mpg i don't remember know exactly which package that comes from but it might be gg plot as well but what do you care about you care about your data if you're doing something with r you you're likely trying to do something with your data like i want to look at my own data make my own analysis i want to do these cool plots but with my stuff not this in this built-in stuff because that's not interesting one of my first projects i have like a side business and on etsy and i just exported all my etsy sales data into a csv threw it into r and started making some plots with it and so you can do the exact same type of thing when you find whatever data that you want you can just collect data on yourself if you wanted to wear a fitbit get your fitbit data and then put it into some plots so that's what we're going to look into the into next is how to import your own data into r so r is geared for data science rstudio is an ide geared for r and that means that it makes your life as a data analyst or somebody working with data all the easier so let's say i have data i need to get data into our studio so i can do r code on it and i need to analyze this visualize this how do i get the data into r so it's really really easy if you just have a csv say like a lot of your data comes in as a csv this is a very simple data format very easy to parse and use but if you have json if you have a sql database connection if you have feather files all of these things r can handle so many different formats and different kinds of data it's it's mind-boggling the amount of stuff that r has it has the capability to deal with and not to mention if it doesn't have that capability people develop packages to make it possible but we're just going to deal with a simple csv something super simple so i just took one of the data sets i think it might have been diamonds and i just saved it as a csv file so i can i have it right here in my files so you can see like in my files viewer over here there's data csv it's 2.3 megabytes of information i can click on this file right here interesting i can view it and i can import it let's view it first what am i importing and it's large so it's gonna be a slow preview okay let's see what it is okay okay it's uh oh it is the diamonds data set and it's got just you know it's just text it's a lot of csv values comma separated values a lot of them almost 54 000 of them great that's a lot of data so now i viewed it i know i want this in here how do i get in here i can click import and this works with excel there are excel packages sidenote excel all the dot xlsx files that x at the end and basically all the microsoft office file formats that end with an x means that they are now considered xml documents basically which is why pandoc can actually convert to microsoft word and powerpoint it makes these documents a lot easier to parse which means you can actually get your data out of a proprietary format in excel into other things like r so in any case importing this way we can read csv just clicking import it brings me to this window it already has the file pathway it gives me a preview i can actually do some sorting in here i can actually change the data types if i wanted to on these if i wanted i can actually have some other options and it even previews the code it wrote the code for me to just get my data in here so this is awesome i'm just going to click import i don't even need to change anything only because i mean this is actually a curated data set but i mean if it was your data set and you want to do a little bit of you know cursory analysis a little bit of looking at it in this way you could but i'm just going to import it okay so now it's in here it's in the viewer there's some stuff that's spit out to the console interesting you know it has all these calls columns caret is equal to a double a character character double double so basically these are the data types of those columns and then data right here ooh so all that visualized data that we saw in here is all here now so now if i go back into here into our actual file i can actually say hey data oh and then the utility sense tells me ah yes that data object does exist now it's a little table in here that lets me know that this is a data set so i can click in there run that and then we get our output this is the diamonds data set so that's a very quick way of getting a tabular data set into r it's already in here just like we were dealing with everything else so from this point on everything we've already done you can now do on your data you've got it into r you know how to filter it you know how to sort it you know how to plot it and if you don't there's just a couple quick google searches and ggplot super easy to get get started ggplot put your data in there aesthetic mapping equals aesthetics the aesthetics are your x and y make a genome point you're off to the races that's all you need to do just to get started so what are some other ways of getting some data into here well if you just have a very small data set or something where you just you want it to look familiar you want it to look somewhat normal and organized to you there is something called tribbles tribbles are a part of a package in the tidy verse where you actually use a a triple is a row wise tibble a tibble without the r here a tibble is like a data frame but with a little bit of extra features and spice i guess a tribble is just a row wise tibble so when you can see our data sets when they're displaying like this you know here are the column headers here's the the variables down here but when we actually have um you know when we assigned to data frame earlier i think it was right here yeah when we assigned to data frames you saw that it was column headers on the left and then variables going the actual values the rows going horizontally this way it's basically like transposing this 90 degrees over and it's a much easier way of actually reading your data so in this way i can just say here's my header for pregnant and in each row has that entire observation so you have a variable vertically and an observation a complete observation horizontally so if you're pregnant yes you're obviously not a male because males don't get pregnant and female 10 if you're pregnant and no well yes all 20 males would be no and then 12 females were no so in this case you could easily just lay this out say hey make a triple out of all of this and then assign that tribble to the value preg so we could just run that cool we now have that we have two observations of three variables the three variables are pregnant male female the two observations are yes and no and then the accounts for those for those results i can run that set preg and there's the output and it is assigned exactly how it looked here so it looks familiar it looks easy to assign and it's all just formatted this way but it is a lot of extra typing to do and we also have these little tildes characters for each column header everything's basically a csv it's comma separated values except the last one it could be a lot to write which is why we're going to come into some of the cool packages i'm going to showcase here in a minute the party winners i like to call them because there's just some really cool functionality with just you don't have to do anything it's just by itself is just amazing tools and one of them makes tribbles a hell of a lot easier to do so let's take a look at those next so i'm going to show you a couple packages today these are ones i just want to call party winners they're just amazing packages for what they do and these are just three simple packages in the entire ecosystem of r there are so many packages for so many different things you have phds writing packages to do highly complex analysis in their narrow scoped fields you have people writing general purpose tools and just the general tooling of making the r language better not necessarily the actual analysis of data it is just an amazing ecosystem of developers creating a robust and super flexible and powerful language i even made a package myself dealing with a conversion of different characters through unicode for linguistics and it's on my github if you ever if you if you care but in any case the three packages i'm going to cover are datapasta i can't pronounce this it's french i'm not going to even bother it i'll just butcher it and then ray shader so i mentioned earlier that tribbles can be a lot to type and they can be so there's this package called data pasta that just makes copy pasta if you know that term pasting data super easy just to get into r if you don't want to have to import a file you just want to copy and paste something and get it done then data pasta is for you data pasta when you actually load this library or install it it comes into the r studio add-ins you can see data pasta right here and all this stuff right there let's look at what this does so i have the data set open i don't want this view i actually want the file view so i open up the diamonds data set csv that i have here view file yes it's large i know open that up okay here's our csv just a bunch of raw text values i'm just going to select all copy it and now i'm not just going to paste in here i'm going to go to add-ins data pasta ah tribble paste test triple so it will take a little bit while to load because that is a lot of data just to spit out into an actual r document but you can see that what it did right here is it made all of my column headers for me everything's separated by commas it actually has some syntax highlighting the triple function call is being called all i have to do up here is say hey here's my data variable and i'm going to give it the assignment of all of this triple information that's all i have to do now i have my data in a variable and i am good to go and one really cool thing about datapasta is that you can also just do this with html tables i think and you go to a web page you see an html table you copy that you can paste that in as a tribble it'll actually recognize that and it just works off of your clipboard so next i'm going to cover this french package again i don't want to butcher the name but this is amazing i mentioned earlier that you know you just do a very simple geom point scatter plot you're off to the races with ggplot i remember walking into r and remember and thinking like whoa this is some interesting syntax here that they how do i get good at gg plot that's the whole point isn't it like i want to make really cool looking graphics that tell a good data story but i have to do a lot of googling just to figure out how to get something simple yet elegant created and ain't nobody got time for that so what's a simple way if you can just create something good and then maybe analyze the code after the fact when you have some time to see how it's doing what it's doing well that's where this package is really really useful it's also in the add-ins bar like data pasta but right under here we have ggplot2 builder so when i go into here there's a bunch of built-in data sets first of all i can choose a data frame of something i already have in my environment so i can just create a variable holding data and then make something with it or if you have sometimes if you don't have anything in the environment if you actually have a loaded data sets like all the diamonds mpg the iris data sets sometimes those will just appear in here and you can play around with those built-in curated sets but in any case we have data so i'm going to pick data okay so we have 10 variables i can change some things about it again it's curated i don't need to but i could change things to you know characters factors dates etc i could coerce those different things but we're just going to go in and mess with it all right so we got our variables in here we have what's the oh interesting plots okay some empty boxes let's start playing around so i have cut under x color under y that doesn't do anything interesting okay cool let's drag that out of there let's put price in y interesting oh wow that's interesting and it just immediately changes the plot type so let's take that one out i think these ones are the actual categorical and these the numerical so this will give us more data now we got a point oh that's a very interesting distribution here so we can start drawing some conclusions from this like most of these fall on these specific points and then scatter out away from those points and these are more of the outliers but it also does have a more of a linear trend going you know positively up here in price as the carat increases interesting so this is a way you could just play around with your variables really quickly see what it's going to output and say oh wait i want this one i like this plot i want to actually put this in my report write some write up some analysis about it but we also have all these options down here labels and title i can give this plot a title here's my title i can give it a subtitle a view of diamonds caption hello caption i don't know i'm gonna run out of stuff here but uh x is carrot y is price cool plot options oh interesting add a smooth line yeah why not and then let's do a color i don't like that color so i'm just going to do this one and theme minimal okay that's fine legend i like my legends on top go to data yeah all this looks good i don't want to filter out anything at this point export okay cool i could save this to a an exact picture i could save it to a powerpoint or i could just grab the code itself i could if i have my cursor right here i can just insert this directly into my script here's all the code insert it now it's in there it's already done that whole plot is now written for me i don't have to do anything other than click some buttons drag some fields i can copy it to the clipboard and paste it but in this case what you can do is you can change this stuff up so now i want to change the plot i want to change the color to this i want to remove the size of that i don't want a smooth line anymore i want to change the depth to that i don't know some random stuff and then i can just do after it finishes loading that insert this into the script so i was clicked a little bit too fast because this is a very large data set to calculate in this manner but when you click insert this code into your script you can basically create a bunch of different plots mess around with your data points a little bit switch some things around click insert in into your script and then you just have these plots and all their minute changes just dropped in here for you to mess around with later so this way you could easily quickly build something get it done put it in there analyze it later figure out how it works but just get something done and ready this next one is not something that i deal with like this is way beyond the scope of any sort of data sets that i deal with i just think it is absolutely the coolest thing to just take a couple lines of code and do something really amazing with this the the talk that the creator of this package made on this package is simply incredible with the work that he put in on this but this package ray shader if you know anything about like ray shading and game creation game development or low level systems programming in c languages you might have an idea of what this might do but if i run this code and i'm going to cut some clips because this might take a little while to render but there's some really cool output from using the ray shader package so the output of this little code this just this these couple lines of code here is this rasterized image of some actual physical terrain using a matrix observation and a file that is actually imported from the web so we just download a file into a temporary file you know rasterize it put some stuff i don't even know what half this stuff does it just makes a really cool map and it actually deals with terrain i thought this was really cool and the next one is really going to blow your socks off and so the output of all that second part of the code is this rasterized image of a 3d data visualization we can actually see the x and y axis but we also have the third dimension of height and you can see the legend over here that the height is to deal with the count and you can see the different colors as well you can create these really cool and interactive plots using the ray shader package when i found this this package out and i saw the talks on this i thought this was just the coolest thing ever and we also grabbed some snapshots from when the code was ran so you can actually see these are the types of visualizations that are produced and again with the right know-how and minimal code this is not a lot of lines of code to produce these very rich data visualizations that tell really interesting stories about the data and the way it was collected and how it was interpreted and cleaned so i hope by now you can see the power and just the sheer amount of awesomeness that r is i i could go over like this is how this is how you add items together a plus operator minus operator how to do boolean comparison you know this and this this or that you know a lot of those are just the fundamental programming concepts ubiquitous across all languages for the most part and that's not really where r shines r shines in actually doing data analysis get your data in do something with it clean it shape it plot it write it up and talk about it and this is where r is just a magnificent language it's a great intermediary it can you can interface with a variety of languages is are too slow on this function you can write that function in c plus plus with its interface to r and have that function execute as a compiled binary really fast return the result to r and let r continue on its way while taking advantage of the speed of a lower level system's language you can use code chunks in these are marked on documents that are sql code bash r stand python julia whatever you want there are dozens of supported language engines in our markdown and then our markdown itself can render to a variety of formats through pandoc you can have custom latex templates you can have basically custom anything your own workflow your package development your actual document templates you can create whatever you want in r and it's not even that hard to do i've done a little bit development in r and actually more on the actual development side versus the actual analytics side and i've done both sides and it is a great language it does have its weird quirks it's an old language it has a lot of baggage from its past but if you stick with the tidy verse you understand where it falls short in some areas where to pick up the slack with another language or tool right tool for the right job and put things together in the right way and it can be an absolute amazing tool in your toolbox you don't have to be a phd data scientist working for a large fortune 500 company earning multiple six figures to actually use r and use it well and actually get some value out of it honestly when i started using it i was just a lowly analyst in my job i was just i went from excel to using r to make some nice looking documents and giving some basic summary statistics before i really got started that's all i was doing and you know what it was great it produced really great looking documents i could automate a lot of my workflow with our scripts and it was just a great tool just to get into data engineering data science and better work as an analyst so i hope hopefully this talk has inspired you to look into r a little bit more it's even if you use it as just an overblown and super powerful calculator it is just a really great language it's my favorite language and there's always more to learn with it and hopefully this was an interesting talk you stuck around and made it this far and uh yeah i hope you uh will let me know how you like the language how if you what you've done with it i would always love to hear what people are doing with r and uh on that note thank you a big thank you to brad traversy for letting me you know put this video up on his channel and giving me this opportunity to you know gush about my favorite programming language and just the whole sphere of academia data science and dealing with this type of programming there's always more to learn but it is my favorite language and i'm happy that you guys got to or listen to me gush about it for a little while so i will let you all go thank you so much for watching and i'll catch you in the next one
Info
Channel: Traversy Media
Views: 47,184
Rating: undefined out of 5
Keywords: r programming, rlang, r#, r-lang, programming, bryan jenks
Id: ZYdXI1GteDE
Channel Id: undefined
Length: 93min 23sec (5603 seconds)
Published: Wed Aug 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.