R Programming Tutorial - Learn the Basics of Statistical Computing

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I haven't heard the term octothorp in a while.

👍︎︎ 2 👤︎︎ u/glodime 📅︎︎ Aug 14 2019 🗫︎ replies
Captions
Welcome to our an introduction. I'm Barton  Paulson. And my goal in this course is to   introduce you to our This is our, but also,  this is our. And then finally, this is our,   it's arguably the language of data science. And  just so you don't think I'm making stuff up off   the top of my head, I have some actual data. This  is a ranking from a survey of data mining experts   on the software that they use most often in  their work. And take a look here at the top   are is first. In fact, it's 50% more than Python,  which is another major tool in data science. So   both of them are important. But you can see why  I personally am fond of R. And why is the one   that I want to start with introducing you to  data science. Now there's a few reasons that R   is especially important. Number one, it's free.  And it's open source compared to other software   packages can be 1000s of dollars per year. Also,  R is optimized for vector operations, which means   you can go through an entire row, or an entire  table of data without you having to explicitly   write for loops. If you've ever had to do that,  then you know, it's a pain. And so this is a nice   thing. Also, R has an amazing community behind it  where you can find supportive people. And you can   get examples of whatever it is you need to do.  And you can get new developments all the time.   Plus our has over 9000 contributed or third party  packages available, make it possible to basically   do anything. Or if you want to put it in the words  of Yoda. You can say this, this is our there is no   if only how, and in this case, I'm quoting our  user Simon Blomberg. So very briefly, in some,   here's why I want to introduce you to our number  one, because r is the language of data science,   because it's free, and it's open source.  And because of the free packages that you   can download, install r makes it possible to do  nearly anything when you're working with data. So   I'm really glad you're here. And then I'll have  this chance to show you how you can use R to do   your own work with data in a more productive,  more interesting and more effective way. Thanks   for joining me. The first thing that we need to  do for our an introduction is to get set up. More   specifically we need to talk about installing  are, the way you do this is you can download it,   you just need to go to the home page for the our  project for statistical computing. And that's at   our dash project.org. When you get there, you  can click on this link in the first paragraph   that says download our and then I'll bring you to  this page that lists all the places that you can   download it. Now I find the easiest is to simply  go to this top one, this has cloud because that'll   automatically direct you to whichever of the  below mirrors is best for your location. When   you click on that, you'll end up at this page,  the comprehensive our archive network, or CRAN,   which we'll see again, in this course, you need to  come here and click on your operating system. If   you're on a Mac, it'll take you to this page.  And the version you're going to want to click   on is just right here, it's a package file,  that's a zipped application installation file,   click on that, download it and follow the  standard installation directions. If you're   on a Windows PC, then you're probably going to  want this one base again, click on it, download   it and go through the standard installation  procedure. And if you're on a Linux computer,   you're probably already familiar with what you  need to do. So I'm not going to run through   that. Now before we get a look at what is actually  like when you open it, there's one other thing you   need to do. And that is to get the files that  we're going to be using in this course. on the   page that you found this video, there's a link  that says download files. If you click on that,   then you'll download a zipped folder called  our oh one underscore entro underscore files,   download that unzip it. And if you want to put  it on your desktop, when you open it, you're   going to see something like this a single folder  that's on your desktop. And if you click on it,   then it opens up a collection of scripts. The  dot r extension is for an R source or a script   file. I also have a folder with a few data files  that we'll be using in one of these videos. If   you simply double click on this first file, whose  full name is this, that'll open up in our and let   me show you what that looks like. When you open  up the application Are you will probably get a   setup of windows that look like this. On the left  is the source window or the script window where   you actually do your programming. On the right is  the console window that shows you the output and   right now it's got a bunch of boilerplate text.  Now coming over here again on the left, any line   that begins with a pound sign or hashtag or aka  Thorpe is a commented line. That's not right. On   these other lines or code that can be run, by the  way, you may notice a red warning just popped up   on the right side, that's just telling us about  something that has to do with changes in our and   it doesn't affect us. What I'm going to do right  here is I'm going to put the cursor in this line,   and then I'm going to hit Command or Control and  then enter, which will run that line. And you can   see now, that is opened up over here. And what  I've done is I've made available to the program,   a collection of data sets. Now I'm going to pick  one of those data sets is the iris data sets very   well known as a measurement of three species of  the iris flower. And we're going to do head to see   the first six lines. And there we have the sepal  length, sepal width, petal length and petal width   of in this case, it's also Tosa. But if you want  to see a summary of the variables, get some quick   descriptive statistics, we can run this next line  over here. And now I get the quartiles. The mean,   as well as the frequency of the three different  species of Iris, on the other hand, is really   nice to get things visually. So I'm going to run  this basic plot command for the entire dataset.   And it opens up a small window, I'm  gonna make it bigger. And it's a   scatterplot of the measurements or the three kinds  of viruses, as well as a funny one where it's   including the three different categories, they're  gonna close that window. And so that is basically   what our looks like and how our works in its  simplest possible version. Now, before we leave,   I'm actually going to take a moment to clean up  the application in the memory, I'm going to detach   or remove the datasets package that I added.  I already closed the plot. So I don't need to   do this one separately. But what I can do is come  over here to clear the console, I'm actually going   to come up to edit and come down to clear console.  And that cleans it out. And this is a very quick   run through of what our looks like in its native  environment. But in the next movie, I'm going to   show you another application we can install called  our studio that lays on top of this, and makes   interacting with our a lot easier and a lot more  organized and really a lot more fun to work with.   The next step and are an introduction and  setting up is about something called our   studio. Now. This is our studio. And what it is  is a piece of software that you can download,   in addition to our what you've already installed,  and its purpose is really simple. It makes working   with our easier. Now there's a few different ways  that it does is number one is it has consistent   commands. What's funny is, the different operating  systems have slightly different keyboard commands   for the same operations. And our, our studio  fixes that. And it makes it the same whether   you're on Mac, Windows or Linux. Also, there's  a unified interface instead of having two,   three or 17. windows open, you have one window  with the information organized, and also makes   it really easy to navigate with the keyboards  and to manage the information that you have in   our and let me show you how to do this. But first  we have to install it, where you're going to need   to do is to go to our studios website, which is at  our studio.com. From there, click on download our   studio. Now bring it to this page or something  like it. And you're going to want to choose   the desktop version. Now, when you get there,  you're going to want to download the free sort of   community version as opposed to the $1,000 a year  version. And so click here on the left. And then   you're going to come to the list of installers for  supported platforms, it's down here on the left,   this is where you get to choose your operating  system. Click the top one if you have windows. The   next one if you have a Mac and then we have lots  of different versions of Linux, whichever one you   get, click on it, download it and go through the  standard installation process, then open it up.   And then let me show you what it's like working  in our studio. To do this, open up this file and   we'll see what it's like in our studio. When you  open up our studio, you get this one window that   has several different panes in it. At the top, we  have the script or the source window. And this is   where you do your actual programming. And you'll  see that it looks really similar to what we did   when I opened up the our application. The color is  a little different. But that's something that you   can change in preferences or options. The console  is down here at the bottom. And that's where you   get the text output. Over here is the environment  that saves the variables if you're using any and   then plots and other information show up here  in the bottom right. Now you have the option   of rearranging things and changing what's there  as much as you want. Our studio is a flexible   environment. And you can resize things by simply  dragging the divider between the areas. So let me   show you quick example, using the exact same code  that I did in my previous example. So you can see   how it works in our studio as opposed to the  regular our app that we use first time. First,   I'm going to load some data, that's by using  the datasets package, I'm going to do a Command   or Ctrl N, enter to load that one. And you  can see right here, it's run the command.   And then I want to do the quick summary of  data I'm going to do head Irish shows the   first six lines. And then here it is down here,  I can make that a little bit bigger if I want.   Then I can do a summary by just coming back  here, and clicking Command or Control Enter.   And actually, I'm going to do a keyboard command  to make the console bigger now. And then we can   see all of that, I have the same basic descriptive  statistics and the same frequencies there. And go   back to how it was before. And make this bring  this one down a little. And now we can do the   plot. Now this time, you see it shows up in this  window here on the side, which is nice. It's not   a standalone window. Let me make that one bigger,  it takes a moment to adjust. And there we have the   same information that we had in the our app.  Right here, it's more organized in a cohesive   environment. And you see that I'm using keyboard  shortcuts to move around. And it makes life really   easy for dealing with the information that I have  in our I'm going to do the same cleanup, I'm going   to detach the package that I had, this is actually  a little command to clear the plots. And then here   in our studio, I can run a funny little command  that'll do the same as doing Ctrl l to clear the   console for me. And that is a quick run through  of how you can do some very basic coding in our   studio, again, which makes working with our more  organized more efficient and easier to do overall.   In our very basic introduction to our and setting  up, there's one more thing I want to mention that   makes working with are really amazing. And that's  the packages that you can download install.   Basically, you can think of them as giving you  have superpowers when you're doing your analysis,   because you can basically do anything with  the packages that are available. Specifically,   packages are bundles of code. So it's more  software that add new function to our makes it   so we can do new things. Now, there are two kinds  of package two general categories. There are base   packages, these are packages that are installed  with our so they're already there. But they're   not loaded by default. That way, our doesn't use  maybe as much memory as it might otherwise. But   more significant than that are the contributed  or third party packages. These are packages that   need to be downloaded, installed, and then  loaded separately. And when you get those,   it makes things extraordinary. And so you may  ask yourself, where to get these marvelous   packages that make things so superduper?  Well, you have a few choices. Number one,   you can go to CRAN. That's the comprehensive  our archive network, that's an official,   our site that has things listed with the official  documentation, too, you can go to a site called   CRAN tastic, which really is just a way of listing  these things. And when you click on the links,   it redirects you back to CRAN. And then third,  you can also get our packages from GitHub,   which is an entirely different process. If  you're familiar with GitHub, it's not a big   deal. Otherwise, you don't usually need to deal  with it. But let's start with this first one,   the comprehensive our archive network,  or CRAN. Now, we saw this previously,   when we were just downloading our This time,  we're going to CRAN dot r dash project.org.   And we're specifically looking for this one,  the CRAN packages, that's gonna be right here   on the left click on packages. And when you open  that, you're gonna have an interesting option.   And that's to go to task views. And that breaks  it down by topic. So we have here, packages that   deal with Bayesian inference packages that deal  with chemo metrics, and computational physics,   so on and so forth. If you click on any one of  those, it'll give you a short description of   the packages that are available and what they're  designed to do. Now another place to get packages,   I said, is CRAN tastic, at CRAN tastic.org. And  this is one that lists the most recently updated   the most popular packages. And it's a nice way of  getting some sort of information about what people   use most frequently, although it does redirect  you back to CRAN to do the actual downloading.   And then finally@github.com if you go to slash  trending slash R, you'll see the most common   are most frequently downloaded packages on GitHub  for use and are now regardless of how you get it,   let me show you the ones that I use most often and  I find these Make working with are really a lot   more effective and what easier. Now they have kind  of cryptic names. The first one is d plier, which   is for manipulating data frames, then there's  tidy or for cleaning up information, stringer   for working with strings or text information.  lubra date for manipulating date information.   h TT er for working with website data. GG ww is  where the GG stands for grammar of graphics. This   is for interactive visualizations. GG, plot two  is probably the most common package for creating   graphics or data visualizations in our SHINee is  another one that allows you to create interactive   applications that you can install on websites.  reo is for our input output is for importing and   exporting data. And then our markdown allows you  to create what are called interactive notebooks or   rich documents for sharing your information. Now,  there are others, but there's one in particular,   that thing's useful. I call it the one package  to load them all. And it's Pac Man, which not   surprisingly, stands for package manager. And  I'm going to demonstrate all of these in another   course that we have here. But let me show you very  quickly how to get them working. He just tried   an R. If you open up this file from the course  files, let me show you what it looks like. What   we have here in our studio is the file for this  particular video. And I say that I use Pac Man,   if you don't have it installed already, then run  this one installation line. This is the standard   installation command in R. And now add Pac Man,  and then it will show up here and packages. Now   I already have it installed. So you can see it  right there. But it's not currently loaded. See   because installing means making it available  on your hard drive. But loading means actually   making it accessible to your current routines.  So then I need to load it or import it. And I   can do it with one of two ways. I can use the  require, which gives a confirmation message,   I can do it like this. And you see it's got that  little sentence there. Or I can do library which   simply loads it without saying anything. You  can see now by the way that it's checked off,   so we know it's there. Now, if you have  Pac Man installed, even if it's not loaded,   then you can actually use Pac Man to install other  packages. So what I actually do is because I have   Pac Man installed, I just go straight to this one  you do Pac Man and then the two colons. It says,   use this command, even though this package isn't  loaded. And then I load an entire collection,   all the things that I showed you starting with Pac  Man itself. So now I'm going to run this command.   And what's nice about Pac Man is, if you don't  have the package, it will actually install it,   make it available and load it. And I gotta tell  you, this is a much easier way to do it than the   standard r routine. And then, for base packages,  that means the ones that come with are natively   like the data sets package, you still want to do  it this way you load and unload them separately.   So now I've got that one available. And then I can  do the work that I want to do. Now I'm actually   not going to do it right now, because I'm going to  show it to you in future videos. But now I have a   whole collection of packages available, they're  going to give me a lot more functionality and   make my work more effective. I'm going to finish  by simply unloading what I have here. Now if   you want to with Pac Man, you can unload specific  packages, or the easiest way is to do p underscore   unload all. And what that does is it unload all  of the add on or contributed third party packages.   And you can see I've got the full list  here of what is unloaded. However,   for the base packages like data sets, you  need to use the standard r command detach,   which I'll use right here. And then I'll clear  my console. And that's a very quick run through   of how packages can be found online installed into  our and loaded to make your code more available.   And I'll demonstrate how those work in basically  every video from here on out. So you'll be able   to see how to exploit their functionality to  make your work a lot faster and a lot easier.   Probably the best place to start when you're  working with any statistics program is basic   graphics so you can get a quick visual impression  of what you're dealing with. And the command and   are the next simplest of all, is the default plot  command is also known as basic x, y plotting for   the x and y axes on a graph. And what's neat about  RS plot command is that it adapts to data types   and to the number of variables that you're dealing  with. Now, it's going to be a lot easier for me to   simply show you how this works. So let's try it  in our just open up the script file and we'll see   how we can do some basic visualizations in our  The first thing that we're going to do is load   some data Data Sets from the data sets package  that comes with our, we simply do library data   sets. And that loads it up, we're gonna use the  iris data, which I've showed you before. And   you'll get to see many more times. Let's look  at the first few lines. I'll zoom in on that.   And what this is, is the measurement of the Siebel  m petal length and width for three species of   viruses is a very famous data set about 100 years  old. And it's a great way of getting a quick feel   for what we're able to do and are, I'll come back  to the full window here. And what we're going to   do is first get a little information about the  plot command to get help on something in our   just do the question mark, and the thing you want  help for. Now we're in our studio. So this opens   up right here in the help window. And you see  we've got the whole set of information here,   all the parameters and additional links, you can  click on and then examples here at the bottom.   I'm going to come over here and I'm going to use  the command for a categorical variable first.   And that's the most basic kind of data that we  have. And so species, which is three different   species is what I want to use right here. So I'm  going to do plot, and then in the parentheses,   you put what it is you want to plot. And what I'm  doing here is I'm saying it's in the data set,   Iris, that's our data frame, actually. And then  the dollar sign says use this variable that's   in that data. So that's how you specify the whole  thing. And then we get an extremely simple three   bar chart, I'll zoom in on it. And what it tells  you is that we have three species of Iris setosa,   versicolor, and virginica, and then we have 50 of  each. And so it's nice now that we have balanced   group that we have three groups because that  might affect some of the analyses that you do.   And it's an extremely quick and easy way to  begin looking at the data all zoom back out.   Now let's look at a quantitative variable, so  one that's on an interval or nominal level of   measurement. For this one, I'll do petal length.  And you see I do the same thing plot and then Iris   and then peddling. Please note I'm not telling  are that this is now a quantitative variable. On   the other hand, it's able to figure that one out  and by itself. Now, this one's a little bit funny,   because it's a scatterplot, I'm going to zoom in  on it. But the x axis is the index number or the   row number in the dataset. So that one's really  not helpful. It's the variable that's going on the   Y, that's the petal length that you get to see the  distribution. On the other hand, you know that we   have 50 of each species. And we have the setosa.  And then we have the versicolor. And then we have   the virginica. And so you can see that there  are group differences on these three things.   Now, what I'm going to do is I'm going to  ask for a specific kind of plot to break it   down more explicitly between the two categories.  That is, I'm going to put in two variables now,   where I have my categorical species, and  then a comma, and then the petal length,   which is my quantitative measurement. I'm  going to run that again, you just hit Ctrl,   or command and Enter. And this is one that I'm  looking for here. Let's zoom in on that. Again,   you see that it's adapted. And it knows, for  instance, that the first variable I gave it   is categorical. The second was quantitative, and  the most common chart for that is a box plot. And   so that's what it automatically chooses to do.  And you can see, it's a good plot here, we can   see very strong separation between the groups on  this particular measurement. I'll zoom back out.   And then let's try a quantitative pair. So now  I'll do petal length and petal width, so it's   gonna be a little bit different. I'll run that  command. And now this one is a proper scatterplot,   where we have a measurement across the bottom,  and a measurement of the side. But you can see   that there's a really strong positive association  between these two. So not surprisingly, as a petal   gets longer, it generally also gets wider, so it  just gets bigger overall. And then finally, if I   want to run the plot command on the entire data  set the entire data frame, this is what happens,   we do plot and then Iris. Now we've seen this one  in previous examples, but let me zoom in on it.   And what it is, is an entire matrix of scatter  plots of the four quantitative variables. And then   we have species, which is kind of funny because  it's not labeling them. But it shows us a dot   plot for the measurements of each species. And  this is a really nice way if you don't have too   many variables of getting a very quick holistic  impression of what's going on in your data. And so   the point of this is that the default plot command  is able to adapt to the number of variables I gave   it, and to the kind of variables I give it, and  it makes life really easy. Now, I want you to   know that it's possible to change the way that  these look. I'm going to specify some options.   I'm going to do the plot again, this scatterplot  where I say plot, and then in parentheses,   I give these two arguments, or saying what I  want in it, I'm gonna say, do the petal length,   and do the petal width. And then I'm gonna go  to another line, I'm just separating with comma.   Now if you want to, you can write this all as one  really long line, I break it up, because I think   it makes a little more readable. I'm going to  specify the color, a new with call for color, and   then I use a hex code. And that code is actually  for the red that is used on the data lab homepage.   And then PCH is four point character, and that  is a 19 is a solid circle. Now I'm going to main   title on it, and then I'm gonna put a label on the  x axis and a label on the y axis. So I'm actually   going to run those now by doing Command or Control  Enter for each line, and you can see it builds up.   And when we finished, we got the whole thing, I'll  zoom in on it again. And this is the kind of plot   that you could actually use in a presentation  or possibly in a publication. And so even what   the base command, we're able to get really  good looking, informative and clean graphs.   Now, what's interesting is that the plot command  can do more than just show data, we can actually   feed it in formulas, if you want, for instance,  to get a cosine, I do plot and then coast is   for cosine. And then I give the limit, I go from  zero to two times pi, because that's relevant for   cosine. I click on that, and you can see the graph  there, it's doing our little cosine curve, I can   do an exponential distribution from one to five.  And there it is curving up. And I can do D norm,   which is for a density of a normal distribution  from minus three to plus three. And there's the   good old bell curve there in the bottom right.  And then we can use the same kind of options   that we used earlier for our scatterplot. Here  to say, do a plot of D norm, so the bell curve   from minus three to plus three on the x axis. And  now we're going to change the color to red l WD is   for linewidth, make it thicker, give it a title on  the top, a label on the x axis and a label on the   y axis. We'll zoom in on that. And so there is my  new and improved prettier and presentation ready   bell curve that I got with a default plot, command  and R. And so this is a really flexible and   powerful command. Also, it's the base package. And  you'll see that we have a lot of other commands   that can do even more elaborate things. But this  is a great way to start and get a quick impression   of your data, see what you're dealing with, and  shape the analyses that you do subsequently.   The next step in our introduction, and our  discussion of basic graphics, is bar charts.   And the reason I like to talk about bar charts  is this, because simple is good. And when it   comes to bar charts, bar charts are the most basic  graphic for the most basic data. And so they're a   wonderful place to start in your analysis. Let me  show you how this works. Just try it in our open   up this script. And let's run through and see how  it works. When you open up the file in our studio,   the first thing we're going to want to do is  come down here and open up the datasets package.   And then we're going to scroll down a little bit  and we're going to use a dataset called empty   cars. Let's get a little bit of information about  this do the question mark and the name of the data   set. This is Motor Trend. That's a magazine  car road test from 1974. So you know they're   42 years old. Let's take a look at the first few  rows of what's in empty cars by doing head. I'm   going to zoom in on this. And what you can see is  that we have a list of cars the Mazda RX four and   the wagon the Datsun 710, the AMC Hornet and I  actually remember these cars and we have several   variables on each of them we have the mpg MPG,  we have the number of cylinders the displacement   and cubic inches, the horsepower the final drive  ratio which has to do with the axle, and then we   have the weight in tons the quarter mile time in  seconds. And these are a bunch of really really   slow cars. V S is for whether the cylinders are in  a V, or whether they are in a straight or in line.   And then the am is for automatic or manual. Then  we go into the next line we have gear which is the   number of gears in the transmission and carb for  how many carburetor barrels they have, which is   we don't even use carburetors anymore. Anyhow. So  that's what's in the data set. I'll zoom back out.   Now if we want to do a really basic bar chart,  you might think that the most obvious thing to   do would be to use RS bar plot command. That's,  it's named for the bar chart. And then to specify   the data set empty cars, and then the dollar sign,  and then the variable that we want cylinders. So   you think that would work, but unfortunately,  it doesn't. Instead, what we get is this,   which is just kind of going through all the  cases on a one by one by one row and telling   us how many cylinders are in that case, that's  not a good one. That's not what we want. And   so what we need to do is we actually need to  reformat the data a little bit, by the way,   you would have to do the exact same thing, if  you wanted to make a bar chart in a spreadsheet,   like Excel or Google Sheets, you can't do it with  the raw data, you first need to create a summary   table. And so what we're going to do here is we're  going to use the command table, we're gonna say,   take this variable from this data set and make a  table of it, and feed it into an object, you know,   a data thing, data container called cylinders,  I'm going to run that one. And then you see that   just showed up in the top left, let me zoom in  on that one. So now I have in my environment,   a data object called cylinders, it's a table,  it's got a length of three, it's got a size of   1000 bytes, and it gives us a little bit more  information. Let's go back to where we were.   But now I've saved that information into  cylinders, which just has the number of cylinders,   I can run the bar plot command. And now I get  the kind of plot I expected to see. From this, we   see that we have a fair number of cars with four  cylinders, a smaller number was six. And because   this is in 74, we've done a lot of eight cylinder  cars in this particular data set. Now, we can also   use the default plot command, which I showed you  previously, on the same data, we're just going   to do something a little different, it's actually  going to make a line chart where the lines are the   same length of each bars, I'd probably use the bar  plot instead, because it's easier to tell what's   going on. But this is a way of making a default  chart that gives you the information you need   for the categorical variables. Remember, simple  is good. And that's a great way to start. In our   last video, on basic graphics, we talked about bar  charts. If you have a quantitative variable, then   the most basic kind of chart is a histogram. And  this is for data that is quantitative or scaled or   measured, or interval or ratio level, all of those  are referring to basically the same thing. And   in all of those, you want to get an idea of what  you have. And a histogram allows you to see what   you have. Now there's a few things you're going  to be looking for with a histogram. Number one,   you're going to be looking for the shape of the  distribution, is it symmetrical, is it skewed is a   uni modal by modal, you're going to look for gaps  or big empty spaces in the distribution. You're   also going to look for outliers, unusual scores,  because those can distort any of your subsequent   analyses. He'll look for symmetry to see whether  you have the same number of high and low scores or   whether you have to do some sort of adjustment  to the distribution. But this is going to be   easier if we just try it in R. So open up this R  script file. And let's take a look at how we can   do histograms in R. When you open up the file, the  first thing we need to do is come down here and   load the data sets. We'll do this by running the  library command, I just do Ctrl or Command Enter.   And then we can do the iris data set. Again, we've  looked at it before. But let's get a little bit of   information from it by asking for help on Iris.  And there we have Edgar Anderson's Iris data,   also known as Fisher's Iris data, because he  published an article on it. And here's the full   set of information available on it from 1936. So  it's 80 years old. Let's take a look at the first   few rows. Again, we've seen this before, Siebel  and petal length and width for three species of   Iris. We're gonna do a basic histogram on the  four quantitative variables that are in here.   And so I'm going to use just the hist command.  So hist and then the dataset Iris and then the   dollar sign to say which variable and then Siebel  dot length. I run that I get my first histogram.   Let's zoom in on a little bit. And what happens  here is of course, it's a basic sort of black line   on white background, which is fine for exploratory  graphics. And it gives us a default title that   says histogram of the variable and it gives us the  the clunky name which is also on the x axis on the   bottom, it automatically adjusts the x axis and it  chooses about seven or nine bars, which is usually   the best choice for a histogram. And then on the  left, it gives us the frequency or the count of   how many offs revisions are in that group. So  for instance, we have only five irises whose   sepal length is between four and four and a half  centimeters, I think it is. Let's zoom back out.   And let's do another one. Now, this time for  a simple width, you can see that's almost a   perfect bell curve. And we do petal length, we get  something different. Let me zoom in on that one.   And this is where we see a big gap, we've got a  really strong bar there at the low end. In fact,   it goes above the frequency axis. And then we  have a gap. And then sort of a bell curve that   lets us know that there's something interesting  going on with the data that we're going to want   to explore a little more fully. And then  we'll do another one for petal width,   I'll just run this command. And you can see  the same kind of pattern here where there's   a big clump at the low end, there's a gap. And  then there's sort of a bell curve beyond that.   Now, another way to do this is to do the  histograms by groups. And that would be an   obvious thing to do here, because we have three  different species of Iris. So what we're going   to do here is we're going to put the graphs into  three rows, one above another in one column. I'm   going to do this by changing a parameter pa RS  for parameter, and I'm giving it the number of   rows that I want to have in my output. And  I need to give it a combination of numbers,   I do this C, which is for concatenate, it means  treat these two numbers as one unit, where three   is the number of rows, and then the one is the  number of columns. So I run that it doesn't show   anything just yet. And then I'm going to come down  and I'm going to do this more elaborate command,   I'm going to do hist. That's the histogram that  we've been doing. I'm going to do petal length,   except this time in square brackets, I'm going to  put a selector is this means use only these rows.   And the way I do this is by saying I want to do it  for this atossa irises. So I say, Iris, that's the   data set, and then dollar sign. And then species  is the variable. And then two equals because in   computers, that means is equivalent to and then  in quotes, and they have to spell it exactly the   same with the same capitalization and do setosa.  So this is the variable and the row selection.   I'm also going to put in some limits for the  x, because I want to manually make sure that   all three of the histograms I have have the same  x scale. So I'm going to specify that breaks is   for how many bars I wanted the histogram. And and  actually, what's funny about this is it's really   only a suggestion that you give to the computer,  then I'm going to put a title above that one,   I'm going to have no x label, and I'm going to  make it read somebody would do all of that right   now. I'll just run each line. And then you see  I have a very skinny chart, let's zoom in on it.   So it's very short. But that's because I'm gonna  have multiple charts, it's gonna make more sense   when we look at them all together. But you can see  by the way that the petal width for this atossa   irises is on the low end. Now let's do the same  thing for versicolor. I'm going to run through all   that. It's all gonna be the same, except we're  gonna make it purple. There's versicolor. And   then let's do virginica last. And we'll make  those blue. And now I can zoom in on that.   And now when we have our three histograms, it's  the same variable petal width, but now I'm doing   it separately for each of the three species. And  it's really easy to see what's going on here.   Now. setosa is really low versicolor and virginica  overlap, but they're still distinct distributions.   This approach, by the way, is referred to as small  multiples, making many versions of the same chart   on the same scale. So it's really easy to compare  across groups are across conditions, which is what   we're able to do right here. Now, by the way,  anytime you change the graphical parameters,   you want to make sure to change them back to what  they were before. So here, I'm going par, and then   going back to one column and one row. And that's  a good way of doing histograms for examining   quantitative variables, and even for exploring  some of the complications that can arise when you   have different categories with different scores  on those variables. In our two previous videos,   we looked at some basic graphics for one variable  at a time, we looked at bar charts for categorical   variables, and we looked at histograms for  quantitative variables. While there's a lot more   you can do with univariate distributions. You also  might want to look at by various distributions,   we're gonna look at scatter plots as the most  common version of that you do a scatter plot when   what you want to do is visualize the association  between two quantitative variables. Now,   I actually know it's more flexible than that. But  this is the canonical case for a scatterplot. And   when you do that, what sorts of things do you want  to look for in your scatterplot? I mean, there's a   purpose in it. Well, number one, you want to see  if the association between your two variables is   linear, or if it can be described by a straight  line, because most of the procedures that we do   assume linearity. You also want to check if you  have consistent spread across the scores as you go   from one end to the x axis to another, because if  things fan out considerably, then you have what's   called heteroscedasticity. And it can really  complicate some of the other analyses. As always,   you want to look for outliers, because an unusual  score, or especially an unusual combination of   scores, can drastically throw off some of your  other interpretations. And then you want to   look for the correlation is there an association  between these two variables. So that's what we're   looking for it, let's try it in our simply open up  this file, and let's see how it works. The first   thing we need to do in our is come down and open  up the datasets package just to command or control   and Enter. And we'll load the data sets, we're  going to use empty cars, we looked at that before,   it's got a little bit of information, it's road  test data from 1974. And let's look at the first   few cases. I'll zoom in on that. Again, we have  miles per gallon cylinders, so on and so forth.   Now, anytime you're going to do an association,  it's a really good idea to look at the univariate   or one variable at a time distributions as well,  we're going to look at the association between   weight and mpg. So let's look at the distribution  for each of those separately. I'll do that with a   histogram, I do hist. And then in parentheses,  I specify the data set empty cars in this case,   and then $1 sign to save which variable in that  data set. So there's the histogram for weight.   And you know, it's not horrible there, it looks  like we've got a few on the high end there. And   here's the histogram for miles per gallon. Again,  mostly kind of normal, but a few on the high end.   But let's look at the plot of the two of them  together. Now, what's interesting is I just use   the generic plot command, I feed that in, and r  is able to tell that I'm giving it to quantitative   variables, and that a scatterplot is the best  kind of plot for that. So we're gonna do weight   and mpg. And then let me zoom in on that. And  what you see here is one circle for each car at   the joint position of its weight and its MPG, and  it's a strong downhill pattern. Not surprisingly,   the more a car weighs and we have some  in this data set that are five tonnes,   the lower miles per gallon, we have get down to  about 10 miles per gallon here, the smallest cars,   which appear to weigh substantially under  two times get about 30 miles per gallon.   Now, this is probably adequate for most purposes.  But there's a few other things that we can do. So   for instance, I'm going to add some colors here,  I'm going to take the same plot, and then add on   additional arguments or say, use a solid circle  pchs for point character 19 as a solid circle,   c x has to do with this size of things, and I'm  going to make in the 1.5 means making 150% larger   call is for color and I'm specifying a particular  read the one for data lab in hex code, I'm going   to give a title, I'm going to give an X label and  a y label. And then we'll zoom in on that. And now   we have a more polished chart that also because of  the solid red circles makes it easier to see the   pattern that's going in there, where we got some  really heavy cars with really bad gas mileage,   and then almost perfect linear association up to  the lighter cars was much better gas mileage. And   so a scatterplot is the easiest way of looking at  the association between two variables, especially   when those two variables are quantitative.  So they're on a scaled or measured outcome.   And that's something that you want to do anytime  you're doing your analysis to first visualize it,   and then use that as the introduction to any  numerical or statistical work you do after that,   as we go through are necessarily very short  presentations on basic graphics. I want to   finish by saying one more thing, and that is  you have the possibility of overlaying plots.   And that means putting one plot directly on top of  or superimposing it on another. Now, you may ask   yourself why you want to do this Well, I can give  you an artistic version on this. This, of course,   is Pablo Picasso's Les Demoiselles d'Avignon. And  it's one of the early masterpieces in Cubism and   the idea of Cubism is it gives you many views,  or it gives you simultaneously several different   perspectives on the same thing. And we're gonna  try to do a similar thing with data. And so we   can say very quickly. Thanks, Pablo. Now, why  would you overlay plots, really, if you want the   technical explanation is because you get increased  information density, you get more information,   and hopefully more insight in the same amount of  space and hopefully the same amount of time. Now,   there is a potential risk here. You might  be saying to yourself at this point, well,   you want dense, guess what? I can do dance. And  then we end up with something vaguely like this,   the Garden of Earthly Delights, and it's  completely overwhelming, and it just makes   you kind of shut down cognitively. No, thank  you. Hieronymus Bosch. No, I instead, well,   I like Hieronymus Bosch his work. And to tell you  when it comes to data graphics use restraint. Just   because you can do something doesn't mean that you  should do that thing. When it comes to graphics   and overland plots, the general rule is this, use  views that complement and support one another that   don't compete. But that gives greater information  in a coherent and consistent way. This is going   to make a lot more sense. If we just take a  look at how it works in our so open up this   script. And we'll see how we can overlay plots for  greater information density and greater insight.   The first thing that we're going to need  to do is open up the datasets package.   And we're going to be using a data set we haven't  used before about lynxes, that's the animal. This   is about Canadian Lynx trappings from 1821 to  1934. If you want the actual information on the   dataset, there it is. Now let's take a look at  the first few lines of data. This one is a time   series. And so what's unusual about it is this  is just one line of numbers. And you have to know   that it starts at 1821. And it goes through. So  let's make a default chart with a histogram. As   a way you've seen, or links trappings consistent  or how much variability was there, we'll do hist,   which is the default histogram. And we'll simply  put links in, we don't have to specify variables,   because there's only one variable in it. And when  we do that, I'll zoom in on that, we get really a   skewed distribution, most of the observations are  down at the low end, and then it tapers off to   it's actually measured in 1000s. So we can tell  that there is a very common value, it's at the   low end. And then on the other hand, we don't know  what years those were. So we're ignoring that for   just a moment and taking a look at the overall  distribution of trappings, regardless of yours,   Miss zoom back out. And we can do some options  on this one to make it a little more intricate,   we can do a histogram. And then in parentheses, I  specify the data. I also can tell it how many bins   I want. And again, it sort of is suggesting  that because r is going to do what it wants   Anyhow, I can say make it a density instead of  frequency. So it'll give proportions of the total   distribution. We'll change the colors to call the  sisal one because you can use color names. And our   will give it a title here. By the way, I'm using  the paste command because it's a long title,   and I want it to show up on one line, but I  need to spread my command across two lines,   you can go longer, I have to use a short  command line. So you can actually see what   we do when we're zoomed in here. So there's that  one, and then we're going to give it a label,   this has number of links trapped. And now we have  a more elaborate chart. I'll zoom in on it, and   it's a kind of little thistle purple lilac color.  And we have divided the number of bins differently   previously, it was one bar for every 1000. Now  it's one bar for 500. But that's just one chart.   We're here to see how we can overlay charts and  a really good one anytime you're dealing with a   histogram is a normal distribution. So you want to  see are the data distributed normally now we can   tell they're skewed here, but let's get an idea of  how far they are from normal. To do this, we use   the command curve. And then D norm is for density  of the normal distribution. And then here I tell   it axes you know just a generic variable name,  but I tell it use the mean of the Lynx data. Use   the standard deviation of the Lynx data We'll make  it a slightly different fissel color. Number four,   we'll make it two pixels wide, the line width  is two pixels and then add says stick it on the   previous graph. And so now I'll zoom in on that.  And you can see if we had a normal distribution   with the same mean and standard deviation as  this data, it would look like that. Obviously,   that's not what we have, because we have  this great big spike here on the low end,   then I can do a couple of other things, I  can put in what are called kernel density   estimators. And those are sort of like a bell  curve, except they're not parametric, instead,   they follow the distribution of the data, that  means they can have a lot more curves in them,   they still add up to one like a normal  distribution. So let's see what those   would look like here, we're gonna do lines. That's  what we use for this one. And then we say density,   that's going to be the standard kernel  density estimator, we'll make it blue.   And there it is, on top, I'm going to do  one more than we'll zoom in, I can change   a parameter of the kernel density estimator,  here, I'm using a just to say, average across   it sort of like a moving average, average across  a little more. And now let me zoom in on that.   And you can see, for instance, the blue line  follows the spike at the low end a lot more   closely than it dips down. On the other hand,  the purple line is a lot more slower to change,   because of the way I gave it his instructions  with the Adjust equals three. And then I'm going   to add one more thing, something called a rug  plot, it's a little vertical lines underneath   the plot for each individual data point. And  I do that with rug. And I say just use links,   and then we're gonna make it a line width or  pixel width of two, and then we'll make it gray.   And that, and assuming is our final plot, you can  see now that we have the individual observations   marked, and you can see why each bar is as tall as  it is and why the kernel density estimator follows   the distribution that it does. This is our final  histogram with several different views of the same   data. It's not Cubism, but it's a great way of  getting a richer view of even a single variable   that can then inform the subsequent analyses you  do to get more meaning and more utility out of   your data. Continuing in our an introduction,  the next thing we need to talk about is basic   statistics. And we'll begin by discussing the  basic summary function in our The idea here is   that once you have done the pictures that you've  done that basic visualizations, then you're going   to want to get some precision by getting numerical  or statistical information. Depending on the kinds   of variables you have, you're going to want  different things. So for instance, you're going   to want counts or frequencies for categories.  They're going to want things like core titles   and the mean for quantitative variables. We can  try this in our and you'll see that it's a very,   very simple thing to do. Just open up this script  and follow along. What we're going to do is load   the data sets package, controller command and  then enter. And we're actually going to look   at some data and do an analysis that we've seen  several times already, we're going to load the   iris data. And let's take a look at the first  few lines. And again, this is for quantitative   measurements on the seaplane petal length  and width are three species of Iris flowers.   And what we're going to do is we're going to  get summary in three different ways. First,   we're going to do summary  for a categorical variable.   And the way we do this is we use the summary  function. And then we'd say Iris, because that's   the data set and then $1 sign and then the name of  the variable that we want. So in this case, it's   species, we'll run that command. And you can see  it just has setosa 50 versicolor 50 and virginica   50. And those are the frequencies are the counts  for each of those three categories in the species   variable. Now we're going to get something  more elaborate for the quantitative variable,   we'll use sepal length for that one, and I'll just  run that next line. And now you can see it lays   it out horizontally, we have the minimum value  of 4.3, then we have the first quartile of 5.1,   the median than the mean than the third quartile  and then the maximum score of 7.9. And so this   is a really nice way of getting a quick impression  of the spread of scores. And also by comparing the   median and the mean sometimes you can tell whether  it's symmetrical or there skewness going on. And   then you have one more option and that is getting  a summary for the entire data frame or data set.   at once, and what I do is I simply do summary  and then in the parentheses for the argument,   I just give the name of the dataset IRS. And  this one, I need to zoom in a little bit,   because now it arranges it vertically. Where do  we do sepal length. So that's our first variable,   and we get the courthouse and we get the median.  And we do Siebel with petal length, petal width,   and then it switches over at the last one species  where it gives us the counts or frequencies of   each of those three categories. So that's the most  basic version of what you're able to do with the   default summary variable in R gives you quick  descriptives gives you the precision to follow   up on some of the graphics that we did previously.  And it gets you ready for your further analyses.   As you're starting to work with R, and you're  getting basic statistics, you may find you want   a little more information than the base summary  function gives you. In that case, you can use   something called describe, and its purpose  is really easy. It gets more in detail. Now,   this is not included in ours basic functionality.  Instead, this comes from a contributed package,   it comes from this psych package. And when you  run describe from site, this is where you're going   to get, you'll get n that's the sample size, the  mean, the standard deviation, the median, the 10%,   trimmed mean, the median absolute deviation, the  minimum and maximum values, the range skewness,   and kurtosis, and standard errors. Now, don't  forget, you still want to do this after you   do your graphical summaries pictures first  numbers later. But let's see how this works   in our simply open up this script, and we'll run  through it step by step. When you open up are,   the first thing we're going to need to do is  we're going to need to install the package. Now,   I'm actually going to go through my default  installation of packages, because I'm going to use   one of these Pac Man. And this just makes things  a little bit easier. So we're going to load all   these packages. And this assumes, of course, you  have Pac Man installed already, we're going to get   the data sets. And then we'll load our Iris data.  We've done that lots of times before sepal, and   petal length and width and the species. But now  we're going to do something a little different,   we're going to load a package, I'm using p load  from the Pac Man package. That's That's why I   loaded it already. And this will download it if  you don't have it already, it might take a moment.   And it downloads a few dependencies, generally  other packages that need to come along with it.   Now, if you want to get some help on it, you can  do p anytime you have P and underscore that's   something from Pac Man p help site. Now when you  do that, it's going to open up a web browser and   it's going to get the PDF help. I've got it open  already because it's really big. In fact, it's 367   pages here, have documentation about the functions  inside. Obviously, we're not going to do the whole   thing here. What we are going to do is we can look  at some of it in the our viewer, if you simply add   this argument here, web equals F for false,  you can spell out the word false, as long as   you do it in all caps, then opens up here on the  right. And here is actually this is a web browser.   This is a web page we're looking at. And each of  these, you can click on and get information about   the individual bits and pieces. Now, let's use  describe that comes from this package. It's for   quantitative variables only. So you don't want to  use it for categories. What we're going to do here   is we're going to pick one quantitative variable  right now. And that is Iris and then sepal length.   When we run that one, here's what we get. Now  I get a list here a line, the first number,   the one simply indicates the row number, we only  have one row. So that's what we have anyhow. And   it gives us the N of 150, the mean of 5.84, the  standard deviation, the median, so on and so forth   out to the standard error there at the end. Now,  that's for one quantitative variable. If you want   to do more than that, or especially if you want to  do an entire data frame, just give the name of the   data frame in describe. So here we go describe  Iris. I'm going to zoom in on that one because   now we have a lot of stuff. Now it lists all the  variables down the side sepal length and it gives   the variables numbers 12345. And it gives us the  information for each one of them. Please note   it's given us numerical information for species  but it shouldn't be doing that because that's a   categorical variable. So you can ignore that  last line. That's why I put an asterisk right   there. But otherwise, this gives you more detailed  information including things is like the standard   deviation and the skewness that you might need.  To get a more complete picture of what you have   in your data. I use describe a lot, it's a great  way to compliment histograms and other charts   like box plots to give you a more precise image of  your data and prepare you for your other analyses.   To finish up our section in our an introduction  on basic statistics, let's take a short look at   selecting cases. What this does is it allows you  to focus your analysis, choose particular cases   and look at them more closely. Now in art, you can  do this a couple of different ways. You can select   by category if you have the name of a category,  or you can select by value on a scaled variable.   Or you can select by both. Let me show you how  this works and are just open up this script and   we'll take a look at how it works. As with most  of our other examples, we'll begin by loading the   data sets package and by using library, just Ctrl  or Command Enter to run that command that's now   loaded, and we'll use the iris dataset. So we'll  look at the first few cases head Iris is how we   do that. Zoom in on it for a second. There's the  iris data, we've already seen it several times,   we'll come down and we'll make a histogram of  the petal length for all of the irises in the   data set. So I received the name of the data set  and then petal length. There's our histogram off   to the right, I'll zoom in on it for a second. So  you see, of course, that we've got this group's   stuck way at the left, and then we have a gap  right here, then we have a pretty much normal   distribution, the rest of it, I'll zoom back  out, we can also get some summary statistics.   I'll do that right here. For petal length, there  we have the minimum value of the core tiles and   the mean. Now let's do one more thing. And let's  get the name of the species. That's going to be   our categorical variable and the number of cases  for of each species. So I do summary, and then   it knows that this is a categorical variable.  So we run it through and we have 50 of each,   that's good. The first thing we're going to do  is we're going to select cases by their category,   in this case by the species of Iris. We'll  do this three times. We'll do it once for   versicolor. So I'm going to do a histogram where I  say use the iris data. And then dollar sign means   use this variable petal length. And then in square  brackets, I put this to indicate select these rows   or select these cases. And I say select when this  variable species is equals, you got to use the two   equal signs to versicolor. Make sure you spell  it and capitalize it exactly as it appears in   the data. Then we'll put a title on it. This says  petal length versicolor. So here we go. And there   is our selected cases. This is just 50 cases going  into the histogram. Now on the bottom right, we'll   do a similar thing for virginica, where we simply  change our selection criteria from versicolor   virginica. And we get a new title there. And  then finally, we can do first atossa also.   So great. That's three different histograms  by selecting values on a categorical variable,   where you just type them in quotes exactly as they  appear in the data. Now, another way to do this is   to select by value on a quantitative or scaled  variable. We want to do that what you do is in   the square brackets to indicate you're selecting  rows, you put the variable, I'm specifying that   it's in the IRS data set, and then say what value  you're selecting. I'm looking for values less than   two. And I have the title chance to reflect  that. Now what's interesting is this selects   the subtypes. This is the exact same group. And so  the diagram doesn't change. But the titles and the   method of selecting the cases did. Probably more  interesting. One is when you want to use multiple   selectors. Let's look for virginica that will be  our species. And we want short petals only. So   this says what variable we're using petal length.  And this is how we select with a Iris dollar sign   species. So that tells us which variable is  equal to with the two equals virginica. And   then I just put an ampersand, and then say, Iris  petal length is less than 5.5. Then I can run that   I get my new title, and I'll zoom in on it.  And so what we have here are just virginica,   but the shorter ones. And so this is a  pair of selectors use simultaneously.   Now, another way to do this, by the way, is if you  know you're going to be using the same sub sample,   many times, you might as well create a new data  set that has just those cases. And the way you   do that is you specified the data that you're  selecting from then in square brackets, the rows   and the columns, and then you use the assignment  operator. That's the less than and dash here. What   you can read as a GED So, so I'm going to create  one called i dot setosa, for Iris setosa. And I'm   going to do it by going to the iris data. And in  species reading just setosa, I then put a comma,   because this one selects the rows, I need to tell  it which columns. If I want all of them, you just   leave it blank. So I'm going to do that. And now  you see up here in the top right, I'll zoom in on   it, I now have a new object new data object. And  the environment is a data frame called ice atossa.   And we can look at that sub sample that I've just  created, we'll get the head of just those cases.   Now you see, it looks just the same as the other  ones, except it only has 50 cases, as opposed to   150. And get a summary for those cases. And this  time, I'm doing just the petal length. And I   can also get a histogram for the petal length. And  it's going to be just these two choices. And so   that's several ways of dealing with sub samples.  And again, saving this election, if you're going   to be using it multiple times, it allows you to  drill down on the data and get a more focused   picture of what's going on, and helps inform  your analyses that you carry on from this point.   The next step in our introduction is to talk  about accessing data. And to get that started,   we need to say a little bit about data formats.  And the reason for that is sometimes your data,   you're like talking about apples and oranges, you  have fundamentally different kinds of things. Now,   there are two ways in particular that this  can happen. The first one is you can have   data of different types, different data  types. And then regardless of the type,   you can have your data in different structures,  and it's important to understand each of these,   we'll start by talking about data types.  This is like the level of measurement of   a variable. You can have numeric variables,  which usually come in integer whole number or   single precision or double precision. You can  have character variables with text in them. We   don't have string variables in our they're all  character, you can have logical which are true,   false, or otherwise called Boolean. You can have  complex numbers, and you can have a data type raw.   But regardless of which kind that you have, you  can arrange them into different data structures.   The most common structures are vector, matrix or  array, data frame, and list, we'll take a look at   each of these. A vector is one or more numbers  in a one dimensional array. Imagine them all in   a straight line. Now, what's interesting here is  that in other situations, if it's a single number,   it would be called a scalar. But in AR, it's  still a vector is just a vector of length one.   The important thing about vectors is that the data  are all of the same data type. So for instance,   all character or all integer. And you can think of  this as ours basic data object in it, most of the   things are variation of the vector. going one step  up from this is a matrix, a matrix has rows and   columns, it's two dimensional data. On the other  hand, they all need to be of the same length,   the columns all need to be the same length,  and all the data needs to be of the same class.   Interestingly, the columns are not named, they're  referred to by index numbers, which can make them   a little weird to work with. And then you can step  up from that into an array. This is identical to   a matrix, but it's for three or more dimensions.  On the other hand, probably the most common form   is a data frame. This is a two dimensional  collection that can have vectors of multiple   types. You can have character variables in one,  you can have integer variables, and another you   can have logical and a third, the trick is,  they all need to be the same length. And you   can think of this as the closest thing that R has  that's analogous to a spreadsheet. And in fact,   if you import a spreadsheet, you're going to go  into a data frame, typically. Now the neat thing   is that R has special functions for working with  data frames, things that you can do with those you   can do with others. And we'll see how those work  as we go through this course and through others.   And then finally, there's the list. This is our  most flexible data format. You can put basically   anything in the list. It's an ordered collection  of elements. And you can have any class,   any length, any structure. And interestingly,  lists can include lists include lists, and so on   and so forth. So it gets like the Russian nesting  dolls, you have one inside the other one inside   the other. Now the trick is that may sound very  flexible and may very good. It's actually kinda   hard to work with lists. And so a data frame  really sort of the optimal level of complexity   for a data structure. And then let me talk about  something else here the idea of coercion now,   in the world of ethics cores is a bad thing in the  world of data science. coercion is good. What it   means here is coercion is changing data objects.  From one type to another, it's changing the level   of measurement or the nature of the variable  that you're dealing with. So for example, you can   change a character to a logical, you can change  a matrix to a data frame, you can change double   precision to integer, you can do any of these,  it's going to be easiest to see how it works. If   we go to our end, give it a whirl. So open up this  script, and let's see how it works in our studio.   For this demonstration of data types, you don't  need to load any packages, we're just going to   run through things all on their own. We'll start  with numeric data. And what I'm going to do is I'm   going to create a data object a variable called  n one, my first numeric variable, and then I use   the assignment operator. That's this, the little  left arrow, and this right as n, one gets 15. Now,   our does double precision by default, let me do  this n one. And then you can see that it showed up   here on the top right. If I call the name of that  object, it'll show its contents in the console.   So I just type n one and run that. And there  you can see in the console at the bottom left,   it brought up a one in square brackets, that's  an index number for the first objects in an   array. And this is an array of one number,  but there it is, and we get the value of 15.   Also, we can use the our command type of to  get a confirmation of what type of variable   that says. And it's double precision by default,  we can also do another one where you do 1.5,   we can get its contents 1.5. And then we see  that it also is double precision, we want to   come down and do a character I'm calling that see  one for my first character variable, you see that   I do see one the name of the object I want to  create, I put the assignment operator the less   than and dash, which is right as gets. And then  I have in double quotes. In other languages, you   would do single quotes for a single character. And  you would use double quotes for strings. They're   the same thing in R, and I put in double quotes  the lowercase C, that's just something I chose.   So I feed that in, you can see that it showed  up in the global environment there on the right,   we can call it forward and you see it shows up  with the double quotes on it. We've got the type   of and it's a character, that's good. If we want  to do an entire string of texts, I can feed that   into C two, just by having it all in the double  quotes. And we pull it out. And we see that it   also is listed as a character even though in other  languages, it would be called a string. We can do   logical, this is L one for logical first. And  then feeding in true when you write true or false,   they have to be all caps, or you can do just  the capital T or the capital F. And then I call   that one out. And it says true. Notice, by  the way, there's no quotes around it. That's   one way you can tell it it's a logical and  not a character. If we put quotes into it,   it would be a character variable, we get  the type of there we go, it's logical.   I said you can also use abbreviation so for my  second logical variable l two, I'll just use F.   I feed that in. And now you see that it when I ask  it to tell me what it is it prints out the whole   word false. And then we get the type of again also  logical, then we can come down to data structures,   I'm going to create a vector which is a collection  of one dimensional collection. And I'm doing it by   creating v one for vector one. And then I use  the C here, which stands for concatenate. You   can also think of it as like combine or collect.  And I'm going to put five numbers in there, you   need to use a comma between the values. And then I  call out the object. Then there's my five numbers,   notice it shows them without the commas but I had  to have the commas going in. And then I asked our   Is it a vector is period vector and then asked  about it. And it's just gonna say true? Yes,   it is. I can also make a vector of characters.  And do that right here, I get the characters,   and it's also a vector. And that can make a vector  of logical values true and false. Call that. And   it's a vector also. Now a matrix, you may remember  is in going in more than one dimension. In this   case, I'm going to call it m one for matrix one.  And I'm using the matrix function. So I'm saying   matrix and then combine these values tt ffts. And  then I'm saying how many rows I want in it, and it   can figure out the number of columns by doing some  math. So I'm going to put that into m one. And   then I'll ask for it AC. Now it displays it in  the rows and columns, and it writes out the full   true or false. Now I can do another one where I'm  going to do a second matrix and this is where I   explicitly shape it in the rows and columns. Now,  that's for my convenience r doesn't care that I   broke it up to make the rows and columns, but it's  a way of working with it. And if I want to tell   it to organize it To go by rows, I can specify  that with the by row equals T or true command.   I do that. And now I have the ABCD. And you  see, by the way that I have the index numbers,   on the left are the row index numbers, that's row  one and row two, and on the top are the column   index numbers, and they come second, which is why  it's blank and then one for the first column and   then blank and then two for the second column,  then we can make an array. What I'm going to   do here is I'm going to create a data and I'm  going to use the colon operator, which says,   Give me the numbers one through 24, I still have  to use the concatenate to combine them. And then   they give the dimensions of my array and it goes  rows, columns, and then tables. Because I'm using   three dimensions here, I'm going to feed that into  an object called array one. And there's my array   right there, you can see that I have two  tables. In fact, let me zoom in on that one.   And so it starts at the last level, which is  the table. And then we have the rows and the   columns listed separately for each of them.  a data frame allows me to combine vectors   of the same length but of different types. Now,  what I'm doing here is I'm creating a vector of   numeric values of character values and logical  values. So these are three different vectors.   But then what I'm going to do is I'm going to  use this function c bind for a column bind to   combine them into a single data frame and  call it DFA for a data frame, a, or all.   Now, the trick here is that we had some  unintentional coercion by just using C bind,   what it did is it coerced it all to the most  general format. I had numeric variables and   character variables, and logical and the most  general is character. And so it turned everything   into a character variable. That's a problem, it's  not what I wanted, I have to add a nother function   to this, I have to tell it specifically make  it a data frame by using AZ dot data dot frame.   When I do that, I can combine it. And now  you see it's maintained the data types of   each of the variables. That's the way I want  it. And then finally, I can do a list, I'm   going to create three objects here, object one,  which is numeric with three values, object two,   which is character with four and object three,  which is logical with five. And then I'm going to   combine them into a list using the list function,  put them into list one, and then we can see the   contents of list one. And you can see it's kind of  a funky structure, and it can be hard to read. But   there's all the information there. And then we're  going to do something that's kind of, you know,   hard to get around logically, because I'm going  to create a new list that has list one in it. So   I have the same three objects, plus I'm adding on  to it list one. So list two, I'm gonna zoom in on   that one. And you can see it's a lot longer. And  we got a lot index numbers there in the brackets.   There, the three integers, the four character  values, and the five logical values. And then here   they are repeated, but that's because they're all  parts of list one, which I included in this list.   And so those are some of the different ways that  you can structure data of different types. But   you want to know also that we can coerce them  into different types to serve our different   purposes. The next thing we need to talk about is  coercing types. Now there's automatic coercion,   we've seen a little bit of that, where the data  automatically goes to the least restrictive data   type. So for instance, if we do this where we  have a one, which is numeric, be in quotes,   which is character, and a logical value, and  we feed them all into this idea coerce one. And   by the way, by putting parentheses around it, it  automatically saves it and shows us the response.   Now you can see that what it's done is is taken  all of them and made all of them character because   that's the least specific most general format.  And so that'll happen, but you kind of watch out   because you don't want things getting coerced when  you're not paying attention. On the other hand,   you can coerce things, specifically, if you want  to haven't go in a particular way. So I can take   this variable right here coerce to, I'm gonna put  a five into that. And we can get its type and we   see that it's double. Okay, that's fine. What if  I want to make it integer, then what I do is I use   this command as dot integer. I run that feed into  coerce three. And it looks the same when we see   the output but now it is an integer. That's how  it's represented in the memory. I can also take   a character variable and here I have one Two and  Three in quotes, which thank them characters and   get those and you can see that they're all  character. But now I can feed them in with this   as dot numeric, and it's able to see that they  are numerical numbers in there, and coerce them   to numeric. Now you see that is lost the quotes,  and it goes to the default double precision,   probably the one you'll do the most often is  taking a matrix. And that's just let's take a   look, I'll make a matrix of nine numbers in three  rows and three columns. There they are. And what   we're going to do is we're going to coerce  it to a data frame. Now that doesn't change   the way it looks is going to look the same. But  there's a lot of functions you can only do with   data frames that you can't do with matrices. This  one, by the way, will ask is it a matrix? And the   answer is true. But now let's do this, we'll  do the same thing and just add on as dot data   dot frame. Then now we thought to make it a data  frame. And you see, it basically looks the same.   It's listed a little differently. This one had its  index numbers here for the rows and the columns.   This one is a row index. And then we have variable  names across the top. And it's just automatically   given them variables one, two, and three.  But the numbers in it look exactly the same.   On the other hand, if we come back here and ask,  Is it a data frame, we get true. So it's a very   long discussion here. But the point here is,  data comes in different types and in different   structures, and you're able to manipulate  those, so you can get them in the format,   and the time and the arrangement that  you need for doing your analyses in our   to continue our introduction and accessing data,  we want to talk about factors. And depending   on the kind of work that you do, this may be a  really important topic. factors have to do with   categories and names of those categories.  Specifically, a factor is an attribute of   a vector. This specifies the possible values and  their order, it's going to be a lot easier to see   if we just try it. In our end, let me demonstrate  some of the variations, just open up this script,   and we can run through it together. What we're  going to do here is create a bunch of artificial   data, and then we're going to see how it works.  First one I'm going to do is I'm going to create   a variable x one with the numbers one through  three. And by putting it in parentheses here,   it'll both stored in the environment, and  it will display it in the console. So there   we have three numbers, one, two, and three, I'm  going to create a nother variable y, that's the   numbers one through nine. So there that is. Now  what I want to do is I want to combine these two,   and I'm going to use the C minor column bind  data frame. So it's going to put them together,   and it's going to make them a data frame. And  it's going to save them into a new object I'm   creating called df for data frame one. And we'll  get to see the results of that. Let me zoom in   on it a little bit. And there you can see, we  have nine rows of data. We have one variable   x one that's from the one that I created, and then  we have y. And then we have the nine indexes or   the row IDs there down the side. Please note  that the first 1x, one only had three values.   And so what it did is it repeated it. So you  see it happening three different times 123123.   And what we want to find out is now what kind  of variable is x one in this data frame? Well,   it's an integer, and we want to get the structure,  it shows that it's still an integer if we're   looking at this line right here. Okay, but we can  change it to a factor by using as dot factor. And   it's going to react differently than, so I'm going  to create a new one called x two, that, again, is   just the numbers one, two, and three. But now I'm  telling are those specifically represent factors,   then I'll create a new data frame using this x two  that I saved as a factor and the one through nine   that we had and why. Now, at this point, it looks  the same. But if we come back to where we were,   and we get the type of it's still an integer,  that's fine, but we get the structure of df   two. Now it tells us that x two instead of being  an integer is a factor with three levels. And it   gives us the three levels in quotes one, two,  and three, and then it lists the data. Now,   if we want to take an existing variable, and  define it as a factor, we can do that too. Here,   I'll create yet another variable with three  values in it. And then we'll bind it to y   in a data frame. And then I'm going to use this  one factor right here. And I'm going to tell it   to reclassify this variable x three as a factor  and feed it into the same place, and that these   are the levels of the factor. And because I put  in parentheses, it'll show To us in the console,   there we have it, let's get the type. It's an  integer, but the structure shows it again as   a factor. So that's one way we could take an  existing variable and turn it into a factor.   If you want to do labels, we can do it this way.  We'll do x four, again, that's the one through   three. And we'll bind it to nine to make a data  frame. And here, I'm going to take the existing   variable, df four, and then the variable is x  four, I'm going to tell it the labels. And then   I'm going to give them text labels, I'm going  to say that there are Mac OS, Windows and Linux   three operating systems. And please note, I need  to put those in the same order that I want them   to line up to those numbers. So one will be Mac  OS two will be windows and three will be Linux. I   run that through, we can pull it up  here. And now you can see how it goes   through. And it changes that factor to the text  variables. Even though I entered it numerically.   I want the type of to see what it is. It's still  called it integer, even though it's showing me   words, and the structure. This is an important  one, let's zoom in on that just for a second.   The structure here at the bottom, it  says it's a factor with three levels,   and it starts giving me the labels. But then it  shows us that those are actually numbers one,   two, and three underneath. If you're used to  working with a program like SPSS, where you can   have values, and then you can add value labels on  top of them. It's the same kind of concept here.   Then I want to show you how we can switch  the order of things. And this gets a little   confusing. So try it a couple of times  and see if you can follow the logic here.   We'll create another variable x five,  that's just the one, two and three,   we'll bind it to why. And there's our data  frame just like we've had in the other examples.   Now what I'm going to do is I'm going to  take that new variable x five in the data   frame five, df five. And notice here, I'm listing  the levels, but I'm listing them in a different   order. I'm changing the order that I put them  in there. And then I'm lining up these labels.   When I run that through, now you can see the  labels here, maybe yes, now maybe yes, no,   it is showing us the nine values. And then this  is an interesting one, because they're ordered,   it puts them with the less than sign at each point  indicate which one comes first which one comes   later, we can take a look at the actual data frame  that I made. Or zoom in on that. And you can see,   we know that the first one's a one because when  I created this, it was 123. And so the maybe is   a one you see because it's the second thing here  in each one. So one equals maybe. But by putting   it in this order, it falls in the middle of this  one, there may be situations in which you want   to do that, I just want to know that you have this  flexibility in creating your factor labels in our.   And finally, we can check the type of that.  And it's still an integer because it's still   coded numerically underneath, but we can  get this structure and see how that works.   So factors give you the opportunity  to assign labels to your variables,   and then use them as factors in various analyses  if you do experimental research, and this sort   of thing becomes really important. And so  this gives you an additional possibility for   your analyses in our as you define your numerical  variables as factors for using your own analyses.   Our next step in our an introduction in  accessing data is entering data. So this is   where you're typing it in manually. And I like  to think of this as a version of ad hoc data,   because under most circumstances, you would  import a data set. But there are situations   in which you need just a small amount of data  right away, and you can type it in this way. Now,   there are many different methods that are  available for this. There's something called   the colon operator. There's SC Q, which is for  sequence, there, C which is short for concatenate,   there's a scan, and there's Rep. And I'm going  to show you how each of these works. I will also   mention this little one, the less than and a  dash, that is the assignment operator in our   let's take a look at it in our and I'll explain  how all of it works. Just open up this script,   and we'll give it a whirl. What we're going to  do here is just begin with a little discussion   of the assignment operator, the less than  dash is used to assign values to a variable,   so is called an assignment operator. Now a  lot of other programs would use an equal sign,   but we use this one that's like an arrow,  and you read it as it gets. So x gets five,   it can go in the other direction pointing to the  right, that would be very unusual. And you can use   an equal sign or knows what you mean. But those  are generally considered poor form. And that's   not just arbitrary. If you look at the Google  style guide for our it's specific about that.   In our studio, you have a shortcut for This, if  you do option dash, it inserts the assignment   operator and a space. So I'll come down here right  now, do option dash, there you see. So that's   a nice little shortcut that you can use in our  studio when you're doing your ad hoc data entry.   Let's start by looking at the colon operator. And  most of this you would have seen already. And what   this means is you simply stick a colon between two  numbers, and it goes through them sequentially. So   I'm doing x one is a variable that I'm creating.  And then I have the assignment operator and   get zero colon 10. And that means it gets the  numbers zero through 10. And there they all are   going to delete my colon operator that's  waiting for me to do something here.   Now if we want to go in descending order, just  put the higher number first. So I'll put 10 colon   zero, there it goes the other way, as EQ or SEC  is short for sequence, and it's a way of being   a little more specific about what you want. If you  want to, we can call it the help on sequence. It's   right over here for sequence generation. There's  the information. And we can do ascending values.   So sec 10, duplicate one through 10 doesn't start  at zero starts at one. But you can also specify   how much you want things to jump by. So if you  want to count down in threes, II do 30 to zero   by negative three means step down threes, we'll  run that one. And because it's in parentheses,   it'll both save it to the environment, and  it'll show it on the console right away.   So those are ways of doing sequential numbers.  And that can be really helpful. Now if you want   to enter an arbitrary collection of numbers in  different order, you can use C that stands for   concatenate, you can also think of it as combine  or collect, we can call it the help on that one.   There it is. And let's just take these numbers  and you see to combine them into the data object   x five, and we can pull it in there you see,  it just went right through. An interesting one   is scan. And this is we're entering data live. So  we'll do scan here, get some help on that one, you   can see it read data values. And this one takes a  little bit of explanation, I'm going to create an   object x sex. And then I'm feeding into it a scan  with opening and closing parentheses because I'm   running that command. So here's what happens, I  run that one. And then down here in the console,   you see that it now has one and a colon. And I  can just start typing numbers. And after each one,   I hit Enter. And I can type in however many  I want. And then when you're done, just hit   enter twice. And it reads them all. And if you  want to see what's in there, come back up here   and just call the name of that object. There are  the numbers that I entered. And so there may be   situations in which that makes it a lot easier to  enter data, especially if you're using a 10 key.   Now, rep you can guess is for repetition. We'll  call the help on that one, replicate elements.   And here's what we're going to do, we're going to  say x seven, we're going to repeat or replicate.   True, and we're going to do it five times. So x  seven. And then if you want to see there are our   five trues. All in a row. If you want to repeat  more than one value, it depends on anything,   set things up a little bit. Here, I'm going to  do replicate a repeat for true and false. But   by doing it as a set where I'm doing the  see concatenate to collect the set, what   it's going to do is repeat that set in order five  times. So true, false, true, false, true, false,   and so on. That's fine. But if you want to do the  first one, five times, and then the second one,   five times, I mean, think of it as like co lading.  On a photocopier. If you don't want it correlated,   you do each. And that's going to do True, true,  true, true, true false, false, false, false false.   And so these are various ways that you can set  up data, get it in really for an ad hoc or an as   needed analysis. And it's a way of checking how  functions work is I've used in a lot of examples   here. And you can explore some as possibilities  and see how you can use it in your own work.   The next step in our introduction, and accessing  data is talking about importing data, which will   probably be the most common way of getting data  into R. Now the goal here is you want to try to   make it easy. Get the data in there, get a large  amount, get it in quickly and get processing as   soon as you can. Now there are a few kinds of  data files you might want to import. There are   CSV files, S stands for comma separated values in  a sort of the plain text version of a spreadsheet.   Any spreadsheet program can export data as a CSV  and nearly any data program at all can read them.   They're also straight text files. txt. Those can  actually be opened up in text editors and word   processing documents, then there are XLS x. And  those are Excel spreadsheets as well as the XLS   version. And then finally, if you're going to  get fancy, you have the opportunity to import   JSON. That's JavaScript Object Notation. And if  you're using web data, you might be dealing with   that kind of data. Now, R has built in functions  for importing data in many formats, including the   ones I just mentioned. But if you really want  to make your life easy, you can use just one,   a package that I load, every time I use R is  reo, which is short for our import output. And   what reo does is it combines all of our import  functions into one simple utility with consistent   syntax and functionality. It makes life so much  easier. Let's see how this all works in our just   open up this script, and we'll run through  the examples all the way through. But there   is one thing you're going to want to do first,  and that is, you're going to want to go to the   course files that we download at the beginning of  this course, these are the individual our scripts,   because this folder right here that significant.  This is a collection of three data sets,   I'm going to click on that. And they're all  called m BB. And the reason they're called that   is because they contain Google Trends information.  And that searches for Mozart, Beethoven, and Bach,   three major classical music composers. And it's  all about the relative popularity of these three   search terms over a period of several years.  And I have it here in CSV or comma separated   value format, and as a txt file dot txt,  and then even as an Excel spreadsheet. Now   let's go to our and we'll open up each one of  these. The first thing we're going to need to   do is make sure that you have reo. Now I've  done this before that Rio is one of the things   I download every time. So I'm going to use Pac  Man and do my standard importing or loading of   packages. So reals available now, I do want  to tell you one thing significant about Excel   files. And we're going to go to the official our  documentation for this. If you click on this,   it'll open up your web browser. And this is  a shortcut web page to the our documentation.   And here's what it says. I'm actually read  this verbatim. Reading Excel spreadsheets.   The most common our data import export question  seems to be how do I read an Excel spreadsheet.   This chapter collects together advice and options  given earlier. Note that most of the advices for   pre Excel 2007 spreadsheets and not the later XLS  x format. The first piece of advice is to avoid   doing so if possible. If you have access to excel,  export the data you want from excel in a tab   delimited or comma separated form, and use read  dot delete or read dot CSV to import it into R,   you may need to use read.dl m to or read dot CSV  to and a locale that uses comma as the decimal   point, exporting a diff file and reading it  using read dot diff is another possibility. Okay,   so really what they're saying is, don't do it.  Well, let's go back to our now it's gonna say   right here, you have been warned. But let's make  life easy by using Rio. Now if you've saved these   three files to your desktop, then it's really  easy to import them this way. We'll start with the   CSV. We use reo underscore CSV is the name of the  object that I'm going to be using to import stuff   into. And all we need is this command import. We  don't have to specify that as a CSV, or C that   has headers or anything, we just use import. And  then in quotes, and in parentheses, we put the   name and location of the file. So on a Mac, it  shows up this way to your desktop. I'm going to   run that. And you can see that it just  showed up in my environment on the top right,   I'll expand that a little bit. I now have a data  frame, I'll come back out. Let's take a look   at the first few rows of that data frame. I'll  zoom up. And you can see we have months listed.   And then the relative popularity of search for  Mozart, Beethoven and Bach during those months.   Now, if I want to read the text file, what's  really nice is I can use the exact same command   import, and I just give the location in the name  of the file, I have to add the dot txt. But I run   that and we look at the head and you'll see it's  exactly the same no difference Piece of cake.   What's nice about Rio is I can even do the  XLS x file. Now it helps that there's only   one tab in that file, and that it's  set up to look exactly the same as   the others want to do that. We went  through and you see that once again.   It's the same thing Rio was able to read all  of these automatically makes life very easy.   Another neat thing is that our hands on thing  called a Data Viewer. Now we'll get a little   bit of information on that to help and you invoke  the Data Viewer. Let's do this one we do with a   capital V for view. And then we say what it is we  want to see. And we'll do rio underscore CSV. When   we do that command, it opens up a new tab here.  And it's like a spreadsheet right here. And in   fact, it's sortable, we can click on this, go from  the lowest to the highest, and vice versa. And you   see that Mozart actually is setting the range  here. And that's one way to do it. You can also   come over to here and just click on this little,  it looks like a calendar. But it is, in fact, the   same thing, we can double click on that. And now  you see we get a viewer of that file as well. I'm   going to close both of those. And I'm just going  to show you the built in our commands for reading   files. Now, these are ones that Rio uses on its  own. And we don't have to go through all this.   But you may encounter these in a lot of existing  code, because not everybody uses Rio. And I want   you to see how they work. If you have a text file,  and it's saved in tab delimited format, you need   the complete address. And you might try to do  something like this read dot table is normally   the command. And you need to say that you have a  header that there's variable names across the top.   But when you read this, it's going to get an error  message. And it's you know, it's frustrating.   That's because there are missing values in there  in the top left corner. And so what we need to do   is we just need to be a little more specific about  what the separator is. And so I do the same thing,   I say read dot table, there's the name of the file  in this location, we have a header. And this is   where I say the separator is a tab, the back score  says indicate this is a tab. So if I run that one,   then it shows up, it reads it properly. We can  also do CSV. The nice thing here is you don't   have to specify the delimiter. Because CSV  means that it's comma separated, so we know   what it is. And I can read that one in the exact  same way. And if I want to, I can come over here.   And I can just click on the viewer here. And I  see the data that way also. And so it's really   easy to import data, especially if you use the  package Rio, which is able to automatically read   the format and get it in properly and get you  started on your analyses as soon as possible.   Now, the part of our introduction that maybe  most of you were waiting for is modeling data.   On the other hand, because this is a very short  introductory course, I'm really just giving a tiny   little overview of a handful of common procedures.  And an another course here at data lab.cc,   we'll have much more thorough investigations  of common statistical modeling and machine   learning algorithms. But right now, I just want  to give you a flavor of what can be done in R.   And we'll start by looking at a common procedure.  hierarchical clustering are ways of finding which   cases or observations in your data belong with  each other. More specifically, you can think   of it as the idea of like with like, which cases  are like other ones. Now, the thing is, of course,   this depends on your criteria, how you measure  similarity, how you measure distance, and there's   a few decisions you have to make. You can do, for  instance, what's called a hierarchical approach,   which is what we're going to do. Or you can do it  where you're trying to get a set number of groups,   or s called K, the number of groups, you also  have many choices for measures of distance.   And you also have a choice between what's  called divisive clustering, where you start   with everything in one group, and then you split  them apart, or agglomerative, which is where they   all start separately, and you selectively put  them together. But we're going to try to make   our life simple here. So we're going to do the  single most common kind of clustering, we're   going to use a measure of Euclidean distance,  we're going to use hierarchical clustering.   So we don't have to set the number of groups in  advance. And we're going to use a divisive method,   we start with them all together and gradually  split them. Let me show you how this works in   our. And what you'll find is even though this  may sound like a very sophisticated technique,   and a lot of the mathematics is sophisticated,  it's really not hard to do in reality.   So what we're going to do here is we're going  to use a data set that we use frequently I'm   going to load my default packages to get some of  this ready. And then I'll bring in the data sets,   we're going to use m t cars, which if you recall,  is Motor Trend, car road tests data from 1974. And   there are 32 cars in there and we're gonna see how  they grew up what cars are similar to which other   ones. Now let's take a look at the first few rows  of data to see what variables we have in here. You   see we have MPG, cylinders displacement, so on  and so forth. Not all of these are going to be   really influential or are useful variables. And so  I'm going to drop a few of them and create a new   data set, that includes just the ones I want. If  you want to see how I do that, I'm going to come   back here and I'm going to create a new object,  a new data frame called cars. And this says,   it gets the data from empty cars. By putting the  blank in the space here, that means use all of   the rows. But here I'm selecting the columns see  for concatenate, means I want columns one through   four, skip Five, six, and seven, skip eight, and  then nine through 11. That's way of selecting my   variables. So I'm going to do that and you see  the cars is now showing up in my environment,   they're at the top right, let's take a look  at the head of that data set. We'll zoom in   on that one. And they can see it's a little bit  smaller, we have mpg cylinders, displacement,   weight, horsepower, quarter mile, seconds, and so  on. Now, we're going to do the cluster analysis,   and we're going to find is that if we're using  the default, it's super, super easy. In fact, I'm   going to be using something called pipes, which is  from the package D plier, which is why I loaded it   is this thing right here. And what it allows you  to do is to take the results of one step and feed   it directly in as the input data into the next  step. Otherwise, this would be several different   steps. But I can run it really quickly, I'm going  to create an object called h c for hierarchical   clusters, we're going to read the cars data that  I just created, we're going to get the distance   or the dis similarity matrix, which says how far  each observation is in Euclidean space from each   of the others. And then we feed that through the  hierarchical cluster routine h clust. So that   saves it into an object and now we need to do is  plot the results. We're gonna do plot H, see my   hierarchical cluster object, then we get this  very busy chart over here. But if I zoom in on it,   and wait a second, you can see that it's this  nice little, it's called a dendrogram. Because   it's a branches and trees looks more like roots  here, you can see they all start up together,   and then they split and then they split and they  split. Now if you know your car's from 1974. And   you can see that some of these things make sense.  So for instance, here we have the Honda Civic and   the Toyota Corolla, which are still in production  are right next to each other, if you're 128. And   if yacht x one nine are very well, they were both  small Italian sports cars, they were different in   many ways. But you can see that they're right next  to each other. The Ferrari Dino, the Lotus Europa,   they make sense to put next to each other. If  we come over here, the Lincoln Continental and   the Cadillac Fleetwood and the Chrysler Imperial,  it's no surprise that are next to each other. What   is interesting is this one here, the mangiarotti  Bora, it's totally separate from everything else,   because it's a very unusual different kind of car  at the time. Now, one really important thing to   remember is that the clustering is only valid for  these data points, based on the data that I gave   it, I only gave it a handful of variables.  And so it has to use those ones to make the   clusters. If I gave it different variables or  different observations, we could end up with a   very different kind of clustering. But I want to  show you one more thing we can do here with this   clusters to make it even easier to read. Let me  zoom back out. And what we're going to do is draw   some boxes around the clusters, we're going to  start by drawing two boxes that have gray borders.   Now I'm going to run that one. And you can see  that it showed up. And then we're going to make   three blue ones, four green ones, and five dark  red ones. And then let me come and zoom in on this   again. And now it's easier to see what the groups  are in this particular data set. So we have here,   for instance, the Hornet for drive, the valley and  the Mercedes Benz, 450, SLC, Dodge, challenger,   and Javelin all clumping together in one general  group. And then we have these other really big   VAT American cars. What's interesting is again, is  that the MAS Ronnie Bora is off by itself almost   immediately. It's kind of surprising because the  Ford Panthera has a lot in common with it. But   this is a way of seeing based on the information  that I gave it, how things are clustered. And if   you're doing market analysis, if you're trying to  find out who's in your audience, if you're trying   to find out what groups of people think in similar  ways, this is an approach that you're probably   going to use. And you can see that it's really  simple to set it up, at least using the default in   our as a way of seeing how you have regularities  and consistencies in groupings in your data.   As we go through our very brief introduction to  modeling data and are another common procedure   that we might want to look at briefly, is called  principal components. And the idea here is that   in certain situations, less is more. That is less  noise, and fewer unhelpful variables in your data   can translate to more meaning and that's why  After In any case, now, this approach is also   known as dimensionality reduction. And I like to  think of it by an analogy, you look at this photo,   and what you see are these big black outlines of  people, you can tell basically how tall they are,   what they're wearing, where they're going. And  it takes a moment to realize that you're actually   looking at a photograph that goes straight down.  And you can see the people there on the bottom,   and you're looking at their shadows. And we're  trying to do a similar thing. Even though these   are shadows, you can still tell a lot about  the people, people are three dimensional,   shadows are two dimensional, but we've retained  almost all the important information. If you   want to do this with data, the most common  method is called principal component analysis,   or PCA. And let me give you an example of the  steps metaphorically in PCA. You begin with two   variables. And so here's a scatterplot, we've got  x across the bottom y decide, and this is just   artificial data. And you can see that there's a  strong linear association between these two. Well,   what we're going to do is we're going to draw  a regression line through the data set, and you   know, it's there about 45 degrees. And then we're  going to measure the perpendicular distance of   each data point to the regression line. Now, not  the vertical distance, that's what we would do if   we were looking for regression residuals, but the  perpendicular distance. And that's what those red   lines are, then what we're going to do is we're  going to collapse the data by sliding each point   down the red line to the regression line. And  that's what we have there. And then finally,   we have the option of rotating it. So it's not on  diagonal anymore, but it's flat. And that there   is the PC the principal component. Now, let's  recap what we've accomplished here, we went from   a two dimensional data set to a one dimensional  data set, but maintained some of the information   in the data. But I like to think that we've  maintained most of the information. And hopefully,   we maintain the most important information in  our data set. And the reason we're doing this   is we've made the analysis and interpretation  easier and more reliable. By going from something   that was more complex, two dimensional or higher  dimensions, down to something that's simpler to   deal with fewer dimensions, it means easier to  make sense of in general, let me show you how   this works in our open up this script. And we'll  go through an example in our studio. To do this,   we'll first need to load our packages,  because I'm going to use a few of these.   Although those will load the data sets. Now I'm  going to use the empty cars data set, we've seen   that a lot. And I'm going to create a little  subset of variables. Let's look at the entire   list of variables. And I don't want all of those  in my particular data set. So the same way I did   with hierarchical clustering, I'm going to create  a subset by dropping a few of those variables.   And we'll take a look at that subset. Let's zoom  in on that. So there's the first six cases in my   slightly reduced data set. And we're going to  use that to see what dimensions we can get to   that we have fewer than the 123456789 variables we  here. Let's try to get to something a little less   and see if we still maintain some of the important  information in this data set. Now what we're going   to do is we're going to start by computing  the PCA, the principal component analysis,   we'll use the entire data frame here, I'm going  to feed into an object called PC for a principal   components. And there's more than one way to do  this in our but I want to use p r comp. And this   specifies the data set that I'm going to use.  And I'm going to do two optional arguments. One   is called centering the data, which means moving  them so the means of other variables are zero. And   then the second one is scaling the data which sort  of compresses or expands the range of the data. So   it's unit or variance of one for each of them.  That puts all of them on the same scale. And it   keeps any one variable from sort of overwhelming  the analysis. So let me run through that.   And now we have a new object that showed up  on the right. And if you want to you can also   specify variables by specifically including  them. The tilde here means that I'm making my   prediction based on all the rest of these. And I  can give the variable names all the way through.   And then I say what data set it's coming from.  I say data equals empty cars, and I can do the   centering in the scaling there. Also, it produces  exactly the same thing. It's just two different   ways of saying the same command. To examine the  results, we can come down and get a summary of the   object PC that I created. So I'll click on that  and then we'll zoom in on this. And here's the   summary it talks about creating nine components  pc one for principal component one to PC nine for   principal component Nice, you get the same number  of components that you had as original variables.   But the question is whether it divvies up the  variation separately. Now, you can take a look   here at principal component one is the standard  deviation of 2.3391. What that means is, if each   variable will begin with a standard deviation of  one, this one has as much as 2.4 of the original   variables, the second one has 159, and the others  have less than one unit standard deviation,   which means they're probably not very important  in the analysis, we can get a scree plot for   the number of components and get an idea on how  much each one of them explains of the original   variance. And we see right here, I'll zoom in on  that, that our first component seems to be really   big and important. Our second one is smaller, but  it still seems to be you know, above zero, and   then we kind of grind out down to that one. Now  there's several different criteria for choosing   how many components are important what you want  to do with them. Right now, we're just eyeballing   it. And we see that number one is really big  number two, sort of a minor axis in our data. If   you want to, you can get the standard deviations  and something called the rotation here, I mean,   just call PC. And then we'll zoom in on that in  the console. to scroll back up a little bit. And   it's a lot of numbers. The standard  deviations here are the same as what   we got from this first row right here. So that  just repeats it. The first one's really big,   the second one's smaller. And then what  this right here does, what the rotation is,   it says, What's the association between  each of the individual variables and the   nine different components. So you  can read these like correlations.   I'm going to come back. And let's see how  individual cases load on the PCs. What I   do that is I use predict running  through PCs, and then I feed those   results using the pipe. And I round them  off, so they're a little more readable.   I'll zoom in on that. And here,  we've got nine components listed,   and we got all of our cars. But the first two  are probably the ones that are most important.   So we have here the PC one and two easy, we  got a giant value there, 2.49273354, and so on.   But probably the easiest way to deal with all this  is to make a plot. And what we're going to do is   go something with a funny name of biplot. What  that means is a two dimensional plot, really,   all it says is going to chart the first two  components. But that's good, because based on   our analysis, it's really only the first two that  seem to matter anyhow. So let's do the biplot,   which is a very busy chart. But if we zoom on  it, we might be able to see a little better   what's going on here. And what we have is the  first principal component across the bottom,   and the second one up the side. And then the  red lines indicate approximately the direction   of each individual variables contribution to  these. And then we have each case we show its   name about where it would go. Now if you remember  from the hierarchical clustering, the Maasai Bora   was really unusual. And you can see it's up  there all by itself. And then really, what we   seem to have here is displacement and weight and  cylinders, and horsepower. This appears to be big,   heavy cars going in this direction. Then we  have the Honda Civic, the Porsche 911, Lotus,   Europa, these are small cars with smaller engines  more efficient. These are fast cars up here. And   these are slow cars down here. And so it's pretty  easy to see what's going on with each of these as   in terms of clustering the variables. With a  hierarchical clustering, we clustered cases,   now we're looking at clusters of variables.  And we see that it might work to talk about big   versus small and slow versus fast as the important  dimensions in our data as a way of getting insight   to what's happening and directing us in our  subsequent analyses. Let's finish our very short   introduction to modeling data in our with a brief  discussion of regression, probably one of the most   common and powerful methods for analyzing data. I  like to think of it as the analytical version of   E Pluribus Unum that is out of many one, or in  the data science sense, out of many variables,   one variable, you want to put out one more  way out of many scores, one score. The idea   with regression is that you use many different  variables simultaneously, to predict scores on   one particular outcome variable. And there's so  much going on here. I'd like to think that there's   some For everyone, there are many versions, and  many adaptations of regression that really make   it flexible, and powerful for almost no matter  what you're trying to do, we'll take a look at   some of these in our so let's try it in our and  just open up this script. And let's see how you   can adapt regression to a number of different  tasks and use different versions of it. When   we come here to our script, we're going to scroll  down here a little bit and install some packages,   we're going to be using several packages in this  one, I'll load those ones as well as the datasets   package. Because we're going to use a data set  from that called us judge radians. Let's get some   information on it. It is lawyers ratings of state  judges in the US Superior Court. And let's take a   look at the first few cases with head I'll zoom  in on that. And what we have here are six judges   listed by name. And we have scores on a number of  different variables like diligence and demeanor.   And whether it finishes with whether they're  worthy of retention, that's the RTN retention.   Let's scroll back out. And what we might want to  do is use all these different judgments to predict   whether lawyers think that these judges should be  retained on the bench. Now, we're going to use a   couple of shortcuts that can actually make working  with regression situations kind of nice. First,   we're going to take our data set, and we're going  to feed it into an object called data. So that   shows up now in our environment on the top right.  And then we're going to define variable reps,   you don't have to do this, but it makes  the code really, really easy to use. Plus,   you find if you do this, then you can actually  just use the same code without having to redo it   every time you do an analysis. So what we're going  to do is we're going to create an object called x,   it's actually going to be a matrix, and it's  going to consist of all of our predictor variables   simultaneously. And the way I'm going to do this  is I'm going to use as matrix and then I'm gonna   say read data, which is what we defined right  here, and read all of the columns except number   12. That's one called retention, that's our  outcome. So the minus means don't include that,   but do all the others. So I do that, and now I  have an object called x. And then the second one,   I say, go to data. And then this blank means use  all of the rows, but only read the 12th column,   that's the one that has retention our outcome.  So following standard method, x, those are all   our variables and why that's our single outcome  variable. Now, the easiest version of regression   is called simultaneous entry, you use all of the x  variables at once, throw them in one big equation   to try to predict your single outcome. And in our  we use lm, which is for linear model. And what we   have here is y, that's our outcome variable. And  then the tilde that means is predicted by or as a   function of x. And then x is all of our variables  together being used as predictors. So this is the   simplest possible version, and we'll save it into  an object called reg for regression one. And now,   if you want to be a little more explicit, you can  give the individual variables you can say that our   10 retention is a function of or as predicted by  all of these other variables. And then I say that   they come from the data set us judge ratings that  we don't have to do the data, and then dollar sign   for you to these. That'll give me the exact same  thing. So I don't need to do that one explicitly.   If you want to see the results, we just call on  the object that we created from the linear model.   And I'm going to zoom in on that. And what we  have are the coefficients. This is the intercept,   start with minus two. And then for each step  up on this one, as 0.1, point three, six, so   on and so forth. You'll see By the way, that  it's changed the name of each of the variables   to add the x because they're in the dataset x  now, that's fine. We can do inferential tests   on these individual coefficients by asking for  a summary. We click on that. And we'll zoom in.   And now you can see, there's the value that we  had previously, but now there's a standard error.   And then this is the t test. And then over here is  the probability value. And the asterisks indicate   values that are below the standard probability  cutoff of point oh five. Now we expect the   intercept to be below that. I see. For instance,  this one integrity has a lot to do with people's   judgment of whether a person should be retained.  And this one physical really, are they sick,   and we have some others that are kind of on  their way. And this is a nice one overall. And   if you come down here, you can see the multiple  r squared. It's super high. And what it means is   that These variables collectively predicted very,  very well, whether the lawyers felt that the judge   should be retained. Let's go back now to our  script, you can get some more summary data here,   if you want, we can get the analysis of variance  table, the ANOVA table, and we click on that zoom   in there, you can see that we have our residuals  and the y. Come back out, we do the coefficients.   Here are the regression coefficients, we saw  those previously, this is just a different   way of getting at this same information, we can  get confidence intervals. Let's zoom in on that.   And now we have a 95% confidence interval. So the  two and a half percent, on the low end the nine,   seven and a half on the top end, in terms  of what each of the coefficients would be.   We can get the residuals on a case by case basis,  let's do this one. And when we zoom in on that,   now, this is a little hard to read in and of  itself, because they're just numbers. But an   easier way to deal with that is to get a histogram  of the residuals from the model. So to do that,   we just run this command, and then I'll zoom  in on this. And you can see that it's a little   bit skewed mostly around zero, we've got one  person we have on the high end, but mostly,   these are pretty good predictions. Come back out.  Now I want to show you something a little more   complicated. We're going to do different kinds  of regression, I'm going to use two additional   libraries for this one is called Lars that  stands for least angle regression, and carat,   which stands for classification and regression  training. We'll do that by loading those two.   And then we're going to do a conventional stepwise  regression, which a lot of people say there's   problems with this, but I'm just gonna show that  I'm gonna do it really fast. There's our stepwise   regression, then we're going to do something from  Lars called stage wise, it's similar to stepwise,   but it has better generalizability. We run that  through, we can also do least angle regression.   And then really, one of my favorites is the  lasso. That's the least absolute shrinkage   and selection operator. Now I'm running through  just the absolute bare minimum versions of these,   there's a lot more that we would want to do  explore these. But what I'm going to do is   compare the predictive ability of each of them.  And I'm going to feed into an object called R   to conference comparison of the R squared  values. And here I specify where it is,   in each of them, I have to give a little index  number, then we're going to round off the values.   And I'm going to give them the name, say the first  one stepwise and forward then larger than lasso.   And we can see the values. And what this shows us  here at the bottom is that all of them were able   to predict it super well. But we knew that because  when we did just the standard simultaneous entry,   there was amazingly high predictive ability within  this data set. But you will find situations in   which each of these can vary a little bit, maybe  sometimes they vary a lot. But the point here is   there are many different ways of doing regression  and are makes those available to whatever you want   to do. So explore your possibilities and  see what seems to fit. In other courses,   we will talk much more about what each of these  mean, how they can be applied and how it can be   interpreted. But right now, I simply want you  to note that these exist, and they can be done,   at least in theory in a very simple way in  our. And so that brings us to the end of   our an introduction. And I want to make a brief  conclusion primarily to give you some next steps,   other things that you can do. As you learn  to work more with our now we have a lot of   resources available here. Number one, we have  additional courses on our in data lab.cc. And   I encourage you to explore each of them. If you  like our you might also like working with Python,   another very popular language for working in data  science, which has the advantage of also being a   general purpose programming language. The things  that we do in our we can do almost all the same   things in Python. And it's nice to do a compare  and contrast between the two with the courses we   have at data lab.cc. I'd also recommend you spend  some time simply on the concepts and practice of   data visualization. R has fabulous packages for  data visualization. But understanding what you're   trying to get and designing quality ones is sort  of a separate issue. And so encourage you to get   the design training from our other courses on  visualization. And then finally, a major topic   is machine learning or methods for processing  large amounts of data and getting predictions from   one set of data that can be applied usefully to  others. We do that for both R and Python and other   mechanisms here in data lab. Take a look at all of  them and see how well you think you can use them   in your own work now Another thing that you can  do is you can try looking at the annual our user   conference, which is user with a capital R and an  exclamation point. There are also local our user   groups are rugs. And I have to say Unfortunately,  there is not yet an official our day. But if you   think about September 19, it's International Talk  Like a Pirate Day. And we like to think as pirates   say are and so that can be our unofficial day  for celebrating these statistical programming   language are any case, I'd like to thank you for  joining me for this and I wish you happy computing
Info
Channel: freeCodeCamp.org
Views: 1,888,908
Rating: 4.8873711 out of 5
Keywords: R (Programming Language), Data Science, Feature Engineering, Visualization, Data Exploration, R Programming, R Programming Tutorial, R Programming Training, Data Science with R, Data Scientist, Machine Learning with R, Programming, Tutorial, Training, Data Science Training, Data Science Tutorial, Machine Learning, Data Analysis, Data Visualization, Data Science with R Programming, language, tutorial, programming, r tutorial, r course, r programming course
Id: _V8eKsto3Ug
Channel Id: undefined
Length: 130min 39sec (7839 seconds)
Published: Thu Jun 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.