Welcome to our an introduction. I'm Barton
Paulson. And my goal in this course is to introduce you to our This is our, but also,
this is our. And then finally, this is our, it's arguably the language of data science. And
just so you don't think I'm making stuff up off the top of my head, I have some actual data. This
is a ranking from a survey of data mining experts on the software that they use most often in
their work. And take a look here at the top are is first. In fact, it's 50% more than Python,
which is another major tool in data science. So both of them are important. But you can see why
I personally am fond of R. And why is the one that I want to start with introducing you to
data science. Now there's a few reasons that R is especially important. Number one, it's free.
And it's open source compared to other software packages can be 1000s of dollars per year. Also,
R is optimized for vector operations, which means you can go through an entire row, or an entire
table of data without you having to explicitly write for loops. If you've ever had to do that,
then you know, it's a pain. And so this is a nice thing. Also, R has an amazing community behind it
where you can find supportive people. And you can get examples of whatever it is you need to do.
And you can get new developments all the time. Plus our has over 9000 contributed or third party
packages available, make it possible to basically do anything. Or if you want to put it in the words
of Yoda. You can say this, this is our there is no if only how, and in this case, I'm quoting our
user Simon Blomberg. So very briefly, in some, here's why I want to introduce you to our number
one, because r is the language of data science, because it's free, and it's open source.
And because of the free packages that you can download, install r makes it possible to do
nearly anything when you're working with data. So I'm really glad you're here. And then I'll have
this chance to show you how you can use R to do your own work with data in a more productive,
more interesting and more effective way. Thanks for joining me. The first thing that we need to
do for our an introduction is to get set up. More specifically we need to talk about installing
are, the way you do this is you can download it, you just need to go to the home page for the our
project for statistical computing. And that's at our dash project.org. When you get there, you
can click on this link in the first paragraph that says download our and then I'll bring you to
this page that lists all the places that you can download it. Now I find the easiest is to simply
go to this top one, this has cloud because that'll automatically direct you to whichever of the
below mirrors is best for your location. When you click on that, you'll end up at this page,
the comprehensive our archive network, or CRAN, which we'll see again, in this course, you need to
come here and click on your operating system. If you're on a Mac, it'll take you to this page.
And the version you're going to want to click on is just right here, it's a package file,
that's a zipped application installation file, click on that, download it and follow the
standard installation directions. If you're on a Windows PC, then you're probably going to
want this one base again, click on it, download it and go through the standard installation
procedure. And if you're on a Linux computer, you're probably already familiar with what you
need to do. So I'm not going to run through that. Now before we get a look at what is actually
like when you open it, there's one other thing you need to do. And that is to get the files that
we're going to be using in this course. on the page that you found this video, there's a link
that says download files. If you click on that, then you'll download a zipped folder called
our oh one underscore entro underscore files, download that unzip it. And if you want to put
it on your desktop, when you open it, you're going to see something like this a single folder
that's on your desktop. And if you click on it, then it opens up a collection of scripts. The
dot r extension is for an R source or a script file. I also have a folder with a few data files
that we'll be using in one of these videos. If you simply double click on this first file, whose
full name is this, that'll open up in our and let me show you what that looks like. When you open
up the application Are you will probably get a setup of windows that look like this. On the left
is the source window or the script window where you actually do your programming. On the right is
the console window that shows you the output and right now it's got a bunch of boilerplate text.
Now coming over here again on the left, any line that begins with a pound sign or hashtag or aka
Thorpe is a commented line. That's not right. On these other lines or code that can be run, by the
way, you may notice a red warning just popped up on the right side, that's just telling us about
something that has to do with changes in our and it doesn't affect us. What I'm going to do right
here is I'm going to put the cursor in this line, and then I'm going to hit Command or Control and
then enter, which will run that line. And you can see now, that is opened up over here. And what
I've done is I've made available to the program, a collection of data sets. Now I'm going to pick
one of those data sets is the iris data sets very well known as a measurement of three species of
the iris flower. And we're going to do head to see the first six lines. And there we have the sepal
length, sepal width, petal length and petal width of in this case, it's also Tosa. But if you want
to see a summary of the variables, get some quick descriptive statistics, we can run this next line
over here. And now I get the quartiles. The mean, as well as the frequency of the three different
species of Iris, on the other hand, is really nice to get things visually. So I'm going to run
this basic plot command for the entire dataset. And it opens up a small window, I'm
gonna make it bigger. And it's a scatterplot of the measurements or the three kinds
of viruses, as well as a funny one where it's including the three different categories, they're
gonna close that window. And so that is basically what our looks like and how our works in its
simplest possible version. Now, before we leave, I'm actually going to take a moment to clean up
the application in the memory, I'm going to detach or remove the datasets package that I added.
I already closed the plot. So I don't need to do this one separately. But what I can do is come
over here to clear the console, I'm actually going to come up to edit and come down to clear console.
And that cleans it out. And this is a very quick run through of what our looks like in its native
environment. But in the next movie, I'm going to show you another application we can install called
our studio that lays on top of this, and makes interacting with our a lot easier and a lot more
organized and really a lot more fun to work with. The next step and are an introduction and
setting up is about something called our studio. Now. This is our studio. And what it is
is a piece of software that you can download, in addition to our what you've already installed,
and its purpose is really simple. It makes working with our easier. Now there's a few different ways
that it does is number one is it has consistent commands. What's funny is, the different operating
systems have slightly different keyboard commands for the same operations. And our, our studio
fixes that. And it makes it the same whether you're on Mac, Windows or Linux. Also, there's
a unified interface instead of having two, three or 17. windows open, you have one window
with the information organized, and also makes it really easy to navigate with the keyboards
and to manage the information that you have in our and let me show you how to do this. But first
we have to install it, where you're going to need to do is to go to our studios website, which is at
our studio.com. From there, click on download our studio. Now bring it to this page or something
like it. And you're going to want to choose the desktop version. Now, when you get there,
you're going to want to download the free sort of community version as opposed to the $1,000 a year
version. And so click here on the left. And then you're going to come to the list of installers for
supported platforms, it's down here on the left, this is where you get to choose your operating
system. Click the top one if you have windows. The next one if you have a Mac and then we have lots
of different versions of Linux, whichever one you get, click on it, download it and go through the
standard installation process, then open it up. And then let me show you what it's like working
in our studio. To do this, open up this file and we'll see what it's like in our studio. When you
open up our studio, you get this one window that has several different panes in it. At the top, we
have the script or the source window. And this is where you do your actual programming. And you'll
see that it looks really similar to what we did when I opened up the our application. The color is
a little different. But that's something that you can change in preferences or options. The console
is down here at the bottom. And that's where you get the text output. Over here is the environment
that saves the variables if you're using any and then plots and other information show up here
in the bottom right. Now you have the option of rearranging things and changing what's there
as much as you want. Our studio is a flexible environment. And you can resize things by simply
dragging the divider between the areas. So let me show you quick example, using the exact same code
that I did in my previous example. So you can see how it works in our studio as opposed to the
regular our app that we use first time. First, I'm going to load some data, that's by using
the datasets package, I'm going to do a Command or Ctrl N, enter to load that one. And you
can see right here, it's run the command. And then I want to do the quick summary of
data I'm going to do head Irish shows the first six lines. And then here it is down here,
I can make that a little bit bigger if I want. Then I can do a summary by just coming back
here, and clicking Command or Control Enter. And actually, I'm going to do a keyboard command
to make the console bigger now. And then we can see all of that, I have the same basic descriptive
statistics and the same frequencies there. And go back to how it was before. And make this bring
this one down a little. And now we can do the plot. Now this time, you see it shows up in this
window here on the side, which is nice. It's not a standalone window. Let me make that one bigger,
it takes a moment to adjust. And there we have the same information that we had in the our app.
Right here, it's more organized in a cohesive environment. And you see that I'm using keyboard
shortcuts to move around. And it makes life really easy for dealing with the information that I have
in our I'm going to do the same cleanup, I'm going to detach the package that I had, this is actually
a little command to clear the plots. And then here in our studio, I can run a funny little command
that'll do the same as doing Ctrl l to clear the console for me. And that is a quick run through
of how you can do some very basic coding in our studio, again, which makes working with our more
organized more efficient and easier to do overall. In our very basic introduction to our and setting
up, there's one more thing I want to mention that makes working with are really amazing. And that's
the packages that you can download install. Basically, you can think of them as giving you
have superpowers when you're doing your analysis, because you can basically do anything with
the packages that are available. Specifically, packages are bundles of code. So it's more
software that add new function to our makes it so we can do new things. Now, there are two kinds
of package two general categories. There are base packages, these are packages that are installed
with our so they're already there. But they're not loaded by default. That way, our doesn't use
maybe as much memory as it might otherwise. But more significant than that are the contributed
or third party packages. These are packages that need to be downloaded, installed, and then
loaded separately. And when you get those, it makes things extraordinary. And so you may
ask yourself, where to get these marvelous packages that make things so superduper?
Well, you have a few choices. Number one, you can go to CRAN. That's the comprehensive
our archive network, that's an official, our site that has things listed with the official
documentation, too, you can go to a site called CRAN tastic, which really is just a way of listing
these things. And when you click on the links, it redirects you back to CRAN. And then third,
you can also get our packages from GitHub, which is an entirely different process. If
you're familiar with GitHub, it's not a big deal. Otherwise, you don't usually need to deal
with it. But let's start with this first one, the comprehensive our archive network,
or CRAN. Now, we saw this previously, when we were just downloading our This time,
we're going to CRAN dot r dash project.org. And we're specifically looking for this one,
the CRAN packages, that's gonna be right here on the left click on packages. And when you open
that, you're gonna have an interesting option. And that's to go to task views. And that breaks
it down by topic. So we have here, packages that deal with Bayesian inference packages that deal
with chemo metrics, and computational physics, so on and so forth. If you click on any one of
those, it'll give you a short description of the packages that are available and what they're
designed to do. Now another place to get packages, I said, is CRAN tastic, at CRAN tastic.org. And
this is one that lists the most recently updated the most popular packages. And it's a nice way of
getting some sort of information about what people use most frequently, although it does redirect
you back to CRAN to do the actual downloading. And then finally@github.com if you go to slash
trending slash R, you'll see the most common are most frequently downloaded packages on GitHub
for use and are now regardless of how you get it, let me show you the ones that I use most often and
I find these Make working with are really a lot more effective and what easier. Now they have kind
of cryptic names. The first one is d plier, which is for manipulating data frames, then there's
tidy or for cleaning up information, stringer for working with strings or text information.
lubra date for manipulating date information. h TT er for working with website data. GG ww is
where the GG stands for grammar of graphics. This is for interactive visualizations. GG, plot two
is probably the most common package for creating graphics or data visualizations in our SHINee is
another one that allows you to create interactive applications that you can install on websites.
reo is for our input output is for importing and exporting data. And then our markdown allows you
to create what are called interactive notebooks or rich documents for sharing your information. Now,
there are others, but there's one in particular, that thing's useful. I call it the one package
to load them all. And it's Pac Man, which not surprisingly, stands for package manager. And
I'm going to demonstrate all of these in another course that we have here. But let me show you very
quickly how to get them working. He just tried an R. If you open up this file from the course
files, let me show you what it looks like. What we have here in our studio is the file for this
particular video. And I say that I use Pac Man, if you don't have it installed already, then run
this one installation line. This is the standard installation command in R. And now add Pac Man,
and then it will show up here and packages. Now I already have it installed. So you can see it
right there. But it's not currently loaded. See because installing means making it available
on your hard drive. But loading means actually making it accessible to your current routines.
So then I need to load it or import it. And I can do it with one of two ways. I can use the
require, which gives a confirmation message, I can do it like this. And you see it's got that
little sentence there. Or I can do library which simply loads it without saying anything. You
can see now by the way that it's checked off, so we know it's there. Now, if you have
Pac Man installed, even if it's not loaded, then you can actually use Pac Man to install other
packages. So what I actually do is because I have Pac Man installed, I just go straight to this one
you do Pac Man and then the two colons. It says, use this command, even though this package isn't
loaded. And then I load an entire collection, all the things that I showed you starting with Pac
Man itself. So now I'm going to run this command. And what's nice about Pac Man is, if you don't
have the package, it will actually install it, make it available and load it. And I gotta tell
you, this is a much easier way to do it than the standard r routine. And then, for base packages,
that means the ones that come with are natively like the data sets package, you still want to do
it this way you load and unload them separately. So now I've got that one available. And then I can
do the work that I want to do. Now I'm actually not going to do it right now, because I'm going to
show it to you in future videos. But now I have a whole collection of packages available, they're
going to give me a lot more functionality and make my work more effective. I'm going to finish
by simply unloading what I have here. Now if you want to with Pac Man, you can unload specific
packages, or the easiest way is to do p underscore unload all. And what that does is it unload all
of the add on or contributed third party packages. And you can see I've got the full list
here of what is unloaded. However, for the base packages like data sets, you
need to use the standard r command detach, which I'll use right here. And then I'll clear
my console. And that's a very quick run through of how packages can be found online installed into
our and loaded to make your code more available. And I'll demonstrate how those work in basically
every video from here on out. So you'll be able to see how to exploit their functionality to
make your work a lot faster and a lot easier. Probably the best place to start when you're
working with any statistics program is basic graphics so you can get a quick visual impression
of what you're dealing with. And the command and are the next simplest of all, is the default plot
command is also known as basic x, y plotting for the x and y axes on a graph. And what's neat about
RS plot command is that it adapts to data types and to the number of variables that you're dealing
with. Now, it's going to be a lot easier for me to simply show you how this works. So let's try it
in our just open up the script file and we'll see how we can do some basic visualizations in our
The first thing that we're going to do is load some data Data Sets from the data sets package
that comes with our, we simply do library data sets. And that loads it up, we're gonna use the
iris data, which I've showed you before. And you'll get to see many more times. Let's look
at the first few lines. I'll zoom in on that. And what this is, is the measurement of the Siebel
m petal length and width for three species of viruses is a very famous data set about 100 years
old. And it's a great way of getting a quick feel for what we're able to do and are, I'll come back
to the full window here. And what we're going to do is first get a little information about the
plot command to get help on something in our just do the question mark, and the thing you want
help for. Now we're in our studio. So this opens up right here in the help window. And you see
we've got the whole set of information here, all the parameters and additional links, you can
click on and then examples here at the bottom. I'm going to come over here and I'm going to use
the command for a categorical variable first. And that's the most basic kind of data that we
have. And so species, which is three different species is what I want to use right here. So I'm
going to do plot, and then in the parentheses, you put what it is you want to plot. And what I'm
doing here is I'm saying it's in the data set, Iris, that's our data frame, actually. And then
the dollar sign says use this variable that's in that data. So that's how you specify the whole
thing. And then we get an extremely simple three bar chart, I'll zoom in on it. And what it tells
you is that we have three species of Iris setosa, versicolor, and virginica, and then we have 50 of
each. And so it's nice now that we have balanced group that we have three groups because that
might affect some of the analyses that you do. And it's an extremely quick and easy way to
begin looking at the data all zoom back out. Now let's look at a quantitative variable, so
one that's on an interval or nominal level of measurement. For this one, I'll do petal length.
And you see I do the same thing plot and then Iris and then peddling. Please note I'm not telling
are that this is now a quantitative variable. On the other hand, it's able to figure that one out
and by itself. Now, this one's a little bit funny, because it's a scatterplot, I'm going to zoom in
on it. But the x axis is the index number or the row number in the dataset. So that one's really
not helpful. It's the variable that's going on the Y, that's the petal length that you get to see the
distribution. On the other hand, you know that we have 50 of each species. And we have the setosa.
And then we have the versicolor. And then we have the virginica. And so you can see that there
are group differences on these three things. Now, what I'm going to do is I'm going to
ask for a specific kind of plot to break it down more explicitly between the two categories.
That is, I'm going to put in two variables now, where I have my categorical species, and
then a comma, and then the petal length, which is my quantitative measurement. I'm
going to run that again, you just hit Ctrl, or command and Enter. And this is one that I'm
looking for here. Let's zoom in on that. Again, you see that it's adapted. And it knows, for
instance, that the first variable I gave it is categorical. The second was quantitative, and
the most common chart for that is a box plot. And so that's what it automatically chooses to do.
And you can see, it's a good plot here, we can see very strong separation between the groups on
this particular measurement. I'll zoom back out. And then let's try a quantitative pair. So now
I'll do petal length and petal width, so it's gonna be a little bit different. I'll run that
command. And now this one is a proper scatterplot, where we have a measurement across the bottom,
and a measurement of the side. But you can see that there's a really strong positive association
between these two. So not surprisingly, as a petal gets longer, it generally also gets wider, so it
just gets bigger overall. And then finally, if I want to run the plot command on the entire data
set the entire data frame, this is what happens, we do plot and then Iris. Now we've seen this one
in previous examples, but let me zoom in on it. And what it is, is an entire matrix of scatter
plots of the four quantitative variables. And then we have species, which is kind of funny because
it's not labeling them. But it shows us a dot plot for the measurements of each species. And
this is a really nice way if you don't have too many variables of getting a very quick holistic
impression of what's going on in your data. And so the point of this is that the default plot command
is able to adapt to the number of variables I gave it, and to the kind of variables I give it, and
it makes life really easy. Now, I want you to know that it's possible to change the way that
these look. I'm going to specify some options. I'm going to do the plot again, this scatterplot
where I say plot, and then in parentheses, I give these two arguments, or saying what I
want in it, I'm gonna say, do the petal length, and do the petal width. And then I'm gonna go
to another line, I'm just separating with comma. Now if you want to, you can write this all as one
really long line, I break it up, because I think it makes a little more readable. I'm going to
specify the color, a new with call for color, and then I use a hex code. And that code is actually
for the red that is used on the data lab homepage. And then PCH is four point character, and that
is a 19 is a solid circle. Now I'm going to main title on it, and then I'm gonna put a label on the
x axis and a label on the y axis. So I'm actually going to run those now by doing Command or Control
Enter for each line, and you can see it builds up. And when we finished, we got the whole thing, I'll
zoom in on it again. And this is the kind of plot that you could actually use in a presentation
or possibly in a publication. And so even what the base command, we're able to get really
good looking, informative and clean graphs. Now, what's interesting is that the plot command
can do more than just show data, we can actually feed it in formulas, if you want, for instance,
to get a cosine, I do plot and then coast is for cosine. And then I give the limit, I go from
zero to two times pi, because that's relevant for cosine. I click on that, and you can see the graph
there, it's doing our little cosine curve, I can do an exponential distribution from one to five.
And there it is curving up. And I can do D norm, which is for a density of a normal distribution
from minus three to plus three. And there's the good old bell curve there in the bottom right.
And then we can use the same kind of options that we used earlier for our scatterplot. Here
to say, do a plot of D norm, so the bell curve from minus three to plus three on the x axis. And
now we're going to change the color to red l WD is for linewidth, make it thicker, give it a title on
the top, a label on the x axis and a label on the y axis. We'll zoom in on that. And so there is my
new and improved prettier and presentation ready bell curve that I got with a default plot, command
and R. And so this is a really flexible and powerful command. Also, it's the base package. And
you'll see that we have a lot of other commands that can do even more elaborate things. But this
is a great way to start and get a quick impression of your data, see what you're dealing with, and
shape the analyses that you do subsequently. The next step in our introduction, and our
discussion of basic graphics, is bar charts. And the reason I like to talk about bar charts
is this, because simple is good. And when it comes to bar charts, bar charts are the most basic
graphic for the most basic data. And so they're a wonderful place to start in your analysis. Let me
show you how this works. Just try it in our open up this script. And let's run through and see how
it works. When you open up the file in our studio, the first thing we're going to want to do is
come down here and open up the datasets package. And then we're going to scroll down a little bit
and we're going to use a dataset called empty cars. Let's get a little bit of information about
this do the question mark and the name of the data set. This is Motor Trend. That's a magazine
car road test from 1974. So you know they're 42 years old. Let's take a look at the first few
rows of what's in empty cars by doing head. I'm going to zoom in on this. And what you can see is
that we have a list of cars the Mazda RX four and the wagon the Datsun 710, the AMC Hornet and I
actually remember these cars and we have several variables on each of them we have the mpg MPG,
we have the number of cylinders the displacement and cubic inches, the horsepower the final drive
ratio which has to do with the axle, and then we have the weight in tons the quarter mile time in
seconds. And these are a bunch of really really slow cars. V S is for whether the cylinders are in
a V, or whether they are in a straight or in line. And then the am is for automatic or manual. Then
we go into the next line we have gear which is the number of gears in the transmission and carb for
how many carburetor barrels they have, which is we don't even use carburetors anymore. Anyhow. So
that's what's in the data set. I'll zoom back out. Now if we want to do a really basic bar chart,
you might think that the most obvious thing to do would be to use RS bar plot command. That's,
it's named for the bar chart. And then to specify the data set empty cars, and then the dollar sign,
and then the variable that we want cylinders. So you think that would work, but unfortunately,
it doesn't. Instead, what we get is this, which is just kind of going through all the
cases on a one by one by one row and telling us how many cylinders are in that case, that's
not a good one. That's not what we want. And so what we need to do is we actually need to
reformat the data a little bit, by the way, you would have to do the exact same thing, if
you wanted to make a bar chart in a spreadsheet, like Excel or Google Sheets, you can't do it with
the raw data, you first need to create a summary table. And so what we're going to do here is we're
going to use the command table, we're gonna say, take this variable from this data set and make a
table of it, and feed it into an object, you know, a data thing, data container called cylinders,
I'm going to run that one. And then you see that just showed up in the top left, let me zoom in
on that one. So now I have in my environment, a data object called cylinders, it's a table,
it's got a length of three, it's got a size of 1000 bytes, and it gives us a little bit more
information. Let's go back to where we were. But now I've saved that information into
cylinders, which just has the number of cylinders, I can run the bar plot command. And now I get
the kind of plot I expected to see. From this, we see that we have a fair number of cars with four
cylinders, a smaller number was six. And because this is in 74, we've done a lot of eight cylinder
cars in this particular data set. Now, we can also use the default plot command, which I showed you
previously, on the same data, we're just going to do something a little different, it's actually
going to make a line chart where the lines are the same length of each bars, I'd probably use the bar
plot instead, because it's easier to tell what's going on. But this is a way of making a default
chart that gives you the information you need for the categorical variables. Remember, simple
is good. And that's a great way to start. In our last video, on basic graphics, we talked about bar
charts. If you have a quantitative variable, then the most basic kind of chart is a histogram. And
this is for data that is quantitative or scaled or measured, or interval or ratio level, all of those
are referring to basically the same thing. And in all of those, you want to get an idea of what
you have. And a histogram allows you to see what you have. Now there's a few things you're going
to be looking for with a histogram. Number one, you're going to be looking for the shape of the
distribution, is it symmetrical, is it skewed is a uni modal by modal, you're going to look for gaps
or big empty spaces in the distribution. You're also going to look for outliers, unusual scores,
because those can distort any of your subsequent analyses. He'll look for symmetry to see whether
you have the same number of high and low scores or whether you have to do some sort of adjustment
to the distribution. But this is going to be easier if we just try it in R. So open up this R
script file. And let's take a look at how we can do histograms in R. When you open up the file, the
first thing we need to do is come down here and load the data sets. We'll do this by running the
library command, I just do Ctrl or Command Enter. And then we can do the iris data set. Again, we've
looked at it before. But let's get a little bit of information from it by asking for help on Iris.
And there we have Edgar Anderson's Iris data, also known as Fisher's Iris data, because he
published an article on it. And here's the full set of information available on it from 1936. So
it's 80 years old. Let's take a look at the first few rows. Again, we've seen this before, Siebel
and petal length and width for three species of Iris. We're gonna do a basic histogram on the
four quantitative variables that are in here. And so I'm going to use just the hist command.
So hist and then the dataset Iris and then the dollar sign to say which variable and then Siebel
dot length. I run that I get my first histogram. Let's zoom in on a little bit. And what happens
here is of course, it's a basic sort of black line on white background, which is fine for exploratory
graphics. And it gives us a default title that says histogram of the variable and it gives us the
the clunky name which is also on the x axis on the bottom, it automatically adjusts the x axis and it
chooses about seven or nine bars, which is usually the best choice for a histogram. And then on the
left, it gives us the frequency or the count of how many offs revisions are in that group. So
for instance, we have only five irises whose sepal length is between four and four and a half
centimeters, I think it is. Let's zoom back out. And let's do another one. Now, this time for
a simple width, you can see that's almost a perfect bell curve. And we do petal length, we get
something different. Let me zoom in on that one. And this is where we see a big gap, we've got a
really strong bar there at the low end. In fact, it goes above the frequency axis. And then we
have a gap. And then sort of a bell curve that lets us know that there's something interesting
going on with the data that we're going to want to explore a little more fully. And then
we'll do another one for petal width, I'll just run this command. And you can see
the same kind of pattern here where there's a big clump at the low end, there's a gap. And
then there's sort of a bell curve beyond that. Now, another way to do this is to do the
histograms by groups. And that would be an obvious thing to do here, because we have three
different species of Iris. So what we're going to do here is we're going to put the graphs into
three rows, one above another in one column. I'm going to do this by changing a parameter pa RS
for parameter, and I'm giving it the number of rows that I want to have in my output. And
I need to give it a combination of numbers, I do this C, which is for concatenate, it means
treat these two numbers as one unit, where three is the number of rows, and then the one is the
number of columns. So I run that it doesn't show anything just yet. And then I'm going to come down
and I'm going to do this more elaborate command, I'm going to do hist. That's the histogram that
we've been doing. I'm going to do petal length, except this time in square brackets, I'm going to
put a selector is this means use only these rows. And the way I do this is by saying I want to do it
for this atossa irises. So I say, Iris, that's the data set, and then dollar sign. And then species
is the variable. And then two equals because in computers, that means is equivalent to and then
in quotes, and they have to spell it exactly the same with the same capitalization and do setosa.
So this is the variable and the row selection. I'm also going to put in some limits for the
x, because I want to manually make sure that all three of the histograms I have have the same
x scale. So I'm going to specify that breaks is for how many bars I wanted the histogram. And and
actually, what's funny about this is it's really only a suggestion that you give to the computer,
then I'm going to put a title above that one, I'm going to have no x label, and I'm going to
make it read somebody would do all of that right now. I'll just run each line. And then you see
I have a very skinny chart, let's zoom in on it. So it's very short. But that's because I'm gonna
have multiple charts, it's gonna make more sense when we look at them all together. But you can see
by the way that the petal width for this atossa irises is on the low end. Now let's do the same
thing for versicolor. I'm going to run through all that. It's all gonna be the same, except we're
gonna make it purple. There's versicolor. And then let's do virginica last. And we'll make
those blue. And now I can zoom in on that. And now when we have our three histograms, it's
the same variable petal width, but now I'm doing it separately for each of the three species. And
it's really easy to see what's going on here. Now. setosa is really low versicolor and virginica
overlap, but they're still distinct distributions. This approach, by the way, is referred to as small
multiples, making many versions of the same chart on the same scale. So it's really easy to compare
across groups are across conditions, which is what we're able to do right here. Now, by the way,
anytime you change the graphical parameters, you want to make sure to change them back to what
they were before. So here, I'm going par, and then going back to one column and one row. And that's
a good way of doing histograms for examining quantitative variables, and even for exploring
some of the complications that can arise when you have different categories with different scores
on those variables. In our two previous videos, we looked at some basic graphics for one variable
at a time, we looked at bar charts for categorical variables, and we looked at histograms for
quantitative variables. While there's a lot more you can do with univariate distributions. You also
might want to look at by various distributions, we're gonna look at scatter plots as the most
common version of that you do a scatter plot when what you want to do is visualize the association
between two quantitative variables. Now, I actually know it's more flexible than that. But
this is the canonical case for a scatterplot. And when you do that, what sorts of things do you want
to look for in your scatterplot? I mean, there's a purpose in it. Well, number one, you want to see
if the association between your two variables is linear, or if it can be described by a straight
line, because most of the procedures that we do assume linearity. You also want to check if you
have consistent spread across the scores as you go from one end to the x axis to another, because if
things fan out considerably, then you have what's called heteroscedasticity. And it can really
complicate some of the other analyses. As always, you want to look for outliers, because an unusual
score, or especially an unusual combination of scores, can drastically throw off some of your
other interpretations. And then you want to look for the correlation is there an association
between these two variables. So that's what we're looking for it, let's try it in our simply open up
this file, and let's see how it works. The first thing we need to do in our is come down and open
up the datasets package just to command or control and Enter. And we'll load the data sets, we're
going to use empty cars, we looked at that before, it's got a little bit of information, it's road
test data from 1974. And let's look at the first few cases. I'll zoom in on that. Again, we have
miles per gallon cylinders, so on and so forth. Now, anytime you're going to do an association,
it's a really good idea to look at the univariate or one variable at a time distributions as well,
we're going to look at the association between weight and mpg. So let's look at the distribution
for each of those separately. I'll do that with a histogram, I do hist. And then in parentheses,
I specify the data set empty cars in this case, and then $1 sign to save which variable in that
data set. So there's the histogram for weight. And you know, it's not horrible there, it looks
like we've got a few on the high end there. And here's the histogram for miles per gallon. Again,
mostly kind of normal, but a few on the high end. But let's look at the plot of the two of them
together. Now, what's interesting is I just use the generic plot command, I feed that in, and r
is able to tell that I'm giving it to quantitative variables, and that a scatterplot is the best
kind of plot for that. So we're gonna do weight and mpg. And then let me zoom in on that. And
what you see here is one circle for each car at the joint position of its weight and its MPG, and
it's a strong downhill pattern. Not surprisingly, the more a car weighs and we have some
in this data set that are five tonnes, the lower miles per gallon, we have get down to
about 10 miles per gallon here, the smallest cars, which appear to weigh substantially under
two times get about 30 miles per gallon. Now, this is probably adequate for most purposes.
But there's a few other things that we can do. So for instance, I'm going to add some colors here,
I'm going to take the same plot, and then add on additional arguments or say, use a solid circle
pchs for point character 19 as a solid circle, c x has to do with this size of things, and I'm
going to make in the 1.5 means making 150% larger call is for color and I'm specifying a particular
read the one for data lab in hex code, I'm going to give a title, I'm going to give an X label and
a y label. And then we'll zoom in on that. And now we have a more polished chart that also because of
the solid red circles makes it easier to see the pattern that's going in there, where we got some
really heavy cars with really bad gas mileage, and then almost perfect linear association up to
the lighter cars was much better gas mileage. And so a scatterplot is the easiest way of looking at
the association between two variables, especially when those two variables are quantitative.
So they're on a scaled or measured outcome. And that's something that you want to do anytime
you're doing your analysis to first visualize it, and then use that as the introduction to any
numerical or statistical work you do after that, as we go through are necessarily very short
presentations on basic graphics. I want to finish by saying one more thing, and that is
you have the possibility of overlaying plots. And that means putting one plot directly on top of
or superimposing it on another. Now, you may ask yourself why you want to do this Well, I can give
you an artistic version on this. This, of course, is Pablo Picasso's Les Demoiselles d'Avignon. And
it's one of the early masterpieces in Cubism and the idea of Cubism is it gives you many views,
or it gives you simultaneously several different perspectives on the same thing. And we're gonna
try to do a similar thing with data. And so we can say very quickly. Thanks, Pablo. Now, why
would you overlay plots, really, if you want the technical explanation is because you get increased
information density, you get more information, and hopefully more insight in the same amount of
space and hopefully the same amount of time. Now, there is a potential risk here. You might
be saying to yourself at this point, well, you want dense, guess what? I can do dance. And
then we end up with something vaguely like this, the Garden of Earthly Delights, and it's
completely overwhelming, and it just makes you kind of shut down cognitively. No, thank
you. Hieronymus Bosch. No, I instead, well, I like Hieronymus Bosch his work. And to tell you
when it comes to data graphics use restraint. Just because you can do something doesn't mean that you
should do that thing. When it comes to graphics and overland plots, the general rule is this, use
views that complement and support one another that don't compete. But that gives greater information
in a coherent and consistent way. This is going to make a lot more sense. If we just take a
look at how it works in our so open up this script. And we'll see how we can overlay plots for
greater information density and greater insight. The first thing that we're going to need
to do is open up the datasets package. And we're going to be using a data set we haven't
used before about lynxes, that's the animal. This is about Canadian Lynx trappings from 1821 to
1934. If you want the actual information on the dataset, there it is. Now let's take a look at
the first few lines of data. This one is a time series. And so what's unusual about it is this
is just one line of numbers. And you have to know that it starts at 1821. And it goes through. So
let's make a default chart with a histogram. As a way you've seen, or links trappings consistent
or how much variability was there, we'll do hist, which is the default histogram. And we'll simply
put links in, we don't have to specify variables, because there's only one variable in it. And when
we do that, I'll zoom in on that, we get really a skewed distribution, most of the observations are
down at the low end, and then it tapers off to it's actually measured in 1000s. So we can tell
that there is a very common value, it's at the low end. And then on the other hand, we don't know
what years those were. So we're ignoring that for just a moment and taking a look at the overall
distribution of trappings, regardless of yours, Miss zoom back out. And we can do some options
on this one to make it a little more intricate, we can do a histogram. And then in parentheses, I
specify the data. I also can tell it how many bins I want. And again, it sort of is suggesting
that because r is going to do what it wants Anyhow, I can say make it a density instead of
frequency. So it'll give proportions of the total distribution. We'll change the colors to call the
sisal one because you can use color names. And our will give it a title here. By the way, I'm using
the paste command because it's a long title, and I want it to show up on one line, but I
need to spread my command across two lines, you can go longer, I have to use a short
command line. So you can actually see what we do when we're zoomed in here. So there's that
one, and then we're going to give it a label, this has number of links trapped. And now we have
a more elaborate chart. I'll zoom in on it, and it's a kind of little thistle purple lilac color.
And we have divided the number of bins differently previously, it was one bar for every 1000. Now
it's one bar for 500. But that's just one chart. We're here to see how we can overlay charts and
a really good one anytime you're dealing with a histogram is a normal distribution. So you want to
see are the data distributed normally now we can tell they're skewed here, but let's get an idea of
how far they are from normal. To do this, we use the command curve. And then D norm is for density
of the normal distribution. And then here I tell it axes you know just a generic variable name,
but I tell it use the mean of the Lynx data. Use the standard deviation of the Lynx data We'll make
it a slightly different fissel color. Number four, we'll make it two pixels wide, the line width
is two pixels and then add says stick it on the previous graph. And so now I'll zoom in on that.
And you can see if we had a normal distribution with the same mean and standard deviation as
this data, it would look like that. Obviously, that's not what we have, because we have
this great big spike here on the low end, then I can do a couple of other things, I
can put in what are called kernel density estimators. And those are sort of like a bell
curve, except they're not parametric, instead, they follow the distribution of the data, that
means they can have a lot more curves in them, they still add up to one like a normal
distribution. So let's see what those would look like here, we're gonna do lines. That's
what we use for this one. And then we say density, that's going to be the standard kernel
density estimator, we'll make it blue. And there it is, on top, I'm going to do
one more than we'll zoom in, I can change a parameter of the kernel density estimator,
here, I'm using a just to say, average across it sort of like a moving average, average across
a little more. And now let me zoom in on that. And you can see, for instance, the blue line
follows the spike at the low end a lot more closely than it dips down. On the other hand,
the purple line is a lot more slower to change, because of the way I gave it his instructions
with the Adjust equals three. And then I'm going to add one more thing, something called a rug
plot, it's a little vertical lines underneath the plot for each individual data point. And
I do that with rug. And I say just use links, and then we're gonna make it a line width or
pixel width of two, and then we'll make it gray. And that, and assuming is our final plot, you can
see now that we have the individual observations marked, and you can see why each bar is as tall as
it is and why the kernel density estimator follows the distribution that it does. This is our final
histogram with several different views of the same data. It's not Cubism, but it's a great way of
getting a richer view of even a single variable that can then inform the subsequent analyses you
do to get more meaning and more utility out of your data. Continuing in our an introduction,
the next thing we need to talk about is basic statistics. And we'll begin by discussing the
basic summary function in our The idea here is that once you have done the pictures that you've
done that basic visualizations, then you're going to want to get some precision by getting numerical
or statistical information. Depending on the kinds of variables you have, you're going to want
different things. So for instance, you're going to want counts or frequencies for categories.
They're going to want things like core titles and the mean for quantitative variables. We can
try this in our and you'll see that it's a very, very simple thing to do. Just open up this script
and follow along. What we're going to do is load the data sets package, controller command and
then enter. And we're actually going to look at some data and do an analysis that we've seen
several times already, we're going to load the iris data. And let's take a look at the first
few lines. And again, this is for quantitative measurements on the seaplane petal length
and width are three species of Iris flowers. And what we're going to do is we're going to
get summary in three different ways. First, we're going to do summary
for a categorical variable. And the way we do this is we use the summary
function. And then we'd say Iris, because that's the data set and then $1 sign and then the name of
the variable that we want. So in this case, it's species, we'll run that command. And you can see
it just has setosa 50 versicolor 50 and virginica 50. And those are the frequencies are the counts
for each of those three categories in the species variable. Now we're going to get something
more elaborate for the quantitative variable, we'll use sepal length for that one, and I'll just
run that next line. And now you can see it lays it out horizontally, we have the minimum value
of 4.3, then we have the first quartile of 5.1, the median than the mean than the third quartile
and then the maximum score of 7.9. And so this is a really nice way of getting a quick impression
of the spread of scores. And also by comparing the median and the mean sometimes you can tell whether
it's symmetrical or there skewness going on. And then you have one more option and that is getting
a summary for the entire data frame or data set. at once, and what I do is I simply do summary
and then in the parentheses for the argument, I just give the name of the dataset IRS. And
this one, I need to zoom in a little bit, because now it arranges it vertically. Where do
we do sepal length. So that's our first variable, and we get the courthouse and we get the median.
And we do Siebel with petal length, petal width, and then it switches over at the last one species
where it gives us the counts or frequencies of each of those three categories. So that's the most
basic version of what you're able to do with the default summary variable in R gives you quick
descriptives gives you the precision to follow up on some of the graphics that we did previously.
And it gets you ready for your further analyses. As you're starting to work with R, and you're
getting basic statistics, you may find you want a little more information than the base summary
function gives you. In that case, you can use something called describe, and its purpose
is really easy. It gets more in detail. Now, this is not included in ours basic functionality.
Instead, this comes from a contributed package, it comes from this psych package. And when you
run describe from site, this is where you're going to get, you'll get n that's the sample size, the
mean, the standard deviation, the median, the 10%, trimmed mean, the median absolute deviation, the
minimum and maximum values, the range skewness, and kurtosis, and standard errors. Now, don't
forget, you still want to do this after you do your graphical summaries pictures first
numbers later. But let's see how this works in our simply open up this script, and we'll run
through it step by step. When you open up are, the first thing we're going to need to do is
we're going to need to install the package. Now, I'm actually going to go through my default
installation of packages, because I'm going to use one of these Pac Man. And this just makes things
a little bit easier. So we're going to load all these packages. And this assumes, of course, you
have Pac Man installed already, we're going to get the data sets. And then we'll load our Iris data.
We've done that lots of times before sepal, and petal length and width and the species. But now
we're going to do something a little different, we're going to load a package, I'm using p load
from the Pac Man package. That's That's why I loaded it already. And this will download it if
you don't have it already, it might take a moment. And it downloads a few dependencies, generally
other packages that need to come along with it. Now, if you want to get some help on it, you can
do p anytime you have P and underscore that's something from Pac Man p help site. Now when you
do that, it's going to open up a web browser and it's going to get the PDF help. I've got it open
already because it's really big. In fact, it's 367 pages here, have documentation about the functions
inside. Obviously, we're not going to do the whole thing here. What we are going to do is we can look
at some of it in the our viewer, if you simply add this argument here, web equals F for false,
you can spell out the word false, as long as you do it in all caps, then opens up here on the
right. And here is actually this is a web browser. This is a web page we're looking at. And each of
these, you can click on and get information about the individual bits and pieces. Now, let's use
describe that comes from this package. It's for quantitative variables only. So you don't want to
use it for categories. What we're going to do here is we're going to pick one quantitative variable
right now. And that is Iris and then sepal length. When we run that one, here's what we get. Now
I get a list here a line, the first number, the one simply indicates the row number, we only
have one row. So that's what we have anyhow. And it gives us the N of 150, the mean of 5.84, the
standard deviation, the median, so on and so forth out to the standard error there at the end. Now,
that's for one quantitative variable. If you want to do more than that, or especially if you want to
do an entire data frame, just give the name of the data frame in describe. So here we go describe
Iris. I'm going to zoom in on that one because now we have a lot of stuff. Now it lists all the
variables down the side sepal length and it gives the variables numbers 12345. And it gives us the
information for each one of them. Please note it's given us numerical information for species
but it shouldn't be doing that because that's a categorical variable. So you can ignore that
last line. That's why I put an asterisk right there. But otherwise, this gives you more detailed
information including things is like the standard deviation and the skewness that you might need.
To get a more complete picture of what you have in your data. I use describe a lot, it's a great
way to compliment histograms and other charts like box plots to give you a more precise image of
your data and prepare you for your other analyses. To finish up our section in our an introduction
on basic statistics, let's take a short look at selecting cases. What this does is it allows you
to focus your analysis, choose particular cases and look at them more closely. Now in art, you can
do this a couple of different ways. You can select by category if you have the name of a category,
or you can select by value on a scaled variable. Or you can select by both. Let me show you how
this works and are just open up this script and we'll take a look at how it works. As with most
of our other examples, we'll begin by loading the data sets package and by using library, just Ctrl
or Command Enter to run that command that's now loaded, and we'll use the iris dataset. So we'll
look at the first few cases head Iris is how we do that. Zoom in on it for a second. There's the
iris data, we've already seen it several times, we'll come down and we'll make a histogram of
the petal length for all of the irises in the data set. So I received the name of the data set
and then petal length. There's our histogram off to the right, I'll zoom in on it for a second. So
you see, of course, that we've got this group's stuck way at the left, and then we have a gap
right here, then we have a pretty much normal distribution, the rest of it, I'll zoom back
out, we can also get some summary statistics. I'll do that right here. For petal length, there
we have the minimum value of the core tiles and the mean. Now let's do one more thing. And let's
get the name of the species. That's going to be our categorical variable and the number of cases
for of each species. So I do summary, and then it knows that this is a categorical variable.
So we run it through and we have 50 of each, that's good. The first thing we're going to do
is we're going to select cases by their category, in this case by the species of Iris. We'll
do this three times. We'll do it once for versicolor. So I'm going to do a histogram where I
say use the iris data. And then dollar sign means use this variable petal length. And then in square
brackets, I put this to indicate select these rows or select these cases. And I say select when this
variable species is equals, you got to use the two equal signs to versicolor. Make sure you spell
it and capitalize it exactly as it appears in the data. Then we'll put a title on it. This says
petal length versicolor. So here we go. And there is our selected cases. This is just 50 cases going
into the histogram. Now on the bottom right, we'll do a similar thing for virginica, where we simply
change our selection criteria from versicolor virginica. And we get a new title there. And
then finally, we can do first atossa also. So great. That's three different histograms
by selecting values on a categorical variable, where you just type them in quotes exactly as they
appear in the data. Now, another way to do this is to select by value on a quantitative or scaled
variable. We want to do that what you do is in the square brackets to indicate you're selecting
rows, you put the variable, I'm specifying that it's in the IRS data set, and then say what value
you're selecting. I'm looking for values less than two. And I have the title chance to reflect
that. Now what's interesting is this selects the subtypes. This is the exact same group. And so
the diagram doesn't change. But the titles and the method of selecting the cases did. Probably more
interesting. One is when you want to use multiple selectors. Let's look for virginica that will be
our species. And we want short petals only. So this says what variable we're using petal length.
And this is how we select with a Iris dollar sign species. So that tells us which variable is
equal to with the two equals virginica. And then I just put an ampersand, and then say, Iris
petal length is less than 5.5. Then I can run that I get my new title, and I'll zoom in on it.
And so what we have here are just virginica, but the shorter ones. And so this is a
pair of selectors use simultaneously. Now, another way to do this, by the way, is if you
know you're going to be using the same sub sample, many times, you might as well create a new data
set that has just those cases. And the way you do that is you specified the data that you're
selecting from then in square brackets, the rows and the columns, and then you use the assignment
operator. That's the less than and dash here. What you can read as a GED So, so I'm going to create
one called i dot setosa, for Iris setosa. And I'm going to do it by going to the iris data. And in
species reading just setosa, I then put a comma, because this one selects the rows, I need to tell
it which columns. If I want all of them, you just leave it blank. So I'm going to do that. And now
you see up here in the top right, I'll zoom in on it, I now have a new object new data object. And
the environment is a data frame called ice atossa. And we can look at that sub sample that I've just
created, we'll get the head of just those cases. Now you see, it looks just the same as the other
ones, except it only has 50 cases, as opposed to 150. And get a summary for those cases. And this
time, I'm doing just the petal length. And I can also get a histogram for the petal length. And
it's going to be just these two choices. And so that's several ways of dealing with sub samples.
And again, saving this election, if you're going to be using it multiple times, it allows you to
drill down on the data and get a more focused picture of what's going on, and helps inform
your analyses that you carry on from this point. The next step in our introduction is to talk
about accessing data. And to get that started, we need to say a little bit about data formats.
And the reason for that is sometimes your data, you're like talking about apples and oranges, you
have fundamentally different kinds of things. Now, there are two ways in particular that this
can happen. The first one is you can have data of different types, different data
types. And then regardless of the type, you can have your data in different structures,
and it's important to understand each of these, we'll start by talking about data types.
This is like the level of measurement of a variable. You can have numeric variables,
which usually come in integer whole number or single precision or double precision. You can
have character variables with text in them. We don't have string variables in our they're all
character, you can have logical which are true, false, or otherwise called Boolean. You can have
complex numbers, and you can have a data type raw. But regardless of which kind that you have, you
can arrange them into different data structures. The most common structures are vector, matrix or
array, data frame, and list, we'll take a look at each of these. A vector is one or more numbers
in a one dimensional array. Imagine them all in a straight line. Now, what's interesting here is
that in other situations, if it's a single number, it would be called a scalar. But in AR, it's
still a vector is just a vector of length one. The important thing about vectors is that the data
are all of the same data type. So for instance, all character or all integer. And you can think of
this as ours basic data object in it, most of the things are variation of the vector. going one step
up from this is a matrix, a matrix has rows and columns, it's two dimensional data. On the other
hand, they all need to be of the same length, the columns all need to be the same length,
and all the data needs to be of the same class. Interestingly, the columns are not named, they're
referred to by index numbers, which can make them a little weird to work with. And then you can step
up from that into an array. This is identical to a matrix, but it's for three or more dimensions.
On the other hand, probably the most common form is a data frame. This is a two dimensional
collection that can have vectors of multiple types. You can have character variables in one,
you can have integer variables, and another you can have logical and a third, the trick is,
they all need to be the same length. And you can think of this as the closest thing that R has
that's analogous to a spreadsheet. And in fact, if you import a spreadsheet, you're going to go
into a data frame, typically. Now the neat thing is that R has special functions for working with
data frames, things that you can do with those you can do with others. And we'll see how those work
as we go through this course and through others. And then finally, there's the list. This is our
most flexible data format. You can put basically anything in the list. It's an ordered collection
of elements. And you can have any class, any length, any structure. And interestingly,
lists can include lists include lists, and so on and so forth. So it gets like the Russian nesting
dolls, you have one inside the other one inside the other. Now the trick is that may sound very
flexible and may very good. It's actually kinda hard to work with lists. And so a data frame
really sort of the optimal level of complexity for a data structure. And then let me talk about
something else here the idea of coercion now, in the world of ethics cores is a bad thing in the
world of data science. coercion is good. What it means here is coercion is changing data objects.
From one type to another, it's changing the level of measurement or the nature of the variable
that you're dealing with. So for example, you can change a character to a logical, you can change
a matrix to a data frame, you can change double precision to integer, you can do any of these,
it's going to be easiest to see how it works. If we go to our end, give it a whirl. So open up this
script, and let's see how it works in our studio. For this demonstration of data types, you don't
need to load any packages, we're just going to run through things all on their own. We'll start
with numeric data. And what I'm going to do is I'm going to create a data object a variable called
n one, my first numeric variable, and then I use the assignment operator. That's this, the little
left arrow, and this right as n, one gets 15. Now, our does double precision by default, let me do
this n one. And then you can see that it showed up here on the top right. If I call the name of that
object, it'll show its contents in the console. So I just type n one and run that. And there
you can see in the console at the bottom left, it brought up a one in square brackets, that's
an index number for the first objects in an array. And this is an array of one number,
but there it is, and we get the value of 15. Also, we can use the our command type of to
get a confirmation of what type of variable that says. And it's double precision by default,
we can also do another one where you do 1.5, we can get its contents 1.5. And then we see
that it also is double precision, we want to come down and do a character I'm calling that see
one for my first character variable, you see that I do see one the name of the object I want to
create, I put the assignment operator the less than and dash, which is right as gets. And then
I have in double quotes. In other languages, you would do single quotes for a single character. And
you would use double quotes for strings. They're the same thing in R, and I put in double quotes
the lowercase C, that's just something I chose. So I feed that in, you can see that it showed
up in the global environment there on the right, we can call it forward and you see it shows up
with the double quotes on it. We've got the type of and it's a character, that's good. If we want
to do an entire string of texts, I can feed that into C two, just by having it all in the double
quotes. And we pull it out. And we see that it also is listed as a character even though in other
languages, it would be called a string. We can do logical, this is L one for logical first. And
then feeding in true when you write true or false, they have to be all caps, or you can do just
the capital T or the capital F. And then I call that one out. And it says true. Notice, by
the way, there's no quotes around it. That's one way you can tell it it's a logical and
not a character. If we put quotes into it, it would be a character variable, we get
the type of there we go, it's logical. I said you can also use abbreviation so for my
second logical variable l two, I'll just use F. I feed that in. And now you see that it when I ask
it to tell me what it is it prints out the whole word false. And then we get the type of again also
logical, then we can come down to data structures, I'm going to create a vector which is a collection
of one dimensional collection. And I'm doing it by creating v one for vector one. And then I use
the C here, which stands for concatenate. You can also think of it as like combine or collect.
And I'm going to put five numbers in there, you need to use a comma between the values. And then I
call out the object. Then there's my five numbers, notice it shows them without the commas but I had
to have the commas going in. And then I asked our Is it a vector is period vector and then asked
about it. And it's just gonna say true? Yes, it is. I can also make a vector of characters.
And do that right here, I get the characters, and it's also a vector. And that can make a vector
of logical values true and false. Call that. And it's a vector also. Now a matrix, you may remember
is in going in more than one dimension. In this case, I'm going to call it m one for matrix one.
And I'm using the matrix function. So I'm saying matrix and then combine these values tt ffts. And
then I'm saying how many rows I want in it, and it can figure out the number of columns by doing some
math. So I'm going to put that into m one. And then I'll ask for it AC. Now it displays it in
the rows and columns, and it writes out the full true or false. Now I can do another one where I'm
going to do a second matrix and this is where I explicitly shape it in the rows and columns. Now,
that's for my convenience r doesn't care that I broke it up to make the rows and columns, but it's
a way of working with it. And if I want to tell it to organize it To go by rows, I can specify
that with the by row equals T or true command. I do that. And now I have the ABCD. And you
see, by the way that I have the index numbers, on the left are the row index numbers, that's row
one and row two, and on the top are the column index numbers, and they come second, which is why
it's blank and then one for the first column and then blank and then two for the second column,
then we can make an array. What I'm going to do here is I'm going to create a data and I'm
going to use the colon operator, which says, Give me the numbers one through 24, I still have
to use the concatenate to combine them. And then they give the dimensions of my array and it goes
rows, columns, and then tables. Because I'm using three dimensions here, I'm going to feed that into
an object called array one. And there's my array right there, you can see that I have two
tables. In fact, let me zoom in on that one. And so it starts at the last level, which is
the table. And then we have the rows and the columns listed separately for each of them.
a data frame allows me to combine vectors of the same length but of different types. Now,
what I'm doing here is I'm creating a vector of numeric values of character values and logical
values. So these are three different vectors. But then what I'm going to do is I'm going to
use this function c bind for a column bind to combine them into a single data frame and
call it DFA for a data frame, a, or all. Now, the trick here is that we had some
unintentional coercion by just using C bind, what it did is it coerced it all to the most
general format. I had numeric variables and character variables, and logical and the most
general is character. And so it turned everything into a character variable. That's a problem, it's
not what I wanted, I have to add a nother function to this, I have to tell it specifically make
it a data frame by using AZ dot data dot frame. When I do that, I can combine it. And now
you see it's maintained the data types of each of the variables. That's the way I want
it. And then finally, I can do a list, I'm going to create three objects here, object one,
which is numeric with three values, object two, which is character with four and object three,
which is logical with five. And then I'm going to combine them into a list using the list function,
put them into list one, and then we can see the contents of list one. And you can see it's kind of
a funky structure, and it can be hard to read. But there's all the information there. And then we're
going to do something that's kind of, you know, hard to get around logically, because I'm going
to create a new list that has list one in it. So I have the same three objects, plus I'm adding on
to it list one. So list two, I'm gonna zoom in on that one. And you can see it's a lot longer. And
we got a lot index numbers there in the brackets. There, the three integers, the four character
values, and the five logical values. And then here they are repeated, but that's because they're all
parts of list one, which I included in this list. And so those are some of the different ways that
you can structure data of different types. But you want to know also that we can coerce them
into different types to serve our different purposes. The next thing we need to talk about is
coercing types. Now there's automatic coercion, we've seen a little bit of that, where the data
automatically goes to the least restrictive data type. So for instance, if we do this where we
have a one, which is numeric, be in quotes, which is character, and a logical value, and
we feed them all into this idea coerce one. And by the way, by putting parentheses around it, it
automatically saves it and shows us the response. Now you can see that what it's done is is taken
all of them and made all of them character because that's the least specific most general format.
And so that'll happen, but you kind of watch out because you don't want things getting coerced when
you're not paying attention. On the other hand, you can coerce things, specifically, if you want
to haven't go in a particular way. So I can take this variable right here coerce to, I'm gonna put
a five into that. And we can get its type and we see that it's double. Okay, that's fine. What if
I want to make it integer, then what I do is I use this command as dot integer. I run that feed into
coerce three. And it looks the same when we see the output but now it is an integer. That's how
it's represented in the memory. I can also take a character variable and here I have one Two and
Three in quotes, which thank them characters and get those and you can see that they're all
character. But now I can feed them in with this as dot numeric, and it's able to see that they
are numerical numbers in there, and coerce them to numeric. Now you see that is lost the quotes,
and it goes to the default double precision, probably the one you'll do the most often is
taking a matrix. And that's just let's take a look, I'll make a matrix of nine numbers in three
rows and three columns. There they are. And what we're going to do is we're going to coerce
it to a data frame. Now that doesn't change the way it looks is going to look the same. But
there's a lot of functions you can only do with data frames that you can't do with matrices. This
one, by the way, will ask is it a matrix? And the answer is true. But now let's do this, we'll
do the same thing and just add on as dot data dot frame. Then now we thought to make it a data
frame. And you see, it basically looks the same. It's listed a little differently. This one had its
index numbers here for the rows and the columns. This one is a row index. And then we have variable
names across the top. And it's just automatically given them variables one, two, and three.
But the numbers in it look exactly the same. On the other hand, if we come back here and ask,
Is it a data frame, we get true. So it's a very long discussion here. But the point here is,
data comes in different types and in different structures, and you're able to manipulate
those, so you can get them in the format, and the time and the arrangement that
you need for doing your analyses in our to continue our introduction and accessing data,
we want to talk about factors. And depending on the kind of work that you do, this may be a
really important topic. factors have to do with categories and names of those categories.
Specifically, a factor is an attribute of a vector. This specifies the possible values and
their order, it's going to be a lot easier to see if we just try it. In our end, let me demonstrate
some of the variations, just open up this script, and we can run through it together. What we're
going to do here is create a bunch of artificial data, and then we're going to see how it works.
First one I'm going to do is I'm going to create a variable x one with the numbers one through
three. And by putting it in parentheses here, it'll both stored in the environment, and
it will display it in the console. So there we have three numbers, one, two, and three, I'm
going to create a nother variable y, that's the numbers one through nine. So there that is. Now
what I want to do is I want to combine these two, and I'm going to use the C minor column bind
data frame. So it's going to put them together, and it's going to make them a data frame. And
it's going to save them into a new object I'm creating called df for data frame one. And we'll
get to see the results of that. Let me zoom in on it a little bit. And there you can see, we
have nine rows of data. We have one variable x one that's from the one that I created, and then
we have y. And then we have the nine indexes or the row IDs there down the side. Please note
that the first 1x, one only had three values. And so what it did is it repeated it. So you
see it happening three different times 123123. And what we want to find out is now what kind
of variable is x one in this data frame? Well, it's an integer, and we want to get the structure,
it shows that it's still an integer if we're looking at this line right here. Okay, but we can
change it to a factor by using as dot factor. And it's going to react differently than, so I'm going
to create a new one called x two, that, again, is just the numbers one, two, and three. But now I'm
telling are those specifically represent factors, then I'll create a new data frame using this x two
that I saved as a factor and the one through nine that we had and why. Now, at this point, it looks
the same. But if we come back to where we were, and we get the type of it's still an integer,
that's fine, but we get the structure of df two. Now it tells us that x two instead of being
an integer is a factor with three levels. And it gives us the three levels in quotes one, two,
and three, and then it lists the data. Now, if we want to take an existing variable, and
define it as a factor, we can do that too. Here, I'll create yet another variable with three
values in it. And then we'll bind it to y in a data frame. And then I'm going to use this
one factor right here. And I'm going to tell it to reclassify this variable x three as a factor
and feed it into the same place, and that these are the levels of the factor. And because I put
in parentheses, it'll show To us in the console, there we have it, let's get the type. It's an
integer, but the structure shows it again as a factor. So that's one way we could take an
existing variable and turn it into a factor. If you want to do labels, we can do it this way.
We'll do x four, again, that's the one through three. And we'll bind it to nine to make a data
frame. And here, I'm going to take the existing variable, df four, and then the variable is x
four, I'm going to tell it the labels. And then I'm going to give them text labels, I'm going
to say that there are Mac OS, Windows and Linux three operating systems. And please note, I need
to put those in the same order that I want them to line up to those numbers. So one will be Mac
OS two will be windows and three will be Linux. I run that through, we can pull it up
here. And now you can see how it goes through. And it changes that factor to the text
variables. Even though I entered it numerically. I want the type of to see what it is. It's still
called it integer, even though it's showing me words, and the structure. This is an important
one, let's zoom in on that just for a second. The structure here at the bottom, it
says it's a factor with three levels, and it starts giving me the labels. But then it
shows us that those are actually numbers one, two, and three underneath. If you're used to
working with a program like SPSS, where you can have values, and then you can add value labels on
top of them. It's the same kind of concept here. Then I want to show you how we can switch
the order of things. And this gets a little confusing. So try it a couple of times
and see if you can follow the logic here. We'll create another variable x five,
that's just the one, two and three, we'll bind it to why. And there's our data
frame just like we've had in the other examples. Now what I'm going to do is I'm going to
take that new variable x five in the data frame five, df five. And notice here, I'm listing
the levels, but I'm listing them in a different order. I'm changing the order that I put them
in there. And then I'm lining up these labels. When I run that through, now you can see the
labels here, maybe yes, now maybe yes, no, it is showing us the nine values. And then this
is an interesting one, because they're ordered, it puts them with the less than sign at each point
indicate which one comes first which one comes later, we can take a look at the actual data frame
that I made. Or zoom in on that. And you can see, we know that the first one's a one because when
I created this, it was 123. And so the maybe is a one you see because it's the second thing here
in each one. So one equals maybe. But by putting it in this order, it falls in the middle of this
one, there may be situations in which you want to do that, I just want to know that you have this
flexibility in creating your factor labels in our. And finally, we can check the type of that.
And it's still an integer because it's still coded numerically underneath, but we can
get this structure and see how that works. So factors give you the opportunity
to assign labels to your variables, and then use them as factors in various analyses
if you do experimental research, and this sort of thing becomes really important. And so
this gives you an additional possibility for your analyses in our as you define your numerical
variables as factors for using your own analyses. Our next step in our an introduction in
accessing data is entering data. So this is where you're typing it in manually. And I like
to think of this as a version of ad hoc data, because under most circumstances, you would
import a data set. But there are situations in which you need just a small amount of data
right away, and you can type it in this way. Now, there are many different methods that are
available for this. There's something called the colon operator. There's SC Q, which is for
sequence, there, C which is short for concatenate, there's a scan, and there's Rep. And I'm going
to show you how each of these works. I will also mention this little one, the less than and a
dash, that is the assignment operator in our let's take a look at it in our and I'll explain
how all of it works. Just open up this script, and we'll give it a whirl. What we're going to
do here is just begin with a little discussion of the assignment operator, the less than
dash is used to assign values to a variable, so is called an assignment operator. Now a
lot of other programs would use an equal sign, but we use this one that's like an arrow,
and you read it as it gets. So x gets five, it can go in the other direction pointing to the
right, that would be very unusual. And you can use an equal sign or knows what you mean. But those
are generally considered poor form. And that's not just arbitrary. If you look at the Google
style guide for our it's specific about that. In our studio, you have a shortcut for This, if
you do option dash, it inserts the assignment operator and a space. So I'll come down here right
now, do option dash, there you see. So that's a nice little shortcut that you can use in our
studio when you're doing your ad hoc data entry. Let's start by looking at the colon operator. And
most of this you would have seen already. And what this means is you simply stick a colon between two
numbers, and it goes through them sequentially. So I'm doing x one is a variable that I'm creating.
And then I have the assignment operator and get zero colon 10. And that means it gets the
numbers zero through 10. And there they all are going to delete my colon operator that's
waiting for me to do something here. Now if we want to go in descending order, just
put the higher number first. So I'll put 10 colon zero, there it goes the other way, as EQ or SEC
is short for sequence, and it's a way of being a little more specific about what you want. If you
want to, we can call it the help on sequence. It's right over here for sequence generation. There's
the information. And we can do ascending values. So sec 10, duplicate one through 10 doesn't start
at zero starts at one. But you can also specify how much you want things to jump by. So if you
want to count down in threes, II do 30 to zero by negative three means step down threes, we'll
run that one. And because it's in parentheses, it'll both save it to the environment, and
it'll show it on the console right away. So those are ways of doing sequential numbers.
And that can be really helpful. Now if you want to enter an arbitrary collection of numbers in
different order, you can use C that stands for concatenate, you can also think of it as combine
or collect, we can call it the help on that one. There it is. And let's just take these numbers
and you see to combine them into the data object x five, and we can pull it in there you see,
it just went right through. An interesting one is scan. And this is we're entering data live. So
we'll do scan here, get some help on that one, you can see it read data values. And this one takes a
little bit of explanation, I'm going to create an object x sex. And then I'm feeding into it a scan
with opening and closing parentheses because I'm running that command. So here's what happens, I
run that one. And then down here in the console, you see that it now has one and a colon. And I
can just start typing numbers. And after each one, I hit Enter. And I can type in however many
I want. And then when you're done, just hit enter twice. And it reads them all. And if you
want to see what's in there, come back up here and just call the name of that object. There are
the numbers that I entered. And so there may be situations in which that makes it a lot easier to
enter data, especially if you're using a 10 key. Now, rep you can guess is for repetition. We'll
call the help on that one, replicate elements. And here's what we're going to do, we're going to
say x seven, we're going to repeat or replicate. True, and we're going to do it five times. So x
seven. And then if you want to see there are our five trues. All in a row. If you want to repeat
more than one value, it depends on anything, set things up a little bit. Here, I'm going to
do replicate a repeat for true and false. But by doing it as a set where I'm doing the
see concatenate to collect the set, what it's going to do is repeat that set in order five
times. So true, false, true, false, true, false, and so on. That's fine. But if you want to do the
first one, five times, and then the second one, five times, I mean, think of it as like co lading.
On a photocopier. If you don't want it correlated, you do each. And that's going to do True, true,
true, true, true false, false, false, false false. And so these are various ways that you can set
up data, get it in really for an ad hoc or an as needed analysis. And it's a way of checking how
functions work is I've used in a lot of examples here. And you can explore some as possibilities
and see how you can use it in your own work. The next step in our introduction, and accessing
data is talking about importing data, which will probably be the most common way of getting data
into R. Now the goal here is you want to try to make it easy. Get the data in there, get a large
amount, get it in quickly and get processing as soon as you can. Now there are a few kinds of
data files you might want to import. There are CSV files, S stands for comma separated values in
a sort of the plain text version of a spreadsheet. Any spreadsheet program can export data as a CSV
and nearly any data program at all can read them. They're also straight text files. txt. Those can
actually be opened up in text editors and word processing documents, then there are XLS x. And
those are Excel spreadsheets as well as the XLS version. And then finally, if you're going to
get fancy, you have the opportunity to import JSON. That's JavaScript Object Notation. And if
you're using web data, you might be dealing with that kind of data. Now, R has built in functions
for importing data in many formats, including the ones I just mentioned. But if you really want
to make your life easy, you can use just one, a package that I load, every time I use R is
reo, which is short for our import output. And what reo does is it combines all of our import
functions into one simple utility with consistent syntax and functionality. It makes life so much
easier. Let's see how this all works in our just open up this script, and we'll run through
the examples all the way through. But there is one thing you're going to want to do first,
and that is, you're going to want to go to the course files that we download at the beginning of
this course, these are the individual our scripts, because this folder right here that significant.
This is a collection of three data sets, I'm going to click on that. And they're all
called m BB. And the reason they're called that is because they contain Google Trends information.
And that searches for Mozart, Beethoven, and Bach, three major classical music composers. And it's
all about the relative popularity of these three search terms over a period of several years.
And I have it here in CSV or comma separated value format, and as a txt file dot txt,
and then even as an Excel spreadsheet. Now let's go to our and we'll open up each one of
these. The first thing we're going to need to do is make sure that you have reo. Now I've
done this before that Rio is one of the things I download every time. So I'm going to use Pac
Man and do my standard importing or loading of packages. So reals available now, I do want
to tell you one thing significant about Excel files. And we're going to go to the official our
documentation for this. If you click on this, it'll open up your web browser. And this is
a shortcut web page to the our documentation. And here's what it says. I'm actually read
this verbatim. Reading Excel spreadsheets. The most common our data import export question
seems to be how do I read an Excel spreadsheet. This chapter collects together advice and options
given earlier. Note that most of the advices for pre Excel 2007 spreadsheets and not the later XLS
x format. The first piece of advice is to avoid doing so if possible. If you have access to excel,
export the data you want from excel in a tab delimited or comma separated form, and use read
dot delete or read dot CSV to import it into R, you may need to use read.dl m to or read dot CSV
to and a locale that uses comma as the decimal point, exporting a diff file and reading it
using read dot diff is another possibility. Okay, so really what they're saying is, don't do it.
Well, let's go back to our now it's gonna say right here, you have been warned. But let's make
life easy by using Rio. Now if you've saved these three files to your desktop, then it's really
easy to import them this way. We'll start with the CSV. We use reo underscore CSV is the name of the
object that I'm going to be using to import stuff into. And all we need is this command import. We
don't have to specify that as a CSV, or C that has headers or anything, we just use import. And
then in quotes, and in parentheses, we put the name and location of the file. So on a Mac, it
shows up this way to your desktop. I'm going to run that. And you can see that it just
showed up in my environment on the top right, I'll expand that a little bit. I now have a data
frame, I'll come back out. Let's take a look at the first few rows of that data frame. I'll
zoom up. And you can see we have months listed. And then the relative popularity of search for
Mozart, Beethoven and Bach during those months. Now, if I want to read the text file, what's
really nice is I can use the exact same command import, and I just give the location in the name
of the file, I have to add the dot txt. But I run that and we look at the head and you'll see it's
exactly the same no difference Piece of cake. What's nice about Rio is I can even do the
XLS x file. Now it helps that there's only one tab in that file, and that it's
set up to look exactly the same as the others want to do that. We went
through and you see that once again. It's the same thing Rio was able to read all
of these automatically makes life very easy. Another neat thing is that our hands on thing
called a Data Viewer. Now we'll get a little bit of information on that to help and you invoke
the Data Viewer. Let's do this one we do with a capital V for view. And then we say what it is we
want to see. And we'll do rio underscore CSV. When we do that command, it opens up a new tab here.
And it's like a spreadsheet right here. And in fact, it's sortable, we can click on this, go from
the lowest to the highest, and vice versa. And you see that Mozart actually is setting the range
here. And that's one way to do it. You can also come over to here and just click on this little,
it looks like a calendar. But it is, in fact, the same thing, we can double click on that. And now
you see we get a viewer of that file as well. I'm going to close both of those. And I'm just going
to show you the built in our commands for reading files. Now, these are ones that Rio uses on its
own. And we don't have to go through all this. But you may encounter these in a lot of existing
code, because not everybody uses Rio. And I want you to see how they work. If you have a text file,
and it's saved in tab delimited format, you need the complete address. And you might try to do
something like this read dot table is normally the command. And you need to say that you have a
header that there's variable names across the top. But when you read this, it's going to get an error
message. And it's you know, it's frustrating. That's because there are missing values in there
in the top left corner. And so what we need to do is we just need to be a little more specific about
what the separator is. And so I do the same thing, I say read dot table, there's the name of the file
in this location, we have a header. And this is where I say the separator is a tab, the back score
says indicate this is a tab. So if I run that one, then it shows up, it reads it properly. We can
also do CSV. The nice thing here is you don't have to specify the delimiter. Because CSV
means that it's comma separated, so we know what it is. And I can read that one in the exact
same way. And if I want to, I can come over here. And I can just click on the viewer here. And I
see the data that way also. And so it's really easy to import data, especially if you use the
package Rio, which is able to automatically read the format and get it in properly and get you
started on your analyses as soon as possible. Now, the part of our introduction that maybe
most of you were waiting for is modeling data. On the other hand, because this is a very short
introductory course, I'm really just giving a tiny little overview of a handful of common procedures.
And an another course here at data lab.cc, we'll have much more thorough investigations
of common statistical modeling and machine learning algorithms. But right now, I just want
to give you a flavor of what can be done in R. And we'll start by looking at a common procedure.
hierarchical clustering are ways of finding which cases or observations in your data belong with
each other. More specifically, you can think of it as the idea of like with like, which cases
are like other ones. Now, the thing is, of course, this depends on your criteria, how you measure
similarity, how you measure distance, and there's a few decisions you have to make. You can do, for
instance, what's called a hierarchical approach, which is what we're going to do. Or you can do it
where you're trying to get a set number of groups, or s called K, the number of groups, you also
have many choices for measures of distance. And you also have a choice between what's
called divisive clustering, where you start with everything in one group, and then you split
them apart, or agglomerative, which is where they all start separately, and you selectively put
them together. But we're going to try to make our life simple here. So we're going to do the
single most common kind of clustering, we're going to use a measure of Euclidean distance,
we're going to use hierarchical clustering. So we don't have to set the number of groups in
advance. And we're going to use a divisive method, we start with them all together and gradually
split them. Let me show you how this works in our. And what you'll find is even though this
may sound like a very sophisticated technique, and a lot of the mathematics is sophisticated,
it's really not hard to do in reality. So what we're going to do here is we're going
to use a data set that we use frequently I'm going to load my default packages to get some of
this ready. And then I'll bring in the data sets, we're going to use m t cars, which if you recall,
is Motor Trend, car road tests data from 1974. And there are 32 cars in there and we're gonna see how
they grew up what cars are similar to which other ones. Now let's take a look at the first few rows
of data to see what variables we have in here. You see we have MPG, cylinders displacement, so on
and so forth. Not all of these are going to be really influential or are useful variables. And so
I'm going to drop a few of them and create a new data set, that includes just the ones I want. If
you want to see how I do that, I'm going to come back here and I'm going to create a new object,
a new data frame called cars. And this says, it gets the data from empty cars. By putting the
blank in the space here, that means use all of the rows. But here I'm selecting the columns see
for concatenate, means I want columns one through four, skip Five, six, and seven, skip eight, and
then nine through 11. That's way of selecting my variables. So I'm going to do that and you see
the cars is now showing up in my environment, they're at the top right, let's take a look
at the head of that data set. We'll zoom in on that one. And they can see it's a little bit
smaller, we have mpg cylinders, displacement, weight, horsepower, quarter mile, seconds, and so
on. Now, we're going to do the cluster analysis, and we're going to find is that if we're using
the default, it's super, super easy. In fact, I'm going to be using something called pipes, which is
from the package D plier, which is why I loaded it is this thing right here. And what it allows you
to do is to take the results of one step and feed it directly in as the input data into the next
step. Otherwise, this would be several different steps. But I can run it really quickly, I'm going
to create an object called h c for hierarchical clusters, we're going to read the cars data that
I just created, we're going to get the distance or the dis similarity matrix, which says how far
each observation is in Euclidean space from each of the others. And then we feed that through the
hierarchical cluster routine h clust. So that saves it into an object and now we need to do is
plot the results. We're gonna do plot H, see my hierarchical cluster object, then we get this
very busy chart over here. But if I zoom in on it, and wait a second, you can see that it's this
nice little, it's called a dendrogram. Because it's a branches and trees looks more like roots
here, you can see they all start up together, and then they split and then they split and they
split. Now if you know your car's from 1974. And you can see that some of these things make sense.
So for instance, here we have the Honda Civic and the Toyota Corolla, which are still in production
are right next to each other, if you're 128. And if yacht x one nine are very well, they were both
small Italian sports cars, they were different in many ways. But you can see that they're right next
to each other. The Ferrari Dino, the Lotus Europa, they make sense to put next to each other. If
we come over here, the Lincoln Continental and the Cadillac Fleetwood and the Chrysler Imperial,
it's no surprise that are next to each other. What is interesting is this one here, the mangiarotti
Bora, it's totally separate from everything else, because it's a very unusual different kind of car
at the time. Now, one really important thing to remember is that the clustering is only valid for
these data points, based on the data that I gave it, I only gave it a handful of variables.
And so it has to use those ones to make the clusters. If I gave it different variables or
different observations, we could end up with a very different kind of clustering. But I want to
show you one more thing we can do here with this clusters to make it even easier to read. Let me
zoom back out. And what we're going to do is draw some boxes around the clusters, we're going to
start by drawing two boxes that have gray borders. Now I'm going to run that one. And you can see
that it showed up. And then we're going to make three blue ones, four green ones, and five dark
red ones. And then let me come and zoom in on this again. And now it's easier to see what the groups
are in this particular data set. So we have here, for instance, the Hornet for drive, the valley and
the Mercedes Benz, 450, SLC, Dodge, challenger, and Javelin all clumping together in one general
group. And then we have these other really big VAT American cars. What's interesting is again, is
that the MAS Ronnie Bora is off by itself almost immediately. It's kind of surprising because the
Ford Panthera has a lot in common with it. But this is a way of seeing based on the information
that I gave it, how things are clustered. And if you're doing market analysis, if you're trying to
find out who's in your audience, if you're trying to find out what groups of people think in similar
ways, this is an approach that you're probably going to use. And you can see that it's really
simple to set it up, at least using the default in our as a way of seeing how you have regularities
and consistencies in groupings in your data. As we go through our very brief introduction to
modeling data and are another common procedure that we might want to look at briefly, is called
principal components. And the idea here is that in certain situations, less is more. That is less
noise, and fewer unhelpful variables in your data can translate to more meaning and that's why
After In any case, now, this approach is also known as dimensionality reduction. And I like to
think of it by an analogy, you look at this photo, and what you see are these big black outlines of
people, you can tell basically how tall they are, what they're wearing, where they're going. And
it takes a moment to realize that you're actually looking at a photograph that goes straight down.
And you can see the people there on the bottom, and you're looking at their shadows. And we're
trying to do a similar thing. Even though these are shadows, you can still tell a lot about
the people, people are three dimensional, shadows are two dimensional, but we've retained
almost all the important information. If you want to do this with data, the most common
method is called principal component analysis, or PCA. And let me give you an example of the
steps metaphorically in PCA. You begin with two variables. And so here's a scatterplot, we've got
x across the bottom y decide, and this is just artificial data. And you can see that there's a
strong linear association between these two. Well, what we're going to do is we're going to draw
a regression line through the data set, and you know, it's there about 45 degrees. And then we're
going to measure the perpendicular distance of each data point to the regression line. Now, not
the vertical distance, that's what we would do if we were looking for regression residuals, but the
perpendicular distance. And that's what those red lines are, then what we're going to do is we're
going to collapse the data by sliding each point down the red line to the regression line. And
that's what we have there. And then finally, we have the option of rotating it. So it's not on
diagonal anymore, but it's flat. And that there is the PC the principal component. Now, let's
recap what we've accomplished here, we went from a two dimensional data set to a one dimensional
data set, but maintained some of the information in the data. But I like to think that we've
maintained most of the information. And hopefully, we maintain the most important information in
our data set. And the reason we're doing this is we've made the analysis and interpretation
easier and more reliable. By going from something that was more complex, two dimensional or higher
dimensions, down to something that's simpler to deal with fewer dimensions, it means easier to
make sense of in general, let me show you how this works in our open up this script. And we'll
go through an example in our studio. To do this, we'll first need to load our packages,
because I'm going to use a few of these. Although those will load the data sets. Now I'm
going to use the empty cars data set, we've seen that a lot. And I'm going to create a little
subset of variables. Let's look at the entire list of variables. And I don't want all of those
in my particular data set. So the same way I did with hierarchical clustering, I'm going to create
a subset by dropping a few of those variables. And we'll take a look at that subset. Let's zoom
in on that. So there's the first six cases in my slightly reduced data set. And we're going to
use that to see what dimensions we can get to that we have fewer than the 123456789 variables we
here. Let's try to get to something a little less and see if we still maintain some of the important
information in this data set. Now what we're going to do is we're going to start by computing
the PCA, the principal component analysis, we'll use the entire data frame here, I'm going
to feed into an object called PC for a principal components. And there's more than one way to do
this in our but I want to use p r comp. And this specifies the data set that I'm going to use.
And I'm going to do two optional arguments. One is called centering the data, which means moving
them so the means of other variables are zero. And then the second one is scaling the data which sort
of compresses or expands the range of the data. So it's unit or variance of one for each of them.
That puts all of them on the same scale. And it keeps any one variable from sort of overwhelming
the analysis. So let me run through that. And now we have a new object that showed up
on the right. And if you want to you can also specify variables by specifically including
them. The tilde here means that I'm making my prediction based on all the rest of these. And I
can give the variable names all the way through. And then I say what data set it's coming from.
I say data equals empty cars, and I can do the centering in the scaling there. Also, it produces
exactly the same thing. It's just two different ways of saying the same command. To examine the
results, we can come down and get a summary of the object PC that I created. So I'll click on that
and then we'll zoom in on this. And here's the summary it talks about creating nine components
pc one for principal component one to PC nine for principal component Nice, you get the same number
of components that you had as original variables. But the question is whether it divvies up the
variation separately. Now, you can take a look here at principal component one is the standard
deviation of 2.3391. What that means is, if each variable will begin with a standard deviation of
one, this one has as much as 2.4 of the original variables, the second one has 159, and the others
have less than one unit standard deviation, which means they're probably not very important
in the analysis, we can get a scree plot for the number of components and get an idea on how
much each one of them explains of the original variance. And we see right here, I'll zoom in on
that, that our first component seems to be really big and important. Our second one is smaller, but
it still seems to be you know, above zero, and then we kind of grind out down to that one. Now
there's several different criteria for choosing how many components are important what you want
to do with them. Right now, we're just eyeballing it. And we see that number one is really big
number two, sort of a minor axis in our data. If you want to, you can get the standard deviations
and something called the rotation here, I mean, just call PC. And then we'll zoom in on that in
the console. to scroll back up a little bit. And it's a lot of numbers. The standard
deviations here are the same as what we got from this first row right here. So that
just repeats it. The first one's really big, the second one's smaller. And then what
this right here does, what the rotation is, it says, What's the association between
each of the individual variables and the nine different components. So you
can read these like correlations. I'm going to come back. And let's see how
individual cases load on the PCs. What I do that is I use predict running
through PCs, and then I feed those results using the pipe. And I round them
off, so they're a little more readable. I'll zoom in on that. And here,
we've got nine components listed, and we got all of our cars. But the first two
are probably the ones that are most important. So we have here the PC one and two easy, we
got a giant value there, 2.49273354, and so on. But probably the easiest way to deal with all this
is to make a plot. And what we're going to do is go something with a funny name of biplot. What
that means is a two dimensional plot, really, all it says is going to chart the first two
components. But that's good, because based on our analysis, it's really only the first two that
seem to matter anyhow. So let's do the biplot, which is a very busy chart. But if we zoom on
it, we might be able to see a little better what's going on here. And what we have is the
first principal component across the bottom, and the second one up the side. And then the
red lines indicate approximately the direction of each individual variables contribution to
these. And then we have each case we show its name about where it would go. Now if you remember
from the hierarchical clustering, the Maasai Bora was really unusual. And you can see it's up
there all by itself. And then really, what we seem to have here is displacement and weight and
cylinders, and horsepower. This appears to be big, heavy cars going in this direction. Then we
have the Honda Civic, the Porsche 911, Lotus, Europa, these are small cars with smaller engines
more efficient. These are fast cars up here. And these are slow cars down here. And so it's pretty
easy to see what's going on with each of these as in terms of clustering the variables. With a
hierarchical clustering, we clustered cases, now we're looking at clusters of variables.
And we see that it might work to talk about big versus small and slow versus fast as the important
dimensions in our data as a way of getting insight to what's happening and directing us in our
subsequent analyses. Let's finish our very short introduction to modeling data in our with a brief
discussion of regression, probably one of the most common and powerful methods for analyzing data. I
like to think of it as the analytical version of E Pluribus Unum that is out of many one, or in
the data science sense, out of many variables, one variable, you want to put out one more
way out of many scores, one score. The idea with regression is that you use many different
variables simultaneously, to predict scores on one particular outcome variable. And there's so
much going on here. I'd like to think that there's some For everyone, there are many versions, and
many adaptations of regression that really make it flexible, and powerful for almost no matter
what you're trying to do, we'll take a look at some of these in our so let's try it in our and
just open up this script. And let's see how you can adapt regression to a number of different
tasks and use different versions of it. When we come here to our script, we're going to scroll
down here a little bit and install some packages, we're going to be using several packages in this
one, I'll load those ones as well as the datasets package. Because we're going to use a data set
from that called us judge radians. Let's get some information on it. It is lawyers ratings of state
judges in the US Superior Court. And let's take a look at the first few cases with head I'll zoom
in on that. And what we have here are six judges listed by name. And we have scores on a number of
different variables like diligence and demeanor. And whether it finishes with whether they're
worthy of retention, that's the RTN retention. Let's scroll back out. And what we might want to
do is use all these different judgments to predict whether lawyers think that these judges should be
retained on the bench. Now, we're going to use a couple of shortcuts that can actually make working
with regression situations kind of nice. First, we're going to take our data set, and we're going
to feed it into an object called data. So that shows up now in our environment on the top right.
And then we're going to define variable reps, you don't have to do this, but it makes
the code really, really easy to use. Plus, you find if you do this, then you can actually
just use the same code without having to redo it every time you do an analysis. So what we're going
to do is we're going to create an object called x, it's actually going to be a matrix, and it's
going to consist of all of our predictor variables simultaneously. And the way I'm going to do this
is I'm going to use as matrix and then I'm gonna say read data, which is what we defined right
here, and read all of the columns except number 12. That's one called retention, that's our
outcome. So the minus means don't include that, but do all the others. So I do that, and now I
have an object called x. And then the second one, I say, go to data. And then this blank means use
all of the rows, but only read the 12th column, that's the one that has retention our outcome.
So following standard method, x, those are all our variables and why that's our single outcome
variable. Now, the easiest version of regression is called simultaneous entry, you use all of the x
variables at once, throw them in one big equation to try to predict your single outcome. And in our
we use lm, which is for linear model. And what we have here is y, that's our outcome variable. And
then the tilde that means is predicted by or as a function of x. And then x is all of our variables
together being used as predictors. So this is the simplest possible version, and we'll save it into
an object called reg for regression one. And now, if you want to be a little more explicit, you can
give the individual variables you can say that our 10 retention is a function of or as predicted by
all of these other variables. And then I say that they come from the data set us judge ratings that
we don't have to do the data, and then dollar sign for you to these. That'll give me the exact same
thing. So I don't need to do that one explicitly. If you want to see the results, we just call on
the object that we created from the linear model. And I'm going to zoom in on that. And what we
have are the coefficients. This is the intercept, start with minus two. And then for each step
up on this one, as 0.1, point three, six, so on and so forth. You'll see By the way, that
it's changed the name of each of the variables to add the x because they're in the dataset x
now, that's fine. We can do inferential tests on these individual coefficients by asking for
a summary. We click on that. And we'll zoom in. And now you can see, there's the value that we
had previously, but now there's a standard error. And then this is the t test. And then over here is
the probability value. And the asterisks indicate values that are below the standard probability
cutoff of point oh five. Now we expect the intercept to be below that. I see. For instance,
this one integrity has a lot to do with people's judgment of whether a person should be retained.
And this one physical really, are they sick, and we have some others that are kind of on
their way. And this is a nice one overall. And if you come down here, you can see the multiple
r squared. It's super high. And what it means is that These variables collectively predicted very,
very well, whether the lawyers felt that the judge should be retained. Let's go back now to our
script, you can get some more summary data here, if you want, we can get the analysis of variance
table, the ANOVA table, and we click on that zoom in there, you can see that we have our residuals
and the y. Come back out, we do the coefficients. Here are the regression coefficients, we saw
those previously, this is just a different way of getting at this same information, we can
get confidence intervals. Let's zoom in on that. And now we have a 95% confidence interval. So the
two and a half percent, on the low end the nine, seven and a half on the top end, in terms
of what each of the coefficients would be. We can get the residuals on a case by case basis,
let's do this one. And when we zoom in on that, now, this is a little hard to read in and of
itself, because they're just numbers. But an easier way to deal with that is to get a histogram
of the residuals from the model. So to do that, we just run this command, and then I'll zoom
in on this. And you can see that it's a little bit skewed mostly around zero, we've got one
person we have on the high end, but mostly, these are pretty good predictions. Come back out.
Now I want to show you something a little more complicated. We're going to do different kinds
of regression, I'm going to use two additional libraries for this one is called Lars that
stands for least angle regression, and carat, which stands for classification and regression
training. We'll do that by loading those two. And then we're going to do a conventional stepwise
regression, which a lot of people say there's problems with this, but I'm just gonna show that
I'm gonna do it really fast. There's our stepwise regression, then we're going to do something from
Lars called stage wise, it's similar to stepwise, but it has better generalizability. We run that
through, we can also do least angle regression. And then really, one of my favorites is the
lasso. That's the least absolute shrinkage and selection operator. Now I'm running through
just the absolute bare minimum versions of these, there's a lot more that we would want to do
explore these. But what I'm going to do is compare the predictive ability of each of them.
And I'm going to feed into an object called R to conference comparison of the R squared
values. And here I specify where it is, in each of them, I have to give a little index
number, then we're going to round off the values. And I'm going to give them the name, say the first
one stepwise and forward then larger than lasso. And we can see the values. And what this shows us
here at the bottom is that all of them were able to predict it super well. But we knew that because
when we did just the standard simultaneous entry, there was amazingly high predictive ability within
this data set. But you will find situations in which each of these can vary a little bit, maybe
sometimes they vary a lot. But the point here is there are many different ways of doing regression
and are makes those available to whatever you want to do. So explore your possibilities and
see what seems to fit. In other courses, we will talk much more about what each of these
mean, how they can be applied and how it can be interpreted. But right now, I simply want you
to note that these exist, and they can be done, at least in theory in a very simple way in
our. And so that brings us to the end of our an introduction. And I want to make a brief
conclusion primarily to give you some next steps, other things that you can do. As you learn
to work more with our now we have a lot of resources available here. Number one, we have
additional courses on our in data lab.cc. And I encourage you to explore each of them. If you
like our you might also like working with Python, another very popular language for working in data
science, which has the advantage of also being a general purpose programming language. The things
that we do in our we can do almost all the same things in Python. And it's nice to do a compare
and contrast between the two with the courses we have at data lab.cc. I'd also recommend you spend
some time simply on the concepts and practice of data visualization. R has fabulous packages for
data visualization. But understanding what you're trying to get and designing quality ones is sort
of a separate issue. And so encourage you to get the design training from our other courses on
visualization. And then finally, a major topic is machine learning or methods for processing
large amounts of data and getting predictions from one set of data that can be applied usefully to
others. We do that for both R and Python and other mechanisms here in data lab. Take a look at all of
them and see how well you think you can use them in your own work now Another thing that you can
do is you can try looking at the annual our user conference, which is user with a capital R and an
exclamation point. There are also local our user groups are rugs. And I have to say Unfortunately,
there is not yet an official our day. But if you think about September 19, it's International Talk
Like a Pirate Day. And we like to think as pirates say are and so that can be our unofficial day
for celebrating these statistical programming language are any case, I'd like to thank you for
joining me for this and I wish you happy computing
I haven't heard the term octothorp in a while.