Introduction to R and RStudio

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so without further ado I'll hand over to thank you thank you for coming guys so everyone has got our our studio installed on the computer so if you want to open our studio I think the way that I think the best way for this bookshop to work is basically as long as you can all see the see what I'm typing I think it's the best way is probably if you type along with me so if I go too fast or if something doesn't work or you get an error or sort of confusing happens please just shout out and really know because I think it's really important that everyone sort of goes along with it because you can you can get lost quite quite easily so R is super cool it's a statistical programming language and it looks a bit daunting because it's not sort of a GUI based application where you can click via menus and get the analysis or the graphs that you want you do everything by the command line by typing text so it can seem a bit complicated but it when you start using it actually it's not as complicated as it seems and the best way to get to grips with it is to go to a couple of workshops and practice and play around with it yourself and then look at some online tutorials and eventually you'll feel sort of come to grips with it so I think what we disagree split into two parts and it depends I guess on where we finish but in my head I've got it that we'll start off just by working with our within our studio so our is the language and people used to work in just like this at the command-line with no sort of application to help them just sort of typing things at the command line and then if you wrote some code that you wanted to save for later on you'd have to open up notepad and save it in the notepad and do it that way whereas our studio is an interactive development environment they call it with it's basically fancy way of saying a program that you can that makes it easier for you to run our so that and everyone who does things these days does it in in our studio so we'll talk about using our studio we'll talk about how you can perform mathematical operations in our because at a very basic level R is just a big glorified calculator we'll talk about creating objects in our I'm working with them what I mean by data classes so whether something is a numeric character or a logical data structures so the way that our likes to store little bits of information in a way that we can work with and manipulate functions which operators that take an input and give you some kind of output so an example is the mean function where you give it a list of numbers and it'll give you the mean afterwards exploring our data so say we generate or read in some data into our we want to explore the data and see if there are any relationships there and then we'll look as well how we can read data into our from an external source so I think most of the time most of us will generate data that we put into organize into some kind of Excel spreadsheet that will then want to import in into our to work with and as part of the the email that you guys got there's an example data set called Pokemon which will probably end up I'm hoping that today will end up reading into our and having a quick look at but then in the next session a week later we'll actually start looking at analyzing that data and doing sort of your t-test I know as to a a knows and post hoc tests and things like that so everyone's got our open does everyone have something that looks a little bit like that then it might not be blue let's carry on so everyone else of you looks a little bit like this and can you read this by the way this white text on a blue background that's okay okay so first thing I want you to do is go up to the top left corner just click on this white sheet and say that you want to create a new R script and then you should get something like this so that your screen is basically split into four different places okay so start in the bottom left hand corner what they call the console this is you can think of actually what R is this is the body of and this is where the R code is actually executed up here is the script where basically you can store your R code as you type it through it doesn't make more sense in a second when we use it I'm a top-left we've got a environment and a history tab which we won't really use so much we'll see it being used when we when we use it but it's not so important for this first one and there down in the bottom right hand side you have this section that you have a little file browser for your computer any plots that you generate appear in this tab and if you look for any help for any packages and things they they appear in here as well so if you just start down in the console you can think of this is actually running and it's very basic level R is a glorified calculator so if we start using some mathematical operators so it's two plus two and then hit enter we'll run the code and it'll print the output for you so two plus two is four so everyone happy to type along as I go you can also forget the space if you want and just due to 2+2 without any spaces but it's generally considered better to make your code more readable to have spaces between operators so addition 7 minus 8 minus 1 multiplication is the asterisk it was 4 times 6 and you can combine these together in the same kind of way that you would if you were writing some kind of equations so you can do 2 plus 2/3 to the power of 4 and hit enter and then you get the result yeah so everyone is happy with that at the moment so our is is a glorified calculator by the way it sort of very basic level what you notice is that as we type code and hit enter the previous code that we've run sort of moves further up the screen and eventually you can imagine that if you typing huge amounts of code this sort of gets lost up the screen and you've got a scroll through and try and find it which is quite difficult if you either make a mistake or you want to find something that you typed earlier if you keep tapping the up key you'll see that it will cycle through the most recent commands that you made and if you go down you can so you can you can scroll and cycle between the ones and if you find the one that you wanted you can just hit enter so this is our but actually a better way of keeping track of what you've typed and keeping track of which bits of code work and which bits a rubbish is by using the script up here so if you'll click on the top left panel so up here if I type 2 2 plus 2 and hit well okay so if I hit enter I just go to the next line and nothing happens if you hit control + Enter what happens is the script passes this line into the console basically gives the line of code to our and our runs it so arvind says this is 4 and I can keep typing codes and although my code that's been executed sort of disappears up at the screen the code that I've written in the scripts stays it's sort of like a text file that you can put all the bits of code in it that work well and it's easier than to go back and say oh no I didn't want two plus two I wanted two times two and hit ctrl + Enter so most of the time when you're writing our code that you are interested in keeping or editing and perfecting you want to be writing it inside the script and also you can save the script so you can save it as a dot our file and then open it up again and anything that you've written in there just like a text file will be in the next time and it you know everything is an object or can be saved as an object so if I want to if I want to save the value of ten times six so that I can reuse it later say I want to call it a you type the name of whatever you want to call the object and then this funny little operator which is the assignment operator which is you can see it's an arrow pointing to the name so people you can do it either way around people tend to have the name of their object on the left the assignment operator and then whatever the value is on the right hand side so this basically says tells our take this and assign it to an object called a and hit ctrl enter and you can see now that over here in our global environment we've got an object called a and its value is 60 and so if I type a and hit ctrl enter it gives me the value of 60 so the this the result of this is stored in this so make sense so it's it's the the less than sign followed by a dash a minus sign yeah okay in are anything that follows a hashtag is ignored so it's purely for you to write a comment to remind you of why you've done something so assigned object hey so I hit ctrl enter and all our C's is the a and it just gets bored when it gets to the hash tag so it doesn't read anything else so this is it's really useful to comment bits of your code so that you remember why you did certain things so that when you come back through it later you don't sort of thing oh why the hell did did I do that and a really useful thing in our studio as well is that if you include a line a comment line that ends in four dashes then a well so a couple of useful things happen so hang on let's say this is Part A Part B and let's just hope hello so first useful thing is that when you do this you start a line with a hashtag have any text that you want and four dashes it indexes that part in our studio so that you can easily flip between different sections which is useful so you can use these basically as section headers and another useful thing is that you see that when you do this you get this little arrow and if you can see that or you can see on your screens alongside the line and if you click on the arrow you can collapse certain bits of code so if you've got a massive piece of code that you don't want to see when you're scrolling through you can click on that and toggle it open and closed so work for everyone would you would you rather that I had it white or is it just you just curious that's a okay so if you go to tools and global options and then appearance mine is on cobalt but there are loads of different ones that you can choose from and there's okay okay so we know how to create objects so you give the object name you give the assignment operator and then whatever value or equation or even plot and that you've got on on the right hand side so let's call this our is case-sensitive so if i create an object with a capital A and gives us the value of 67 and enter if I call capital a I get 67 and if I call lowercase a then I still get the 60 that we assigned over here so our is is case-sensitive for everything actually so it's just because it's the same word doesn't mean it's the same it's the same object you can overwrite objects so if I reassign a to two and then call a a is now two so you can overwrite something just by reassigning a different value to it so make sense and R has conventions for naming objects so you can't just name anything anyway that you want so for example you can't name something with a name that starts with a number so if I if we run that it says unexpected symbol in one object because it doesn't like you having something named beginning with a number it also doesn't like having names that have special characters in like exclamation mark look I'm not happy with that or with a - oops not happy with that and actually even if I put the exclamation mark at the end of the name it's not happy without either what you can do is have numbers at the end so you can have object three that's fine so we can call object Oh object three and every terms of value three if you've got if you want to name something some kind of phrase that might have spaces in it there are a few different ways that you can do it well yeah I mean you can do it any way that you want but you might be versi something like that with a dot where you might put a space to make it easier to read or an underscore or camelcase which is where you don't put any dart or underscore between the words but every time you start a new word you the first letter is is capital so these are fine our accepts these most of the time you'll probably see these two I generally prefer camelcase but if you prefer underscores and then you can do whichever way that you you prefer if it's not clear to you by the way at the moment why we're creating objects it will later it's just so that there's you have some way of storing a value or or it could be a data table or a plot with a single word that you can call and do later on so rather than having to create one massive long equation that you type out all in one go you can create it bit by bit and then just feed the objects into a an equation and if you want to remove an object then we'll talk a little bit more about functions later on but there's a function called RM for remove see you type RM and then put whatever object you want to remove inside brackets so if I want to remove my object I hit control enter then we find that my my object has disappeared from our environment so sometimes you've got something that you don't want anymore you just want to get rid of it then you can do it like that so R is built obviously for statistical analysis or programming and it likes it has certain classes or certain ways that it likes to classify data because different processes apply differently to different types of data so what I mean is you can have data that is numeric it's based on numbers you can have data that is based on characters so text or you can have logical data which is basically either true or false and it's important to understand that they're different because our handles them different for different processes so for example that is numeric so is that so is that whereas that is character so numeric is always a number obviously and you can you can see that our as at least on my color scheme has made this pink and it's just written as the number itself for text for character each item each little piece of text data needs to be surrounded in double quotes and that's the way that there are understands that it's character if so if I if I do that then it says male it basically just passes it back to me because I haven't done anything with it but if I try and submit it like that it says object male not found because it's no longer looking full and it no longer sees it as a character it sees it as an object that we've named but it can't find that because we haven't created it so character strings need to be in quotes and then the last type of data class is logical so this is obviously either true or false and you can see that ours colored in pink because it recognizes that it is well basically that it's not character they have to be in capital letters so that you can see isn't colored differently that is an illogical value that's that's that would just look for an object called true so maybe that's that's also not not a logical database but you can abbreviate them to T and F for showing and it might not be obviously clear why you would need to use logical data values but they're often useful for options for functions so it might be that you you run a function that plots a regression line and it has an option whether you want arrow bands and you can supply true or false and then that will plot that differently so most of the time when we're working with data we were often not just working with single data values we're working with collections of data so it might be a data table or a string of results or or a set of random numbers or something it's not just a single value and our organizes the data or likes the data to be organized into certain structures so again it likes to perform certain things with certain functions that isn't very clear but it will be clear I promise you so there are a few main types of data structures basically ways that data are organized and represented in are the three most important ones are probably vector lists and data frame and actually initially people probably aren't gonna use lists very often we'll play with these in a minute and it will become more clear I promise everything will come together a vector is basically like a series of single values so it could be a series of weights of different people a series of drug doses it's just a single row of data and the data is all of the same class so the data is either all numeric it's all character or it's all logical similar to a vector is a list which again is like a single row of data but it can be lots of different types so you could have character mixed in with numeric mixed in with logical and and also a list can be a list of lists you can have a list embedded within a list you can have a plot embed it within a list so it's a way of storing lots of different complicated data types together which to be honest most people initially don't play with lists that match a matrix we won't really talk about here but you may come across it it's basically like a 2d version of a vector so instead of just having a single row of data you have rows and columns of data but they're all the same type so they're either all numerical or character or all logical and probably the most important one for us well as well as with vector is a data frame which is you can kind of think of really as loads of vectors pasted together as columns so it might be that every column is a different variable so you could have one column is group and the other one is response and so within each column is a single data type so it could be a numerical character a logical again but they're all pasted together so this is probably you can kind of think of it as if you've been using prism then the table that you put your data into in prism is a bit like a data table that you find in R so it's a way of organizing your data in a way that you can use for your analyses your plots so we'll talk about that in a second so the first one is vector oops my you smile vector is Evan with me so far I know nothing is sort of come together yet but I promise you it will it will make more sense so if we want to create a vector of data which is just a series of values that are at the same time we use this function C which stands for combined and again we'll talk about functions later we know it's a function because it has a name and it has two brackets and then you put his arguments inside the bracket so let's say that we want to create a vector that represents patient ID so we've got ten patients in our study so to create a vector the basically has values one through ten so we can do that by putting the values in and separating them by a single comment you don't have to put a space but the standard in R is to put the comma and then have a space to make it more readable so we've got our values one through ten and then if you hit control + Enter it pastes out our vector so all the vector is is just a series of values that are of the same type so what without being patronizing what kind of vector is this what data class is is inspector they they are integers yeah yeah I'm going to take one step back from integer yes is its numeric yeah another way easier quicker way that we can generate a series of numbers is using the colon symbol which so if we say 1 colon 10 that basically says generate them the sequence of numbers between 1 and 10 so if you hit control and enter then it basically pace out exactly the same so it's shorthand rather than typing out every single in them but you can imagine that if we wanted to generate a vector of 10,000 then you wouldn't want to type all that out by hand so you can do that quickly by by using the colon so that's a numeric vector we can make a character vector in basically the same way so maybe we want to make a vector of days of the week so remember for it to be a character it needs to be inside double quotes so let's do Monday separated by comma Tuesday etc this is go as far as Friday so inside this combined function we've got this is a character element Monday this is a character element this is a character element and they're all separated by commas so what this combined function does is it sticks them all together into a single vector and this is a character vector so if we print that then we get the vector printed out and you can see when it prints them out it prints them out with quotation marks as well and the quotation marks are important so if I try and do the same thing but without quotation marks then it says object mod not found because because it's not in quotation marks it R is looking for an object called Monde and there isn't one we can have a vector of logical values so can have oops not to true true false and hit ctrl + Enter and then again we get a vector of logical values which doesn't work if we do true true false because it's looking for an object called true but what you can do is abbreviate true and false to T and then an R is happy with that it expands the T in the after true or false earlier we talked about creating objects so you can assign vectors to objects so let's call let's create a vector called my vector and I want to assign to this object let's assign this so if you want to copy and paste to make it easier and then hit ctrl enter so it's stored this vector inside the object called my vector and then if I type actually let's do this in here if I type my vector and hit ctrl enter then it prints out the the true true false vector let's create another one an object called numbers assigned to this object the numbers 1 through 10 ctrl enter and and then if I call numbers I get any prints me the vector so is everyone happy with that that we can create vectors of data assign them to object and then call them into are the reason the vectors are kind of important is you can imagine that if you're inputting your data manually into R you can do it via vectors so you could create a vector for group a vector for response and then paste them together into a a data table and use that fuel analysis [Music] you can perform mathematical operations with vectors so we've created respect in numbers and let's say I don't want I don't just want the numbers 1 through 10 I want to find out what the value of every number in my vector is times 2 and if you run that you can see that our performs this time is to operation on every single element inside our vector and we could do this with different operations so plus 3 and we can even add different vectors together by name so let's add this to itself and which is of course is the same as doubling it so not only can you save vectors objects you can use those object names to manipulate them ok so we've talked about vectors the other I guess similar data structure to a vector is a list we won't spend too much time talking about them because you probably won't find that you use them quite as often yeah no no PC the four dots first I see what you mean is it so therefore dashes the four dashes is it because it looks like dots on the screen oh I'm sorry yeah so there there there there there four dashes so if you if you do that then so like this actually I haven't been doing it if I want to collect if I want to collapse this whole vector section then I can click here and that collapses it away and also if I want to go right the way back up to where we start is in objects I can click on there like that yeah sorry guys that's okay let me tell ya I know it's not very on the screen so a list is similar to a vector except that it's a little bit more flexible in that it can accept the single list can accept different kinds of data and you can have lists within lists you can have regression models within lists you can have plots within lists so it's a way of storing lots of data so if if the function for creating a vector was C the function for creating a list is handli lists so let's create a list with the values 1 2 3 4 5 oh no wait wait sorry let's go as far as 3 it's not good notes and then so we've got three numeric values let's add a character hello these are all separated by commas and then it's add a logical value at the end as well false and hit ctrl enter and then you probably can't see this on my screen and also if if things don't fit on your screen you can move things around in our studio - however you so you'll see that the list is printed a little bit differently than a vector so enlist the the main elements of a list are printed down your screen instead so this is element 1 was the number 1 this is 2 this is 3 this is hello and this is false and it might seem a little bit weird as to why it does that but the reason is because you can have lists within lists so you can imagine that the first element could be 1 the second element could be 2 the third element could be a list with with the values ABC and then they would be shown horizontally across your screen let's so let's have a look at what I mean if that's completely and confused you so we created this earlier my vector so that's still in our environment and it's got it's a vector of the values true true and false so let's create a new list and we'll let's start it with my vector and we can just call the object name and I will will see this it'll look through your environment find my vector and it knows that it it's a vector of these values so the first element in the list is my vector and then let's go one two three oops hello and sewed again sorry if well so it could be whatever I want so if I if I want it to be hello then we can do the bad sense oh no so they they weren't sensitive to the case it was just that they need they need quotation marks yeah if I'd written it all lowercase then just would have printed it out as lowercase so if you need it to be capital or lower case then you can do it as as you need it so now if we hit control enter let's look up here again so now you can see that the first element in the list is the vector which has three values true true and false and then the other elements there are just individual elements so does that make sense how a list operates differently to a vector and maybe it could be useful why can't I find the oh there we go we can also if I this one two three in the middle if you wrap that actually don't don't run you let's save let's save a source and typing if we actually instead of having individual elements one two three let's make this a single vector and now hit enter control enter now it's done the same thing but with one two three so the first element in the list is the vector true true false the second element is the vector 1 2 3 and then the third is the character string hello and the fourth is a logical false okay if that's confusing it doesn't make sense to you don't worry about it vectors and data friends are the most important data structures that you will be doing if you're doing data analysis at least initially so the next most important data structure is a data frame but we're going to come back to that a little bit later and because I want to talk about how you can subset vectors so there's anything that is really confused anyone so far no everything so let's copy and paste these days of the week and save them in an object called days so if we want to print out days then it just prints out these days of the week I'm supposed to make that a bit bigger so let's say so we have our vector days we may have created it or it might have been created through a piece of analysis that we've done well let's say actually we only want the first element we just want Monday so we want a way of being able to extract that single element from our vector of elements and the way that you do that in R is you give it the object and you open square brackets and then you give the value of the element that the index of the element that you want to extract so Monday is the first element so we put days square brackets 1 and hit ctrl + Enter and it returns Monday so if we wanted to return Thursday what would you type for oh yeah cool okay so at least that makes sense okay so yeah if we wanted this a week I thought if you if you use a value that is outside the bounds then it says na which is not in them sorry not applicable or not available so if you put so if you were too short okay so we're talking more a little bit about functions later on bets and brackets like that round brackets are use of functions so you always put the function name you open the the round brackets and then whatever arguments or inputs are for that function go inside those square brackets are always for sub setting so if you want to take out individual values in the you use square brackets if a bit more complicated if you start writing your own functions then function definitions go inside curly brackets but those of you many times really when you would use those so round four functions and square for subsetting mm-hm if we want to select out just the first three days of the week then or actually every every other day of the week we say days square bracket and then we basically give it a vector of values corresponding to the to the days that we want so what does this look like see to create a vector and we want day one day three and day five and then you hit control enter and it gives you Monday Wednesday and Friday like that it'll complain so we'll we'll find out in a little bit later but basically the the square brackets are used force obsessing vectors and data frames as well but well so it can so for a vector it needs one value because the vector only has one dimension dead for him as to so that what you do then is you and we'll do this later on as well you open your square brackets you put the index of the rows that you want first and then the index of the columns that you want second so what we've just told our to do is take days index the first row the third column and that it doesn't know what what to do with that instead so so really what so what we did here is basically gave it one three and five that sort of fits into into the first the yeah so make sense yeah cuz it seems that's it doesn't see it doesn't look intuitive when you do it but it needs a vector of indexes if you wanted everything except Monday because Monday's awful you could do two to four and it because this so I get so this creates a vector values between two and four so two three and four oh sorry um let's put this up here so remember hit control enter and you get Tuesday Wednesday oh oh no hang on we forgot Friday the best one there yet you say Wednesday Thursday Friday if you want all the days except Friday rather than doing days well actually if you wanted all the days except Friday how would you do it with a colon yeah yeah so one one one column for another way that you could do it and is by explicitly saying give me everything in days except index 5 so minus 5 and that gives you Monday Wednesday Thursday and it pulls out five okay so that that makes sense how you can select certain values from from from a vector okay we're still going to go on to date frames but one thing that I want to talk about before we go to date frames which actually is is is fun I promise is functions so functions is really how r does stuff because without functions are is just a pretty way of storing numbers and values so a function is basically you imagine a box that has inputs into the function the function does something and gives you an output and you know that something is a function in R because it has some some name it could be enough to take this some name that the function has been called by whoever made the function and then open round brackets and are knows that this is a function and it turns it orange because it's recognized this is a function some functions don't need any arguments at all you just give the name you still put the code that the round brackets even if you haven't given any arguments but most functions will accept arguments so input basically it could be a series of numbers it could be options and all of those things go inside the curly brackets so let's create a vector create a vector called my values and make it so that it contains all the values between 1 and 100 without typing each number so as everyone got something that looks like oops yep and then if we return my values we get it printed out like this these on the side by the way if you notice them are basically the index of the first value in that row so this is number one this is 80 obviously you can read the number this is 18 this is 19 this is 37 but even if these were the actual numbers that if these were you know days of the week or names or whatever this just tells you it's an easy way of I've seen you know where you are in in counting okay so we've got a vector called my values and this could be data that we've collected and we will know okay yes you can do either way around to be honest that's probably a better way of doing it so the the the C in that case was superfluous so this says create a vector between 1 and 100 but actually the curl on symbol does that anyway so so you're right it's the same thing so you could do it either way mm-hmm so we've got values 1 to 100 and we want to find out what the mean is and so conveniently there is a function called mean so you type the name of it you open round brackets and put my values in there and hit control enter and then our passes your vector of values to mean and it gives you 50 point five okay other functions median we should be yeah 55 min which gives you the minimum value and obviously max if you want if you've got a vector of values that you want to add together you can use the sum function standard deviation and a useful one as well we mentioned some data class earlier so whether something is numeric character or logical if you're not sure which sometimes might be the case the function class will tell you so class my values and it says integer just like you said earlier on so integer is a type of numeric so basically if there's no decimal place then it's and it's an integer or it's a float or point and oops can't spell length is useful which tells you how many elements there are in in a vector oops but that was just a demonstrate that are like the case to be exactly the same and it gives you hundreds you've got a hundred atoms in your vector so these functions are predefined they come with sort of base are then that they're ready for you to use when you open up you don't have to define them yourself whether you can make your own functions which is one of the reasons why R is so flexible and these functions are summary functions so they they return a single value basically whereas other functions will return a different value for every element in your vector so for example and if you want to transform your vector the log function will print out the log values of your vector but be careful because this is the natural log by default so if you want the log to the base 10 the function is log 10 and then my values I mean it prints them out so you can see whether whereas these guys printed out a single value for your entire vector these will print out a value for every element in your vector another one is and let's say this as yeah and and you can save the output of any function as an object as well so if we create an object called my square root and then we give my square roots values if we then type my square root we get the square root for every single value in our vector if you're ever if you want to do something in our and you don't have to do it you don't know what what function you need to use to do whatever it is that you want to achieve Google is your friend R is open source and there's loads of help online so if you want to generate numbers from a normal distribution random numbers if you type into Google how do I do that and someone says oh you need to use the r norm function and you say great how do i how do i use the r norm function what arguments do i need to give it the way that you can find out R has a really good help function so if you function to help facility I don't know if you type a question mark and the name of the function that you want to know how to use so there's a function will our norm which stands for our normal distribution so at random normal distribution and hit control enter then you get this help page that opens up in the help viewer on the right hand side so the package is called sorry the the function is called the normal distribution it gives you a little description of what the function is what it's supposed to do it gives you some example for usage so this can look a little bit complicated to begin with basically this this actual there's a it gives you a few different functions here we're interested in our norm and the way to read this is it shows you how to type the name of the function and it shows you what arguments the function accepts so the our norm function will accept an argument called n we don't know what n is yet an argument called mean and this is a default value so if in this usage section of the help any argument has a default value it means that if you don't specifically tell what to use for that particular argument it'll use its default value so if you just say our norm and we'll come to the end in a second and you don't supply mean or standard deviation it'll generate numbers that have a mean of 0 and a standard deviation of 100 you can specifically tell it what to generate and it'll do that for you but if there's an argument that is defined like n but there isn't a default value you have to supply it otherwise it will have a hissy fit and it will give you an error so to find out what what these arguments are for or what they do what they are what you need to give are you scroll down to the argument section and we're interested in N and it just says a number of observations so if we wanted to generate a hundred random numbers drawn from normal distribution with mean of 0 and standard deviation of 1 we would just type our norm 100 does that make sense yeah and you can do that for any any function so if we type oops question mark mean then we get 1 foot for mean as well so this is a really useful way of learning how to use functions so we've worked out how to use our our norm function so we want to generate a random normal distribution with 100 values with the mean of 5 so we know from our example that we type our norm a hundred and we want to set the mean as five and if you hit control enter then it produces a hundred values whose mean is five another point to mention as well is that you can are will will match arguments by position or by name which means that I could have also written that so because in the in a definition the number comes first and the mean comes second even if I don't explicitly say that this is the number and this is the mean it knows because of position which ones which sometimes it's fine to just not specifically define it for more complicated functions it can be better to explicitly say you know this is this use the equal sign basically says that for n equals 100 and mean equals that for a lot of functions the function is written in such a way that often the first argument you don't have to type the name of because it should be obvious in whatever you're doing but it just so that you're aware that you can you can do it different ways another function is hist which draws a histogram this is nice and simple usage hist it has one argument X and Weil ignore the dot for now because it's not so important I'm basically the X that it needs is a vector of values for which the histogram is desired so let's use the hist fortune and let's just copy and paste this our norm inside our hist so if you say you say you can put functions within functions and the way that I will handle that is that it'll do the innermost function first so it runs our norm N equals 100 min equals 5 and that results in a vector of 100 values and then it passes that vector to the outer function hist so then if we do control enter what we get is a little histogram of our one values that appears on the the right-hand side your histogram might look a bit different to mine because we used a random number generator so it won't be identical if you want to make this bigger then you can always click zoom and it brings it up in a separate panel and you can stretch this and make it bigger as you want it to be and then close that sorry so up here in the plot area is zoom and then you can open it up like that so again sorry yes so if you go to export you can save the image as I think either what function is it yeah it's got a few different image functions but it also has a PDF function as well so you can say like that mm-hmm so is everyone kind of happy how functions generally work and that you pass arguments to them and they they give you some form of output there could be a plot or it could be a value okay okay so data frames so this is the last important data structure probably the most important data structure if you're working with data and you're analyzing data in R so a data frame is just a series of vectors pasted together vertically so that each each vector each column represents a different column of your data so it could be treatment group it could be time it could be whatever your output variable is and there are a couple of ways that you can and create data frames the way we're going to do it now is we're basically going to make individual vectors and then we're going to paste them all together so you basically manually create each column of our data and then put them all together so the first column we're going to call ID and we're going to assign this to the values between 1 and 200 this could be patient ID and here control enter we're going to create another one called group and to this we're gonna okay so this is going to be a vector and it's going to be a vector of the word vehicle 100 times and the word drug 100 times so you could spend hours typing them out over and over and over again but it's like you ages it's probably gonna you probably make a spelling mistake there somewhere and it kind of ruins the point of programming if you're doing things slowly and inefficiently so we're gonna use the repeat function which is super useful so you can see that at the moment we've got the repeat function inside the create vector function and the repeat function takes whatever value it is that you want to be repeated and the number of times that you want to repeat it so we're gonna repeat the character vehicle comma so now we put the comma so that we can add the second argument for the repeat function and we're going to do this a hundred times so that's the loops that's the the first vector inside our vector and the second vector that we're going to paste together with it with with the the combine function is another repeat function and we're going to call this drug and we're going to repeat that a hundred times that's probably the most complicated line of code that we've written so far does that does it make sense what's gonna happen when we run this so we create group and if we call group we get vehicle repeated 100 times and then we get drunk repeated 200 times so that makes it makes it so much faster rather than doing yourself the next column that we're going to make is our response variable and we're gonna do something a little bit similar to this so oops it's gonna be a vector and this time we're going to paste together two random samples of data so we're going to use our favorite our norm function and we want to generate a hundred values with a mean of 25 and a standard deviation of five this is the first part can you see that this is going to be these going to be the values that corresponds to the vehicle group if you can't yet and then don't worry and then the second part of our vector is going to be another our norm call again we want a hundred these are going to be the values for the drug group and this time we can have a mean of 23 and a standard deviation of five so that's now the most complicated line of code that we've written does that make sense to everyone what's going to happen another thing that you can do if you start to find that lines of code are really long and confusing in your eyes going blurry are it's quite happy if you two split lines of code across multiple lines so if we hit enter here we can put this on a separate line and so now maybe this is a bit more readable the basically R is going to combine together this vector and this vector and then make sure that when you run the code if you run it from the first line it'll then run through to the second line as well and then again if we just call this then we've got our response variable but the two random samples that have been combined together and now let's make our date frame so we're going to call this my data and the commands to make a data frame is handli data dot frame and then you basically give it the column title that you want so our first column is going to be oops that's my patient patient sorry okay to rename something you're easy it's probably better off just creating another one and removing the old one yeah so the first column is going to be called patient so you give it the the column title and then the equal sign and then the values that you want so patient is going to be equal to the ID vector that we created earlier okay let's go on to the next line to make it easier for the next column so the second column in our data table is going to be called treatment and this is equal to the group vector that we created and then finally and the last column is going to be called response and this is going to be equal to the response vector that we created okay and then if you control enter on the first line of this of this call it creates my data and if we have a look at my data oops then you can see that R has printed out what actually oh my god what actually looks something like some data table that we might have entered into Excel so we've got our three columns patient treatment and response each one is a vector and basis so basically what we did then was defined each vector individually and then use the data frame function to paste them all together into a data frame what we could have done instead is this we could have just defined them on the fly yeah does that make sense but that isn't very readable so it's often it's easier to define each vector individually and then paste them together afterwards so that's ugly so let's put that back to where it was okay another little handy tip so when you call a data frame and it just puts the whole lot on to your screen and and dumps you at the bottom which often isn't isn't very useful for a second let's just clear everything so if you push control out it just clears the console for you so it's nice and clean instead of just being dumped with the entire data set or sometimes if you did it's that it's too large it simply won't show all of it to you because it's too big so there are ways of us being able to explore and understand our data a bit more intuitively than just having all the numbers printed at us so instead of viewing everything and the head command oops sorry yes the head command is very useful what this does is just print out the top six values the top six rows from our data table say we were interested in the top 10 then you just supply the number of rows that you wanted to show you and it shows you the top 10 rows rows similarly the tail command does the same thing it shows you that well it doesn't do the same thing it shows you the last six values and and again if I want to see the last two then I can I can do that and I'll show you that so rather than seeing everything all at once if you want to know the dimensions of you do know the dimensions of your data frame how many rows and how many columns you have the DIMM command if you passed your data frame to it and hit ctrl enter tells you that you have 200 rows and three columns so in our rows always come before columns so if you weren't short other I was quite obvious from my dataset but 200 rows and three columns the structure command STR is useful but isn't maybe the most obvious to understand so it shows you the structure of your data so it says that well you've got a data frame 200 sebaceans you've got three variables and then it shows you that well the name of each of your variables each of your columns and then what the data class is so patient is an integer and then it shows you the firsts of ten or so values treatment is a factor with two levels drug and vehicle so it's clever enough to know that even though you entered it as a character it's a factor and it has two levels and the response is is numeric and then it shows you the 50 values of that so structure was useful as well and then finally the summary just dumped a load of summary statistics about your data all at once which is useful so for patient or bubble for response the mean median the quartiles the max and for if you have a categorical variable like treatment then it tells you how many you have in each treatment so there's a way of exploring your data a little bit and understanding rather than just having the entire data frame dumped in front of you at once mm-hmm yes so there are there are functions that are so you have the numeric function so say we've got character strings like that maybe the the software that you did your experiment outputs it like that and if you wrap the strings in this as numeric function it should give you the numbers so you could change it that way and you could actually pass an entire column of data to it and it'll do that or you could do as character yes yeah exactly so we could change the entire column into whatever that you wanted yeah yep [Music] so we've got our data frames we create that we've created manually you know and we want to way of extracting certain features from them we maybe don't want to analyze everything all at once we want to select certain rows or columns and you can subset them in basically the same way that we did for vectors so if we've got my data the only difference between subsetting vectors and data frames is that you supply two values one for the row number and one for the columns like we mentioned earlier so if you want the first row of the second column you just type it like that and then you get the value vehicle and handily because our knows it's a factor it also reminds you that you have both drug and vehicle in this column if we wanted well actually what's the value of Row two column three oops that's everyone get oh no no no we were generating the data randomly didn't way so you don't have different values yeah that didn't work okay but you all like this anyway right to get this the second row and the third column if you wanted and the first 20 rows and the second and third columns you could say I want so you're basically creating a vector here between 1 and 20 for the rows and another one oops for the columns so this is the first 20 rows four columns 2 & 3 if you wanted the first 20 columns sorry the first 20 rows and all the columns we just have to leave the option for columns blank and our assumes that we want everything I don't give us the first the first 3 and similarly we can do it the other way around so if we want all rows but only want the first column then we just leave the rows section blank and it gives us everything for the first column and it also gives it to us as a vector you see so it's no longer vertical you can also call columns by their actual names so we want all rows for just the response column you can do that but you have to give the the name of the column in quotes and then again it gives us then all the values for the response column and an easier way of doing this is by using the dollar value the the subsetting value so if you have a data frame and you want to select a column from the data frame rather than doing it this way the shorter and the faster way of doing it is just by inserting a dollar sign and then without having to use quotes type the name of the column that you want and it basically tells our look at my data frame and just give me the column oops just give me the column called response and it does the same thing okay you can use logical expressions to pull out rows or columns that meet certain criteria about your your in your data frame so for example let's say we want all values sorry we want all rows who had a response higher than 26 so again we've got our row and column separator we want all rows for which my data response was larger than 26 so this seems a little bit weird at first because you're having to supply my data twice but all this is saying is look at my data and just give me the rows for which the my data response vector is larger than 26 and give me all columns as well so if we run this we get all the data frames and you'll see that there aren't any values in the response column that are lower than 26 does that make sense because that's really really useful for quickly sub setting and pulling out certain rows of data that you want what else can we do yes absolutely so yes so if you want to create a new column in your existing data frame you just give the data frame you do the get that way you give a dollar sign and then you give the name of the column it doesn't exist yet and so what I will do is it will create the column for you and then put in whatever values that you've given it so and then you use the assignment operator as well so this should work no no it doesn't work because we need to give it a yeah that's why we didn't actually need to subset so we just said basically look through the my data response column if its larger than 26 then it gets assigned a value of true if it's smaller than 26 it gets the value of false and so now if we and we've assigned that to the column positive so if we now look at my data we've got an additional column called positive and it has a value of true or false for mm-hmm yeah you could do the other way other way around so if I'd done it like this then it would do swap it the other way around does that make sense let's do something a bit more complicated so sometimes you want to subset your data based on more than one criterion so let's do a complicated one we could do my data treatment is equal to so when you're doing logical expressions so something that evaluates to either true or false the is equal to isn't just a single equal sign it's a double equal sign okay just see where so if we if we omitted this second equal sign it it wouldn't it wouldn't make sense so we want the rows where treatment is equal to vehicle and the response variable is smaller than or equal to 23 and I want all columns okay so that's quite complicated so we basically given it this is the section to evaluate whether it's going to return the rows and this little bit is for the columns and we've left that blanks in give us all the columns it basically said if the the value the value in my data no sorry if the row has a is part of the vehicle group and the response is smaller than or equal to 23 return that row so if we run that it runs it returns this this subset does that make sense so we basically subsetted our data based on combining two separate criteria with a boolean logic you know and not or so if we wanted to do whether about a it's part of the vehicle group or its response is smaller than or equal to 23 you can use the pipe operator so this is usually the one next to you use shift and the key next is Zed on your keyboard and so basically if a row has a value for treatment equal to vehicle or its response is smaller than or equal to 23 it'll return that value yeah does that make sense because I know it's a bit complicated but it's really powerful and useful when you're confident with it another one if we wanted I mean in this case we've got two so we could say drug here but actually if you want to make if you want to say things that are not equal to vehicle then you use them not equal to operator which is an exclamation mark and equal sign and so basically anything that is not vehicle or has a response of less than or equal to 23 gets returned yeah okay and we talked about adding a new column so what we can do is to find a new vector called age oops let's say there's 200 of them mean of 40 and I'm not going to explicitly type them out and send deviation of 20 if you just highlight the code and just run that on its own for a second you'll see that R has a default number of decimal places and we're only simulating data here but it would be a little bit silly for people to have ages of eleven point eight seven one eight nine one so we can pass we can wrap this inside the round function and I will round that off so that we have integer values instead which is makes more sense and then we want to add this to our data frame just like we added the other column so we can say my data dollar sign age and our recognizes that there isn't already a column called a so it creates one for us and this is going to take on the value of age that we just created and then if we call head of my data we can see now that we've got patient treatment response positive that we define a second ago and age okay so sometimes if you've got small dataset it might be useful to create data manually in are like that but most of the time most of us are probably generate organize our data into some kind of spreadsheet and then read it into yeah there's probably the last thing for today just yeah so you've all got the example data that that you've got with the email so what I would suggest that you do is put that in a folder that's accessible to you now that you can easily get access to there's a CSV file which is a comma separated value file which is a text file a lot of spreadsheet programs can can open and there's also an xlsx file because a lot of people work with Excel files although it's slightly and you can read those in our but it's slightly easier to read in CSV file so they're both in there just to show you that the differences for each of them so we'll start with with the CSV file so the command for loading in she let's call this Pokemon so the function for reading data into R or reading a CSV file is read CSV and then you supply the name of the file in in quotation marks there's a couple ways that you can do this you can either supply the full file path of the file or a much better way of doing it is to set what's called the working directory so at the moment if i type in pokemon dot csv is the name of the file i will say can i open the connection and cannot find the file but no such file or directory and the reason that is is because oops at the moment if you if you don't supply a file path r has it like a default place that it looks for files if you just give it the name it at the moment at least for me it's just looking in this folder and the pokemon file isn't in that folder you can explicitly set your working directory with the set working directory functions but actually our studio has a much better way of doing this which is to create a project file and basically what that does is it it sets the current folder that you're working in with your data as your working directory so if your good project in the top right hand corner and a new project and if you've already got your data in a file the the the pokemon data then it's re in a folder then go to existing direct [Music] okay yeah you can you can save the file save it probably somewhere where we'd save it with the Pokemon data in whichever folder that is so project new project the existing directory and then navigate to the directory I actually already have one open but just actually let's do it anyway I think create project once you've done that okay so it may get rid of the the file that you just created is it done that okay that's not the end of the world because everything that we've typed up is in the email that you guys got sent but but neater and it's not the end of the world so ideally debate the way that it is best to work is when you start working on something create a project at the start and then you can use lose anything it's just the way the way we're doing it now no no it'll be set to wherever the [ __ ] wherever that our project file is whatever folder that's in going okay so is that as everyone being able to do that now yeah sure sorry okay so new project existing directory and then navigate to wherever your the the pokemon files are yep and then they just click create project in the folder oh yeah that's fine there is enough to be anything else in there it's just somewhere for the there's there's like an hour project file that just sits in there to just remind however that's where it's working directory yeah sorry okay yeah because the best way of working is to make the the project first and then do everything else so if I quickly type get working directory it now shows me that this is is where I've put my my my file so I've always we're nearly done we're just gonna read in the Pokemon data and then that'll be it for today so we're going to call our date of Pokemon read dot CSV the file name is Pokemon CSV and there's an additional option which is header which by default is true which basically just means where there are reads the first line of your spreadsheet as being column names so sometimes you'll have data where the first line is cut is like headings for your columns and sometimes it won't be by default it's true so you don't actually have to set that but it's just useful to know and many of you control-enter because this file is in our working directory I just looks for this file and just reads it in and it's happy so if we type Pokemon a lot of data is gonna come up and actually what you find is that it can't fit all the columns and side by side and it has to split them but there's a load of data and it should look something like that so imagine that you've got the best PhD in the world your your PhD supervisor is Professor Oak and you're the aim of your thesis is to go and catch them all and to measure things about them so these are the official data for the first 250 Pokemon but we can't see everything because there's so much data on there so we can we learn earlier about some functions that we can use to explain our date it a little bit better so we can use structure we can see we've got a dataframe with 251 observations of 15 variables we've got the National pokedex number NAT which is just the index of the Pokemon the name its hit points attack defense special attack special defense speed the total of all of these values the the main type and then some of them as well have a secondary type we can also have a look at head Pokemon I mean you can see as well that we've got evolved from gloves into and there's extra one captive that I added I did yeah I mean I didn't do this manually I found it online but and then the very last thing the very last thing that we'll do today because although you can you can save CSV files quite easily from Excel often people for convenience because we work in a xlsx file the function to read an XLS X file is conveniently an Excel s X but if we do this so now we're reading the other file not the CSV file if you do this I will give an error could not find function read xlsx but it definitely exists and the reason is there are is a modular language so it comes with sort of base things base functionality and then it has additional packages that you can add into it a bit like plugins that different people will write and then you can you can add in so if you were able to follow the instructions and install a few packages this will work if you haven't if you go to tools install packages and in here just type X al s X and then hit install and then see if that works but if you've really done nothing you don't need to do that so you have to install additional packages to be able to use the functions they're contained in but also our doesn't load all of the packages into memory at once because if you had loads of packages going on then I could slow things down a lot so if you want to use a particular package which is like a library of different functions then you have to specifically tell are that you want to load them into memory and the way that we do that is with library function so library text LSX and hit ctrl enter and then R has now loaded xlsx so it's sort of like a plugin so we now have access to all of the functions contained within the xlsx package that some person wrote and made available for us and now if we run our line of code with the read xlsx function then it also gives an error because there's an additional mandatory argument for the xlsx function which is the sheet index so often you'll be working in an Excel spreadsheet with data across multiple sheets so if you wanted to read from sheet 1 to 3 then you have to tell out which one you want so there's only one sheet in our data set so we want to read from sheet 1 and now if we call head on pokemon then again we've got our data so we've read our data in as a CSV file and we've read it in as an X n X X file so I've probably gone on a lot longer than I intended to sorry I know that was probably some of the things were quite complicated I think we'll stop there for today and the next week we'll spend some time exploring the pokemon data set and analyzing it and doing T tests and aloes and things on it is there anything that anybody has any questions on now or any I didn't explain very well or want me to go back over or if you've got specific questions about what you want to do
Info
Channel: Hefin Rhys
Views: 413,562
Rating: 4.9389219 out of 5
Keywords: rstudio, statistics, programming
Id: lL0s1coNtRk
Channel Id: undefined
Length: 91min 20sec (5480 seconds)
Published: Mon Jul 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.