How to use ggplot2 in R | A Beginner's RStudio Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello good afternoon everyone welcome to another episode of our studio tutorial my name is Brendan and is liquid brain today I'm going to talk through the package on what we call it ggplot2 so it's a very very commonly used of visualization to for big data in our so I'm gonna talk basically six different points today the first one will be some basic structure and syntax on ggplot2 you know how to start off when you have a dataset that you want to draw so what is the first thing that you would do so once you've got that what is then we have to go look in true what is the what is your data structure why is the aim goal that you want to draw so how to figure out the difference between continuous data and discrete data and how to plot things with the correct graphics and then we go into what is that static definition of the data itself as well as and as I started for individual different type of graph as well there are of course then we go into the actual plotting process the first one will be primitive plus will be just a horizontal line in a plot canvas and vertical line converse as well as let's take a slope in the canvas how do you do that which is actually not from the data itself then we of course go a little bit more advanced into a single variable plot in the case there will be something like distribution data which is a histogram you can do at your bar and so on you see some example data so once we're done with a single variable we're trying to do two variables so that's x and y so we can do a jump axed your point John jitter box plot and so on so those are based on two different parameters and can be Conte con continuous and discrete so that what different combination that we're gonna go through so lastly we can actually draw the data based on a coffee night in actual world case data I download from CDC so the first thing that you need to do of course is to install ggplot2 so that's a fairly straightforward good package click install and just create a plot too so just install what if I wanna find a ggplot2 package press install that will walk you the process of installation and usually which is installed just fine ggplot2 is a relatively straightforward packages without too many kind of prerequisites so usually a straight clean install you would find okay so once we've done that lets just run up this command which is library to the pro 2 so this would tell the our studio to load ggplot2 into the current environment so you are able to use the command directly later so today we're going to use a public data called empty cars and I'm gonna decline into object what we call a data so this is not a common thing but it's know how people to easily visualize that this the data that they are working with because you actually appear on the top right corner over here so this is actually they dunk from 1918 I we're trying to try to build multiple regression models so you can actually have a look on the our source material over here so the first thing is always to look at the dimensional data how many columns and rows are they so there's two ways the first one is true a command which is dimension data D I am open closed bracket and put your object inside so that would tell you that's 32 11 so the first one is always referring to the observation or the second is always referred to as the variable so if you click in over here you can see that that's 32 rows of data and it 11 column so in our observation is also the rows while variable is always the column so that's how they differentiate between different it's just a different naming scheme for the different system so just to have a quick look at our data over here we have mpg where we have the horsepower we have yeah no idea what it is let's go back to the previous one we can have a look ok so to actually understand different data you can actually use a command called structure STR and open closed bracket and put your object in the middle that will give you the basic ideal and information on the data frame itself so as you can see from output itself this object is a data frame as 2:32 observation and even a variable and this is the one that you want to look at so in this case on eleven variable numerix so their numbers basically and individual definition of the shortfall over here actually have summarize I'm not not surprised actually I pulled out the information straight from the website so that you know that mpg is a miles per gallon which is their fuel consumption on individual cars Cusack is actually the quarter-mile drag race time which is how fast they are in accelerations m-mister transmission with zero refers to automatic one equals manual and so on so you also have number of the gears and number of copywriters and so on so these are a different type of data they work with so in this case all 11 of them are numeric which is which means that all of them are actually continuous data there's no discrete data over here we're going to discrete data when we reach there so the first the first thing in trying to understand is that when you look at the data like this what kind of graph do you want to draw so in that case let's think about what we we just want to visualize that's two different power meters gears an mpg you want to know if the more gears the car have does it have a back better Mouse by Galant on relationship so it's trying to think about all this thing in your mind before you even start to plot the data so let's the once we have that caught out so let's see how we can pollute that put the the syntax to works what should you plot so the first thing of course is to do a function called GG plot put your data in front so that would be the first parameter that you fit into GG plot and then you define a static what is X and what is y in this case you don't always have to declare both of them if you only have a single parameter you can just put x equals to something so that would work as well so there are very specific to the plot day I want to do later so this is what we call defining the the data and a static for GG plot itself so GG plot is a kind of layering structure where you have the basic data and Static define and then you add layers to eat to add different elements to your graph so once we run the first command you actually generate object called a so you click into a you realize that one of the data where are different layers currently there's no layers yeah what are the scaling what are the mapping coordinates and labels and so on so I define a different object that mix up to your craft later on so let's now add the first layer so the first layer is actually adding a blank canvas which is what we call a GE or M blank so we can acknowledge eom here is just telling people that this is a ggplot object that you want to add on every single graph will be preceded I will say yeah preceded by GOM underscore something so GOM blank means is empty so if you run the first command over here you actually add the layers into the object just now so let's go to object just now you realize that we have more layers over here because we have added the blank canvas so how do you print out the graph is just tie a and press run so that would actually give us the object so once we done the blank which is black divers left to get a second data point so why limit is actually the the limit of the excess so you can see the y-axis over here we're running we are plotting it from 0 to 35 while X were running from 0 to 5 so that will actually give you a blank canvas with very specific x-axis and y-axis so once we're done that we want to add a horizontal line so this case each line means it's a horizontal line at 25 so you actually intersect 25 over here and draw a horizontal line so let's just run it and check it by exporting it by pressing a and run so in this case you can see we have done a horizontal line at 25 bar and now let us in the vertical line so a vertical line will be V line which is vertical line x intercept equals to 4 and let's press it again and products so we have add another horizontal vertical line into the plot so now we are done let's go to the third one which is a bee line so a bee line actually is a straight line but it's not horizontal or vertical only so this is basically the slope if you really want to understand it that way so if one add in the slope we just say a plus young a belie intercept because the tan slow because two five so in this case intercept because that's just actually run is easiest to to demonstrate directly so you can see it demonstrate it actually intercept Y and X 0 over here because the plot actually gives us a little leeway in front of 0 it doesn't show the hand here but I can assure you that when x is equal to 0 y is equals to 2 tack ok so the slope because 2 5 means that y equals MX plus C inter-service C and then slope is equals to 5 so C is equals to 5 while no C is equals to the intercept or M is equals to the slope yeah you know standard quadratic equation for you to draw a line so the final a over here we'll be able to print out the data so usually if you want to just draw the graph without you know checking individual step you can highlight whole thing and run it as a whole so that would generate the graph directly so you can also go back to the object and realize that the layers have a lot more object because we are adding more and more and more things into this object so that if you export a you should actually just view the graph itself this so you start whole graph into an object okay so that would actually conclude our a very primitive a very basic plot as well as a little bit of data structure so now let's go into a little bit more practice by going into a single variable plotting okay so now this way we're gonna try to change a little bit on a variable so that it has a little bit more different type of data that we can use and ok so the next variable is actually an object a public available publicly available data set called MPG so you can actually go to the a website to see the sources and where they come from and so on so first thing first always check the dimension of the data so we 234 Rose which is 234 observation 11 variables so it's a lot bigger and you can see there's a lot more things over here there's words that we are doing have just now just now everything is are just number here things a little bit more dynamic so you have a lot more things to look at okay so let's go to again check the structure of the data set whoa we don't want the Tarrington design narrator interesting so the next one is of course China structure of the data stack itself so you can see the table so table is actually a table or you can actually stand acetate offering that very similar in this context they are not very similar they are not similar in many other ways but in this case we don't really it doesn't really make any difference okay so you can see the first one is a manufacturer which is a character there's a model number of the called as displacement of the card it's the year of the car cylinder transmission and so on so if you don't know CI CT y is city mouse consumption value which is the fuel consumption value in a city and a few consumption value in a highway and so on so of course that different now and the different classes so these are all different tile properties of a car okay so now let's just start with the first discreet parameters so the main difference between discrete and continuous is that discrete are usually factors so that can be classified into individual blocks itself like gender would be male female and everything in the middle so that others and let's say you have number of cylinders so you have 4 5 7 8 9 so they're that they're just discrete group there's no such thing as 2.5 cylinders in the car there's no such thing as a year which is twenty two thousand two thousand eight point two to the a point three when we define a year of a car is always just 2008 2009 1990 and so on there's no cross over in the middle so everything is a very distinct class and that's it so that manufacturer model obviously is Olivia you for and so on there's no half LD and happy MW so that does the main difference between continuous but for a data such as the displacement which is a number there's 1.8 1.8 there might be 1.8 one day might be one point eight five there might be one point eight one five three five so there's always things you can cut in a video for continuous data which is the you know the word means continuous or discrete data is only in different buckets okay so in this case we're gonna go for one discrete parameter first so we're gonna plot something called a foul which I have no understanding they're just different types of the cars part of the properties of a car so in this case we're gonna use the same thing ggplot and then we put the data in so the data here is could just call MPG and the study here is the FL so there's CD EPR so we can understand a little bit more and that the Dave asset itself so this is just example of how we can draw bar chart based on a single one so what the the static here is actually using the cow not a static the the VAR the aggregation function is actually a cow so you can see there's a lot of our last p/e a little bit and see is is the least um this amount them okay so you can see that the structure is very similar of how we draw the the primitive graph above which is a first thing with a GG plot function define the data set itself and the main parameters that you want to plot and then you know you add on the layers for the actual plotting that you want to in this case jumpa so a bar chart okay so for instead of plotting discrete what if you want a plot continuous so in this case we want to use a histogram and actually automatically bucket them so he so grandma and barcha are very similar to each other except barcha usually used for discrete while histogram is used for continuous because histogram has a automatically automatic bucket function and you can define a bucket into whatever size you want so in this case first thing again juice Ziploc you find data set and on Instagram and print it up over here so in this case the peanut using 30 so pick a better value with better if the rule we will go through that we reached at every single plotting function and layers over here have individual parameters you can actually tweak so you can see over here for the count there's a lot of cars with four-cylinder this above cow is six-cylinder and this love car v8 cylinder and there's some a few cars with five cylinder which is really weird and there's no car seven cylinders make sense stealing that I usually even number so that the cancellation of five cylinder usually comes in a circling engine where there's a way of canceling each other out without having problems or imbalance that's why we can also have three cylinder car with a 120 degree offset and so on but this is how we can easily easily visualize the data so I feel like this is not a very good example of a histogram because the the cylinder is not a very good continuous data over here so we'll try another one with displacement so displacement is basically how big is the energy so it's a 1.8 CC one point liter engine a 2.4 liter engine and so on so you can see using de the default bead over here we can actually kind of see the distribution of displacement for the sample size over here so two point four two point three to a five oh here's the most common one and there's a very little cars with like more than six litre engine because obviously those are supercar territory so this is how you can pin them and of course you can change the bucket so that you only show between 2 to 2.5 to a 5 a 3 for M and so on so you can customize the the width of the bar over here using a histogram so you can also change it let's say I want to do CQI so there's a city mpg de the fuel consumption the city so again you can see something similar where most of it is between 10 and 20 and everything above 20 is a lot less which is most likely correlated with you know bigger engine less efficient and so on but if you have our parameters that's usually read straight forward figure out whether it's a distinct one or continuous one and choose the function that is suitable for process so in the meantime you want to promote a little bit on the cheese our from the treaty you can download it from the website so in this case you can actually see for one parameter continuous days the area density dot plot frequency polygon histogram and QQ that you can choose from depends on the situation that you are in and what kind of graph that you want to mmm visualize basically you one o'clock so of course for discrete relatively straightforward is only for jump R will be the most straightforward and easy to represent what that you have to so let's just go into two variables over here so we'll come back to this let's go to back to our example here so for two variables of course you have four different scenario so you have continuous and your discrete and the two scenario so two to the power of two is actually four so there's 42 scenario let's go to the first one which is continuous since X and continuous y so we are familiar with plotting the horizontal lines always refer to X exists or vertical lines here always refer to the y axis so let's try to plot our first one so same thing I use CG plot they find the data and then you a yes to define the two main structure that you want to plot so you can either directly put it in without actually telling them which one is which where they will automatically assume which one is the X and which one is the Y and we'll see what happens later over is that so if you didn't define it we just we will just plot taught the into object I II so you can see here I'll just put a plus y on point there's no arrow over here so what this means is that I'll directly want the output to be displayed rather than put it to another object so you can do it if you just want to visualize the the plot without actually say we need to know check so if you run this you can see directly have a plot so if you don't defy it the first one cty over here you'll be defined as the x-axis what the second one would be define X the y-axis so um so this is the Gian point which is straightforward enough it's just a scatter plot however the problem is that you actually have more than one data point at a one single location so are also juju plot also provided something called jitter which try to enter random more noise to each other point so that they're separate up so we can visualize the number of dots better in this at the location so that's just one of the function that I put it in okay of course we don't just have to do that we can also do something called show label we actually label down the thing directly on the plot so instead of the dot you can actually put in numbers off one of the axis in this case you can see 29 over here corresponds to the x axis value so of course you can also change it to issue WI which is the y-axis value where the plot will corresponds to the y-axis value okay so of course you can also use attacked so that that will not constrain that to a label it would be a point right there called a data point directly it has the tax owed ya tax replace data for replace with the tax basically so this especially useful for let's say you have not just numbers you have like num name of the country names of the past object and human names and so on and to visualize directly where they are on the plot okay so once we're done we continuous x and y which is usually the easiest one let's go to discrete X and continuous Y in this case we will be looking at the class of the vertical versus they are highway fuel consumption so class actually have like compact graph like SUVs and so on so it depends on the type of the kind class of the car so same thing for object F we're gonna just declare the basic verbal and because the basic structures you plot and then we pedak leave a box plot so you can see that different class of car def to see the mid-sized and minivans pickup subcompact and SUV and so on so a boxplot is very good to visualize the mean of a class as well as the spread of the cloth so you can see the subcompact over here as a relatively average and mean but the spread is huge means that within the subcompact category itself the fuel consumption can vary wildly Wow something like a two-seater is almost always the same fuel consumption so same happen so yeah same you can observe with SUV where the mean are relatively low but the spread I am practically high compared to a pickup truck which is always gonna consume more fuel fuel on the highway because they are just bigger so this you can easily tell the relationship between two viable using a box floor over here so pizza box top we can also use what we call violin plot which is basically the same as a box plot but instead of using box to represent the spread and the 50% 75% 25% percent hall so volume would give you the size of the sample within a certain range so we can actually visualize it directly by looking at a file in plot so once we're done with one discrete and one continuous let's go to discrete X and discreet Y so so this basically means that both side are discrete so in this case we are using a class which is the same thing that we use over here and drive train which is front-wheel drive back wheel drive or four-wheel drive okay so same thing run this and run this so this will actually give us in your air wheel drive we have SUV subcompact and two-seater so relatively small number in our data set but you can see there's a lot of SUV with four-wheel drive a little bit with rare wheel drive we are below a lot more midsize car with front-wheel drive and so on so this is easier to visualize a distribution between the two discrete variable using a gym car basically so of course you can color them you can try to do shading transpire with them afterwards so we'll go down to to the other one so you realize have miss something out what if we have a discrete y and continuous or discrete y the continued attacks you can just flick the exist and they will do exactly the same as the middle so you can just flip it so that the bar instead of horizontal box pot become a horizontal box yeah you can actually switch between X and y axis if you need to so it's usually a better way to display a discrete acts and continuous y where it's just more familiar to the most of the user okay so instead of just doing the two-dimensional you can actually plot things on more dimension we're gonna go through that next time but for now let's just try to see how we can make our two-dimensional plot more interesting by adding let's say the color the size and so on so we are adding more than two dimensional on a two-dimensional plot okay so same thing where this we are just using mpg with city and highway so that's two continuous variable but in this case we yeah we've done this ready so we have done the labeling but now we wanna add color so in this case we are coloring the different numbers and even plot with their manufacturer so you can see that over here you have the purple color which I believe in Subaru and we have here the green color which I believe is a human die or G so it's not very clear because there's a lot of difference a manufacturer and my the way of me looking at color isn't very good but you can see there's a purple cluster the middle there's a green cloth over there there's not a cloud of green cloth over here and if you have better eye you should be able to tell this brand directly has us as has dominated us at the national market based on a few consumption okay so besides just the color we can also do let's say we want to see the silly we can also color based on se Lincoln silly in the size which is much easier to see you can see that all the cars with a packed a few consumption are all the way up and all the car with sorry with a lesser selling better in terms of fuel consumption where car with Marceline more cylinder consume more fuel which makes sense you know because car with big engine consume more fume more fuel ok so we can also again call a base on the manufacturer lightly just now and now instead of just using a color we can also change the shape of the cloth so this will give you a further dimension of visualizing directly with the manufacturer the number of cylinders the highway and city fuel consumption or what so that's four dimensional in a single plot so of course again you can still add another dimension with the size of the plot with the displacement so you can actually see a lot more things in the clock but here is a bit of over plotting it's very hard to see and usually we don't put too much data in one plot because it just become really difficult to differentiate which is what and where's where so when we try to focus on what kind of thing do you want a towel and just plot the simple straightforward story a plot that can actually tell your story or your hypothesis do not try to squeeze everything in one ok so that's good a limit on the practical data so we have top true or the single variable plotting to a parallel plotting continuous data discrete data how to add color shape size and so on so now let's go to a little bit more of a practical how do we actually use ggplot2 in real life so because of the recent events I'm trying to plot the data basically lighting distribution and the new cases POW in different country so I found a data set on this ECDC Europe website where they have all the newly filed cases well Y will nicely label with the ITER graphical locations and their populations and so on so I'm gonna try to read this as I could not find a functionary Excel doesn't matter so we'll try to read this data through an inbuilt function core in Excel which until you can do from here in Excel and I'm gonna go to my the place where I save my adopt and we're just two so you can actually have a look at the data to check if everything is correct and we can import okay so we have imported data directly over here as object called Co V 19 okay so first thing first before we even sanitizing and data think about the plot that you want to do and how you would do it so here I define something very straightforward two main craft that I wanna draw first one I want a comparison between the country using a line plot over time so over time I want to see the growth of the new cases in different countries and for the Y and one in log 10 value because you know the spread of this pandemic is in the law behavior so if you plot it on a straight line it as an extra to show the real spread and plot it on the log allow you to see a little bit more details into country with a lower dump of cases rather than get dominated by let's say US and Spain and so on so secondly I want to total now double to the new cases actually it's not just total cases it's not cumulative one so a total new cases in a world in a stack column with each country represent is own color so I want a way to visualize how many is how many new cases are there in the wall and I should be able to see directly which country is doing was just by the color alone okay so first thing first to get a dimension so object coding I did not foul that was weird okay geographical so I forgot to delete the bit of stuff just now so let's just do another attack so again and we need to delete this okay so this will rename the object as copy nineteen and I don't know why covalent in here works it and there's a book just now so that's just important and now we have the object call Co V 19 over here okay so let's see that I mentioned of the object so we have eleven thousand line and ten different columns so of course you can open up the object and have a look on the the the structure itself so we have the date of the reporter we have the day manager not very useful for us and also we have the cases which is the new cases found that day was the new death country the country code and the population and so on so in this case we only want to use the not date report as well as the number of cases on the country okay so we mainly focus on this tree okay so same thing just run make sure that which data that you're looking at discrete which data you look at which variable you look at actually continuous as well as the discrete one so you can see for the first one date report is a post pure as i XE t i don't know how people pronounce them but this basically date format so when you look when you see a Pillai's POS i XE t is actually the the unique not unicode like everyone we know that is the date format that we will be able to sort palette a very easily to know that is a time series data okay so they is number one is a number years a number and cases number so there's our continuous data and death yeah because technically years kind of discrete so that's kind of a blur line in between those and number of death of course is technically also a number and yeah it's a fine line between like numeric value whether it's a discrete or a continuous though depends on the context that you're trying to measure them with but if you see things like a character you definitely know that this creep and if you know that Joe ID decimal an indiscreet and so on okay so we're gonna use a little bit of external libraries here called deeply aware so this is a very very commonly used library for data manipulations and summarizing so what I do is that I put the object and I try to group by the countries and try to summaries the number of cases of the country so we can focus on baby just want to focus on country we because cases or do an expose on country with models cases so this gives us an overview of all the country and the number of number of total cases they have right now okay so you can see that Australia has about six thousand five Bosnia has about one thousand two and Brazil has about 36,000 which is pretty bad so this gives us a total number of cases reported in the country so this doesn't consider people that have been removed which is they're either recovered or q so these are just pure infection numbers just keep that in mind when you're doing the rest okay so now let's go as to the actual plotting and because I am actually not able to open my fan because it will affect the microphone it's actually very hot in here okay so let's get back so the first one we try to do a comparison between different country using a line plot so a line plot usually have exist at the time Y exists as the number of cases in long time in this case okay so first thing same thing just you plot defined this is the object the data frame there were the plot we want to slot in X is the day report obviously and Y is the number of cases so we don't do a scaling yet we do the scaling later so the plotting is fairly straightforward CV plus y online so they were Plus on a line plot and if you want to scale the output you can just put this and that will actually give us a nicely scale data off the the total new cases over time okay so this is given idea that in February we have kind of a a big mountain over here this is because of the outbreak in China and they slowly recovered before it was discovered in the rest of the world and number has been growing ever since the end of February over here and dump lab is strong and growing and stabilizing a little bit over here but beware that this is lock-up so if we look at absolute value it's to massively increasing a lot in a lot of country so let's so now basically we have done the first chart that was very fast okay so the second let's go to we want a total case in a world in a stacked column plot with each country is issued color so since I believe everyone's already quite advanced we had thirty five minutes into the video right now so CV is we can reuse this because we are using the same data or using the same static over here so what we do is that we are plotting a column chart which is what we have talked about before this so column chart if you actually look at the cheat sheet here it's actually for discrete X and continuous Y but you know it can be used in this situation as well so actually I need to explain a little bit forge your own line before we go forward you realize that John life you look at here that actually a smooth flight with nothing below over here but the problem is that when we report the data they will actually have things that weekend and they'll have missing data point so what happened to this lie is that is just going up and down up and down up and down up and out because certain dates itself doesn't have any data point so maybe we can try to remove that I think for visualization purposes and to understand did I sell I don't think it's necessary over here if you do want just run a few to base on this and then cut off and you should be fine okay so let's go back to our second one so we want a column chart so we're trying to color a column chart based on the country that we have so you would appear as a different country and maybe just try that first before let's just let's just try it one by one let's let's try out the actual process of building it on okay so in first one we just want to build a chart with a column with the color as a different country code and we'll see what happens and because it's quite a big data and there's a lot of countries it would actually take some time to generate and they wouldn't really understand how you do so yeah the problem here is that what you see over here are just a legend there's nothing else because the plot I mentioned that the the the size of the process is so small it get dominated by the legend so you can resize a little bit you should reload or you should not I mean because this is a big day it's a big plot that need to they need some time to figure it out so you can see there's a plot but this is just too big and I don't care so I new get rid of them so let's just try to get rid of them okay so we've let's just yeah let's carry of them and a second one before we go stealing shall we so for to carry of them we have to run something called team team is actually the team of the plot itself and we say legend position nun so this should get rid of the legend force same thing apply because it's a big data set we are unable to you know just and anymore run so of course we don't plotting out the legend it's a lot faster we can see kind of the same trend that we see in the light plot just now however now you can actually see the colors bit that belongs to different country I believe that this is obviously China over here in February and while the biggest color over here is either us or Spain okay so let's just run the last one no actually not the last one I like there's a few more okay so in this way it is scaled number a long time and you should give us a little bit more female growth rather than exponential growth like what we did earlier okay so that one row that missing something that's a little bit problem so doesn't doesn't matter don't care you should you should appear later on yeah yeah I forgot to put this in okay so just do this properly before we go to the next step and the problem is that because there's so much countries they're hard to visualize in this love country with very little cases then it doesn't really affect our result so we do not want to see them so what we do in data science and data visualization usually is I remove those country that doesn't really have enough thing for us to worry about so when we want to look at this graph I just want to look at the country with a very very high infection rate so in this case you can just click here solve cases of course this is a small data you can do that a big data you can't really do so let's say we just want a select of top maybe 10 or 20 of them let's set a cutoff point so the kind of point set up previously is 30,000 so we are just gonna go for all this country and plot them on the same plot over here and we should be able to to the legend behind it beside it and still see everything clearly so let's try that so of course we had and trying to use TV ylr way of filtering the data which is we have the object and we group by the country first and then we run a future where the total number of cases we will only be included if they're larger than 30,000 okay and we send that whole filter object into a new object call Kovac 19's so it's a string down version of the whole thing so and we run the same thing we we into another object call cv2 so that we don't we don't confuse the previous objector object now and we ran the check for CB 2 and then we run another one call the same column but now you can obviously see just now we can't even know how which color belongs to B country now we can tell that this belongs to China CH and and the biggest Cheung here it should be only belongs to USA and a little bit on the green color here maybe it's SP which is actually Spain over here so that's basically that let's just run our last one before we close the thing that so we can actually see with a long curve it's much easier to see country with a smaller number rather than it getting dominated by just US and Spain alone now we can see that Brazil's here Spain is here France is here Great Britain is here Iran is here it leaves here Netherlands here and USA is here turkey is over here so very easily we can see the spread and growth between each country and so on so let's just go for recap so today we're talking about ggplot2 so we start from beginning where we explained the basic structure of let's say how to install ggplot2 in your environment how to include it in your environment and how does you how do you start with it so of course the first thing that you do is your plot is to run a function like this where you try to tell these you plot while the data angle and what are the basic aesthetic so it can be the x axis and y axis maybe other exists if you want to include and what as the how do you explore data structure that will be usually there's many ways to do it for myself I will just look at a dimensional the which is how many columns and how many rows they are and do you need to run any filtering on it and of course as well as a structure of the data so we can see which one is continuous which one is numeric and what are the things that we might need to convert to factors and so on so that's easier for downstream operations so I studied definitions here refers only to the X&Y exist and how do you tell the the ggplot which which variable belongs to which exists so that is the study of the clause itself and of course different aesthetic when you go into different commands like I think here we have one you can see the aesthetic of the label so a statting of the label would be a different thing compared to the overall thing that you just now and you can put different and thurible in an aspect of tact if you have not included in the on orange in OGG for decoration below above so you can always try a different try different again see if it works and then we go into the plotting which we try the primitive which is the horizontal line and vertical line we try a single variable which is the histogram the bar chart and so on we also try to variable plot which is a tax point jitter box and various other junk count and so on so lastly we actually use the knowledge that we just got and try to plot the data on the recent Kovach 19 data has downloaded from CDC so I think that's concludes our data for ggplot2 for further information that actually something called GG enemy which is actually taking GG plot to the next level by stacking them and create an animation so that would be the part 2 of this tutorial for now I hope you enjoyed the G plot 2 over here we will see you in the next video bye
Info
Channel: Liquid Brain
Views: 11,215
Rating: 4.956284 out of 5
Keywords: Rstudio, Science news, Rstudio Tutorial, Bioinformatic, Computing, Research, statistic
Id: TgyWeKoK1HA
Channel Id: undefined
Length: 43min 37sec (2617 seconds)
Published: Wed Apr 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.