TidyTuesday: Analyzing CO2 Emissions in R using the Tidyverse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey my name is Andrew couch I'm a business analytics student at Universal and today I'm gonna Stan I go over the tidy Tuesday series or data set so basically every week tidy Tuesday will release a data set for you to analyze and make charts on so I thought it'd be interesting to just do a video of my process when I'm looking at a data set and how I do a basic exploratory data analysis so we'll go to the most recent data set which is foods carbon footprint well look at that and then they give you a way to load in the data which is this one little liner over here and we'll just exit out of there pull up our studio all do this thing and all this analysis and our markdown so I'll do tidied Tuesday co2 emissions and I'll do it from a template which is a github template to choose two missions all right sweet remove that junk will do the on a new chunk so creating chunks is controlled ctrl alt I and will first load in that tidy verse okay sweet and then we'll load in the data set ok so what's kind of cool with it that tie a Tuesday project is that they basically just give you a wider range of data sets so for this one we'll just kind of look at it it appears that there's a country a category for each country its consumption and co2 emission okay so let's look at just how many countries there are so we'll do count country okay so we have about 130 rows of countries and for each country there's 11 other things that are with it so I'm gonna guess that it is food category all right okay go through that so each country has its relative food category and then I guess its respective co2 and goods consumption of that product so it gives you the consumption of pork with its relative co2 emission all right cool okay one thing that you should probably always do that it's a little bit easier just do a summary function right doesn't only tell us a lot about that but it does give us some summary statistics at consumption and co2 emissions so I'm gonna comment just viewing the basic summary of the data set well is it 130 countries right many countries and eleven food categories alright so let's just do a visualization of that done on this nude gather which just will let us gather the consumption co2 emission and I just want like a holistic approach so we won't do it by country so we'll say that the main thing is the feature with its relative value and will ignore country and actually yeah well nor country and Mayas food category so when now we just have it in a more tidy format and we'll plot this out where and let's see here yeah we'll just do a box plot of it so color equals feature box plot and I'm gonna do a facet wrap so we can see each box plot individually just so nothing crazy happens okay so right here it's kind of all over the what it's pretty bad and you can't really understand this data set right now this chart that's pretty normal when you're dealing with like countries because all countries have like different population sizes stuff like that so normally you're gonna have to do like a log ten transformation okay so right there pretty pretty easy to look at awesome one thing that's kind of interesting is it kind of seems relative where co2 emission is basically equal to the amount of consumption which kind of makes sense additionally if you see how right here it's one e+ over three that's in scientific notations so we can remove that by doing the sigh pen thing you know put it back to so okay so there are there are a little bit different scales right so that makes it way easier and in fact if we wanted to fix the skills we can see how it's a little bit different right the distributions are a little bit different cool okay so distributions of co2 and consumption are not as proportional as I imagine okay and what's cool like this is walk she just kind of do a little bit more detailed of a plot and one cool thing that I like to add is a genome jitter where the Alpha is point like four let's go that adds okay so that way we can kind of see the basic density distribution of it right which shows the same story of the box plot but it's a little bit more interesting I think because you can see how the outliers of this one of these the first graph is you kind of just think oh it's just a few of them when in reality it's actually a decent amount it's not just like a single digit of the outliers okay that's kind of cool so let's see actually I have one question on it let's go back to you tidy Tuesday and I actually should probably look at the data dictionary first so what is their actual consumption and emission so consumption is kilogram per person per year so how much each person how many children is a food a person eats every year or is how many kilograms of co2 is produced per year okay that's kind of cool so one thing I see is similar scales right so we can kind of say kilogram per person and then kilograms of co2 co2 produced okay so I'm gonna get all that right down something so consumption equals hope kilograms ate by each person every year and then co2 emission goes kilograms kg of co2 produced per person by each I'll do by each person every year solid so one thing I think is okay if we can just kind of reformat this this question to like almost an efficiency question so we can just do some basic algebra remove year so we want to I want to see like which countries actually have like an efficient consumption process where for every co2 for it for every kilogram kg of co2 maded how many kilograms of food do they eat and you can kind of frame it in different like honestly it's bear was kilograms of kill or kilograms of food a tune din produces how many kilograms of co2 that's probably or that's probably better to understand although these both these actually questions kind of seemed like the same question however they're actually a little bit different right so like kilograms of food in a food eaten and previous how many kilograms of co2 one thing that would be interesting for this is I bet countries that are richer welcome are gonna be eating more food which will lead to more inefficient production of co2 however this one for how efficient each food is Ian it's probably buy produced by more agrarian people but there should be a little pretty similar so let's do something that's a little more understandable so we'll do every kilograms of food eaten produces how many co2 just just to keep it normal so we'll do that okay so what kind of suck that bad boy come make sure we'll get up and we'll do new tape and we'll do like food co2 per food right co2 per food which is just couldn't see o 2e emission / consumption what we're gonna see you at E - hums okay and one thing I was I would be interested in is like I bet food category category plays a huge role in that right so we'll do a buy food category and we'll just like summarize it and say mean equals actually equals mean co2 per I will do average co2 co2 per right okay na and o have to do a drop na I know yet what going on over there okay so there's some anaise that's kind of weird oops hey I know there's summat huh that's weird what's going on oh maybe let's here filter we're consumption equals zero oh okay there we go that's what huh oh that makes sense okay so obviously in India they're not gonna eat pork I guess because that's a that's a religious affiliation mm-hmm okay so what we're gonna have to do actually is filter consumption does not equal zero okay how about that so that way we'll just do category y equals average co2 per food I wants to add Jim see what that looks like huh it's kind of weird and I'll do a rewarder by average okay okay so just did some basic formatting all do uh I'm a fan of flipping box plots or bar charts and then since we don't really need the legend we can see that we can just remove that since it's all boy we already have the labels sweet okay so that this kind of makes sense right but one thing that I think it's kind of kind of weird right is that usually when people will say about how inefficient food is we always talk about beef especially in America when in reality I guess lamb and goat is a pretty inefficient food group I think obviously I guess the reason why we don't really talk about lamb and goat is known really easily mango in America especially where I'm at in the Midwest but it's not we pretty interesting how there's a huge discrepancy between like lamb goat and beef with the remainder right like on a co2 basis is just like tenfold it's insane kind of makes you re-evaluate how you should be eating and additionally you can see how like rice is a little bit less efficient than poultry and eggs so that's that's kind of weird like poultry rice is more inefficient than wheat kind of cool okay so let's just write down what we observed right okay was he lamb and goat is in efficient and she probably also and should probably be should probably should be avoid be avoided same with beef but I'm gonna say most we everyone knows this then we'll say also huh rice is more inefficient fish and wheat that's kind of weird okay cool so given these categories let's kind of look at these distributions of like consumption and all so let's just do a kind of a facet style of or I want to see the distributions of consumption with all these food categories and also co2 emissions and maybe let's see if there's like some kind of relationship with that so let's do that just wants to select everything besides country just do a basic gather I think you mean actually app it we do this other one normally I usually gather that gather and spread function let's do like a pivot wider and then now I'm gonna try it consumption and co2 emission and I'll do like what values from and then kind of some shin and co2 emission let's see how that works oh god I didn't oh my god they don't work oh it's cuz I did pivot longer why don't why am i knew this names - okay I'm not doing this I don't really know but yeah the pivot why don't people longer definite pretty useful if you actually know what you're doing but I do not want to know I'm going it goes - oh my god okay there we go Jesus all right and what do you uh another category y equals value and we'll do color equals food category box let's do a facet wrap and then we'll do we want to do by feature you know we'll do a fixed scale and I already know we don't want to see up is a legend so let's say what your position equals nine again we still have to do the scale y log 10 okay so we see some cool stuff I'll do another cord flip too cuz why not okay cool hmm it does look pretty similar let's look at each distribution one by one and we have to note that our axis or axes I guess are are fixed to some point cool so let's look at let's look at the ones that we were first starting at so let's look at lambing go Salim and go alright here we can see how right the median value is between point 1 and 10 and analog where is next eight so you know I don't know what that value is yet but it's easy to compare how this median is much higher than the other median really shows how like even though it's not being consumed as much it still is emitting a lot of co2 emissions same with beef right beef the median consumption is about 10 kilograms per person however its production is almost reaching towards like 8 like 100 probably like 4 4 to 500 kilograms of co2 emission per person hmm cool I think one thing that we kind of Justin you just kind of look at it I can efficiency plot so let's do like where we do like consumption and co2 emissions as like an X and a y-axis where I would say co2 is a result resulting variable so we'll put co2 emission on the y-axis and the consumption on the x-axis okay that makes sense okay and then maybe we could yeah I'm gonna do that we'll do like a well like the food categories still be the color and maybe this will tell an even easier story to understand or I was a better story right Co - oh my god that's why it's a mesh color equals food category what do i geom point might have to do a an alpha but I'm not sure and we'll do a Jim Jim AB line right and we'll say the slope is 1 and that is like total efficiency where it's it's completely impractical because like you just on my physical standpoint where if you consume something there's gonna be a lot of wasted energy - but this is assuming that like whatever car cataloger is going this exact he'll always come out it's like a perfectly efficient thing Oh oops ok so we can see something kind of cool right however again with using real-world data you see most data is real world on a large scale as I guess normal but it's mostly log normal right so you can see how that is pretty cool that's pretty interesting which makes sense because also we probably wouldn't need define um every country's food consumption but we can see how it's pretty straight and this data it kind of makes sense on on an efficiency standpoint how each food category no matter how much consumed and stuff will emit based this same rate of co2 emission which makes complete sense and it'd be honestly weird if we didn't see anything crazy but I think even then it's pretty cool however I'm actually gonna do a group by category food food category because there's there is a lot of uh a lot of like junk right you don't we don't need this actually consumption so we'll just do the the averages but what consumption no that's going on here object consumption Oh oops see that's that's why okay and that this tells a fantastic story right we can see how they see these residuals completely off right how we have beef right here it's super high and we eggs right where eggs are Connor right on the line they're pretty efficient Joe it's kind of that's kind of a cool thing okay in fact I could I could even just do like a basic residual thing too if I really wanted to and I don't know we could do it why not okay we'll add some residuals and this is okay and then in with residuals and that residual is gonna be co2 it's gonna be was it co2 emission minus consumption right hopefully I don't get that confused yeah there we go and then we'll plot that bad boy into ggplot and with ggplot why the points right there we'll also do a gym segment xn equals was it was our consumption y and equals consumption oh I guess I didn't really need it okay and actually goes consumption and y equals co2 co2 that's pretty interesting all right there we go we didn't actually need to do the residuals but you know if we want to wanted to plot out the residuals on like a like a normal kind of like do a residual plot we could do that too well you can see right here the difference between it is staggering especially when you have a look at you have beef and you have a lamb right there and they're just huge and I guess you can have I guess co2 emission isn't it perfect you have a tradition see but it's it's still something of like proportional waste on its a kilogram level because you can see how what soybeans and and we are are below in the line right so that's kind of cool okay now what else do I want to ask about this question let's find is the most if a let's find the top consumers of each food category it is something where you probably just don't even you probably don't need to find the you probably need to find the create a plot or anything because this thing where your affinity to the top and in a bottom end it kind of serves better as just like a table I would say so let's just do that we'll say country and we'll say consumption right N equals five milk and cheese and then washed actually we'll do a group of pie and then food category that's pretty funny how these countries consume a ton of cheese I guess cool range by will do consumption oops cool so we can see how I just ah I guess we should do a food category - okay so beef Argentina and Brazil those make sense right we can see I want to go back to that she the milk and cheese thing I'll get right there Finland consumes a lot of cheese I guess I guess that makes does that make sense because I would think Switzerland like France and stuff would consume law cheese I guess includes milk so that's a little bit different okay and then pork Hong Kong China ok that kind of makes sense Austria poultry Israel okay so one thing that I would be interesting is ok so we have all that we have the top 5 so top 5 consumers of each food category well I think I may be interested in is like ok so given always like top 5 kind of consumers or of these foods let's find like how many countries aren't it twice or multiple times right how many countries are like the consumers of multiple products you know there on the top end of all these other food categories so we'll just do we'll just do a count country oops oh yeah count country and boom right oh and we have to do he goes true okay so now we have these guys and this is something where I would honestly want to maybe plan on a plot but we're not sure yet we can say on most these countries our consumption level most of them are like once right in fact yeah most of them are ones however there are some people who are on it three times and then two times there's only one that's on three times and it's Hong Kong in China and honestly it's kind of weird how I don't see America or the United States in it right maybe the United States doesn't consume as much food or maybe they just consume a lot of food but in different categories where they're kind of like being spread too thin on it but okay we'll just right in countries that appear more than once in the top five consumers buy and buy a food category okay and we'll do it I'll see you a filter and does not equal zero boom tonight cool oops one cool okay so one thing with the tidy tidy Tuesday thing is is that they they have different you know we have our foods we have our foods but we don't actually have like cat kind of like these categories of foods right so we look at it like we have animal and non animal products let's just let's compare that right that's that's that's kind of a cool idea and obviously I think there is come in puts it like Oh non animal products are definitely more efficient or less harmful to the environment in animal products but I'm kind of curious on like how much are they better for the environment let's do that and um I'm pretty familiar with foods I think it can just kind of manually do it but let's see what we're doing here and what's you select category we'll do unique I guess okay nuts okay so we'll probably just create a one as I'm trying to figure out the best way to do it I think the smartest way is just showing in an inclusion in an or statement so let's just find the ones with the minority class and I believe that would be non animal products so we'll do wheat and eat products rice soy beans okay and then nuts Inc peanut butter okay let's do mutate let's do look toast a good thing for food category I will say vegan vegan okay equals if else okay food category is in boom this then will say non animal product else animal product let's see you does that make sense okay and then I think one way to just kind of make sure nothing's too crazy is we'll do like food category we'll just do a count food category buy vegan just I like do a make sure nothing crazy is going on right so if we you shouldn't have like animal product or nah nah no product and we can kind of sort through it where yeah rice soybeans no it's okay I said that there we go so we got it and then I'm just gonna select since we we just need consumption co2 emissions vegan okay solid and honestly one thing that I told forgot was just do a correlation between consumption and co2 emissions right oops nope so we have a correlation okay so it's actually not as oily as we would think it is okay it's not really what we wanted you care care about but okay again there's something where maybe we should do a let's do a t-test right so what's let's do a group group actually will do a gather first equals type equals value minus vegan boom and then we'll do group by type and we'll do a pivot wider names from type values from there we go okay and then we'll do - yeah actually we don't you need that since we're just doing a t-test okay let's do a t-test between the animal and not animal for consumption and co2 emission I will put in the broom package though dope okay I will do do a t-test I'm gonna I want to look at the thing so we'll do the do function which cannot the mutate function for broom I'm gonna say value to be vegan data equals dot must be too dangerous with you test i test so we just ran two t-test and wants to our nest test what's that what is going on here what's going on here oh that's why they're not factored okay vegan equals as a factor vegan hopefully this will work now grouped by type duty to test equals your test vegan to value oh god what that test Eagle t-test value to vegan data equals dot tidy tidy test there we go there we go Jesus okay so now we have a tidied a t-test I totally forgot the tidy part that's that's ridiculous so what are we looking at so co2 emission is statistically significant right we got we got that however our consumption is not so people may be consuming relatively the same amount of vegan and non vegan products so that's pretty cool and you know just to do a basic test for that just do a basic plot for it I guess is a gym column let's giome H line y intercept equals 0.05 right crazy yeah yeah so you see how co2 emissions is very statistically significant I don't even think I'll like me when I do is scale log 10 yeah so it's very very significant that's not good that's not good why it's either just do that okay so let's hear what else what else should we ask for this mm-hmm let's look at that country's actually let's look at that again so select country unique and so we got a decent amount of countries geez 130 I'm curious on what I would want to do with this though let's see yeah let's do most popular consumption right so what do top and actually we'll do it groups group by country top and kind sumption any equals one okay Wow anyway what's to uh let's do a mutate from our previous thing let's see if most these people are eating a lot of vegan foods or non animal foods okay now they're all in on products and then we'll do a count vegan oh okay 90 forty so it's not too bad in fact let's find food category because we're the most popular food categories too okay we just do a sort so milk and cheese they think that's the most popular thing that's the animal product it's milk and cheese and honestly milk and cheese like when I'm looking at looking at this like one weird thing is fish one country its most in product is fish I'm gonna guess that's an Asian country but if you will get milk rice and wheat products if we look at our previous chart that we made like milk yeah milk is a little inefficient but milk and rice are you know pretty they don't they don't really contribute a lot to co2 at least and then wheat products we products are pretty good so maybe if I was someone who I guess was trying to reduce co2 emissions through consuming habits of food I would probably try to promote just eating more wheats just obviously more grains basically okay if you think about like Asian food is mostly just like rice fish I'm not really on a milk or wheat but you know maybe an Asian diet diet would be kind of nice too it's kind of interesting especially if you think about a vegan really do have a lot of options too because they can kind of eat where they want for vegan or non-vegan consumption of co2 cool okay let's look at this again alright let's look at do we do we look at this already okay yeah so we did look at consumptions for all these things but maybe she'll get a bigger picture than just box plots maybe we should do some density things right might as well make it as well so what are you nice country my category actually you know will do will do uh will do more vegan stuff that's kind of cool right and then we'll do - vegan alright so we have we have our thing and actually we don't really need we don't need country and yeah we just we don't need country so let's look at ggplot a s x equals value color will do and color equals food category actually I'm trying to think of how I would do this so well I'm thinking about doing is maybe doing like let's go back here maybe it just like have a kind of distribution chart of each food category on the I guess the y-axis and its distribution and then also show its kind of side distributions I'm like okay here's pork oh I guess that doesn't make any sense no which is do vegan distributions yeah we'll do that so Gigi Pont is X equal value we'll do you color it goes vegan density okay maybe we'll help you fill to fill oh that's why scale log 10 so that's pretty in string actually 0.3 right in fact let's do a facet feature plus facet crap feature scales equals let's let's just do that so that way we have was it do I do in row 2 area so there we go we have what we see here is actually like co2 emissions or definitely there's their mean and or whatever is definitely to the left of animal products animal products definitely produce more co2 and then just on our consumption basis we actually do consume a decent amount however on the majority people do consume a lot of animal products so like some countries will actually consume a decent amount of non animal products however the majority of countries do consume still just a lot and it's something where we would probably want it to be inverse where we wanted to have a lot of people eating animal products and then just some countries where they eat a lot of a lot of animal products but it would be kind of minority where majority of people or majority countries would be eating not animal so if you just invert that that's what we would we'll probably want to see okay right on um yeah I think that's kind of it for what I want to do right now what is data set I think this is like Jesus probably good an hour long video like a 50 minute video but this is basically just a a preliminary analysis honestly this data set just isn't that large it really isn't that big and there's only two numerical data variables and you know some categorical ones so it's a pretty simple I data said that I think actually you could kind of deeper toes in so I'm gonna just save this thing I'm gonna save this thing on where's died Tuesday and let's do tight so I'm gonna save this and it's tight shoes a co2 emissions when I save this thing I'm going to knit it to a you know kind of a hub document to post my roomie but yeah that's basically it let's here I think that's all I really need to do so if that's your yeah let's see so yeah if you want I'm probably gonna post more videos I mean he's analyzing this this is my first time doing this so I'm definitely an at all and I might edit it to move it out but hopefully you learned something and I'll probably see you next week
Info
Channel: Andrew Couch
Views: 1,202
Rating: 4.9000001 out of 5
Keywords: Rstudio, Tidyverse, R Programming, Data Science, Analytics, University of Iowa, Statistics, ggplot2, ggplot, data wrangling
Id: VKCPYet9qLM
Channel Id: undefined
Length: 52min 5sec (3125 seconds)
Published: Mon Feb 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.