Intro to Data Analysis / Visualization with Python, Matplotlib and Pandas | Matplotlib Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone this is my introduction to their analysis slash data visualization with Python in this video I'm gonna cover why you might want to use data visualization and why you might want to use Python and map lot live for it and then we're gonna go over some simple examples of how to actually use these tools and then using these tools we're gonna do sort of a real analysis with a real data set at the end and in this video I'm only gonna cover line charts just to keep everything simple I'm also gonna put a more detailed version of this outline in the comment section below so you don't have to watch the whole thing if you don't want to okay so why should you use data visualization in the first place well error visualization is actually often the first step of any type of data analysis work whether it's simple there analysis or statistical analysis or machine learning analysis and the reason for that is because visualizing data often gives you an intuitive understanding of the data and it often helps you see patterns that are otherwise hard to see and we're gonna see an example of that later okay and why should you use Python for this well python is not the only good choice but I would say it's one of the best and the reason is first of all it's a general-purpose language that's pretty easy to use and learn and it also has many libraries for scientific computing and data science including Mapple lib and if you work at a company your company might already use Python for something else and if that's the case that's really nice because then you and your team are not gonna have to learn a totally new language to do some data analysis and why are we using Mapple live for this well Mapple live is not the only good visualization library for python but it's still one of the most popular choices and there are actually other libraries that are based on MATLAB so if you learn MATLAB it's gonna help you learn these you know other libraries for example this one called C born later on if you want to and Mapple lib is also pretty easy to get started with anyway let's dive into a demo for this demo we're gonna use something called Jupiter notebook and a few other Python libraries and we're gonna use anaconda to install them if you're not familiar with JIRA notebook and anaconda I have an explanation about them in my Python tutorial video so I'm gonna leave a link to that in the description anyway to install anaconda just search for anaconda Python or directly go to anaconda org and there find the button that says download anaconda and select whatever OS you are using I'm using Mac here and click download under Python three-point-something version instead of python 2 point something because we're gonna use python 3 here and select where you want to download this package save it and once it's downloaded open up the package that you just download it and then just click continue continue continue continue agree install for me only you or install on the specific disk it doesn't matter which one and continue and click install and this process is gonna take a while after some waiting you might see this prompt to install microsoft vs code we don't need that so let's just continue here and then close and then to launch jupiter notebook you can do it through this thing called anaconda navigator so just launch it like you launch any other application just dismiss whatever comes up and then click launch in that jibra notebook section and then you should see a browser window show up with the jupiter notebook interface now if you want to follow this tutorial the first thing you should do is you should create a new folder let's say on desktop and let's call this one data visualization and we're gonna put all our data and to put a notebook file here so let's first download our data to do that just go to CS dojo dot io slash data and download these two files sample data CSV and country's that CSV and then put these CSV files in the folder that you just created data visualization after that go back to the Jupiter notebook interface and you can just navigate to desktop and then the folder that we just created data visualization and to create a new jabber notebook file here just find the new button on the right and click Python 3 right now this notebook file has untitled as the title so let's change it to data visualization with Python click rename and you have a notebook called data visualization with Python you can check it just by going to desktop and then to the folder that you just created and you should see that there's a file called data visualization with python dot I PI and B and it's really important that this notebook file is in the same folder as the data that you just downloaded countries that CSV and the other one and once everything is set up just right in the first cell import pandas as PD this means we want to import a module called panels as P D or we want to give it sort of a nickname and that's going to be P D you can run the cell by clicking this button and now pandas is imported as P D and here we're gonna use pandas for importing and using some data from our CSV files and we need to import another module here so for that just right from my plot lib import PI plot as PRT so this says from the matplotlib package import PI pop module and then call it PLT let's run this cell and now PI plot is imported we're gonna use PI plot from Apple lib for making our charts so here let's first take a look at a really simple example of how to use PI plot so here I'm going to write x equals 1 2 3 it's a list of 3 elements and y equals 1 4 & 9 and to plot this set of data you can just write PLT the plot X comma Y and this plots X on the x axis and y on the y axis and then you can show this graph by writing PLT - OH when you run the cell you should see a graph like this you see that the values of X our one two and three as expected and the values of y are 1 4 & 9 if you want to add a title to this graph you can do so by writing PLT dot title tests plots right after the plot statement before the show statement and then you can add an X label and the y they bow as well by writing PLT dot X label let's call the X label X and Quixote dot y label let's call it y label Y here and when you run this cell you see that there's a title called test plot and the X label called X and why they both called Y okay what if you wanted to plot multiple lines here well to do that let's create another list let's call it D and this one is gonna have ten five and there inside and to plot X and D on top of x and y you can just right PLT the plots X comma Z right after PR T dot X comma Y and then let's fix the Y level here to Y & Z and when you run this cell you should see these two lines so the blue line represents x and y and the orange line represents x and z so PR t dot plot x and z plot x on the x axis and z on the y axis but right now it's kind of hard to tell which line represents which data so we can fix it by adding a legend statement let's add that after the y level statement by writing quixote dot legend parentheses square brackets double quotes this is y comma double quotes this is Z so note here that this legend function takes a list as an argument and when you run this so you should see this legend that says the blue line is this is y and the orange line is this is d okay that's the basics of body now let's see how to load data from a CSV file for that you can just write sample and the score data equals d or pandas that read CSV by the way I just press tab here to do autocomplete and then parentheses sample underscore data dot CSV now before you run this cell make sure that the notebook file data visualization with Python that I PI and B is in the same folder as sample data dot CSV when you run this cell this data sample data dot CSV is loaded by the pandas module which we call PD and then it's assigned to this variable called sample data you can check what's inside this variable sample underscore data just by writing sample underscore data in this new cell and then when you run this cell you should see something like this so as you can see this data has three columns column a column B and column C and five rows and you see a bunch of values inside this table if you want to check if this set of data is exactly the same as the original data you can do so by opening up the original data file sample data dot CSV with Excel or any other spreadsheet application and when you open it you should see exactly the same data column a column B column C with five rows with a bunch of values okay the only difference that you might see is that in Jupiter notebook you might see these numbers zero one two three and four and these are just indices for the rows and you can check what type this variable is by writing type parentheses sample underscore data and when you run this cell it says that this is pandals d'accord a friend data frame so this is a data frame type that's defined by the pandas module and the data frame type is used to contain a table like piece of information just like this one okay now what if you wanted to plot data in this data frame for example the values of column a on the x-axis and column C on the y-axis what to do that you need to be able to retrieve a specific column and you can do that by writing sample underscore data dot column dot C column underscore C when you run this cell you see that a column see its retrieved it has the values 10 8 6 4 and 2 and the numbers you see on the left are just indices 0 1 2 3 & 4 just like before you can check what type this is by writing type parentheses sample data column C and when you run the cell you see that this is Parnell's duck or that series that series so this is basically a series type that's defined by the pandas module and it's a type that's used to store a series of values for example these values 10 8 6 4 & 2 now what if we wanted to retrieve a specific value out of this series well if you want to retrieve for example the second value here 8 you can do so by writing sample data column C dot I'd lock I LOC square brackets 1 and this retrieves the second value of the series 8 and if you want to retrieve the third value 6 you can write I lock 2 and that gets the third value and if you want to retrieve the first value you can write I lock 0 and this should give us 10 and it does ok and using what we've just learned here we'll be able to plot the data in this data frame so let's say we want to plot column a on the x-axis and column B on the y-axis we can do that by writing PRT dot plots sample data dot column a comma sample data dot column B and we can show it by writing PLT that show let's see how it looks we have 1 2 3 4 and 5 on the x-axis and on the y-axis we have 1 4 9 16 and 25 as expected if you want to add a column C to this data you can write PRT dot plots sample data dot column a so let's use column a as the x-axis again and the sample data dot column C when you run the cell you see that there are two lines here just like before if you want to make this graph a little bit easier to read you can add a titles and a legend and by the way in this plot function you can use the third argument to change how the plot looks so for example if you give it o in a string as the argument in the first line for column B and when you run the cell the plot becomes dots instead of just a line and there's a lot more you can do you can find more about it in the official documentation anyway let's move on and do sort of a real analysis with a real data set now for this analysis we're gonna use this data country's dot CSV it should be in the same folder as well and when you open it you should see this data so we have a bunch of countries and a bunch of ears ranging from 1952 to 2007 for every five years and population for each year for that country and you can see that there are a lot of rows in this data so let's now import that data just like before by writing PD or pandas that read CSV parentheses single quotes or double quotes countries dot CSV and by the way this is a string single quotes country's dot CSV and in Python you can use either double quotes or single quotes to express a string let's assign that to a new variable called data by writing data equals and when you run the cell this data is loaded onto data so once you write data in this new cell and run it you should be able to see this data in a data frame now let's say that the analysis we want to do here is we want to compare the population girls in the US and China now to do this analysis the first thing we want to do is we want to isolate the data for the US and China we can do that for the US by writing us equals data square brackets there that country EKOS United States in single quotes and when you run this cell us now only contains the data for the United States so let's break down this statement a little bit more let's click insert here and insert cell bill when you write the other country equals United States this actually gives a series of a bunch of choose and forces so when the roll is not us this gives us false and when it is us it gives us true we don't see any cheese here but there are a bunch of cheese here where the rows are for the US and then when you right there a scrub buckets this a series of bunch of trues and falses this gives us a portion of the data where the value of the series is true and that's the data for that us as you can see here and then we just assign it to this variable called us okay let's now do the same thing for China by writing China EKOS theta square brackets that are the country equals China and when you run this so and when you write China here and run this cell you should only see the data for China and using these two variables US and China will be able to compare their population growth so let's first plot u.s. population here by writing PLT dot plot us dear comma u.s. top population you can show this plot with TLT does show and when you run this cell used to see this graph you see that US dollar is party on the x-axis and US the population is plotted on the y-axis but you see this scientific notation thing 1e8 because the numbers are so big so let's divide the whole population each number in the series with 1 million or 10 to the power of 6 that's 10 star star 6 in python and when you run the cell again you now see the population in millions so this is 160 million and it goes up to I think more than 300 million in 2007 and let's plot China's data on top of the spot by writing PLT dot china that year actually you could use us that year or China that year because we have exactly the set of ears but for now let's just use China deer for the x-axis and China dot population for the y-axis and we're gonna divide this by 1 million as well to make the population show in millions when you run the cell you should see these two lines let's add a and titles here to make this graph easier to read so PLT legend parentheses square brackets United States and China and the X label PRT da hex label should be just here and PRT da Y label should be population run this cell again and this graph is much easier to read so you can see that China's population started out much larger than the US in 1952 and it seems like it's going faster as well now what if you wanted to compare instead of the absolute amount that you see here the percentage girls from the first year that we have in our data 1952 well there are several different ways of doing this but I'm gonna show you just one way so to do that let's first copy this whole block of code over here now let's say that for each country we want to find the percentage girls from the first year so we want to set the first years amount to 100 as a 100% and show the rest of the data in percentage relative to the first year and we can do that by dividing this whole series for example us stop population with the first years population and then multiplying everything by 100 so to show you what I mean let's just create a new cell here above by clicking insert cell above here and here first I'm gonna write us that population and you see a series of population here for each year and the first row you see here is the first year's population or the population in 1952 I think let's insert a new cell below here now to retrieve the first years population you can just write us top population the Eyelock square brackets 0 and this gives us the first years population which is this amount then we can divide the whole population this whole series by the first years population just by writing us the population divided by us the publishing dialogue square brackets there and this gives us this series so as you can see the first year is set to 1 and the rest of the years are shown in relative amounts and if you multiply everything by 100 just by writing start 100 here you'll be able to show everything in percentage amounts so you can see that the first year is shown as 100% and from 1952 to 2007 which is the last year we have the population grew by 90 percent now like I said earlier this is not the only method to show the relative girls in population but I chose this method here because it's pretty simple to implement anyway let's copy this whole thing and paste it over here to replace the y-axis let's do the same thing for China as well so copy the whole thing for China here and then replace us with China once you do that let's change the population my label to population girls and let's just write first year equals 100 just for clarity here when you run this cell you should see this graph so you can see that even in percentage amount China's population grew much faster than that of the United States the u.s. population grew by 90 percent from 1952 to 2007 but during the same time China's population grew by more than 120 percent okay this was a pretty simple example and it actually came from my course called introduction to data visualization if you liked this video I'd actually highly recommend it it's a course with more videos just like this one and I cover more realistic and complex examples and more different types of data visualization techniques not just line charts so if you want to check out the course you can just go to CS dojo da io / more data you can actually take this course for free by signing up to plural sites 10 day free trial that's the site the course is hosted on anyway as always thanks for watching this video and I'll see you in the next one
Info
Channel: CS Dojo
Views: 1,062,360
Rating: 4.9578176 out of 5
Keywords: python data analysis tutorial, python data science tutorial, python matplotlib tutorial, matplot lib tutorial, matplotlib tutorial
Id: a9UrKTVEeZA
Channel Id: undefined
Length: 22min 0sec (1320 seconds)
Published: Mon Jun 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.