Matplotlib Tutorial (Part 2): Bar Charts and Analyzing Data from CSVs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there how's it going everybody in this video we're going to continue learning about matplotlib and seeing how to create some different types of charts specifically we're going to be looking at bar charts in this video we're also going to see how to load in data from a CSV instead of just having our data directly within our Python script because most likely when you're plotting data the data is going to be coming from another source like a CSV file now I would like to mention that we do have a sponsor for this series of videos and that is brilliant work so I really want to thank brilliant for sponsoring the series and it would be great if you all could check them out using the link in the description section below and support the sponsors and I'll talk more about their services and just a bit so with that said let's go ahead and get started ok so in the last video we learned the basics of matplotlib and how to plot some data and customize our plots in different ways I have a stripped down version of the code that we wrote in that video opened up here in my editor and I'll have a link to this code in a description section below if you'd like to follow along but just in case you're not continuing from a previous video let me go over this code really quick so first we are importing plot up here at the top pipe lot from matplotlib we are using a 538 style for our plots our ages here this is our x-axis it's just a list of numbers dev Y this is the values that are going to be on our y-axis and here we are plotting out that data so we're plotting out our X values which are the ages the y values which is our dev Y here and we're giving it a custom color and a label and I've got some commented out code right here all of this data is median salaries for different ages so this is for developers in general this is for Python developers here this is for JavaScript developers here but I've got those commented out for now we are also putting a legend on our plot giving it a title x and y label giving it a tight layout what just helps with the padding and then lastly we are showing it so when we plotted our data in the last video we use this PLT dot plot method and when you use the plot method it will use a line plot by default so if we run this then we'll see something kind of similar to what we saw at the end of the last video so we can see that we get a line plot here for the median salary of developers and again this is some data that I took from the annual stackoverflow developer survey but let's say that we wanted to show this as a bar chart instead well to do that we can simply use the bar method instead of the plot method so if I just change this to use bar instead of plot then we'll have a bar method or a bar plot sorry and just like that plot method we can pass in our X values first for our X values and the y values for our y axis and additional parameters here can be passed in as well like color and label so I'm just going to leave that as is just like it was with the plot method and if I run this then we can see that now this is plotting our data and it's represented as a bar chart instead okay so that is plotting the data for all developers who answered the survey so like I said I also have the data for Python and JavaScript developers as well and right now those are commented out so what if I wanted to include those in our bar chart well first of all you can mix and match some plots so if for some reason you wanted the Python and JavaScript data to remain as line plots and just overlay that on to our bar chart then we could simply uncomment out our code here and we could just run these as plots and that will actually overlay line plots on top of our bar plot now that doesn't make much sense in this situation but depending on your data you might find that useful okay but what if we wanted to include these in our bar chart as bars side by side with the other data so you might think that we could do this just like we did our line plots and just run those using the bar method as well but that's actually going to give us some issues so let's try that real quick and see what that does so I'm going to change these to use bar so PLT dot bar so I'm going to run that and we can see that this doesn't quite look right we can't even see the data for all of the developers and the data for Python and JavaScript is overlap so how can we put these side-by-side because right now they're just all stacked on top of each other so we can do this by offsetting the x-values each time we plot some data now I actually think this is a lot harder than it should be it seems a bit hacky in my opinion but this is just how we have to do it so to do this we're gonna have to import numpy and use that to grab a range of values for our x-axis now if you've never used numpy before then don't worry too much about it it's just going to use one simple function now I believe numpy should be installed when you install matplotlib so we should just be able to import it without doing any additional installs so up here at the top I'm going to say import numpy and I'm going to import that as MP that's a convention there when using numpy is to import it as MP and now below our x-values here where we have our ages X I'm going to create a range from these values so I'm going to say X underscore indexes and I'm gonna set this equal to MP a range and I'm gonna pass in the length of our ages X list here and what that's going to do is it's going to create a variable called X indexes and that is an array of values and those values are going to be a numbered version of our X values so basically it's a lot like having a list with an index starting at 0 and counting up to our last item but instead it's a numpy array so once we have that we're going to use that for our X values within our bar chart method so I'm gonna copy that and instead of using our ages here I'm instead going to use those X indexes so I paste those X indexes into each of our bar methods here so if I were to run this right now then it would look very similar to what we had before but now we're just using those indexes instead but now that we're using these indexes we can actually shift the location of these by adding or subtracting to our values here so if we think about it they're all stacked up on top of each other right now so let's shift our first bar to the left and the second bar to the right but how far do we actually want to shift these well we want to shift them by the exact width of a bar so to do this it would be nice if we specify an exact width for our bars so that this is explicit I believe that they have a default width of like 0.8 or something like that but just to be sure let's create our own width variable so up here underneath X indexes I'm going to create a width and set this equal to 0.25 and I think the default of 0.8 is going to be a little thick with three bars being side-by-side so I think 0.25 would be good here and but you can experiment with these different widths if you'd like to get different looks depending on your data so now that we have a width let's subtract that width from our first plotted values and we'll add that with two our last plotted values and that should shift those bars to all be side by side so with our first bar plot here which is right here we are going to say X index is - width then for our second bar chart we're not going to do anything because that's going to be in the middle and then for our last bar chart we'll say plus width since we want that to shift over to the right and lastly before we plot this we're actually going to need to tell our plot that we want the width of the bars to be equal to the width variable that we just created and we can do that just by passing in a another variable here so right before color on all these I'm going to add a width oops let me spell that right width equal to width and I did that for all three of these bar methods so width equals width here here and there so now that we've done that if we run our code here then now we can see if I make this a little larger here we can see that now our bar chart has these all lined up side-by-side instead of being stacked on top of each other like they were before now if you have more or less bars that you need to fit side-by-side then you'll have to adjust the offsets accordingly for the number of bars that you have the way that I did this was with three but if you added another bar then you need to do an offset with the width added twice and so on now also if we look at our x-axis down here we can see that we no longer have the age ranges that we had before it's using the indexes since that's what we needed to do our offset so to fix this let's go back to our code so I'm going to shut that down and down here towards the bottom we're going to need to use a next X label to change the labels so right here above the title I'm gonna say PLT dot X ticks oops let me spell that right so within this X text method we need to pass in a couple of arguments so I'm gonna say ticks is equal to and those ticks are equal to the X indexes now the labels for those ticks are going to be equal to our ages list here so we are using those X indexes for the ticks and the labels which are all of our ages that we saw before in the last video we're going to use that for our labels so now if I run that then we can see that now our plot has our x axis labeled correctly okay so we've looked here at vertical bar charts and how to add multiple different bars to that plot and in a minute we're gonna look at how to create horizontal bar charts but first I want to load in some data that's more appropriate for a horizontal chart you usually want to use horizontal bar charts when you have a lot of data and it looks too crowded in a vertical plot and the data that I want to load in is going to be from a CSV file so so far we've only used data that has been directly in our Python script but most of the time you're going to be likely using data from external sources like a CSV file and sometimes you're going to need to work with that data a little bit before it's actually ready to be graphed so first let me get rid of the data that we've been using so that we can make room for data that will load in from our CSV file so I'm going to remove I'm gonna remove all the way from our PLT X ticks there I'm going to go up all the way to our ages and remove all of that data and for I'm also going to comment out our plot titles and plot show and things like that and now let me open the CSV file and show you what this looks like so I have this open here in my current directory and like I said all of this is going to be available for download in the description section below if you want to follow along so this is the CSV file that I'm going to be loading in here so this is also data from that stack overflow developer survey but I cleaned it up a little bit and only grabbed the data for the programming languages respondents said that they worked with so we can see that the top line here tells us what information this is so this first column here is the responder ID so these are just IDs for each person who answered the survey and the languages worked with these are the languages that that specific person said they knew so this first person here said that they knew HTML CSS Java JavaScript and Python and we can see that these languages are all delineated by a semicolon here so each line here has all these different languages and using these we can graph the most popular programming languages from that survey so let me go back to my script here and like I said let's say that we wanted to create a bar chart of the most popular programming languages that people said that they work with so first let's grab the data from that CSV file now there are multiple ways that we can load in a CSV file we could use the CSV module from the standard library we could use the read CSV method from pandas we could also use the load txt method from numpy now first let's use the CSV module from the standard library for since most people are probably familiar with that but then I'm also going to show you a faster way using pandas and that reads csv method so first let's use the standard library to do this so at the top here I'm going to import CSV and now I'm going to read that file using the csv module now if you don't know how to work with csv files using the csv module from the standard library then I do have a detailed video specifically on that so I'll be sure to leave a link to that video in the description section below if anyone is interested okay so the way that we can read this in is I can say with open and we want to open that file is called data CSV and it's in the same directory as this script so I don't have to specify a full path and now we can just say as CSV file and now we can use this CSV module to read this in so I'm going to say CSV Reader is equal to and I'm going to use the dictionary reader method from the CSV module to read in this CSV data the dictionary reader actually makes a dictionary where we can access the values by key instead of by index and I find that pretty helpful so to do that that is CSV got reader and now we just want to pass in that CSV file okay so now we should have that CSV data in our CSV reader variable and this is an iterator that we can loop over now I don't want to loop over all of these right now because I think there are like 90,000 rows and that data there so instead let me just print out the first row so that we can kind of see what this looks like and I can grab that first row by saying Row is equal to next CSV reader and that will grab that first line from that iterator and now let's print that out so I'll print out row so if I save that and run it let me make my output a little larger here okay so we can see that this is an order dictionary and the keys are what we saw as the headers and the CSV file and the values are the responses for that particular person so like I said we want to plot the most popular programming languages so those are within the key languages worked with right here so let me just print out that key instead of printing out that entire row so if I save that and run it then we can see that now we get those languages and like I said these are delimited by semicolons here so to clean this up a bit and turn this into a list of languages we can actually split the values on that semicolon by saying after we access that key we can simply say dot split and split on those semicolons so if I save that and run it then now we can see that we have a Python list of those languages so sometimes you're going to run into data that you need to clean up or analyze a bit before you're actually able to plot the data that you want so that's why I'm showing that process here so in our case we want to plot the most popular programming languages from the results of this survey so we need to keep a count of each language that each respondent said that they work with so there are a lot of different ways that we could do this as well we could keep a list and count them at the end we could keep a dictionary and update the counts of that dictionary each time but this is actually so common that Python has a built-in class for this kind of thing called counter and it's definitely the best way to do something like this now if you don't know how counters work they can be extremely helpful and I plan on making a video specifically about counters in the near future but I haven't put one together just yet so first let me show you how a quick example of how counters actually work so let me open up my terminal here and I'm going to run Python and let me show you how counters work here really quick so to import these I'm going to say from collections import counter they are from the collections module and now that we have a counter I want to say C is equal to counter and I'm going to pass in a list here so I'm going to pass in a list of Python and I'll also pass in a of Python and JavaScript those two values in my list so if I look at that counter we can see that this says okay I have a counter here I have a key of Python and that's currently set to 1 I have a key of JavaScript and that's currently set to 1 so it's keeping count of how often it sees these values so to update this counter I can simply say C dot update and now I'm going to pass in a new list so this new list let's say this time i say c plus plus whoops C++ and Python okay so now let me look at this counter so now when we look at the counter we can see okay now python is two because it's seen Python twice we saw it up here when we first created the counter and we saw what up saw it right here when we updated the counter it still only seen JavaScript one time the first time we created it and it's only seen C++ one time so now let's do an update one more time so if I run that update statement again with C++ and Python and then look at our counter again now it's saying okay I've seen Python three times c plus plus twice javascript once so this is what we're going to use to keep track of these languages so at the top of my script let me exit out of Python here I hope that all made sense to you because these are the kinds of things that you need to do sometimes when you clean up data for plotting okay so I'm going to close down that output now up here at the top of my script I'm going to import that counter so again that's from collections import counter spell that right okay now I'm going to instantiate a new counter right after we read in our csv data so right above our row here i'm going to make a variable and i'm going to call this language underscore counter and set that equal to an empty counter so right now we only have the data for a single row but we want to grab the exact same list of languages from every row so in order to do this we can copy what we've already printed out here this big long thing here is what god is that list of languages from that single row so let's copy that and now we can loop over all of the rows of our csv data and update our counter with the data that is within this list here so i'm going to say four row in csv reader and this will loop over every row in that csv file and i'm gonna say language counter dot update and we want to update that with that list of languages for every single row and so i'm going to paste that in and this section here is what's going to give us those list of languages so now our language counter get updated with all those languages okay so now let's print out our language counter to see if it looks like we have some coherent data and I'm going to do this back on the main level of the Python script outside of this with context manager here so above our PLT title I'm going to print out language counter so let's run that and it looks like we've got some good data here okay so since this is a counter it should print out sorted with the most responses at the beginning so we can see here that we have JavaScript with 59,000 HTML CSS 55 SQL 47 Python 36,000 Java 35,000 and so on now we can see that there are a lot of programming languages here if I remember correctly I think there are 28 total here so we probably don't want to plot all of these so let's say that we just wanted the 15 most common languages well the great thing about using a counter like we did here is that it actually has a most common method built-in to do this for us so whenever I'm printing this out I could say print language counter dot most common and just pass in a 15 and if I run that then that is the 15 most common responses and that most common method actually returned a list here and each item in this list is a tuple containing so this is one tuple here it's containing the language and the count so now let's try to plot this data so how would we do this well first we need to split out the languages into their own list and these corresponding counts into their own list so when we did our previous bar charts we had our X and y-axis so we'll want all of our languages on one axis and the counts on another so that's why we need to split those up so there are also a couple ways that we can do this now let me show you a way that takes a little bit more code but I think is going to be where most everyone will be able to read it so to do this I'm just going to overwrite this line here actually I will keep that there for now but above this line I'm just gonna say languages and set this as an empty list and then I'll say popularity that's going to be for the numbers so we want the languages in this list and this corresponding popularity in this list so now let's loop over all those tuples that we got back from this most common method so I'll say for item in language counter not most common whoops and let me sorry let me go the next line here and remember this is going to be looping over a list of tuples and the first value of that tuple is going to be the language and the second value is going to be the popularity so I'll just say languages dot append item index of 0 to grab that first item and append that to our languages and we want to append the second item to our popularity so now if I print out our languages and our popularity languages print popularity save that and run it then we can see that now we have one list here that is all of our top 15 most common languages and the second list here is the corresponding popularity of that language according to that survey so now we can actually use these two lists for our plot now there's actually a way of doing this whole section right here there's actually a way of doing that with a one-liner using the zip function and unpacking values and things like that but I wasn't sure how many people would find that confusing so I think it's easier to read this way so I just decided to do it this way instead okay so now that we have these lists here let me exit that output there and I'm also going to get rid of those print statements so now that we have these lists let's plot these just like we did before so to do that we can just say PLT dot bar because we want to make a bar chart here and on our x-axis we're going to plot the languages and on the y-axis let's plot the popularity and it's also uncommon about our titles and labels here and change those to match what we're actually plotting so instead of median salary I'm going to type in let's just say most popular languages spelled that wrong that's okay for the X label here I can just say our X label is the programming languages so I'll say programming languages and for the Y label here I'll say number of people who use okay so now with that in place let me save that and run this and let's take a look at our chart now we can see right off the bat when we have this many items it's hard to see all of these using a vertical bar chart like we did here when you have a lot of items then it might be more readable to use a horizontal bar chart instead and we can do that easily just by changing our bar method to a bar H method so right here where we're saying dot bar I'm going to change that and say dot bar H so now we can leave our arguments exactly as they are because the horizontal chart expects the y-axis values first so we'll just keep our languages there now we will have to change our axis labels here because those are going to be different now so I'm just gonna switch the X and Y labels here real quick so I'm just gonna have programming languages as our wide label number of people who use as our X label okay and now I think that's about it and actually now that I think about it I don't even think that we need this Y label telling us that these are programming languages that's pretty self-evident since the names of the programming languages are actually the labels themselves so I'm just gonna get rid of that that's one thing with plots is it's nice to be descriptive but you can also be overly descriptive so I'm gonna get rid of that actually just let me comment it out instead okay so now let me run this and now we can see that we have whoops a vertical bar chart here let me open this back up make this a little larger okay so what I meant to say is we have a horizontal bar chart here so we can see that this is much easier to read with a lot of values and those aren't scrunched together like they were in that vertical bar chart so whenever you're plotting things out if you've got a lot of values to plot with a bar then it might be a good idea to use a horizontal for this type of thing now one thing here is that with a horizontal bar chart maybe you want the most popular language right now it's down here at the bottom maybe we want that at the top since we read from the top down so to do this we could simply just reverse the list that we're passing into the bar H method before we actually plot it so I'm going to close that down and now up here before that bar H method I'm simply going to say languages dot reverse and popularity dot reverse and the reverse method on a list actually reverses those in place so we don't need to set languages equal to this or anything like that it's actually going to modify that list in place so now if I save that and run it then now we can see that we have the most popular languages up top and I think that that looks a lot better now I did say that I was going to show you a faster way to load in that data from the CSV using pandas so let me show you how to do that because for the rest of the series I'm probably going to use pandas to load in data since it's a bit faster and it's also a bit cleaner so first of all if we don't have pandas installed then we'll need to do that and it's really easy to install so first let me install that I'll just open up my terminal here and clear this out and we can just install that using pip by saying pip install pandas whoops got the wrong spelling there pip install pandas and now once that's installed we will need to I'm just going to assume that that installs correctly and it did ok so back here in our script up here at the top we need to import this so I'm just gonna say import pandas as PD that's another convention when you're using pandas is to import it as PD ok so up here at the top of our file instead of opening our file and using the dict reader method to read in the data we can instead replace that with a pandas method so now instead of doing it like this we can simply say so I'm going to get rid of this with context manager here and since we got rid of that context manager I'll an indent these other lines here but now where we were opening that file instead I can simply say data is equal to PD dot read underscore CSV and pass in the name of that CSV file and it was data dot CSV and now I can specify some columns so I'm going to say that the ID I'm going to create this IDS variable and I'm going to see ID is the let me see exactly what that column name was responder ID so I'll pass in responder ID there so that's going to set this IDS variable equal to all of the IDS in that responder ID column and we can do the same thing with the languages so I'll call this variable lang underscore responses is equal to data and we want the key to be languages worked with so I'll grab that so we still want our language counter but now here for our loop instead of saying for row and CSV reader this doesn't exist anymore now we have this list of languages here so I can just say for response in lang responses update that counter so that simple update to our code there should work exactly the way that we that it worked before so if I save this and I run it then whoops name row is not defined okay so yeah I got an error here that says name Row is not defined I also meant to update this section one here because there's no row anymore so we just want to split the response instead so response dot split because remember these Lang responses here when we're looping through these each response is going to be this entire section here of all of the languages so we can simply just split that response okay so I'll save that and run it and this should work exactly like it worked before and we can see that it does that looks pretty good now like I was saying before this is actually real-world data that I grabbed from their actual survey and I actually have those charts that stackoverflow put together when they analyzed their survey data so let me open those up and see if we got similar results so I'm gonna put my chart here on the right and their chart I have open here in the browser so let me open that up okay so here is their chart plotting out the exact same thing that we just plotted now there could be some small differences here based on how I sanitize the data compared to how they sanitized it but you can see that as far as the order goes we got the same results they've also styled their plot a bit further but with a little customization we could probably get something very similar so it looks like we just need to change up the colors a bit and add in a little spacing and also make these lines a little thinner and it would almost be identical so that's why learning things like this can be extremely useful because these companies are constantly looking for people who can analyze their data and present it in ways that can give insights like this so this is definitely a skill that you're going to be able to apply to a lot of different situations just like we did here okay so before we end I'd like to mention the sponsor of this video and that is brilliant org brilliant is a problem-solving website that helps you understand underlying concepts by actively working through guided lessons they have computer science courses ranging from algorithms and data structures to machine learning and neural networks they even have a coding environment built into their website so that you can run code directly in the browser and that's a great way to compliment watching my tutorials because you can apply what you've learned in their active problem-solving environment and that helps to solidify that knowledge they're guided lessons will challenge you but you also have the ability to get hints or even solutions if you need them it's really tailored towards understanding that material so they're computer science material is fantastic and I really like what they're doing they also have plenty of courses depending on what you're most interested in so they have courses in different fields of mathematics or astronomy solar energy computational biology and all kinds of other great content so to support my channel and learn more about brilliant you can go to brilliant org ford slash CMS to sign up for free and also the first 200 people that go to that link will get 20% off the annual premium subscription and you can find that link in the description in below and again that's brilliant org ford /c m/s okay so I think that is going to do it for this video hopefully you feel a bit more comfortable working with matplotlib and how you can pluck out the data that you need and create types of charts that you'd like in this video we covered bar charts but in the next video we're going to learn how to create pie charts and pie charts are great for seeing how our data is proportioned and quickly visualize what different categories make up large and small pieces of your data so be sure to check that out but if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoy these tutorials and would like to support them then there are several ways you can do that the easiest ways to simply like the video and give them a thumbs up and also it's a huge help to share these videos with anyone you think would find them useful and if you have the means you can contribute through patreon and there's a link to that page in the description section below be sure to subscribe for future videos and thank you all for watching you
Info
Channel: Corey Schafer
Views: 197,725
Rating: undefined out of 5
Keywords: matplotlib, python, python matplotlib, data science, data analytics, data visualization, python plotting, python graphing, matplotlib tutorial, bar chart, bar graph, bar plot, python bar chart, python matplotlib tutorial, matplotlib (software), python (programming language), python tutorial, corey schafer, python programming, pandas, csv, python pandas, pandas read_csv, pandas plot
Id: nKxLfUrkLE8
Channel Id: undefined
Length: 34min 25sec (2065 seconds)
Published: Tue Jun 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.