A Visual Guide To Pandas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
as a warm welcome to the guy who usually isn't up until the night owl jayson werth come on go and look it up thank you very much that first like to give a round of applause to Braintree for sponsoring this great event pretty good find a chippy so today I'd like to talk a little bit about pandas and I'm going to title this visual pandas and the reason being that there's a lot of pandas tutorials particularly if you go on YouTube you can see wess give a bunch and he's a very code heavy and that's awesome but if you come from excel like I come from an excel background I'm a very visual thinker I'm used to thinking in tables so sometimes when I'm learning pandas or going through things it helps to have a visual understanding of what's going on so I thought of information here that will be redundant you're a pandas ninja but if you're new to the library your visual thinker this might give you a different way to kind of think about how the library works so pandas is really built off of numpy who here uses numpy three four five six all right so numpy is an array based storage so you can have an array of numbers here you can index into that and get those numbers out it's very similar to a Python list and you can have a 1d array and you can also have a 2d array right so you can see things here are stacked by rows and columns and you can get into those and get columns out you can get rows out you can do different types of analyses on that if you see here we can index into them based off of numbers right so give me the first row give me the zeroth row give me the 0th column things like that so pandas basically takes that numpy array and gives you a labelled index on that so if you want value 14 how would you get that out anybody be absolutely and so when you have an index and you have a 2d array you can think of having two indices right unive one along the rows one along the columns so if you've used Excel you can think about calling things in terms of the columns or calling things in terms of the rows and so we can just apply labels to these things and one of the nice things about pandas is that there's two different types of ways that you can work with it you can work with in a series which is like a 1d array or you can work with it in what's called the data frame which is a 2d array and so when you're thinking about it you really want to think about it as a kind of dictionary where you're going into things with keys you're going to pull out a bunch of values and so pandas is really a dictionary based numpy and so let's go through a couple of examples here so if you have this data frame on the top and you want to pull out column a you can just index into it and you get column a and you get that bottom data frame out so along the top we have the column index along the left we have the row index the 1 2 3 4 and now you can select multiple items as well so you can see here that if you pull into the index of a and C you're going to get out this data frame on the bottom which is a two column so now if you want to select more columns you want to get different things out of your data set what do you think will happen here if we want to get columns B C and D anyone have any guess what this call at the bottom will return you're slicing the data frame B to D you think that you'll get who here by a show of hands thinks that we're going to get that data frame out 1 2 3 4 who hear things no one person who hears uncertain hi we got people so only one person says no and you know what it's not going to work so one of the weird things that if you're coming to pandas from a beginner things don't always really work the way that they might seem and so we come into this we've indexed into it in a few similar ways before but when we want to index into it in a kind of pythonic way that we're used to it doesn't always work that way so we can index into it with this I axis this index right and you can see that we're going to have a colon and then B and we're going to slice all the way to D and so now pandas allows you to slice columns it also allows you to slice rows and so we just saw this before with our index now we're going to go from two to four along the rows okay so if you wanted to get one two three out you would just put that in there you could put two in there you just get that row out and so you can also do kind of creative things in terms of slicing and getting intersections of things so here we're getting an intersection of a row and a column you can see that we're going to pass into this data frame ix method right we're going to pass into a list the first one of the rows that we want the second one are the columns that we want and we're going to get back that middle data frame those values of sixes and fours it's going to return back to us a data frame that has an index of B and C along the top and two and three along the rows and you can do other creative slicing with things so here you can see that there's a very familiar kind of start stop and step method of slicing and using the i-x method we can get out kind of creative shapes within our data so along the rows we're going to start at one we're going go to nine with the step value of three and then along the columns we're just can take from five to eight now one of the nice things about pandas is how it handles missing data so when you're working with a lot of data sets you'll you'll have holes and things right it'll kind of match up in one but it won't necessarily match up in the other and pandas handles that for you automatically and so we can take two series which are one dimensional arrays we can stick them together into a data frame and what we get here is we get one resulting data frame with missing values I've tried to indicate those as byte here you can see series a is listed in the red in series 2 or B I guess would be listed in the green in the way you do that it'll align them I hope people can see that it's automatically joining on the C and D because that's common to both sets and so if you're working with data like maybe you have customer data and you have one list that has a set of customers and you have another list that has a second set of customers you can join those two with pandas and it will automatically intersect them where the two match up and where they don't match up it handles that flawlessly it just puts nothing in there it puts a inand value and there's a few different ways that you can create a series in in the first one we've created it out of a dict you can see on that last line of code that we're going to pass in a dictionary of series the first one being labeled a second one being labeled B here we're going to do it we get a slightly different data frame result and so when we have these missing values we can do creative things with them we can fill them with values you get pass in a default value of zero you pass in a default value of you know - nein nein nein nein nein if that's how your data works in the top you can drop those you can just completely erase them and so it's nice to be able to take something that's missing and exclude it now this is really cool if you start with this data frame that's in the upper left and say what you really want is on the right hand side right you want labels 4 5 6 and 7 along the rows and de f and G along the columns that's kind of a tricky thing to get if you're just working with standard lists in Python and so here we can pass in an index you can see along the right-hand side and it's going to take that it's going to grab that section of our data that we want right that lower right-hand corner that 4x4 grid of cells and then it's going to fill in missing values for all of the remaining cells that we don't have and so you can get slices of data that you know you can fill in with values this is pretty creative and so the basis of data frames and series are built along in index and there's a few different types there's a standard index there's what's called a multi index which which I'll cover here and then there's a few that I don't cover there's an integer index so instead of dealing with named labels write ABC or customer names something like that you can deal with just plain numbers and then there's also date and times if you're dealing with time series right where you want to slice out things based on years and there's a period index where you can take ranges of things you take orderlies you take day aging take weeks things like that and so the hierarchical index is really cool because up to now we've been dealing with a one-dimensional array which is a series a two-dimensional array which is data frame and if we have a data frame with a hierarchical index we can think of that as being an N dimensional array pandas has a third data structure called a panel and kind of the name pandas and a panel is really meant for say three dimensional or more but when you're dealing with small data sets 2d with a multi index works fine and so you can see here that we have a 2x2 grid of data right but we have three different indices along the rows and we have two different indices along other columns and so when we look at that it can look like this so say that that you're dealing with different segmentation of customers right you have locations Chicago might be a location Detroit might be a location you have different days of the week and different tests right maybe you're a web developer and you're doing some kind of segmentation testing so you can arrange your data here and so a series with a one dimensional that's one dimensional it has a multi index is really a 2d and we can see here on the right hand side how those labels how those indices would correspond to each other and so when you get into what a multi index is it's an array of tuples and this will be this will come handy later when you're doing things like group eyes which are going to give you back multi indices and for selecting on higher dimensional data and so when we look into this it can get kind of tricky but you can see that we have on the levels we have two different lists with two different indices the first one being Python right the second one being Ruby the second level we have speed and lines and it shows you on the labels how they're arranged and because pandas is built up of numpy it can do calculations it can do slicing and selecting very very fast so now if you have some data how would you go about creating a multi index well you can simply just take in a data frame and pass into it a multi index that you've defined with a list of values so you can see here that we're going to have four separate items right you can think of these being like primary keys in a database and then you could do it from tuples as well you can also do it straight within a constructor so to create a multi induction or series constructor you're just going to do two different lists with inside of a list that creates your index now pandas does a lot of for database style joints and so there's merging there's joining there's concatenation and this works well for bringing different types of data together so if you have two different data frames say that you brought in from separate CSV files or disparate sources and you want to create one data frame so that you can work on them you can merge them and here if you just call this pandas merge function and pass in two data frames it'll give you a resulting data frame with those two joined and what's important to note here is that they're going to be joined based on a similar column name so both of the data frames that I have pictured have a column name name and obviously it's going to join on that column and you can see that it's going to join where they where the items in that column are equal so each one has a B each one has a C and so our resultant data frame has that now if your data doesn't quite match that ideal fit you can choose different things to join on so here you can join on a left column you can join on a right column they have different names you can see that it aligns them it also matches up so that we get out the union of the data now if you have something with an index you can join on that as well just by specifying that you want it to be the index and you can do even combinations of things you can join on both indices and on columns and so if you're familiar work with working in databases in sequel right you have outer joins you have inner joins you have left you have right things like that well you can do similar stuff in pandas so you can see here in our first data frame in the red we have a B and C and we're going to join it with one that's missing an a value and if we choose the left it's going to take that left one and choose all of the values from that you give us the resulting data frame with Nan's in the missing spots and so we can do similar stuff here with the right you can see that it's taking all of the right values you you can do an outerwear it's going to take both and give you back missing values for everything so if you want to create a big list of stuff now there's a similar one called join which I in my work I mainly use concatenation which I'll get to in a second and merge-join is useful because it allows you to take a list of things and pass it in together so we're taking one data frame we're calling the join method on it but we're passing it a list of subsequent data frames so it's much more convenient than using merge where you'd have to use just two and so concatenation concatenation is another way to take data from different sources and bring them together and so concatenation is basically a kind of gluing of things together so if you have say a hundred different CSV files and you want to take segments of that and join them all together to create one resultant one concatenation is a good tool and so normally when you do it it's going to stack the rows and give you one long data frame now you can concatenate into a multi index so we've covered that before and this might be useful if you're dealing with things with different lists or hierarchies and you want to give them some kind of ordering that makes sense right so you can pull it out later later so you can manipulate it in different ways so here we're just going to concatenate together and we're going to give them key values and then you can concatenate to expand it width wise as well and so here you would use axis equals one now sometimes if you're working with pandas and you're new and you're trying to join things together it can be kind of confusing I always think of axis one is being joining along the columns one to me visually it looks like a column and so the same thing when we join along rows you have a multi index you can generate the same thing out of a out of the columns now selecting so we covered some of that before but selecting gets more interesting here's some of the different selection so you can when you have a hierarchical index right and you want to select you can just call PI here you'll get all of the performance values for maybe some testing done for Python and so here so on the left hand side right we're going to call we're going to get Python out of that and you can see on the bottom that we get two values speed in line now if we want to select only the speed value is right and we want to get how that represents for Python and how that represents for Ruby you might think that we just pass in speed in there and that seems kind of logical but that's not what happens pandas will blow up when you give that so instead there's an excess what's called a cross slice and cross slice allows you to select out individual items from a level so here our category of speed we can pass it into the cross slice and we can call on it the level now that level here we've given a name which is say category cat short and then we could also do it by an item right so the level is number one which is moving from the left which is zero moving inward that would be the first one and so we can do the same thing for selecting all of the columns so if you have a bunch of different data and you want to figure out well okay giving you all of the people who have done test two and give me what days of the week they've done it it'll create a data frame based on that so now if we have this we have this highly multi-dimensional data frame right we have Monday Tuesday and Tuesday could go for maybe an infinite number of tests right like maybe that was a really test heavy day we did a thousand tests how would we select that out well we can join these things together so we can just call the column index error right by taking Tuesday out of the data frame that will actually slice off that kind of Monday portion right and then we're just going to take from the rows give me Tuesday all the way to the end or give me from the first value all the way to the end excluding the zero value and then last there's the stack and the unstack and these are really cool because they allow you to change the ordering of the indices so we can turn a data frame into a series by calling stack on it we can also turn a series into a data frame and we can switch them around and so it'll basically reverse them in between so when we call data frames stack and unstack what we're doing is we're rotating the indices and we're rotating the other one back and so this is a really kind of creative way to manipulate your data so you might have a hierarchical index where you have a main value and then you have that separated out into days of the week or maybe months of the year so I do a lot of stuff with Finance so we're dealing with with calendar values right we want monthly returns for something so say that you have a certain fund that you are looking at you have a big data set right you're going to have along one axis you're going to have the fun name and then as a sub index on that you're going to have their monthly returns right so January February March April you call unstack on that and just get as the row is the fund name and then along the columns the monthly returns so if you wanted to then see some analyses on that you could just take that and say well give me all of the fund returns for the month of January or the month of December and it's really really easy to work with your data by flipping it around so thank you very much any questions me how do I actually use pandas so I use pandas actually as a big database so I'll pull information in from a CSV file I'll join things together I'll add things all some things there's a lot more to pandas than just what's covered here there's there's really nice there's really nice features where you can take a series of data and just say multiply by five it'll take all those values and in one line of code give you that kind of scaler back and so when I'm when I'm working with different data sets right so if I'm analyzing financial data right I can pull all of this in multiply it by something and get that value out and then when I have that I can output it as a CSV file I can store it into a database things like that Jason you seem like your folks do financial reporting I'm just curious what the same thing with qualitative textual database so the difference between working with quantitative or qualitative data so are some of the examples I've given here our numbers right but instead of numbers those could be strings so so pandas allows for strings allows for numbers allows for a lot of things and so a lot of the things that you'll work with are in fact other names of things right so you might have a certain column where all of the names are your customer names right and they might be last name comma first name right or might be a telephone number right which is numerical but it's stored in a string right so you can work with strings and then it's really easy there's there's a map function and pandas that I use a lot where you can pass in a value and you can separate your data out by just calling a function on that is I think about it now that's like sounds really abstract and in practice it's like really really easy so so pandas has a lot of import and export tools so it connects to like hdf5 it connects to your standard databases there's a lot of import tools from the web if you want to bank if you want to bring in things from say Yahoo Finance there's ways to bring in things from just like there's a dataframe dot from clipboard so say you want to analyze some kind of sports to statistics right you go to ESPN the day after the Bears play and you want to see if you know color should be benched or something right so you're going to copy a table out of Excel and you can go into pandas and choose like you know the data frame from clipboard and pull that in so it has those types of imports and it has those types of out ports as well now there's two CSV you can out port are you sorry you can export to different databases and things like that so so their questions always brought up every time the pandas is mentioned so so West McKinney the creator of pandas was was working with R and R has a data structure very similar to this which he kind of copied and so I would say it's pretty much I don't say it's one-to-one I'm not in our user and I'm not a MATLAB user so I'm not really qualified to point out the holes it's all in memory do we have one last question well thank you very much
Info
Channel: Next Day Video
Views: 134,610
Rating: 4.6638985 out of 5
Keywords: chipy, dec_2013, JasonWirth
Id: 9d5-Ti6onew
Channel Id: undefined
Length: 26min 58sec (1618 seconds)
Published: Fri Oct 16 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.