Alexander Hendorf - Introduction to Data-Analysis with Pandas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
yeah welcome everyone uh everybody who is just joined the room on there is a gift with the Jupiter notebooks the slides everything I'm going to show you including all the solutions and everything is already on github feel free to download it on some so this will be a walkthrough tutorial so I'm going to main mostly walk to you walk you through what I'm doing um I'm happy to take questions up to a certain extent because we only have 90 minutes and with quite some ground to cover so and then I will encourage you just to play around with what you just have learned about all by yourself so there will be no programming riddles to solve because I don't I think it's taking too much overhead because we want to learn a lot today um so there's an introduction about pandas arm who knows pandas okay who has worked with pandas who is very confident with pandas what are you doing here okay welcome to um so this is a from the background who's like a real Python programmer who's new or who's or by the way who's rather new to Python alright okay but everybody who has experience in one programming language okay all right almost everybody is quite confident to programming so we don't need super programming skills to use pandas you can basically learn the methods of course if you know basic Python arm or even the better your in Python the more you can do with pandas especially if you want to mangle the data so um okay let's start um so I'm Alexander ah I work with ku next week in Mannheim we're a strategic consultancy and we consult startups and the middle-sized industry so mine domain here is data science consulting um also on the team mode for Europe - I'm the program chair and one of the organizers there are same for the upcoming PyCon de Canterville in October and they I like to speak and teach and talk so what are we going to talk about today arm we're going to do an introduction to pandas I'm going to tell you a little bit about the pandas origin and why is why is it there are we going to be follow a long tutorial so we're going to start with like practical really basic stuff and I'm trying to introduce you more and more in with little depth how we can how will panel works how are the inner mechanics of pandas are and how you can make use of them to work with your data um so um to follow along to play along with the code please get this gift now um it's not necessary to follow along with this tutorial it's handy to have and and we're going to put the short URL here so in case you for everybody who's late or just take a picture so can everybody from the back reads slide okay probably we make the room a little bit more darker so okay I just okay I think that's better isn't it okay I hope that's okay for the recording as well nah okay okay Google would try to solve the light situation it's very binary here on okay okay um in case you have trouble of just download the slides and the the Jupiter notebooks if you in case you have trouble reading it so what are you going to do today first we're going to start with reading and writing data all in to read path right data in pandas write data out of handles and pandas is many self and Culture sleuths multi-purpose knife for data analysis so pandas is very happy to read almost every data format you can think of very easily so you can do read CSV Excel files Jason you can even load data from the clipboard HTML HDFS you name it now if you want to have a full list please feel free to go to and doctor Docs um so and this brings us to our first slide so this is a practical start into pandas so the first thing is if you're new to pandas we import pandas with import pandas SPD it's a convention it's you find it everywhere like everybody uses nobody uses just like input pandas it's always import pandas SPD and so we reference both pd2 pandas for the rest of our code I try spanner hmm I think this will work okay ah so in our data directory that's just like the directory we have some files and this is more like a Jupiter trick we can use exclamation mark PWD or just like on an Linux console and to see what's your current working directory and another handy thing is you can also just do an LS to see what's in our data directory so we see we have some sales data from the Bluth data store data yeah from the blue sales data on and it also includes a generator for fake for generating fake user data which is quite handy but so so the first thing we're going to do is are is just like get one of these CSV files and read them into or pandas dataframe and what the data frame is and all this I'm going to cover later so just let's imagine that let's get the 3ds into a table and a table we can work with so all we need to do is call the read CSV methods and pass in file class and that's basically it and let's explore our data set so and so this is what we have um what's more so I can't matter okay this is basically our data set um it's all burden that's the magic of Jupiter all we need to do is just basically print sales data to the screen and we get this nice print of what we have and that's basically that's very one of the simplest things you can do in pandas a few things you probably already see we have an a header file in our seriously and the header files are automatically use this column names we also see this thing here on the Left which is not in the data that's called the index we're going to talk about that later but basically it's quite simple um since this is a little bit too long let's reset this that's just another thing you can just do on the fly we can just set an option for pandas to display only like ten rows we want and so it's a little bit more fun and we also get information our data set has like a thousand rows and seven columns so and let's see what what do we have in sales data let's do a little bit more Asian of what what's the type so the type r is a panda's data frame and dimensional array it also has a standard length which is a thousand and we can also inspect and that's very handy and if you work with Jupiter notebooks it's especially handy if you want to explore data set you can iterate it really explore a date and see what you work with you can just pass in use the head function just like in UNIX systems on tail so hat will you get the first five lines till the last ones and here we're going to loop people we also have an info method and what was the info method gives gives us more insight and we actually have in our data set so we see here we have a name of a thousand objects at a thousand entries in each and every column and we see most of them are objects but we already see units and unit prices with integers and floats and basically I just read in a text file and pandas figured out the rest here which is one of the great powers of man pandas but probably something depending on the data quality you're reading in something you can also have to fight both pandas or basically not fight with pandas use canvas to fight who the data you're reading in to get what you want so what we can note here everything for example let's go a little bit more up we have obviously the birthday and the order date this this is like a date and a time stamp and if we just use the standard methods here on the CSV files they still store this object so it's not a date/time object objects and canvas is always like a string so they still store the string so string date strings are not that handy to really work with so let's see what we can do about it so we just use the same method and pass in one more parameter which is actually pause dates and in past dates we pass it pass in the column names of the of the date/time columns and the thing is pandas will figure out the rest for us so we have a new sales data open data frame now and as you can see here the birthday is now a day time and the order date is not daytime so this is super handy and and it's also highly customizable so what we see here still the same one carried something you have to keep in mind we did not into any instructions how our dates are actually formatted so we can do we can give instructions we can also change the data later so we're going to show you later but the date there's also like auto positives included in pandas and they're really good in guessing what the best format is but they are not perfect for every data set and for example on the date time Plaza is a very US state friendly and as you probably know the u.s. is the only country in the world which has month/day/year format which is odd but it's friendly you can be more explicit here in Paris you can also pass and just like a first true so they first true has as all we have dates worth today at the very beginning and this is also automatically armed for you then but keep that in mind especially if you have a data set which dates beginning of the year it's or like with lower date they-they-they numbers it's it can be really tricky and you end up with likely crazy on dates okay so we've read our first see if we fall it's quite happy um what easy um just let me I don't want to really go into too deep into the methods but I encourage you to read it by yourself for example I think the read see is we method has more than like forty parameters you can pass into and this is can be quite overwhelming for a beginner because there's like isn't method for everything I'm not self I was quite happy to find fun to find something I recently found something there was no method building in pandas it was very happy so just read them through don't panic um stay focused on the problem you want to solve and also you don't we need to solve all your problems on importing the data already you can mangle with your data probably in a more flexible way later and I'm going to show you a little bit later how you can mangle and work with your data once you have them in your data frame so there are many methods arm so and this is by the way a nice application is called - and it's good to store offline documents document in documentation since I travel a lot this is really handy if you want to look up on the plane or like that so let's go back so I'm Jason let's read in some JSON and this is of course very simple all the reached methods are very very similar so um this looks just like quite the same and let's see what we have here again everything is imported as an object but I'm a little confused because like jason has a standard daytime format I provides actually in standard a platform still tenders tells me it's still imported as an object and the reason is in our sample data set on the the take time format is not come it's not a ISO arm it's not in the ISO format so it's a little bit more free format so Candice does nothing here so like so so we have one approach really solve it arm which leads us to the next input method we can also just like read in the JSON so this is our adjacent we see here we we have an order date it look slightly formatted but it's not as it's supposed to be and Jason which is arm something like this here arm you know like with Athena mill so and of course we can do all this cleanup and making it into a date our selves and of course we can also change us use the JSON we read in then we have a dictionary and now we changed data and format in our dictionary so this is just like um it pythonic on site way we are taking here and of course we can also read data just like from the dig into Jason and this is what I'm actually want to show you here so so we have sales data from the dictionary in Python so it's also pandas it's really handy if you have data in Python and you want to mangle it or export it to a CSV file for example for example there is also like the CSV library in Python um I had to use it in earlier days nowadays I only use pandas for it because it's much more handy and powerful and simpler to use now so we see here a birthday we mangled we changed the day time we can change the format to add a time in our JSON so um this works correctly now but num yeah this is basically yeah it's just like one step what we did before probably a little bit more proficient a efficient here is to use the convert States method where we can also a parameter which we can just pass in birthday in order date columns and pandas does add the rest of the heavy lifting here and let's look at our info whether this was successful and yeah and I encourage you if you read data and really use this info method it's really handy to see what's in your data set like also um and we're going to explain a little bit more later about why the day type in pandas is important and it's because it really pi Sinister's we don't care about the types Python does most of the stuff for us so but in pandas it's it's one of the things that brings all the performance to pandas so and we also help a described method for our data frames and the described method is more like basic statistics about our data frames so we have account we have like a thousand rooms where the mean of the unique price and the units we have a standard deviation the minimum a maximum and the quartiles here so if we want to take a peek into your data it's really handy just reading the data described method and you already have an impression so what's in your data set whether there are probably outliers on there so if you have like s the maximum is very far apart from and did the mean yeah okay um here we go so and of course um if you have to deal with sales support or like classical industry like they still love axial a lot and actually I was never so fond of actual both recently if I really fell in love with it again because it's really handy actually I am you will see in a bad so we can also like work with axial guess what you can read today the data just in like by giving it a path to the excel file but arm one nice thing here is you can instruct which columns to pass so and actually you know like all the columns ABCD DD to wherever and then the rows are just like numbered we can also like just pass take the first two columns so for example if we want to see who the birthday the birthdays of the people we have in our data set we can just like immediately instruct pandas only to get those two columns and into a data frame and here that's the nice thing about Excel data arm we see the birthday it's already day time because internally Excel supports day time it's taught as an integer there so we don't need to mangle any stuff here so arm we're going to do X working to Excel and exporting to see as we dig and everything on I leave this up to your imagination because it's basically the same except for some parameters you can constant so it's really easy we just pass in a filename um we have for example index false here because we don't want the index export in my excel file arm I just want to have the columns here and we can also name the sheet data and yep the actual is now safe here on lady to imagination let's save some time why did I recently fell in love again with Excel when we were building the European schedule so your person has like 200 talks with like many constraints people who speakers availabilities topics should be grouped in all this is a really hard problem that's just like too many ways to solve it so actually I built a piece of software to do all the heavy lifting in the schedule and then it was really hard to I wanted something to visualize everything so I started exporting all schedules axle and this is just like a picture as an example where Excel and the close connection to pandas can be really strong to use the XLS writer library you can basically make this your pandas excel excel writer you can even like you can put colors um yeah like the column sizes row hides everything here and this was really handy because it was very simple to set up and before example we have the data stuff which is red death lobsters like lilac and so this was really handy to do some useful simple report for the program committee to decide this is a good schedule or should we work on that there's a little bit more so this is a hint for if you to use if you work with Excel and you really want to export my excel sheets which makes your colleagues from other departments happy look into the XLS writer library there is a really good arm documentation there with many examples it's quite simple to do so um let's go a little bit here and let me just like so I just mark it and I just copy it to my clipboard and this is also something which can be really handy um if you if you get structured data from another source and you can just like copy and paste it you can even just like copy and paste it into pandas and of course it does not always work that well because it has to be not everything is under summer let's give it a retry so okay yeah so I copied these two lines and I erected into the clipboard of course pandas pandas is just like checking whether it's possible it will easily break if there's just like half a column missing or something like that of if the rate of loss from word but for example if you have it something like from SQL queries or so it's super simple to get it into your clip or if you want to just like work it early so um that's already the end of our first part of the like of first session on getting into pandas that's the end of our little light warm-up so what we've learned reading and writing data in pandas is really simple um it's very customizable it's very it can be overwhelming on to beginners so don't be too overwhelmed just like see concentrate on the problem you want to solve and look for a right parameter um to to solve it um a lot of the handling is done really well by pandas by default so you don't need to configure a lot of the stuff on all this very much depends on the data quality you basically not many times you have to fight with the bad data quality you probably have to work with and so Antanas can be your friend so arm one of the cabinets we haven't seen this here because they're non null values in our data sets um but something which is sometimes we're confusing for beginners if you read in a column with integers we expect integers and they're imported as a float it's because there are nine values in there because only the float data type can handle and values just keep it in the back of your mind um and yeah so yeah this is just like a little summary any questions I think it's quite easy peasy just warm up so you want to do more stuff so where does all the power come from pandas it's basically that's why it's called over to num pi because and pandas is built on numpy numpy is in the miracles library r4 it's the library for working with numbers in python and that's why all the cow power comes form if you know python python is dynamically typed so you don't really have to define types as another program damages before and um so but of course dynamic typing is taking performance because what your code is running in the background you always have to recheck the type recheck the time what's happening guessing and all the stuff and so um num hi is typing the data so they're going to see a little bit more when we and i'll show you a little bit more rotor data model um and also a little bit more depth of definitions so if we talk about a table tables often refer to data frames um we have columns and rows even like in with and pandas and there's data series and data frames and let's have a little bit more look into the structure so our data is just like an umpire array so this is just like an array of integers for example this is like a list but arm this is a typed list and there's only one data type in this list aloud so it's not like in the Python this way you can have strings integers whatever or mixed up in your list so this is there's only one data type on here and okay before we cover this um so um we have our numpy array here and a pandas series is basically a labeled numpy array this is just like one dimensional array so you have the data and an index and for example here the index is positional um should actually start at 0 because we have Seri indexing here as well and so I when you work with punters is a little bit it's very important what to think in series and these series are like verticals and because if you work with data in a table you usually would rather think in rows you iterate through each and every row and of course there are methods and pandas where you can basically take your data frame and iterate over the rows because it's better for our hats but if you want to get more creative and have a dieter and that much more pleasant way to work with pandas really try to think okay this is a series and this is just like one series and of course multiple series was the same index this is a data frame so basically the index is the link between the series and it goes like everything together to a data frame um there's also another thing the three dimensional object is called panel so if you stumble across it you know oh I like told me about it in Berlin but you can already forget about it because it's been deprecated because we can use multi indexing for now for this so all we need to do to remember here is one dimensional data series and a data frame is the two dimensional get one more series or even one or more data frames so from the definitions like like synonyms we can say table is a to date data frame on column is a data series a row as the all the values at the same position in each data series in the data frame I would have loved to come up with something more compact but that was my best shot so um let's have a deeper look into series every goods also say this it's also in a notebook spot since I have some added some extra stuff here for visualization purposes and we're going to use the slides here for the tutorial so how do we create a data series it's very simple arm because it's called the series method PD series and here we just pass in 10 random numbers and this is our result and we see already we already have a data type because we asked for a random int from random and positional index created on the left side from 0 to mine so as 10 values there um very simple and how can we access data in a series so we only have a one dimensional we only have one row now and how can we exit it we can access it by position just like as we can access this in Python so if we pass in 0 we will get the first value and see it is almost like the our series on the right hand side so you can follow along look easy we can also do slides as we do with Python slices on from like 3 to 6 when we get this and there also the I log method we can use and please note square brackets now it's not amalya choose square brackets for I lock and we can do the same so I lock is basically like integer location so it can pass this in here so um but as I told ya yes so lock code I'm glad I'm going to cover it in just 10 seconds thank you yeah um so I thought hold like you can imagine like a series that we labeled it's labeled array arm so let's set an index I'm very well in indexes so we're just resetting this to two letters so and now we can still do the slicing on a series here with the position like just like the personick way we can even do slicing with passing in labels now from D to F so this panel super magic can I just pass in any any slices no of course not you can only slice the one block which look is like if there's one value after another so too many slices will fail we can also contact for example if you want to have like the rows d 2s and i2j the best the better solution is to use the concat method and pass in the two slices here so we can you can get this here and now I guess we come to the loc method which was just asked about so of course with the loc method we can now also pass in a label and for example the label here is in our case of string so the loc method will look on the index the unlock method will always go by the position so and it's really easy to I log integer unlock label and I integer location so it's quite easy to remember ah so the loc method does not its support slicing but something is different here we because I've set a new index the index is no longer a 2j I passed in Gattaca there a character XYZ and I can use the loc method I cannot be actually easy to get from G to A because the index is not unique and as one of the learnings of the pandas index panics indexes are not necessarily in unique so especially if you're just a second yeah oh oh it's just you was just stretching our okay okay yeah oh no g6 selects the first G so the value 60 yeah yeah yeah oh I'm sorry oh okay I have to fix the slice I'm sorry look over there I'm sorry oh thanks for finding that out okay but here we have X Y Z are unique in our values and here the loc method will work so so this is can be a little bit confusing for beginners especially if you don't know about the index and this is why I'm covering the indexes this is a quite early stage because I think it's it's one of the most important things to understand and and this is actually the index so actually yes I know actually this is just like I just take this is just like a list comprehension taking each each letter for renaming so it does not really thought if I don't instruct a series to sort explicitly so it's yes actually like a it looks up in the and here you see when the index is unique it knows where to look and you can see like it's there is there a match so in between but it's not necessary I think we could exchange the Y for anything there yeah I get it now it is the order in the index this is useful to fly thing it's because it's not it's not carrying about it's just like an alphabetical label it just looks for the label and until it hits the next than the end of the slice so it's it's it's still not intelligent so it's just like it's just like iterating and see what the slice can be so on the index so actually like as I mentioned like the label of the series is usually called the index it's automatically created or if you don't give it one you're importing your data or create your series as already saw it you can it can be rien de stit can be replaced it's immutable so you can just like decide oh I have these duplicates here just like go to the position and change one index value um you have to reset the whole index then it only can contain hashable objects and so integers tuples strings but of course not sets or dictionaries it can have one or more dimensions so tonight today we're only covering two up to two dimensional now because we won't have time and remember the convex index is not necessarily unique so there's multiple index types which I'm so sorry we cannot cover them all there's the index the multi index mostly index is basically used for groupings a lot there's a daytime index my favorite index talk about it a lot let's dine time Delta index and the new category categorical index which was just introduced into pandas with the latest version and interval going to expert we cannot cover them so it's let's go on and see what how can we select data from data frames so we have a data frame it's out to dimensional data and we see we have count data series is columns and ten rows with some random integers and let's see how we can slice them so we can just access if we now pause use a data frame and pass and just like an integer we don't get the row we get the column yeah this is very important because as remember with the series we got the value from the series which is actually more like the row it's just like a row with one value but so if you select on data frame we select the columns to series um we can also slice but this is a very confusing if you slice and put in like a range here we get the rows back which is I think a little bit logical break here which you have to get used to on which can be felt like can be confusing at the beginning so we also have an I log method here and the the I not methods we can also like to take select the segment probably you can't see from the back is like four here I use a mouse there's like mouse mouse mouse okay yeah that's like four values selected here arm and then we have to pass in of course like two to four which is actually like four zero axis and the axis one and I'm going to explain you and give you an how to remember both the axis arm here little in a second so arm we also have an eye lock method and the eye loc method method um yeah basically you kids also like the same also if you want to use the aye loc method um I ins the first is x0 is the Rose so we pass them just like um for everything like we can do in Python just like it's arranged without start and end so everything and the second to the fourth column and this gives us picks back and so I really have a lot of hard time at the beginning to access zero access long what's this so let me explain so basically it's very simple X is zero is the Rose so the horizontal axles one other columns and since I'm one of these guys who always like mixes of left and right all the time I found something to remember so does this my what we call it is Booker so the axis one looks like a one because its vertical and this is like a good way to remember oh yeah one gives me back the series because it looks like a one okay um next arm indexes are indexes and column names are of course when we created the data frame we only had just like positional values here of course we can just like create it so I magistrate resetting the index to some sustar both are and have like two digits same for the columns so we can just like rename them really simple in pound us the selection here is the same arm so we can just like pass in Co five which will give us back the label we can pass in a range and remember ranges on data frames go for the rows not the series and the columns arm we can also use the range here it works all the same logic we just have look deeper with a Gattaca we can also use the lock method on to get a slice of in the middle of our data set which is just like following the same logic here arm and now we come to something which we kind of call a boolean indexing so and see yeah sorry I yeah skip that booty in indexing so what is this boolean index boolean index has nothing to do with the indexes I just mentioned um it's just like called indexing as well so we see these are the day this is the data we have in our Co 4 series and we can create a boolean index very easy if we just toss in like get like the series greater than 60 and what we will get back is the series with the same labels and the values are just like true and false so this is a boolean index boolean index is just like an array of yes no yes no yes yes yes here novo and basically if we combine our data and the boolean index we only get the data back where we hit a yes so this is all the magic so if you process the boolean index we have up here and the outer square brackets on we can just pass it in and it returns back all the rows where our c4 is greater than 60 and we can also do combinations of this so we can pass in an R just like with the pipe in the middle note we have to put brackets around the bullion indexers here at the beginning like the arm smaller 60 and greater 60 so this is just like we can select anything else which is smaller than 60 or greater 60 get back to rows we can also do multiple ends and pass them like anything which creates a boolean index and this is one of the things that the arm comes in with numpy is the broadcast thing so if you work with vectors you know like you can like take one vector and multiply it with another and this is like the same works here so we can just like pass in okay gives us back on the everything where the mod of the value of mod 2 is 0 and we only get these two rows in the end so yeah to speed a little out because only 1/5 15 minutes left so yeah arm now you're familiar with a data selection and indexing on data series and data frames please keep in mind the selection one data series and data frames the slicing is different this can be a little bit confusing at the beginning because you're probably not always aware we're really working on a series or a data frame in the very beginning also when you select data you always get a copy of the data return so and most of like all the operations you do on data frames are not applied to the data frame until you explicitly instruct so we will see that in a minute because when we come to operations so operations I want to show you how can we add and remove C um do stuff like adding subtracting mangling the data and how can we be working with null values with your nm so let's go to the next here okay yes yes you're right oh I'm very relieved now you're right thank you probably should have thought my water oh yeah oh yeah we have 42 minutes left oh okay let's do some exercises now so okay I am well since we have 40 minutes left thank you again questions on selecting indexing or anything else now yeah the columns are it some time ago okay yes the question is how do can we index row columns so I actually columns are not indexed columns have names but we can just let me go back a bit me show you here here on that top here we reset the index and we set the column names just like like passing and a list of the names you want to has there oh the boolean indexing you mean like boolean indexing for columns not a really good question ah but but but yeah this is yeah yeah but how can you how I think you basically the work with series how because like then it's Road so basically you have transposing it's probably the best thing or just like get the series because like I am yy-you transpose the column like they know my why do you want a boolean index for the column what's like the use case you think oh um you can you can also like select a column just like by the names if you want to okay yeah okay yes you can just like select one or multiple columns just like by passing in a list of the column names oh yeah so you don't need to boolean in the exodus if you want to use boolean indexing for columns you could just do a list of your columns and two boolean indexing on this one then you could then you could give the list to your premise data for him then you get boolean indexing for columns yeah for example yeah so actually I've never stumbled to crying about any actually so but thanks for bringing it up yeah um I just wanted to ask if I select one column yeah is it automatically a series yes so I don't need to convert it to a series no no if you if you select the column you get back a series also including the labels so the index so the holding index comes always back it was in the background so all right cool so we're worried so it's um okay so what can we do to mango with our data how can we work a little bit more in the series that's what I just like sum it up with operations um so let's get our favorite data set back and remembering here arm storage in the DF for a def data frame so um we can also like where dynamically with our data frames so we can add columns delete columns select rows so for example like if you for example just select a part of our data frame you always get back a new data frame or series if you only want to select one series and you can work with that data frame or series as well any other series you created so basically if you have like a huge data frame it's very easy to go to all use pandas just like to take two slides out the part of the data frame and work with it and without changing anything else um so um let's just like add another 11 column called 310 to our data frame that's simple so the roots as you see this was just like a list our width of the sill the same size like ten numbers we added to our data frame so as I just like explained it's really easy to take a series out is it also this that is it just like to recreate a new series and pass it in so let's see so what we going to do here so here we create a new series of the pandas series and by passing the same list with like ten integer values here and now we have a new series so let's let's have a look so this is our new series and let's just like add this new series to our data frame oh why did the stupid Python list work and our super sophisticated pandas will not work ideas index thank you yes of course because when you add data and this is if you add series to data frame pandas is actually looking at the index and since we have like these are zero to our nine labels on our index here our new series had just like an auto index which was just like positional index from zero to nine so of course this not did not work so if I want to add a series to my data frame I have to basically just like do an index here and this is what I want just like let's make it an extra step so here we create a new series and it's really easy to just like get the index from our data frame and set it as the index of our new series that's very simple of course this has to have the same length um and so of course this is really easy um to set you could also for example you can also add we are not going to cover joints here but for example if you have like a smaller your critics also have used like a series with like only six values for example from just like R 0 0 to R 0 5 and add these to our data frame so it's not necessarily they have the same length of course pandas would then add na n values just put like we saw here so we would have like 6 3 values here from whole list and the rest would be automatically null values so this is really easy and handy if you have to join data from various sources if you have labels corresponding labels it's super easy to get everything together in one data frame so let's add this as another one has 12 on here how can I get rid of it arm so I want to have my data playing frame small a little bit more readable so there are many ways to do this one is just to use del like in Python delay and then just like remove the series from our data frame arm so just remember here we go up to 10 what a capital of course if I pass in a column name unknown I will get able to get a quiero but if we're not sure about whether we have columns in our data frame we can also avoid key arrows if we use the drop method and just like instructed to ignore errors so we can just like drop any column by name with the drop method without raising errors if the column is not present that's really handy if you get data from outside and there's different column names there um of course if you want to have like just like a smaller selection if you just are if you're just like interested in like for example like two rows of your data frame there's no need to delete all the others you can basically just like create a new data frame on with the lock method and just like instructed give me give us back all the rows here and just like these two columns and just like store it to a new data frame or overwrite the data frame a variable you're working of already so this is very easy um so and how can we mangle the data we have in our data frames so let's do a little recall on numpy broadcasting arm and what that is basically so we just create one array a on here with like three values and another one be also free values and broadcasting is this is pretty simple simple so we have these two rays three values each and if we want to we just can for example just add them to another so and of course like the values are added according to their opposition we can subtract them we can multiply them we can divide them we can even like just like add a constant to it so arm since there's only one value seven will be added to each and every value this is just like broadcasting and we can also divide it by full integer here so this is all 0 so and of course we can also just like multiplied by another array so yeah this is the idea of broadcasting I think everybody got the idea and the same works comes at hand if you work with pandas series so imagine you have series and there's like multiple values and there's often stuff you want to do on with these values so let's get back to our sales list the sales is from the Bluth store so this is just like some customer data arm so we have a customer the customer as a company we know the first name of the person who purchased it um we know their birthday arm we have an order date um if a product name these are like the 10 most successful products of all times arm according to some internet list with the unit price which we the price we asked when the item was sold and how many units the customer ordered and this is just like random data and let's see so for example what what we are missing here for example management asked us okay what what was turnover we say okay yeah we have a unit price and a unit but we really miss the turnover data so let's add a new series called turnover to whole data frame so cuz all dataframe we add a new series we label it turnover and it just can pass in the two series unit price and the price which we actually asked arm and just like multiply them it's as simple as that and now we know have another series in our data frame for turnover and now we can also ask more information for example what's the mean what there are mean turnover our or for example probably even more interesting so we see we have all these nice methods we can actually all apply them directly on oh sorry let's copy this ah it can we can also just like easily get the sum of our turnover of our data set here um so then it can also just ask what's our medium turnover and and of course um we can also have a closer look at our data in what the info method so for example arm we can also work with the data we have in our series in a very flexible way with for example the map methods so a map method here is just like in any other programming language and mathematic is map this and take each and every value of a list and apply a certain function to it on method that's a map so what are we doing here we want to know get create a new our column just like what 1/8 the year or the purchase was made so we take the order date we know here the order date is already a date time of object so we don't need to do any data conversions here and we can pass in a lumped up function so all we do here is take each and every value daytime object and just like ask for the year of the data object and we can just access it directly with dot here and python and this is what we get so we have the year now the same we can do just like for a month and in case we are interested in getting some easy report for year and month for example to do a plot or report or summary we can also combine multiple columns so we here we're creating a new year month column and we can just pass in the two series and basically I'll concatenate them with just like by adding one to another arm one thing here the year is of course we cannot just like a Python won't allow us to take an integer and add another integer to two and because we were looking for string so what we do here we take the data from the year pause and map just like string so each and every value in this series will be converted to string there are other methods to do this we will see in a second and of course we want to have like two digit month they are nicely formatted and it's as simple as this and now we have our year most column one hint if I wanted to create a sales report on your monthly basis both data like that I would not do it like this it's really it's good to do it if your due to pandas if you want to have your data really visible but this would be a super cool case for the daytime index which we won't cover here today just like as a hint yes the difference is that oh yeah sorry thank you what's the difference between map apply map and apply map actually basically just was about to explain here later map is for series I play you apply you can work on a data frame with that and apply map is basically you can apply it to the whole data frame but as far as I know there is no big difference anymore between apply and apply map it was only in the beginning because if it I think if we do this here we will get the same result so as we see we this was just basically a slide for the whole data frame and if we only use apply we get the same I'm not wondering percent sure it was on I think it was only a difference in the early days I think it was merged into arm so yeah as we see arm we can also apply along to functions to a data frame so we can just like get a subset of our data frame just like two columns and apply anything to them for example will be by 10 by 1.5 or whatever we want to do so but as you see here um this is because like very often forgotten in the beginning here we have sales data we select you we apply something we get like new values back but if we look at our data frame nothing has changed this it's really important to know that handles always returns a copy of the data and it does not apply all this directly on the data frame unless you instruct it to do so with the emphasized methods so you cannot put say do something to a series pass in in place true and then the data frame will change otherwise you can also just like reassigned a variable arm for this in for your data frame that's the same yes actually like that's that's another thing which is very overwhelming there's not only one right way to do it there are many ways to do the same in panda so it's basically it depends on you yeah I probably actually have never worked that much in introduction oh okay thank you as far okay why it's broadcasting faster than applying a method so and actually honestly I have never measured because most of the times are not really worried about like getting the most of the performance because pandas tandas claims itself it's good for handling data up to one gigabyte in memory so and I am a bigger dataset in memory without really having any troubles hooded camels always is fast and I probably I say I used to worry about performance at the beginning ah a lot in the early days and nowadays I just say I only worry about if I have a performance problem so and there are methods to use for like bigger data sets but we won't cover it in an introduction here but there's some nice thing from Tom ox burger is some nice tutorial how to work with bigger data sets of stats and I think he's also like that there's also an explanation upon the performance differences mm-hmm I have another question the other functions to perform computations in place directly without copying in the game are there functions to perform the computations in in place without a and you can you can pass them in place and then the data frame will change yeah so that that that's important so usually if you take some data out of Panda do something with it you always have a copy but you can also instruct it with in place true entice a equals true to apply directly to the data frame um yeah so uh here um we also have a describe method on our whole data frame so we can see the same cecile data we saw for earlier for the series with like count means standard deviations minimum maximum here and of course this works also series still so for example also for like a daytime series so we see we have 284 unique birthdays within our thousand people we have here and the most popular is the most popular birthday that's the top birthday so what else um so for example we have now floats in our turnover we can also easily change the type of a series in pandas we can just pass so we if we want to replace it here we pass it into our B this is like the replacement here we sign the new turnover our will we get back from math type and I want to okay thank you um so I can just like say okay make this like an integer like with Empire numpy integer or just like passing int and this will return the float value as integer and we just save it back to the data frame and one thing is especially if you get data from the outside and/or which is not like clean or not a z1 so let's sneak in some null values arm so as we see we have known null values in our dataset for unit price and also for yeah for the order date so this is model timestamp then so on you can check your data frame um if there are any novel is there because sometimes if you pass in like a more special function this might lead to an exception and we don't want that so it's null gives us back any row where we have a null value we can also just like drop the null values from all data frame so we see we've lost some lines here but remember this is also the drop and a method returns a new data frame so this would be a super good place oh yeah I'm showing here yeah in a second so yeah this is just like a copy from the data frame so nothing has changed on the same is for we can also pass them with fill and a default values um if we want to if we know okay better guess the data will have this zero there some like that you can also be like more sophisticated fills forward fills we won't cover it here so you can even like write your own interpolations for this it's necessary but as we see our data frame has not changed so this is again an example for the in-place method so here we do fill RNA will for 99.99 and for every no value we had in place true and as we see now actually our data frame has changed and every null value is replaced and as you can see here also the na T values will be replaced by this so there's no difference now so because this was na t to say that na Nana na tv4 and so we know like in the 70s here very early order so okay to sum this up off [Music] you see how you can apply map mangle your data working with ni n and all the stuff na and n isn't representation of null values you can rehab the place them drop the lines it very much depends on the data set you work and you need how you want to handle null values sometimes you can drop them sometimes you need some default values here so and modifying data series and data frames arm is very important to know it always gets returned to copy and of course I'm emphasizing it a lot yeah because it's often forgotten and with a parameter in place the result can be applied directly to the data frame we are working with so where we are Here I am okay anymore introduction Airy questions on this in the filling in emitter is it possible to write some conditions such as if the turnover is less than 500,000 fill the null values with this number or silicon writer you can write a Python method as you wish and just passes it past them so we need to include this villainy inside those if conditions are use condition not like I was asking something like fill any inside the proper way mmm I'm actually I'm not sure whether the fill and a will do this arm probably will are I think so now I actually have to look it up documentation whether it's but it should accept the method as well okay if not you can also do a map on the series and there you can definitely pass in any custom functions you want no and if you hit an own value can also check to meant the value of is it's an and value and then apply whatever you want in a very custom fashion okay alright then then as the last part that's just like a sneak peek into visualization and grouping so pandas always comes built in what were there many Wizards simple with the realization tools and it's also very easy to customize them that's all handled by map load lip and there's also like many libraries involved on so here for example this is just like instruction to plot within our Jupiter notebooks if you haven't seen it before so we read in our sales data um again calculate out some turnover here and now we do grouping we want to visualize a little bit a sales report so let's do a group boy which is a very simple aggregation you probably if you know SQL or self it's it's the same so we just take one key which is the product name here and we want to group and group it by the product and basically get the song back and let's just have a look and here um yeah that's basically it so important here we here we do a selection of the columns we want to work with it's just like a product and to turn over the group boy are is done by the product this group I will only accept of course columns which are selected so we could also take the whole data frame you so is not absolutely necessary but for example we would not really interested to get the sum of the unit price here that's why I'm doing a pre selection because the some of the unique prices quite pointless so so and then we stir it into the TT um variable here and then it's really easy then we can just like take our take this and just do a ball plot and we can just like put take the data frame or the theory or series plot it and at the bar for bar or we can su can do the same for the mean or sophie or the unit soul-stirring uh so this is a new grouping about like the how many units are sold which is a little bit more nicer to plot so we can use it have it as a bar plot the nice thing here is of course we also have more plots building or format learn what what a nice thing is you can still work with these plots so it's not just like take this plotted and that's it so the plot returns the ax object from matplotlib so and you basically that's the plot object and you can still work with it and improve it for example here I was interested in okay what's the what's the median of the medium of the unit sold which is quite obvious here so here we can just like take the ax object we get back so I'll save it in a variable called X as well and just and use less you horizontal line here with a method from McLaughlin's directly and here which is the unit we have just aggregated or take the medium of them arm to find a color and align style and just like immediately added to the same plot and another nice thing here it just looks too simple all the labels from your index and the data you work with are automatically added to your plots as well so it's really handy because in earlier days all this plotting was a lot you had to do sometimes a lot of customization on it to have a nice cloth is understandable for outsiders and not familiar with the data set so here pandas does all the heavy lifting for you as well okay let's just like see a little bit more about what we can plot or aggregate for example um this is just like not so useful top so by default we get a line clock back and also if I group by each and every day order date which is the time step actually it's not very readable so if there's something you can do about it and finally I get about a bigot it gets a little bit more to talk about how Andy for example the time series index can be armed so we here have our sales data and we reset the index now to reorder date the order date is a date time X yeah sorry um could you please thank you is the order date is already a timestamp the day time object and so we set our index now to these day x objects and this creates my most beloved data time index and so this is what it looks like now here so we have some positional values here in the beginning arm and now with the data option and now we can do something really nice we can take our sales data group it by the sales data index year and week for example so the day time index has is directly you can directly access on the data year day mum quarters you think about it it's like a list like this queue and so here we go pry and next year the Mogu Basu and and a week we want to look into the week take the sum and just like plot it and this is basically all for each and every week we get this plot back you can also use other libraries for example like bulky or Seaborn Alfred matplotlib is very you can customize the MATLAB lib really easy as I just demonstrated a lot of probably like a seaborne or pokey are more suitable if you want to have interactive visualizations but that's probably enough for a tutorial itself and and this is like the last example um for example here for grouping and immediately plotting the data our with my plot lip is just like we can also just like see house basically like the distribution of our sales across weekdays and we even can group by them with our daytime index here and plot it to a pie and I would say arm just have a look at the documentation to see all those articles per we have to get ready okay tutorial yeah alright so yeah that's awful I was at the end anyway so thank you very much [Applause] [Music] yeah time for questions now but I'm sure the speaker will be available yes idolater in the come feel free to ping me or talk to me thank you
Info
Channel: PyData
Views: 10,287
Rating: 4.9784946 out of 5
Keywords: python, pandas
Id: C9jU_200miw
Channel Id: undefined
Length: 84min 28sec (5068 seconds)
Published: Wed Jul 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.