Data Science Best Practices with pandas (PyCon 2019)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Is there a pdf version that lists this? 1.5 hrs is kinda long for pandas tips

👍︎︎ 16 👤︎︎ u/zhanem 📅︎︎ Jun 02 2019 🗫︎ replies

PyCon Cleveland 2019

https://www.youtube.com/watch?v=dPwLlJkSHLo

Check for NAs

ted.isna().sum()

Sort by column

ted.sort_values('views_per_comment').head()

Plot

Shift-tab to see arguments in help window

ted.comments.plot(kind='hist')

Drop outliers

ted[ted.comments < 1000].comments.plot(kind='hist')

See how many we lost

ted[ted.comments > 1000].shape

query method

ted.query('comments < 1000').comments.plot(kind='hist')

loc method

ted.loc[ted.comments < 1000, 'comments'].plot(kind='hist', bins=20)

boxplot

ted.loc[ted.comments < 1000, 'comments'].plot(kind='box')

Random sample

ted.event.sample(10)

Results

pd.to_datetime(ted.film_date) # guesses wrong if timestamp ted['film_datetime'] = pd.to_datetime(ted.film_date, unit='s')

Pull two columns to verify results

ted[['event', 'film_datetime']].sample(5)

Check dtypes again

ted.dtypes

Now we get a 'dt' namespace in datetime datatype

ted.film_datetime.dt.year

String namespaces have this

ted.event.str.lower()

Count values

ted.film_datetime.dt.year.value_counts()

Barplot problem: Missing years

ted.film_datetime.dt.year.value_counts().plot(kind='bar')

Lineplot with sorting issue

ted.film_datetime.dt.year.value_counts().plot(

Fix the sorting issue with sort_index()

ted.film_datetime.dt.year.value_counts().sort_index().plot()

max: Notice that data is incomplete

ted.film_datetime.max()

Get amount of talks

ted.event.value_counts().head()

Aggregating

ted.groupby('event').views.mean().sort_values().tail()

Modify to show number at event, aggregate

Now it's a dataframe

ted.groupby('event').views.agg(['count', 'mean']).mean().sort_values('mean').tail()

Sort by sum

ted.groupby('event').views.agg(['count', 'mean', 'sum']).mean().sort_values('sum').tail()

6. Unpack ratings data

ted.ratings[0]

Unpack a stringified list of dictionary data

import ast ast.literal_eval('[1, 2, 3]')

Make the list

ast.literal_event(ted.ratings[0])

Make a helper

def str_to_list(ratings_str): return ast.literal_eval(ratings_str)

Apply rating

ted.ratings.apply(str_to_list).head()

Pass ast.literal_eval

ted.ratings.apply(ast.literal_eval).head()

Pass ast.literal_eval

ted['ratings_list'] = ted.ratings.apply(lambda x: ast.literal_eval(x)).head()

7. Count the total number of ratings received by each talk

def get_num_ratings(list_of_dicts): num = 0 for d in list_of_dicts" num = num + d['count'] return num

Apply it

ted['num_ratings'] = ted.ratings_list.apply(get_num_ratings)

Describe for statistics

ted.num_ratings.describe()

8. Which occupations deliver the funniest TED talks on average?

ted.ratings_list.head()

Check if 'Funny' is always there, yes, it is always there

ted.ratings.str.contains('Funny').value_counts()

def get_funny_ratings(list_of_dicts): for d in list_of_dicts: if d['name'] == 'Funny': return d['count']

ted['funny_ratings'] = ted.ratings_list.apply(get_funny_ratings)

Calculate percent that were funny

ted['funny_rate'] = ted.funny_ratings / ted.num_ratings

Funny rate

ted.sort_values('funny_rate').speaker_occupation.tail(20)

Analyze funny rate by occupation, use groupby

ted.groupby('speaker_occupation').funny_rate.mean().sort_values().tail()

Check sample size, many are unique

ted.speaker_occupation.describe()

Focus on occupations that are well-represented

ted.speaker_occupation.value_counts()

Output of value_counts() is a series

occupation_counts = ted.speaker_occupation.value_counts() top_occupations = occupation_counts[occupation_counts >= 5].index

Filter it down, using 'isin'

ted_top_occupations = ted[ted.speaker_occupation.isin(top_occupations)]

Do the groupby again

ted_top_occupations.groupby('speaker_occupation').funny_rate.mean().sort_values()

👍︎︎ 3 👤︎︎ u/mosymo 📅︎︎ Jun 03 2019 🗫︎ replies

Captions

do you find the pandas library overwhelming have you been wondering how to write better more efficient pandas code but you just don't know how to do it I can help my name is Kevin and I'm the founder of data school you're about to watch a panda's tutorial that I taught at PyCon 2019 by the end of this video I promise that you'll be more fluent at using pandas to answer your own data science questions during the tutorial there will be 8 exercises that we'll work through together if you want to follow along at home you should download the data set from github and there's a link to that in the description below now this is an intermediate level tutorial so if you're new to pandas I recommend starting with my other video series data analysis with pandas and there's also a link to that below I hope you enjoy the video so let's get started okay um welcome everyone we're gonna go ahead and get started so thank you so much for being here my name is Kevin Markham I'll be your instructor for this tutorial this is data science best practices with pandas very happy to be here it's my privilege to be teaching you today so before I talk about the tutorial I'm just going to tell you some setup instructions if you haven't done them yet so that you can get caught up with that and then I'll introduce myself and what we're doing and then we'll get started so you'll want to go to the github repository and I snagged that sweet bitly link slash PyCon 2019 if you want to go directly there and there's some instructions here on downloading a CSV file checking that pandas and matplotlib are working and that's about it and there's a link here to the TED Talks data set you're welcome to pull that up we'll use it just a tiny bit but yeah go ahead and open your Python environment whatever environment you're most comfortable with and put that CSV file into the same directory or wherever you usually open files from and that's that if people run into problems I've got two backups backup solutions one backup is I have some flash drives with the data file on them and some flash drives with anaconda installers if your machine just isn't working and you want to install anaconda right now the other backup is that if at any time you are having problems with your machine just go to the top of the repo click launch binder and you'll be running Jupiter notebook in the cloud with the data file and with the libraries you already need installed takes a couple minutes to open but it's it's a great solution if you are having big big problems with your machine okay all right so while anyone's getting set up that's not all ready I'll just tell you about myself I live in Asheville North Carolina with my wife and my son and I used to live in Washington DC I'm the founder of data school I teach data science in Python mostly online I used to teach in person at General Assembly in DC I have a lot of YouTube videos anyone's seen one of my YouTube videos on something well awesome thanks cool well I have taught a couple tutorials I taught a machine learning one at PyCon 2016 and one on pandas at 2018 that was a lot of fun and I'm since I'm online only it's a real pleasure for me to be in the classroom these days so this is great what we're gonna be doing today is my goal is to help you become more fluent at using pandas to answer your own data science questions so I do expect you know some pandas there'd be doing a lot of exercises basically we're gonna have this single data set I'm gonna ask you a question or give you a task you're going to try to answer it using pandas and then we'll discuss the answer together we'll spend a good amount of time on that or I'll kind of walk through a solution we can talk about it and at the end of each section there are just some takeaways some best practices that I'm trying to communicate in terms of the pacing this is intermediate level so if you feel beginner to pandas it might go a little fast if you feel like pretty advanced it might go a little slow but I'm aiming for that middle middle ground final thoughts if you're not used to using the Jupiter notebook that's what I'm going to be using but you should use what you're most comfortable with questions are welcome at any time all the code I'm typing will be posted on github so you don't feel like you have to type everything if you're trying to like keep up that's it any questions for now before I jump into the first exercise awesome all right so what I'm gonna do so this is all live coding and I am this first exercise we're just kind of going to do it together so I would encourage you just to type along with me and we're just going to do an introduction to the to the data set and this is really just kind of a warm-up but go ahead and follow along in terms of so this is a dataset of TED talks has anyone ever seen a TED talk yes everyone who can someone just define quickly what a TED talk is talk about technology entertainment or design or that was the original idea there's short talks very popular they've been around for a long time and they have kind of a message they're trying to communicate this data set actually before I start typing there's this page on the Kaggle data sets website which by the way is a great way to find and explore data sets and this is a description of the TED Talks data set if you don't have this open it might be useful the thing that will may be useful is there's some metadata here about the data set that might be useful for you to understand so I would keep that page open but this was scraped from the ted.com website that's where this came from and yeah so I just wanted to kind of set the context for the data set but let's go ahead and just import des and sorry it's harder to type and talk than it seems so I'm using I think point two 4.2 but you can use any modern version should be just fine for today and we're just an import matplotlib lib dot pipe lot as p LT and if you're in the notebook is well matplotlib in line in line if you're not in the notebook that that's largely or completely irrelevant and then we're just going to read in the data set and we're just going to read it into a data frame called ted and in its in my local directory so i'm just gonna read right from ted dot csv and let's just take a quick look at the data set and if you if you're falling behind definitely feel free to ask your neighbor if you missed a line of code all right so each row in this data set represents a single TED talk and we're just going to check the shape just the shape attribute so there are twenty five hundred and fifty talks so those are rows and columns and then we'll just take a quick look at the D types this is just what I do when I get a new data set and want to be vaguely familiar with what's in it we'll just take a quick look at the D types it's just a blend of integer columns and object columns now object columns are usually strings but what else can an object column contain just just go ahead and raise your hand and I'll grab some one yeah Ville it for categories that will actually say category I believe there's something else so what else can an object type contain so other than strings other other things yep yeah so object data types although though they usually have strings they can contain arbitrary Python objects such as lists or dictionaries so that is something useful to keep in mind that they're not always strings but yeah the most important we have a bunch of int int columns and object columns and I'm just gonna write this and not explain it and then show you the result and just can anyone tell me what this code is checking for right there right so what I did is dot isn't it is na dot sum so I'm checking for missing values that's another useful thing to know when you open up a data set so is n/a which same as is null is n/a outputs a data frame of boolean values and then when you summit some sums along the 0 axis the rows axis which gives you column sums and true values are treated as ones false values are treated as zeroes so a lot happening in that line but anyway the bottom line is that speaker occupation has six missing values I'm not going to take any action at this time but this is something you should know about any data set you're working with and there's lots of different ways you might deal with missing values especially if you have a lot of them but we have very few so there's not not much to say this isn't really like a big best practice but the documentation has been migrating over to using is n/a and not in a rather than is null and not null for consistency with drop na and fill in a so that's why I've switched is na and you might I don't know if they'll eventually deprecated the is null and whatnot but that's that's what I'm doing here so anyway that's just a brief intro to this data set we're gonna jump into the first exercise which is going to be very short but anyone have any questions up till now ok so this exercise is the following question which talks provoke the most online discussion okay so that you don't actually you don't have to type but you can if you want but that's the exercise you're going to have to decide which column or columns are rel to answering this question the some of the columns are self-explanatory but you can check out the Kaggle website for descriptions I will give you about three minutes on this so that's just because it's not that deep of an exercise but if you've already got an answer maybe think a little bit beyond the surface as to what I'm trying to get at but go ahead and take three minutes try to figure out which talks provoke the most online discussion all right that has been three minutes some of you might have been like boy that was what am I gonna do for those three minutes and others are like I can barely get started okay so I'm gonna go through this one which talks provoked the most online discussion so the first thing you might think to do is to focus on the number of comments so these are the number of first level comments so the thing I would do is I would use the sort values data frame method and just pass it comments and then we're gonna look at the tail because this is sorting an ascending order so you'd want the ones with the most comments so that would be one idea okay um there are different ways to do it and you might just conclude okay I've decided that since this talk about militant atheism had the most comments in the data set that this talk provoked the most online discussion if I concluded that what would be a possible problem with that conclusion or that approach sub comments is one limitation right so like nested comments we're completely ignoring those because we don't have the data but what would be another possible problem right we don't there's a lot of data we do want but we don't have and that's an example how about there right how long it's been online right because that's really relevant some of these talks have been online for twenty years others have been online for a a and so it's relevant how long it's been online now the way I was thinking to correct for this bias is to normalize it by anyone want to guess views is what I'm thinking so if someone watches a talk the number of views if someone watches a talk and they do comment or they don't comment that would seem to indicate how provocative if you will a talk is so I'm gonna create a new column called comments per per view and it's going to be Ted comments divided by Ted dot views okay and I'm creating a new column I always use dot notation when I can which I'm using on the right side you have to use bracket notation when you're creating a new series in a data frame thus that's why use bracket notation I put underscores in my in my column names because I like to use dot notation so you need underscores there so we I do that and then let's go ahead and sort by comments for view instead use comments per view and we'll look at the tail again so now we actually so that Richard Dawkins talk is still up high but now we have a different talk we're defining as the most sorry I'll scroll over all the way to the right because when you add new columns to a data frame they end up all the way on the right and here's the comments per view column and all the way at the bottom we've got our number this is the greatest number in the data frame so this talk called the case for same-sex marriage is a different way you might define you know which talks provoke the most online discussion so we would interpret this number as for every view of this talk there are point zero zero two comments okay so that's the greatest number though that number is a little bit tricky to kind of put into real-world context how might I make this more interpretable than what I've got how might I make this the same information but slightly more interpretable views per comment that is a great idea that is exactly what I was thinking is we can just reverse the reverse the reverse the calculation invert the calculation and I'm gonna change this to views per comment and views per comment and then it'll be Ted dot views maybe the copy/paste wasn't the best idea by ted comments sorry I should have just typed it that was silly views per comments all right looks good and then we'll do Ted dot sort values use per comment use per comment and then this time we'll look at the head because we want the lowest number right it's the opposite so now I will just scroll over all the way over here and views for comment it says 450 which means that it takes 450 views to generate a comment or one out of every four hundred and fifty people who watches the talk comments on it which to me is a lot more in e which is a lot easier to interpret than comments per view with that that small decimal number so on the bottom line for this exercise is this is how I would probably measure with the data we have which talks provoke the most online discussion but it would be great to have other data all right I'm going to show you I just show you on screen a couple lessons I took away from this but any questions about this exercise before I do that okay so I'll just show you the lessons the the lessons I I was trying to get across from this exercise and these are lessons or best practice some of these will be data science lessons some of them will be pandas lessons but number one consider the limitations and biases of your data when analyzing it so the limitation as someone mentioned over here is we only have the first level comments whereas actually having those nested comments might actually be even more useful for which talks are most provocative because people are saying things and then other people are replying to them but we can't really correct for that without gathering the data ourselves and then the bias that we tried to correct for was that older talks have had a lot more time to accumulate comments so we tried to correct for this by normalizing by views so that was the first lesson I was trying to get across the second lesson is just to make your results understandable regardless of your audience regardless of whether it's just for yourself and that's why we used views per comment rather than comments per view so those are just a couple lessons I'm going to move on to the next exercise unless there are any final questions about this one okay all right so the next exercise is visualize the distribution of comments okay and we're going to talk about this one for just a few minutes before I let you loose on this one so let's just make a plot real quick with pandas so we're gonna do TED comments dot plot alright and here's what we get what kind of plot is this a line plot is that is correct so what do the axes represent what is the X represent and what is the y represent X is the the index and what is the y represent the value the value of the comments field right so it's basically plotted out two thousand five hundred and fifty points and connecting them my line that's what a line plot is a line plot is not what you would use here because a line plot is for measuring something over time so if there's not a time component to it I mean what you're showing a line plot is usually not the right choice this is I just want you to know how to plot using pandas if you haven't done it before we can't learn some things from this plot we can see that many talks have a low number of comments and a small number of talks have lots of comments but that's about it so this is not the plot I want you to create but you might know and and in Jupiter notebook you you put your cursor in the in the parenthesis and shift tab and I'm gonna do it four times to pull up the signature essentially the doc string in a lower pane and you'll see and it'll be different depending on upon which editor you use but you'll see that there's this kind argument okay the default is line but if you scroll down you'll see that there are also other options like bar bar h hist box KDE density area and pie okay so there are many different plot types supported by pandas so the exercise is really again I want you to visualize the distribution of comments so the first step is I want you to figure out what plot to do that and if you don't know the terminology you're looking for a plot that summarizes how often particular values are occurring this is called the frequency distribution and when you've made this it should be easy to see are there a lot of low values are there a lot of high values okay so that's step one is pick a plot type and step two is once you've picked a plot I want you to modify the defaults to make it as informative as possible so let's pretend you've got one shot to pick the perfect visualization to visualize the distribution of comments I want you to make it as informative as possible you don't need to add like labels but I want you to modify how it looks so that it's as informative as possible okay I am we gonna give you about four or five minutes on this and then we'll be back up in a couple minutes to talk about it all right we're gonna go ahead and go through this one what plot type did someone choose so tell me what plot type you used histogram okay so that's kind of when you talk about a frequency distribution that's kind of a natural thought of usually so a histogram if you're not familiar it shows the frequency distribution of a single numeric variable and the way it works is it creates ten equally sized bins by default between the minimum value and the maximum value divides those into ten bins and then it counts how many observations appear in each bin okay so this isn't kind of the endpoint but what have we learned so far by looking at this visualization well is that right so are there a lot with no comments this person is saying no and why are you saying no right right right okay so there's kind of two thoughts I want to get out one is adjusting the number of bins which we'll get to in a minute for that many of you may have thought about already and if you haven't are not familiar you'll see that the other thought which is just to make sure we understand this this is a bin that starts at 0 and ends at about 600 and something ok so this bar represents a lot of talks with between 0 and 600 comments it's not a lot of talks with 0 comments so it's not what's at the left you have to think about the bin itself which is 0 to 600 and something and we only know that because the max number is 6000 and something so when you divide it by 10 you get 600 and something okay so that's the first thing we've learned about the distribution is that most talks have a lot fewer than a thousand or 600 comments okay now what does this not tell us that we might want to know right so how are the what is the distribution look like within that first bar or within that subset so we've got the reason we have this kind of weird-looking histogram well weird in the sense that you've got a lot of what looks like empty space is because there's a talk with six thousand and something comments but that doesn't really tell you anything and it's artificially spread out this histogram so the vast majority of our data is at the left side and so my thought is that we we don't know whether most talks have five comments or 500 comments we just know that most talks have less than 600 comments so I thought it would be nice to modify this to be more informative by filtering it down so we're actually going to filter just like this Ted does comments is less than a thousand okay all right so we're just filtering out everything above a thousand okay so all I did is filter the actual data frame before selecting the comments series and we've actually only lost a little bit of data and I'll just show you in a cell above here so if we do Ted dot comments comments is greater is greater or equal to a thousand and we do out dot shape we can see that when we filter out the ones with a thousand or greater we're only losing 32 stalks out of 2,500 okay so this is my way of getting at a little like this to me is a much more informative plot because it tells me about most of the data it doesn't focus on the outliers okay all right so before I move on and show you a few more things I do want to mention that there are different ways you could write this code that would produce an identical plot so did anyone use the query method did anyone use query so it's a different way of writing of filtering your data frame so you pass it actually pass it a string and that plot kind equals hist I think I got that right so this does exactly the same thing using the query method some people like using query and then a third way you can write this and it's the way I like and sorry to move things off screen but that's kind of the nature of this is I use loke a lot instead because I think it's it is the most flexible and the most well it's to me it's the most readable way it's because it's very explicit about which rows do I want in which columns do I want so these are this is not really about plotting this is just showing you a couple different ways to do the same thing so this is with loke so with Lok I select which rows I want which columns I want and then plot it as a histogram okay yeah please right yes so Lok is great because you can select multiple columns you can you can pass it a single column you can pass it a list of columns you can pass it a range of columns literally column a colon column C and it will select a B and C or a boolean series and I that'll at least work for the rows and I I'm actually not sure about that for the columns but anyway yes there are different ways you can select with loke which is quite nice okay so the the final thing or well there's two things left I wanted to show you about this is what someone over here has already suggested which is maybe we should increase the number of Bin's so we can see a little more fine-grain detail and so instead of sticking with the default of of 10 bins we're going to change it to 20 and now we get a little bit more fine-grained picture and what we've learned so now we have 20 bins each represent like the first bin is 0 to 50 comments the next 50 to 100 comments etc etc now we've learned that the most popular at least in this context is 50 to 100 comments that's more common in this data set than 0 to 50 yep right right right right yes so briefly pandas is always calling matplotlib to draw its plots there are different there are many different ways to call pandas plots there are the generic like dot plot kind equals with a lot of options for keyword arguments and those keyword arguments ultimately get passed I think directly onto matplotlib but if you want to know what all those are you unfortunately have to go to the matplotlib documentation the flip side is you can use things like ted.com ents dot hist which allow things like grouped histograms but the downside isn't is most plots don't have a you know dot hist I mean there is dot hist there is not box plot but there's not I don't think like dot KDE and there's so many types of plots that there's not a specifically named one that I like to teach the dot plot generically because it has most of it but the downside being you have all these kind of hidden keyword arguments that you don't know what you can put in it and you end up at the matplotlib documentation like kind of guessing like okay I think this one will work here and then you try it and it usually does work so that answer your question okay great the final thing you might think about using here is did anyone try a box plot so some box plots yep so that's another good way to do distributions and it's kind of box and unfortunately it doesn't it's not quite as informative because of what are all the black dots outliers so anything that it defines as an outlier just gets a black dot there's a ton of things that it it shows you so you can think of this kind of like a in a sense it's like a histogram that's been turned vertically but it's just showing you your your quartiles and then but it's got so many outliers that it's not the most informative plot so I don't like the box plot in this case but sometimes the box plot is kind of what you're looking for all right the lessons that I take away from this from this exercise number one is you choose your plot type based on the question you're answering and the data types you're working with so histograms are good for distributions bar plots are for comparing categories line plots are good for time series data scatter plots are for comparing multiple numeric variables so yeah you have to choose based upon the type of data you're working with and pandas actually has a lot more plot types than are represented in that kind so if you're really interested I would encourage you to read the visualization page of the pandas documentation because it has tons more plots and you call them unfortunately in an array of different ways but there are actually a ton of support ones so that was my first lesson my second kind of takeaway is I like to use pandas one-liners to iterate through plots quickly as I said pandas is calling matplotlib under the hood but it's generally faster and easier to write pandas plots than matplotlib plots the downside being that this is this advice is mostly relevant for exploratory plots because um there are limits to how much you can customize it without using matplotlib directly okay the third lesson I wanted to impart is just to try modifying the plot defaults because it's rare that the default settings are going to give you exactly the plot that you are looking for the most informative plot and the final takeaway especially for anyone who is uncomfortable with the idea of like let's just cut out these 32 observations which is what I did creating plots involves decision making in in fact a lot of decision-making because you can't show everything on a visualization a visualization is inherently a summary so you have to accept that you're not going to show something there's no single correct way to visualize a certain piece of data so there's no right or wrong visualization but there are certainly misleading visualizations and it's very easy to create a misleading visualization even accidentally so your decision should be guided around you know how do I accurately answer this question and not mislead folks as to what's going on in the data okay so those are the the takeaways there I'm gonna jump into the next exercise and this one is going to be as I mentioned it's going to be longer and hopefully you will like kind of if if you found those to be easy then hopefully you'll struggle some a little bit at least so this is number four plot the number of talks that took place each year okay number four plot the number of talks that took place each year and I choose my words very carefully so look back at that if you're unclear I'm not gonna give you any advice this time I do have a bonus for anyone who finishes early I will just raise your hand and grab me I am gonna give you about 12 minutes for this exercise okay and again always grab me while I'm walking around with your questions alright this is I think I'm gonna go ahead and cut you off on this one because there are some sticking points that a lot of people are running into and I don't know if you'll get it even if I give you like a really long time so I don't want to just spend all day no and that's not I mean this is hard like this is hard if you you have to know I mean I have to guess what you might know and that's what I'm using when I create these exercises so if you don't know certain key things this would be very very hard so anyway with that being said let's go through it okay so what is wrong first with using the event column did anyone try the event column and then they got stuck why did did it work or did it not work did that always work and the answer's no but can someone tell me why the answers now some of them don't have the year so this is if you're not familiar with like the random sample method that's useful so you know I just want ten randomly selected events and you'll see that some of them end with a number that obviously represents the year and some of them don't so if you go to that column you are gonna get stuck you're gonna have at least a lot of missing values because you're not gonna have the year so what I was hoping you would do is go over to the TEDTalks dataset website and you would scroll down and you would see film date and you'd see it's the UNIX timestamp of the filming and that sounds like well that's when it took place so that sounds like the right column to focus on so let's take a look at it film date ahead okay and we've got our UNIX timestamp okay so I look at that and I think well I don't really know much about UNIX timestamps or what to do with them but my you know PD to date time which was a key function here if you did not get to PD to date time that was like a big something you needed to make this easy so PD to date/time converts a series to the date-time format and it does a lot of guessing as to what format it is and produces pretty good results now unfortunately these don't look like good results which is disappointing because usually you stick in whatever and out comes the real dates and you're like wow that's magical but in this case it did not work okay so here's what I was hoping you would do and I know this is a stretch because you'd have to happen to follow this same sequence but if it was me and you know what to date time is you would google pandas to date I'm due to date time and you might go looking to stackoverflow and you might stumble upon a good a good answer but in this case I would get here because I know I want to use this function I would search for the word eunuchs whoops eunuchs and I would get down at the bottom I would find this which is an example of using it to convert something that looks like our time stamps to something that looks like what we want and all they did was say unit equals s and magically they got the right thing so my thought was well let's just add unit equals s and then you get these dates that look like hey that looks like the right date so if you got if you didn't know about to date I'm getting to this stuff would be very hard but if you did then what I'm hoping you did is created a new column because it's best not to over at your existing column unless you're sure you don't need it so I'm gonna create a new column called film date time okay with this data and then I you know they looked like the right dates but I would want to verify that it works as we had hoped so what I'm actually going to do is just pull out two columns event and film date time so I'm pulling out the two columns that one is the one I just created and the other is one that has a name that often includes the year and I'm just going to randomly sample and when we do that you're gonna have a different random sample but you're gonna look at it and you're gonna say okay the the 2012 event has a 2012 film date time same thing with the 2007 and the two in the 2015 and so you think okay that probably did work so we don't want to assume it worked just because some Rose looked like they were correct it's good to use a random sample to kind of verify your results okay now the thing it's done the we're gonna check the D types again if the entire data frame and the thing it's done it's created the new column called film date/time and it's using the date/time datatype so that's a special pandas data type that is automatically output when you use PD to daytime okay and that's useful for a lot of reasons and I'll show you one reason right now which is Ted if I select film date/time then I have this little DT namespace and I can select convenient attributes like year okay I can select day of week I think is one I can select day of year and all these convenient attributes are available under this little dot DT namespace because it's the date/time datatype okay so this is the easiest way to get the year there are there are other solutions to this exercise but this is a good one so if you're like wait I that looks kind of familiar um this is the same as like string methods have this little stir namespace like stir dot lower so if you want to do string methods on a series that's a string you do column dot stir dot string method name and it will do it in the same way with date/time datatypes you have this little DT namespace with a variety of convenient attributes you can pull out okay um so I've got this year data how so I've got everything I need in this column year which is what we were going after how do we count the number of talks per year from where we are here value counts whoever said that so yeah the value counts method is good anytime you want to count values okay now we've got our value counts and now that we've got the data that we're looking for which is the number of talks that took place each year what kind of plot do we probably want to use bar plot is one suggestion anyone want to offer an alternative someone said histogram someone said line plot okay so a bar plot is okay but there are two think well let's just okay let's do it and then we'll talk about it for a second and whoops obviously I need to actually dot plot sorry my accounts dot plot okay so we get this and yes we can fix up the sorting but the thing there are kind of the main issue here with using a bar plot other than sorting which we can fix is that it's excluding all the years that did not get that are not in the data so there was no talk in 1973 but there was in 1972 but a plot that misses those time spans is a little bit misleading because you can't see the jumps now bar plots are best for categorical data and technically you could consider years to be categories but they're not really or it's not optimal and a line plot is better because line plots are for things that are plotted over time okay so a line plot is better in this case so what I'm gonna do the line plot is the default so I'm just going to run this and I want you to tell me what went wrong here so this is not the answer but what went wrong someone right okay so it is a sorting issue and let's look at the data okay let's look at the data if I take this and I give it to dot plot which produces a line plot pandas will plot them in the order in which they appear so it will first plot 2013 270 then it will plot 2011 270 and connected then it'll plot 2010 267 and if you actually follow this you'll see that that is why it produces this plot so it it plots things in the order you give it to them okay now if we want to fix this we're just going to do sort index okay so we're gonna throw in sort index and actually let me show you that before the plot sort index sorts the index which is the thing on the left not the values which is the thing on the right okay so you want to sort the index because the index is going to become the x-axis and the values are going to become the y-axis but because it's connecting the points you need those points to be in order so when we do that we do we get the plot that I was actually looking for so this is a one way to go about getting this particular plot and there are others but we could talk about that all day I do have one question though is the number of talks on a sharp decline you say no and why do you say no right so there's some incomplete data so you don't want to draw a conclusion that there is a big drop just because there's a drop in in the plot because it turns out that if you use film date time dot max so you can use things like max min with your date/time datatypes and you see that this data ends in the middle of 2017 which is why we have incomplete data for this particular year okay other questions about this one okay so a couple lessons and then let's see okay a couple lessons I wanted to go through as to some takeaways and first takeaway is read the documentation so in this case the documentation for the data set as well as the pandas documentation would both be useful to you so it's it's a good resource for to use the second lesson is to use the date/time data type for dates and times so number one it works well with plots number two it provides convenient attributes and pandas actually has extensive extensive functionality for working with time series data but that functionality depends upon you using these specialized data types so the takeaway here is if you have date and time data it will serve you well to use the date/time data type lesson number three is check your work as you go so hopefully you know that I use the random sample method in order to spot-check that certain conversions are working properly rather than just looking at like the head or the tail I like to use a random sample and then the fourth point here is to consider excluding data if it might not be relevant so as we saw we only have partial data for 2017 you might want to consider excluding that it depends on what you're trying to show it turns out that some of the talks from the Ted website which is this is it was scraped from the Ted website some of the talks on the tabbed website actually were not TED talks they're just like talks that they uploaded because they thought they were really Ted like talks Ted the TED conference was actually founded in 1984 not in 1972 so you might want to consider like if someone asked you this question for work for some reason it might be a good idea to research oh are all the talks in the data set actually TED talks and do I want to include data that is non TED Talks on the Ted website and that would depend on what question you're trying to answer so yeah that's that's really all I want to say about that is sometimes the data you have you should exclude certain things depending upon what you are trying to get across okay all right yeah please yeah oh he's asking if you can do conversion through using like as type in the same way that you have a column of strings that says like it's a string of the number three you can just do like as type int and it will become an integer column I think it would be a good thing if there was like as type date/time but as far as I know that does not work there's no it like as type does not support that particular conversion so you have to use this pd-2 date/time which is this top level level method it's not like a series method or a data frame method it's just this top level function actually other questions okay so I'm gonna start you on the next exercise and let me go ahead and type it out and let you get started okay this is number five five what were the best events in Ted history to attend okay what were the best events in Ted history to attend and I'm not asking about the best individual talks I'm asking about the best event that event column that's like Ted 2006 Ted whatever so what was the best event in Ted history to attend and it's your job to figure out how to define bass and then use that definition to calculate which event was best okay so I'm gonna give you at least eight minutes but I'm more likely give you until the break at three o'clock and then we'll discuss it after but yeah go ahead and get started all right 3:20 we are back cool hope everyone had a good break okay so what were the best events in Ted history to attend before I start going through this just want to hear what a couple folks had as ideas so if you want to just raise your hand and share the idea you had yep count the jaw droppings in the ratings okay all right okay now so for each talk you're you're calculating the percentage of positive views based upon those ratings and then are you taking that how are you using that to judge events as a whole okay so she aggregated by event at that point okay got it got it got it so he was didn't come to a conclusion but noted that some events did not have many talks and that seemed like it should be factored in yep the minimal number of views to get a positive comment but you won't know oh okay okay I halfway understand that but I won't dig too far into that okay lots of great ideas I really like this you may be disappointed that I came up with something simpler than most of these but have built upon some of these ideas which is fine and we're actually going to get into the ratings data because I think it's interesting we're sorry we're going to get into the ratings day in the next exercise explicitly because I want to show you some stuff with that but we're actually not going to touch ratings for this particular one because I stayed away from it okay so thank you for sharing all those ideas but um so one idea that no one mentioned but I thought was worth considering is just like hey why don't we just count the number of talks because this is not my complete answer but um you know a good event has a lot of talks and if you have a lot of talks you have some variety so if all you care about is variety of talks then this is not an unreasonable way to value to go about thinking about this question but then I thought well what's we want to know about talk quality and the talk quality proxy measurement I thought would be views because someone viewing it online is kind of like a vote for the talk doesn't mean the talk is great but it's there's something to that because we want to know about talk quality we don't have data I mean we have the rating state at which you can use it gets a lot more complicated but I thought views would be an interesting way to start so I did a group by so I thought because ultimately we need to aggregate based upon event okay so I used to group by and then we'll talk about group I in a second Ted dot group by event dot views dot mean dot head so I did a group by so the template for using a group by is for each event I want to to do some aggregation function on some column so I have a category that goes in your group by for each category I want to do some aggregation function in which in this case is mean over some particular column which in this case is views so this is the average or the the mean number of views for each of vent okay now let us go ahead and sort this because this is just sorted in alpha order so let's do sort values dot tail and we see that tedx Puget Sound had on average thirty four million views per talk so that must have been one amazing event because they averaged thirty four million views but as this gentleman over here mentioned some of these talks some sorry some of these events only have a few talks in them which would seem to be relevant so if I'm looking at this data and I say well TEDx Puget Sound may have been a great event but I want to know how many talks were at each of those events what would I do how would I modify this to see that data anyone know offhand that would not you can group by two things but you wouldn't get what you're looking for there's a different approach or a slight tweak on that that will give you what we're looking for here one more time so so da AG so what we're gonna do is you may not know that so this is mean as an aggregation function you may not know that you can do multiple aggregation functions here so you actually just do dot AG okay dot AG which is short for aggregate and then you pass it you actually pass AG a list of functions so we want count and mean so I know the notation is a little tricky here but we're grouping my event we're selecting views and then we want to say we want to aggregate separately on count and mean all right so we're passing it a list of the functions we want to aggregate on and then we want to actually let's just do that to star whoops what did I do survived okay so now that this is a data frame I need to specify what I'm sorting on so we're gonna still sort on mean and we still get that same mean column we got previously previously the output was a series now the output was a data frame we have count and mean columns and we can see that yes indeed tedx Puget Sound had thirty four million views per talk on average but there was only one talk so that would not necessarily be the best event in history it was probably quite a good talk and I doubt that was actually the only talk maybe that was the only talk they uploaded who knows we don't know how they decide exactly what to put on the website so this seemed like a good direction but didn't quite get what we were looking for which is tedx Puget Sound was probably not that the best head event in history so my new idea and again this is just one way to go about things was to add in another aggregation function namely some and my thought here was well how about we look at what events have had the most total number of views and sort by sum and this is what I ended up with okay so Ted 2013 across all of their talks had a hundred and seventy seven million online views so perhaps that was that's some proxy for quality of talk in that lots of people after the fact have been watching the talks now there are some definite weaknesses with this approach namely like we've discussed previously views are biased in favor of older talks that have had a lot of time to accumulate views and of course talks with lots of views are not necessarily good talks maybe they're bad talks and people are sharing them just to say how bad this is more likely they're probably good talks but we don't know that which the ratings data might be able to help us out with okay um I'm gonna just talk briefly my couple lessons to take away from this but any questions about this exercise before I do that okay so the lessons that I would take away here think creatively for how you can use the data you have to answer your question because you're rarely going to have the perfect data set to answer a particular question but when you do that always acknowledge the weaknesses of your approach so explicitly say well I summed based on views and here's the weakness of that approach but it's the best approach I could come up with and if you can't make it work with the data you do have sometimes you should just not answer the question or go gather additional data okay and then the second point is watching out for small sample sizes and as I like to say use count with mean to look for meaningless means so if you find yourself doing a lot of group by something select something mean make sure that you actually have a lot of data you were taking the mean of because it doesn't tell you but if you throw it in that if you throw in the AG it'll become a data frame and you have multiple columns you can look at okay here the sample sizes I'm working from when I do this okay all right um let's go on to the next exercise and actually this exercise which I'm calling sorry unpack the ratings data so this is actually going to be a fully guided exercise I just want you to follow along and I will kind of lead the way here so let's take a look at ratings which a number of people mentioned in the last exercise Ted ratings dot head so ratings there used to be a way on the Ted website to tag talks for visitors to the site to tag talks as funny or ingenious or boring or inspiring or any number of things now that it functionality doesn't exist anymore but anyway this was the dataset was scraped while it was still available so we've got this ratings data and let's take a look at that first line okay so I want to look at this first line to understand what's in it so we can figure out what to do about it if we want to access it so one way to do this is to use a Lok use the locus Esser and I say what row do I want I want row 0 what column do I want I want ratings okay so that's one way you can do things the other way I'm gonna do it which is maybe a little more familiar is just to select the rating series and then select element 0 in the series works just as well so that's what I'm gonna use but you can do either way okay so here are the ratings of the first talk in the data set and what is the data type of this Oh some different answers we've got string and array any other list of dictionaries okay well let's find out and string it is alright let's look at it again this is a string if I'd is the technical term a stringify list of dictionaries that is not a list of dictionaries that is a string representation of a list of dictionaries so can anyone tell me the absolute simplest way to unpack this data eval is there's there may I know of one really easy way there might be multiple but I heard eval which is very close to what I'm gonna do Oh replace the single quote with double quotes and load is JSON oh maybe that works um I have not tried it but that's an interesting strategy I will tell you a generic strategy which is a s/he it's a module that stands for abstract syntax tree and it has a function called literal eval okay literal eval and I want to show you an example and then we'll actually do it so if I pass literal eval a string that looks sorry keeps one there we go I passed it a string that looks like a list it returns a list okay I passed it a string that looks like a list and it returns an actual list okay so that's our magic function um literal eval allows you to evaluate a string containing a Python literal or container okay so if you have a string a fide integer you can use it if you have a string a fide list you can use it and that's the magic of it so that's what we're gonna use and what I'm gonna do is try st literal eval Ted dot ratings zero and little well it looks much better now this is not a string this is now a list okay this is an actual list it turned our stringify list into lists so now we actually want to apply this to the entire series because this is the list of dictionaries of ratings for a single talk but we want to do this for all talks okay so the spoiler is we're gonna use apply and I'm gonna define a custom function and then I'm gonna show you a different way using lambda but I if none of those are familiar to you that's totally fine so what I'm gonna do is define a custom function and I'm gonna call it stir to list and and I'm gonna pass it something called ratings stir so the goal of this function is to take in a ratings string which is what it starts as and then convert it to the list which is what it represents so all we're doing in the function is return a st dot literal eval literal eval ratings stir okay so there is my function now I always like to test my function and stir two lists Ted dot ratings ratings zero okay and that looks like it worked okay so I've defined this custom function I believe it works I want to apply it to every element in the ratings series so what you do is use the apply method so I'm gonna say Ted dot ratings dot apply and pass it stir to list okay and let's just look at the head of that and it looks pretty much the same as before except those are actual lists instead of strings okay oh sorry please that would not actually that might work let's actually try oh look at that you're right that is a superior method I will thank you I will finish showing you what I was going to show you because at the very least for anyone who's not familiar with using apply and lambda I do want to you to understand how lambda relates to custom functions but in this case my lambda doesn't do anything special other than a function pass a function with no arguments so in this case he's absolutely right I could have just passed this um good point thank you all right let's see so what I'm going to do is store the results of this and I'm gonna change back to my function just fun but in your you don't have to I'm just very attached to my function that I've made so we are gonna create a new series called actually let me show you the lambda version and then we will save it okay so anytime you have a really simple custom function that you're passing to apply you can use something like lambda and it looks like this X and this is the general format of the lambda function lambda X X just kind of becomes every element in the series and then you say for each element in the series do the following thing to it okay and again it looks like it worked so finally we're actually going to just save this as a new column now I will pause and just say I do want you to create this column so if you've just been watching up to now go ahead and type in this code and as he pointed out you don't actually have to use the lambda or you can but I do want you to create this column because it will be needed for the next exercise I'm calling my new column ratings list okay and we're just going to check that again we're gonna check that it worked Ted dot ratings list zero okay so the goal here is to take our ratings which was a series of strings and turn it into a series of lists okay so as we talked about the very beginning you can actually put arbitrary Python objects in a series so this is a I want to just prove to you that this is a series of lists okay that is an actual Python list okay and if you go back to the and if we go back to the datatype study types you will see that the ratings list is an object column but it is a not a string column it is a column that contains lists okay any questions yeah in the back right so map seems to have a subset of apply functionality I don't know if there are things you can do with map that you can't do with apply there might be in my head I've just decided I'm gonna use apply when I want to apply a function to a series I'm gonna use map to like essentially do a dictionary mapping of a to be you know a becomes one B becomes to whatever so that's how I use them but it is confusing that they have very overlapping functionality just to kind of summarize and this is actually that actually leads right into one of the lessons and actually let me jump over to that and the I'll start with number two first the lesson is use apply anytime it's necessary so apply is often discouraged because it's slow compared to built-in pandas operations but sometimes you don't care about performance number one so why not use something that's working but I would say use it last rather than first so don't use apply if there's a built-in function that can do the same thing built-in pandas functions are going to be faster they're gonna be more reliable and probably better designed and documented and tested um the confusing thing about apply which is my comment on her last comment is apply is confusing because there's the series apply method there is a series map method which has some of the same functionality there's a dataframe apply method which is different and there's a dataframe apply map method apply map one word which is also different so I've got a video about this if you really want to understand them but the bottom line is apply gets confusing so be careful and apply gets a bad rap for performance but that's okay if you don't care about performance okay so that was lesson 2 lesson 1 is just paying attention to data types because data types impact every aspect of panda's functionality you have to know what data types you're working with all the time otherwise you're gonna use the wrong functions or you're gonna miss out on functions you can use okay all right let us jump in to the next one so this one is back to on your own so this is number seven count the total number of ratings received by each talk and I want you to store that in a new series called a new data frame column called I'll just new column named num ratings okay so count the total number of ratings received by each talk and put it in a new column called num ratings your hint is use what we just learned and if you finish early let me know and I will give you a bonus I'm gonna give you eight to ten minutes and then we'll go through this one okay all right we're gonna go ahead and go through this one hope you enjoyed that so let's start by talking about what we wanted to accomplish so we have we have Ted ratings list and we've got element 0 in the ratings list so this is our first record and the goal we're trying to accomplish is to sum the count right if you were trying to do something else you you went off in a different direction but that's what we were trying to do is sum the count so I'm gonna build a function to do this a custom function and then I'm gonna use apply and I'm gonna build the function pieces because that's how I build functions for many people you might look at this and you can just write it in one shot it's like instantly obvious to you what to do it's not as obvious to me so I'm going to build it in pieces so especially for anyone who's a bit newer to programming or Python and wants to think through how to write a function so what I actually tend to do is I kind of always define like a minimum function that doesn't do what I want to do but it gets me closer and then I build it up in pieces so I'm gonna find define a function called get num ratings and I am going to pass it what I'm gonna call list of dicts and all it's going to do to start is just return list of deaths 0 okay and then obviously does not accomplish our goal but I want to make sure I understand the data structures so when I pass it Ted ratings list 0 I want to make sure that I get out a dictionary ok I'm going into the data structure I pulled I successfully pulled out the first dictionary in the list ok so it's getting me closer to my goal so then my next kind of step is to say ok I want to pull out the count from that dictionary and I rerun the function and then I run it and then now I'm pulling out that count that number okay so I'm close in that I've pulled out one of the numbers I want but what I really want is to iterate through the dictionary and get all of those numbers and sum them ok so I'm going to revise this function to give me actually what we want and if you didn't get this on your own I'd advise you to copy this because it might be useful for the next exercise will do but basically here's how I did it I just said I just sent it started a counter called num ok and then we're just going to iterate through our list of dictionaries for D in list of dicts okay I'm just saying num equals num plus D bracket count ok so and then in the end we're just going to return num ok so that those temporary steps I took word just to make sure that I understand the fundamental data structures so that when I dive into them I'm getting the right oh thank you thank you all right I think I got that and now we'll run it and we get a number that looks reasonable based on the data we saw ninety-three thousand votes okay and what what I advise doing is if it's not too hard you can spot check that it worked by like kind of adding this up in your head just the round numbers and seeing if you get about that another way to verify that it's working would be to pass in a couple other records see if what the results you're getting look good but I believe this is a good function so because of apply all we all we have to do is do Ted dot ratings list dot apply get num ratings okay so now that I've got a working function I can apply it to that series okay and I get numbers that seem reasonable okay they're not all the same number which would indicate a bug in your code there's not a bunch of missing values there's no zeros in there these values make sense okay and because of that I'm gonna go ahead and store these in a new column called num ratings so if you did not do this on your own I would advise also typing in this line so that you create that new series so again here's the function we applied it we saved it in a new column called num ratings and just for fun because I like to do lots of spot checks one thing I might do would be Ted dot num ratings dot describe okay and that just allows us to check oh well twenty-five fifty that's the right number of values a minimum of sixty eight sounds reasonable a maximum of 93,000 well we saw that one so we know it's reasonable we're not we're not seeing any negative values we're not seeing any zeros so we're always just looking to see I think I did it right I can't prove it by like I'm not gonna manually do the math for every row in the data frame but I can at least do things to kind of check that my results make sense okay that's the end of this exercise did I'll go into the lessons in a moment did anyone do this exercise but with a lambda I wasn't sure if it was possible you did and it worked is it complicated right and you can pass in like this is this what you're okay Oh interesting oh nice okay oh right right right okay so just one teaching point I was about to do this and who can tell me why I can't do dot count dot sum so count you can't so I use dot notation all the time but I also understand the limitations one of which is if a column name so this is a data frame a column name conflicts with a built-in method you cannot use dot notation which is why some people don't teach it but I just like it so much that so in those cases you have to use bracket notation and I just don't like record notation it's the bottom line and there you go oh great very cool I had not thought of that right aha okay you can oh wow cool well thank you I appreciate that and I do still want to see the other version just for my knowledge but that's awesome I love it so we've got one exercise left and it's pretty substantial so I'm gonna kind of move us along but just to briefly go through the lessons I mean really these are more kind of general principles of mine write code in small chunks and check your work as you go so that's that's how I write things check whether results are reasonable and it's easier to detect when bugs are entering your code if you're writing in small chunks and then to me lambdas best for simple functions though it sounds like it still is pretty readable using the lambda and certainly the one that he showed was pretty readable so anyway those are my lessons for that one but I really want to dive into the last exercise because it's a bit more substantial and a bit more creative so let's get there it's number eight which occupations deliver the funniest TED Talks on average okay which occupations deliver the funniest TED Talks on average now this is as I mentioned the final exercise this is a multi-step process there's not one correct way to do this but I will show you a way to do this I would definitely encourage you to check your work as you go and as before watch out for small sample sizes okay I am gonna give you 15 minutes and I'll type up some bonuses on the screen just in case you fly through it okay but go ahead and get started all right I'm sure there's some folks who would like more time but we're gonna go ahead and go through this the clock is ticking down so there's no no way around it all right so this as I mentioned is a multi-step process so I'll I'll kind of walk you through the steps as I thought of them and we'll talk about it as we go so for me step one is to count the number of funny ratings so I assume you thought about well I've got this ratings column it's got this funny this funny rating in there and one thing to check so if you look at ratings list and check out the head I hope you looked past the very first dictionary beak and didn't just assume that funny was always the first dictionary in the list because it's not always the first dictionary so I hope you didn't assume it and you actually looked for it because sometimes it's well sometimes it's not the first and then one question you might have before you write a function to do this is oh I wonder if funny is always there which is a good thing to check and to me the fastest way to check that is actually just to use a string contains method so I'm actually going to look at ratings which is a string not ratings list which is a series of lists and do Astor contains funny and then I mean this outputs you know trues and falses but if we do a value counts what we're looking for is are there any falses so this would seem to indicate that funny is always going to be in our dictionary if it wasn't then we might need to account for that like we don't know if something has zero funny ratings does it say funny zero or does it not exist and that's what you're looking for and so I would sense that it it does exist so we don't have to check for that condition okay so here's the function I wrote which now I realize you could definitely do in a comprehension form but I'm gonna write it in the form I like which is like this so I'll just go and write the whole thing out for times sake list of dicks and so what you're doing is you're just iterating for D in list of dicks whoops forgot my underscores again of dicks if if the name is equal to funny then we will return that count okay so this is I think if I wrote all that right I think this is the function I used to get that data out of there and let's go ahead and apply it and see how the results look so Ted dot ratings list dot apply get funny ratings and I mean this looks like reasonable results we we recognize that nineteen six four five on number so these I think this looks good and you could do some checks to make sure but I'm just gonna go ahead for times sake and just go ahead and save this as our funny ratings column so Ted bracket so I called this Ted dot ratings okay and Ted dot funny ratings make sure that looks right again okay and seems fine all right so that's probably like the least interesting part of this the question is now what do I do with this and the first thought that came to mind is and I hinted at this when I said you should save the column the number eighteen column from the previous exercise which is I thought well let's calculate the percentage of ratings that are funny so nineteen six four five out of how many were funny and that would be some measure of how funny was this talk so Ted bracket funny rate is what I called it and Ted dot funny ratings divided by Ted num ratings okay and now we've got this measure of how funny is a talk how might we spot check that this calculation makes sense how might we just do kind of a gut Shack anyone have an idea so here's here's what I'm going after and I think my question is just too vague to really get out what I'm saying is so I might say let's go ahead and sort by these values look let's look at what talks have the highest funny rate and look at the occupations and see if it kind of matches up with our human intuition okay because if it doesn't then maybe we did something wrong up to here okay so if we sort by funny rate and then check the speaker occupation okay you will get this doesn't tell you anything because we need to take the tail because that is where we have actually let's do even more let's do let's do 20 so we sorted by funny rate funny rate is the sorting is ascending order thus the highest funny rates are at the bottom thus these are the speaker occupations of the 20 funniest talks as judged by funny rate and this kind of matches our intuition that comedians and a lot of comedians satirist actors data scientists jugglers are funny and like this makes sense so even though this doesn't prove that we did a useful calculation it indicates that we probably did something useful because people's whose job it is to be funny are getting click people are clicking funny on the rating box okay and then similarly and we're not we're not giving any of these folks our time but these are the twenty least funny occupations penguin experts spinal cord researcher thinker educator AIDS fighter I mean these are I'm sure these are great talks they just weren't marked as funny so this kind of all makes sense people aren't marking funny on these talks okay so that's step two I can't remember I wrote that but here was I intended to say that was step two and here is what I'd call step three which is analyze the funny rate by occupation and you might already be thinking there's a caveat here we need to be aware of but I will get to that so how if I want to calculate the mean funny rate per occupation what structure would I use a group by yes sorry that's what I meant so we're using a group by because I want again for a group by you're thinking for each something I want to take some aggregation function of some other column so for each occupation I want to take the mean funny rate and that's how I translate group bys into code so group by for each speaker occupation what is the dot funny rate dot mean okay so let's go ahead and sort this and these are the mean of funny rates by occupation and the highest values comedian writer juggler actor etc the problem with this calculation is that a lot of these occupations have a very small sample size and one way we could do that is one way we could see that is by changing the the mean to a nag and saying count and mean but the other way you can just see this is just in general if you do speak your occupation got described you'll just see that they do a count of non null values how many unique values what's the top value and then of what's the frequency of that top value so out of all of these twenty five hundred and forty-four non-null instances fourteen hundred and fifty eight of them were unique meaning there's only one instance of that occupation so that tells us already with a weakness that that is a weakness of our approach in that we have this very small sample size on a lot of things so my next step and again there are other ways you could do this but this is how all I think this is one of the more interesting ways is step four we're gonna focus on occupations that are well represented in the data and this may not be quite the most efficient way to do things and in fact there's some ways that would be a little less code but I think you'll get something useful out of this particular approach and what I want to do is look at speaker occupation dot value counts okay and here my value counts we've got a writer there are forty five of them and then down at the bottom there are poet of code like that there are all these ones okay because there's a lot of occupations that only a that only appear once so what I want to do is I want to generate a list of all of the occupations that are somewhat frequent okay so my question one question is what is the data type of the value counts output what are we looking at it is a panda's series okay so many pandas operations output a data frame or series and because of that you can manipulate the output of a lot of functions like any other series or data frame and what I'm talking about here is I'm actually going to save this as something called occupation counts and this is indeed a panda's series and I can filter that like any other series occupation counts bracket occupation counts is greater than or equal to five okay so I've now filtered out only to only include occupations that appear at least five times now I filtered a series so occupation councils as a series I have filtered the series by their values then I'm going to take the index okay and I'm sorry it doesn't look great on screen because of the the font size but we're gonna save that as something called top occupations okay so top occupations is actually I believe it's a numpy array top occupations oh it's an index so it still remains an index but an index can be treated like a list okay so you can think of this as just a list of the occupations that appear at least five times so rather than doing a weighted average which is what I think two folks were suggesting I'm actually going to use all the data but I'm just gonna cut it off at an arbitrary value and say let's only include P patient's that appear at least five times in the data so what I need to do is to filter it down and there's a lot of ways you could do this but here is the kind of slickest way to do it is we want to filter the data frame to only include talks by occupations with at least five instances so I'm actually going to just say Ted dot speaker occupation dot is in top occupations okay now that pulls out outputs a data frame so the way is in works it's kind of in the name so it's checking for the existence of talk top occupations it's like a bunch of ORS it's like saying you know if you ever do multiple conditions to filter you say something equals something or something equals something or something equals something as an alternative you do dot is in and then you pass it a list or if something lists like and it will check for whether it's a member of that list hope that was kind of clear but I'm gonna save this and I know this is a lot of intermediate objects and I hope the names are good but Ted top occupations is my intermediate data frame that only includes talks by occupations that are represented at least top five times and this still includes 786 talks which is not a huge I mean we've lost 2/3 or 3/4 of the talks but we still have a lot of data to work from so what I'm really doing is I'm just gonna redo my group by essentially we want to get at the funny rate but only when we have at least five data points so we're going to Ted top occupations that group by by speaker occupation dot funny rate dot mean and it still goes on dot sort values okay so I know that was really long and you may have gotten lost in there my apologies but basically we've just eliminated all the small sample sizes and we end up with a list that looks similar to our previous list so these are the least funny occupations that have at least five data points okay so you know here's the if you will the bottom of the list here's the top of the list and you know it kind of still makes sense I've mentioned one weakness of this approach which is you know five is still a pretty small sample size it's not a great sample size what are some other weaknesses of this approach right right so that's a big weakness of the approach in that we have we apparently have a performance poet comma multimedia artist who has done at least five TED talks but it happens to be all the same person and that person is funny so it doesn't really tell us that performance poets and multimedia artists are funny maybe when combined but that particular person is funny so that's a big weakness of the the approach what's another weakness is that right right right so ideally we would want to look at like data visionaries separate from global health experts so you could split this and then how do you decide do you assign them the first occupation do you double count them but the point is when people are have multiple occupations listed how do you deal with that data and we haven't really dealt with it we've just left it as is which is another weakness of the approach okay well I know we're at time I'm just gonna go I'm gonna do two things one is I'm gonna go through the kind of the lessons to take away from this exercise and the second is just to wrap up this overall tutorial but any questions about this exercise before I get to that okay so the lessons to take away and I'll try to be brief I don't want to keep you too long so check number one check your assumptions about your data did you assume funny was always in in those dictionaries did you assume that funny was always the first element those kind of things number two check for reasonable results and that's when we do our funding rate calculation and we check does it make sense like does it match with our human intuition number three take advantage of the fact that pandas operations often output a data frame or series and because of that you can do a lot of chaining you can manipulate the output a lot of times that's kind of an underappreciated feature of pandas in my opinion is you can do a lot of manipulation on the output and not just the input number four watch out for small sample sizes as we've talked about numerous times number five consider the impact of missing data so pandas generally ignores missing values by default thus most calculations won't fail due to missing values so you need to be very cognizant values in this in this tutorial but you have to be cognizant of missing values because panda it does not generally cause pandas to throw up errors you need to be aware that it's doing calculations on the non missing data and lesson number six data scientists are of course hilarious the data proves it they were the top five in terms of occupation they were number five in the list so tell your friends that you are hilarious okay so that's it just to kind of conclude some housekeeping number one let me go to the very top of this notebook and show you the survey link so if you've already closed your notebook it's bitly slash PyCon 2019 survey I'd really appreciate it if you would fill it out I assume they look into it when deciding who to bring back next year and I'd love to come back so please do fill out that survey and I would love that I'll be posting all the code from today or most of it with a lot of comments on github you'll receive an email once I have done that it'll be perhaps a couple days if you want to keep in touch I've got a website at Stata school I've got an email newsletter you can keep up to date with any of the videos I release or my courses I have an existing course on machine learning I have a panda's course on data camp I'm working on some courses about Conda and about the Jupiter notebook so if you want to hear that please sign up for the email list and that is it thank you so much for joining me today this was awesome really appreciate it [Applause] hi again thank you so much for watching the video let me know if you have any questions and I'm happy to answer them I want to give a special shout out to members of my patreon community which is called data school insiders insiders make it possible for me to keep creating tutorials and so I'm grateful to every single one of them now if you want to be commentated school insider just go to patreon.com/scishow

Info

Channel: Data School

Views: 124,211

Rating: 4.9612336 out of 5

Keywords: python, pandas, data science, data analysis, tutorial

Id: dPwLlJkSHLo

Channel Id: undefined

Length: 104min 16sec (6256 seconds)

Published: Thu May 23 2019