Data Science Best Practices with pandas (PyCon 2019)
Video Statistics and Information
Channel: Data School
Views: 124,211
Rating: 4.9612336 out of 5
Keywords: python, pandas, data science, data analysis, tutorial
Id: dPwLlJkSHLo
Channel Id: undefined
Length: 104min 16sec (6256 seconds)
Published: Thu May 23 2019
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.
Is there a pdf version that lists this? 1.5 hrs is kinda long for pandas tips
PyCon Cleveland 2019
https://www.youtube.com/watch?v=dPwLlJkSHLo
Check for NAs
ted.isna().sum()
Sort by column
ted.sort_values('views_per_comment').head()
Plot
Shift-tab to see arguments in help window
ted.comments.plot(kind='hist')
Drop outliers
ted[ted.comments < 1000].comments.plot(kind='hist')
See how many we lost
ted[ted.comments > 1000].shape
query method
ted.query('comments < 1000').comments.plot(kind='hist')
loc method
ted.loc[ted.comments < 1000, 'comments'].plot(kind='hist', bins=20)
boxplot
ted.loc[ted.comments < 1000, 'comments'].plot(kind='box')
Random sample
ted.event.sample(10)
Results
pd.to_datetime(ted.film_date) # guesses wrong if timestamp ted['film_datetime'] = pd.to_datetime(ted.film_date, unit='s')
Pull two columns to verify results
ted[['event', 'film_datetime']].sample(5)
Check dtypes again
ted.dtypes
Now we get a 'dt' namespace in datetime datatype
ted.film_datetime.dt.year
String namespaces have this
ted.event.str.lower()
Count values
ted.film_datetime.dt.year.value_counts()
Barplot problem: Missing years
ted.film_datetime.dt.year.value_counts().plot(kind='bar')
Lineplot with sorting issue
ted.film_datetime.dt.year.value_counts().plot(
Fix the sorting issue with sort_index()
ted.film_datetime.dt.year.value_counts().sort_index().plot()
max: Notice that data is incomplete
ted.film_datetime.max()
Get amount of talks
ted.event.value_counts().head()
Aggregating
ted.groupby('event').views.mean().sort_values().tail()
Modify to show number at event, aggregate
Now it's a dataframe
ted.groupby('event').views.agg(['count', 'mean']).mean().sort_values('mean').tail()
Sort by sum
ted.groupby('event').views.agg(['count', 'mean', 'sum']).mean().sort_values('sum').tail()
6. Unpack ratings data
ted.ratings[0]
Unpack a stringified list of dictionary data
import ast ast.literal_eval('[1, 2, 3]')
Make the list
ast.literal_event(ted.ratings[0])
Make a helper
def str_to_list(ratings_str): return ast.literal_eval(ratings_str)
Apply rating
ted.ratings.apply(str_to_list).head()
Pass ast.literal_eval
ted.ratings.apply(ast.literal_eval).head()
Pass ast.literal_eval
ted['ratings_list'] = ted.ratings.apply(lambda x: ast.literal_eval(x)).head()
7. Count the total number of ratings received by each talk
def get_num_ratings(list_of_dicts): num = 0 for d in list_of_dicts" num = num + d['count'] return num
Apply it
ted['num_ratings'] = ted.ratings_list.apply(get_num_ratings)
Describe for statistics
ted.num_ratings.describe()
8. Which occupations deliver the funniest TED talks on average?
ted.ratings_list.head()
Check if 'Funny' is always there, yes, it is always there
ted.ratings.str.contains('Funny').value_counts()
def get_funny_ratings(list_of_dicts): for d in list_of_dicts: if d['name'] == 'Funny': return d['count']
ted['funny_ratings'] = ted.ratings_list.apply(get_funny_ratings)
Calculate percent that were funny
ted['funny_rate'] = ted.funny_ratings / ted.num_ratings
Funny rate
ted.sort_values('funny_rate').speaker_occupation.tail(20)
Analyze funny rate by occupation, use groupby
ted.groupby('speaker_occupation').funny_rate.mean().sort_values().tail()
Check sample size, many are unique
ted.speaker_occupation.describe()
Focus on occupations that are well-represented
ted.speaker_occupation.value_counts()
Output of value_counts() is a series
occupation_counts = ted.speaker_occupation.value_counts() top_occupations = occupation_counts[occupation_counts >= 5].index
Filter it down, using 'isin'
ted_top_occupations = ted[ted.speaker_occupation.isin(top_occupations)]
Do the groupby again
ted_top_occupations.groupby('speaker_occupation').funny_rate.mean().sort_values()