ANITA KHAN: Hi, everyone. Welcome to Data Science
with Python pandas. This is a CS50 seminar and
my name is Ms. Anita Khan. Just to give you a little bit
introduction about myself, I'm a sophomore here at
Harvard and I'm in Pfoho. This past summer I interned at Booz
Allen Hamilton, a tech consulting firm, and there I was doing server
security and data science research. On campus, I'm involved with
the Harvard Open Data Project, among other things, where we try
to aggregate all of Harvard's data into one central area. That way students can work with
data from all across the university to create something special, and some
applications that improve student life. I'm also on the school's curling team. Just to give you a brief
introduction about data science, this is an incredibly evolving
field That's? growing so quickly. Right now on Glassdoor it's rated as the
number one best job in America in 2016. And you can have a median
base salary of $116,000. Harvard Business Review also listed it
as the sexiest job of the 21st century, and it's always growing. If we look at Indeed, we see the number
of job postings has been skyrocketing. In the past four years
alone, the number of postings have increased eight times,
which is pretty incredible, just because data science is such a
growing field and every company now wants to use it. If we also look at job seeker
interest versus job posting, we see that there are, at max, there
are sometimes 30 times more posts than there are people to fill them. And we also have at minimum still
almost 20, which is incredible. And we want to fill that demand. So here I'll be teaching
some data science. Stephen Few once said that numbers
have an important story to tell. They rely on you to give them
a clear and convincing voice. And today, I'll be
helping you to develop that clear and convincing voice. If we look at some
examples of data science, we have seen things like how
data science has been used to predict results for the election. And so we can see here,
there is a diagram about how many different
ways Clinton can win, how many different ways
Trump can win, just depending on the number of different results. And this results in a very interactive
and intuitive visualization. So if we want to look at
the brief article here, we see here some election results. And this was just released
pretty recently actually, updated 28 minutes ago. So we see here there are things
like the percentage over time on the likelihood of winning. We also have where exactly
the race has shifted, some things about state by
state estimates divided by time. And so this is just a really intuitive
way for people all across the world to be accessing this data. And so data scientists are always taking
this data and these huge spreadsheets that aren't always
accessible for people to see, and that way people can actually
observe what's going on. You also can see different forecasts,
some different outcomes that are pretty likely, and, again,
an interactive visualization for people to really understand
the data that's going on. Some other ways are that we can
see Obama rates are rising here. Obamacare rates are rising and there is
the graph to see how that's changing. We've also used data to catch
people like Osama bin Laden, and to fight crime. So data scientists have
a lot of different usages across many different fields. There are many steps to data science
and I'll be going through them today. So the first thing is you
want to ask a question. Ask a question, then you want
to-- it's important to ask a question because otherwise
there's nothing to answer. Data science is a tool, and
so once you have a question to answer you use data
science to answer that. You can't just use data
science on some arbitrary data set that you don't care too much about. Next you want to get the data. And so there are a wide variety
of places to get the data, but you just want to find a data
set that you also care about. After that you can explore
the data set a little bit, get a better sense of what
kind of descriptive statistics you're looking for. Next, you want to model the data. So what happens if you
are trying to predict something years into the future? What happens if this scenario occurs? Or what happens if this
predictor changes a lot? Then you want to see what could
possibly happen based on your model. And models always improve
when you have more data, and so it's always
good to get more data. Finally, you want to communicate
all of your information. Because while it's great
that a data scientist has all of this information that they
found, and all these visualizations, it's really important to share that
with your boss or other colleagues. That way there can be
something actionable about it. So in the examples we
showed before, we've seen things like how Osama bin
Laden was caught using data science. But if the person who's the data
scientist came up with the data and couldn't present that effectively,
then that couldn't have happened. There are a bunch of different
tools to help you get along, help you find all this information. So when asking a question, you can
think about to your own experiences. What are some issues
that you faced before? What is something that you
want to know more about? You can also look at
websites, like Kaggle, for example, which presents data
challenges pretty frequently. And so if a company poses a question,
you can always answer them yourself. You can also talk to some
experts what kind of things are they looking to answer
that they might not necessarily have the capability to address. And so you can help them using the data
that you find to answer their question. As for getting the data, there are
many different ways to get data. You can scrape a web page. So that you can get
information that way. You can also look at databases
if you have access to one. And finally, a lot of different places
have Excel spreadsheets or CVSs, Comma Separated Values, and text files
that are really easy to work with. After that, you want to
explore the data a little bit. And so we have a couple of different
Python libraries, along with others, but Python seems to be pretty
common in the industry. You have libraries such
as pandas, matplotlib, which is more for visualization,
and then NumPy as well, which works with arrays. And so after that you want to
work with modeling the data. So, extrapolating essentially. And so you can also do
this with pandas and also a library that's gaining
a lot of traction is sklearn, which is more
for like machine learning. And finally, you want to
communicate your information. So matplotlib is great for creating
graphs and d3 is great for creating interactive visualizations. But as we've seen before, pandas is
used in both explore and modeling. And also, matplotlib and
NumPy is built into panda. So that's why pandas is great. So we're going to be
exploring that today. Just a little bit more
information about pandas. It's a Python library,
as I mentioned before. And it's great for a wide variety of
steps in the data science process. So, things like cleaning,
analysis, and visualization. It's super easy, very quick. And it's very flexible, so you can work
with a bunch of different data types, often many different types at once. You could have several different
columns with strings, but also numbers, and even strings within
strings, and it's great. And finally, you can integrate
well with other libraries because it's built off of
Python, it works with NumPy tools and other different libraries as well. So it's pretty easy to integrate. Next, we'll also be
using Jupityr Notebooks. So this is kind of
similar to the CS50 IDE, but this is preferential
for data science because you can see the
graphs in line and you don't have to worry about
loading things separately. You also have all of your tools and
all your libraries already loaded. So if you download a package
in SAR called Anaconda, that has all of these tools already. It also allows over 40 languages. So today, we'll be focusing
on Python but it's great that you can share notebooks and work
with many different languages as well. So we're going to just
launch into pandas. And so there are two different
data types in Python pandas. So the main one is
called series and there's another great one called DataFrame. And so series are
essentially NumPy arrays. They're essentially arrays. So you can index through
them, just as you did in CS50, but one difference is that you can
hold a lot of different data types. So this is kind of
similar to a Python array. So we can work on a couple
of different exercises. So here is going to
be our notebook where we're going to be working
with all of our information. This way you can see
everything as it goes. So you have the code here,
and then if you press Shift-Enter it loads the code for you. So here in this section, we're going
to be exploring different series. And so first you want to import the
library as you did in the last P set, for CS50. So if you import pandas
as pd, that pd means that you can access different modules
within panda just using the word pd. So you don't have to
type pandas all the time. So if you want to create a
series, you just call pd.Series. And then this generates
this NumPy command. Import NumPy as np. This NumPy command generates five
random numbers, and then in the series you'll also have an index. So let's see what it creates. As you can see, you have an
index here, a, b, c, d, e. And then you have your five
random numbers from just here. Because this isn't saved
inside of a variable, it's just pd.Series, if you want
to save it inside of a variable, you can also do the same thing. You also don't need to have
an index, you can just have 0 and the default is just 0 through 4. Next, you can also index through them
because there are different arrays. So can someone tell me what
ss[0] would return here? AUDIENCE: The first value. ANITA KHAN: Yeah, exactly. And then do you know what this one is? AUDIENCE: That's all the
values up to the third. ANITA KHAN: Yup, exactly. So here you have your first
value, as you had here. And then after, when you are slicing
through them, it gets easier. 1 and 2. So that's a series in a nutshell. The next type of data structure
is called a DataFrame. And so essentially this is just multiple
series added together into one table that way you can work with
many different series at once and that way you can work with
many different data types as well. You can also index
through index and columns, that way you can just work with many
different data types very quickly. So here we're going to do a couple
of exercises with DataFrames. And so first we create a
DataFrame in the same way. So when we call pd.DataFrame, that
means you access the command DataFrame in pandas. That means you create
a DataFrame out of (s). So (s), remember, was
this series back up here. So we're going to create a
DataFrame out of that series, and we're going to call
that column Column 1. So as you can see, it's
the exact same series that we had before, these random
five numbers put into this DataFrame. And then its column is named Column 1. You can also access
the column by the name if you want to have a specific column. So if you call df, which is
the name of the DataFrame, and then in brackets ["Column 1"], kind
of like what we did in the last piece with accessing like dicts, then
you can access that first column. It's also really easy to work with
different functions applied to that. And so for example, if we wanted to
create another column called Column 2, for example, and we want that
column to be the same as Column 1 but multiplied by 4, it would just
be like adding another element in that dict. So then it would be df,
and then in that dict we'd be creating something
else called Column 2. And then that's equal
to the Column 1 times 4. And so as you can see, we've added
a second column that's exactly the same, except it's multiplied by 4. So it's pretty intuitive. You can work with many different
other functions as well. And so if you want to add something
like df times 5, or like subtracting, or you can even add or
subtract two different columns, you can add multiple columns, it's
pretty flexible with what you can do. You can also work with
other manipulations, such as a thing like sorting. So if you want to preserve-- you
can do other things such as sorting. So if you want to sort
by Column 2, for example, you can take this column and you can
call df.sort_values and then by Column 2. And if you want to preserve it,
make sure to set it to a variable because this just sorts it
and it doesn't actually affect how the DataFrame actually looks. And so if you sort by
Column 2, you can see that the whole DataFrame is just sorted
with these indices staying the same. So, for example, you see
that Column 2, this one has the lowest value so
it's going to be at the top, and then you also have the indices
preserved sorted by that Column 2. You can also do something
called Boolean indexing, which is where-- so if you recall from
a Python array, if you just call, for example, is this
array less than 2, then it should return trues and falses to
see whether each element is actually less than 2. So this same concept can
be applied to a DataFrame. And so if you call
this DataFrame, if you want to access things that
in Column 2 are less than 2, then you can just do syntax
like this and it would return every column that's less than 2. As you can see, the first
row has been eliminated because Column 2 is not less than 2. You can also apply things
called anonymous functions. And so if you have
something called lambda x, is the minimum of the DataFrame
plus the maximum of the DataFrame, then you can apply
that to your DataFrame and then that should return the
result of whatever this should be. So, for example, if you run this
you take the minimum of Column 1 and then you add it to
the maximum of Column 1. And this result is negative 1.31966. And then you do the same
thing for Column 2 as well. So you can run the same thing
to another-- you can also add another anonymous function. Do you want to try it out? Give an example? So it's something like
df.apply (lambda x). AUDIENCE: A mean? ANITA KHAN: Mean? OK. mean(x). Oh, whoops. That's why you don't do
live coding during seminars. You can also call on mean(df) and
then that should-- np.mean(df). And then that should
return the mean as well. Finally, you can describe what different
characteristics of that DataFrame. And so if you do something
like df.describe, it returns how many values
are inside the DataFrame. You can also find things like the mean,
standard deviation, minimum, quartiles, and finally, the maximum. So it's pretty easy once you have all
that data loaded into DataFrame if you call df.describe, then that allows you
to access pretty essential variables about that DataFrame. That way you can work with
different things pretty quickly. So if you want to
subtract and add the mean, then you have these two
values here already. If you want to access things like
the mean exactly, you could call-- if this is the table-- then you want
to call table(mean), or ["mean"], that should access the means as well. So we're going to go through the
data science process together. So, the first thing we're
going to do is ask a question. So what are some data sets
that you're interested in and what kind of questions do
you want to answer with data? AUDIENCE: Who's going
to win the election? ANITA KHAN: Who's going
to win the election? That's a good one. AUDIENCE: Anything to
do with stock prices. ANITA KHAN: Stock prices. What kind of things with stock prices? Kind of similar to CS50 Finance? Or like if you want to predict
how a stock moves up and down? AUDIENCE: Yeah. ANITA KHAN: OK. All very interesting questions. And the data is definitely available. So for something like-- yeah,
we can go through that later. So today we're going to be
exploring how have earth's surface temperatures changed over time. And this is definitely a relevant issue
as global warming is pretty prevalent and then temperatures
definitely are increasing a lot. We had a very hot summer,
a very hot winter. So this might be something
we want to explore, and there are definitely
data sets out there. So for getting the data
in this kind of example, so where do you think you'd get data
about who's going to win the election? AUDIENCE: I'm sure
there's several databases. Or past results. ANITA KHAN: Past results
of previous elections? AUDIENCE: Yeah. And polls. ANITA KHAN: Where do you think you
could get data about elections? AUDIENCE: Previous polls. ANITA KHAN: Yeah, definitely. And as we saw before in the
New York Times visualization, that's how a lot of people
predict how the elections are going to go, just based on aggravating
a lot of different polls together. And we can take maybe the
mean and see who's actually going to win based on all of these. That way you account for any variance,
or where different places are, and who different polls
are targeting, and so on. So for something like stock
prices, what would you look at? Or where would you get the data? AUDIENCE: You could start
with Google Finance. ANITA KHAN: Google Finance. Yeah. Anything. AUDIENCE: Like Bloomberg
or something like that. ANITA KHAN: Yeah, for sure. Same thing? AUDIENCE: Same places, I guess. ANITA KHAN: Same places. Yeah. And what's really cool is
that there are industries that are predicated off of both of
the questions that you're asking. And so if you can use data science to
predict how stocks are going to move, that's how some companies operate. That's how they decide
what to invest in. And then for elections, if
you can predict the election, that's life changing. And so here we're going to get the
data from this place called Kaggle. As I mentioned before, it's where
a lot of different companies pose challenges for data science. And so if we look here,
there is a challenge for looking at earth's surface
temperature data since 1750. And it was posted by Berkley
Earth pretty recently. What's great about Kaggle
is that you can also look at other people's
contributions or discussion about it if you need help about how do you
access different types of data. So if we look at a
description of this data, we see a brief graph of how
things have changed over time. So we can definitely see
this is a relevant issue. And you can see from this
example of data science already, it's pretty intuitive to see what
exactly is happening in this graph. We see that there is an upward
trend of data happening over time, and we see exactly what are the
anomalies over this line of best fit. We also see that this data set
includes other different files, such as global land and
ocean temperature, and so on. And the raw data comes from
the Berkeley Earth data page. So if we download this-- it might take
a little bit to download because it's a huge data file,
because it's containing every single temperature since 1750
by city, by country, by everything. So it's a pretty cool
data set to work with. There's a lot of different data sources. And while this isn't quite
like technically big data, this definitely is a chance
to work with a large data set. So if we look here, we can
look at global temperatures. So here you can see some pretty
cool information about the data. You see that it's
organized by timestamp. You can look at land average
temperatures, you can see here. Might be kind of hard to tell. Land Average Temperature Uncertainty,
that's a pretty interesting field. Maximum Temperature, Maximum Temperature
Uncertainty, Minimum Temperature. So it's always great to look at a data
set, like once you actually have it, what kinds of fields there are. And so there's things
like date, temperature. We see that there are a lot of different
blanks here, which is kind of curious. And so maybe this could get
resolved later in the data set? And we see that this goes all
the way up to the 1800s so far. And then we see here that the
other fields are populated here. So it's possible that
before 1850, they just didn't measure this at all, which is
why we don't have information before. So this is something to keep in
mind as we work with the data set. And so we see, there's a lot of
information, a lot of really cool data. And so we want to work with that. And so we open up our notebook. You import in all of the
libraries you already have. The great thing about
Jupityr Notebook is that keeps it keeps in memory from
things that you've loaded before. So up here we loaded
pandas and NumPy already, so we don't have to load them again. And so we just import matplotlib,
which is, again, for visualizations, and graphs, and everything. And we also import NumPy-- we
already imported that-- but it helps you work with arrays and everything. This matplotlib inline allows you to
look at graphs within Jupityr Notebook. Otherwise it would just open up a new
window, which can get kind of annoying. And so if you want to see
it inline, that way you can work with things pretty quickly
rather than switching between windows, it's a good thing to use. And then this is just a
style way of preference for how you want your graphs to look. And so if you use the
default, it's just like blue. I wanted it to be red and
gray, and nice so I changed it. So if you call pd.read_csv-- again,
remember that pd is referencing pandas. And so this is accessing a
module in pandas called read_csv. So it let's you load in a CSV,
just with a single command, and that way it loads
into your DataFrame. And so if we call that-- yeah. So this looks exactly the same
way we had it before, or had it in the Excel spreadsheet,
just loaded into a DataFrame. So again, very simple. If you want to see the rest
of the file, you just call df. I just chose head(), that way head shows
the first five elements rather than every single thing, because
it was a pretty long data set. But it does show the first 30, and
then also the last 30 I believe. And so you can see that there
are 3,192 rows and 9 columns, just from loading it in. You can also call tail(), and then that
should show you the last five elements. You can also change the number within
here to be the last 10 elements. So you can see things pretty easily. Next, we want to look at just
the land average temperature. That way we can work with
just the temperature for now. The others are a little
bit confusing to work with, and so we want to just
focus on one column for now. Plus, that's what we're interested in. We want to see how temperature
has changed over time. So we want to look at just temperature. And so this is a method to index. And so this takes the columns
from 0 all the way up to 2, where it stops right before 2. And then it gets to zeroth
column, and the first column. The zeroth column,
remember, is the datetime, and the first column is the
land average temperature. And then again, we want
to take the head(). So as you see, it's just the datetime
and the land average temperature. And we also changed the
DataFrame to be updated to this. That way we can just work with just
these rather than the rest of them. Next, as we saw before, df.describe
was a very helpful tool. And so if we run that
again, that will allow us to see basic information about it. And so we see that there
are in total 3,180. And then we also have
a mean temperature. We have a standard
deviation for temperature. We have our minimum and maximum as well. And we also see that we have NaN
values, which means it's not a number. So that's a little bit curious. We might want to explore
that a little bit. In all likelihood, it probably is
just that there are Not a Number values in there, and so it's hard
to find quartiles when some of them are not valid numbers. So once we have a
description, we can see we've gained insights already about it,
just from those couple lines of code up here. And so we see that the mean
temperature from 1750 to 2015 was 8.4 degrees, which is interesting. Next, we want to just
plot it, just so we have a little bit of a sense
of how the data is trending. We just want to plot it, just to
see we can explore some of the data. And plus, it's pretty easy to apply it. So even if it doesn't look too great,
then we aren't losing anything. And so, plt. Again, we
imported matplotlib, which is the library that helps you plot. matplotlib.pyplot helps you plot. And then if import it as plt, you
can access all the modules from just calling plt(). And so we have plt.figure. plt.figure(figsize), that just defines
how big that graph is going to look. And so we call its going to be 15 by 5. And so you have the width
is a little bit bigger, and that's to be expected because it
should be like a time series graph, and so there will be more years
than there are actual temperatures. Next, we're going to
actually plot the thing. And so since we have a DataFrame
that has all that information, we can just plug that in. And this command knows exactly
how to sort between the x and y, so you just need to call that DataFrame. The only thing is that matplotlib
in this case would plot a series. You can also plot multiple of them. But as the series, as
you remember before, is a one-dimensional
array with an index. And so in this case that land average
temperature, or the temperature itself, would be what you plot on your y-axis. And then the x-axis would be the index. So that would be what year you're in. You can also plot a whole DataFrame. And then this, we'd just plot all
the different lines all at once. So if you had a land
maximum temperature, then you can see the
differences between that. We also have plt.title, that
changes the title of a whole graph. You have the x label, year, and y label. And finally, you want to show the graph. You also don't have to, but
because of Jupityr Notebook, so then same thing happens. And so you see from this
graph, it's a little bit noisy. And so we see that there
seems to be an upward trend, but it's kind of unclear because
it looks like things are just skyrocketing back and forth. Do you have an idea why
that might be the case? AUDIENCE: It's connecting the dots. ANITA KHAN: Yeah, exactly. Yeah, that's exactly right. And so we also see
from the table up here, there are different months located. And so, of course, the temperature
will decrease during the winter and increase during the summer. And so as it connects
the dots, as you said, then it'll just be connecting the
dots between winter and summer and it will just be increasing a lot. So this graph is kind of messy. We want to think about how
exactly we can refine it. But we do see that there is
a general upward trend, which is a good thing for us to see, probably
not good for the world, but it's OK. We can also pretty clearly
see what the ranges are. And so we see here, you can get from
as low as couple of negative degrees up to almost 20 degrees,
which is consistent with our df.describe findings. We also see that it goes from the
0 to the 3,000, or almost 3,200, which is not quite correct because we
only had the years from 1750 to 2015. And so there's something incorrect here. It's probably referencing
the months maybe. AUDIENCE: I think it's
referencing the indexes? ANITA KHAN: Yeah, exactly. Referencing the indexes,
but each row is a month. And so it would be like the zeroth
month, first month, and so on. So how do you think we can make
this graph a little bit smoother, so that it doesn't go
up and down by month? AUDIENCE: Make a scatterplot? ANITA KHAN: Scatterplot. But if you had the points--
yeah, we can try that. So plt.plot(kind=scatter). And then for a scatterplot, you
need to specify the x and the y. So we could have x equals
the index, as we said before. And the y equals the
actual thing itself. plt.scatter. Scatterplot. So we still see a couple different--
it's still a little bit messy. It's still kind of hard to see
exactly where everything is. What else do you think we could do? So right now we have
it indexed by month. What do you think we
could change about that? AUDIENCE: You can have dates by year. ANITA KHAN: Yeah, exactly. So if we ever-- AUDIENCE: Like the max temperature. ANITA KHAN: Max temperature, yup. All very good ideas and
something to definitely explore. So for now we can just look at the mean
of the year, or average of the year. That way we can see because
each year has all of the months, it would make sense just
to average all of them, just to see how that's been changing. However, we notice when we look
at the timestamp column, which is called DT, if we access
that and called the type, it's actually of type str. So that means all of these
dates are recorded inside of the file as a string
rather than a date. So that would mean if we
want to parse through them, we have to look through every
single letter inside of the DT. So what might be helpful is to
convert that to something pandas has called a DatetimeIndex. Pandas is very adapted
towards time series data. And so, definitely, there are a lot
of tools in their library for this exactly. So if we convert it to a DatetimeIndex,
we can also group it by a year. And this is a syntax where we
take the year in the index, and then we also take the
mean of every single one. So if we run that, and then we plot that
again, that's a little bit smoother. So we can definitely see that
there is a trend over time. And as there are a lot
of different spikes, so it's not incredibly
uniform, which makes sense because there are peaks
and valleys for years. But as a whole, this data
set is trending upwards. So this is wrapping up
the exploratory phase. But then we notice there is
something pretty anomalous here. We see right around the 1750,
in the beginning with 1750s, there's a huge dip down. So before while it was at 8.5 before,
it went all the way down to 5.7. So let's see. There might be a couple
of reasons why this might be the case,
such as maybe there was an ice age for that
one year or something and then it went back up to 8.5. But that's probably not what happened. So let's look into
the data a little bit. Maybe they messed up something,
maybe someone mistyped a number. So that it says negative 40,
or negative 20 instead of 20, or something like that. And so if we look at the data-- and it's
important to check in with yourself, make sure that what you're getting
is reasonable-- we can look in. And so we want to see what
caused these anomalies. Because it was in the
first couple of years, we can call something like .head(),
which shows the first five elements. And we see here that
1752 is what caused this. And for whatever reason, even though
all of the years previous and after had 8 degrees and then 9 degrees. It just goes back down
to 6.4 degrees, which matches what we found in our plot. So let's look at that data set exactly. So, as you remember, we
can filter by Booleans. So if we want to see if the year of
that grouped DataFrame is equal to 1752, we can see what happened. And so we see here, in this case
we can see every single temperature from every single month, and
the land average temperature, as long as that year is 1752. And because it's a
DatetimeIndex, we're allowed to do something like that,
rather than searching the string for every single
thing, looking for 1752. And so we see here in this exploration
that land average temperature, so while this January makes
sense that it's pretty low, we also have things like Not a Number. And you have things, like you
have a couple of the numbers but then all these summer
months are just gone. And so what happens is when you average
this, where it might not have a number, it'll just average the existing values. And so because you're
missing those summer months it'll be low, even though
it's not supposed to be. So what exactly can we do about that? So there are a lot of null values. You want to see what exactly we can do. Also, this might be affecting
results in the future. Because what happens if there are
other null values in other years? It wouldn't be just exclusive to 1752. And so again, as we tried
from that Boolean values, if we call numpy.isnan(), that can
access every single thing and determine which cells exactly are not a number. And specifically, land average
temperature is not a number. And so we see here that there
are a lot of different values that are all not a number. And so this is OK. It definitely makes sense, because
no data set is going to be perfect. As we saw before when we
were looking at the data set, it was missing all these columns. And so it's not ever going
to be perfect, which is OK. The thing that you have to do is
either work with data that is perfect, or you have to fill
in those null values. You have to make sure that it
has something that's reasonable that shouldn't affect
your data that much, but you should fill it in with
something that makes sense. So, in order to find out
what exactly makes sense, we want to look at possibly
other information around it. So if we wanted to predict
this February of 1752, how do you think that we could
estimate what that should be? AUDIENCE: Look at the
previous and past February's? ANITA KHAN: Yeah, exactly. Yeah, previous and past
February's are a good way. Another way to do it might
be looking at the January and the March of that same year. It should be somewhere
around the middle maybe. Because to get from that
January to the March, you have to be somewhere in the middle. And so February would make sense that
it should be right around the middle. And then you could do the same
thing for these values as well. It's kind of a little bit
more difficult because you don't have before and after values for
where there are a lot in the sequence, but definitely looking
at the year before, the year after might be helpful. So what we're going to
do today is we're going to be looking at the month before,
or previous thing that's most valid. So, for example, in February you
would look at the month before. So then this would be that January. For this May, you would be
looking at the April previously. And then for this June, because the
most previous value is that April, you'll be looking at
that April value as well. So you'd just be filling all
of these with this April value. So, not the most accurate, but it's
something that we can at least say it's reasonable. So you're going to be changing the
value of what that DataFrame column is. And so we want to set that
equal to something else. And it's going to be
exactly the same thing, but we're going to be calling
a command called fillna(). It's another pandas command, but
it fills all of the null values. So these are things like none,
NaN, any blank spaces, or anything, just things that would go under
na, that you would classify as na. And the way we're going to fill this
is going to be called something ffill, or forward fill. So this is going to
be things from before and then it's just going to
fill the things ahead of it. You can also do backward fill, and there
are some other different ways as well. And so once we call that, it changes. And then we can graph that again. And then we see it's a
little bit more reasonable. There still are some dips and
everything, but it can't be perfect. So we might want to try
different avenues for the future. That data set definitely looks a
lot cleaner than it was before. And we know that there are no
null values as of right now, so then we can work
with the whole data set and not have to worry about that at all. All the syntax for the
plots are pretty similar. So you can always definitely copy it,
or even create a function out of it, that way you don't have to worry too
much about styling and everything. You can also change things like
the x-axis, y-axis, font size. So it's pretty simple. So that concludes our
exploration of our data set. Next, we want to model
our data set a little bit to predict what would happen
based on future conditions or other variables that could happen. So in your example of predicting the
election, what would you want to model? AUDIENCE: Who gets electoral votes. ANITA KHAN: Yes, exactly. And then for stock price,
what might you want to model? AUDIENCE: Likely [INAUDIBLE]. ANITA KHAN: Yeah, exactly. And how that all change over time. And so there are
different ways to model. The model we're going to use
today is called linear regression. So, as you might have
learned before in class, just like creating a line of best fit. That way you can estimate how that
trend is going to change over time. So we're going to be calling
in a library called sklearn. So this is used for
typically machine learning, but definitely regression
models or just seeing how things will change over time, this
is good for, and pretty easy to use. And so this is just a
couple of syntax values, that way you can set what
that x is and what that y is. You just want to take just the
values rather than a series, and that creates a NumPy array. And then when you import this as LinReg. You can just call your
regression is equal to this. And then sklearn has a quirky syntax
where you want to fit it to your data first, and then you can predict the
data based on what you had there. That way if you want to
predict a certain value that wasn't in your data set, you
could call that in predict. And so if you call reg.fit(x, y),
that should find the line of best fit between x and y. And then if you want
to predict something, then you would call reg.predict(x). You can also do something
called score, which is where you compare your predicted
values against your actual values. And so here you put in x,
which would be your predictors, and y, which is like
your predicted values. So in this case x would
be the year, and then y would be what exactly
that temperature would be. And so you compare what the
actual temperature is against what your predicted temperature is. Next, we want to find that accuracy
to see how good our model is and everything. And so this compares how
far the predicted point is from the actual point,
does residual squares, and r-squared, if you
heard that in stats. And so we see that it's not very
accurate, but it's better than nothing. It would be better than a random point. And since this was a very basic model,
like this is actually not terrible. It's a good way to start. And so next we want to plot it to
see exactly how accurate is it. Because while this percentage could
mean something as to how accurate it is, it's not that intuitive,
and so we want to graph it. So again, graph it as we did before. Scatterplot is good for this. And we see how all of
these points-- you see that we have our straight line
of best fit here, that blue line, but then we also have all of our points. And we see that it's not
perfect, but it definitely matches the trend in data,
which is what we're looking for. And so if we wanted to
predict something like 2050, we would just extend that
line a little bit further. Or if you just wanted the number,
you could call reg.predict(). And so this is what we did here if
you call that reg.predict(2050). So this predicts that
the temperature in 2050 will be 9.15 degrees, which is pretty
consistent with what this line is. Do you have any ideas for
a better regression model? So instead of linear, what might we do? AUDIENCE: Like a polynomial? ANITA KHAN: Yeah, exactly. So it looks like this data set is
following a pretty curvy model. We see while it's pretty
straight here, it curves up here. And so, definitely, polynomial
might be something to look for. There's also another pretty
cool method of predicting called k-nearest neighbors. And what this is you find
the nearest points and then you predict based on that. So for example, if you
wanted to predict 2016, you would look at the
nearest points, which are 2015 and 2014, maybe
2013 if you want that. Average it together and then
that would be your prediction. There are other regression
methods as well. You could do logistic regression,
or you can use linear regression but use a few more parameters. That way you can decrease the effect
a certain predictor has, and so on. But linear regression is a good start. You should definitely look
at the sklearn library and there are definitely a lot of
different models for you to use there. And so the next part is
communicating our data. So how do you think we could communicate
the information that we have now? Who would we want to communicate
to on global temperature data? AUDIENCE: [INAUDIBLE] ANITA KHAN: What do you think? Same thing? OK. If you wanted to communicate something
about what your examples are, once you had data about
election predictions, how do you think you
could communicate that? AUDIENCE: Do something very similar
to what the New York Times did. ANITA KHAN: And what
about stock market data? Who would you communicate to,
what would you be sharing? AUDIENCE: Try to put it in
some type of presentation. ANITA KHAN: Yeah, exactly. That'd be great. And you could present to
one of these companies, or you could do it at a
stock pitch competition, or even invest, because maybe you
just want to communicate to yourself, and that's fine too. But the idea is once you have that
data, someone needs to see it. Once you have that data, it can
generate pretty actionable goals, which is a great thing about data science. So just talking about some
other resources since we've gone through the pretty
simple data science process. Other resources if you want
to continue this further. I'm a part of the
Harvard Open Data Project where we're trying to aggregate Harvard
data sets into one central area. That way students can work with that
kind of data and create something. So some projects that we're working on
are looking at energy consumption data sets, or food waste data
sets, and seeing how exactly we can make changes in that. So other than that, again, as
I showed you before, Kaggle. Definitely a great resource if you
want to just play with some simple data sets. They have a great
tutorial on how to predict who's going to survive
the Titanic crash based on socioeconomic status,
or gender, or age. Can you exactly predict
who will survive? And actually, the best
models are pretty accurate. And so that's really cool that just
using a couple regression models and using exactly the same tools that
I showed you, you can predict anything. Your predictions might
not be very correct, but you can definitely create a model
that would be more accurate than if you shot in the dark. Some other tools are data.gov
and data.cityofboston.gov. So again, more open data
sets that you can play with and you can create actually
meaningful conclusions. And so in data.gov you could look
at a data set on economic trends. So, how unemployment is changing. You could predict how unemployment
will be in a couple different years. Or you can definitely
get information about how election races have gone in the past. You can definitely reach out to
organizations like Data Ventures that works with other
organizations, essentially like consulting for another
organization using data science. There are a lot of classes
at Harvard about this. Definitely CS50 was sentiment analysis. You can work with that as well. So if you've got all the tweets of
Donald Trump and Hillary Clinton, and all the other
presidential candidates, and did some sentiment analysis on
that, or looked at different words, you could predict what
exactly might happen. You can also take other classes such as
CS109 A and B, which are Data Science and, I believe, Advanced
Topics in Data Science. CS181 is Machine Learning as well. There are other classes, I'm sure,
that are definitely helping with this. Also another good resource
is if you just Google things. If you do Python pandas groupby, by,
for example, if you forget the syntax, you can look through great documentation
on how exactly to use them. So it gives you examples,
like code examples. So in case you forget from this
presentation, or other tools that you might want to use as well. So, for example, if you
want to do a tutorial, or if you want to work
with time series, there are a lot of-- the documentation
for pandas is pretty robust. And same thing for the
other libraries as well. So sklearn linear regression. Definitely have looked that up before. And you can do the same thing, where
it has parameters that it takes in, and also what you can call after you've
called sklearn in your regression, what exactly you can get. So you can get the coefficients,
you can get the residuals, the sum of the residuals. You can get your intercepts. There are some other
information that you can use. They probably have examples as well. They have examples
using this, just in case like you want an example of what
exactly yours should look like, or you want code. That's definitely helpful. And finally, just to inspire
you a little bit further, I can talk a little bit about my data
science projects that I'm working on. For one of my final
projects for a class I'm trying to predict the NBA draft
order just from college statistics. So there's a lot of information, I
think back up to since the NBA started, on how exactly draft order
is selected, just based on that college student's statistics. And so definitely a lot
of people are trying-- like there are industries devoted
to predicting what will happen based on those college statistics. Like exactly what order,
how much they'll get paid, how does this affect their play time
while they're on their teams, so on. Also, over the summer at Booz Allen I
was developing an intrusion detection system in industrial control systems. Essentially what this
entails is industrial control systems are responsible
for our national infrastructure. And so if we observe
different data about them, we can possibly detect
any anomalies in them. An anomaly might indicate
the presence of an attack, or a virus or something on it. And so that is a possibly better
alternative to current intrusion detection systems that might
be a little bit more complex rather than just focusing on data. Something else I'm working on for
another final project for class is looking at Instagram friends
based on mutual interactions. And so each person on Instagram, maybe
they like certain people's photos more often than other people's photos. Maybe they comment more, maybe
they are tagged in more photos. And so looking at that information,
if you look at the Instagram API, it's pretty cool to see how there
is a certain web of influence, and you have a certain
circle that's very condensed and expands a little bit further. And what's interesting about
that is celebrities, for sure, they definitely interact
with certain people more or less, definitely get
in hate wars, or anything. For example, Justin
Bieber and Selena Gomez. People found out they
broke up because they unfollowed each other on Instagram. So I think that's interesting. Also some other things that I've
done are predicting diabetes subtypes based on biometric data. So this was in CS109. First P set, I believe. And so given biometric data, so it would
be information like age and gender, but also biometric data like presence
of certain markers, or blood pressure, or something. You can pretty accurately predict
what type of diabetes they'll have, or whether they'll have diabetes or
not, like type 1, type 2, or type 3. And we can also predict things
like urban demographic changes. Because a lot of this
information is available online, you know what socioeconomic
status people are in, but you also know where
exactly they're located based on longitude and latitude. And so based on how good
your regression model is, if you input in a specific
latitude and longitude, you can predict what exactly
socioeconomic status they're in, which I think is pretty cool. And over time as well, because their
data sets go back many different years. So those are a couple of ideas. Any questions about data science? AUDIENCE: It's pretty cool. ANITA KHAN: Thank you. OK. Well, thank you for coming. If you have any questions,
feel free to let me know. My information is here if you want
any advice or tips or anything. And also these slides and
everything will be posted online if you want to access that again. So, thank you.