[MUSIC PLAYING] CARL BERGSTROM: All right. So the next thing I
want to talk about in trying to spot
bullshit is being aware of unfair comparisons. There are so many
unfair comparisons that get put out there in
various forms of media. Also in scientific
papers-- when people are trying to make
arguments-- they say, oh, this is better than that. This is more
effective than that. Whatever the case may be. And you have to make sure that
these comparisons are actually reasonable. So let me just give you an
example of the kind of study someone might do. Maybe I want to know
whether fresh juice is sweeter than juice
in concentrate form. So how do I test that? I'm going to go get myself
a bunch of fresh juice. Buy a bunch of jugs
of Martinelli's. Going to go get myself
a bunch of concentrate. Buy a bunch of tubes
of Minute Maid. And now I'm going to give
that to a bunch of people-- let's say 25 subjects--
and I'm going to ask them, which one do you
think tastes sweeter? So I do that. Twenty-one people think the
Martinelli's is sweeter. Four people think that the
Minute Maid is sweeter. I do a little bit of statistics. We don't need to
worry about that. But I find that
looking at these data, I can conclude
that fresh juice is sweeter than from
concentrate, with a p value of less than 0.01
using an exact binomial test. I hope someone's a
little bit skeptical. What am I doing wrong here? STUDENT: It's literally
apples and oranges. CARL BERGSTROM: I'm comparing-- JEVIN WEST: [LAUGHTER]. CARL BERGSTROM: Thank you. I'm comparing
apples and oranges. So that was a very
unfair comparison. If I want to see,
obviously, what the process of concentrating
a juice does to its sweetness, I should use the same fruit and
probably from the same supplier and so on, and certainly not be
comparing apples and oranges. That was a silly example,
just to have some fun. But this is really,
really common in the sorts of
media reports that we see and in scientific
papers as well. And I want to give you
a couple of examples. I'm sure you've
seen these lists. These are all over the internet. They're exposed. They're easy to make, and
they generate a lot of clicks. And sometimes they make you
click through city by city and see every city. And so they get a lot
of ad revenue and so on. So here we go. The most dangerous cities
in America, and then your most murders,
most violent crimes, whatever they want to do. And so in this particular story
that came out not so long ago, the most dangerous list starts
off with St. Louis, Missouri. I was born there and lived
there at the start of my life. Number two, Detroit,
Michigan, I spent most of my teens hanging out there. And so this is starting
to get a little bit personal, because these
weren't such bad places. These were not such bad places. And so what's going on? Why are they up here? But on the other hand, a
city is a city is a city. What could be apples
and oranges about that? To answer that question,
we have to look into the sociological nature
of cities and how they work. And for reasons that would
take a lot more than one class to go through, it
turns out that-- as you know-- inner cities-- urban cores-- typically
have higher crime rates than the outer suburbs. So these are crime
rates in Seattle. And so we've got high
crime rate down here. We've actually got a relatively
high crime rate right here. But that's a very typical
pattern that we see in cities. So for example, here's
Atlanta, where I lived before I came to Seattle. Here in the central
part of Atlanta, we've got fairly
high rates of crime. And in the outer
suburbs, most of them have much lower crime rates. Now if you think about
what constitutes Atlanta-- what's the city of Atlanta-- the Atlanta metropolitan area
is really this whole huge range. Atlanta is this gigantic
suburban sprawl that's got 13 million people or
something like that-- you can look that number
up and call BS on me-- but something like that. What's the city of Atlanta? The city of Atlanta is
a local political unit. The city of Atlanta is just
that region right there. So the city of Atlanta is
just this little piece, even though Atlanta
metropolitan area is out here. And of course, this little
piece that's the city of Atlanta contains Atlanta's historic
center and its urban core, where much of the
crime is taking place-- where the crime
rates are higher. So that's Atlanta. Let's compare
Jacksonville, Florida. Here's Jacksonville, Florida. We see a similar pattern
to what we see in Atlanta. We've got this urban core and
then this larger suburban area around the outside. But now we see something really
different in terms of how the city limits are defined. So in 1967, the original
city was right here in the urban core. And you see that overlaps
quite cleanly with where we've got the high crime rates. But gradually the
city of Jacksonville has pushed its boundaries
out all the way to include all of the
encompassing surrounding suburban areas, where crime
rates are much, much lower. So what we've got going
on in Jacksonville is that in Jacksonville the city
is including the entire metro area. In Atlanta, the city is only
including the urban core. So if we go and we look at
the murder rate by city-- So here's these
different cities. These are murder rate data. Here's Atlanta. Here's Jacksonville. Now we can say, what
fraction of the metro area is included in the city? What fraction of the
population of the metro area is included in the city? And Atlanta is one of
the smallest in the US. Less than 10% of the
metro area is in the city. And so Atlanta-- we've
got this what seems like a pretty high crime rate. Over here in Jacksonville,
the crime rate's lower. But in Jacksonville,
60% of the metro area is included in the city. This is not an apples
to apples comparison, because in Atlanta we're
only counting the urban core, where the crime rate is highest. In Jacksonville, we're counting
the entire surrounding area. And we're getting almost
as high of crime rates as we are in Atlanta anyway. So this is very
much not an apples to apples comparison,
if that makes sense. So I did this. I put these data together
over the weekend, and I wanted to just be sure
that my story was legit. So I put together
a second graph. And I'm not going
explain this graph today. This is a teaser for next class. I'll tell you what it is,
but I'm not going tell you-- and what I would like you to
do is figure out why I did it and why I feel like it
strengthens my argument. So in the previous
graph, I'm graphing the murder rate in the city. Here I'm the graphing
the murder rate in the entire metropolitan area. So if instead of
using just the city, I used the entire
metropolitan area. And I again graph it
against this same measure, what fraction of the
metro area is in the city? So you've got Atlanta over
here, Jacksonville out here. Now I don't have
any trend at all. We had massive
statistical significance when I do it against the
murder rate in the city. No significance at
all. p equals 0.5 when I do it against the
murder rate in the metro. So think about why I did
that, why that's convincing. And I think if we have time,
we'll talk about that some more in the next lecture. I want to do one more example of
apples to oranges comparisons. So after the
election, this issue of how many people came
to the inauguration turned into a huge deal-- which
is remarkably stupid and one could call
bullshit simply on caring about that because
it doesn't really have anything to do with someone's efficacy
serving as president. But it seemed very
important to people, including our president. And so it ended up
being talked about. An enormous amount
of people say, oh, so many people came to
Obama's inauguration, and no one went to Trump's. And it's so
terrible-- or so good, depending on what news
source you're reading. And so here's a
conservative news outlet. And they say, the
media is so unfair, because the mainstream media
has ignored the fact that eight times more people watched
Trump's inauguration over streaming video than
watched Obama's. Now think about
that for a second. Just think about
it for a second. What's wrong with that? Streaming video was
hardly a thing in 2009. They didn't have streaming
video, or if they had it, you couldn't afford it
with the data charges. So I just grabbed a
couple of quick graphs. Here's internet video by
terabytes from 2010 to 2015. Here's mobile video. And so these things
are exploding. The point is, of course,
Obama had fewer viewers on streaming media
because people were using streaming media yet. More people drove to Trump's
inauguration in a Tesla as well, because they
weren't released for Obama's. So there you go. See the kind of energy-- Zero-carbon people love Trump. So now you know. And then to finish
this off, these guys-- I don't think they really
helped their case much with the way they ended this. They say, "The press left out
some important differences. Most importantly, millions
watched the inauguration on TV and streaming media-- probably millions
in Russia alone." I guess I would have
left that detail out, if I'd been writing for them. But there you go. So one of the things I
really want to stress-- in this whole
lecture, we've looked at a bunch of
statistics, and we're going to continue to
look at statistics. And in the process
of course, we're to look at a bunch of big
data algorithms, all of that. But nowhere in that
process have we really dug in to what's going
on in the algorithm. We haven't criticized
the fine details of a particular chi-square test
and how many degrees of freedom we're going, because a lot of
the time you don't need to. Very few of us are
going to be trained as professional statisticians
and able to really dig into that-- or as professional
data scientists like Jevin. But we can call bullshit
on work of guys like this simply by looking into what's
going into this algorithm and what's coming out. And so that's what we've
been trying to do today. What are you putting in? Are the data reasonable? Are they fair? Are the comparisons reasonable? How did they get those data? Are they pertinent to the
claims that are being made? What's the output? Does the output even make sense? Is the output the right
order of magnitude? And if it does make sense,
does it support the claim that somebody's making-- we should get rid
of food stamps. Or does it actually
refute the claim? It seems to me that if you've
got a government agency running 10 times as efficiently
with respect to fraud as the free market,
that might not be an argument
against that agency. There may be other
arguments against it, and people can find those. But thinking carefully
about these outputs is really important. So in the class, we're
really going to focus over here and over here. And I think you'll be
amazed at how much you can do without having to dig in. Not that that isn't fun as well. JEVIN WEST: That's something
you'll hear in every machine learning class. And I bet most of you haven't
taken a machine learning class. In my class or in any
machine learning class, there's this adage,
garbage in, garbage out. And one of the things that
really excited Carl and I about teaching this
class was the fact that we think that
we can teach you the same skills without a PhD in
machine learning or statistics. We think that you guys-- and some of you do have that
background and that skill level-- but you don't need that. You can call BS without those
really advanced degrees. [MUSIC PLAYING]