[♪ INTRO ] Data. The word is everywhere these days. Every company is dying to tell you about its
big data, data analytics, data privacy, data warehouse, data lake, data data data data. At the center of the data mania is data mining—the
practice of sifting through all those piles of information for insights. Data mining recently made big news with the
Cambridge Analytica scandal. The political consultancy reportedly sucked
up data about millions of Facebook users without their knowledge, then used it to profile and
sway voters in the US, UK, and elsewhere. And similar techniques let companies like
Amazon, Facebook, and Google work out what we want to see or buy—sometimes with shocking
accuracy. It’s a little creepy. It’s not just ads and politics, either. Data mining allows airlines to predict who’s
going to miss a flight; it tells big-box stores who’s pregnant; it helps doctors spot fatal
infections; and it’s even enabled cell phone companies to predict massacres in the Congo. The power of data mining and the hype surrounding
it can make it sound like a magic wand—one that will either save your business or sink
democracy. Of course, data mining doesn’t really involve
any unicorn hair or phoenix tail feathers. It’s just applied statistics, searching
lots of data points for patterns that humans might not spot. Those patterns are based not on human intuition,
but on whatever the data suggests, so sometimes they can seem incredibly subtle or even alien. But there’s no more magic in data mining
than there is in a weather forecast. In fact, data mining is a lot like meteorology. Meteorologists aim for two things: first,
they want to describe patterns in the weather—to boil down its massive complexity into a few
numbers and equations. And second, they want to predict Tuesday’s
weather. That’s the whole point. Similarly, Spotify’s data scientists might
be interested in describing medieval rock fans, recognizing them as a group distinct
from nerdcore or freak folk fans — yes, that's a real subgenre. Ultimately, though, what’s most important
to companies like Spotify is predicting what each person wants to listen to. The key with data mining is that it achieves
description and prediction not through careful study by experts, but by analyzing large amounts
of data. In Spotify’s case, that might mean scanning
for patterns in genre labels, acoustic attributes, Internet reviews, and anything else about
each track, plus the age, location, friend group, and other scraps of information about
each user. Data mining is more about spotting patterns
than explaining them. Of course, the words “pattern” and “data”
can mean just about anything. There are no clear definitions for data mining,
data science, or big data, and they’re sometimes used interchangeably with each other or with
machine learning. That’s why it’s so easy to slap these
buzzwords onto any project for instant venture capital karma. That being said, a few types of techniques
consistently earn the “data mining” label. The most broadly applicable one is classification,
where you try to categorize things. For example, Target famously realized as early
as 2002 that they could guess who was pregnant and send them baby-related coupons. That’s a textbook classification problem:
Target needed to assign each customer to one of two categories: either “probably pregnant”
or “probably not pregnant.” Classification typically works in several
stages. First, each example, or instance, has to be
broken down into a collection of numerical attributes, or features. For a store like Target, an instance might
be your mom 7 months before you were born. The features would be things like “How many
bottles of unscented lotion did she buy in the last three months? How about in the quarter before that?” And likewise for zinc supplements, Asian pears,
and every other product in the inventory. The store would also need labels for some
chunk of the data—the ground truth about whether those customers were pregnant. Target got those labels from baby registries
and due dates customers had shared. Once the data’s all lined up, it’s time
for training. That’s where the system tries to tease out
patterns from all the labeled examples. Learning to classify is such a basic, common
need that dozens of algorithms, the mathematical procedures computer programs follow, have
been devised for it. Which algorithm works best depends on all
kinds of factors, like how many categories there are and how different features are connected
to each other. But many classification algorithms are similar
in that they treat each feature as a drop of evidence for one category or the other. The features get weights indicating how strongly
they boost or weaken someone’s chances of falling into the “yes” category — that
they are pregnant, for example. Those weights are what the system learns during
training. Basically, it’s figuring out how informative
each attribute is. Finally, to classify instances the system
hasn’t seen before, it puts together all the weighted contributions, and maybe stuffs
the resulting number through a bit of mathematical machinery to slide it up or down. If the result is negative, that instance goes
in the “no” bucket. If it’s positive—load up the crib coupons! Each individual feature doesn’t tell you
much. In fact, many turn out to be irrelevant. But together they can be really powerful. Target’s approach worked so well that when
one customer complained that his teenage daughter was getting coupons for baby clothes, he ended
up apologizing to Target. Turned out the company knew about his daughter’s
pregnancy before he did! Classification is useful any time you want
to tell one group of things from another. Insurance companies use it to guess which
elderly patients will die soon so that they can start end-of-life counseling. Doctors use it to check whether premature
babies are developing dangerous infections, since the classifier can put together subtle
disease indicators before humans would notice any signs. I could spend all day listing uses for classification,
but it’s far from the only type of data mining. One close cousin is known as regression. And no, that doesn’t mean deciding you like
Limp Bizkit again. In regression, instead of predicting a category,
the goal is to predict a number. Take Target again. They wanted to know not just whether each
customer was pregnant, but when to send each coupon So they managed to estimate due dates, too. That’s a regression question—how many
weeks until the customer gives birth. Regression often depends on dozens or even
thousands of variables—the features that describe each example. It finds an equation or curve to fit the data
points, telling you how high you’d expect the curve to be given any arbitrary input. Or in this case, how far away you’d expect
the customer’s due date to be. Like in classification, many regression techniques
give each feature a weight, then combine the positive and negative contributions from the
weighted features to get an estimate. And, also like classification, regression
is used everywhere. One of the better-known examples is Google
Flu Trends. In 2008, it began publishing real-time estimates
of how many people had the flu based on searches for words like “fever” and “cough.” Regression is also part of predictive policing
software — programs that look at historical data to guess how likely a crime is to occur
in each area. The third major data mining technique is clustering. As the name suggests, the goal here is to
group data points in a way that helps with the analysis In the marketing world, clustering emerged
in the 1980s—well before data mining—with the work of a market researcher named Howard
Moskowitz. He struck gold when he realized there wasn’t
one best pasta sauce. Consumers showed three distinct types of preferences—and
the previously unrecognized group that craved extra-chunky turned out to be worth millions. Clustering is often used to analyze market
segmentation like this, but to understand how the techniques work, let’s take a different
example: eBay. On ebay, you can get millions of products,
from antiques to zip ties. Even within a single category, like electronics,
the selection is overwhelming. So eBay organizes things into subcategories. But it’s a pain for humans to trawl through
all the electronics, identify subcategories, and assign every product to a subcategory. Instead, the company can use clustering to
automatically group the products. Again, each product first has to be broken
down into numerical features, like how many times “printer” appears in the description,
or who manufactured it. The simplest clustering method is to guess
how many distinct subcategories there should be. Then you randomly lump items together into
that many clusters, and keep shifting items between groups to make each cluster tighter. In the end, similar products end up settling
into clusters together. But we don’t have to stop there! The blue and silver versions of the same camera
don’t really deserve separate listings; they’re variants of the same product. So in addition to subcategories, it would
be nice to find listings to merge. Sites like eBay can do both simultaneously
with a technique called hierarchical clustering. Rather than a single set of categories, hierarchical
clustering produces a sort of taxonomic tree. For example, it might find that cameras are
much more like each other than like TVs. But within cameras, the DSLRs and point-and-shoots
each get their own subgroup, albeit slightly less distinct ones. And within those are many different models,
each with a few variants. (on image) Companies like Cambridge Analytica use these
techniques to look for groups of voters who will respond to the same kinds of advertising,
and Spotify can use them to guess who will like similar music. The fourth staple of data mining is anomaly
detection. It’s basically a special case of classification—identifying
instances that are unusual or worrisome. The IRS uses anomaly detection to spot likely
tax evaders, and credit card companies use it to flag transactions that don’t fit your
usual buying habits. It also helps industries with heavy-duty equipment. For instance, power companies and airlines
can see when a generator or jet engine is starting to vibrate differently than usual. Some anomalies can be detected just by looking
for deviations from averages. Fancier techniques include looking for instances
that don’t match any cluster, or comparing instances with the closest other examples
to see if their feature values are far off. Finally, association learning reveals which
birds are of a feather. The idea is to look through, say, millions
of grocery store purchases to see what gets bought together and when. A classic example is the Osco drug store chain,
which once found that many customers bought beer and diapers together on Friday evenings. Contrary to popular legend, the store never
acted on this profound insight, but stores regularly use observations like this to optimize
their floor layouts and inventory. For instance, Walmart discovered that shoppers
buy lots of Pop Tarts immediately before hurricanes, so it started to stock up. Association learning has broader applications,
too. CellTel, an African cell phone company, realized
it could spot impending massacres in the Congo when everyone nearby started buying prepaid
phone cards. The five strategies we’ve covered—classification,
regression, clustering, anomaly detection, and association learning—form the backbone
of data mining. What makes them so powerful is that they offer
standard mathematical tools you can use for everything from curating Facebook feeds to
optimizing store layouts. But that ease of use can also lead people
astray. Data mining is just one step in the process
of extracting knowledge from data—and it’s all too easy to whip out an algorithm without
carefully selecting the data, massaging it into the right form, and considering how to
interpret the results. Remember Google Flu Trends? It shut down after a few years, but not because
the algorithm was broken. Search auto completion had totally thrown
off the data, and engineers had given it too much leeway to interpret seasonal words like
“snow” as evidence of the flu. Then there are the queasy social implications
of sharing data in the first place, and of letting companies form such an intimate understanding
of our behavior. In other words … the creep factor. So as powerful as it is, the math of data
mining is just the beginning. Sometimes the hardest part is all the messy
human stuff. Thanks for watching this episode of SciShow! If you’re interested in the ways companies
can use psychology to learn even more about you from your data, you can check out our
video about that over on the SciShow Psych channel. [ ♪OUTRO ]