YUFENG GUO: Natural language has
many challenges that are unique and separate it from other
data types like images and structured data. So it requires a slightly
different approach. Today, we'll explore
a foundational piece of modeling natural language,
called "bag of words." What does it mean? And how do we use
it to process text? Stay tuned to find out. [THEME SONG] Welcome to "AI
Adventures," where we explore the art, science,
and tools of machine learning. My name is Yufeng Guo. And on this episode,
we're going to look at how to use bag of words
to classify natural language. Natural language is
special because it has structure inherent in the
language while at the same time being very free-form. There are many ways you
can say the same thing. And you can also say
very similar words, and yet mean very
different things. So in much of
machine learning, we aim to turn our data
into matrices or tensors. This is very natural for
images since that's already their inherent representation. Structured data often
meets a similar fate, with numbers in a spreadsheet
mapping very directly to input matrix values. But with natural
language, we need to somehow find a way to turn
words into numbers so we can stick them into those matrices. There are many ways
that we can do this. And today, we'll focus on an
approach called "bag of words." Let's pretend for a moment
that we're learning English for the first time ever. And for some reason,
the first words we have chosen to learn
in our entire vocabulary are these 10 shown here-- words like "dataframe" and
"graph," "plot," "color," and "activation." And so we want to
be able to identify, given some arbitrary
text, whether that topic is about pandas,
keras, or Matplotlib. How might we do that? Perhaps if we looked at
a sentence, like "how to plot dataframe
bar graph," we would recognize just the words "plot,"
"dataframe," and "graph." The rest of the sentence would
look like a foreign language, just gibberish. Knowing only those three
words in this sentence, though, we might
still be able to get some sense of what it's about. And the way you might capture
this information in an array or matrix would be to first
make an array that represents your entire vocabulary. So in this case, we have
an array of just length 10. We'd set all those
values to 0 and turn on the array indices
that correspond to the words in the sentence
by setting them to 1. Notice that this has
nothing to do with the order the words appear in
the input sentence, but everything to do with
the order of the words in our vocabulary list. So now we've encoded or
translated the English sentence into an array of numbers
based on our somewhat limited understanding of English. The words we don't
recognize, we'll just ignore. Notice that this has the
effect of scrambling up the order of the words,
like, say, a bag of words. Of course, we should do
the same for our labels. This is much simpler, since
there are only three of them. In our case, we
have some sentences that have more than
one label attached to them at the
same time, however, since a sentence can talk
about multiple topics at once. In that case, we want to set
all the relevant indices to 1, leaving the rest as 0, just
like we did for the words from our training data. Now we've turned the
inputs as well as the outputs, which
both used to be words, into arrays of numbers. And we can let machine
learning do what it does best-- map one set of numbers to
another set of numbers. All the heavy lifting is
done in the preprocessing as we transformed
or encoded that text into numerical representations. Bag of words is a pretty simple
approach for doing this task, though it's worth pointing
out that it might surprise you how well it works
in some situations. How might we build a bag
of words modeled with code? Keras has a convenient
preprocessing library that we can use to handle
much of this for us. Using the Tokenizer
class, we can select the size
of the vocabulary we'd like to utilize. In our example, we just had 10
words, which is quite small, but in our code, let's choose
something bigger, like 400. This will then be fit on
the entire body of the text from your training
data, selecting out the most common 400 words. With the tokenization process
complete, building the model becomes quite
straightforward and similar to working with other
structured data. Since each row is now just
an input of 1s and 0s, using something as simple as
a standard, fully connected, deep neural network
can be quite effective. If you're planning on having
multiple label classification, where more than one label might
be true for a single input, as we do here, be sure that
we choose a sigmoid activation instead of the more common
Softmax activation function, and pair it with binary
cross-entropy loss. So there you have it-- the bag of words
model in a nutshell. Understanding how
bag of words works and its advantages
and drawbacks can help you build your foundation
in natural language processing as you move on to more advanced
approaches to encoding text. For more details
and examples, be sure to check out the expanded
blog post I have linked below in the description. Thanks for watching this episode
of "Cloud AI Adventures." And if you enjoyed
it, please like it and subscribe to get all
the latest episodes right when they come out. For now, get started on your
natural language processing journey by checking
out the tensor flow word embedding tutorial
I've linked below in the description. [MUSIC PLAYING]