Getting started with Natural Language Processing: Bag of words

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

YUFENG GUO: Natural language has many challenges that are unique and separate it from other data types like images and structured data. So it requires a slightly different approach. Today, we'll explore a foundational piece of modeling natural language, called "bag of words." What does it mean? And how do we use it to process text? Stay tuned to find out. [THEME SONG] Welcome to "AI Adventures," where we explore the art, science, and tools of machine learning. My name is Yufeng Guo. And on this episode, we're going to look at how to use bag of words to classify natural language. Natural language is special because it has structure inherent in the language while at the same time being very free-form. There are many ways you can say the same thing. And you can also say very similar words, and yet mean very different things. So in much of machine learning, we aim to turn our data into matrices or tensors. This is very natural for images since that's already their inherent representation. Structured data often meets a similar fate, with numbers in a spreadsheet mapping very directly to input matrix values. But with natural language, we need to somehow find a way to turn words into numbers so we can stick them into those matrices. There are many ways that we can do this. And today, we'll focus on an approach called "bag of words." Let's pretend for a moment that we're learning English for the first time ever. And for some reason, the first words we have chosen to learn in our entire vocabulary are these 10 shown here-- words like "dataframe" and "graph," "plot," "color," and "activation." And so we want to be able to identify, given some arbitrary text, whether that topic is about pandas, keras, or Matplotlib. How might we do that? Perhaps if we looked at a sentence, like "how to plot dataframe bar graph," we would recognize just the words "plot," "dataframe," and "graph." The rest of the sentence would look like a foreign language, just gibberish. Knowing only those three words in this sentence, though, we might still be able to get some sense of what it's about. And the way you might capture this information in an array or matrix would be to first make an array that represents your entire vocabulary. So in this case, we have an array of just length 10. We'd set all those values to 0 and turn on the array indices that correspond to the words in the sentence by setting them to 1. Notice that this has nothing to do with the order the words appear in the input sentence, but everything to do with the order of the words in our vocabulary list. So now we've encoded or translated the English sentence into an array of numbers based on our somewhat limited understanding of English. The words we don't recognize, we'll just ignore. Notice that this has the effect of scrambling up the order of the words, like, say, a bag of words. Of course, we should do the same for our labels. This is much simpler, since there are only three of them. In our case, we have some sentences that have more than one label attached to them at the same time, however, since a sentence can talk about multiple topics at once. In that case, we want to set all the relevant indices to 1, leaving the rest as 0, just like we did for the words from our training data. Now we've turned the inputs as well as the outputs, which both used to be words, into arrays of numbers. And we can let machine learning do what it does best-- map one set of numbers to another set of numbers. All the heavy lifting is done in the preprocessing as we transformed or encoded that text into numerical representations. Bag of words is a pretty simple approach for doing this task, though it's worth pointing out that it might surprise you how well it works in some situations. How might we build a bag of words modeled with code? Keras has a convenient preprocessing library that we can use to handle much of this for us. Using the Tokenizer class, we can select the size of the vocabulary we'd like to utilize. In our example, we just had 10 words, which is quite small, but in our code, let's choose something bigger, like 400. This will then be fit on the entire body of the text from your training data, selecting out the most common 400 words. With the tokenization process complete, building the model becomes quite straightforward and similar to working with other structured data. Since each row is now just an input of 1s and 0s, using something as simple as a standard, fully connected, deep neural network can be quite effective. If you're planning on having multiple label classification, where more than one label might be true for a single input, as we do here, be sure that we choose a sigmoid activation instead of the more common Softmax activation function, and pair it with binary cross-entropy loss. So there you have it-- the bag of words model in a nutshell. Understanding how bag of words works and its advantages and drawbacks can help you build your foundation in natural language processing as you move on to more advanced approaches to encoding text. For more details and examples, be sure to check out the expanded blog post I have linked below in the description. Thanks for watching this episode of "Cloud AI Adventures." And if you enjoyed it, please like it and subscribe to get all the latest episodes right when they come out. For now, get started on your natural language processing journey by checking out the tensor flow word embedding tutorial I've linked below in the description. [MUSIC PLAYING]

Info

Channel: Google Cloud Tech

Views: 20,699

Rating: 4.9540229 out of 5

Keywords: Getting started with Natural Language Processing: Bag of words, natural language processing, language processing, Machine Learning, TensorFlow, Big Data, Cloud, Artificial intelligence, AI, ML, machine learning with gcp, gcp machine learning, cloud and machine learning, ai adventures, training, classification, machine learning models, data science, GCP, Google Cloud Platform, Deep Learning, Yufeng Guo, GDS: Yes;

Id: UFtXy0KRxVI

Channel Id: undefined

Length: 6min 26sec (386 seconds)

Published: Tue Dec 17 2019