[MUSIC PLAYING] KARMEL ALLISON: Hi, and
welcome to Coding TensorFlow. I'm Karmel Allison, and
I'm here to guide you through a scenario using
TensorFlow's high-level APIs. This video is the first
in a three-part series. In this, we'll look at
data and, in particular, how to prepare and load your
data for machine learning. The rest of the series is
available on this channel, so don't forget to hit
that subscribe button. Building a machine learning
model is a multi-stage process. You have to collect, clean, and
process your data, prototype and iterate on your
model architecture, train and evaluate
results, prepare your model for production
serving, and then, you have to do it all over
again because the model is a living thing that will have
to be updated and improved. TensorFlow high-level APIs
aim to help you at each stage of your model's lifecycle-- from the beginning of an
idea to training and serving large-scale applications. In this series, I will
walk through the key steps in developing a
machine learning model and show you what TensorFlow
provides for you at each step. And then, I'll also cover
some of the new developments that we are working
on to continue to improve your workflow. We start with the problem-- an associated data set. We will use the covertype
data set from the US Forestry Service and Colorado
State University which has about 500,000 rows of
geophysical data collected from particular regions
in National Forest areas. We are going to use the
features in this data set to try to predict the soil type
that was found in each region. And there are a mix of features
that we'll be working with. Some are real values-- elevation, slope,
aspect, and so on. Some are real values
that have been binned into an 8-bit scale, and
some are categorical values that assign integers to soil
types and wilderness area names. If we inspect the first
couple rows of our data, this is what we see-- integers, no header, so we have
to work from the Info file. OK, so here we can see that we
have some of our real values, and it looks like some
of the categorical values are one-hot encoded, and
some are just categories. Some features span
multiple cells, so we'll have to handle that. Where do we start? What's the first thing
we should do here? I'm going to suggest to you
that when you're prototyping a new model in TensorFlow, the
very first thing you should do is enable eager execution. It's simple. You just add a single line
after importing TensorFlow, and you're good to go. The way it does that is
rather than deferring execution of your TensorFlow
graph, it runs ops immediately. The result is that you can
write your models in eager while you're experimenting
and iterating, but you still get the full
benefit of TensorFlow graph execution when it comes
time to train and deploy your model at scale. The first thing we're
going to want to do is load our data in and process
the data in columns so that we can feed it into a model. The data is a CSV file with
55 columns of integers. We'll go over each of
those in detail in a bit, but first we will use
the TensorFlow CSV data set to load our data from disk. This particular data set doesn't
have a header, but if it did, we could process that as
well with the CSV data set. Now, a TensorFlow data
set is similar to a NumPy array or a Pandas DataFrame
in that it reads and processes data. But instead of being optimized
for in-memory analysis, it is designed to take data,
run the set of operations that are necessary to
process and consume that data for training. Here, we are telling TensorFlow
to read our data from disk, parse the CSV, and
process the incoming data as a vector of 55 integers. Because we are running with
eager execution enabled, our data set here does
already represent our data, and we can even check to
see what each row currently looks like. If we take the first row, we can
see that right now, each row is a tuple of 55 integer tensors-- not yet processed,
batched, or even split into features and labels. So we have tuples
of 55 integers, but we want our data to reflect
the structure of the data we know is in there. For that, we can
write a function to apply to our
data set row by row. This function will
take in the tuple of 55 integers in each row. A data set is expected to return
tuples of features and labels. So our goal with each
row is to parse the row and return the set of features
we care about plus a class label. So what needs to
go in-between here? This function is going
to be applied at runtime to each row of data, but it
will be applied efficiently by TensorFlow data sets. So this is a good place
to put things like image processing or
adding random noise or other special
transformations. In our case, we will handle
most of our transformations using feature columns which
I will explain more in a bit, so our main goal in
the parsing function is to make sure we
correctly separate and group our columns of features. So for example, if you read over
the details of the data set, you will see that the soil type
is a categorical feature that is one-hot encoded. It is spread out over
40 of our integers. We combine those here into
a single length-40 tensor so that we can learn soil type
as a single feature rather than 40 independent features. Then we can combine
the soil-type tensor with the other features
which are spread out over the set of 55 columns
in the original data set. We can splice the tuple
of incoming values to make sure we get
everything we need. And then we zip those up with
human-readable column names to get a dictionary
of features that we can process further later. Finally, we convert our one
hot-encoded wilderness area class into a class label
that is in the range 0 to 3. We could leave them
one-hot encoded as well, and for some
model architectures or loss calculations that
might be preferable. And that gives us features
and a label for each row. We then map this function
to our data row-wise, and then we batch the rows
into sets of 64 examples. Using TensorFlow
data sets here allows us to take advantage of
many built-in performance optimizations that
data sets provide for this type of
mapping and batching to help remove I/O bottlenecks. There are many other tricks for
I/O performance optimization, depending on your system,
that we won't cover here, but a guide is included
in the description below. Because we're using
eager execution, we can check to see what our
data looks like after this, and you can see that now we
have parsed dictionaries of ints with nice human-readable names. Each feature has been batched. So a feature that is a single
number is a length-64 tensor, and we can see that our
conversion of soil type results in a tensor with
a shape of 64 by 40. We can also see that we have
a single tensor for the class labels which has the
category indices as expected. Just to keep our eyes
on the big picture here, let's see where we are. We've taken our
raw data and put it into a TensorFlow data set
that generates dictionaries of feature tensors and labels. But something is still
wrong with the integers we have as features here. Anyone care to venture a guess? We have lots of feature
types-- some are continuous, some are categorical,
some are one-hot encoded. We need to represent
these in a way that is meaningful to an ML model. You'll see how to fix
that using feature columns in part two of this
series right here on YouTube. So don't forget to hit
that subscribe button and I'll see you there. [MUSIC PLAYING]