[MUSIC PLAYING] ALEX VAYSBURD: Hi. I'm Alex Vaysburd. I am a software
engineer at Google. I work on machine learning
models for Google ads. And before that, I was a
quant and portfolio manager for a number of years at some
of the largest hedge funds. MIKE STONE: And, hi. Welcome. I'm Mike Stone. I'm a global technology
director at Thomson Reuters. I worked with some of our
largest global clients and focused on helping them
advance their innovation and technology strategies. I'm excited to be
here with Alex, as we're going to explain how
you can leverage the Google Platform capabilities, along
with using Thomson Reuters data. Start us off, Alex. ALEX VAYSBURD: OK. So building and training
a stock price forecasting model in Google Cloud is a
very cool, exciting project. Yet it is also a
challenging project, because you need to
figure out not just how to use each of those
high-powered, leading edge tools that Google
Cloud provides, but you also need to
understand and figure out how to put all those
pieces together, how to build your complete
data processing pipeline, end-to-end. And this is exactly the
goal of this session, to show you how to do
just that, and the idea is that after this session, you
will be able to build and train your own model in Google Cloud,
and it will seem easy for you. So this session is going to be
a very practical, very hands-on, with lots of code snippets,
as well as high-level diagrams and discussions. So what exactly does it mean
to build and train stock price forecasting model in the cloud? What does it entail? There are three components. First of all, market data. It's supposed to
be in the cloud. What does it mean? It means that the data
is hosted in the cloud, and is available to you
via native cloud APIs. Second, what does it mean to do
data processing in the cloud? It means that you are using
tools and services that are also native to
the cloud and all your data processing
pipeline end-to-end is native to the cloud. And finally, what it means to
train your model in the cloud. It means that you're
using cloud services to train your model as a
[INAUDIBLE] application with multiple drops, running
concurrently, and training your model at much higher scale
potentially, and much faster. So market data is key to models. This is what powers
financial models, and this is what
brings them to life. And one of the key leading
financial market data companies is Google Cloud's strategic
partner Thomson Reuters. And with this, I'm
turning it over to Mike. MIKE STONE: Thanks, Alex. What I'd like to do over
the next several minutes is just give you a quick overview
of the journey Thomson Reuters is taking, in
terms of bringing our data sets to the cloud,
and specifically what we're doing
with Google, and then give you a quick overview
of the actual data asset Alex is going to be using for
the stock forecasting model. Now as Alex
mentioned, market data certainly powers
financial models when you want to run those. Thomson Reuters itself,
our data and solutions power the financial
services industry. Our data, for those
of you who aren't aware of who Thomson
Reuters is, is used by 10 of the
top 10 global banks, 90% of all companies
managing over $10 billion use our solutions, and 87% of
the institutional investors top global research
firms also use our data. Now let me give you some insight
where and how our data is actually used. Now Thomson Reuters is an open
platform that's cloud enabled. We call it-- it's branded as
the electron data platform. And basically what
it does is it serves the major functions of
an institutional firm around the globe. Let me highlight a few of those. The divisions, the investment
banking sales and trading, wealth management division,
commercial banking, retail banking,
risk and regulatory, to name a few of them. Now the Elektron Data Platform
itself is a set of services, analytics, integration tools,
and then, ultimately, content. And content is
really the foundation of the platform itself. If I take a look, I'm sure
this is an eye chart for those of you at the back,
but what it represents is one of the most extensive
data set Thomson Reuters has in the industry. It spans all the way from
the award-winning and leading Reuters news and
commentary, which is available not only
as human-readable text, but also as machine-readable,
real-time feeds, minutes bands through market
data pricing, security reference data,
risk and regulatory, a very wide range of company
data, and then wrapping up, risk compliance and
supply chain data. Now it's one thing to
have all the data, though. How do we make that
available to our clients? As I mentioned previously,
we have the Elektron Data Platform. Which again, it's our
journey that we're moving that towards the cloud. Now most of the data within
the financial service industry functions in a cycle. And let me use an
example to explain what that cycle actually
represents with real-time data. If we were to look at what
we collect from exchanges, over 500 of them globally
around the world, we take all the security
activity from the exchange, we bring it in, normalize
it, enhance it, and then tag it with common identifiers. From there, it's now ready for
distribution to our clients either through a suite of
our APIs, or, as we progress, making it more and more
available through the cloud itself. Let me give you some insight on
the strategy Thomson Reuters is taking as we go to the cloud. There's three primary
pillars to our strategy. For our own internal
systems it's cloud first. So everything we're doing,
we're building in the cloud. Secondarily, where
it's more important, how do we get our data
out to our clients? We're evolving with the
industry in providing that data through various methods as
I mentioned, native cloud service, we'll talk more
about Google BigQuery later, or through our own suite of APIs
available in the cloud already. The last area is transforming
our own previously deployed systems and making those as
a service in the cloud, where basically, again,
we've pulled them out and have a zero
footprint in our clients. Now let's look a little
more specifically at that. From recent surveys
from our clients, they've given this
perspective on how they're moving their
applications to the cloud. Down the right hand side
for financial institutions. It's basically the
trade lifecycle, all the way from pre-trade,
trade, post-trade middle office activities, compliance, and
then ultimately, the back office technology wraps things up. Across the top
what we've got is, we're charting between what
clients have done today and what they expect to
do in the future in terms of prioritization and
moving to the cloud. And across the
bottom, it matches nicely with the type of
data, either relatively static historical
data right now, all the way through
to the future, where real-time transactional
services will be required. And what we've
plotted out is what clients have told us they've
been deploying to the cloud already. And there's a few general things
that we can observe from this. One is the applications
are across the entire trade lifecycle, so no
one's shying away from deploying applications
anywhere within their business. And then ultimately,
too, a lot of it's being used for analytics and
reporting type capability. Out in the future,
clients expect that there's some things that
still need to be worked out to really get truly low
latency data in the cloud, and it'll prove out over time
if that's a valid use case or not, for example,
algo-trading. Finally, to summarize things,
a lot of the data right now is relatively
high volume data, and therefore is used very
efficiently by cloud services. Which brings us to
tick history data. One of the largest data sets
within the financial services industry is tick data. Tick data, for those
of you not familiar, is really the collection and
storage of all trade activity, bid and ask prices
and everything for security in the market. So we collect that
up, and I mentioned before, that's part of what
we process through our EDP platform. Now back in 1918, you see
how they worked a board to store that information. Certainly there's been
a lot of evolution in that data since then. I don't think these two
working the board could keep up with today's volumes,
or necessarily worried about nanosecond time stamps. Specifically now, what
clients can do today in accessing Thomson Reuters
tick history data in the cloud is they can take it through
our web services APIs, or as a proof of concept right
now as we sort out the most efficient methods,
we've actually loaded a few years of the 20
years history of the New York Stock Exchange data right
into the BigQuery tables. It's very efficient for
using training models and doing simulations
on that volume of data. It's one of the ways
you can access over the 20 years of history
of market leading quality of the 400 exchanges
we've collected it from. And then ultimately too, when
I talked about enriching data, here's a perfect example. We've taken all the
corporate actions, whether it be
splits or dividends, and that's already
incorporated for you. And then ultimately,
too, if you're linking it with other data, it
comes in the same format with the same set of
identifiers you've already used. A few other benefits you
don't have to worry about by managing in the cloud. You don't have to worry
about collecting the data and managing that
infrastructure. You're not going to worry about
the database CPUs and servers and managing them. And then ultimately, for
disaster recovery and backups, also part of the built-in
services of the cloud. Now to go specifically,
what did we actually put in? So I said as a proof of
concept of this point, we've worked with a
number of clients. We've taken three years or so
of the New York Stock Exchange data. We've put it in a table. It represents the trade and
quote data that goes on. So for example, we've
got an identifier, which is the RIC, or the
Reuters instrument code, we have a number of
other fields in terms of the time, the bid, the price. Ultimately the trade data
includes similar information. If we want to get access to it,
it's as simple as a SQL query. We just list the fields, the
date, the fields-- the table we want to pull it from. In this case, we're
pulling GS.N which is Goldman Sachs from
the NYSE, and then we say we want the quote data. Voila. This is what we're going
to get back very quickly. If we want to do
similar for trade data, just a slightly different query. We're going to
modify the, fields say that we want
to pull trade data, and here's what we've got. This is the data that
Alex is going to be using to explain the model. So I'm going to
pass it back to him now, so he can explain how
he's leveraged Thomson Reuters data, along with the
Google Cloud capacity, so that you can make
it easier for yourself. ALEX VAYSBURD: Thank you, Mike. That was great. OK, date processing
pipeline, now the fun begins. We're going to take a look at
how to put together the data processing pipeline,
and before we get into nitty-gritty
details, I'll give a high-level overview of
what we're going to be doing. So the first step is we're going
to generate model features, starting from the Thomson
Reuters data in BigQuery. and this can be done from a pipe
and scrape, or from a Colab. In case you don't know, Colab
is an extremely useful tool built by Google. It hosts the Python notebook. It's hosts on the Google Cloud. It's a free service. When you use it, you'll get
access to cloud services. And you even get a
free GPU on the cloud. You don't have to pay for it. And in the Colab notebooks
you can write your code, you can execute
the code, and you can see the output, all
within one notebook. And it's very convenient for
sharing with your teammates, for example. OK, a second step
in the data pipeline is we're going to write model
input features into what is called TFRecord files. So this is the standard format
for presenting input features data for TensorFlow models. The third step is we're going
to launch model training using Cloud ML engine. When the training job
starts, the workers will be loading model features
from TensorFlow record files that have been
previously written to Google Cloud storage. And periodically,
the worker jobs will be saving model state
checkpoints, as well as training and
evaluation statistics in the bucket in cloud storage. So that's the pipeline. Now, let's take a look
at model predictions. What we're going
to be predicting, let's start with that. What metric will the
model be forecasting? When we're talking about stock
price, forecast, and model, the key is to define what
horizon we're talking about. Are we going to be
forecasting for a year? For several months? Days? So to be specific, for
this example that we're going to show you, we're going
to be forecasting intraday day stock returns with a
five-minute horizon. The reason we chose
this for this example is because this is
something useful-- practical. It's useful for
scheduling suborders for algorithmic
execution orders-- execution algos. And it's useful for
intraday trading strategies. So it's something relevant. Now, let's take a big
picture view for a moment and discuss a little
bit what exactly we're going to be doing here. We're going to be using
supervised learning, meaning that we're going to be
training our model on examples. Each training example is a
pair comprising input features and the correct answer. So input features
are whatever inputs the model is going to be using
to give us their prediction. And the label is
the correct answer. So given inputs and
correct answers, the model will be learning,
and learning until it hopefully is able to generate correct
or more or less correct predictions, even when it
doesn't know what they're going to be based on the inputs. So what are going
to be our labels? How do we construct the
labels-- predictions? So our goal is to
predict intraday returns for a five-minute horizon. And for this, we need
to define the starting price and the final
price-- the end price. For the starting
price, we're going to be using the average
bid/ask midpoint for the current
10-second interval. And for the final
price, we're going to be using five-minute
volume weighted average price, or a VWAP,
for the five-minute interval subsequent to the
current point in time. And that is actually
relevant, because when you schedule
suborders, you'll want to know what will be the VWAP. Not just some price at
some random point in time, but VWAP may be
more relevant here. So this is the picture that
illustrates what exactly we're doing. We are calculating
the average midpoint, we're calculating the VWAP for
the next five-minute interval. And we take the logarithm
of those two values. And the difference of
logarithms is the label that we're going to be using
for training the model. So after we're done with
building and training the model, we're
going to be taking a look at evaluations
of this to see how well the model performs. But I guess you're already
curious to see how it performs, and so we're going to have
dessert first before we're done with the main entry. OK. So this is what we did. For this example,
we trained the model on 16 days of Thomson
Reuters market data between May 1st and 24th. And we evaluate that model
on one day of market data-- May 25th. So you can see that
evaluation data should always be after the end of
the training data. Don't want to have
any look-aheads here. And we're using R
square as a measure of the quote of the model. What is R square? Approximately, it
tells you how good is the model compared
to a trivial model that always predicts zero return. So for this particular
example, R square was 0.1, which means that
the model was useful. If R square is zero, it means
the model is as good or as bad as the trivial model that
always predicts zero returns. If R square is 1, it means
the model is perfect. If it's between 0 and
1, it means that it has some predictable ability. So here, you can see that for
the zero prediction model, the mean squared error
on evaluation data was 231 squared basis
points, where a basis point is one hundredth, or 1%. And the mean squared error for
our model on evaluation data was 207. So it was slightly better
than the trivial model. And this is roughly
what would be expected. Because you do not realistically
expect to forecast all or even most of the variance at
five-minute intervals, because there is a lot of
noise in intraday prices. So let's take a look at
how we build the inputs. What are we going to be
feeding into the model? We're going to be defining
input features based on what types of factors
that affect stock returns we're trying
to capture here. So first, we're going to
learn intraday price patterns at different times of the day. And to do this, we're going
to include a feature based on the intraday sequence
number of the current 10-second interval. Then we want to capture mean
reversion or momentum price patterns. And for this, we're going to
be using a list of differences of logarithms of average
midpoint quotes in the last 120 10-second intervals. So essentially what this
means, this long sentence, is that we're looking
at returns between consecutive 10-second
intervals for the trailing 20 minute window. And we're using this list as
one of the input features. Then we want to capture
price patterns for stocks at different price levels. And for this, we're
looking at the logarithm of the price at the
current 10-second interval. Volume is also relevant. So as one of the
input features, we're going to be using volume in
the trailing 20 minute window preceding the current interval. But for this to
be useful, we want to normalize the volume by the
stock's average daily volume over some reasonably
long time range. And for example, we used
4-week average daily volume as a normalization factor. So we take traded volume for
this stock in the last 20 minutes, and we divided by these
stock's average daily volume on the past four weeks. And finally, we're going to be
using an input feature based on their stock's
security identifier, RIC, because different stocks may
have their unique distinct intraday price patterns. And I will show how to plug
in a RIC or string in general into the model. Now get ready for some heavy
construction work ahead. I'm going to start
showing you some slides with the actual
SQL code snippets. So first of all, we're
going to be using BigQuery. What is BigQuery? It's Google Cloud's
enterprise data warehouse, ideal for analytics. You can run standard
SQL queries on it. And also, it has some
very useful extensions-- analytic functions. I will show how to
use those functions. It's fully managed
and serverless, meaning that Google takes
care of provisioning of CPU and storage capacity
of data application. You don't have to worry
about any of that. Feature number one--
intraday sequence number of the current
10-second interval. So what I want to
do here is basically calculate the number of
seconds since midnight and divide it by 10. And to do this, we extract
the hour, multiply by 360. We convert from UTC
time into local time, because we care
about stock patterns specific to the exchanges
local time, not UTC time. We extract number of
minutes multiplied by 6, and we extract number of seconds
divided by 10, and round it up. So this is our interval
sequence number. Building block
for feature number two-- list of 10-second
interval midpoints. So one nice thing
BigQuery provides is the ability to use temporary
tables as part of the queries. So in this example, we
use interval midpoints as a temporary table. You write a query, and
the output of the query is stored in the
temporary table. And then we can use this table
from within another query. And here, we're using
array aggregation function, which is an analytic
function of BigQuery that allows you to apply a certain
function over a series of rows preceding each
row in the output. So what I want to do here
is to partition data by RIC. And the OVER clause
specifies how exactly we're going to do it. So we'll partition data by
RIC, by stock identifier. Within each partition,
we'll order rows by interval sequence number. And with this ordering for each
row, we take 120 preceding rows and make a list of
them, and return these as the output of the query. Feature number three--
normalize traded volume. So there are three steps here. As I mentioned
before, we compute each stock's average daily
volume for the last four weeks. We compute interval volumes,
normalize them by ADV. And then for each interval,
we compute the sum of normalized volumes for
those 10-second intervals in the preceding
20-minute window. So there are three steps here. And similar as before,
we use a temporary table to store average daily volumes. Then we use this temporary
table from another query to calculate interval
vols normalized by ADV. And then we use those values
from yet another query in which we use an analytic
function called SUM. Which instead of making
a list of overall values from preceding rows, here
it just adds up values from a list of preceding rows. And as before, we
specify that we want to specify that we want
the partition data by RIC, we want to order the
rows in each partition by interval sequence
number, and for each row, we want to apply the sum
for 120 preceding rows. Now I'm going to show how
to construct a building block for the label-- VWAP in the five-minute period
subsequent to each interval. This is something
that we won't have when we're generating
forecasts, but this is something that we're going to have
when training the model. So once again, we
use a temporary table for interval price volumes. We use this table
from another query, which for each row
for each interval, computes the sum
of price volumes, and then divides by
the sum of volumes. Now, one thing that's very
convenient about using BigQuery for data and Google Cloud
is that it's very nicely integrated with Python. There is a pandas package
called io.gbq that has this method called read gbq. You can pass the query,
you pass your project ID in Google Cloud, and it
returns the output of a query into a pandas dataframe. And then it can do
additional transformations of data generation
of final features from our semi-final
results of queries directly within pandas
dataframes, which have very rich functionality. So it's very convenient. Now, once we have this final
dataframe with all features that you need, you can
save it into a JSON file, with to JSON function. JSON format is very convenient
in that it preserves the structure of data. So for example, it will
save lists as lists. If you used to CVS
function, for example, it will save lists
as strings and you will lose the structure. So I recommend to JSON. And then finally, if you
wish, you can use gsutil tool. It's one of the
tools in Google Cloud to copy your JSON file into one
of the buckets in the Cloud. Generating TFRecord files. So a TFRecord is the
recommended format for storing serialized model
features in TensorFlow. So what we do, we run queries
into queries to get that data, we'll put data into
dataframes, we'll apply additional
transformations. And then we use those dataframes
with everything prepared to generate TFRecord files. And those files will be later
fed to the model for training. So there are three steps here. Generating training examples
comprising model input features. Serializing training
samples into TFRecord files. And feeding TFRecord
files into the model. So what is a training example? It's actually a Python
dictionary or a map. It maps feature names feature
protos, so protocol messages. And there are three types. You can have a list of
bytes, a list of floats, or a list of integers. You will always have a list. And what do you do with the
features that are scalars. If it's on a list,
you simply have a list with the one element in it. So that is an example. There is how we
construct an example-- assuming that each
role holds values for features from
our dataframe that we have constructed earlier. So delta logs VWAP mid--
this is what we're going to be using for our label. This is the correct answer that
we only have during training. We don't have it when we're
actually using the model to make predictions. And we have five features
that I described earlier. The RIC, stock identifier,
interval sequence number, deltas of log mids, sum
of interval volumes, and logarithm of
the current mid. How do you use these to actually
generate TFRecord files? The first step is to
initialize the TFRecord writer. Then you iterate over
rows in the dataframe that contains the features. And for each row, you
construct an example based on the feature
dictionary that I showed you on the previous slide. Then you serialize this
example and you write it into the TFRecord writer. Finally, when you are done,
you're closing the writer. And you have this file
with serialized TFRecords. If you only intend to
train your model locally, you can generate this
file in a local directory. But if you are planning
to use Cloud ML Engine, you have to copy these TFRecords
file into Cloud Storage so that it can be accessed by
multiple jobs, multiple tasks that are running as part
of the Cloud ML Engine job. Preparing data to
train the model. So TFRecords can be efficiently
fed into TensorFlow models for training and evaluation. And in the next
several slides, we're going to look at
exactly how we're going to be feeding those
models to our model. And there are three steps here. The first step is we're going
to define feature columns. And I will show you in
a moment what they are. Second step, we're going
to create a parsing spec. And third step, we're going to
define an input function that will use the parsing spec. Step one, defining feature
columns for the model. So a feature column is a list. It specifies the
type of each feature. So here, for
example, for the RIC, we're using what's called
an embedding column. We're plugging stocks
identifier, a string, as an embedding. Which means essentially,
that we're making the string into a numeric value. And this numeric
value may represent, for example, a stocks propensity
to have mean-reverting versus momentum price patterns. And the one caveat that
you need to be aware of is when they use
embeddings, you want to make sure that
you set hash bucket size to a large enough value. Because you certainly
want to avoid collisions in the hash map used from
embeddings as much as possible. Because essentially,
when you have collisions, the model doesn't distinguish
between two different RICs in this case, and you
don't want to do that. Then you have feature columns
for other model inputs. In this case, interval
sequence number, sum of interval volumes
for the trailing window, and logarithm of
the current mid. You can see that we
specified their shape. The first dimension
is one in all cases. Meaning that they
all have with one-- they're all scalars. And then we have delta log mids,
for which we specify shape, with the first
dimension set to 120. Reflect on the fact
that it's expected to be at least 120 values. Step two here is
creating parsing spec. And this is very easy to
do with a function called make parse example spec. So you pass this list of
feature specifications for this function,
and it gives you a feature of spec that we can
use in the input function. Which is the next step. We're going to
define input function in the next couple of slides. Let's see what we're doing here. The first step is
constructing a dataset. And for this, I recommend
using a function called make batched features dataset. It creates all-ready
performance optimized dataset with batching and
shuffling built-in. So you don't need to worry about
reading your data efficiently. You don't need to
worry about setting up multiple threads to read your
data because sometimes it's a bottleneck. It does everything for you. So where previously you would
make several function calls, here you can just
use this function. So let's look at
the parameters here. The file pattern specifies
the list of TFRecord files that you have
generated previously. So for example, you can
have a separate TFRecord file for market data from
this thing-- the trading base. Then batch size. So here, we'll set it 128. What does this mean? It means that the
training examples are going to be passed through
the model not one by one, but all together in batches. And each batch is going to
have 128 training examples. When you feed training
examples through the model, it generates predictions. And the lowest is computed
for the difference between predictions and
actual correct answers. And then based on the
lowest, the back propagation of gradients is applied. So when you use a batch,
the back propagation happens only once
within the entire batch. All gradients are averaged
over all training examples, so then there is back
propagation happening once. So this is called mini-batching. And usually it improves
model performance with the batch size set around
a hundred to a thousand. Now, the feature spec is
what we constructed earlier. Number of epochs is set to one. This simply means that we're
reading the dataset just once. We can read it several
times if you wish. Shuffle is set it true. What this does, it
randomizes the order in which it puts training
examples that it reads from the file into batches. It makes the model more robust. Then we make an iterator. So make one shot
iterator, it simply creates an iterator that reads
the data from the dataset just once. There are other
types of iterators. Some of them read
data more than once. And this is interesting. The get next function. So it looks like, if you
look at this function, it returns features. But in fact, it does not. What this function returns
is an operation node in the computation graph. And then later on when
this node is evaluated, it will return next
batch of examples several times in several ways. And this is a very
important, subtle point, so I have a separate
slide just for this. Because it's a key
point to understand. TensorFlow programs
in graph mode. In graph mode they
have two phases. In phase one, you construct
the computation graph, but no data is flowing
through the graph yet. Then in phase two, you
execute the computation within the TensorFlow session. And this is when data actually
flows through the graph. So in our dataset input
function, as it's implemented, it doesn't return any
batches of data yet. But instead what it does, it
adds nodes to the TensorFlow computation graph. And then, when
later on the nodes are related within
a session, they will be returning next
batch of training samples upon each evaluation. So we're back to
dataset input function. It returns features and labels. So features is an operation
node in a TensorFlow graph. And labels is another operation
node in a TensorFlow graph. Training the model. First, we're going to look at
how to train the model locally. So the model will be training
on examples we feed it. And with some number of
examples, some of which it trains, hopefully
it will learn how to make correct predictions. That how supervised
loading works. So the first step is to
construct a DNNRegressor. TensorFlow has several
built-in estimators. So a DNNRegressor is
one of the estimators, and it's intended
for use with models that make numerical
predictions, which is what we're doing in our example. Hidden units specifies
the structure of the neural network. So here, we are
going to be using two fully connected layers. And each layer is going to
have 128 hidden weights. Feature columns-- so this is
exactly the feature columns that we defined earlier. It's a list of specifications
of the dimensionality and type of each input feature. Model directory is in the
form of an argument that specifies where
the model is going to be saving its periodic
checkpoint, so its state. And where it's also going
to periodically saving its training and
evaluation statistics. So here, I'm again going
over the feature columns. There is embedding
column for the RIC, there is a numeric column for
interval sequence number shape one, sum interval in shape
one, low current mid-- they all have width one. And delta log mids
has width 120. And I already said
that model directors is where the model
stores it's checkpoints. And this is also
importantly the directory from which the model reads
it's state initialized when you construct
DNNRegressor later on. And by using this feature,
you can train the model iteratively or incrementally. So the way it works is you
initialize DNNRegressor from the latest checkpoint. Then a train model on newly
available data, for example, or newly available dataset. And your checkpoints will
be automatically saved in the model directory. So it will create
new checkpoints with files under new file
names and automatically garbage collect old checkpoints. So for example, in a
case of financial data, you can incrementally
train the model at the end of each trading date. So at the end of
the trading date, you collect all market
data for this date. You train the model on training
examples for that date. The model starts
with a checkpoint from the previous date. After it trains, it stays in
the checkpoint representing its state after it has
incrementally trained the market data from today. And then tomorrow it can go on. So how to train the model. One thing before
we get to training is you may want to add some
custom performance metrics-- evaluation metrics. By default, DNNRegressor
has one metric. It shows L2 loss values,
which is mean squared error. And it may be useful to have
root mean squared error, as well. So this is how you add the
custom evaluation metrics. So we're almost
at the point where we're training the models. Before we train
the model, we need to construct training
spec and evaluation spec. And what is that? It's to primarily
specify input functions. So we can use the
same input function, but simply initialize
them differently with a different list of files. We pass the list
of training files into the trained
specification, and we pass the list of
evaluation files for the eval specification. And very importantly,
when training models, evaluation data should start
after the end of the training date. You don't want to have
any look-aheads here. And also, it's important
to specify the number of maximum steps for training. Because depending on how you
implement your dataset input function, if you
don't specify these, your model may train forever. It will never stop. And finally, train
and evaluate function is the recommended way
of training a TensorFlow model that uses DNNRegressor. What it does, it
trains the model, it periodically
checkpoints the state and it periodically checkpoints
training and evaluation performance metrics, all
in the same directory. But there is a
separate subdirectory for evaluation metrics. Now, we had a look at how
to train the model locally. Now let's take a look at how
you will do it with the Cloud ML Engine. So why would you want
to use Cloud ML Engine? The primary purpose is
it gives us scalability. It gives the ability
to train the model with multiple workers, multiple
tasks at the same time. And this way you can train
your model many times faster and you can have a much faster
turnover over different verses for the models you want to try. You can do your research
much faster this way. So the good thing is, when
you transition your model from local training to
training with Cloud ML Engine, you don't need to
change your code at all. When you're using
estimators, they already have built-in support
for Cloud ML Engine. And that's extremely convenient. So all you have to do is
specify a configuration file, you need to specify code to
construct training package-- the setup-- and you need
to provide the main program to train the model,
which you already have because it's the same program. So configuration settings. You need to specify
the scale tier. So in this example,
we're going to be using a custom tier because
it's the most interesting-- most flexible. You can specify exactly
the hardware configuration for different types of
nodes in your training job. So let's take a look at the
different types of nodes. There are three types. There are worker nodes,
which calculate gradients. And workers are
the nodes that do lots of numerical computations,
such as matrix multiplications. And they are the
nodes that you will want to run with GPUs to
get faster performance. Then there are
parameter server nodes, which update parameters with
gradient vectors from workers. And then there is a master
node, which coordinates everyone and also operates as a worker. So based on that
usage, we are going to be using standard GPU
for master and worker nodes. And we're going to be using
certain CPU for parameter servers. And we'll also get to specify
how many workers or how many parameter servers
we want to use. For this example, we're
going to be training model with eight workers and
four parameter servers. The code to construct
training package-- there is a separate
file we need to provide. And you specify the lists
of required packages. In many cases, this list
will be actually empty, because each Cloud ML
Engine runtime comes with a already built-in list of
many popular Python packages. So if you are using something
that's very nonstandard, you may need to specify it
here in this file in setup. But in many cases,
we won't need to. And the main program
to train the model is the same as for
local training. And again, let's go
over what it does. It parses command
line parameters, it defines feature columns,
it constructs DNNRegressor, it defines evaluation
metric if you need any, it defines the input function
that will be reading TFRecords from a list of
files, and it will start training and
evaluating the model. So this is the command they
use to submit a job to Cloud ML Engine. You need to specify
a unique job name. You specify the path
to your main program that does the training. This is the program that
creates DNNRegressor and calls the train and evaluate function. You specify the path to
your configuration file. You specify the
path of your staging bucket in cloud storage. And you specify the
runtime version. So this number in
the runtime version-- runtime version is exactly
the version of TensorFlow. So depending on what version of
TensorFlow you intend to use, depending on your
code dependencies, you will specify
the version here. And as I said, each runtime
version of Cloud ML Engine comes with a different
list of packages already built-in the runtime. Then you need to
specify the region. There are two region support
at the moment, US central one and US east one. And finally, if
you need to you can specify a list of
custom parameters that will be passed to
your training application. Evaluation of the model. So what is evaluation? The key question
I want to answer is, how precise is our model? What's the quality
of predictions? How well does it perform
on evaluation data? So we're going to be using
TensorBoard for this. This is how you
run a TensorBoard and you command this code,
TensorBoard, and then you specify the directory
where model checkpoints are. And tabulation statistics
and training statistics are the same. This is the same model
directory that they used when they constructed
DNNRegressor object. So this is what L2 loss
looks like on the training data that were used
for this model example. This is what the loss looks
like for the first 3 and a 1/2 hours of training. You can see that L2 loss
started around 90 probably, and it was gradually decreasing
over the next 2 and a 1/2 hours. Meaning that as
the model trains, its loss on training
data becomes gradual lower and lower. But this doesn't
necessarily mean that the model is becoming
better in its predictions. Because what we
really care about is the quote of generalizations. And when we look at the L2
loss for evaluation data, we see that the model
starts around 206 after 30 minutes of training. And then actually, you can see
that the loss from evaluation data increases slightly
to around 210 to 211. So what this means
is you probably want to stop training your model
after the first half an hour. Because after that, L2
loss on training data will keep decreasing. But this simply means
that the model gradually starts overfeeding. So you want to stop
training, because you're not going to improve quality
of predictions after that. The longer it trains after
the first half an hour in this example, the
worse actually the model predicts evaluation data. So there is usually
an optimal amount of training you want to apply. So I said this already. Let's take a look at R squared. So as I mentioned
before, square gives you an idea of how well the model
performs on evaluation data compared to a trivial model that
always predicts zero returns. And as I said, if
the value is zero, it means the model is worthless. If it's one, it means
the model is perfect. And usually, we expect
the value to be closer to zero than the one,
but to be positive. And as it happens,
that's the case here. R squared is approximately
0.10 to 0.11 after the first 20 to 30 minutes of training, and
then it declines to a range between 0.08 and 0.09, which
is actually decent for when you're predicting
five-minute intraday prices. But remember, this
was based on model trained on only 16
days of market data and evaluated on
only one day of data. So this is fine for an example
of how to build the model and how to train
it, but this is not what you would want to do
for actual use in production. For use in production
training, you would want to train your model
with several years of data and evaluate on
several years of data. Generating forecasts. Once we have the model, how
do we generate forecasts? For training, it's convenient to
save data into TFRecord files. When you want to build
a forecast in real-time, it may actually be more
convenient to store data in pandas dataframes from
memory than to read from files. So I will show how
to do it a bit later. So the key is that for
generating predictions, you use the same
initialization settings as when you were training the model. Same hidden units,
same model directories, same feature columns. That's important. As I said, we're going to
be using pandas dataframe. And when input functions based
on the pandas dataframe-- so each dataframe row is going
to include feature values for training example. And each column is going to have
feature values for one feature type. And the input function is going
return a dictionary, mapping names of input
features to tensors containing the feature's
values for the entire batch. So essentially, values for one
column in the pandas dataframe. So the feature tensors
will be two-dimensional. The first dimension is
the size of the batch, which is the number of
rows in the dataframe. And the second
dimension is the width of the corresponding
feature column. So we have RIC, we have
interval sequence number, some of interval volumes, log of
the current mid, delta log mid. So here, the second
dimension is 120. And there is a little bit
different transformation here, because first, we need to
convert the list of lists into numpy
multidimensional array. And then we shape the
array as required, with the second
dimension set to 120. Then we use the predict method
that returns predictions generator. We use islice method
to get the iterator. And here, it's important to
specify the number of steps in the iterator. Otherwise, it may never stop. And then as we iterate,
we get this list of predictions which correspond
to rows in the dataframe. So the length of
this list will be exactly the length of the
dataframe-- the number of rows. So the wrap-up. What have we accomplished? We have shown how to build model
features using Thomson Reuters data in Google Cloud
using BigQuery. We have looked at how serialize
input features in TFRecord files in Cloud Storage. We have seen how to construct
DNNRegressors and build and train the models locally. And we have seen how
to train the model as a scalable distributed
application using Cloud ML Engine with multiple workers. And now we invite you to apply
the knowledge that you have gained from the session
to try and build and train your own machine learning
model in Google Cloud, and put all pieces
of your own puzzles together into a
beautiful picture. Thank you. [MUSIC PLAYING]