Forecasting Stock Returns with TensorFlow, Cloud ML Engine, and Thomson Reuters

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] ALEX VAYSBURD: Hi. I'm Alex Vaysburd. I am a software engineer at Google. I work on machine learning models for Google ads. And before that, I was a quant and portfolio manager for a number of years at some of the largest hedge funds. MIKE STONE: And, hi. Welcome. I'm Mike Stone. I'm a global technology director at Thomson Reuters. I worked with some of our largest global clients and focused on helping them advance their innovation and technology strategies. I'm excited to be here with Alex, as we're going to explain how you can leverage the Google Platform capabilities, along with using Thomson Reuters data. Start us off, Alex. ALEX VAYSBURD: OK. So building and training a stock price forecasting model in Google Cloud is a very cool, exciting project. Yet it is also a challenging project, because you need to figure out not just how to use each of those high-powered, leading edge tools that Google Cloud provides, but you also need to understand and figure out how to put all those pieces together, how to build your complete data processing pipeline, end-to-end. And this is exactly the goal of this session, to show you how to do just that, and the idea is that after this session, you will be able to build and train your own model in Google Cloud, and it will seem easy for you. So this session is going to be a very practical, very hands-on, with lots of code snippets, as well as high-level diagrams and discussions. So what exactly does it mean to build and train stock price forecasting model in the cloud? What does it entail? There are three components. First of all, market data. It's supposed to be in the cloud. What does it mean? It means that the data is hosted in the cloud, and is available to you via native cloud APIs. Second, what does it mean to do data processing in the cloud? It means that you are using tools and services that are also native to the cloud and all your data processing pipeline end-to-end is native to the cloud. And finally, what it means to train your model in the cloud. It means that you're using cloud services to train your model as a [INAUDIBLE] application with multiple drops, running concurrently, and training your model at much higher scale potentially, and much faster. So market data is key to models. This is what powers financial models, and this is what brings them to life. And one of the key leading financial market data companies is Google Cloud's strategic partner Thomson Reuters. And with this, I'm turning it over to Mike. MIKE STONE: Thanks, Alex. What I'd like to do over the next several minutes is just give you a quick overview of the journey Thomson Reuters is taking, in terms of bringing our data sets to the cloud, and specifically what we're doing with Google, and then give you a quick overview of the actual data asset Alex is going to be using for the stock forecasting model. Now as Alex mentioned, market data certainly powers financial models when you want to run those. Thomson Reuters itself, our data and solutions power the financial services industry. Our data, for those of you who aren't aware of who Thomson Reuters is, is used by 10 of the top 10 global banks, 90% of all companies managing over $10 billion use our solutions, and 87% of the institutional investors top global research firms also use our data. Now let me give you some insight where and how our data is actually used. Now Thomson Reuters is an open platform that's cloud enabled. We call it-- it's branded as the electron data platform. And basically what it does is it serves the major functions of an institutional firm around the globe. Let me highlight a few of those. The divisions, the investment banking sales and trading, wealth management division, commercial banking, retail banking, risk and regulatory, to name a few of them. Now the Elektron Data Platform itself is a set of services, analytics, integration tools, and then, ultimately, content. And content is really the foundation of the platform itself. If I take a look, I'm sure this is an eye chart for those of you at the back, but what it represents is one of the most extensive data set Thomson Reuters has in the industry. It spans all the way from the award-winning and leading Reuters news and commentary, which is available not only as human-readable text, but also as machine-readable, real-time feeds, minutes bands through market data pricing, security reference data, risk and regulatory, a very wide range of company data, and then wrapping up, risk compliance and supply chain data. Now it's one thing to have all the data, though. How do we make that available to our clients? As I mentioned previously, we have the Elektron Data Platform. Which again, it's our journey that we're moving that towards the cloud. Now most of the data within the financial service industry functions in a cycle. And let me use an example to explain what that cycle actually represents with real-time data. If we were to look at what we collect from exchanges, over 500 of them globally around the world, we take all the security activity from the exchange, we bring it in, normalize it, enhance it, and then tag it with common identifiers. From there, it's now ready for distribution to our clients either through a suite of our APIs, or, as we progress, making it more and more available through the cloud itself. Let me give you some insight on the strategy Thomson Reuters is taking as we go to the cloud. There's three primary pillars to our strategy. For our own internal systems it's cloud first. So everything we're doing, we're building in the cloud. Secondarily, where it's more important, how do we get our data out to our clients? We're evolving with the industry in providing that data through various methods as I mentioned, native cloud service, we'll talk more about Google BigQuery later, or through our own suite of APIs available in the cloud already. The last area is transforming our own previously deployed systems and making those as a service in the cloud, where basically, again, we've pulled them out and have a zero footprint in our clients. Now let's look a little more specifically at that. From recent surveys from our clients, they've given this perspective on how they're moving their applications to the cloud. Down the right hand side for financial institutions. It's basically the trade lifecycle, all the way from pre-trade, trade, post-trade middle office activities, compliance, and then ultimately, the back office technology wraps things up. Across the top what we've got is, we're charting between what clients have done today and what they expect to do in the future in terms of prioritization and moving to the cloud. And across the bottom, it matches nicely with the type of data, either relatively static historical data right now, all the way through to the future, where real-time transactional services will be required. And what we've plotted out is what clients have told us they've been deploying to the cloud already. And there's a few general things that we can observe from this. One is the applications are across the entire trade lifecycle, so no one's shying away from deploying applications anywhere within their business. And then ultimately, too, a lot of it's being used for analytics and reporting type capability. Out in the future, clients expect that there's some things that still need to be worked out to really get truly low latency data in the cloud, and it'll prove out over time if that's a valid use case or not, for example, algo-trading. Finally, to summarize things, a lot of the data right now is relatively high volume data, and therefore is used very efficiently by cloud services. Which brings us to tick history data. One of the largest data sets within the financial services industry is tick data. Tick data, for those of you not familiar, is really the collection and storage of all trade activity, bid and ask prices and everything for security in the market. So we collect that up, and I mentioned before, that's part of what we process through our EDP platform. Now back in 1918, you see how they worked a board to store that information. Certainly there's been a lot of evolution in that data since then. I don't think these two working the board could keep up with today's volumes, or necessarily worried about nanosecond time stamps. Specifically now, what clients can do today in accessing Thomson Reuters tick history data in the cloud is they can take it through our web services APIs, or as a proof of concept right now as we sort out the most efficient methods, we've actually loaded a few years of the 20 years history of the New York Stock Exchange data right into the BigQuery tables. It's very efficient for using training models and doing simulations on that volume of data. It's one of the ways you can access over the 20 years of history of market leading quality of the 400 exchanges we've collected it from. And then ultimately too, when I talked about enriching data, here's a perfect example. We've taken all the corporate actions, whether it be splits or dividends, and that's already incorporated for you. And then ultimately, too, if you're linking it with other data, it comes in the same format with the same set of identifiers you've already used. A few other benefits you don't have to worry about by managing in the cloud. You don't have to worry about collecting the data and managing that infrastructure. You're not going to worry about the database CPUs and servers and managing them. And then ultimately, for disaster recovery and backups, also part of the built-in services of the cloud. Now to go specifically, what did we actually put in? So I said as a proof of concept of this point, we've worked with a number of clients. We've taken three years or so of the New York Stock Exchange data. We've put it in a table. It represents the trade and quote data that goes on. So for example, we've got an identifier, which is the RIC, or the Reuters instrument code, we have a number of other fields in terms of the time, the bid, the price. Ultimately the trade data includes similar information. If we want to get access to it, it's as simple as a SQL query. We just list the fields, the date, the fields-- the table we want to pull it from. In this case, we're pulling GS.N which is Goldman Sachs from the NYSE, and then we say we want the quote data. Voila. This is what we're going to get back very quickly. If we want to do similar for trade data, just a slightly different query. We're going to modify the, fields say that we want to pull trade data, and here's what we've got. This is the data that Alex is going to be using to explain the model. So I'm going to pass it back to him now, so he can explain how he's leveraged Thomson Reuters data, along with the Google Cloud capacity, so that you can make it easier for yourself. ALEX VAYSBURD: Thank you, Mike. That was great. OK, date processing pipeline, now the fun begins. We're going to take a look at how to put together the data processing pipeline, and before we get into nitty-gritty details, I'll give a high-level overview of what we're going to be doing. So the first step is we're going to generate model features, starting from the Thomson Reuters data in BigQuery. and this can be done from a pipe and scrape, or from a Colab. In case you don't know, Colab is an extremely useful tool built by Google. It hosts the Python notebook. It's hosts on the Google Cloud. It's a free service. When you use it, you'll get access to cloud services. And you even get a free GPU on the cloud. You don't have to pay for it. And in the Colab notebooks you can write your code, you can execute the code, and you can see the output, all within one notebook. And it's very convenient for sharing with your teammates, for example. OK, a second step in the data pipeline is we're going to write model input features into what is called TFRecord files. So this is the standard format for presenting input features data for TensorFlow models. The third step is we're going to launch model training using Cloud ML engine. When the training job starts, the workers will be loading model features from TensorFlow record files that have been previously written to Google Cloud storage. And periodically, the worker jobs will be saving model state checkpoints, as well as training and evaluation statistics in the bucket in cloud storage. So that's the pipeline. Now, let's take a look at model predictions. What we're going to be predicting, let's start with that. What metric will the model be forecasting? When we're talking about stock price, forecast, and model, the key is to define what horizon we're talking about. Are we going to be forecasting for a year? For several months? Days? So to be specific, for this example that we're going to show you, we're going to be forecasting intraday day stock returns with a five-minute horizon. The reason we chose this for this example is because this is something useful-- practical. It's useful for scheduling suborders for algorithmic execution orders-- execution algos. And it's useful for intraday trading strategies. So it's something relevant. Now, let's take a big picture view for a moment and discuss a little bit what exactly we're going to be doing here. We're going to be using supervised learning, meaning that we're going to be training our model on examples. Each training example is a pair comprising input features and the correct answer. So input features are whatever inputs the model is going to be using to give us their prediction. And the label is the correct answer. So given inputs and correct answers, the model will be learning, and learning until it hopefully is able to generate correct or more or less correct predictions, even when it doesn't know what they're going to be based on the inputs. So what are going to be our labels? How do we construct the labels-- predictions? So our goal is to predict intraday returns for a five-minute horizon. And for this, we need to define the starting price and the final price-- the end price. For the starting price, we're going to be using the average bid/ask midpoint for the current 10-second interval. And for the final price, we're going to be using five-minute volume weighted average price, or a VWAP, for the five-minute interval subsequent to the current point in time. And that is actually relevant, because when you schedule suborders, you'll want to know what will be the VWAP. Not just some price at some random point in time, but VWAP may be more relevant here. So this is the picture that illustrates what exactly we're doing. We are calculating the average midpoint, we're calculating the VWAP for the next five-minute interval. And we take the logarithm of those two values. And the difference of logarithms is the label that we're going to be using for training the model. So after we're done with building and training the model, we're going to be taking a look at evaluations of this to see how well the model performs. But I guess you're already curious to see how it performs, and so we're going to have dessert first before we're done with the main entry. OK. So this is what we did. For this example, we trained the model on 16 days of Thomson Reuters market data between May 1st and 24th. And we evaluate that model on one day of market data-- May 25th. So you can see that evaluation data should always be after the end of the training data. Don't want to have any look-aheads here. And we're using R square as a measure of the quote of the model. What is R square? Approximately, it tells you how good is the model compared to a trivial model that always predicts zero return. So for this particular example, R square was 0.1, which means that the model was useful. If R square is zero, it means the model is as good or as bad as the trivial model that always predicts zero returns. If R square is 1, it means the model is perfect. If it's between 0 and 1, it means that it has some predictable ability. So here, you can see that for the zero prediction model, the mean squared error on evaluation data was 231 squared basis points, where a basis point is one hundredth, or 1%. And the mean squared error for our model on evaluation data was 207. So it was slightly better than the trivial model. And this is roughly what would be expected. Because you do not realistically expect to forecast all or even most of the variance at five-minute intervals, because there is a lot of noise in intraday prices. So let's take a look at how we build the inputs. What are we going to be feeding into the model? We're going to be defining input features based on what types of factors that affect stock returns we're trying to capture here. So first, we're going to learn intraday price patterns at different times of the day. And to do this, we're going to include a feature based on the intraday sequence number of the current 10-second interval. Then we want to capture mean reversion or momentum price patterns. And for this, we're going to be using a list of differences of logarithms of average midpoint quotes in the last 120 10-second intervals. So essentially what this means, this long sentence, is that we're looking at returns between consecutive 10-second intervals for the trailing 20 minute window. And we're using this list as one of the input features. Then we want to capture price patterns for stocks at different price levels. And for this, we're looking at the logarithm of the price at the current 10-second interval. Volume is also relevant. So as one of the input features, we're going to be using volume in the trailing 20 minute window preceding the current interval. But for this to be useful, we want to normalize the volume by the stock's average daily volume over some reasonably long time range. And for example, we used 4-week average daily volume as a normalization factor. So we take traded volume for this stock in the last 20 minutes, and we divided by these stock's average daily volume on the past four weeks. And finally, we're going to be using an input feature based on their stock's security identifier, RIC, because different stocks may have their unique distinct intraday price patterns. And I will show how to plug in a RIC or string in general into the model. Now get ready for some heavy construction work ahead. I'm going to start showing you some slides with the actual SQL code snippets. So first of all, we're going to be using BigQuery. What is BigQuery? It's Google Cloud's enterprise data warehouse, ideal for analytics. You can run standard SQL queries on it. And also, it has some very useful extensions-- analytic functions. I will show how to use those functions. It's fully managed and serverless, meaning that Google takes care of provisioning of CPU and storage capacity of data application. You don't have to worry about any of that. Feature number one-- intraday sequence number of the current 10-second interval. So what I want to do here is basically calculate the number of seconds since midnight and divide it by 10. And to do this, we extract the hour, multiply by 360. We convert from UTC time into local time, because we care about stock patterns specific to the exchanges local time, not UTC time. We extract number of minutes multiplied by 6, and we extract number of seconds divided by 10, and round it up. So this is our interval sequence number. Building block for feature number two-- list of 10-second interval midpoints. So one nice thing BigQuery provides is the ability to use temporary tables as part of the queries. So in this example, we use interval midpoints as a temporary table. You write a query, and the output of the query is stored in the temporary table. And then we can use this table from within another query. And here, we're using array aggregation function, which is an analytic function of BigQuery that allows you to apply a certain function over a series of rows preceding each row in the output. So what I want to do here is to partition data by RIC. And the OVER clause specifies how exactly we're going to do it. So we'll partition data by RIC, by stock identifier. Within each partition, we'll order rows by interval sequence number. And with this ordering for each row, we take 120 preceding rows and make a list of them, and return these as the output of the query. Feature number three-- normalize traded volume. So there are three steps here. As I mentioned before, we compute each stock's average daily volume for the last four weeks. We compute interval volumes, normalize them by ADV. And then for each interval, we compute the sum of normalized volumes for those 10-second intervals in the preceding 20-minute window. So there are three steps here. And similar as before, we use a temporary table to store average daily volumes. Then we use this temporary table from another query to calculate interval vols normalized by ADV. And then we use those values from yet another query in which we use an analytic function called SUM. Which instead of making a list of overall values from preceding rows, here it just adds up values from a list of preceding rows. And as before, we specify that we want to specify that we want the partition data by RIC, we want to order the rows in each partition by interval sequence number, and for each row, we want to apply the sum for 120 preceding rows. Now I'm going to show how to construct a building block for the label-- VWAP in the five-minute period subsequent to each interval. This is something that we won't have when we're generating forecasts, but this is something that we're going to have when training the model. So once again, we use a temporary table for interval price volumes. We use this table from another query, which for each row for each interval, computes the sum of price volumes, and then divides by the sum of volumes. Now, one thing that's very convenient about using BigQuery for data and Google Cloud is that it's very nicely integrated with Python. There is a pandas package called io.gbq that has this method called read gbq. You can pass the query, you pass your project ID in Google Cloud, and it returns the output of a query into a pandas dataframe. And then it can do additional transformations of data generation of final features from our semi-final results of queries directly within pandas dataframes, which have very rich functionality. So it's very convenient. Now, once we have this final dataframe with all features that you need, you can save it into a JSON file, with to JSON function. JSON format is very convenient in that it preserves the structure of data. So for example, it will save lists as lists. If you used to CVS function, for example, it will save lists as strings and you will lose the structure. So I recommend to JSON. And then finally, if you wish, you can use gsutil tool. It's one of the tools in Google Cloud to copy your JSON file into one of the buckets in the Cloud. Generating TFRecord files. So a TFRecord is the recommended format for storing serialized model features in TensorFlow. So what we do, we run queries into queries to get that data, we'll put data into dataframes, we'll apply additional transformations. And then we use those dataframes with everything prepared to generate TFRecord files. And those files will be later fed to the model for training. So there are three steps here. Generating training examples comprising model input features. Serializing training samples into TFRecord files. And feeding TFRecord files into the model. So what is a training example? It's actually a Python dictionary or a map. It maps feature names feature protos, so protocol messages. And there are three types. You can have a list of bytes, a list of floats, or a list of integers. You will always have a list. And what do you do with the features that are scalars. If it's on a list, you simply have a list with the one element in it. So that is an example. There is how we construct an example-- assuming that each role holds values for features from our dataframe that we have constructed earlier. So delta logs VWAP mid-- this is what we're going to be using for our label. This is the correct answer that we only have during training. We don't have it when we're actually using the model to make predictions. And we have five features that I described earlier. The RIC, stock identifier, interval sequence number, deltas of log mids, sum of interval volumes, and logarithm of the current mid. How do you use these to actually generate TFRecord files? The first step is to initialize the TFRecord writer. Then you iterate over rows in the dataframe that contains the features. And for each row, you construct an example based on the feature dictionary that I showed you on the previous slide. Then you serialize this example and you write it into the TFRecord writer. Finally, when you are done, you're closing the writer. And you have this file with serialized TFRecords. If you only intend to train your model locally, you can generate this file in a local directory. But if you are planning to use Cloud ML Engine, you have to copy these TFRecords file into Cloud Storage so that it can be accessed by multiple jobs, multiple tasks that are running as part of the Cloud ML Engine job. Preparing data to train the model. So TFRecords can be efficiently fed into TensorFlow models for training and evaluation. And in the next several slides, we're going to look at exactly how we're going to be feeding those models to our model. And there are three steps here. The first step is we're going to define feature columns. And I will show you in a moment what they are. Second step, we're going to create a parsing spec. And third step, we're going to define an input function that will use the parsing spec. Step one, defining feature columns for the model. So a feature column is a list. It specifies the type of each feature. So here, for example, for the RIC, we're using what's called an embedding column. We're plugging stocks identifier, a string, as an embedding. Which means essentially, that we're making the string into a numeric value. And this numeric value may represent, for example, a stocks propensity to have mean-reverting versus momentum price patterns. And the one caveat that you need to be aware of is when they use embeddings, you want to make sure that you set hash bucket size to a large enough value. Because you certainly want to avoid collisions in the hash map used from embeddings as much as possible. Because essentially, when you have collisions, the model doesn't distinguish between two different RICs in this case, and you don't want to do that. Then you have feature columns for other model inputs. In this case, interval sequence number, sum of interval volumes for the trailing window, and logarithm of the current mid. You can see that we specified their shape. The first dimension is one in all cases. Meaning that they all have with one-- they're all scalars. And then we have delta log mids, for which we specify shape, with the first dimension set to 120. Reflect on the fact that it's expected to be at least 120 values. Step two here is creating parsing spec. And this is very easy to do with a function called make parse example spec. So you pass this list of feature specifications for this function, and it gives you a feature of spec that we can use in the input function. Which is the next step. We're going to define input function in the next couple of slides. Let's see what we're doing here. The first step is constructing a dataset. And for this, I recommend using a function called make batched features dataset. It creates all-ready performance optimized dataset with batching and shuffling built-in. So you don't need to worry about reading your data efficiently. You don't need to worry about setting up multiple threads to read your data because sometimes it's a bottleneck. It does everything for you. So where previously you would make several function calls, here you can just use this function. So let's look at the parameters here. The file pattern specifies the list of TFRecord files that you have generated previously. So for example, you can have a separate TFRecord file for market data from this thing-- the trading base. Then batch size. So here, we'll set it 128. What does this mean? It means that the training examples are going to be passed through the model not one by one, but all together in batches. And each batch is going to have 128 training examples. When you feed training examples through the model, it generates predictions. And the lowest is computed for the difference between predictions and actual correct answers. And then based on the lowest, the back propagation of gradients is applied. So when you use a batch, the back propagation happens only once within the entire batch. All gradients are averaged over all training examples, so then there is back propagation happening once. So this is called mini-batching. And usually it improves model performance with the batch size set around a hundred to a thousand. Now, the feature spec is what we constructed earlier. Number of epochs is set to one. This simply means that we're reading the dataset just once. We can read it several times if you wish. Shuffle is set it true. What this does, it randomizes the order in which it puts training examples that it reads from the file into batches. It makes the model more robust. Then we make an iterator. So make one shot iterator, it simply creates an iterator that reads the data from the dataset just once. There are other types of iterators. Some of them read data more than once. And this is interesting. The get next function. So it looks like, if you look at this function, it returns features. But in fact, it does not. What this function returns is an operation node in the computation graph. And then later on when this node is evaluated, it will return next batch of examples several times in several ways. And this is a very important, subtle point, so I have a separate slide just for this. Because it's a key point to understand. TensorFlow programs in graph mode. In graph mode they have two phases. In phase one, you construct the computation graph, but no data is flowing through the graph yet. Then in phase two, you execute the computation within the TensorFlow session. And this is when data actually flows through the graph. So in our dataset input function, as it's implemented, it doesn't return any batches of data yet. But instead what it does, it adds nodes to the TensorFlow computation graph. And then, when later on the nodes are related within a session, they will be returning next batch of training samples upon each evaluation. So we're back to dataset input function. It returns features and labels. So features is an operation node in a TensorFlow graph. And labels is another operation node in a TensorFlow graph. Training the model. First, we're going to look at how to train the model locally. So the model will be training on examples we feed it. And with some number of examples, some of which it trains, hopefully it will learn how to make correct predictions. That how supervised loading works. So the first step is to construct a DNNRegressor. TensorFlow has several built-in estimators. So a DNNRegressor is one of the estimators, and it's intended for use with models that make numerical predictions, which is what we're doing in our example. Hidden units specifies the structure of the neural network. So here, we are going to be using two fully connected layers. And each layer is going to have 128 hidden weights. Feature columns-- so this is exactly the feature columns that we defined earlier. It's a list of specifications of the dimensionality and type of each input feature. Model directory is in the form of an argument that specifies where the model is going to be saving its periodic checkpoint, so its state. And where it's also going to periodically saving its training and evaluation statistics. So here, I'm again going over the feature columns. There is embedding column for the RIC, there is a numeric column for interval sequence number shape one, sum interval in shape one, low current mid-- they all have width one. And delta log mids has width 120. And I already said that model directors is where the model stores it's checkpoints. And this is also importantly the directory from which the model reads it's state initialized when you construct DNNRegressor later on. And by using this feature, you can train the model iteratively or incrementally. So the way it works is you initialize DNNRegressor from the latest checkpoint. Then a train model on newly available data, for example, or newly available dataset. And your checkpoints will be automatically saved in the model directory. So it will create new checkpoints with files under new file names and automatically garbage collect old checkpoints. So for example, in a case of financial data, you can incrementally train the model at the end of each trading date. So at the end of the trading date, you collect all market data for this date. You train the model on training examples for that date. The model starts with a checkpoint from the previous date. After it trains, it stays in the checkpoint representing its state after it has incrementally trained the market data from today. And then tomorrow it can go on. So how to train the model. One thing before we get to training is you may want to add some custom performance metrics-- evaluation metrics. By default, DNNRegressor has one metric. It shows L2 loss values, which is mean squared error. And it may be useful to have root mean squared error, as well. So this is how you add the custom evaluation metrics. So we're almost at the point where we're training the models. Before we train the model, we need to construct training spec and evaluation spec. And what is that? It's to primarily specify input functions. So we can use the same input function, but simply initialize them differently with a different list of files. We pass the list of training files into the trained specification, and we pass the list of evaluation files for the eval specification. And very importantly, when training models, evaluation data should start after the end of the training date. You don't want to have any look-aheads here. And also, it's important to specify the number of maximum steps for training. Because depending on how you implement your dataset input function, if you don't specify these, your model may train forever. It will never stop. And finally, train and evaluate function is the recommended way of training a TensorFlow model that uses DNNRegressor. What it does, it trains the model, it periodically checkpoints the state and it periodically checkpoints training and evaluation performance metrics, all in the same directory. But there is a separate subdirectory for evaluation metrics. Now, we had a look at how to train the model locally. Now let's take a look at how you will do it with the Cloud ML Engine. So why would you want to use Cloud ML Engine? The primary purpose is it gives us scalability. It gives the ability to train the model with multiple workers, multiple tasks at the same time. And this way you can train your model many times faster and you can have a much faster turnover over different verses for the models you want to try. You can do your research much faster this way. So the good thing is, when you transition your model from local training to training with Cloud ML Engine, you don't need to change your code at all. When you're using estimators, they already have built-in support for Cloud ML Engine. And that's extremely convenient. So all you have to do is specify a configuration file, you need to specify code to construct training package-- the setup-- and you need to provide the main program to train the model, which you already have because it's the same program. So configuration settings. You need to specify the scale tier. So in this example, we're going to be using a custom tier because it's the most interesting-- most flexible. You can specify exactly the hardware configuration for different types of nodes in your training job. So let's take a look at the different types of nodes. There are three types. There are worker nodes, which calculate gradients. And workers are the nodes that do lots of numerical computations, such as matrix multiplications. And they are the nodes that you will want to run with GPUs to get faster performance. Then there are parameter server nodes, which update parameters with gradient vectors from workers. And then there is a master node, which coordinates everyone and also operates as a worker. So based on that usage, we are going to be using standard GPU for master and worker nodes. And we're going to be using certain CPU for parameter servers. And we'll also get to specify how many workers or how many parameter servers we want to use. For this example, we're going to be training model with eight workers and four parameter servers. The code to construct training package-- there is a separate file we need to provide. And you specify the lists of required packages. In many cases, this list will be actually empty, because each Cloud ML Engine runtime comes with a already built-in list of many popular Python packages. So if you are using something that's very nonstandard, you may need to specify it here in this file in setup. But in many cases, we won't need to. And the main program to train the model is the same as for local training. And again, let's go over what it does. It parses command line parameters, it defines feature columns, it constructs DNNRegressor, it defines evaluation metric if you need any, it defines the input function that will be reading TFRecords from a list of files, and it will start training and evaluating the model. So this is the command they use to submit a job to Cloud ML Engine. You need to specify a unique job name. You specify the path to your main program that does the training. This is the program that creates DNNRegressor and calls the train and evaluate function. You specify the path to your configuration file. You specify the path of your staging bucket in cloud storage. And you specify the runtime version. So this number in the runtime version-- runtime version is exactly the version of TensorFlow. So depending on what version of TensorFlow you intend to use, depending on your code dependencies, you will specify the version here. And as I said, each runtime version of Cloud ML Engine comes with a different list of packages already built-in the runtime. Then you need to specify the region. There are two region support at the moment, US central one and US east one. And finally, if you need to you can specify a list of custom parameters that will be passed to your training application. Evaluation of the model. So what is evaluation? The key question I want to answer is, how precise is our model? What's the quality of predictions? How well does it perform on evaluation data? So we're going to be using TensorBoard for this. This is how you run a TensorBoard and you command this code, TensorBoard, and then you specify the directory where model checkpoints are. And tabulation statistics and training statistics are the same. This is the same model directory that they used when they constructed DNNRegressor object. So this is what L2 loss looks like on the training data that were used for this model example. This is what the loss looks like for the first 3 and a 1/2 hours of training. You can see that L2 loss started around 90 probably, and it was gradually decreasing over the next 2 and a 1/2 hours. Meaning that as the model trains, its loss on training data becomes gradual lower and lower. But this doesn't necessarily mean that the model is becoming better in its predictions. Because what we really care about is the quote of generalizations. And when we look at the L2 loss for evaluation data, we see that the model starts around 206 after 30 minutes of training. And then actually, you can see that the loss from evaluation data increases slightly to around 210 to 211. So what this means is you probably want to stop training your model after the first half an hour. Because after that, L2 loss on training data will keep decreasing. But this simply means that the model gradually starts overfeeding. So you want to stop training, because you're not going to improve quality of predictions after that. The longer it trains after the first half an hour in this example, the worse actually the model predicts evaluation data. So there is usually an optimal amount of training you want to apply. So I said this already. Let's take a look at R squared. So as I mentioned before, square gives you an idea of how well the model performs on evaluation data compared to a trivial model that always predicts zero returns. And as I said, if the value is zero, it means the model is worthless. If it's one, it means the model is perfect. And usually, we expect the value to be closer to zero than the one, but to be positive. And as it happens, that's the case here. R squared is approximately 0.10 to 0.11 after the first 20 to 30 minutes of training, and then it declines to a range between 0.08 and 0.09, which is actually decent for when you're predicting five-minute intraday prices. But remember, this was based on model trained on only 16 days of market data and evaluated on only one day of data. So this is fine for an example of how to build the model and how to train it, but this is not what you would want to do for actual use in production. For use in production training, you would want to train your model with several years of data and evaluate on several years of data. Generating forecasts. Once we have the model, how do we generate forecasts? For training, it's convenient to save data into TFRecord files. When you want to build a forecast in real-time, it may actually be more convenient to store data in pandas dataframes from memory than to read from files. So I will show how to do it a bit later. So the key is that for generating predictions, you use the same initialization settings as when you were training the model. Same hidden units, same model directories, same feature columns. That's important. As I said, we're going to be using pandas dataframe. And when input functions based on the pandas dataframe-- so each dataframe row is going to include feature values for training example. And each column is going to have feature values for one feature type. And the input function is going return a dictionary, mapping names of input features to tensors containing the feature's values for the entire batch. So essentially, values for one column in the pandas dataframe. So the feature tensors will be two-dimensional. The first dimension is the size of the batch, which is the number of rows in the dataframe. And the second dimension is the width of the corresponding feature column. So we have RIC, we have interval sequence number, some of interval volumes, log of the current mid, delta log mid. So here, the second dimension is 120. And there is a little bit different transformation here, because first, we need to convert the list of lists into numpy multidimensional array. And then we shape the array as required, with the second dimension set to 120. Then we use the predict method that returns predictions generator. We use islice method to get the iterator. And here, it's important to specify the number of steps in the iterator. Otherwise, it may never stop. And then as we iterate, we get this list of predictions which correspond to rows in the dataframe. So the length of this list will be exactly the length of the dataframe-- the number of rows. So the wrap-up. What have we accomplished? We have shown how to build model features using Thomson Reuters data in Google Cloud using BigQuery. We have looked at how serialize input features in TFRecord files in Cloud Storage. We have seen how to construct DNNRegressors and build and train the models locally. And we have seen how to train the model as a scalable distributed application using Cloud ML Engine with multiple workers. And now we invite you to apply the knowledge that you have gained from the session to try and build and train your own machine learning model in Google Cloud, and put all pieces of your own puzzles together into a beautiful picture. Thank you. [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 14,228
Rating: 4.9444447 out of 5
Keywords: forecasting stock returns, cloud ml engine, thomson retuers tick data, BigQuery, stock price forecasting model, distributed training tensorflow, tensorflow distributed training, Google Cloud next 2018, cloud next 2018, cloud next 18 livestream, next 2018, google cloud next, cloud next, google cloud next ’18, google cloud conference, machine learning, AI, GCP, G Suite, cloud developers, AWS, Kubernetes, CloudSQL, Cloud ML
Id: VAkLSLuJCgc
Channel Id: undefined
Length: 50min 56sec (3056 seconds)
Published: Wed Jul 25 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.