Bigtable, BigQuery, and iCharts for ingesting and visualizing data at scale (Google Cloud Next '17)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
CARTER PAGE: Thanks everyone for coming. I know it's late in the day after a couple of very busy sessions. If any of you want to go grab a beer and come back, that's cool too. It's been going on for a while. But anyways, thanks for joining us very much. I'm Carter Page. I'm the Lead Manager for Bigtable at Google, which includes the Cloud Bigtable team and we have George McMullen, who's the Senior Director of Products at iCharts. GEORGE MCMULLEN: Hi, everybody. CARTER PAGE: And the two of us, we're going to be presenting about BigQuery federated queries on Cloud Bigtable and how iCharts uses it. So cutting straight to the chase, we just announced this morning that you can now run high-speed SQL queries on unstructured data in Bigtable. So you don't need to hook up your Hive or anything else like that. You can just connect BigQuery, and now you've got SQL. You can basically turn Bigtable into another virtual table in BigQuery, and do everything you can do with BigQuery on top of it. So the advantages of using this approach is you can query row ranges. So if you can select row keys. Basically you start in row keys in Cloud Bigtable when you're doing queries through BigQuery, and you can get very fast results. As long as you're kind of ranging that data. If you do full table scans over terabytes, it's not going to be as fast as BigQuery is. And it's not exactly what this connector is optimized for, but you can do it in batch mode. I'll talk a little bit more about the performance tradeoffs, why this is, et cetera as we go along. The important thing is that it supports SQL. It supports both the legacy SQL and standard SQL. There is some performance optimizations that are coming on the standard SQL in the next couple of weeks. And I'll also talk about that a little bit. And all the time while you're running BigQuery against a live real time Cloud Bigtable database, you can still use the Cloud Bigtable interface for the low latency reads and writes that it provides. And by connecting Bigtable to BigQuery, it also connects Bigtable to the Big Query ecosystem. So anything that can read from a BigQuery table can now read from a Bigtable table. Here's the agenda. Basically I'm going to run through real quick how Big Query works. And then I'm going to go into some detail about how Bigtable works. I'm going to talk then specifically about the BigQuery Bigtable connector, and some of the tradeoffs. Like how you use it, when you use it, when not to use it. Then we're going to get in and do a demo. We're actually going to set up the connector and run BigQuery against a Bigtable. And then iCharts is going to come in and show how they plug into this connector as well. So stepping back a little bit, the broader roadmap here. Google Cloud Platform has seven different storage products ranging from a memcache all the way up to a data warehouse. And what we're going to be focusing on today are these two, about Cloud Bigtable and BigQuery. These two databases are what we refer to as our analytics databases. And they have different strengths and weaknesses. There's tradeoffs for them. So what Cloud Bigtable is designed for is for being able to scale to petabytes and beyond, and be able to provide very fast single value reads and writes. A spot reads and writes in single digit milliseconds, even out on a long tail. Like 99% and beyond. What BigQuery is specialized for is the ability to take terabytes of data and do aggregation across it. And we can do aggregation analysis and return results in just a few seconds. The inverse of those is not actually true, though. BigQuery cannot do very fast individual spot reads and writes. And while you can do scans over Cloud Bigtable, it's not optimized for it in the same way that BigQuery is. And so large scans will end up taking minutes. And I'll talk again a little bit why that is. But the basic gist of it is that the architecture of the two systems are different. So Cloud Bigtable is row based. All of your data is basically stored one row at a time. And so the files that underlie Bigtable are sets of rows. So one file contains a set of rows. BigQuery on the other hand, is actually each file is a column, or actually each column is a set of files. So if you have 20 different columns in BigQuery, and you read two different columns from the same row in BigQuery, you're reading from different files. And I'll explain a little bit more why that affects the relative performance characteristics of the two. So quick dive into how BigQuery works. So fully managed, no ops, data warehouse. When you hear data warehouse, store schema, those sorts of things, that's where your BigQuery comes in. Petabyte scale, it's super, super parallelized for speed. If you ask a question of it, we'll spin up tons and tons of workers to analyze your data, and charge you just for the second that it's up, give you the response back, and spin them down. It has the convenience of SQL. So you don't need software developers to ask simple questions of your data. And it's durable and highly available like with similar characteristics to Bigtable. The basic journal architecture for its first decade-- I think this was published in a paper around in the late 2000s-- and basically what you have is it's sort of like a on spot MapReduce. You essentially have these leafs that will each be assigned to a chunk of data in BigQuery. And they will go analyze it, they will aggregate the results, send it up to the next layer, to the level of mixers, which in turn will aggregate again, and send it up to the top mixer. In 2015 we developed the Dremel X architecture, which simplified and actually sped it up where you really just have one master talking to all the shards to aggregate things up. And underneath, these talk to this columnar storage, which in 2016, we replaced the previous columnar storage with capacitor, which is faster, especially when dealing with encryption at rest that we do for everything at Google. And as I mentioned here, the important thing of having a columnar store when you're doing large aggregate queries is-- say you have a table that has 100 different columns and you want to be able to aggregate the values in column A, based on the values in column J. BigQuery will basically rip through those two columns and will spread those out and aggregate them, and return results back. If you did the same thing on Bigtable however, Bigtable would be reading every single row as you go along, just because of the way it's stored. And that's by design, in terms of what they're intended for. So I'll go into how Bigtable works. There are some slides that were in some of the other Bigtable table talks today. I just want to guess it's here, like, how much I need to go into deep dive on this? How many people have been to other deep dive Bigtable sessions today? There's enough people who haven't, so I'm not going to rush through this. So Cloud Bigtable, it's a fully managed NoSQL database. Basically kicked off the NoSQL revolution in the mid-2000s. It led to a bunch of fun open source solutions that imitated or either took direct clones of the architecture described in the paper, or were heavily influenced by it. This includes HBase, Cassandra, Hypertable, the NSA built one-- I think it's called Accumulo. And it's basically inspired a whole family of databases out there. The basic important characteristic about it is it's incredibly scalable. So it can scale petabytes, hundreds of petabytes, whatever. And it's very, very low latency. So even with large data, even if you're pounding it with a million QPS you get very tight low latency, even at the tail. Single digit milliseconds typically. That supports sequential row scans. So the row keys are stored sequentially. Think of it like alphabetical order. And this allows you to do some clever things in terms of optimization. So for example, one of the examples we'll give them the demo is the idea of stock ticker data over time. And so if you prefix your row key with a stock ticker, you can tell it, hey I want to read the data for this stock ticker during this window, and it will only read that data there. The other thing that's nice about it is under the covers, because they're stored together-- so like, we're going to look at the stock ticker GOOG, for Google-- when you start reading the first entry, the first row for that, it's going to go to the File system, and their story together in the same files. It's already going to have pulled that file into memory into the block cache of the file system, and so it's actually very fast doing scans. Not quite as fast as BigQuery for the reasons outlined, but still pretty good. And then the other thing is it learns and adjusts the access patterns. So as you have diagonal patterns during the day, where you're going hot or you're going cold, on different areas of the data Bigtable will learn about this and will rebalance. And I'll show how this works. So here's basically what a Bigtable looks like. Conceptually, you could have millions of columns. So it can be very wide columns. You want to have a single row to have less than 100 megs. But apart from that, you have of no constraint on number of columns you throw into a row. It's a sparse table, meaning that you could have a billion rows, and each of the billion rows could have a different million columns than each of the other rows, and it doesn't affect the space. You only pay for the row column intersections that you use. You'll notice that the row keys are sequential. That was the lexicographic or alphabetical ordering of the row keys I mentioned. And then at the top, there's this notion of column families. In this case it's follows and friends. And that's the only schema definition that happens in Bigtable. You'll define a column family. We may define column families. And then the columns themselves are actually not defined until insertion time. So when you make a mutation, that's when that column first appears. And it may only appear in one row, or it may appear in many rows. That's really up to use the application developer. So there's a lot of flexibility, particularly if you're ingesting things with unpredictable schemas. This is one of the reasons that it's a great fit for IoT, where you can get some kind of surprising properties from different devices. And the last thing I'll point out in this slide is that the database is three dimensional. So you have your row, you have your column, and then we'll store whatever number of versions you want. You could store them forever, or you could put a TTL or max and reversions to kind of manage your data size. So this is the Bigtable architecture in a nutshell. The left is kind of a deep dive into what's inside our tablet server. And I won't go into the details of that. That would take a while. If you're curious go read the OCI paper from 2006. A Bigtable that goes into gory detail in terms of how Bigtable works, and how it operates and scales the way it does. The important thing to notice here is you've got your Bigtable client, and it's talking to a pool of tablet servers. And each tablet server will represent a certain amount of data that's stored in colossus, which is our file system. And the important thing there is the data is not actually stored in the tablet service. So they're essentially stateless. And they represent and they pin the data from colossus. And it's our high speed networks that allow this separation to happen without any performance degradation. Then we have a master disk coordination. And we have Chubby, which is kind of like-- if you ever heard of ZooKeeper, it's kind of the same thing. So when you connect to Cloud Bigtable from outside, you talk to a load balancer proxy. And then it will actually send you to the Bigtable nodes. We're not going to talk about the load balancer proxy right now. It's kind of abstracted out, as far as you're concerned. And we'll talk about how things work inside. So we talked about the nodes being separated from the storage. And this allows you to do two cool things. So one is that you could easily move data around essentially. So imagine you have a situation where you have this node on the left that's getting a lot of access patterns. Maybe it contains data for a certain region of the country that just woke up. And what Bigtable do is it will recognize this in a matter of seconds, and then it will send a signal to that node to unload the data. And then it will send a signal to the other node to load the data and start serving this. And this happens in a couple of seconds, sometimes hundreds of milliseconds. It's practically imperceptible. Whoever happened to be right at that moment, might get a little bit of latency on the very long tail, like four nines or something like that. But for all intents and purposes, it's transparent. And this is happening all the time. It's a very aggressive at rebalancing to keep things even. The other nice thing is you can resize too kind of on the fly. So if you need to throw more nodes on, unlike Cassandra or HBase, you're not also throwing more disk with the nodes themselves. You can spin up and spin down nodes to increase your throughput, increase your QPS. You can do that for a batch. And then after the batch is done, spin it back down again if you like. And because it keeps the balancing nice and even, and because the nodes actually don't talk to each other, you don't get that kind of exponential price that typically comes from the distributed systems, or offering distributed systems. And so it goes in a straight line from 3 nodes to 30 nodes. By the way, so we kind of give a guidance of 10,000 QPS per node. And that's kind of what we show on the console. And that's based off of a load test of 1 kilobyte payloads, mixed read right 50/50. If you have very large rows that are like 100 megs, you're going to get much lower than this. If you have much smaller rows that are 10 bytes, you'll get much, much greater than this. So this is just a rule of thumb. The most important thing to do if you're trying to actually figure out to scale it is run your own load against it. The important point from this is that you can trust Bigtable to scale linearly. So as you keep increasing up, the number of nodes will increase your ear load throughput proportionally. And in fact, we had a customer last year who did a load test, just to kind of prove they could-- and we want to prove we could-- where he ran 3,500 nodes and got 35 million QPS. Now the thing that's really cool about 35 million QPS is they were actually persistent and durable QPS. These were two disks. If you had turned off the power right in the middle of it, you wouldn't have lost a single bite when it came back up again. So its performance is comparable to an in-memory database. So the connector here. So as I mentioned, each database has its own tradeoffs, so because BigQuery's column base, it allows you to query terabytes in seconds, and aggregate that data really quickly. Cloud Bigtable being row based, and because of other parts of its architecture, gives you single digit millisecond latency. So very fast. To use both, sometimes people want both aspects of this, and that's why we built this connector, you previously had to dual ingest. In terms of building a system, maybe using a Pub/Sub with a data flow, riding into both systems hoping to stay kind of in sync. And then sending one program to one or the other, depending on what your use case is. This Federated connector that we're releasing today, basically replaces the column aisle store that's at the bottom of the BigQuery diagram, and just puts Bigtable in there. So you can basically tell BigQuery either go to your native store, or you can point it to the Bigtable APIs. And it basically treats Bigtable as an application would talking to Bigtable. The benefits of this is you get SQL on the Cloud Big table, and that's pretty cool. Because if you have a simple question about your data, previously you would have to either go instruments, something like Hive. Or you'd have to go, have a developer write a program and run a MapReduce and hopefully get that answer to you in the next few hours or something like that. Now you can do ad hoc queries. You can keep them canned. People who are not developers can actually can ask questions about the data that's in there. Once you have this connector in place, you can use the same tools that work against BigQuery. And that's why we have iCharts up here. And they're going to show off a little bit about what that allows you to do. No table partitioning is required. So if you're familiar with BigQuery, BigQuery does full table scans. And if you're doing something like warehousing all your data over time, you'll want to partition your data into days or weeks or something like that, so that you're not reading five years of data when you're just curious about two days or something like that. Bigtable solves this by providing row ranges. And rather than defining it ahead of time, you define your row range is at query time. And it will just select the data that you requested, and return that. And so as long as you're keeping the row ranges small, the query times will be very fast, one or two seconds. And now you only have-- for the people who this is a good connector for, you only are storing your data in one place, and that simplifies things a bit too. BigQuery is seeing the same real time data that Bigtable is. The trade off, as I mentioned before, is full table scans are going to be slower. If you still want to do lots of ad hoc full table scans all day long and return in really, really quick times, you may still need to do this dual ingest. If you're doing smaller subsets of the day, you're probably going be just fine with a connector on top of Bigtable. And the other thing I want to point out is that Bigtable connector is read-only. So unlike BigQuery, you can read out of a BigQuery, get some results, read it into another BigQuery table. You can't do that with Bigtable. However, you can do everything else you could do with a read-only connector on BigQuery. You can join a Bigtable read with data on a BigQuery table. You could even join two Bigtable columns together, or two Bigtable tables together. So let's stop talking and start typing. Let's see, can we switch over to the laptop. So I have an instance here. It is called GCP demo. And I have it sitting in US central 1c. And this is important because BigQuery resides into regions. It resides in US and it resides in Europe. And because of the amount of data that it's reading, you need to co-locate your Bigtable in the same region. So here we have it in the zone US central 1c, which is in the same region as the BigQuery US region. So let's take a look at the data here. We're going to use this tool called CBT. If you use Bigtable, and you haven't used CBT before, then you should grab it. It's a GCloud component, and it makes it really easy to do simple admin things for the command line. It's a lot easier than using the HBase admin tools or whatever. And so here I just basically said, read the first row of the data in this table ticker. And start with the first row that matches the prefix Goog. So we see actually the opening day on the stock market for Google, on August 19, 2004, and it closed at $50,000 that day. We've got a high and a low and a open. The volume there is 0. Some of the data in here is based on what year it was or whatever. For both the volume and the open can sometimes be 0. So it's just an artifact of the data essentially. But we want to do, if you wanted to actually start looking at more of this data, using this interface quickly becomes very clumsy and hard to actually parse. So what we want to do is actually make a BigQuery table so we can actually do something interesting with it. So let's see. So we're going to create a new dataset first. We're going to call it demo. And as I mentioned before, we're going to put it in the US. So it's co-located with the Bigtable that we have running and US central 1c. So click OK. And that's great. It's pretty fast. So then what we have to do is we have to tell BigQuery how to talk to Bigtable. The trick here is Bigtable has no notion of typing. It's all binary data. Bigtable doesn't know if you're storing an integer or a string or a movie or some ciphertext, or whatever. You just store bytes in it. So we need so BigQuery when it's reading from certain columns what the type of that data is. And the way we do that is this is the only slightly complicated aspect of the connector as you just define a JSON file. If you get this, everything else is brain dead simple. So this JSON file starts with a format. And you have a reserved term for the format for Bigtable. So you put in all caps Bigtable. And then it will be looking for a source URI to define where that Bigtable is. So we point to the Bigtable API-- to Google APIs. Com/Bigtable. We point to the project name, project ID. We point to the instance ID, and we point to the table ID. Then we set up our options. So the first option that's here is actually pretty important. It's read row key as string. So 90% of the time, probably more, you want to use strings as your row keys on Bigtable. It gives you the most kind of flexibility, particularly if you're concatenating a bunch of different types of data together. The important thing in this case is if you give BigQuery a string row key and you say, my row keys are all strings, then BigQuery will know how to be able to select between individual ranges. And that's the key to actually having a fast connector. So when you're selecting between to between two particular rows, and you're reading that 100 rows or 1,000 rows or 10,000 rows or whatever, not maybe the whole billion rows that are in that table. The second thing, only read latest, you usually want this set to true. Now if you recall, Bigtable is a three-dimensional database. So each column or row intersection can have any number of versions. It's interesting for historical purposes, usually you just want the latest one. But if you don't, you can set those defaults, but you typically want this to true. And then you actually define out your column families. In the simple example we were working with today, I have one column family just simply called T. And so we specify the column family ID there. And then I specify a type here, which in this case is a string. And in this case, I'm saying that all the data in this column family happens to be a string. That doesn't have to be the case. You could either set a default, and then override in columns that are different. Or you could set each column to be a different type. You could say this is a string, this is an integer. But it's important to set the correct type so BigQuery knows how to correctly parse your data, or how to turn those bytes into the correct data you are looking at. And then here is the individual columns. And you can see these match up with what we saw before, looking at the raw Bigtable data. Symbol, close, date, et cetera. So the way we actually create the new virtual table is fairly simple. We use another command line tool. This is BigQuery's GCloud component called BQ. And again, if you use BigQuery, you should probably have this tool, because it makes your life a lot easier. We'll call the MK command to make a new table. And we specify the project ID of the BigQuery. They're coincidentally the same place, but just pointing out is not the Bigtable project, this is the BigQuery project. We specify the external table definition, the JSON file that we just looked at. And then you say the new dataset that we created, demo, and then adopt. And then the name of the new BigQuery table. Note that's not the name of the Bigtable table that's specified inside of JSON file itself. We're specifying the name of the virtual query that we want to have queryable inside of BigQuery. And we do that. And after a long time, it's done. So we go back to BigQuery here. And we'll refresh. And generally, it takes about five seconds. So pretty fast. And now we have the BQ ticker table here. Let's take a look at the schema here. And you can see it's basically you got a schema that it's got a lot more kind of layers that are in here. But the things you generally care about are kind of like you have your T.symbol, T.close. In fact, what we'll do here is we'll grab the close price on Halloween. So T.close.Sell.Value from BQ underscore ticker where row key equals Goog. And it will say underscore 2016 1031. I'll run this. And so we get a closing price of $785, basically, on that day. But you know, this is a BigQuery. So you can already do spot lookups inside of Bigtable if that's not very interesting. So you can do some more interesting sorts of things. Let's look at the min and max for the entire month of last October. So because the values are strings, you want to cast them to floats. Were going to indicate that the min of the low for the month is going to be called low, the max of the high is called high. And we're going to give it a row key range. It's going to be between this and this. So we could even do a prefix. You could have basically a row key prefix, and just say it goes up to 2016 10, it will basically do the same thing. So we do that and we find out that the low for the whole month was $770, the high was $816. But usually like when you're plugging this into reporting engines, you want to be able to do more interesting things. Like build histograms of data. So let's look at a histogram of the highest high and the lowest low of each month ever since Google went public. So you run that, and after 1.3 seconds we've got our answer back. We've got basically the range for each month, for all 151 months that the stock has been trading. So that's just showing the raw functionality. But I think what gets interesting is when we can actually turn this into some kind of visualization. And for that, George is going to come up and show us iCharts. So can we switch back to the slides please? GEORGE MCMULLEN: Awesome. Thanks. CARTER PAGE: Thanks, George. GEORGE MCMULLEN: It's pretty incredible stuff. Hey, everybody. I'm not going to do a demo yet. Another demo. So just for anybody who doesn't already understand or know what iCharts does, we help our folks like yourselves understand the trends that are in massive, massive amounts of data so that it can help you answer questions and make decisions and drive actions that can drive your business. And so some of the things that our customers really like about us is the simplicity easy to use. It's a drag and drop type of interface. And it's not to say that we are trying to hide any of the power. We actually do give you some of the power as well to do some really native stuff within our interface. And I'm going to show you some of that. Also, multiple levels of aggregation so you can drill down into the details of the data, while still seeing those higher levels of aggregation in a hierarchy that you define on the fly within the interface. And something else that's really important is single sign on. Being able to use Google OAuth to get into our platform. And that's really important for folks who don't want to have another username and password. It simplifies the whole data management. And it also enables embedability whereby you can use our visualizations or the entire reporting system within the products that you're already using. Just simple example is if you have a dashboard, you could put that into Google Sites. Or you put a chart in Google Slides. So how do we work? We start at the visualization layer. To us, everything needs to be a visualization because that's the goal. Those are the questions that you're trying to answer. Those are the decisions that you try to make. And so that's where we start. Except for this demo, I'll be going into a little bit more of the meat and the details, and peeling back cover to show you how it's all working. And then what our system basically does is it takes that visualization configuration, and creates a dynamic query that includes all the calculations and aggregations, segmentation filtering, windowing, all that wonderful stuff, and parses it out into the APIs like Google BigQuery's API. Gets all the data back, pushes it back to the user where they can then interact with it. So before I get into the demo, got a bit of a confession to make. We've never worked with Bigtable at all before. We've never done any kind of integration or anything. Yeah, I know, right? But that's one of the reasons why we're so excited about this. Because now we can offer this to the customers that we have that have Bigtable data. And then they can access all that information alongside their BigQuery information in one single place. And that also helps with the simplification of data management. And they can also query on it using SQL, using the language that they already kind of know. These analysts already know SQL. And most importantly, I think really is the fact that they can take this unstructured data that they have in one system, and combine it with structured data that they have in another system, and see it all cohesively in a single visualization and report. But I think the most important thing for us as a company, is that we didn't have to do any work. So I really like that. On with our demo. If you could switch back over here. And a little bit also about iCharts. We run on standard SQL. As Carter mentioned, they're still doing some query optimization. So things might run a few seconds slower. Might have to wait a couple of seconds for things to load up. But one of the things that Carter just showed, and I'm going to pick it up where he left off is that query that was showing a histogram of the Google stock ticker over months. So I did a little conversion over to standard SQL. It was really simple to do. Just a few lines that needed to be changed. Plenty of documentation on that. And I'm already logged into iCharts through Google OAuth. Just refresh the screen. Make sure that I'm still here. And I have access to all the data sets and all the projects that I already had access to. So my BigQuery-- there's that demo data set that he created. Here's the stock ticker data set. And I can actually use that query as a data set. I could drive directly by the tables themselves, but I wanted to use this particular use case as an example. So I'm just go to paste the query in there. I've already done this a few times. Practice makes perfect. I'm going to go create this data set. Actually you know what I'll also do is I'll take this query and run it in the BigQuery console as well. So you kind of see that-- show options. Disable, legacy SQL. Get rid of cash results also. So right now it's basically running the same query. Actually, I think it returned already. I'll hide the options. It didn't take that long at all. So we see the same exact data that we saw in Carter's original demo. It's there. So now I'm in our editor, and we have the fields that have defined in that query. So we have the month, the low, and the high. And I could basically take this and replicate the same exact output that Carter showed us, and that I just showed you the BigQuery console. It's going in. It's basically getting every single month out of that dataset, piping it through BigQuery's API, and then going down into the Bigtable through the Federated system. And now it's going to basically do the same thing and get the highs. And it's also aggregating it at the same time. So if it were kind of role level information, not preaggregated through a query, then you'd basically be able to see a sum. Or you can also choose things like account, average min, max, distinct. Or you could do other things, which I'll show in a moment. So this is great. It's pretty much the same thing. But I promised a visualization, so I'm going to show visualization instead. And we'll see how basically all of this, which is kind of difficult to interpret just by reading it, a few rows, but this is much easier. And you could obviously see hey, if I invested in Google in the beginning of time, I'd be doing pretty well right now. And if you had done that, you had you'd be 10x your original investment, maybe even more. But most people, they don't have just one stock symbol in their portfolio. They didn't just put all their eggs in one basket. They're diversifying, they may have multiple portfolios. And if you're building an application that is for portfolio trading, or some kind of trading application or brokerage application, you're dealing with thousands or even more portfolios that contain hundreds of thousands of ticker symbols in them, with billions and billions of traits. Not just purchased, but every single time you've made a trade, that's being recorded. And so that's kind of the next use case that I wanted to show you really about how can we combine this information here that's coming from Bigtable with some other data which would represent my portfolio within BigQuery? And I'll just take a look at BigQuery here. Show you what my portfolio looks like. And hide this. It's a basic table. Portfolio name, ticker, purchase date, when I purchased the stock, how many shares I purchased, what was the price that I purchased it. So I've got another query below. And you could do this all within our interface, but I think it's important here to kind of show the core SQL since this a little bit more of a technical audience. Basic select command. Taking the information from my portfolio, taking some of the information from the ticker table that's driven from Bigtable, doing the from the portfolio, joining it just like you would join any two tables together, which is great because these two tables are in different systems. And just joining it on the cell. And then we basically have a little bit of optimization here where you're saying, the row key is from between the time that I purchased the stock until the present time. And it's doing some preaggregation for us as well, just to make things fun. If I could copy and paste this. I can even use this as a data set. And it just basically create a user entered query here. Give it a name, portfolio by month. Create a data set. I'll also show you what it looks like on BigQuery itself. Give you an example of what happens on query. Hide options. Wow, that was fast too. So here, now I have-- we're seeing a portfolio, all the tickers, when I purchased the shares, how many shares I purchased, the purchase price, the price, and the month together. So price is actually the price during that month. And you know what I'm going to do, is I'm actually going to do a little something extra here. Because the price of the shares currently, versus the number of shares that I have, is great. But I actually want to see what the value is that is currently existing in my portfolio, or how it's changed over time. And that's really straightforward simple calculation. I'm just going to multiply purchased shares to price. And certainly, you could have created a view to do this. But I think it's more important to actually do it directly within the dataset layer within our semantic layer. Because if this were a more complex query that was put onto a preaggregated dataset then you'll lose the capabilities of filtering and drilling down into the data. So it's more important to have that directly within our layer, than trying to do it directly within the data. So I'm just going to copy that, and edit our data set. And just going to call it value. Value. Going to give it a float. Paste in my thing. And you know what? I'm actually also going to do a little extra formatting just so it looks good. So now I could create a pivot table directly on this data. So what I'll do is I'll take the portfolio name that I have, and what it does is goes directly into BigQuery automatically, gets all the portfolio information out, and then maybe I want to see how each portfolio has evolved over time. Take that, put that into the columns. And you know what? I'll also add it to the filter as my trading started in 2014, but maybe I'm only interested in maybe the past six months. So I'll just pick a date range from end to February to October sounds nice. And then I'll take the value, which is the calculation that I entered in to our semantic layer, and I'll add it right within there. So now it's actually going in, creating that query that breaks it down by portfolio name over the trading months, and multiplying the current price for each individual month by the number of shares. So I could see each individual portfolio and how they've evolved over time. But this is great. But maybe I want to get in a little bit more detail. What about each individual stock ticker? So I'll just drag this into our rows. And it will perform the same kind of query over again, but drilling into each individual stock ticker. And now we see energy, every stock ticker, all the individual amounts, financial. And what it's doing here is we have the individual amounts, but we also have the higher level aggregation based off of that hierarchy that I defined. If there was some other data here, I could switch these things around. Or what I can also do is I could take the trading month, and maybe I don't want to see it on the columnar level. I want to basically put that up to the rows, and show it in kind of a really more hierarchical level. So I could see now for the month of October 2016, which is the first month that I filtered by, I'm seeing that the total value is $30,000. I'm not doing too bad. I don't know how to double click on your laptop. And I can see every single month over time. Cool. But again, what does this look like? When I just switch this over to a line chart, or maybe even an area chart, and so we could actually see how this has come over time. So this is past six months. Maybe I just want to see it through from the beginning, all the way back into 2014. Back in the day. This might be a little bit easier to read. I'm looking at my portfolio, and I'm doing pretty well. Maybe only 6x. Not as good as if I just invest invested in Google from the beginning. But I'm seeing also there's a little dip. And I'm wondering where that's coming from. I could tell that somewhere in February or April, February, that this dip has. Where is that coming from? Because it looks pretty drastic here. But if you get into the details, what we might see is a little bit of a different story. A more detailed story. So I'm actually going to take the portfolio and take the same exact chart, and segment it by the portfolio of every single trade. It goes in, creates the query that says give me the current value of every one of my stocks in every single one of my portfolios for all the months that I want to display. So what I'm actually seeing here is that it's not such a massive drastic drop. What I'm actually seeing is that the stocks in my health portfolio and even a little bit of my stocks in my financial portfolio has experienced a bit of a downtrend for over the course of a year. And at the same time, I've got a nice big increase in my tech stocks portfolio. So there was just kind of a slow moving downward trend, and a nice sharp upward trend for a couple of things. And this is the exact type of questions that people have that until now, they wouldn't be able to dynamically answer. And especially with data that comes from multiple sources that they want to combine into one. This is pretty groundbreaking stuff. Could we switch over back to the presentation mode? So you can try it out if you want. Go for it. iCharts.net/Googlepivot. It's available now. Again, we use standard SQL. All that good optimization is coming real soon. You could ask Carter about it. But really I want to thank Carter and the team for creating this. It's incredible. I appreciate it. Thanks. CARTER PAGE: Thanks, George. It's a really cool tool. What's really neat is he didn't do anything. He just pointed at the BigQuery, and can drag and drop and get a lot of information on the data that's already there. Where if I wanted to do that with a MapReduce, I'll be programming for about two hours to try and get the same results. So quick recap. So as of today you can analyze Bigtable data using BigQuery. It is row range optimized for the legacy SQL. The range optimization for standard SQL is coming soon. That's why his queries were taking 8 seconds and not 2 seconds. That was our fault, not his. In a couple of weeks the standard SQL will also be as lightning fast as the legacy SQL. And that in short you get the best of both analytical data stores by using this connector. So thank you. There is a URL here that you're welcome to go take a look at the documentation. It will show you how to set up a connector and play around with it. And be interesting to see what you find. It's in beta right now. So we're always eager to hear your feedback. That is the end of our presentation, but we have some time for questions for anyone who's not running off for beer right now. But if you are running out for beer, I do not begrudge you, because I would.
Info
Channel: Google Cloud Tech
Views: 8,103
Rating: 5 out of 5
Keywords: bigtable, low latency, store data, data, retrieve data, time series data, application architecture, application, app, apps, applications, scale, scalable, scalability, NoSQL, analytical database, SQL, database, iCharts, visualize data at scale, Cloud NEXT, Google Cloud, GCP, Cloud, #GoogleNext17
Id: 2HUB4uu7K2k
Channel Id: undefined
Length: 47min 56sec (2876 seconds)
Published: Thu Mar 09 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.