Migrating your data warehouse to Google BigQuery: Lessons Learned (Google Cloud Next '17)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

This was easily one of the best sessions at Cloud Next.

👍︎︎ 4 👤︎︎ u/Skreex 📅︎︎ Mar 12 2017 🗫︎ replies
Captions
[MUSIC PLAYING] CHAD JENNINGS: Hey. Good afternoon, everyone. It's 4 o'clock on the dot. And we have practiced this talk with military grade precision. So I thought I would get started right away as folks can come in and fill in the extra seats. My name is Chad Jennings. And I'm one of the managers for BigQuery. And thanks for spending this hour with us. Obviously, we're going to be talking about Data Migrations: Lessons Learned. And we put that in there on purpose. I'm going to spend a little bit of time talking about some of Google's tools for folks who aren't maybe as familiar with GCP as some of the other vets here. And then our customers, we've got an excellent field of customers who are going to come up and talk about their sessions and actually give you their war stories. I couldn't put War Stories up here, or like, whipping post, or like anything negative. So I had to put lessons learned on the slide. I guess that's captured in the video now. Anyway, they're going to come up and give you the real skinny. Before we get into that, though, can we do a little bit of audience participation? How many folks here are relatively new to GCP and BigQuery? OK. A fair few. Do you guys come from the G Suite side? And how many have like gotten really deep, dark, and dirty with BigQuery and know it pretty well? OK. A fair smattering. And how many folks are in this room who are thinking about migrating to BigQuery as your enterprise data warehouse? OK. Great. I think we've got a range of materials for you all. And I think that was about the distribution I was hoping for. So we have three customers, kind of in data volume migration sizes, you know, small, medium, and large, and representative across a wide set of industries and such. So Nikhil Mishra from Yahoo is going to come up and take the stage and tell you about their effort migrating to BigQuery. Then Nick Ursa will come up and do the same. And then getting into the agenda, starting down at the green dot, why migrate at all? What we're going to do with the last 20 minutes of the session is I'll be joined by the gentleman up here on stage, and we'll all just have a bit of a panel discussion about why go ahead and do it? If you've got a system that's working well enough for you, the thought of moving encompasses a lot of risk, maybe a fair bit of fear, maybe a career is on the line. And so there has to be a real big benefit. So we'll talk about that. And that will also gives us an opportunity to have an extended question and answer period. So if you're thinking about moving, this is a time to ask people who have actually done it what their experiences were and what roadblocks or hiccups they had to overcome. OK. I'll take the first two sets. So I'll set the stage a little bit. And then I'll talk about Google and partner tools to help get this kind of work done. Okeydoke. So there are a number of reasons why you might find yourself compelled to move. Maybe it's time to renew your license on your on-prem system. Maybe you're frustrated with the inability to scale storage and compute. Maybe you're just frustrated with the inability to scale them at all. Maybe your system works well enough, but it's not going to keep up with your forecasted growth if you're lucky enough to be in that problem. Maybe your current system is just too hard to maintain, or takes too much work to keep it fed and watered and humming along. Maybe it's just too dang expensive, and you're looking for an option. So now you're faced with a choice, right? You can either move now or move later. At some point you're going to be forced to make some change, whether it's buying more of your legacy system or thinking about something else. And so that's what we're here to talk about. So for those of you new to BigQuery, we'll talk about this for a little bit. If you've known BigQuery for a long, long time, I'll call you back in, you can just check your email for a couple of minutes here. So BigQuery is Google Cloud's enterprise data warehouse. It's an analytics tool. So it's not an OLTP store. Like Bigtable, we have other products for that. This one is focused on analytics. BigQuery goes really big and really fast. BigQuery, you interact with BigQuery in SQL. It's the Lingua Franca of data analytics. And so standard SQL is the coding language that you use. And because BigQuery is basically the public face or the productization of a number of the elements of Google's infrastructure, our storage is on the same system as storage of products like AdWords, YouTube. Let's see, another small one-- oh yeah, Gmail. What's that other one that Google does? Search. So we get a lot of benefit at BigQuery by being built on the exact same infrastructure that these other, you know, billion plus user products utilize. Now, something that is unique to BigQuery and not something that we get from Google's infrastructure is that we are fully managed and serverless. So what that means is you do not have to worry about spinning up reservations, or provisioning, or figuring out when you're going to have peak loads. You just load your data, write your science, and then press the red button, and you execute your queries. Google takes care of managing all the computation resources that you need. And we can scale up storage almost infinitely. Our largest customers store well over 100 petabytes. OK. I'm not going to go through this slide in detail. But I do think it's a really good slide for context. So Google has an awful lot of storage options. And if you're in this migration, like should I migrate or should I not? It's important to know what these other things do, or might do for you, or might not. So BigQuery is all the way over there on the left in case you couldn't find it amidst all of the other blue hexagons here. And that is the data warehouse. So it's relational in nature, but it's not an OLTP system. It's not a database. It's our data warehouse. Okeydoke. So let's talk about the migration path. And so here I have the legacy enterprise data warehouse on the left. BigQuery there kind of in the middle. And then I put this little chart, this last icon in to really drive in the fact that when you're migrating from one place to another, getting the data landed in BigQuery and in the right shape, and getting the queries to run appropriately quickly, that's actually not the end line. The end line is the understanding and the insight that you're able to implant in people's brains, right? It's not even the charts, or whether it's a bar chart, or a line chart, or how pretty the colors are. It's the understanding that people take away. And so I wanted to make that point explicitly here so basically that we make sure we cross all the way to the finish line. Now, to get data from a legacy on-prem system to understanding in your users' minds, there are a couple of steps along the way. Google Cloud Storage is one such stop for BigQuery. We're trying to make this process smoother and be able to load things directly into BigQuery. But at the moment, if you want to migrate even little amounts of data, or significant amounts of data, then doing it through GCS is the way you're going to do it. However, we do have some people, probably in the room, who make this much easier by abstracting that layer out of the way. And that's our partners. We'll go through some of these in detail a little bit later on. But our partner tools will abstract this step from you. And so they'll help you with schema migration, they'll help [AUDIO OUT]. Are we good? They'll help you with migrations, data cleanliness, deduplications, et cetera, and make it a lot smoother to get your data into BigQuery. If, however, you want to control that process in detail, you can do it all on your own. Maybe it's more cost effective for you to do some transformations or shift your schema into the right format before you move it out. That's fine. If you're moving data from the legacy into cloud storage, you can either use online transfer, which most people do. If you don't have really good internet connections, we even offer offline transfer where you can actually pull the hard disks out of your system, put them in a box, and send them to us, and we'll load them up for you. Once you get the data there, BQ Load is the command that takes things from cloud storage and drops it into BigQuery. Now, you have to be careful here, right? When you do that you have to specify the schema. And doing it this way assumes that you've got the schema appropriately partitioned or appropriately designed inside of cloud storage. If that's not the case, there are a number of Google tools that you can use to make that happen. And then second to last but not least, and Nick may talk about this, doing ETL inside of SQL is a way to generate a much more transparent ETL pipeline for your users. And last but not least, you have to make sure that you have an efficient pipeline between BigQuery and whatever vis tool you're going to use. Okeydoke. We'll spend just a moment or two, because I'm running at the edge of my time. We'll spend just a moment or two talking about the tools that exist here for you. So everyone in this room is on this chart somewhere. The x-axis is the distance to Google Cloud Platform. And it's virtual distance, right? You're very, very close if your data is already in GCS on the left side of the axis. And you're very far away if you're on-prem with no network whatsoever. And then the y-axis here is the amount of data. And so we're going to go through some strategies depending on where you line up. So Google has a transfer service. So if you're in GCS or in another cloud, you can use our transfer service to migrate data into Google Cloud Storage. If you're a little further away in this virtual distance scale, if you're in a colo, or if you're on-prem with good connectivity, you can use GS Util, or these partners to move your data over. If you're a little bit further away and you've got less than 100 terabytes, then Iron Mountain and Prime Focus are good bets for you. And then if you're far away and a lot of data, like way off in right field there, than Zadara is the place to go. I thought I had moved that slide out of there. I'll spend just a moment here talking a little bit more about one of our new products in the Google ecosystem. And then we'll go through partner tools, and then I'll turn the mic over. So Google Cloud data prep was announced earlier today at the keynote. And so this is a tool. It doesn't do migrations. However, when you bring data in it can automatically detect schema. It runs analysis on the columns to draw histograms, as shown here along the bottom of the screen, to identify what type of data it is. And then it can automatically do things like deduplication, or finding nulls, and things like that. So this is a new tool. We just launched it today. And it can help clean up your data before you start giving it to your analysts. And the partners, I won't go through this one in detail. There's too much. There are too many different axes on this four legged octopus. What I will say is that depending on the source, we have partners to help you ingest data from that source. Since BigQuery is a data warehouse, I'll just spend a moment on that. We've had a lot of success using partner tools and even folks managing it themselves from data sources such as, you know, Teradata, Vertica, Natezza, AWS, Hadoop, and Hive. OK. So I will now turn it over Nikhil Mishra, who will talk about Yahoo's experience migrating an awful lot of data, to BigQuery. [APPLAUSE] As you like. NIKHIL MISHRA: I'm Nikhil Mishra. I'll just walk through some of the experiences that we had with BigQuery. And just probably share some of the learnings that we have had as part of this migration, and share as much detail as I could in this talk. So I wanted to give you some context on how the Yahoo data pipelines. And this is a general model in most of the companies where you are using primarily Hadoop for your data processing, and ETL processing in Hadoop. So a lot of data gets generated through our user activity over the web, on native apps, on ad servers, and a bunch of data centers collecting those data, and moving it out into an HDFS storage system. ETL processes run on Hadoop. They use MapReduce, Pig, Hive, whatever you're comfortable with and whatever your data transformation needs are. And then you put this data on Hive, Hive on Tez, which is the fastest possible querying interface you have there. And it goes in the hands of users through a BI tool. And that's typically the pattern that we follow. What we are trying to do is not to replace this data flow with BigQuery or something on GCP. What we are trying to do is add BigQuery into the mix. And I'll talk to you the reasons for going through this modified architecture. And what the pain points were with our existing systems, and what we're trying to solve here. The rest of our system works pretty well. And you know, the goods, as I said, the data size is humongous. Now, I've been advised not to use the exact numbers that we process. But I can always say humongous, freakingly large, multi-something. So I'm going to use that lingo here. So the data size that we process is humongous. It's big. We do it throughout the day, every hour. The batch system is extremely stable. The pipeline's strong. There are massive joins happening with multiple data sets. And we've never had any major issues in our system. So this part is working pretty well. Now, the pain point, and this is primarily from the customer standpoint. So once we deliver the data to our users, you massage this typically data transformation. We cleaned up the data. We standardized it. We enriched the data with a lot of additional annotations. We added a lot of metrics in there and made it available for our users. And they would just go and query through a BI tool, and that's pretty much it. That's what you need to make your data accessible. But what is missing for us was the speed. And the speed was needed at Yahoo scale. And there are not many options available. We tried a lot of things. And we keep trying new stuff. So every once in a while we'll just see, you know, what eventually turns out to be the best architecture. So the scale at Yahoo, as I said, is freaking big. It's multi-petabyte data warehouse. The incremental size, as I said, our processes run every hour. And there is a reason behind that. And I'll little bit about that. And every hour when we process the data, again, it is multi-terabyte data. The concurrence is high. And this is the result of a cultural shift happening in the company, or has been going on in the company, where we are moving away from the model where they've got a set of analysts, set of data engineers, who would prepare data for you, massage it for you, build the dashboards, run your queries, and hand it over to you in a platter and just explain to you what exactly it is. And you make your decisions on what you wanted to do. What we are doing is instead we're just handing all the data to all of our engineers, all our product advisors. No matter who you are in the company, you should have access to the data. And you should be running your own query. Now, when you do that you need to understand that those guys have a daytime job. They cannot be wrangling with the data the whole day. And one experience for me was, you know, one of our colleagues, a pretty smart product manager was talking to him, and he was using my system, and he just ran a query. And he said, you know, I'll go for a smoking break, and probably by then it could be done. And if you're managing an analytic system, that's as humiliating as it can get. If your queries start taking that much time, because that's eating up a lot of productivity from your company. So key consideration for us was at this scale that are very few options as it is. So for us, being hosted, being managed was good, because we didn't have to worry about setting up the system, because that just adds up to the lead time before we start testing it out. So that's pretty good, pretty convenient for us. SQL, again, it's the language of choice for our data engineers, analysts, and pretty much everywhere. And with SQL comes BI integration, pretty good SQL integration. And one really important thing for us was independent scaling of compute and storing. That was really important for us. Because even though we have multiple petabytes of warehouse-- and that's true for everyone's case-- even if you have this multi-petabytes warehouse, users are not going to come, and each user is not going to come and query all of the data all the time. He's going to query a subset of data, what he needs, and probably work with it. So we wanted to be able to adjust our capacity for storage and compute as we see fit. So we set out to test BigQuery. It fit in most of the [INAUDIBLE] we laid out. And another point we added was we defined what fast means. When we said that we wanted our queries to be interactive and fast, what fast means here? So we said that, you know, when you're running your queries, for typical day-to-day queries that you are running, daily usage queries, you know that you're going to run it every single day. We don't want you to be wasting your time running those queries. We want you to do your daytime job. So we said, you know, 30 seconds seems to be a good threshold. If a user is typically going through a seven step, eight step now process to get answers to his query, he should be back probably in a few minutes, or in a half an hour. And we said if you are going to do your monthly reporting, your weekly reporting, which is typically the case with most of our business users, we said less than a minute should be good enough, because that's something you do once a month, or a few times in a month. So with that we started our process of migration. And the very first step that we do at Yahoo when we are doing some of those things, and there's something new-- as we were moving our data for the first time, or at least what I know of as the first time, outside of our data centers. So it ran through a comprehensive legacy, you know, legal review. The data that we are moving should not be violating any policies, any agreements we have with our users. And that's an important consideration. Another important consideration was network bandwidth, because we were going to move many, many petabytes of data across the wire. So we had to be very clear how we phase it out, and if we have enough bandwidth to support that. Another interesting consideration, mostly for engineers on the team, was that as we move the data between the two systems-- because we already had our data prepared. All we were doing is moving it from one database, one system to another. And the data-- we wanted that the schema also moved along with the data. And partitioning was important for a couple of reasons. We wanted to manage the retention, because as we load the data into the warehouse, over a period of time the warehouse size increases. You don't need all the history in some cases. So you should be able to manage it. And you should also be able to optimize your queries leverage and partition pruning. One interesting thing we learned very early in our cycle was that even though BigQuery stores data in an uncompressed format, there is no need to move your data uncompressed. You should be able to move your data, and you should be moving it, in a compressed format. And we realized that BigQuery supports Avro. GCP supports Avro natively. And what we wanted to do is we wanted to leverage that. And we tested it out. And it turns out that using the Avro with [INAUDIBLE], we ended up reducing our size by 6x. And that was a pretty good deal. So we said, you know, that's what we're going to use. And the schema migration, which was an interesting criteria for us, and a needed criteria, came for free. So with Avro, with every single row, you get your schema. And we don't have to worry about as the schema changes. It gets propagated in both the systems as the ETL progresses. Now, we moved the data. We moved it in Avro format, loaded in BigQuery. The loads were efficient. We monitored the load processes, and we benchmarked some of our queries. We benchmarked some of our loads. And we kept it within the SLA when the data was available. And one interesting thing we learned was that in order to utilize the capacity efficiently and not to have this blip every day, where you're using the entire capacity in a daily batch by moving your hundreds of terabytes, or multi-terabytes of data in one day. If you split it out throughout the day every hour, you would be not only speeding up all the load process, you'd be also needing less capacity throughout the day instead of massive capacity once every day. And you know, that helps a lot. That keeps your pipeline stable. And that helps you leverage your capacity much more efficiently. Query benchmarking, what we did, we set out 30 second as our daily, 30 seconds in one minute intervals. And we were pleasantly surprised. The query response time was pretty amazing. It was way, way below than what we set out to achieve for. So it was a big win. Interesting thing we learned was there is something called Coalesce that BigQuery does. We don't know much about it. We typically work with the Google team to make sure that all of our tables are coalesced. And there is a huge difference between the query performance when the tables are coalesced versus they are not. You don't have any visibility into it. Typically you work with your Google partners to make sure that [INAUDIBLE] that is a query problems, just talk to them and make sure that, you know, your tables are being coalesced. Another thing we learned was purging was a new feature when we started off. And it was an important feature. And it's a pseudo column called partitioned time. And you pick your date to be partitioned, which is your data date typically. And what you need to make sure that your users end up accessing-- if you can obscure it, if you can have your BI tool translate partition time into data and give that to your users to query, you would be able to easily partition from your data, scan less data, make it less expensive, make it faster for your users. What we also did when we launched BigQuery with our customers was we gave them-- we are trying out another hosted system, which is something, again, that I'm not going to talk much about-- and we gave a set of users the new back-end, but exactly the same user interface they were using with the old system. And we kept both the systems running in parallel. And we started monitoring their usage pattern, the query average, the query response time, and all those things. And we benchmarked both the systems side by side. And BigQuery was like way better than the other one in all parameters. Its speed, its stability, its query performance. Everything else was pretty good. And the users didn't even know that the back-end had changed. So kind of they were in dark. So what that allowed us to do was over the weekend we just flipped the switch. We initially thought that we were going to keep running the systems in parallel for a few months. But we flipped the switch in a few weeks, because we thought, you know, it's working out pretty well. And we started moving all our users and our use cases over to BigQuery. And again, the most important part here is when you're doing this scale of data migration, you would be working pretty closely with your Google team, because there are a lot of things that happen. It's a black box, and I like it that way. You don't want to worry about what's happening behind the scenes. If you run into a problem, you just tell Google, go fix it. And they are usually pretty prompt in fixing it instead of you trying to figure it out through the documentation what's going on. So we'll talk a bit more later on in the panel, but it was a pretty interesting experience moving this scale of data. And some of the key learnings I'll share with you. Next up, I guess, it's Nick from "The New York Times." [APPLAUSE] NICK URSA: Hi, everybody. My name is Nick. I work as a data scientist and data engineer at "The New York Times." I'm actually going to interrupt my normally scheduled performance to tell you about a little piece of history I think we just witnessed. In 2006, Doug Cutting was working at Yahoo. And if you don't know Doug, he was who came up with the Hadoop elephant. It was his son's plush toy. And a lot of Hadoop, to which many of us owe some part of our careers, was really born at Yahoo. So it's kind of historic that Yahoo has kind of merged back using elements of the original Google File System under BigQuery, and they were the developers of the Yahoo that we all used. There's kind of this family reunion that you just saw. Anyways, I want to talk to you about what you're interested in your 20s and your 30s. A lot of us kind of grew up getting interested in the technology behind data. And we loved scale. And we loved getting things to work. But when I got a little bit older and started kind of working with the data, instead of getting the data to work at all, I saw a lot of people who were disenfranchised by the kind of problems that you're having when you're working with technologies like Hadoop. What I realized is that a lot of people are left out in the cold as soon as they hit like a Java stack trace or something. And it led me to switch kind of from doing data science to trying to go back to data engineering at "The New York Times," because I saw this problem around us. We had a situation where we had a lot of people staring at Java stack traces when they should be kind of computing things. So it caused us to take a step back and look, not at the technologies that were available at the time, but the technologies that maybe were nascent at the time. And this was last year. So to the people in this room who maybe were flirting with the idea. You're running Hadoop stacks, and you're not sure if you really want to take a chance on this thing, which is, after all, as a commercial product, quite new. I was exactly where you are last year at last year's Google Next wondering if I'm going to be risking anything. So this is potentially you one year later. At "The New York Times," you know, we're an older company. I think our founding predates the Civil War. And we had a similar archaeological cross-section of data storage technologies. We still have a few mainframes, and they work-- much respect to mainframes. There are just so bulletproof after all these years, but we've got to move a lot of these things into a common place. Because if you're going to work in media today, you've got to be able to know all the touch points for a given person in the data silos that have been created over time. Different IT systems for different departments, you know, that's not going to work when you need to respond in 10 milliseconds to a web request. So if you can see, we've gone through Oracle data warehouses. We've spun up at least 500 MySQL instances on Amazon. Redshift, which I like, but-- and I'll tell you the but later. And not maybe a year ago, we spent-- well, maybe a year and a half ago-- $1 million on our hardware Hadoop cluster to cut down all these cloud costs. And I got to tell you, after we like kicked the tires on Hadoop, it was a no-brainer to junk all that capital and just go for this system. So we did that with that kind of outlay. You have people trying to answer a simple question about the business that we're in, and they get this kind of thing. And what happened is that the kind of people who can deal with this problem are not the people who know what our quarterly earnings should be. Is the business metrics at the right level? And we separated the people who have the technical capability to answer the question from the people who originated the question. I think that's one of the unfortunate sides of big data. And so what we want to do is really bring the capability to the outer part of the circle. People who are closer to the product design, people who are closer to the heartbeat of the business. We fetishized data engineering a little bit, and it's been fun for a lot of it. But I think we need to make these tools as polished as the tools that we used to use, as polished as those old mainframes, you know? And there's a problem with opensource, I think, because the kind of thing that takes you to the 100%, the polish, is really boring QA work of edge cases and corner cases. And from what I know of engineers, they want to solve scaling problems and new core engines. Unfortunately, we've got businesses to run in the meantime. So we took a look at kind of what our priorities were. And although machine learning gets a lot of headlines, and certainly everything in there is important, it relies on stable infrastructure. Nothing works unless you can count on it to work without a smoke break. We need it to be in SQL. We don't want to make people write Java just to answer some question. And then once you do that, you can start taking control of the probably octopus of ways you're storing data in your company. And then you can have regular normal people ask questions. Now you have four times as many people getting the answers they want. So I'm going to briefly show you how we conducted our trial. We benchmarked performance. We took our six most common queries. We estimated price by looking at the Hive logs and seeing what was the data footprint on each of them and adding it up. My estimate was that we were actually going to be at 1/3 the costs of our Redshift cluster. It turned out to be 1/4 when we actually put it in. And people are running their queries a lot faster. So then we went to an internal beta. And here's a little bit of social engineering I'll talk to on the next slide. It's really important if you do the shift you budget for two systems running at the same time. Because this is data, and you have to make sure it is correct. So that means you need an audit benchmark. So we kept-- I'll show you this a little in zoom. So the bottom is our old flow. The top is what we set up for GCP. So we split our thing going to a pub/sub. The pub/sub calls the AWS SNS. And that goes into dataflow. And now, you don't have to do data flow. We wanted to move from mini-batch systems to near real-time. And it took three weeks in the POC for us to get latency from 45 minutes, log files, gzip-- I know some of you still do this-- to 45 seconds. From the time that you click to "The New York Times," to the time that I can see your click in a query is down to 45 seconds. And internally you get a lot of love. No one likes queries to fail. But this is the kind of thing we saw pretty well immediately with the beta group. And once the beta group started getting this everyone else started getting a little jealous of it. And this is just organizational information. That made the rollout across the organization really good, because everyone's like, oh, can I try that? Oh, can I try that? And through that we learned a little something about how to onboard people to new technology. It's really important, if you do this, you write your internal getting started guide that uses your queries and your data. Because if you hit the Google Docs, their first day may be one of frustration. Then we made sure we had like a chat group internal to the company so that you could get the early users helping each other. They became your local heroes. And it formed kind of a Hive mind when it came to solving queries. And we tried to notice the little things that were keeping people from working. There were a few wins with Google here that we didn't expect. At first I thought not having JDBC/ODBC was going to be a problem. Turns out that having a web client was a bigger advantage, because it enabled so much casual use. When you can give someone an URL and say, hey, check out this query, versus having to get them to set up a whole kind of client, and like download the opensource JDBC driver, it was such a winner to have the web interface. Email account creds, it helps a lot. You don't have to set up all kinds of LDAP things here and there and make it permissions in the database. There is a module in Pandas called, I think, from GBQ, or something like that, which goes straight to Pandas data frame. So all the data scientists were absolutely thrilled. And because of the permissioning you can actually use the Google Groups as mailing lists to alert people of schema changes. The first thing you're going to have to do when you start trying to get useful stuff in there is get some of your relational data in there as well. The click stream I showed you, we did with data flow. Ironically, the large data is the easy stuff to do. It's the thousands of tables all through your organization that you're going to finally do this data lake on that end up being kind of the real pain. But I don't know, BigQuery-- all Google Cloud is Google commercializing the technology that allowed their developers to be successful. So what you're seeing is a productization of what has made Google successful. That's what we're all trying to tap into here. And they know that when a Google Developer sits down and they're paying them what they're paying them, they want it to be productive on the first day. And I got to say, when you start loading data into BigQuery, you go hit your database, push it to a FIFO or a file, and push it into BigQuery, it works like right away. And you know, this thing, there's no indexes. There's no sort order. At first I thought, well, what are the wastes to get all this performance? But the ability for me to tell an analyst that they can upload data without worrying about the layered on disk or anything like that, it's been a huge win. Because now people can do these loops where they have a question, they have an external data source, they push it up into the database themselves, and they can answer their own question. And so I think what happened is we really changed our performance metrics from-- how does this query run? To-- given a question, how long does it take until a reasonable skilled person can get an answer? And so the less people you have in their loop the better. I'm not going to read all these slides. The slides are more for you if you check it out later and you want to know what are the best practices for every type of data. These are the things that I came up with. But I do want to talk about data flow a little bit. If you don't need real-time-- and be honest with yourself about it-- don't do it. Little small batches are totally fine. A lot of people say real-time, they're still fine with five minutes. Data flow can be a little bit expensive to use streaming, because you're just keeping these connections alive no matter what the data volume is. And here are a few other things. Telling people to use BQMate-- for the first time people will become aware of the cost of what they're doing, which is sometimes a shocker. But it teaches them how to write a better query. And you probably need to get out of the habit of ordering. I think relational databases made us do that. But on distributed systems ordering can really screw up your performance. So don't order if you don't need to. Finally, don't think of BigQuery as a database. And it's in this uncanny valley right now, because it almost looks like a relational database. But you can't do a lot of mutations on it. But the storage is the same price as cloud storage, which means you can store your data as cheaply in BigQuery, where you have query access to it, as you can as if it were on disk, you need to push up like a Hive cluster to get at it. So you basically have-- all structured data is just as happy in BigQuery price-wise, op-x-wise, as it is in any other system. So finally, I let's talk about what your KPIs are. When we started this project we were like, what should we target? Deprecated systems? Tables migrated. And I said, no, let's do a survey of people. And we did a survey before and after BigQuery. And we ended up with 3.5 to 4.7 out of 5 in terms of analyst satisfaction. And around Christmas I get an email like this. So I hope that if you go down this path-- and you can grab my email, or grab anybody here, because it's a small enough community that we can help each other a lot-- that you end up with an email like this when you're into your attempt at trying a technology, which I think really deserves a break. That's it. Chad? [APPLAUSE] CHAD JENNINGS: Thanks very much. The last part of the session here-- we've got right on 20 minutes or so. I'll invite my co-presenters up onto the stage. So also acknowledging the-- Nick, please-- and acknowledging the dual purpose of these slides, we're going to talk a little bit right now about why bother do this crazy migration thing if it has risks, and it has work to do, and et cetera, et cetera. So we'll talk a little bit about that starting with Jason in just a second. But I also wanted to point out that in the deck the gentlemen here were kind enough to add some bullet points about what the benefits were. So like you can get the deck, you can get the video, and have a look through these. Oh, yeah. Sorry. This is also for reference of the related-- oh, there's the clicker. Thanks. Some other sessions that may be interesting to you. And then because I chose the slide with the animation-- yeah, I won't leave this one up forever. But if you want to grab our handles, you can get a hold of us another way. I'll leave this one up for a couple of minutes. But I wanted to start off the conversation here. And, Jason, ask you to lead us off. What was the benefit of doing the migration that you did, and can you describe what the migration was? JASON JHO: Yeah, sure. So I guess I'll start with just first introducing myself. So for those of you who may not know about Blue Apron, we're a company whose mission is to bring incredible home cooking to everybody. And in the process of doing so build a better food system, which is essentially how food is grown and distributed. So we're not actually as big as Yahoo and perhaps "The New York Times," in terms of data, size, and scale. But we actually do have a lot of complex data sets-- everything from our warehouses, the physical kind, where we actually ship boxes out to our customers. To the consumer website that we have collecting data about consumer behavior, to all the data that is relevant to shipping and logistics. So a lot of different fragmented, heterogeneous data sets that are relatively complex. We rely on that data to perform analytics as well as to help us forecast demand. And to also drive personalized content to the end user. So data is very important for our business success. BigQuery has been an incredibly important asset to our platform and pipeline. So one of the things that really shines about BigQuery, it's really simple. It's our ability to not think about performance-- so tuning query performance, thinking about indexes and capacity planning. All of that sort of magically went away. And we instead took that time to think about the data model. What is the right data model that we want to curate for our analytics and machine learning teams, so that they don't have to run all of these complex processes to get data from different places? That semantic layer that we're building can be consistently shared across the company and scale as a business grows. And BigQuery is a real incredible enabler for that sort of process, which leads me to my next point. I think having that sort of scalable petabyte scale data warehouse at your fingertips, without having to really think about the management part of it, has allowed our analysts to really explore data at interactive scale. And I want to emphasize interactive. I think for the very first time using Looker, which is a tool that we use for BI, we've been able to answer questions as they come. You know, like, the ability to do full table scans, to aggregate data on demand is an incredibly game changing thing. Before we would have wasted hours and perhaps, embarrassingly, days to get answers to our questions that we had. So BigQuery, again, has been a tremendous tool to our tool box. And finally, I would say that, to Nick's point earlier about storage costs, having data in BigQuery is-- number one, it's available for hot querying. But you can actually have large amounts of historical data just sitting there. And that's sort of what we've adopted in terms of getting historical data into the data warehouse without having actual time series data. So we've been able to batch load consistent snapshots of our data into BigQuery using day partitioning and essentially have the ability to query over time. So that's been a huge boost in our ability to understand customer behavior over time. So all in all BigQuery has been an incredible tool. And it's allowed us to stay ahead of the insights instead of playing catch up all the time. CHAD JENNINGS: Great. Thank you. Thank you very much for that. I'll say that the team will be particularly pleased to hear that story. Actually it was at this conference last year that we introduced long term storage. Where if your table hasn't been modified for 90 days, we just reduce the price of storage by 50%. And it was exactly to induce people to say, hey, let's keep this around and continue to get value out of it. So that's terrific. I wanted to start out with a quick question here for you. How did you guys, when you were first adopting BigQuery, or first thinking about it, how did you design the proof of concept? JASON JHO: You want me to take this one? CHAD JENNINGS: Yeah, why don't you lead us off? JASON JHO: Sure. For us it was really simple. To Nick's point, getting started with BigQuery is as easy as opening up a URL on your web page and putting in a query. So we actually took a couple of the most expensive queries that we have, queries that would actually run for hours, and without much migration work at all we demonstrated that this thing can run in less than a couple of minutes. So that was a huge just proof of concept right off the bat. So yeah, it was really as simple as that. CHAD JENNINGS: And, Nikhil, you had a very different, like orders of magnitude, proof of concept to design. How did you pick that? How did you design it? NIKHIL MISHRA: So for us, yes, it took us time to get started. So we had to think through a lot of stuff. And the way we decided was we didn't want to go to a specific query model, or benchmark something-- that's industry standard benchmark-- or one of those things. We essentially wanted to replicate a real life use case. What that required was to actually move our real data out there. And that requires all the kind of prep that I talked about. Go through the legal review. That's a very important step. And we started off with that. The network bandwidth was something that we actually did not start off looking at initially when we did it. We actually ended up choking our pipe between Yahoo and Google by moving a massive amount of data. We brought down some of our internal systems. And that's where the consideration has become so important for everybody at Yahoo. And it was something new. And we assumed a lot of stuff. And we were wrong. And as we started off migrating the data, we also were not aware-- and I believe Avro is still beta. I don't know. But at that point we were not aware that Avro is natively supported, which is a huge deal for us. So when we got those pieces together what we did was we really picked up our real life use case. And we were using another third party hosted cloud. And we had something going on there. And we just picked it up and moved it here and moved some additional data sets to just try out Google's claim of petabyte scale. And it worked out pretty well. And we replicated the use case. And as I said, without telling our users we flipped the switch. They didn't even know that they were querying another database. We used Looker as our BI tool. And Looker had pretty interesting integration with legacy SQL. The legacy Google SQL, which is not on single plan. Which I love, by the way, it is pretty interesting. It's fun to use. So we just hooked it up. We created exactly-- all we had to do was flip the connection, reuse the same model, same stuff, hand it over to set up users, and just monitor them. And that's what we did. And that allowed us, and that gave us the confidence that this is something that we could possibly work with. CHAD JENNINGS: Yeah, you were in a fortunate position where you had a cloud workflow running, and you were able to switch that one over and do a direct compare. I wanted to ask Nick a question here. So BigQuery, if you're thinking about just doing lift and shift from a Teradata instance or something like that, BigQuery is not optimized for a star schema, right? BigQuery wants the tables flattened. So there are definitely some-- let me not sugarcoat that. There was a session yesterday on appropriate data models, so we're not going to go into that in detail here. But I did want to ask Nick-- when you do that shift there's an opportunity for you to, A, endure some heartache and do some work, like redesign-- NICK URSA: We did, and it was heartache. CHAD JENNINGS: So the question here is, how much schema migration do you want to do? The analogy is when you do remodeling at your home, you pull the plasterboard off and you're like, oh, geez. Do I do the electrical as well? So how far do you take that? NICK URSA: I don't want to say that you can't do star schema in BigQuery, especially if your dimension tables are small. They're going to like-- CHAD JENNINGS: You can, it's just not optimal. NICK URSA: Right. Right. Well, the thing is-- maybe other places have this, too, where there's a clickstream data flow. And clickstream dataflows are high volume, wide tables. Often you're only taking a subset of the data drop. And then you have like a bunch of data warehouses with smaller tables. And we're at the point that those two worlds are merging inside BigQuery. So we used to do a lot of stamping in applied values, because we knew we couldn't join at it later. I think that we were a little more conservative in BigQuery, because we saw that we could actually get good join performance in some of the smaller tables. But if anything, we had to scale back on how much we denormalized. We normalized as if we were going to run on Hadoop and never do a join in the world. And we found BigQuery actually had better join performance coming from a Hadoop workflow. But when you revisit things it's a chance to do those things that you kind of never had the chance to. So we did, I think, more importantly in the performance was just a data governance exercise of-- what are these fields? Are we still using them? Do they mean what they think they mean? And although it delayed our rollout by a month, it's important that everybody who sees a column name and makes a logical assumption of what that thing means-- that they're right. Because I think worse than slow queries, worse than anything is getting two different answers. So I think it's more of the data governance that drove the schema redesign than any performance consideration. CHAD JENNINGS: Interesting. I'll also invite the audience here. So I have a couple more questions, but I'm going to run out before the session is over. So I'd like to invite the audience to come up and ask questions of us as well. Please, use the microphone so that it's on the record. I'll give you a second. I did want to ask Nikhil, what's been the biggest impact of migrating to BigQuery and having BigQuery be that like fast store at the end of your block diagram? NIKHIL MISHRA: It's productivity. There was one of the PMs in Yahoo, a pretty smart guy, he actually put a sheet together and handed it over to me after we did this migration. And he had a dollar amount of how much we saved improving the productivity of different product managers. And he took himself as a base, which is not the right thing to do, because he did all kinds of crazy steps there. But I think that to me is a big win. Initially, even though we had all the data easily available and accessible to the users, users knew that that's where the data is, and I can query it, there was a fear. There was a barrier. And the barrier was they didn't know whether they would run the query and whether they'll be able to get the answer back, right? Depending on how much data they're querying. Now, what's the load on the system at that point in time? And by nature, you know, it was more suited for batch loads. And it kind of, you know, with concurrent users, we just ended up waiting way too long than we would anticipate when a single user or a few set of users are running the queries. So I think that was a big. When users don't think twice before just logging into the system and running their queries, that's a huge win. CHAD JENNINGS: Fantastic. NICK URSA: I just wanted to add to that. Which is that it's not only an efficiency thing, it's an innovation thing. Because you know what it's like when you try something out and it's five minutes. You lose your train of thought. Imagine if your IDE didn't respond in five minutes. But that's what it's like for an analyst when they run in BigQuery. And so when you're trying to refine a business model or you're trying to search for the cause of a problem, the ability to retain your train of thought as you query is hugely valuable. And people like Justin are literally having so much more fun now. CHAD JENNINGS: That's a wonderful story. Yes, sir. Please. AUDIENCE: I wanted to ask you, do you have any scenario where the data need to be updated, and how frequent are the data need to be updated? How you solve it? And does that affect your schema design? NICK URSA: I mean, I might want to talk to you about what it is. When it comes to, like, logs and aggregations, append only, obviously those don't need to be updated. So let's look at the categories. Large things that need to be updated, small things that need to be updated, small things that don't need to be updated-- well, let's not even talk about that-- and large things that don't need to be updated. So the last category is very easy. Those are just append only. Small things that need to be updated, we have relational databases for that. So for me, the solution to that is just replicate, replicate, replicate. And there's a company called Alooma, which can even read my SQL bin log files directly and move them into BigQuery for you. And so the tricky one is the first column. Things that need to be updated that are large. This is the use case that Bigtable was designed for. When you're in that category you maybe need to look at different storage options. You know, Spanner may actually be a good choice for you if you have large scale. But a lot of people can get by on a 256 gigabyte backed SSD PostgreSQL instance. The 96 is amazing. SSDs are amazing. And a lot of RAM solves a lot of your problems. CHAD JENNINGS: If I may, I'll just add one thought to that. Nick mentioned Bigtable. I can't say what it is now, but there's a talk at 5:20 that has a lot to do with how Bigtable and BigQuery might possibly, maybe possibly work together. NICK URSA: Hint, hint. CHAD JENNINGS: Yeah, no kidding. Hint, hint. So to take Nick's point, if you have to do a lot of streaming, then you can load it into BigQuery query, and then you can export it from there. 5:20. JASON JHO: Yeah, so we actually do both. So we take consistent transactional snapshots of our relational data. And we batch load them into BigQuery several times a day. But we also have event data that's coming in through it like an append only fashion, streaming that directly into BigQuery. So I think depending on your needs, for us we have the benefit of taking our relational data, sort of loading them into a big history table. But we also have sort of real-time event data coming in, as well. So it really serves well for both of those use cases. CHAD JENNINGS: Yes, sir? AUDIENCE: My question is about separation for the data. So do you have a roll based security implemented in your systems, guys, somehow? CHAD JENNINGS: Yeah, actually. Thank you for bringing this one up. So the GCP security model is based on GCP's IAM-- identity, access, and management. And so you can get granularity down to row and column access by creating views and then giving particular sets of people access to that view. AUDIENCE: OK. Thank you. CHAD JENNINGS: Yes, sir? AUDIENCE: I'm just curious how you guys control costs by like throwing Looker on top of BigQuery and releasing it out in the wild. JASON JHO: I think Looker comes with a dry run functionality. You can actually like estimate the number of bytes scanned since that's how you actually get charged. So that's one thing that we do. We also have budget calculators. And I think Google actually has a tool that allows you to cap-- CHAD JENNINGS: Yeah. There's a form that you can fill out online. Apologies, it's a bit clunky. But you can fill out the form and say, hey, I want to cap spend on X project at Y dollars. It's basically quota control. AUDIENCE: Thanks. NICK URSA: By far, the largest cost is mistaken queries, not like the normal ones. CHAD JENNINGS: Yes, sir? AUDIENCE: First of all, I appreciate you said that migrating star Schema is not easy. You didn't sugar coat. So the reality is that there are a lot of legacy star schemas, Teredata, Netezza, or whatever, right? I'm just curious if any of you guys had the experience, how you went about it. Did you just try to break it down? Chunk it down? What stays as star schema? What are some things which you can refactor, redesign? What are some effective strategies? I'm just curious on that. CHAD JENNINGS: Thank you. JASON JHO: So I might not be the best person to answer this, but we just threw the same data model into BigQuery because it joins. So that actually allowed us to buy time to do that, get the benefits of the query performance, and in parallel think about, well, how do I denormalize and flatten this data so that we can really optimize? So I think just the fact that it supports standard SQL and big joins is a huge win, because it allows you to be effective immediately and giving you time to really think about the optimal data schema. AUDIENCE: Because some of the other sessions they talked about repeating using the nested repeated features, but that requires a rewrite in SQL, to your point. You've got to train all the analysts to un-nest, which is an investment. That's what happened. CHAD JENNINGS: Yeah. I might I might add also-- so the session on the data models is the best place to go for that. The video actually should be available today. But I will say, in the experience that I've had helping people specifically from Teradata migrate, the lift and shift, with the expectation that it's going to work just like it did before is a fool's errand. That's just not going to work, and we found that out the hard way. Because we tried it once, and it was like, oh, that was not a good idea. And so I can't give you too much advice here without like sitting down with a piece of paper, but it definitely takes some doing. We found that the partner tools have been very useful here. Although, partner tools, it's basically like they'll give you a pair of scissors and invite you to go running. They're not going to give you advice about what shape to put your data in. They're going to let you code that up any way you want. So it's a useful tool, but you still need to do that design work. Do reference that other talk from yesterday. AUDIENCE: All right. Thank you. CHAD JENNINGS: Yes, sir? In the blue shirt, please. AUDIENCE: Go first. [INTERPOSING VOICES] CHAD JENNINGS: All right, you guys can fight it out. AUDIENCE: I was just wondering if the custom partitioning was going to come in the near future. CHAD JENNINGS: So BigQuery's partitioning is not as full featured as some others. And yeah, so custom partitioning based on any user's hash is definitely on our roadmap. I really can't speculate. I know where it is on the roadmap. I can't share it here at a public forum. But let me say it this way. So BigQuery is well aware that as we market ourselves as an enterprise data warehouse, and as we talk about things like Teradata, and we talk about Oracle, and moving from those things, that we need to answer use cases with features that people are accustomed to having. And so the team has that really very much in the front of their mind. You guys have heard this story before. When we look at features that we're prioritizing, I put them into 2 buckets. It's table stakes and market differentiators. And things like hashed partitioning are table stakes. We have to do that to be really useful. And so those things tend to get prioritized when we're doing our work. AUDIENCE: OK. CHAD JENNINGS: Thank you. AUDIENCE: Hello. Hi. I'm Tiger, coming from Seattle. First I want to say, this session is a very solid session. It's the best I have ever taken since yesterday. It's full of good practice and good knowledge. Thank you so much. CHAD JENNINGS: It's the best one since yesterday? AUDIENCE: Yeah. CHAD JENNINGS: That's awesome. [LAUGHING] NICK URSA: What was yesterday? CHAD JENNINGS: Shh, shh, just take the compliment. AUDIENCE: My question is as the artificial intelligence and machine learning is in fashion, do you guys think you want to integrate it with artificial intelligence somehow in with your data warehouse solution? Or any practice already? NICK URSA: No. [LAUGHING] CHAD JENNINGS: OK. Give me. NICK URSA: So most ML starts with clean data. I guess they can write clustering functions or whatever but the computational requirements of most machine learning algorithms are not appropriate for the thing that BigQuery's based on. BigQuery is based on being able to subdivide your data tremendously. But a lot of machine learning is about global optimums, and there's so much shuffling in between. So who's doing this? I'll tell you. Spotify is now using a lot of BigQuery as the source data, because good ML starts with clean data. And having a nice repository that you can pull training data from quickly and know that it's right-- and then the algo is like icing on the cake. CHAD JENNINGS: Yeah. We're being indicated that we're out of time here. So let me just add a step to that. And if there are folks with questions remaining, I think the speakers will be available a little bit after. But to talk about integrating BigQuery with machine learning, that is something that is very much top of mind in that market differentiator column. So it's prioritized often behind things like hash partitioning. But you can use it now. It's a bit clunky. But you can export data out of BigQuery, store it as a CSV in Google Cloud Storage. And then use that as input to a cloud ML call or something like that. But we definitely have at top of mind to smooth out that workflow. All right. Folks, thank you so much for your time. To my co-presenters. [APPLAUSE] [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 19,109
Rating: undefined out of 5
Keywords: BigQuery Data Warehousing, BigQuery, data warehousing, data, data migration, scaling, Cloud NEXT, Google Cloud, GCP, Cloud, #GoogleNext17
Id: TLpfGaYWshw
Channel Id: undefined
Length: 61min 51sec (3711 seconds)
Published: Thu Mar 09 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.