Google BigQuery introduction by Jordan Tigani

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

thanks everybody for coming out that's great too great to be here I hope this is is interactive so if you have any questions or I say something silly feel free to to to stop me and raise your hand ask questions I think we are recording this so if we get if you can if you raise your hand I'll give you a mic otherwise I if I remember I'll repeat your question I'm a developer not a not not someone who goes around giving talks all the time so I'll always always had the most polished presentations but I know a lot about bigquery I wrote the book on bigquery Paul's got a copy if anybody wants to wrestling for it so I'm happy to answer any kind of questions I can go on at length about bigquery I'll probably bore you all to to tears so feel free to like to start nodding off and and that'll take that as a signal that I should start wrapping things up so what am I going to talk about today talk about what what is bigquery I guess first of all how many people have used bigquery anyone okay so we have a few users how many people don't know what bigquery is a couple okay some what we got some honest honest folks here well hopefully nice next ten minutes you'll you'll you'll know what bigquery is I'm going to show a couple demos of just how how how fast it is how what kind of stuff it can do I'm gonna go in a little bit of the architecture I can go I can go at more depth depending on what people are interested in and then just sort of also have a couple slides on on on how you can use it and the kinds of things you can do with it and then finally I have a demo that you know I guess you guys are in the in the city of Spotify so I thought I would I would come up with a demo that was that was appropriate so I have a playlist data set that I'm gonna use to build sort of music recommendations in ipython at the end and then hopefully have plenty of times for Question and Answer you can ask questions about bigquery about Google cloud working at Google what I think of Sweden American politics I may not answer but but hopefully hopefully won't ask me anything about American politics so so I stole these slides from some ones on the sales team and I didn't don't always know when you have to click multiple times to get it to get everything so so what is bigquery it's it's it's a big data analytics engine it lets you run sequel queries in the cloud over your data that's stored in the cloud and it doesn't matter how big that data is it doesn't matter what what kinds of query queries you want to do it's easy to use it's got open open interfaces we have a REST API to to access that we also have a number of tools that that we provide you can access access it through and there's some third-party tools as well and it should be familiar to people that are familiar with databases you can share share your data you can easily easily import your data because when you're when you have large data in the cloud it becomes really important just how how you get your data there because you don't want to have to keep moving data around so we like think of it as a data warehouse some data warehousing people people coming for a traditional data warehousing background may be like well you guys don't have XY and Z perhaps perhaps we don't but I think a lot of the reasons that people use data warehouses you can use bigquery for it's fully managed you'd have to worry about spinning up instances you don't have to worry about how large those instances are you have to worry about if if disks crash or things crash if things crash I'll get a page or one of my co-workers will get a page and and hopefully we'll fix it we'll fix it quickly if you've got petabytes of data we can we can we can take your petabytes of data as I mentioned has has a nice equal interface I will show just how how fast it is in a couple of minutes and I think it was the register that calls this analytics as a service and the the the acronym is ass so we don't always use it the acronym as much as we use like for mrs. service but but you could call it that so they're not what's under the hood you know we we run your queries in parallel that's how we make it fast Google a long time ago decided that scale-up wasn't gonna wasn't going to work they didn't want to buy actually they at the time they couldn't afford to buy expensive million dollar database machines that every time you need something bigger you have to buy something in order of magnitude more expensive so they said well we can apply brain power to to scaling out rather than rather than scaling up and this isn't you know these days that's like of course duh but ten years ago it was it it was it was pretty revolutionary and you know the goal was to to be able to to read a terabyte of data per second we can we can pretty much get get close to that on your data and that's a shared cluster so basically everybody is all the queries are running in the same cluster when we have some nice nice mechanisms to keep to keep your queries from stomping on somebody else's queries but it also means that you know you get a slice of a giant cluster machine of machines rather than having to spin up a medium sized cluster or have to deal with all the hardware requirements for the for the the queries you need at the time so it's got a rich sequel sequel language that extends extend some of the things you can do a normal sequel with with nested and repeated fields so it doesn't have to be just purely rectangular data it does have a JSON JSON query so if you--if you have fields that have just raw JSON you can you can do sort of JSON path type queries from those IP address IP parsing regular expressions kinds of things that are they can be really expensive to do in in a normal relationship relational database that has indices for example it doesn't matter we also have a streaming ingestion API where you can just give us send us your data just send a an HTTP POST request as your data comes in and you can you can send up to a hundred thousand rows per second per table so you know we have some customers that are actually that's not enough for them so they send a whole bunch or they they shard that over a whole bunch of tables and that I generally just just works you can also adjust your data from cloud storage or via Hadoop or via Google Cloud dataflow or sort of a number of other other mechanisms there's some third-party connectors to to allow you to to access your data so if you use fuse R or ipython pandas there's some nice integrations there there's a Hadoop connector so you can use bigquery tables as either a source or sink for your for your Hadoop jobs there's also an ODBC connector for people who might be sort of old-school they like their Microsoft technologies and they're gonna they're going to use them etc and we also allow you to to join data from anywhere so it's a single namespace so you might have a table that you want to share with with a customer or with a group of customers you can create a view of your underlying data and allow your customers to query over query over that data just by just by updating the Akal it's it can be pretty pretty easy and also you can you can do ETL on an bigquery so we we do have the ability to to write large data sets back in in parallel so let's say you need to do data cleaning you need to coerce your data from from one format to another you need to deduplicate you can do all these things within within bigquery and bigquery has essentially unlimited storage if you're going to store more than I don't know 10 petabytes or something let us know but you know it at Google there are people who are like we're like you know you say I need I need some any disk quota and they say how much you need me as I see a petabyte oh I thought you I think needed a lot of quota like it's it's it's generally not a not a big deal but we do replicate it you know geographically and and and store it in an encoding that's that that's durable so so we can we can store your petabytes of data without without without losing it um it also means that you you don't need to throw away data at least one definition of Big Data that I heard was Big Data is when it becomes less expensive to keep your data than it is to figure out what you need to throw away so we hope that we're enabling that kind of model where you just accumulate data you might need it you might need it again and if you don't it's not costing you very much to to keep it around and and it's also that all that old data is immediately accessible you don't have to readjust it read you know change the format you can just run queries over over all of your data ok so I'll do a little demo just to kind of show how how fast the query is and hopefully it's fast tonight all right so this is a benchmark data so that it actually comes from from Wikipedia has basically one row for every Wikipedia view per day of every Wikipedia topic so we have it we actually have some very large very large data and this this is just a small slice of that data so this has 1 million rows of the the Wikipedia data set and that was cached and you probably don't want to see the cached data because it's not very impressive to to run a cached query very fast there we go no cash results there we go so two seconds to do a million rows that's probably probably not not not impressing anybody you know a million rows isn't cool you know what's cool a billion rows how long is a billion rows take 4.7 seconds so not bad we went up we wanted three orders of magnitude and and just over doubled doubled the time we have some bigger we have some bigger data here how about a hundred billion I don't have a movie quote to go with a hundred billion but honourably is another two orders of magnitude it's a lot of data this is about it's going to scan about four and a half terabytes of data and there we go seven point four seconds so as reasonably fast hopefully hopefully somebody is is suitably impressed but okay so that's so the query that we were running was actually running a regular expression over every single row to find things that had that match Google and title and it's doing an aggregation over those but you may say well what about joins like I want to I want to do joins so so this is the same this is a 1 billion row data set and I'm joining against itself and I'm doing a group by here not to to decrease the number but because I don't want to get a joint explosion so when you join the relational algebra says if if there's multiple values or multiple rows with the same value on each side of the join you get the cross product of those I know I don't want it I don't want to expend an expansion I just want to get just see how fast the joint is I probably should have started this before I before I started started my spiel because this this will take usually 20 or 30 seconds sometimes it's a little slower but so this is this is this is joining two billion row tables and so when I do the group by the one on the table on the right it shrinks down to about 200 million rows so it's a billion rows against 200 million rows and there we go twenty two twenty two seconds twenty two point nine seconds and as shuffling shuffling that data it's sending it across the network and and then doing the same same sort of aggregation so so big query it's pretty fast I didn't need to do anything special to get that kind of speed all you need to do is actually you get you can run the same the same query and and validate that that's that you get the same same sort of speeds if you have your own data you can use that or you can I believe it I believe this this data sets public I use that for the for the for the book oops I want to share I want to present three buttons that we throw on okay so here's a here's a chart of what performance looks like and this is I this was from the book this is from last last spring Celeste it's about a year old since then our clusters have gotten much much larger and and and we added some some technology to make joins much faster so you see here and it looks like a straight line but this is a log linear line so if it's a straight line it's actually increasing as little as the as the log so if we go from from two seconds to scan a thousand rows to to eight seconds to scan seven orders of magnitude larger and then four four joins you do see we hit a kind of an e here at the it adjoins this was with the old joint technology because I just did I said it looks about right that's just it says 1 billion is is is 30 seconds so bigquery is based on is an externalization of an internal technology called Dremel Dremel is a tool that's been used in Google for about about ten years the the inventors of Dremel they said what if the constraints of normal databases means one of the harder the one of the hardest things and in a normal in building a normal database is you spend a lot of your time worrying about table scans you don't want to do a table scan because table scans are slow every time you do a table scan you have to read all your data off disk discs are slow memory is much faster you want to be able to have indices and and and build up heuristics about your data so that you you can do is to disk read as possible and so they said what if we just made us that every query was a table scan if every query read every row could we just make a table scan really fast if we made a table scan fast are there cool things that we could do by leveraging Google's technology and an infrastructure scale so they built they built Tremmel and the way it essentially works is there's a root server we call it a mixer some intermediate servers and some some leaf servers the that that that do the work so when you when you execute a query when you when you run a query the the root server partitions that query up into the source the sources of your data so if you have you know we may store your table on ten thousand ten thousand files so we will send the sort of that till the the leaf servers to read to read just one of those files and we can send one of the nice things about sequel is that sequel is is sort of nicely parallel parallelizable it's it's it's essentially functional you can split it up into multiple pieces do them in parallel and then the hard part is aggregation so that's when the your intermediate servers come in so the the queries flow down to the leaf servers the aggregations flow flow back up let's say you're doing a some the the the leaf servers will compute the the some of the data they know about and the intermediate servers will compute the some of the data they know about are the some of the the partial sums and then the root will compute the total sum so you can you can split things apart nicely build them take them back together there's some things that are tricky distinct distinct numbers of values are tricky and then joins joins are tricky when you have two large data sets so when you do when you have two large data sets what actually happens is we do a hash partitioning so we'll read the data we will assign a hash value to to your join keys and then it will send all of the data that that match all the rows that match that join key to the same server I'm probably going into this more in more detail I'm happy to answer questions about about this of people if people have had them and then we store it in in actually the successor to GFS which is called Colossus at Google it's it's very similar to to GFS it's you think that is storing your data on hundreds of thousands of disks so one disk one disk is is slow but if you can read on if you can read from a hundred thousand disks in parallel and that's pretty fast so it can leverage the technology that we have at Google called called Colossus so that so that you really can read from from a massive numbers of disks at a time so that you that's essentially what allows the how do you do a table scan super fast and their goal was to do to do a terabyte a second and on some some queries we can we can actually we can actually get that I mean the query that we saw was what seven seconds eight seconds to do to do four and a half terabytes so and that was also doing a bunch of process things at regular expression processing and that's also sharing the cluster with a bunch of other clients yes yes so the question was was whether the query planning takes takes place in the root server or if I can talk about the query planning can't talk too much about the query planning other than to say that we don't do much great planning the it's sort of one of the nice things about this massive massively parallel is is that you don't need in order to get things fast you don't need to be smart you just need to have lots of hardware and there's clearly some you know some some things that are that are not as true about that and we are we are we are investing in some query planning it's like we started and we had this amazingly fast amazingly fast system that there's a couple of like corner cases that because it's a totally different architecture the way people expect things to behave sometimes don't work the way people would the way people would expect certain types of queries that are fast in relational database they may be more difficult for us to do the count unique is count distinct is is is a perfect example and but as as we're sort of becoming a more mature technology we realized that our there are some of these nice to have things that we need to go back and do it so there are we're getting smarter so our service is getting smarter we're building a lot of more intelligence into the service and a lot more predictability and that's you know one of the things that we're really focusing on now is being able to make sure that if you ran a query yesterday and it took ten seconds I can run it tomorrow and it's going to take ten seconds plus or minus that's a reasonable Delta hope that answered your question so the other the other thing that helped out was that that helped us to table scans very fast is it's column-oriented storage and column-oriented storage is very it's very popular now but instead of storing your data in in in record oriented for more basic you'd you'd store one row the next row the the subsequent row you store essentially each column in a different file and it seems like well why would why would that do faster there's there's a couple of reasons one is that column stores are very compressible so if you say you have like a very very large number of columns and you try to compress that data basically the the compression algorithm it works by removing redundancy so there's not a whole lot of redundancy when you go across a row because all those columns are actually doing something different but there's plenty of redundancy when you go when you go down within a column so you might you might say say like countries so country might be might be one of the row and there might be rows might be one of the fields and it might be spelled out everywhere so the this the but maybe eighty percent of your your users are in Sweden or in or in the US so a column with with just a few distinct values actually compresses very very nicely so so column score column stores compressed well and they also work really nicely with with a distributed storage system because you can fetch you can actually read multiple columns in in parallel without having to worry that you're you're doing you're doing disk seeks each time so in if you have a column store and you have just a standard spinning spinning disk on your machine it may not actually be particularly fast because you're reading multiple columns in parallel but each time you read one column versus the other you're gonna have to do a disk seek and there's obviously some read ahead things that the operating system will do but it's it's much nicer if you can if you can read from from distributed storage like Colossus what about indexing the Rondo index isn't bigquery so so you not to worry about it um so the the question was the block size in Colossus was is a megabyte Hadoop sizes are going to much larger to half a gigabyte what is how is that gonna affect query times I can't talk about that very much but all I can say is is not particularly important the and I don't want to sound like cocky when I when I say this but like Google's had had you know Google build GFS which was which was which was amazing we had the GFS paper and that turned into what is it not HDFS what's the Oh HDFS sorry I was blanking I was it wasn't you know implemented implemented externally but in the meantime we had a lot of experience running those servers and running those systems and so they built Colossus as to solve them to solve the problems with with GFS so the it's it's possible that that that did what Hadoop was doing is is is actually is is better or larger block sizes larger block sizes would help but it I think for the most part it it doesn't and I think and I I can't go to too much too much more detail about it unfortunately sorry and I think you may have some misleading data about the blossom decide I don't but I also don't like getting in trouble for saying so the question is do you need four tables a large number of columns you need do you need to define sort orders to make the queries fast not that I know of so I'm not sure why why you need to do that on Vertica but the best of my knowledge that doesn't affect affect bigquery and so maybe maybe with nested in repeated fields there there may be some summon there might be some performance improvements that could be had had that way but but not not that I know of always a full scan and so within caveats so we can be smart about storing metadata in and we're getting smarter about storing metadata about what what data we have to read in every file so it's possible that we can we can we can skip some some data so bigquery is an API it's a JSON API restful JSON API anybody who who wants to connect to it can just send us raw HTTP requests with curl or we have Google has a technology that takes your takes your JSON the definition of your JSON API and it generates clients for in multiple languages so there's a Python client there's a Java client c-sharp a whole bunch of different languages some people get frustrated that those clients can be a little bit weird they may the thing is their since they're automatically generated from the from the from the the API definition we don't have a whole lot of control over over some certain things so you know the fact that you have you might have to specify something as a string rather this then as an enum and Java it can be a little bit wonky but it also makes it really easy just to send send these what-what turns into JSON requests so anybody can connect to it but we do we do have a bunch of first party tools that we that we provide there's the bigquery web UI which I showed which I showed a little while ago there's the there is a BQ command-line client some people just don't trust you eyes and if it doesn't have a you know I can't do it from their UNIX command prompt then then it might as well not exist so so we do we do provide that and that's also a good way of sort of you can read the code for that and see what sort of the best practices are for using the using bigquery in Python there's also a connector for Excel so if you like your Excel spreadsheets you can you can use that actually you know I I joke but Excel can be really nice for doing things like graphing and pivot tables and stuff and so you can you can just have your Excel spreadsheet run and run run against bigquery and update your graphs it also integrates with with Google sheets so there's something called Apps Script which lets you write scripts against against bigquery and you can import the data into into your Google sheets as well for people who are using Hadoop there's a Hadoop connector so you can use query as a source or a sink and then Google Cloud dataflow is as a new new technology that allows you to sort of tie together a bunch of map produces and in a really sort of nice nice way that we we can optimize and has lots of nice nice features and you can also easily read or write from a bigquery table for that there's a number of third-party tools that integrate with bigquery so tableau is an awesome way of doing data visualization they did a really good job they're actually right across the street from from the the office in Seattle or we where we work that was fortunate and so we've been able to work with it work with them on on their they're connecting connection to bigquery so people doing scientific programming they a good chance they're using our Hadley Wickham has who's pretty big in the art community has has written a bigquery connector for for our if you if you're into scientific programming and you're not using our you're probably using pandas and ipython there's also a nice connector and I'm actually gonna show that a little bit later there's ODBC for people that are more old-school and a whole bunch of whole bunch of other things and and additional ones are being added all the time because how much does it cost like you know it seems like this is great thing but we do we do charge money for it the first terabyte of data you scan per month is free that only includes that includes just the columns that you read so you might have a a hundred gigabyte table but if you're with if there's 50 columns and only reading two of those columns you only have two you're only charged four for the columns that you that you read but because we do a full table scan every time we we do charge you for the full size of the of the column it's I did these who there's the the conversion math before before before I started these numbers may may fluctuates I'm not sure what the actual sort of published rates are published conversion rates are but it's this is sort of more or less what you but you pay for storing data we charge you on bytes per month and if you store your data for less than a month you only get charged for the for the actual amount of time that you store your data rounded down per per second the other other way you we will we will get you is if you stream data in will charge you per ten million rows so it's not very much per so it's it's one u.s. dollar per 10 million rows so only it only it can add up you for doing if you're doing massive amounts of rows but that's only for the streaming ingestion path if you're using the the batch load which lets you load from google cloud storage or or via media upload which allows you to sort of post you post a large file to to Google that's free they've come down we started and it was kind of ridiculously expensive I won't I won't pass any judgment on that certainly not well being recorded but I think there was the numbers numbers have come down a lot I would expect them to come down come down more but we may we might we might change change them slightly to make sure that we are modeling our more active more accurately modeling our costs the the problem is you know you want to have a simple a simple metric by which we can charge customers but but if we allow customers to do things that are really expensive and we only charge them as if they're not doing those things it can be it can be a little bit difficult so those things might change but it it shouldn't it shouldn't get more expensive unless you're doing something really abusive besides the cold best state of uncompressed data right so so question was the is it the compress data or the or the uncompressed data that that your and and if you're doing a join will it will we get charged for the the join outcome so currently it's it's the recharge for the uncompressed bytes because with the reasons we charge Leon your best bytes is that it's predictable and people know how well their data compresses we also may rewrite your table to make it more queryable over time and that might change the that might change the size of it on disk and so if we were charging for the compression and we change the compression then the then the cost would change and people would be like well wait I have paid a dollar for this last month and you're charging me a dollar ten you know why did that happen so we want to make it we want to make it predictable but the the join question was it was an excellent one so you can do across join two million wrote able to million wrote a million row table and I guess a million row table and the the outcome is a trillion rows that's clearly going to be difficult for for us to to to handle and more expensive than then a million row table so that might be the kind of thing that might get more expensive but it shouldn't be not it shouldn't be in a ridiculous way but I won't go too much more into into that it's it's it's a problem it's an open problem so I talked about this a little more a little bit more to get your data in you can load CSV or JSON files from from Google Cloud Storage you can post them directly you can also stream records in you can also send them via batch ETL through Google Cloud Data flows or Hadoop or streaming ETL so dataflow has a streaming API that allows you to to do sort of like streaming essentially streaming mapreduces and it'll it'll send send those so you can massage the data as it's coming in and then post those two to bigquery question I I shouldn't I don't wanna talk about what the what what we're working on but I wouldn't expect that anytime soon and the questions before you go on this is just a nice slide that I think Paul created about how you can how you can integrate how you can move your data between between storage locations so you can if you're if your data is in Google Cloud datastore there's a datastore connector for a Hadoop that actually there's another line that should go to bigquery that's in a experimental phase but you can you can backup your data to to bigquery so we well you can take a snapshot of your date you have your of your App Engine data store and import it into bigquery and you can obviously go back and forth between Google Cloud Storage okay so I have one more one more demo so I'm going to show how see let me know when you can read this better there we go hopefully that hopefully all works so I'm gonna show I had this I have a playlist data set and and somebody I found this on on the Internet and and it was somebody scraped I think from from from from teaser which I guess is a French streaming service but anyway it's it we've got about 12 million songs in half a million playlists so it's it's not huge huge data but this would scale up to up to huge data I hope I convinced you that with the first the first couple couple queries that I that I showed so give you an idea of just what's in this table we've got we've got the rating we've got the the title of an album the the track the artist and the ID of the playlist and this is stored as a nested nested data structure when I do this select star it'll flatten it roll through all this is it oh yeah so the next the next thing is is is okay yeah this just shows how how how big the data is we've got half a million playlists 12 million songs so then let's say we want to compute we're going to do recommendations we want to do recommendations by you know one way of doing that is by is by similarity and you what you can do is you can compute the how often do do these two artists appear in the same playlist together and you can use that as a as a proxy for if somebody likes this then they also like like that playlists are actually kind of nice in that people have already curated things that things that have some sort of similarity because they might have they're working out playlists or or or electronic a playlist that that often have similar similar bands in them are artists so here's the here's the query that that I use for this so so I'm selecting okay so let's say I'm I'm joining I'm self joining the playlists on on ID but where the the artist names are different so it's actually going to do a cross join of the or it's going to generate the cross product of the of every artist in in a playlist against every artist that it's different with in that playlist and so this this is an example of like so fine artists that are similar to - Daffy deaf punk and and sort them by how often they appear they appear with each other so if you if you run this query the the results here you get David Guetta studio group justice Rihanna weight Rihanna like how is how is Rihanna like like def def panca so it might be might be reasonable Chemical Brothers I guess you know that kind of makes but free Queen perhaps perhaps not so certainly something is is it's not right here and I wish you didn't have to scan through all of these it's what happens when I just say run the entire script come on I told you I don't do this for a living and I aha oh wait no I just went up to the end sorry this is a all right there we go so the same thing okay um so so why didn't we get better results well and so Amazon likes to give this as an interview question and they call it the Harry Potter problem is if you're trying to do recommendations for people's like people who like this book also liked and chances are you're gonna come up if if you use a naive recommendation you're gonna come up with they also like Harry Potter because lots of people have read Harry Potter but if you're reading Fifty Shades of Grey you might not be in the mood for high pot or vice versa so so hooli you need to do need something else and so and so what we have here is we have the the artists that come up most are are way too heavily weighted here so if we look at at the play counts from four for various artists we see that a lot of the wait how are those similar came up studio group Rihanna Queen our way at the top and one of these we can show here is actually is the it's a nice way of demonstrating that musical artists like sort of a lot of other letter other things follow a their popularity follows a zip or a power-law distribution meaning that it's a fat-tailed distribution but that the the the the further you go out if really go to the left you jump up very very quickly it's a really poor description of that but it's gonna have to do for today but that's what that that's what that looks like and so we we we fit actually really nicely to as if distribution I actually I I took out the most of the tail to this anything with the play count of less than fifty because otherwise you wouldn't even be able to see the blue line on the corner there so one of the things that we can do is is one of the things you want to do you do often when you're when you're writing complex queries is you start with a simple query and or you start with a couple simple queries and then you can combine them so what I'm doing is since so there was another problem with the with the the the naive query which was that a lot of artists will show up multiple times in a single playlist and we were sort of waiting things to strongly we wait so if if an artist showed up twice in a playlist we'd wait them double against things that showed up in that same playlist and so we probably don't want to don't want to do that especially since a lot of people may put you know a whole bunch of you know all the same tracks from an album and in a playlist and and those shouldn't get more wait so we can do as a computer sub query of the the unique artists that are that show up and put in a playlist so we can sort of save this off as sort of the unique artists and then we can compute the similarity by unique artist within it within a playlist so this is this is similar to that same query that I ran at the beginning but but it's instead of running against the raw table I'm running against this sort of this unique artist value and I'm limiting limiting to two things that show up more than more than twenty times just because it's not all that useful to say something is is similar to something that's only been played 20 times like that that doesn't doesn't make so much difference so so then how do we so we have one more one more step and then we actually have the weekend we can get a better better simular similarity and so what we'll do is we we we join we join the similarity output against the against the play counts and we we scale the the similarity by the play count and you can actually think of this is what we're doing is we're computing the percent of the time that if you have one artist that that other artist will also appear in the in the same playlist and we do it from the similar artist to the artist because that also helps helps prevent the the Rihanna problem so let's see let's see how we do here make sure this is live so we've got jay-z so I didn't know how they were gonna spell jay-z so I just did it contains here which is also one nice thing with bigquery is that you know in a relational database you might need to do an exact as I extreme match because it was going to need to read on an index this you could just do I could do a regular expression I can do just a string containment here so what are the what do we get here we got Farrell Busta Rhymes biggie so these are these are pretty reasonable so try try something somebody totally different Andrew bird popular indie artist plays plays like six instruments at once pretty amazing amazing guy see what see who's what similar to - Andrew bird survey says come on it's kind of thing that loves to happen when you're doing a demo go blame this on being a 3,000 miles away from our servers they missed me all right that that's way longer that should take there we go so who we got here we've got a bunch of people I don't recognize bon iver Sufjan Stevens fleet Fleet Foxes Belle and Sebastian the national it's a bunch of these things you know seem Beirut seem kind of kind of similar and then there's a Swedish artist named Kent that's right try this one I want I mean I'll be able to tell whether anybody is is is similar or not but you guys may be able to there we go anybody seem similar yes no okay anyway I have other suggestions they don't know anything we were there Spelling do they I think we're we're than they do with their music oh no yeah is it is it it's all caps looks like somebody thinks they here we go bo diddley I'm not sure I'd necessarily agree with these but shangri-la's this might be might be similar anyway at least today we can see that it's we it's better than better than the than the original one anyway that's um they came from somebody had posted a done a blog post about using using bigquery to to do to ask some questions about about music database and I guess they had they had ingested these from from Deezer and and he actually had there were a couple sample sample queries that were doing sort of a naive way and I'm like I bet I can do better than that so hopefully I can then contribute those back anyway that's all that's all I have

Info

Channel: Google Cloud Tech

Views: 79,455

Rating: undefined out of 5

Keywords: Google, Google Cloud Platform, Google BigQuery, tigani

Id: kKBnFsNWwYM

Channel Id: undefined

Length: 52min 3sec (3123 seconds)

Published: Sun Apr 05 2015