Cloud Data Warehouse Benchmark Redshift vs Snowflake vs BigQuery | Fivetran

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] my name is George I'm the CEO at five Tran five Tran is a data pipeline a automated data pipeline that syncs data from all of these places and many more this slide is actually now quite out of date into your data warehouse and last summer we did this benchmark as was just mentioned benchmarks depend a lot on how you do them we thought that there we looked and looked and found that there was not a good benchmark in my opinion out there comparing data warehouses and in particular we wanted to do a benchmark that described the kind of scenario that our customers actually faced so our customers are typically syncing data from lots of places they have complicated schemas we deliver normalized schemas so they have complex queries and generally the data sets are not huge it's not the size of the data that's the primary problem it's the complexity of it so we compared these three data warehouses which are the main data warehouses that we support as destinations before I get into the details of the benchmark I just want to take a moment to explain the difference between OLTP and OLAP data sources I am sorry if this is old news to you but I'm always amazed by how many very technical people don't realize this distinction and just how different these are so every chance I get I explain this again just trying to put it out into the world how different data warehouses are from transactional databases so in the world of sequel databases there's really two kinds of sequel databases there's sequel databases that are designed for online transaction processing that's where you do something like go get one row out of my database you know here's the primary key go get me this one row and you might grab all the columns but it's a very shallow query you're just grabbing one or a few rows the other kind of database is designed to support analytical processing that's where you say scan through all of my data probably only look at a couple of columns so just grab you know the email of the author if in this scenario we're imagining that we're querying a table of github commits just grab the author's email and just count up how many commits they did so it's the opposite query pattern right we're wearing few columns but we're scanning the entire data set every time and this talk is about that that second scenario even though they both use sequel the data stores that are designed for that second kind of query are organized in a different in fact the opposite way so classic databases like Postgres and my sequel are row stores we often visualize data you know in this table format but you can't actually store data that way right you have to store data linearly and the fundamental strategy that most transactional databases use is their their row stores so you go from left to right and then from top to bottom right you store the value of commit and then the value of file and the value of added and what that means is that one row will all the all the elements from one row will be sitting next to each other like you see here column stores are in fact done in the opposite way so they store an entire column sequentially so we store all of the commit values and then we store all of the file names and then we store all of the added counts and and these column store data warehouses are designed to support that that second query pattern so the most obvious way in which this is better if you're doing that second query pattern is that if you are only interested in the file column and the change column you can skip reading most of the data right you can just read one of each of those columns so all three data warehouses that we're benchmarking here are of the second kind column store and the the benchmark we're doing is designed to test that kind of query pattern the first just a little bit of history of when these data warehouses came out the first true columnstore data warehouse that really put it all together was C store which was way back in 2005 C store later became Vertica it was an academic project it's not just a matter of having a columnar disk format to make a good column store data warehouse you need to have a different kind of execution engine really every layer of it is stiff is designed differently except the top layer the sequel so C store came out in 2005 the first data warehouse of the ones that were gonna benchmark here today to come out was actually bigquery but there really are two big queries the bigquery that became available in 2011 was very different than the bigquery we have today it was not very good at doing joins I could only do broadcast joins Google decided to make their own dialect of sequel for some reason so it had a non-standard language which kind of looked like sequel but it wasn't the same and not a lot of people used it you so you couldn't do complex queries that did like fact effect joins you had to adapt to this weird pseudo sequel language there were no updates there were a lot of things that were missing it was not really very usable for anybody outside of Google at that time the into the big thing that happened in our space was redshift coming out in 2013 redshift was not the first good columnar data warehouse but it was the first good columnar data warehouse that was cheap so you could go into the AWS console and for not that much money you could have a really fast columnar data warehouse redshift has kind of a funny Providence it came from par Excel Amazon actually bought the source code from another company which is kind of a weird thing to do you don't often see that happening but they bought the source of our excel and turned it into redshift that has as well seal it see you later I think proven to be somewhat of a problem for them in recent years they've had trouble evolving it and keeping up with some of the other data warehouses in the last couple years it's been it's it got them to market quickly but it's that later proven hard for them to change snowflake came out in 2015 snowflake is similar to bigquery in that it stores the data separately from where it does the compute so they store the actual data of your tables in s3 and then they can provision any number of compute nodes that they want to to read that data other than that it's basically fundamentally similar to redshift and other columnar data warehouses but it had this marquee feature which was the separation of compute from storage and they they came out into 2015 they really started to take off in like early 2017 bigquery was not to be counted out though in 2016 bigquery started getting a lot better Google released a new dialect of sequel that followed the standard they improved their query planners so you could actually do fact effect joins they implemented DML and they've continued so bigquery in the last couple years has gotten a lot better and it's become much more usable for regular analysts who don't work at Google so those are the three major data warehouses we support and the ones that we tested in this benchmark so what the heck did we query here we queried a data set from the tpc EDS benchmark which is a standard benchmark used for data warehouses it's a pretty conventional dimensional schema this is actually only part of it there's more fact tables and more dimension tables it's a pretty complicated schema and we're gonna run some pretty complicated queries which is important but it looks like a pretty normal dimensional data warehouse schema we ran it at a 100 gigabyte scale which is pretty low usually people do much larger scales especially vendors they like to say that we can you know handle all of these terabytes and terabytes of data but what we find is that most of our customers who are loading data from relational databases from Salesforce from support systems their data sets are not that large but they don't want to allocate a giant data warehouse and want to get sub second or sub 10 seconds type queries so it's not about you know can you crunch through a petabyte of data in half an hour it's about can you crunch through 100 gigabytes of data in 5 seconds that's what we're really testing here the queries we ran are the TPC DS queries so these are pretty complex queries the one on the left is a pretty simple one you still are joining a bunch of dimension tables onto a fact table there the fact table is web sales and item and date dam or dimension tables they contain complicated predicates they contain aggregations many of them contain multiple sub queries that later get joined together they sometimes have fact to fact joins so these are these are complicated queries with complicated query plans which is an important part of this benchmark because this is the kind of query that we see in the wild when real customers query their real data sets they're not just like you know nice the one giant wide table there's lots of joins and crazy stuff going on there so our view is that it's really important to do these kinds of complicated queries and that's why we chose the DB CD s benchmark just in case you're wondering who is this DBS TPC TPC is the transaction processing council so they're like a group of vendors and academics who created this and several other benchmarks they have this lovely website which looks like it came from a time warp from 1998 but it's it's a good benchmark and it is a standard that a lot of try and people will follow now just running the DB CD s benchmark isn't specific enough so it's very common for vendors to run T PCBs on their own data warehouses and it's very common for them to cheat as I put it you will not be surprised to find that whenever a vendor runs this benchmark their data warehouse is always the best one and there are lots of little choices you make you run these things that matter a lot for the result you're gonna get and we made this these choices in a very careful way that was designed to test what we think are the important characteristics of data warehouses but I want to talk a little bit about why we made these choices and you can make your own opinion of whether we did this in the right way sometimes people agree and sometimes people get very mad at me and they think you know you did this the wrong way but at least I'll just tell you exactly what it is and we did it fairly across all of them so the first thing you need to do is you need to not run the same query twice in a row if you're on the exact same query twice in a row well a big query for example will just give you the cached result from the previous run and the other data warehouses aren't quite as aggressive about caching well snowflake will do that too but they also have the ability to cache intermediate values of queries some data warehouses have this and it's just not a very realistic scenario that's not what we want to test whether you can just cash previous results and give them back to us so you need to be sure that you set up your data warehouse you warm it up but that you don't give the data warehouse the exact same query twice in a row the second choice we made is we did not use disk keys so disk keys are an optimization that you can apply to data warehouses data warehouses are multi node systems and if you want to you can designate one column and say distribute the data according to this column and when you do this it will make some joins really fast so for example if I wanted to run this query where I join web sales to item and I know I'm going to join on the item key I can tell the data warehouse in advance distribute the data according to the item key and then what it's going to do is let's say it has two nodes and there's two item keys it'll put all of the data for item one a node one and I'll put all the data for node for item two on node two and then when I do this this eques join on the item key the data warehouse knows that the data is already distributed according to this item key and so it'll skip the shuffle step if you don't do this then the first step of every join is you have to shuffle the data between nodes on whatever the equi-join key is before you can do anything else you skip that huge step these joins will go way faster if you do this this is a good optimization that you should do if you can the problem is you can't always do this this only sometimes works you can't distribute on every key and our view is that the important performance measure is how fast you are when this doesn't work when this trick is not available how fast can you go that's the thing that's really going to differentiate one data warehouse from another so we chose not to use disk keys at all the the second somewhat controversial choice that we made is we did not use sort or partition keys so this is another optimization you can do where if you know in advance what your queries are going to look like you can give the data warehouse this hint so you what you can do is say if you have like a date column in your table you can tell the data warehouse to sort the the data on that date in redshift it's called the sort key snowflake and bigquery have a slightly different but similar strategy where you don't exactly sort each file on that key but you partition the data on that key and then if you run a query that has a where Clause like this where you say where the date is in some range the query planner will just jump to the appropriate place in this file you can't really see it but I've highlighted in yellow this part right here so what will happen is if it knows in advance that the data is laid out on this order it can just skip all the data up to the point where the predicate applies and there are two reasons why we opted not to apply this optimization to any of these data warehouses the first is the same reason from before the the date partitioning trick is not always available and it's really important to measure the performance of the data warehouse when the tricks don't work the the second reason is that the TPC DS benchmark contains a lot of tape predicates in it and if you use this trick the benchmark ends up just being about how well you can do how well you can do date partitioning it in the TP CBS in particular many many of the queries have these predicates on the dates and so it ends up kind of dominating the benchmark so we opted not to use sort or partition keys we did apply compression we did the the basic optimizations that are very Universal and robust that that will just make queries faster in general so we did all the basic optimizations of the data like compression but we didn't do any of these sort of tricks that make a huge difference but only apply sometimes so the first way we evaluated the results is we looked at the execution time 499 TPC des queries we used a pretty small warehouse so the snowflake and redshift warehouses we used or both configured to cost two dollars an hour which is a pretty small data warehouse but again what we see a lot in the wild as people who have datasets that are not that huge but they're looking for really fast response times and for bigquery there is no such thing as a bigquery size bigquery is just a pure web service it's like Gmail you just send your query off and the results come back and it ran somewhere so when you run bigquery you're just running on bigquery there is no such thing as a node size that's going to be tricky when we compare pricing in a second but for time we just measured the the raw time and we found that they're all similar you know bigquery was the fastest by a little bit but these are not big differences you should not choose the data warehouse because in one benchmark one finished in eight seconds and another finished in six seconds and the fact that they are so similar is a good sign if we saw dramatically different results from different similarly configured data warehouses that would be extremely surprising the basic techniques to make a fast columnar data warehouse have been known for more than ten years now it would be very weird if snowflake was way way way faster than redshift so the fact that we got similar times is a good sign that indicates we probably did the benchmark right the second way we compare them and this is the more interesting one is we compare them based on cost so comparing snowflake and redshift on cost is simple you just multiply the execution time by $1 per hour number but comparing to bigquery is tricky because bigquery has this fundamentally different pricing model big gray doesn't charge you per hour that your warehouse is running they discharge you per query but this is a really important factor to consider and so to address this question we actually looked at data from our 300 customers on how they actually ran their data warehouses we looked across five trans customer base at what percentage of time their data warehouse is actually running a query because that's what you need to know in order to make an apples-to-apples comparison between say running redshift which charges you by the hour that the warehouse is running and running bigquery which just charges you per query so if you're running redshift you're not going to have perfect back-to-back-to-back-to-back queries using it all the time some of that time that you're paying for redshift is just going to be wasted and if so if you know the percentage of time that your warehouse is actually running something you can calculate an actual per query cost for redshift and it turns out that across our customer base the average percentage of time that the warehouses are actually doing something is about 20% which may surprise some people in this audience that it is solo everyone imagines their data warehouse is you know this incredibly important resource is very expensive you must be running things on it all the time but the reality is the way most of our customers use their data warehouse is to power dashboards in bi tools and that means there's a lot of times of day when nobody is clicking on anything on the dashboards unless you have many thousands of users using these dashboards most of the day most seconds there's not going to be anyone requesting a query especially if you have these data warehouses configured to be big enough to give you like five second response times if you if you're having five second response times unless you have thousands of BI users you are not going to be running anything on your warehouse most of the day so using that factor that 20% factor we were able to convert the snowflake and redshift numbers from a price per hour to a price per query and make an apples-to-apples comparison to bigquery and on that basis bigquery ends up being about twice as expensive as the other warehouses which wasn't surprised to us we kind of figured they had probably calibrated their pricing model to be you know just exactly equivalent to their competitors but this this does square with what we've heard from users who have used both systems the bigquery is a little bit more expensive at the end of the day you're gonna spend more money on your own staff and on all the tools around your data warehouse so this isn't this isn't a reason that no one should ever use bigquery but you should just know that it is a little bit more expensive the other thing you need to keep in mind when interpreting this plot is that it is entirely dependent on that percentage of use factor right so if you have a situation where you use your data warehouse 80% of the time for whatever reason you just have jobs queued up all the time and it's always busy then big gray is gonna be a really bad deal comparatively for you on the other hand if you use your data warehouse very rarely if you have a very spiky workload so that it's only active like 5% of the time as some people do then bigquery is gonna be a great deal it's gonna be way cheaper because of its different pricing level so you should interpret this carefully you you want to adjust that based on your own scenario of how active your data warehouse is we're not the only people ever to run benchmarks comparing these data warehouses and I just wanted to talk quickly about how our benchmark compares to others the first one is Amazon did a benchmark comparing redshift to bigquery unsurprisingly they found that redshift was much better as lenders always do when they compare their their products to others they ran TP CH not TP CD s TP CD S is a better benchmark for comparing warehouses so that was a little bit of a not great choice they also ran an extremely large redshift this is a very big expensive redshift cluster that they ran in order to get these really short times so you have to take that with the grain of salt it's not clear that this is really an apples to apples comparison they also um they ran a really large data set and I'm guessing they probably didn't call up Google and ask for a greater slots quota which is what you would do if you have massive data sets that you want to run through bigquery there's quotas on big crayon how much computer will allocate to you unless you call up and ask them and I'm guessing they probably didn't ask for a larger quota when they ran this benchmark they didn't publish any of those details so we don't know they didn't like publish the code they used to produce the benchmark so we can't reproduce it but that is my suspicion a periscope who's also here at the conference also ran a redshift vs. snowflake versus bigquery benchmark they did it a little bit earlier than ours and I think snowflake was less fast at that time so snowflake came out a little bit worse this is a a a cost-effectiveness plot so I think snowflake at that time was a little bit slower than it was when we did this benchmark last summer and the other thing they did is they also attempted to compare cost and they had similar numbers of you they looked at their own customers and they found it was significantly less than 50% usage time for for for what percentage of time the data warehouse was actually running anything the last difference between periscopes benchmark and ours is that where we did TP CVS they did a much simpler benchmark where they just had two tables and they did one join between them so their benchmark is more of a measure of just the flat-out scan speed of the warehouse as opposed to your ability to run complex queries and it probably is the case that redshift is faster at those flat-out scans if snowflake is loading the data out of s3 it's a little bit complicated because snowflake once you query a table repeatedly it'll actually cache the data locally and then it's not reading out of s3 anymore but if you if you measure from a cold start redshift should be faster at just doing like a flat-out scan out of s3 so those are the differences that account for why their benchmark is different than ours I think ours is a better measure if what your goal is is to run complex queries are running T BCD s is more relevant than the simpler query they ran here but it's good to know that they're in the same universe and they had similar numbers on that percentage of use time of data warehouses as we did which is a really important factor in comparing big grade to the other data warehouses the last one is this taxi benchmark which some people may have heard of which has a very comprehensive set of data warehouses the the taxi benchmark uses very different sized data warehouses that it compares across ov so it's pretty apples and oranges and they it's a very simple query they're just querying one big table so this really is just a measure of scan speed i I don't think this is very useful for predicting how your data warehouse will actually perform it's kind of it's almost all GPU databases that win this particular benchmark which is you know just scanning this one giant one terabyte table of taxi information I did get into an argument with the author of that pension rock on hacker news so I like to give him the chance to respond to that criticism so his view and he's not the only one who thinks like this is that scanning one giant table is the right thing to do there are some people who set up their data warehouses in this way who D normalize everything into a few giant tables and then when they run their queries they're not doing any joints they're really just there they're just hitting that without one table and doing the scan speed and so if you are like that if you are going to set up your data warehouse like that then that is actually a good benchmark for that so there is that kind of community of people who think that way and that that benchmark is is right for them the last thing I want to say is that the differences in speed between these data warehouses are small the differences in costs are medium-sized but like I said you're probably going to spend more money on your team and on the tools around your data warehouse than the data warehouse itself that shouldn't be decisive for you what really matters in my opinion when you're comparing these data warehouses these are all good choices but what matters most is ease-of-use how productive are you going to be using these data warehouses how many corner cases are you gonna run into what kind of surprising behavior are you going to encounter and in our experience as of today bigquery and snowflake do have a leg up over redshift on that front redshift does require more babysitting more management it has more weird corner cases where things worse will stall where weird stuff will happen we've seen more troubles with big customers with busy workloads on red shift than we have on these other data warehouses this wasn't true a few years ago from where we stand a redshift has fallen behind a little bit on the ease-of-use front and then between between snowflake and bigquery the big difference is in their support for transactions and then their support for the full range of D ml and DD l like create table weird updates things like that snowflake has every kind of sequel statement that you would expect as an analyst bigquery only has a small fraction of the you know DDL and DML statements it depends if you care about that some people don't care they don't use those things so does not matter to them some people care a lot and if you're looking to load data into your warehouse we can help you with that we are as I said a fully managed fully automated data pipeline that can help you get all those complex data sources into your data warehouse we have a booth downstairs and I'll be doing office hours at one across the hall and with that if there's any questions I'd love to take them questions yeah yeah I agree with you about caching and not being realistic to run the same query identical query twice but if you're doing interactive business intelligence then people will run one query and then they'll run or the tool will run a very similar query on a very similar set of data so I would hope a good system would run faster you know if I'm querying you know a similar the same data set with a slightly different query did you mess with that tool yeah you're totally right we tried in order to be fair to that and because that is what people do they don't they don't just show up one day and run a query against the table that's never been queried before I didn't mention this but as part of this benchmark what we did is we ran a select star against every table before we ran the benchmark so he ran select star once against every table - as a hint to the data warehouse like hey I'm about to query this table you might want to cache this thing and then we ran each query from the benchmark one time other questions hey I'm sorry sorry I have a question about Emmys of migration one of the struggles that we ran into when we were moving to redshift is the large number of functions that redshifted in support and we had to change all of our transforms and I was wondering how snowflake and bigquery and it sounds like a big query is clearly lacking in that area what about snowflake yeah so you know these these these these functions especially if they're not standard if they're not in the ANSI standard they're kind of a toy that the data warehouse vendors use to try to lock you into their ecosystem so the first thing I would say is to encourage everyone to always try to use the standard function not the non-standard one because you will someday one a migrate to a different data warehouse redshift I think has pretty good coverage of functions snowflake also has excellent coverage of sequel functions big we used to be really bad on this before they introduced standard sequel now they're a lot better but then I'm not a big user of sort of obscure sequel functions I don't know that I can give you a good answer to that my instinct is that in general snowflake and redshift have better coverage of sequel but bigquery is catching up they're making steady progress finally after five years of not yeah thank you any more questions we have time for one or two I was wondering how did you guys run the queries where they run concurrently or one after the other one after the other we did do some tests and these data warehouses are pretty linear if you run two queries simultaneously it'll take about the same amount of time as if you run them sequentially it's not perfect but they actually tend to be pretty linear so it doesn't matter that much whether you use concurrency or not Thanks hi do you do you have any other benchmarks against like presto or like a hosted service of presto like a theme yeah yeah so you know if you go first of all our benchmark is there was a link shown some of those slides it's in github github.com slash 5 trans slash benchmark presto is in there we actually did presto as part of it it was way behind in terms of speed and we suspected that we had not configured it correctly that we had just done it wrong so he didn't publish it we later determined that we were right it was badly configured even even when we configured it better which may still not be perfect it was still pretty far behind but now presto has finally released well starburst which is the company that does a lot of the commercial development of presto has released a stats based query planner which is going to make a huge difference to its performance so in the next iteration of this we will include presto and it should be it's not going to be as fast as the commercial data warehouses but it should be in the same ballpark you

Info

Channel: Data Council

Views: 60,717

Rating: undefined out of 5

Keywords: fivetran, redshift, bigquery, snowflake, data warehouse, data warehouse benchmarking, cloud data warehouse, cloud data warehouse solutions, cloud data warehouse benchmarking, cloud data warehouse benchmarks, OLTP, OLAP, online transaction processing, online analytical processing, c-store, c-store a column-oriented dbms, amazon redshift

Id: XpaN-PqSczM

Channel Id: undefined

Length: 34min 28sec (2068 seconds)

Published: Fri May 25 2018