Comparing Azure Synapse, Snowflake and Databricks for Common Data Workloads

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone welcome back to another edition of my video blog where i talk about azure data services and a little bit of professional development today we are going to talk about how azure synapse azure data bricks and snowflake all compare as far as regular and usual data workloads perform and some use cases and things like that before i kick things off there though i want to thank my friends over at mill five they hooked me up with this sweet t-shirt right taking it by storm and uh a great group of guys over there doing some great stuff check them out at mill5.com okay so now let's look at um the the sort of high level overview of each of the services we're going to talk about today and kind of take a look at what some of those common workloads are we've got a good handful of them here and and we'll see where it goes so now first things first azure synapse so when we talk about azure synapse we talk about the next generation of azure sql data warehouse or the aps or analytical processing system and parallel data warehouse it sort of had a lot of different iterations in different forms and its roots are in that massively parallel processing engine for primarily olap workloads it came out as azure sql data warehouse several years ago and then gen 2 came back just a couple years back that offered a ton more performance and capabilities but then we flash forward to just about a year ago last november when we announced azure synapse and synapse is the mpp engine and more right so now you have a spark engine that coincides with the mpp sql engine as well as a development interface that allows you to develop notebooks and sql queries and some power bi and orchestrate your data and do some transformation we've ported azure data factory as part of the interface so it really is sort of an all-in-one solution for a variety of different data workloads now when we look at azure data bricks we've got a a different approach at a similar solution so databricks is primarily built on the spark infrastructure so when you look at spark it was built out of hadoop it gave us a way to process big data workloads in memory and do so in in the open source environment apache created spark initially and the team that created spark went and built data bricks as a company now databricks gives you the ability to optimize spark and more in um the the newer iterations of spark you've got some some pretty significant performance benefits and so by using uh databricks and the interface and then some of the other tooling that they've introduced over time you have again a relatively complete solution when it comes to data workloads lastly when we look at snowflake this is a cloud first data warehousing tool it was built by folks that came out of microsoft to be a better version of a data warehouse tool cloud native it can't run on-prem whereas you can still run some olap workloads as well as some databricks workloads uh sql olap workloads in your premises on you know local hardware and things like that whereas snowflake was cloudborne and cloud only uh it runs in the the three majority uh public clouds so uh azure data sorry azure aws and uh google and it's it's a very nice um offering uh and as we get into it you'll see it has some definite strengths and then some definite gaps and so it'll be interesting to see how that product evolves over time so jumping in let's look at the first use case and we're going to look at traditional data warehousing okay so this one's an interesting category uh because when we look at a traditional data warehouse and say we want to look at the four stages of a data warehouse where we ingest we store we transform or process the data and then we serve it up to whatever our presentation tool is going to be starting with synapse it has that sql olap engine running in the background but then you've also got some of the nice interface tools so you can do a lot of the data processing right inside uh the synapse studio environment uh which is at the time of this recording in public preview and we expect it to go to generally available within uh the course of the calendar year so we're relatively close to that going to ga now it has a very familiar interface because it does have notebooks interface which has been out for a few years now uh you can do a lot of your data processing and you can do a lot of your transformation and those types of activities with your data within that interface and it also allows you to now uh allow for any kind of big data workloads uh because it has a spark engine included now it's relatively new with regard to the spark engine but for pure data warehousing needs and requirements it is definitely tops snowflake is a similar story um except that you know it might actually have a slight edge over synapse and databricks which we'll talk about in a second uh you know it it was designed for data warehousing um you know it has a great clean responsive and intuitive interface uh you know some niceties about it are things like you can have um all your historical queries you know and up to a certain number uh just available for you so um you can rerun those at a later time it kind of reminds me of of being able to run uh command line uh commands in say cli or or shell or something like that now it's definitely uh different from uh the the synapse that we're used to or or sql interface that we're used to because it does have a more clean interface it's more responsive uh and and again it's all cloud native so you can just use the tools that are given to you uh and and run all of your queries and all your transformation and everything in in that interface now with regard to the completeness uh you do need to use some other tools to bring the data into your data warehouse environment uh you know but from a pure data warehouse perspective and and serving up that data uh it it's a great interface and a great tool databricks is uh a little different right so databricks is uh primarily a uh cli interface you know you're you're developing code in something like spark or in python pi spark you know spark sql those kinds of things except there's some more capability coming out they recently announced the photon engine so now it it opens up more opportunities for data bricks and i'm going to be curious to see where this goes the photon engine uh coupled with uh the delta lake uh creates this um delta lake house architecture uh which is interesting uh but a lot of it is is still in the early phases of preview and we won't see a complete picture until probably the middle of next year so from a pure data warehousing perspective it's probably not strongest in databricks at this time but it'll be interesting to see what happens as it evolves over time with with the delta lake and and the photon engine coming up to speed okay switching gears we're going to look at data processing and so now when we look at it from a synapse perspective uh it's definitely a strong capability here because of the porting of azure data factory it has the data factory elements built in so from an orchestration and transformation standpoint uh you have all of the capability of data factory so you've got over 80 connectors to different sources you can use a fair amount of transformation tools built into the data factory so you know all the traditional ones sorts and aggregations and those kinds of things those are all built in and then on top of that you can also use sql stored procedures if there's something that the visual elements uh the gui that that adf provides um or any of the the dynamic uh coding environment from adf uh has to offer you can always use sql stored procedures in the background uh it's actually running uh spark for for uh doing all of the data processing but it's essentially translating that gui into the code for you and running it on spark in the background now looking at snowflake the data processing side is pretty minimal you have snow pipes which will help you move the data but there is really limited aggregation and transformation capabilities so it's not a great option for data processing now databricks on the other hand is a great option and you know this is basically a code first environment so one of the challenges is that your team has to have familiarity with the the tools that are used there for that data processing if you have those team members great you know that it's a great solution uh it's you know a very familiar environment using notebooks and in doing any transformation and aggregation uh and you know really this is an environment that's designed for a a lot of heavy data workload it runs on a hadoop file system they actually have their own iteration called databricks file system and it's definitely optimized for running massive amounts of data and parallelizing the way that it's computed to output the information that you need so you know whether you load the data into a data frame or memory or you persist the data into some kind of parquet format the the data processing here is is very capable uh so you know i would say they're on par with synapse and in data factory uh again just looking at it from a pure development environment uh one is cli versus the other being gui based okay so next we look at real time data streaming workloads or the hot path if you will common term that's used here and when we look at synapse synapse in and of itself doesn't have a way to process streaming data workloads now it's in the azure environment it's a first party service so it is a microsoft tool and it has the ability to work with other extensions for streaming workloads so if you couple synapse with event hubs and stream analytics uh there was recently announced some pretty cool optimization for stream analytics in synapse you could also use it in conjunction with databricks that's another capability and down the road as the spark clusters become ga and more capability gets added there may be some options there as well as of this recording there is no built-in uh streaming capability in synapse itself snowflake has none um it's not intended or designed for hotpath there may be uh down the road but as of this recording again there there is nothing databricks is generally accepted as one of the best options here uh again you can run it with synapse in in event hubs and stream analytics kafka to pull in the data you know you've got some good solutions there but what i would say is that if you're familiar with the development environment uh databricks is probably the best option here and it also simplifies your architecture because you're really only using one tool as opposed to you know multiple tools with the synapse environment three or more okay so ad hoc data discovery is our next category and i would say again here that synapse and data bricks are are tied for for the most capable options and you know when we look at the synapse environment the tooling that we use for this is called polybase or newly branded as data virtualization which that term is coming through on a lot of our database products so you'll see it in sql and you'll see it in synapse and some of the others and what this allows us to do is to connect to a hadoop-based file system and pull data from those file systems parquet being the the current flavor of choice we'll see some more support down the road i believe we just released support for ork file type i'm sure we'll see avro as those are very popular standards but in essence you can connect to those external sources and you can then pull that data into memory using data virtualization it's considered an external table when you when you hang it off of synapse or you can use a create table as select statement or a ctas statement and bring that data right into your data warehouse as a table in snowflake um there's no capability for doing any kind of ad hoc data discovery uh you essentially have to run it is as if it's a data warehouse import the data into the the the database that you're using and then run any of your queries against it and then finally uh databricks uh again comes down to the interface right so if you're comfortable with that code driven cli driven interface you've got a great experience here you can connect into azure data lake store and some other sources via api pull that data into memory or or bring it in and persist it as part of parquet you can also run some lighter workloads on adls that can help with some of this and helps to optimize what you get for results and they're adding more features to adls as well to help with some of these types of workloads so in this case databricks and and synapse are probably on par databricks having a little more history with this type of operation though so depends on on what your team looks like next data science and machine learning workloads so this one has to go to databricks data bricks is by far and away are designed around these types of workloads and so this is you know a built-in capability and a very very strong suit for data bricks and very commonly used synapse we introduced the spark clusters as i mentioned there's going to be a lot of capability that grows over time it is open open source spark that's running in the background you can use a delta lake as part of that but again this is all currently in preview and so you know the the real story it's going to take a little while to kind of tell how strong of a tool it is from a spark cluster standpoint otherwise you can use some of the azure machine learning tools that are available aml studio and automated machine learning and those capabilities and and you can use those with uh direct connectivity but again you're somewhat complicating your architecture so it depends on what your requirements are to see what the best solution might be sorry snowflake you just don't have this capability wasn't built for this and unfortunately what we actually have to do is we have to pull the data back out of snowflake in order to run these kinds of workloads and you know basically the other option would be to multicast the data to a place where you want to run any kind of data science and machine learning workloads and then you'd you know have to choose from one of the azure tools or a databricks something like that with databricks you've got the new notion of workspaces 2.0 which they have just recently announced as part of their seven point x um version of of the tool and this is offering a lot more capability around things like devops for data science ml ops and those types of things right so you've got some more integration now with git and a lot of the different get providers you know bitbucket and github and azure devops and those tools databricks is really coming strong and adding even more capability than there already is as part of this tool okay next we're looking at business intelligence so this one's uh kind of tricky because we look at pure business intelligence and for me as i define it it is the etl it is the data warehouse and it is the presentation layer when it comes to those three elements i think that synapse has got the best here and the reason i say that is because we have the native integration and development of power bi within your synapse studio environment you've got the etl tools you've got the um the compute engines and everything is integrated so that just makes the the business intelligent developer world a lot easier as the product goes to ga you're going to have more capability with regard to how you use things like source control and templates and those kinds of things but right now again it's it's probably the most complete solution and you also have the capability to uh push the results of your data to something like azure analysis services for xmla connectivity if you're not going to use power bi and then you could also put it to like a data lake and consume it if your presentation tool is going to allow you to connect to like a parquet format or a csv format or something like that so great solution there snowflake when you when you look at uh the pure data warehouse element and it being a great solution for that um you know that that's that's a great basis uh for your business intelligence obviously and then it has some good connectivity for other tools so you know your power bis your tableaus and all of the leading presentation tools there's a lot of capability in there and so you know it's definitely a very good option for uh for bi workloads but you do still have that sort of etl portion that's lacking a little bit so i'm going to say this is kind of middle range now data britix is kind of tbd for me the reason i say that is because there's a lot of interesting stuff happening with the photon engine they added redash to to the environment which gives uh more traditional sql developers a familiar environment for char charting and things like that you can connect into it with power bi but it feels incomplete right now to me uh certainly you have the etl elements you can use a delta lake which which guarantees your transactions uh you know similar to that of of a traditional data warehouse uh you know and you can store those files in parquet and then you've got you know some nice features like time travel and some of that stuff uh but it just it doesn't feel like a truly complete solution right now if you look at the the delta lake house and connecting to it with a third-party presentation tool or power bi and those kinds of things there is some capability there it just just feels slightly incomplete so i'm going to say that's a tbd i've covered the majority of the solutions that that i feel are what are being asked of by our customers today now we've got a ton of other tools in azure that offer a great amount of capability you know you think of the things like azure data explorer and some of the recent acquisitions of things like sas uh and other analytical tools that are somewhat industry focused and things like that so there's really a great amount of offerings when we when we look at what's available but really what i wanted to do was look at what the most complete solutions were and what i would consider as as my first three choices when it came to these particular workloads so from an overall standpoint i give a leg up for most complete solution to azure synapse now this this could be highly contested because again there are there are a lot of capabilities built in all three services uh the reason that i say that synapse is the most complete solution is you really have that end-to-end story and you have the native connectivity for all of the machine learning and data science workloads as well as being able to you know down the road use the spark engines to really combine your big data and data warehouse needs into one solution so the reason i give a slight edge to synapse is because you have the ability to report directly and build your reports directly with power bi to then publish those to your power bi workspaces or uh you know whatever you're going to use those for now you can connect to databricks with power bi however you can't develop it in the same environment and with data bricks you have to have that compute running in order to connect to it with power bi whereas you can use xmla endpoints and publish directly to power bi data sets in the background or you can use azure analysis services to stage that model for delivery and and some of those costs can be mitigated as a result of this with databricks you have all of the other portions of the total solution but you can't develop the presentation layer now snowflake again is is a great opportunity for a data warehouse environment and purely that's what it's doing from a pure compute standpoint but it's really lacking in some of the other areas that i've discussed here and so i would not put that as a a tier a solution or a top choice the big delineation again between uh databricks and synapse is what does your team have familiarity with right uh when it comes to reporting it's a presentation layer thing no matter what you do whereas with with uh the development and and all of those other areas it really kind of depends down with what your company is used to and what they're comfortable with if they came from the open source world and they've got a ton of hadoop and and spark uh history great that's probably your best solution now if you are looking at what might be a totally visual environment sort of like the next stage next evolution of your on-premises sql and ssis and those types of solutions or other graphical tools it's probably going to be synapses the better solution that wraps up what i wanted to cover today i hope you found this valuable uh please ask questions below let me know what you like about this what you don't like about this i really try to help out and and offer as many solutions as we possibly can in an unbiased way in the way that i'm seeing customers work with these various tools really appreciate all the opportunities that people are taking to share and like and comment and subscribe it really helps out and it's really motivating to keep putting some of this stuff out for y'all so thanks again thanks for watching have a great day bye
Info
Channel: Chris Seferlis
Views: 4,915
Rating: 4.8048782 out of 5
Keywords: Azure, Azure Databricks, Azure Synapse, Azure Snowflake, Azure Data Factory, Power BI, Azure Event Hubs, Azure Stream Analytics, Azure Machine Learning, Azure AutoML, Azure Spark, Azure Synapse Spark, Data Science, Machine Learning, Data Processing, Data Warehousing
Id: j8x_WqKmbKc
Channel Id: undefined
Length: 26min 9sec (1569 seconds)
Published: Fri Nov 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.