The Azure Spark Showdown - Databricks VS Synapse Analytics - Simon Whiteley

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome to another virtual sql bits session my name is simon weigley and i want to talk about stack now spark is a really really cool open source big data processing platform that can revolutionize everything that you're doing in terms of building analytics platforms be it data engineering data science even a humble analyst can get a whole load of use out of spark however for the past couple of years it's been the de facto approach to jump on data bricks databricks is a company started by the guys who invented spark and they're pushing the boat forward in terms of what spot can do however this year we are going to see the release of azure synapse analytics which has its own inbuilt spark engine and that begs the question if you want to do spark if you want to harness the power of lakes and all that kind of stuff in your analytics platform which one should you be looking at so in this session we're going to have a look at databricks versus signups analytics specifically the spark engines and just talk about in theory what they offer what they don't know about which one you should choose where you can use them and hopefully shed a bit of light on what the difference is now this is all talking about the theory all talking about the stuff this isn't a technical demo session so set expectations we're not looking at code we're just looking at what's under the hood how it works let's have a look okay so as i said we are looking at the big beatdown that is databricks versus signups analytics again my name is simon whiteley and i run a company called advancing analytics in the uk and we do a lot of data consultancy and i think from data engineering how to build a lake how to use spark to make your lake amazing through to data science and how to do custom vision and how to build recommenders and all of that kind of fun so we have a booth feel free to stop by and say hello any questions about this i'm live in the session answering questions or pop by the booth later and we'll get you sorted out okay right on to the meat let's see we're talking about one why spark why am i even having this conversation with you why do we care about spark so the age old approach is building a warehouse having these three steps going i've got some staging data cleaning data some warehousing data that's old hat but we've all done it most of us still do it it's building a warehouse on sql server usually it's a single sql server lots of databases and some kind of etl job be it ssis be it talent being just storedprox some think doing some kind of transformation for us and that's not great so there's lots of bottlenecks in there there's lots of things that slow us down the code's not that flexible whole load of reasons why that's people are moving away from that wig one box sql server just using a relational database and going you know what lakes are pretty cool lakes gives a whole lot of flexibility they give us a whole load of exotic file types machine learning integration scalability loads and lots and loads of reasons why having a lake and having spark on top of it is fantastic so this is kind of what people are doing now some arguments as to whether you do all in one box where they have a separate warehouse that's a separate conversation for now if we have lake and we're doing some preliminary data prep some work in the lake although flat files to prepare them to show to some end users to prepare them to put them into a data model then the tool that we're using to do that lake based data preparation is spark hands down that's kind of the winner these days especially in the jewel we're using some kind of spark based platform so i'll say that won't change in the next few years currently spark is one of the best things for doing that so what's actually this what is this spark think there's a few different layers and there's a few different things i want to give you a little bit of context as to how spark has changed in the past few years so you've got this thing you've got the engine it is a open source project basically there is a big old engine written largely in a combination of scala and java and that come up later um but this is kind of a load of data processing libraries a lot of things to help you do common tasks grouping things aggregating things pulling data data readers data writers all of that kind of stuff encapsulated in a data processing engine that very very importantly is all based around doing things one in memory and two in parallel so saying i want to take 100 machines and spread my workload across it i'll take two machines and spread the word across it scale has the same code and that is amazing i can just write one little bit of code and then the same code can be used to scale across massive volumes or small volumes doesn't matter so i can bring the load of flexibility there's a lot of good stuff in there now the way it scales so the way that we can actually say take a data set take one big chunky table of data essentially but give that worker some that work or something i just take and spread it across all these different workers happens because of this thing the rdd the resilient distributed data set and that used to be where all of the magic happened so all of the work you used to do in spark was coding directly against these blocks of data and saying take that block iterate through it give me a number add it to that block and it's fairly manual but allowed you to do distributed queries which is good um these days we have two abstractions on the top so we've got to think of the data frame api i've got think of a sql api now data frames just means that we can write scala we can write r we can write python and it'll turn it all into the same code which means we have language parity across different things that are going to end up running the same thing which is amazing so you can have people writing art people writing scholar people writing python no performance difference between it and all runs on the same engine all on the same platform so it's less of an argument of which one do we get for our users it's a no you just each user can choose not necessarily the wisest thing but you have that flexibility and then the sql api is an absolute killer because so many people know sql most people know a little bit of sql if they're working at all especially in the data world um so that ability to query your leg via sequel is huge that's a massive massive thing so few bits in spark which just mean it's really really geared towards the kind of thing that we as data people do so writing some data transformation scripts using python and then having users come in and query all those different data elements just using the sql and using the same syntax the ansi standard syntax that they know and love is awesome so that's why we're using sparks box really really good for those elements and the kind of things just to set the scenes a little bit you've got my name and i've got csv sitting in my leg now inside uh dead next door inside any hdfs fact storage so that's a hadoop distributed file system so that's blue storage it's data lake storage gen 1 daily explore gen 2 it's s3 buckets over in amazon anywhere you are that has this hdfs idea essentially if i've got a big file like a csv plug that down in my storage i'm going to take it and actually cut it up into lots of different readable chucks called extents so even though i've just got one file i can actually read from it in parallel by reading these different chunks and saying actually all my different workers in my spark cluster that worker can read that data that one can read that data that one can read that data and suddenly parallelism happens even on a single file and then when you imagine you've got a folder with thousands of files suddenly everything scales fantastically so the kind of thing we're going to see we're going to say hey i've got a spark job i want you to go and aggregate this give me the maximum for a certain column whatever it happens to be it's going to pass it to this driver so i've got a little brain node at the top i'm going to take that tell each of the workers hey this is the work i want you to do and you to do and you do need to do and in this case they're all being told which of those extents which chunk of that file should they go and access and they'll each independently completely look at each other and do their own bit of work they're going to get the data pull it onto the cluster do some processing return the results we'll collect all the results onto one of our workers to then do another job and aggregate all the answers go what's the maximum of all my individual maximums i turned that to the driver and i've got my data and everything that we're doing in spark has a similar kind of pattern it's always a tell the cluster to do something it figures out how many steps it needs how many of those work as it needs how much it needs to chop up the data um and then how many how many steps to get it back to us that's just box what does that for us we don't care we don't have to tell it the threads we don't have to tell it which worker can get which bit of data we just say go and read some data and it'll figure all that stuff out which is awesome and that's all the goodness that is in the spark engine these days and that's just all regular spark nothing fancy there that's not data breaks or signups that's just how spark works the open source project so having a look at our two options then so we have the entrenched the champion the old bus that is databricks and we've got the new challenger we've got the synapse analytics spark pools that are now presenting two managed options for working with spark in azure now you're sitting there going um he's kind of missing the other because there is a thing called hdinsight and hdinsight has been you've been able to run spark in azure for years using hdinsight however it is fairly arduous to set up there's a lot of conflict needed there's a lot of setting up all the various nuts and bolts and it's not that um platform as a service it still has lots of elements where you need to control so i'm not looking at that in today's comparison comparison is the ones that are real turn key push your button you've got a spark listener you can start working and that's these two this is signups analytics and data bricks okay so let's learn a bit about competitors okay so about data breaks so the old champion released back in 2016 in aws so it is cross-platform and that's an important piece if you build a a ton of scripts out using data bricks you can you do have the option to port that to amazon in the future so you have quite close parity in terms of the two versions that are working across them um it has its own runtime so the guys in databricks they contribute i think something like 20 like 80 i think either 70 or 80 of the content that goes into the spark open source project comes from databricks it's the guys who invented sparks started the company that company has exploded in the last year and they are pumping so many changes into that spark project however they don't pump everything into the open source project so you have the open source this is what spark is and then you have all the extra wrappers that datavix add around the top of their own spark engine to make it faster to make it fancier to give them a premium offering worth paying databricks a license fee um that's pretty cool so there's a lot of workspace stuff and we can talk about what's inside that databricks workspace and we'll talk about this thing called the delta engine because databricks are currently about to release a whole new version of everything that they do and that makes this comparison really interesting because most of the comparisons that we've seen is people say well this is current databricks this is what's going to be signups when it's released and they're comparing that like for like was actually the same as we're about to see this brand new first release of synapse analytics spark we're about to see the same on the data rig side for the delta engine and then it's going to be a completely different pass and i'll pull out what some of those features are as we go through okay so that was databricks on this side we got synapse analytics so our spunky young contender coming in to try and knock data bricks off that champion spot so while it's still in preview and we have to remember that going through there's lots of things that still don't quite work lots of things are still very beta release and we're going through seeing evolutions of it since i started looking into signup spark it's changed a hell of a lot a lot of new features have come in they've gathered user feedback they've changed how they're implementing it so it's very much a work in progress it's very easy to be fairly down on the synapse spark uh implementation because it's not finished yet you look at the doing well i can't use that in production because it doesn't work yet and that's because it's in preview and so when we get to general availability we should see a slightly different beast but we have what we have to compare for now uh it is azure opening however it is very very similar to the vanilla spark runtime so as the database runtime has loads of extra tweaks and features and options and things you can enable the spark one has a lot of vanilla options which means if you take that code that same code should run on almost any other spark instance including databricks including ones you're running locally including hdinsight so it's kind of because it's a vanilla it means it is quite portable even though the actual spark instance itself is proprietary inside microsoft yeah it's it's different so we're so used to thinking about spark in terms of these clusters in terms of thinking what different cluster designs do i need how big does that need to be and they've taken a different stab in it and we'll talk about that in in the later section we'll look at how we actually ramp it up and scale it but yeah it's it's different but interesting uh and special skills what are the good features why would we even think about this given we've got databricks there as a mature contender already an integration is definitely gonna be one of those interesting arguments so because it's inside signups and linux this thing automatically talks to the other compute elements inside that so the thing that used to be sql data warehouse sql under which is really cool things like azure ml and all that kind of stuff cosmos db there's lots of azure native integrations that are being built directly into the spark pools which is gonna only gonna grow and get bigger and bigger and bigger so that's a super interesting area to go okay if all i'm interested in is integrations and how it works with the rest of the azure architecture there's a lot of really really good points in here this one's spark.net so c sharp for spark so in norton spark we can write as i mentioned earlier in scala in java in python in r and in c however if we're in synapse we can also write in c sharp so if you have a big.net house and they need to integrate directly they can just write c sharp against the spark engine itself against the data frame api and that's really cool that has a load of extra stuff built in there if you've got a load of net libraries you'd love to be able to use to control your spark packages that can now be done straight away so interesting stuff okay so i mentioned there's some benefits to the azure databricks workspace so the kind of things that we get in there there's a whole user management suite which is good but it actually doesn't link into active directory groups which is paint but still the ability to have users and onboard them and have security and say who can start clusters and who can't that's all baked into this this workspace we've got databricks notebooks now they're jupyter based but they're not quite jupyter notebooks they've got a lot of extra things in there so we can do things like charting and dashboards directly in there we've got widgets for controlling drop downs and having parameters there's loads of really nice quality of life features inside the dailyrx note experience uh you've got jobs so if i want to schedule it it's kind of like sql agent it's kind of a dumb scheduling that you can just say run this job with these parameters in this case it's every hour every day whatever happens to be but the ability to have that schedule and the ability to call that schedule from things like rest apis from data factory that's actually becomes really cool so heavily heavily parameterized uh notebooks can be we've got a whole live view management system of pulling in things from uh pi pi or maybe fran so all the online repositories where people tend to keep all of their open source libraries we can just connect it and pull it down and say whenever i start my cluster just connect and get the latest version that's really cool it means i can have a lot of my live my dependency and package management is baked into the workspace itself we have a thing called ddfs so that is a file system that's baked into the workspace which is okay but has some issues uh essentially it's because it's storage and it's blocked off you can only see it within data breaks however one of the things we can do is mount storage drives so we can say well i've got my data lake store gen 2 lake with all my data in there just mount it treat us it treat that whole file system as if it was part of dbfs that databricks file system and then you can go from there so all i would download cool stuff we can do a load of utilities for files and finally we've got a whole cluster management piece and this is one of the things that's it's a little bit weird when you first start using sparkles well certainly when you first start you see data breaks because in so much of azure you've got this idea that you're provisioning a service and you have to pick how big it is i want a database and i need to scale it how big that data should be i'm going to provision i don't know functions need to say how big they are um with data bricks you just provision the workspace and that's not there's nothing in there really that inside the workspace you can say provision lots of different types of cluster and they're the things that cost you money when they're turned on so it's like a whole little resource management portal all of its own it's a lot of really cool stuff in the daily bricks workspace now on the side up side that is just a whole load of other things baked in so where's databricks is very focused on being the spark engine so it's got lots of features to make spark richer and to make it allow you to interact with the engine better but it is just spark with inside out analytics as we can see on screen we've got lots of different tools are kind of being baked in to this eventual thing so we've got a bit of data factory we've got some of the thing that used to be called azure sql data warehouse as of this recording it's currently called azure synapse analytics but that's the current ga version which is just the dana warehouse when we eventually have general availability for this new thing the azure synapse workspaces that's going to have that provision sql pulse it's going to have what used to be sql data warehouse baked inside it it's going to have on-demand sql server so the ability to just write sql without database turned on no no ongoing compute and it will charge you petterabyte red which is awesome and you can write things against the data like and just spin up a proper fully fledged sql engine just for the single serving lifetime of your query you've got this spark engine and that's what it sits inside it is very much part of this and that's again becoming new data factory so we've got a whole version it's called orchestration pipelines or something with insanity workspaces but you open it and it's data factory so rather than the databricks this is just spark and some gubbins to make it nicer we have this whole suite of different tools of which spark is just one piece of that jigsaw so look at it a different way we've got kind of this is my little signups estate uh picture so we've got at the top that data factory and mapping data flows is baked in and again mapping data phones is a spark based gui so you can drag and drop and say i want to pull data from there and there combine it together add some drive columns do an aggregate right out to there and then when you hit go on that that runs a spark job that runs on spark for you which is really interesting that's kind of another option for doing for exposing spark to users but who don't have to be python scanner are savvy so that's a nice bit at the top the bottom we mentioned we've got the two data warehouse style things we've got provision sql pools that used to be called sql data warehouse we've got sql on demand or sql serverless different names um using it for querying data in real time and we've got our spark engine and then we've got a data lake so when we provision sign-ups we have to say this is sitting on a lake so any data site apps has is held within that same lane so when my databricks which has its dbfs and that hides some data off in a storage account we can't really access and we see everything that signups is doing it's about a metadata store again similar to uh databricks but we means we can save tables we can say actually just take that folder structure with like thousands and thousands of parquet files and just call it you know dbo schools and then i can do select star from schools and that's going to work which is awesome again it's abstract it's making it easy of people to kind of use spark without knowing they're using spot and then similar to data factory and a lot of the other azure components we have this kind of monitoring and management plane it looks very much like data factory so if you're used to these tools again you can just start using this oh okay i kind of know how this is working get over in management i've got linked services all of that kind of stuff is baked in so a similar experience between the two and the main thing i want to do we're talking about spark pools not worrying about too much about the other stuff but yeah it's interesting there's two different options which have different pros and cons so how do we actually decide we look at three different sections number one is power so when talking about that cluster what's actually gonna happen so i alluded earlier that you've got this core open source project this a practice bar and then there's a runtime around it that database has built this is a proprietary database runtime so you're not just using vanilla open source spark using databricks spark which comes with a load of optimizations a lot of extra stuff a lot of little hooks into some of their own features which generally means you're looking at quite fast things um now the certain things that just by the nature of the fact they're doing that and the nature of the fact that they are the ones who are contributing so heavily to the open source project kind of means they're the first to get so recently we've just had a load of extra stuff so one we've got a lot of functions to make life easier my favorite one the thing called bad records path if you're reading some data and some other rows don't fit the schema that you're saying you can reject those rows automatically and put them into a json file with a tag saying why they failed and that's a databricks function except it's not an open source part and that makes life as an etl developer building a data engineering style data pipeline so easy and some of those like nice little things they're all data brick specific we've got a lot of optimizations and it tends to be databricks some really cool optimization stuff make those ones open source and then add some more that they keep for themselves uh and a lot of those have gone into spark 3. so spark 3.0 came out a couple of months ago and has a ton of stuff that really make building complex data models really achieving so if you've got if you've got something like a kimball star scheme you've got facts and dimensions and you're joining across the two there's a load of optimizations that went into spark 3 that are in the open that allow you to filter a dimension have that cross filter and do partition filtering dynamically do data pruning loads of cool stuff like that that just makes the whole experience for the analysts better and traditional spark didn't have that bigger focus on the list it was like they've got a sequel platform but it's very easy to write really poor performing code and there's been a massive mind shift for spark 3 to say you know what let's make this perform as good as a warehouse that's their current uh focus so because databricks had committed a lot of the code that was behind spark 3 they were the first to adopt it it's already been in um the database runtime from and that's one of the benefits you get on the databricks side they're going to be quick they're going to be just behind the actual core spark releases because they're writing it so they can get there faster however there is the formation of a synapse runtime so they'll just have the microsoft are putting in to kind of make that better and produce the argument a bit more add a bit more umph into the synapse sparkle and we don't know too much about what that looks like yet and that's going to be an interesting race so obviously databricks have a head start they've been doing this for years what since they invented spark and signups are just coming in but sign ups have all of the power of the rest of the microsoft product teams behind them so it's going to be interesting to see how much of their jockeying for that position so to kind of put it another way we've got databricks and synapse databricks straight out there with spark 3.0 and some optimizations of their own built on top of spark 3.0 now signups i imagine i do not know but i'm assuming they're going to have spark feedback fairly soon they don't have it yet so they're at least a couple of months behind databricks adopting this new version of spark and i don't know if that's going to be a vision of things to cover it's like we're going to constantly going to be on this like back foot going oh what are we getting here there we got it and the interesting thing is that i alluded to earlier databricks were about to bring out this new version of spark this thing called the delta engine and the photon query engine coming and there is a whole raft of things that's going to make spark a hell of a lot more efficient and have a lot faster especially for that kind of analyst style sql query and that's coming and it'll be interesting to see if by the time signups adopt spark 3.0 database are then making that next sleep and so you're constantly going to see that kind of leapfrogging of trying to catch up and that's the thing to watch over the next kind of six months a year how much is sign of spark going to be able to catch up now currently obviously they're behind but they're still building the rest of the platform they have to put in things like just in things like parameterization so are they behind because they're currently filling out all the other things and as soon as that's done and they've got this steady platform in their life they can then really start focusing on making that engine fly or is this a story we're going to see so databricks is always going to be ahead always going to be pushing the next generation the next load of innovation and science is kind of the fairly steady has taught us in the hair kind of thing and that we don't currently know and that's the biggest thing in terms of looking at the two going do you want to be on the the cutting edge the latest with a slight price tag on it or are you okay being on the one that's fairly a little bit behind but integrated with everything i need to focus on ease of use i don't know so in terms of how we're seeing things plugged together so i mentioned that it's based on java earlier so you've got this spark cluster i've got a driver got my workers they all work on a thing of the jvm the java virtual machine so that means everything is running in java that's why it's always been a little bit weird when we're using things like python and r which they don't compile down to java so if you're using the actual data frame the core spark engine that's fine because you're submitting python and r to that engine and that turns it into java under the hood but it's always been slightly painful if you're bringing in something extra and saying i want to run my own python library or a separate r library i found somewhere because they don't compile down to java which means it's a performance hit because on that machine it has to go outside the java virtual machine use the library pull it back and that that operation that interrupt is normally very very slow so there's a whole thing around that which is kind of keeping an eye on it to go what's going to happen there inside there you've got these things executed so each of my workers has an executor on it and then depending on the number of cpus i have on each section i get a number of slots and this is this comes down to how you size your cluster which is a big question between databricks and synapse because they've taken very different approaches so i've meant databricks this is what i'm looking at i'm going okay so i've i've got this amount of data that breaks up into so many chunks of data to process a chunk of data i need a spare cpu so if i if my data splits nicely up into 16 chunks of data i can process it entirely on this cluster so i've got 16 free um slots makes sense so datavec's all about thinking about how big should your driver be how big should your individual workers be and how many of them do you want and then any work that you have just shares the cluster that you've given it so you can have 10 20 100 2 000 users all trying to use the same cluster and they're just queuing up waiting for slots to pick up free now sign ups have taken a very different approach to it essentially so i've taken the absolutely zero crossover between user's approach so databricks is i provision one thing lots of people use it and i kind of carve it up and use it most efficiently as i can between lots of queries in signups i provision a spark pool so a potential a what's the maximum amount of cpu i want to use at any one time in this case similar thing i've said i've got i want four workers worth of data and then when someone runs a query that makes a session and that kind of just pencils and kind of earmarks so many of those cpus and says i'm going to run a session on these particular parts of my pool and i'll run them back and now keep a lock on those particular sessions on those particular um that's those slots that hardware another session comes in that just grabs the next load if a flat session comes in currently that'll error that'll go well you know you don't have any spare cpus your pool isn't big enough and that's a very different way of working so for me i'm used to in the databricks world of saying just fire off all my jobs in parallel and i know my database cluster will just chunk through it things will sit in the queue until some slots come free then i'll start working on it and then kick off and then just start the next one i'll just work full of them in signups i need to plan for what's the biggest amount of capacity i need at any one time and that's kind of the slant they've made when actually sort of designing this so far it's very designed for quite fixed jobs not this dynamic scaling up a mess of doing things and dynamically determining how many jobs to run it's very much geared towards the i know what i'm going to run i need to plan for that imagine running and then kick it off so yeah difference different ways of thinking about how you want to plan for this stuff okay so special things they can do databricks has dbutils just a whole load of special functions built into those notebooks allowing us to do things like install libraries on a one-off cluster or just on a notebook to move files rename files change folders to secrets to credential and store management just baked into databricks which is cool so i can pull key vaults and bring back a credential and then use that to connect somewhere widgets i mentioned are the drop down parameters to allow you connect to different things so many cool things in there uh we've got things like version control so we can have it so we're working on a particular notebook and every change i make is kind of getting synced and saved automatically and then i can hit a button and commit that to my git repo where's your device whatever it happens to be so lots of really nice as i said quality of life uh that is in data it's because it's fairly mature now signups is less about as we're saying it's things like one it's got a native lake browser which is just so useful so even in databricks and i want to know actually can i just browse that lake i need to open up a separate azure portal or with your storage explorer go to my lake have a look what's in there it's not a massive hardship but it's just so easy in synapse to go well what's in my lake go and have a look browse it right click say new script and it'll generate a new notebook for me so some of the kind of integrations microsoft can do a lot better because again this is all of their stack it actually links into the other parts of synapse so when we're talking about these bits they're kind of the data warehouse parts and data factory then they just integrate and they're natively part of it so saying in data factory just run a sign up job run the next thing call the store proc inside provision sql pools that sort of begged in and this one is super interesting so cosmos tv the potentially giant document database um there's this thing called synapse leak so now if we've got a cosmos tv that holds all of our application data and we're just throwing thousands of thousands of single rides and we're just keeping that all our data up to date we've got to click a button and that will automatically take that data write it down into its own little mini managed like and make that instantly queryable from signups so without having to do any etl without having to say okay so take our change feed coming from cosmos land in a lake so i can query it register that table it's just registered as you see in that picture there can if we just see our late cosmos db we can see the elements underneath that and we can start querying it and one of the big things is that data store is held in parquet which is problem store and therefore incredibly really effective very very good for doing analytical queries so if you're using things like cosmos the synapse integration is awesome uh i mean there's a bit of light like the htap the hybrid transactional and analytical processing um there's a lot of marketing about you no longer needing a warehouse because this is doing it for you you can do it directly on operational that's crazy talk however it does mean you can skip a load of etl steps and you can do operational reporting directly from it which is really cool so that's a massive massive thing for synapse in terms of that integration if you're on cosmos tv okay so in terms of storing data so we both have this idea of a thing called hype so even in databricks and i've got my parquet or i've got my dell to work on my various other things i could say we'll take that directory structure make it look like a sql table so my sql users can connect you just get a list of databases and a list of tables and then just query it as if they were connecting to a sql server i mean that's that's just awesome anyway um now in site apps i can do that and i've got my spark side and i can register things but then that automatically copies over and makes the same metadata available to sql on demand so if i'm in sign-ups i'm using the sql on-demand query engine i can query all the tables i've registered in spa which is awesome so i can do a load of processing in spark advice and generic scripts the property scales all that kind of stuff reached to the tables at the end and then my users just instantly see them in sql on demand it can start querying them so that live live integration is really really cool okay bit about spark.net hopefully i mentioned we can now do c-sharp and that's going through that same data frame api so it's not like this other language that they've had to write their own optimizer for and jimmy into the side because that's talking to the data frame api it goes through the same query optimizer it makes the same code it is the same thing that gets run in the end as the rest of spark so that is really really cool now udfs the kind of if it's not a spark job and i've got another library that came i want to build a manual user defined function that's the bit that if you're doing python or go slowly and the rasp optimizations they have made that go quite quick for c sharp it's still not as fast adjusting straight spark jobs weirdly in synapse they've gotten rid of r so the blues got let in but they're taken away are so it would have been a massive win for signups if it was like we do python scala r and dot net instead it's we've just swapped alpha.net so that's what we're gonna alienate a lot of the kind of academia a lot of data science a lot of people who live up that aren't going to be that netty and it's kind of like showing a bit of an angle for me that showed that synapse is really gearing for the the kind of uh application style jobs for the data processing for the assume spark is in this mix so we can prepare data not so we can do ad hoc analytics which is a large choice for me okay on the other side back in data bricks we've got this thing the delta engine and that is doing some crazy stuff so i mentioned earlier that sparks based on java that's why i've got those jvms and all that kind of stuff now databricks have recently rewritten the entire spark engine so that executes on c plus so they can do all sorts of really interesting things about folding up the data before they send it as a cpu so they can be processing four or eight or more bits of data at once and getting that kind of factor of speed up so what i was mentioning is going to be like under the shift in terms of databricks performance this is coming and assuming you know in the next few months we're going to see a massive speed up in terms of how database is working which is going to be again that jump ahead like spark 3 currently is so the delta engine all the kind of the delta lake functionality plus this photon engine doing that c plus plus query 40 is all sorts of cool however signups are also doing some interesting stuff so startups have come up with this thing called hyperspace which is kind of like a non-clustered index for spark so saying we'll take our data and currently you know if we want to try and read our data then this thing is an index in spark so if you want to get a particular set of records i have to read everything in and then filter it down in memory now hyperspace is saying well i just take a copy of that order in a specific way that just has the data for certain columns in the data set aka it's a covering non-clustered index and then it's automatically kept up to date and that's our weird interesting idea unlike anything we've currently got in spy so definitely interesting stuff however up in database land so we've got two different things that are really interesting on that same front so for data scientists we've got this thing called ml flow and that is a whole data science experimentation tracking so you can run lots of different jobs you can see which of your models performed the best what the different parameters were what the different accuracy curves were and you can go back to the one that performed the best you can promote loads of really really cool stuff in there however signups has azure map so it's got a lot of the same things but in the baked in azure version so again similar approaches different ways of tackling it but similar parity of functionality the difference is on this delta leg side so delta lag is a file format and so far format the databricks have open sourced which is taking part and then putting some really cool stuff around the top so you can have things like transactional consistency you could have temporal queries you can write merge statements all of which if you can make from a relational engine point of view is like yeah it's that's that's what a database engine does that's the point why this is so cool is because you never had any of that stuff in a data lake and daylight was always seen as this kind of messy thing because you didn't have these nice rigid data model structures things on the top delta lake gives you that now databricks have their own proprietary version of delta lake so they open source this single delta lake it's baked into synapse automatically but they have their own special version in databricks which is faster it's optimized it has special things and one of those things is called zed ordering which reorganizes the data and does some file compaction to make it faster to bring up certain data if you're querying on certain columns aka a clustered index so whereas kind of um signups have gone the kind of non-clustered covering index route databricks have kind of gone there we'll just reorganize your data as an asynchronous maintenance job so both tackling similar problems and in that one databricks is slightly ahead because the zed ordering is really cool and it's just it's changing your data and the way they've implemented it within delta is actually very good but so a lot of jockeying for positions doing similar things and coming out with competing attacks finally a bit on price so signups as i'm on the neutral looking at the purefix synapse is cheaper however this is preview costing so we don't know the actual final cost yet and databricks is has a license fee so we kind of assume that databricks is a little bit more expensive like for like and this is assuming if we turn them both on for the same amount of time at the same scale which isn't always going to be the case now the what we've seen from sign ups apparently is those sessions when i've got my spark pool and i have a session that session stays on and has a time to live so i start using it maybe stays active for 10 minutes 20 minutes whatever it happens to be during that time we assume i'm being charged for that session being active and up it can auto scale but only the spark pool so the pool can change how many sessions that have happened at a time but those sessions which is just locked to a single usage they don't scale natively so that is okay however when we look at databricks and we go you know what auto scaling and data premium is excellent so we can actually just have it turned on and then say look make a really small cheap cluster i'm not paying a lot of money for as soon as you come if i start piling up a load of things in my queue just get bigger just grow expand a lot of clusters use it for a lot of stuff and then if we're in premium mode the auto scale is really really good it'll exponentially go no one's using it okay take down fifty percent of my clusters another fifty percent another fifty percent and really quickly get down to the lower level so it's going to scale very very quickly up and down essentially which means we get coupling between how much we're paying for and how much we need to be paying for and that's quite closely mapped in databricks whereas what we currently see in synapses is quite chunky it's a one user one usage that session stays alive and it it doesn't actually scale across a high concurrency um approach but then it's in preview they've not they've not actually released how that's going to work yet so currently if you're trying to compare costs on the face of having the two things turned on together database is more expensive in terms of actual usage database will be way cheaper if you're not actually consistently using it and you have peaks and troughs of usage as almost all of us do databricks will end up cheaper because it can scale that more effectively by the time that synapse goes live and we get ga is that still going to be the case are they going to have implemented a lot more kind of session sharing and auto scaling and that kind of stuff we don't know so that is an unknown currently as we or as what we know currently what we can currently see data rights will be cheaper in circumstances because of the auto scale feature but we just don't know so yeah it's an interesting thing there's a lot going on between these two different ones you know it's hard to kind of visualize because we've not done a tech demo we've not dived into hey this is what the two of them look like but honestly they look very similar they're both using notebooks they both have same similar languages they're slight different bits of syntax as to one that you use the main question is which one do you use based on all those features based on what the problems you're trying to solve now it might have been evident across there there is there's definitely a bias with terms of what we're talking about as a heavy spark user myself the winner is fairly obvious it's very very much data bricks is a more premium more robust more mature spark offering currently however synapses in preview so yeah sure comparing something's been out for a few years and a ton of investments currently in the middle of operating and optimizing to something that is a brand new feature that hasn't even gone live yet isn't unfair and we're seeing them a lot at the moment we're being forced to make that comparison a hell of a lot and it's not a fair comparison currently if you're looking to start working with spark right now databricks is a no-brainer because it is out it is mature it's not a lot of stuff that's in and works already you cannot use sign up spark in production because it is in preview that's not to say in six months that answer is gonna be as clear-cut definitely not to say in 12 months two years definitely not gonna be that clear cut but it's definitely an interesting thing to be thinking about with all the marketing going on all the hype and all that kind of stuff taking a step back and going which one should i use right now it's definitely data bricks which one should we be using let's assume synapse goes live with all these features and they fix some birds it's all out there and happy so then we're interested in seeing which is the one we should be using and the way we're trying to talk about it is this kind of idea of your enterprise versus standard edition data bricks on the face value ignoring auto scaling is slightly more expensive because it has this database license you're paying for a premium cost because it's got a lot of premium features a load of extra stuff a lot of optimizations a lot of things they're baking into it where synapse is a currently slightly less functionality much more vanilla run time to a few optimizations it's a little bit less fully featured because again it's a preview but we're kind of expecting this is what it's going to look like in six months 12 months time anyway databricks aren't going to stop ramping and focusing and turning the entire engine of a whole company to make that a better product as opposed to on the microsoft side but it is just one of the products that's running many many many many things so we're kind of gonna this is how we're putting it if you're doing lots of spark dynamics is a no-brainer if you're doing a little bit of spark and you want integration and baking it in and having a common platform that's a much less clinical argument so straightforward i'm trying to say i'm building a platform that's largely based in single i've got an existing secretary warehouse that would have inside here and put it in straight in i want to use sql under man i'll lose data factory and have all these things baked in there's one or two spark use cases that i might want to use in case synapse is probably easier there's less integration work there's less thinking about the cluster and the design and stuff if you're just doing the occasional odd notebook as part of a much bigger picture then it probably makes sense to go with sign ups and bake that just you because you're not go you don't need the full sledge hammer that is delivered you can just do a little bit of spark to augment your current processes if you're going for a big heavy data engineering you're building a lake house you're trying to do a very very spark based project but a no-brainer for us currently that databricks is going to give you a premium experience you're going to have a better time it's got more features it's got more functionality it goes faster so all of these things just mean databricks is the assumption for if it's if that kind of skew of how much spark are you expecting your data platform if you're expecting lots and lots of spark steady bricks if it's a tiny bit of spark probably sign ups somewhere in the middle it depends on integrations and depends on your users it depends on the efc charts the important thing to you offer is an important thing to you so there's a very much it depends but hopefully i'm equipping you with the the options you need to work through to make that decision now actually what most people are thinking about heading to is this combined platform you know there's a tagline of you know better together of saying well actually doing some of the processing and prep work in dailyrx because it is a more premium spark engine and then they can get into the other elements of signups now because signups can sit on an existing lane and databricks can just map that like they can share a common ecosystem without a lot of integration work it's really easy to have this whole thing working as a single platform and that's kind of our default answer so it's not is it signups or is it data breaks it's it's bits of birth depending on what you need if it's is it spot pools or is it data breaks then it comes down to how much premium content you need how optimized you need it how important is it that you're on the cutting edge of optimizations and new features versus having something that's really simple to integrate with some other stuff and that's going to be the interesting point to keep an eye on over the next few months so that is all i want to go through so thanks very much for listening and i hope that has given you some food for thought i hope it made you think maybe it's not as clear-cut as we thought it was and hopefully there's a bit of excitement about how many things are coming in the near future so over the next three months six months 12 months we're going to see a whole load of new features in both sides so both on the dailyrx side and on the synapse analytics side it's going to be a real interesting space to watch regardless so again i'm around answering a question so feel free to grill me about anything that we know about azure signups analytics about data bricks about the delta engine and photon and all that crazy stuff and don't forget to drop by our booth in advance analytics and come say hi come ask us any questions and enjoy the rest of secret bits [Music] cheers you
Info
Channel: SQLBits
Views: 2,418
Rating: 5 out of 5
Keywords: AI and data science, Big data analytics, Cloud, Data Lake, Databricks, Developing, Python, Spark, Synapse Analytics
Id: FjsnVueXijQ
Channel Id: undefined
Length: 49min 17sec (2957 seconds)
Published: Tue May 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.