Oh welcome back to the Synapse sessions
where this week we're going to go right back to the start so Azure Synapse Analytics went into public preview fairly recently and I got super excited and
dove straight into the spark side the SQL on demand side roaming around all
the goodies that we've not really been held to see before now. Over the weekend
a few people got in touch and said so I mean no it sounds good but what is it?
I'm using Azure SQL data warehouse isn't this just that with a different
name? Where did spark come from? So this time we're gonna have a quick run
through the beginning. What it is, where it came from how to put it in the right
context. What's live, what's not live so less tech today, more just context and if you're joining us the first one welcome don't forget to hit like and subscribe i'm
new to youtube, i keep forgetting to tell people that. But if you do I'd
appreciate it. So, let's dig in okay now in true consultant fashion I
have some slides. I just want to tell you the story where it came from
so back in the day we had the granddaddy known as parallel data warehouse that they
got rebranded as the analytics platform system which is a whole mouthful now
this thing was an appliance so you have to buy the actual physical server with a
specialist version sequel server the sat on top of it when this thing worked as
an MPP solution that's massively parallel processing essentially a whole
load of sequel servers all stuck together working in unison spreading the
work out and then giving a single answer at the end it's parallel scaling but
sequel style so that was the analytics platform system and it was quite being
very expensive so I think people used it not that many people used in the people
use they loved it now when people started using the cloud more and more
and more and more they said well that'd be a great thing right rather than having just
one fixed server that can't change we can just go will actually spread the
work out across more servers less service it is the perfect story for
things like azure so we had that come along in the form of as your sequel data
warehouse a lot of the good ideas a good thinking the core sequel engine from PDW
the parallel data warehouse and rebranded and change how it works and it
is different something now it took them a scalable amounts of compute where that
wasn't a concept back in the APS PDW days and that's been in Azure for an age
so I learned people are already using it its life lots of people love it
sql data wa house is a thing it is good beautiful it's fairly mature and
stable now so last November there was a big fanfare, announcements we had... was it... was it Ignite? Well there was a big announcement saying "hey, it's now Azure Synapse Analytics" that's just a simple straight
rebranding announcing the new thing we've taken azure sql datawarehouse,
we've changed the name we've changed the logo it's now called azure synapse
analytics so all the docs everything changed except the actual product it's
the same thing it's just azure sequel data warehouse with another name the
confusing part is it's part of an evolution so there's a new bigger thing
coming called azure synapse analytics that's right the new big crazy thing is
called the same things the thing that's currently live and they wonder why
people are getting confused with their marketing so we've got these two concepts of azure synapse analytics brackets sequel data warehouse and then the new thing
the thing that's just in preview is a azure synapse analytics workspaces and
that's so much more than just the data warehouse path so synapse analytics
(sql data warehouse) is still present in the new big altogether thing but it's
now called provisioned sequel pools get that as a name evolution where we've gone PDW
we knew where that was. azure sequel data warehouse, ok it's strange that you
name the product after what its meant to be used for, but not all the
use cases, ok cool. Azure synapse analytics okay yeah fine just a
rebranding... Azure synapse analytics a bigger thing... oh and this thing is still
part of it but now it's a different thing
super super confusing and I get why people don't know what's going on so
azure synapse analytics the sql datawarehouse version is currently live,
it's GA - generally available meaning use it in production. It's supported in a full
full-blown SLA covered member of azure azure synapse analytics workspaces with
other things in there is currently in public preview so don't use for
production things they're still building it still buggy they're still building up
the story that just doesn't quite explain what's in this big new thing and
why you should care so there's another slide the marketing slide this is the
slide that every consultant under the sun just said "hey this is what it's gonna
be" and pushed it out, including advancing analytics we were like "this is a new thing,
look at all the boxes, that that sounds great" and that doesn't really tell the full
story so some great things sodown the side we've got management, security,
monitoring, meta store - having those all in one box with previously they were
managed in different places that's great make sense
the core confusing bit is all of those different languages so I've got
sequel, python, dot.net, java, scala and r and then i've got two different flavors
so I do want to provisioned or on-demand and that kind of looks like it's all
a single story but it's not. So sequel I can use provisioned or on demand. The spark
side of things when we're talking about Python dotnet Java Scala
or R, they don't work on demand they are just provisioned and that's not that
clear from this diagram so this is kind of just it doesn't quite tell you how
those things fit together you have two different compute runtimes. 1 that's
sql based and one thats spark based which is why you have that different
functionality and yeah that's kind of what we've had to go on so I've had a
stab and pulled it back and said lets just really simplify it and let's talk about what is
actually the hood in this thing so super easy we
have four boxes first we'll talk about the compute options what are these
different versions of sequel and then demand and they're not on demand at all
of that orchestration and getting to working what's it based on storage and
then what else is in the workspace let's see if that makes a bit more sense than
l1 diagram.So... options for complete I've got two things on the sequel side, I've
got provisioned sequel pools and that's when I want to say I know how many
compute nodes I need I want to have a certain number of sequel servers in my
cluster doing some parallel work I want a provision in advance
so I want to say ok to turn my server run and I expect I'll get some queries
coming at some point and I get charged by the hour and I size it in data
warehouse units and if that's all sounding familiar it because that is
exactly azure sequel data warehouse that rebranded badge of synapse
analytics that used to be sequel data warehouse is now inside the Synapse analytics workspace now called provisioned sequel pools and they are
exactly the same thing that is provisioning sequel pools all are sequel
data warehouse, but built into this larger workspace so that should be super
familiar if you're from there now on demand secret pools is something else
entirely so that is a severless version that is, I
write a query, I hit go and it instantly runs and it doesn't actually run on any
server. I don't have to scale it. I don't have to say how big it should be. I don't
have to predict workload in advance so great I get charged per terabyte read so
if I use that a hell of lot and I'm constantly hammering my lake reading lots
of data that's going to cost me more than provisioning my sequel data warehouse
it's a balancing act so depends on how much I use and what I do but if I don't
know yet and I've got really really unpredictable workloads then I can just
use on-demand sql pools really quickly to try and figure things out to get
things going and go from there and that's great! other side I've got my spark element so
I got the provisioned sparkpools I don't have OnDemand spark pools. Again, you can't quite see that from that last diagram so I can choose a spark
cluster so I can say I would like a cluster with this many nodes there should be
this big this balance between memory and compute, all of that kind of thing. And I
can run some queries and that query can be python or R or c-sharp, it can
be sparks sequel, can be scala. We've got lots of different flavors because it's spark
the one weird one, the C sharp one in there is kinda cool. that is the
first time we've had a spark engine where C sharp is a native language which
is great Now it's got to be said that is not data
bricks and that's another thing that caused a lot of confusion everyone like
"we've got spark in Azure it's called databricks" which to be fair when databricks came out they said that with HD Insight. So this is a different
runtime this is much closer to vanilla spark so if went "I'm going to spin up a
load of local little machines and installs spark on it by going to Apache
spark foundation and downloading the latest cut of it" then that'll be a very
similar versions of spark that is installed currently inside azure synapse works
spaces if you go on data bricks you'll see a slightly different functionality ,
you'll see premium functionality essentially they've taken spark and built a load of
extra stuff around it so they can charge a premium data rich price there are two
different things but we do have spark inside here it runs on a cluster when
you want to run it you got to wait for your cluster to start and so far
it's been like three or four minutes. It's pretty quick but you feel have that
wait time, it's provisioned and then when your queries are finished you can
say "stay alive for 10 minutes, 15 minutes" whatever it happens to be. But
you do have that you got to worry about that okay
Up at the top lots of familiar friends we've got data factory, so orchestration
in anywhere else in azure we use azure data Factory and now this is going to be
part of azure synapse analytics so if you go into this workspace this one pane
of glass for doing everything it's got a version of data factory in there, called
"orchestrate" or something like that that but if you look at it, it is just data factory
and it's a slightly different version doesn't have exact feature parity with
the one that's live because this is in preview and I'm assuming that they will
just be the same thing but you can log inside synapse analytics and
see orchestration and see some stuff and log in to live ADF, try and you'll see... So we had like the management layer just went over recently because they're
building stuff here to make it synapse friendly now I'm hoping that as soon as
everything is live and it's in then they're the same product and it doesn't matter where you use it or even it's just inside synapse but for now, slightly different you've also got ADF mapping data flows now that I kinda put that on orchestration but actually it's another
flavor of compute so mapping data flows is a draggy droppy GUI driven spark
engine essentially so like in the old days of SSIS I can say we'll take some
sources combined these do some aggregations do a lookup write it out to
another source and I can drag and drop and do all that when I hit go that runs
on a spark engine which again has to be provisioned will stay alive for certain
amount time it gets charged for so some different compute options but that's
kind of baked into the orchestrate layer but it's kind of a fourth compute choice.
And then finally we've storage down at the bottom when we provision synapse analytics you
can pick a lake for it to sit on and that is just an existing data lake
storage gen2 so blob storage with hierarchical namespace enabled and that
is exactly the same as it normally is, I just used a production one and it's all fine it
just the ability to see it inside synapse analytics is the new preview-ey bit
we've got a metastore so I've been going through doing things in hive in spark or
in sequel on demand I can save things so I can say "well actually
there's a load of parquet but I don't want people to have to go and find that in
the lake each time just call this dbo.table" and then people can select from it and it'll go off to that parquet so I can I can have a meta store on top of my lake
that describes also many different entities an object which is awesome. The plan obviously being that that works across languages so it's not full parity yet
but I can do create some hive tables in spark and then read them from
sequel on-demand and that's great I mean I can just flex and switch between
languages, that's awesome the workspace elements there's
less kind of core bhatia more just you get in the box that wraps all this
stuff you've got the studio so there's a new dev environment there's a whole
place I can go to write notebooks with intellisense and write sequel scripts
and hug them together and manage all my codebase I can use linked services from
ADF, kinda, and just say well here's all the things that synapse is linked to
and use them elsewhere so it's kind of this one big wrapper around it with a
bow on top I've got monitoring which again if
you're from data factory land will be very familiar to you it's but it is a
one stop monitoring plane and I got a management place where I can go and
tinker and say well this is the size of my spark pool, how many spark pools, this is what I've done in sequel on demand recently this is how big my azure sequel
data warehouse (provisioned sql pools) is now so lots ofthings all baked in one
box. That, to me, is what synapse analytics is and I wish they come out
with just a super simple diagram saying it's just these things guys but... it's taking a
while. A final thing in order to hammer home is the different flavors that we've
got in here... umm.. no the point that were in preview so provision sequel pools (aka
sequel data warehouse) is the only bit here that is live that you can use in
production that it's supported that has full SLA behind it everything else so
the data factory version that's inside synapse the lake integration inside
synapse, the whole of the secret on-demand pools the synapse flavor of spark, all of
that stuff that is in public preview and it's currently growing and having
functionality added and having bugs found and all of that stuff so that's
kind of the the hardest question is what's live and what's not. The thing that
used to be called sequel data warehouse that's now being rebranded,
that's live, of course its been around for ages. The rest of the stuff
aka synapse workspaces that's not live yet that's in public preview
hopefully that helps clear those bits up Another bit I was gonna go on to finally
those different compute options those four compute it's kind of tricky to say
well which one should I use then it's almost like too many different options
by which one of them do you actually pick to do
your work so if I give you a quick overview of the four recapped, how
they're charged, where you should do so to sql data warehhouse, aka, provisioned
sequel pools, I can't say that enough I have that billed by the hour so if you
say I want I want to compute, I want six compute nodes and you run a query and
then you turn it down to two (nodes) you're going to get charged for six for that
full hour and then you're going to the loweered. You get charged for the highest
scale you've had it within any given hour so be careful about that because if
you're doing just a small little bits of work I can rack up cuts really quickly
so sql data warehouse is like that it's full T- sequel... well... it's not FULL T-sequel
it's the PDW variation flavor of T sequel so it's MPP T-sequel so certain
things don't work like you can't do a recursive ctas... I'm sorry a recursive CTE you can't do a CTE that inserts as part of a ctas that you're doing an insert inro...there's certain restrictions on it but also other extra
bit of language because we're dealing with sequel data warehouse so you can do
things like you have to know which distribution you're using so when you
create a table should it be round-robin distribution or
a hash distribution, should it be replicated these are all terms are very MPP related
so it's a slightly different flavor than the traditional sequel server engine and
you need to be aware of that in some of the syntax that you're using so generally what's it for? Huge data sets if we're talking about our datasets in the
size of terabytes sql data warehouse really good at scaling and
working at that kind of size if we're talking about huge deep fact
tables with only a few amounts of joints it's really good at that kind of thing
so big hefty good aggregations sql data warehouse will excel and it's gonna be a
long time till I stopped calling it sql data warehouse rather than
provisioned sequel pools but we'll see On the other side, sql on demand and as I said it is billed per terabyte read so it's an awkward balance of how are you gonna use it if I
write a query a day that does a couple of terabytes. that's gonna be far cheaper
than leaving a sql data warehouse turned on all day
just in case I happen to write a query but if I write if I run the same query
every 10 minutes that's going to blow over on my budget
so it's really going to be careful cuz it's just how much are you going to use
it equates to which one should you use so for me it's kind of like
experimentation quick things and ad-hoc occasional workflows, sql on demand, great and will end up being super super cheap really useful. Things that're
touching a lot of data if you're doing that a lot you're gonna rack up cost
very quickly so be a little careful which one you choose in which when you work
for maybe start off with sequel on demand and then when you've got a better
idea of what your usage pattern looks like you could say we've just reached
the point where we're now more expensive than just having a cluster we'll switch
over and we use a provision cluster kind of makes sense so yeah, ad-hoc, occasional access, lots of different things that's kind of where I see it fitting.
Mapping data flows so again that's all it's billed data factory style which is
per execution of different activities but also when you execute it it has to
spit up a spark cluster so when that spark cluster is provisioned you're
paying for it so you've got a cluster cost as well as your data factory
poking things cost and then that cluster has an uptime you know so
it'll finish the job and then you give it a time to leave you can say well just
shut down immediately or stay alive to the next 15 minutes in case another job
comes in and then it can just execute immediately rather than having to start
up again but you're paying for that time while that list is turned on and if you
got of things just running occasionally they might just chain up so you got
cluster turned on permanently and that's gonna cost a lot of money so you've got to be a little bit careful about how you're balancing that cluster uptime and the
cost associated what do you do finally you've got a spark I mean spark is billed
in the same way as the mapping data flow clusters it's a spark cluster that has to
be turned on when you try and run your job it'll take three or four minutes to
turn it on and you can set that cluster to stay turned on for an amount of time
after the last query ran and it gets no more queries it'll turn
off again same problem get lots of occasional queries could just play keepie-uppy and you'll just keep that cluster alive and that's gonna cost you money
so it can run scala, python, c-sharp, sequel r, it's language flexibility is much greater on
the spark site maybe a little more of a learning curve because you're having to
write code so someone's just used to sequel, just used to the graphical
interfaces they're gonna find a little bit steep learning curve when
they first start but it's so powerful if you're doing a dynamic workflow I want
to set out a set of tasks and run the same thing for every single different
thing that happens I can write very generic workflows in spark and should
apply that to a ton of stuff it's very powerful, if I wanna do machine learning
then I'm probably gonna do it in spark just because it's quite a rich
ecosystems tons of open source stuff I can integrate and bring in as libraries
and I'm gonna work with anything really strange like complex gnarly nested JSON
data structures or put it in computer vision and do analysis on videos in real
time all that kind of stuff I could do that in spark, I can't really do that on the
sequel side so that maybe there's use cases for all of them. So your
graphical one, your business analysts, your guys who aren't that developer savvy,
they're not software engineers, they're not data engineers they might
want to use that so they can drag-and-drop them they can visualize
the data flows but then you've got a lot of people who are doing some data
science at scale and they would be writing a spark and then there's a lot
of traditional sequel analyts and they want to use management studio and they wanna write queries against a proper Kimball warehouse and all of that kind of stuff, they
want to be over on the other side now just the thing to be aware of is if you're a
spark cluster turned on and you've got a sql data warehouse turned on and
people are writing sequel on demand then there's three different cost bases and
you're not getting economies across them so you're got loads of flexibility, loads of choice, loads of options but be a little bit careful between them because
again you might end up paying three or four times what to do one system so
everything's in one place but with great power comes great ability to spend huge
amounts of money all righty so that is my quick overview
of what is synapse analytics and how it all fits together I hope that is useful!
Now the first thing that did was went and had a bit of a setup of my first 20
minutes of using synapse so you might want to find that video that gonna be over
that on one side and might find that useful and if you haven't already please subscribe and
like the video. We do appreciate it and the next case we'll just catch you next time!
so... Cheers!