(electronic music) - Coming up on a special
edition of How We Built It, we're joined by UK-based data
engineering consultant endjin, who specialize in big
data analytics solutions for their customers in ocean research, financial services and retail industries to see how they're
evaluating Azure Synapse for on-demand serverless
compute and querying. So I'm joined today by Jess Panni, Principal and Data Architect at endjin. Welcome to Microsoft Mechanics. - Thank you, it's great to be on the show. - Thanks for joining us today, and let's take a step back though. Before we get into this,
just a quick recap. If you're new to Azure Synapse, this is Microsoft's
limitless analytics platform that really brings together enterprise data warehousing
and big data processing into a single service. In fact, you can learn more by watching our introduction to Azure Synapse with John McIntyre recently at aka.ms/mechanicssynapse. All right, so Jess, your company endjin, really started your
journey with Azure Synapse for data exploration. You're using things like SQL serverless for kind of on-demand compute. And also the data
visualization capabilities, but can you tell us what's behind this? - Yeah, so we've been working with a number of different customers with their big data challenges across a range of different industries. We try and make their data more accessible and to take away the friction of getting insights from their data. Now we also look after their security and governance processes
and needs as well. So we've been putting SQL serverless through its paces on some of their most demanding workloads to see how it sort of stacks up with other services and technologies. - All right so Jess, let's
make this real for people. Can you give us an example
about how this would work? - Yeah absolutely sure. I can give you an example. So imagine you're an ISP or a Telco and you're providing
broadband services to people. And you want to be able to identify who and where problems are occurring on your network so that you can prioritize where to send your network engineers. Now, clearly Telco doesn't
have a direct connection into your home, but there's a network of
interconnected devices as part of their network infrastructure between you and them, which introduces multiple
points of failure. Now these devices are
constantly producing telemetry which we typically could
store in an Azure Data Lake, which in this particular case we're collecting about 60
gigabytes of data per day. We then need to provision
a Spark environment, configure it, secure it, and connect it to the Data Lake. And only then can we spin up a cluster which would usually comprise
of at least three VMs for the duration of the
exploration and development. And of course if you're new to all this, you need to get skilled up in something like Python or Scala. - And that's really
familiar territory for me because in IT then we have to use that, it's a dependency on us which might mean additional time or cost to get those resources spun up. - Yeah absolutely, that can
take days, weeks, months even. And once you've found, you know, the insights and how to
extract those insights you need to operationalize the solution and build data pipelines that you're continuously
monitoring data as it arrives so that you can measure and
find issues in the future. - Right, and those data pipelines also themselves pose a challenge. Because they might be running on different infrastructures and then
bringing all of that together can be hard. So how do things then
change with Azure Synapse? - Yeah I mean, I can show
you how you can achieve all of this securely from a single environment without spinning up any
pre-provisioned resources, and using the language that
you probably already know. Now there are a number of steps we need to go through, and the first of those
steps is understanding what data we have. So here we are in Synapse studio, and one of the great
things about Synapse studio is we can now browse the files and folders within our Data Lake. And you can see here the contents of one particular folder. And we can see we've got
a network telemetry folder which looks interesting. So let's go and have a look inside there. Now we can see we have
data partitioned by date, and then partitioned by hour, and finally we can see the raw files which are in this case in Parquet format. Now if I wanted to analyze these files traditionally I'd have
to run a Spark cluster, or I'd need to download those files. But with SQL serverless all I need to do is go and select the files, right click on them and I
can select top 100 rows. So I'm just going to run this, and you can see we're
running a SQL query over, you know, one or more files in the Data Lake without
any provisioned servers. But what if we wanted to query over all our telemetry without worrying about individual files or where they live in the Lake. And I can show you an example. So here we've got a view that
I created called Radius Raw. It represents all our network telemetry. And I can use this query to find the total number of events that we have received on a single day. So if I run this, we can see that we've received about 50 million events on that day. Now this view is completely reusable we've hidden all the details of the individual files, and even the formats of the
data that we're querying. And as I say all of this is being done without using any provisioned resources. It's fully serverless and I didn't have to
spin up a single thing. Well, I've got a slightly
more interesting example that I can show you. So here's a query that
returns network issues for a given piece of equipment by the number of customers
who were impacted. And we're joining these issues over another set of files which provide more information about the total exchange of the devices including geographical location. So, we are limiting this entire data set down to a single day. And also just down to the
top sort of 100 customers that are impacted. So if I run this, I just want to point out
that the exchanges view is actually reading as CSV file, whereas the other view we're querying over is raw Parquet, so we're actually joining data stored in completely different formats. And as you can see we've got
some results quite quickly. Now the next thing to do is to start sort of you know, visualizing it, seeing what's in that data set, how can we represent that
table in a better way. Now what I can do is I can go and have a look
at down in the results view and change the view to a chart view. And I'm just going to make
a couple of changes here. I'm just going to remove
some of these columns and just leave the number of events and the number of distinct customers. And now what I've got is I've got a timed-sheared series showing when peak issues are occurring, the number of distinct
customers when that occurs, and obviously the number of events. So, now that we've discovered how to extract insights we
need to operationalize these. And to do this, we would typically run this query continuously as new data arrives and we store the data in the Data Lake to make it easier for upstream
processes and visualizations. So I can show you that here as well. What we've got is we've got an expression, a CTAS expression, and if you are familiar with SQL Database or SQL Data Warehouse you may
have seen this in the past. Now what this is doing is
it's running the same query, but it's writing the output to a location in the Data Lake. We're writing to a folder called Output, and within that a subfolder
called Network Health. And now what we can do is we can go back into our Data Lake, I'm just going to drop out and you can see we've got
a new folder called Output, a new folder called Network Health, and here are all our issues ready to go. - So now you've kinda got the data, the visualizations to really see where those different
points of failures are on the network to be able to dispatch the right field techs
but what would I do next? - Yeah, so the next thing we need to do is we need to make sure
that our field engineers have good visualizations
that continuously running, pointing them at the issues so they can react based
on customer impact. Now Power BI is a great
data visualization tool and the good news is that it
comes integrated with Synapse, but one of the first challenges is, how do we get data into Power BI. Now Power BI comes with a whole bunch of great connectors and because
we're just talking SQL here, we can use the existing SQL connectors with SQL serverless
without any changes at all. So let's have a look in our Develop hub. In our Develop hub you'll see we now have a Power BI option and we can see our Power BI
workspace that we set up. And from there, we can
go and directly query the Power BI data sets within Synapse. So I'm going to add a new data set and you can see that from Power BI's perspective SQL serverless looks like
any regular SQL database. And I've got all my views in
a database called TelcoDemo and I can download what's
called a PBIDS file, or PBIDS file, which is a Power BI file which has all the connection detail setup, so I can immediately start querying our SQL serverless views and tables without any other problems or delay. So, to save a bit of time, what I've done is I've already done that and I've pushed up a data
set with our network issues. And what I can do is I'm going to create a new Power BI report. And you'll see the report
experience is now embedded in. Here's our network
health data from before, and I'm just going to go and drop a map onto our report and we're
going to start looking that up. So I'm going to pull the latitude in, the longitude in, and I'm going to make the
size of my data points the distinct custom account. So you can see now we've
got a nice hot spot graph. So we going to make this a
little bit more interesting, we're just going to drop in a
tree map for some navigation. A little bit bigger. And we're going to do that by exchange and again by the number
of customers impacted. And now what I can do is I can start to navigate around the various different exchanges and see exactly where my issues are occurring at any given time. - And this is really
great, but I know a lot of people watching are
probably using Spark tables, so in that case, do I have to spin up and kind of run a Spark
cluster against that data, or can I use SQL serverless to do that? - Right, that's a really good question. So, Synapse comes with its own managed Spark environment which is great if you've
got existing Spark assets and skills, but SQL on-demand also plays quite well with it. So I can show you an example
of how that all works. So what I'm going to do is
I'm just going to go back to my Data hub, and you'll see that
we've got our databases and if I open that up you'll see we've got a special database
there called default Spark. Now if I open that up, you'll see I've got a bunch of tables. Now these are Spark tables. These are not views, these
are not SQL serverless views. These are Spark views and that last view there, radius raw, is a Spark table that I
previously set up and created. Now in the past what I would have to do is spin up a cluster to
actually view the contents of that table, but with SQL serverless all I need to do is start
querying it straight away. So here we're doing it again, doing another select top 1,000 rows and here I am querying a Spark table without a cluster running at all. - All right, so we've
been talking a lot about serverless capabilities really
for that on-demand compute, but when would we start using
provisioned resources then? - So it depends really. We're finding with SQL serverless that we're getting really good performance and given its pay-per-query, it also allows us to attribute
cost-to-business units. But you know, there may be a time when you need the performance of SQL pools and because we're talking
SQL in both cases, it actually makes it easy to transition from SQL serverless to SQL pools. Remember the query I ran before? I can show the top here where we're switching between
SQL serverless environments and our SQL pool environments. I'm just going to run that and while that's running I'm just going to explain what I've done. I pulled the same raw
data into a SQL pool. And I've created a similar set of views in the SQL pool environment. And now I'm able to simply switch, run my existing original query and I'm getting exactly the same results, but this time using
pre-provisioned resources. - All right, this is really great stuff. Thanks so much for sharing kind of what you've been doing
with Azure Synapse, Jess, but any tips that you would give people that are watching from home about how they can start
using Azure Synapse? - Yeah, sure, well historically analytics has been seen as big scary, big budget kind of exercise, and Synapse provides a path to get you productive really quickly, but also allows you to graduate to more advanced services
as and when necessary. So I suppose I'd say look
for the simple opportunities within your data and grow your analytic solutions from there. - Good tips. Thanks again for joining us today, Jess, and also if you liked hearing from endjin, please check out the rest of
our How We Built It series at aka.ms/AzureSynapseSeries to learn from early adopters who are also kicking the
tires with Azure Synapse. That's all the time we have for this show. Thanks for watching. We'll see you next time. (electronic music)