Azure Synapse On-Demand Serverless Compute and Querying | Endjin's Practical Use Case

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(electronic music) - Coming up on a special edition of How We Built It, we're joined by UK-based data engineering consultant endjin, who specialize in big data analytics solutions for their customers in ocean research, financial services and retail industries to see how they're evaluating Azure Synapse for on-demand serverless compute and querying. So I'm joined today by Jess Panni, Principal and Data Architect at endjin. Welcome to Microsoft Mechanics. - Thank you, it's great to be on the show. - Thanks for joining us today, and let's take a step back though. Before we get into this, just a quick recap. If you're new to Azure Synapse, this is Microsoft's limitless analytics platform that really brings together enterprise data warehousing and big data processing into a single service. In fact, you can learn more by watching our introduction to Azure Synapse with John McIntyre recently at aka.ms/mechanicssynapse. All right, so Jess, your company endjin, really started your journey with Azure Synapse for data exploration. You're using things like SQL serverless for kind of on-demand compute. And also the data visualization capabilities, but can you tell us what's behind this? - Yeah, so we've been working with a number of different customers with their big data challenges across a range of different industries. We try and make their data more accessible and to take away the friction of getting insights from their data. Now we also look after their security and governance processes and needs as well. So we've been putting SQL serverless through its paces on some of their most demanding workloads to see how it sort of stacks up with other services and technologies. - All right so Jess, let's make this real for people. Can you give us an example about how this would work? - Yeah absolutely sure. I can give you an example. So imagine you're an ISP or a Telco and you're providing broadband services to people. And you want to be able to identify who and where problems are occurring on your network so that you can prioritize where to send your network engineers. Now, clearly Telco doesn't have a direct connection into your home, but there's a network of interconnected devices as part of their network infrastructure between you and them, which introduces multiple points of failure. Now these devices are constantly producing telemetry which we typically could store in an Azure Data Lake, which in this particular case we're collecting about 60 gigabytes of data per day. We then need to provision a Spark environment, configure it, secure it, and connect it to the Data Lake. And only then can we spin up a cluster which would usually comprise of at least three VMs for the duration of the exploration and development. And of course if you're new to all this, you need to get skilled up in something like Python or Scala. - And that's really familiar territory for me because in IT then we have to use that, it's a dependency on us which might mean additional time or cost to get those resources spun up. - Yeah absolutely, that can take days, weeks, months even. And once you've found, you know, the insights and how to extract those insights you need to operationalize the solution and build data pipelines that you're continuously monitoring data as it arrives so that you can measure and find issues in the future. - Right, and those data pipelines also themselves pose a challenge. Because they might be running on different infrastructures and then bringing all of that together can be hard. So how do things then change with Azure Synapse? - Yeah I mean, I can show you how you can achieve all of this securely from a single environment without spinning up any pre-provisioned resources, and using the language that you probably already know. Now there are a number of steps we need to go through, and the first of those steps is understanding what data we have. So here we are in Synapse studio, and one of the great things about Synapse studio is we can now browse the files and folders within our Data Lake. And you can see here the contents of one particular folder. And we can see we've got a network telemetry folder which looks interesting. So let's go and have a look inside there. Now we can see we have data partitioned by date, and then partitioned by hour, and finally we can see the raw files which are in this case in Parquet format. Now if I wanted to analyze these files traditionally I'd have to run a Spark cluster, or I'd need to download those files. But with SQL serverless all I need to do is go and select the files, right click on them and I can select top 100 rows. So I'm just going to run this, and you can see we're running a SQL query over, you know, one or more files in the Data Lake without any provisioned servers. But what if we wanted to query over all our telemetry without worrying about individual files or where they live in the Lake. And I can show you an example. So here we've got a view that I created called Radius Raw. It represents all our network telemetry. And I can use this query to find the total number of events that we have received on a single day. So if I run this, we can see that we've received about 50 million events on that day. Now this view is completely reusable we've hidden all the details of the individual files, and even the formats of the data that we're querying. And as I say all of this is being done without using any provisioned resources. It's fully serverless and I didn't have to spin up a single thing. Well, I've got a slightly more interesting example that I can show you. So here's a query that returns network issues for a given piece of equipment by the number of customers who were impacted. And we're joining these issues over another set of files which provide more information about the total exchange of the devices including geographical location. So, we are limiting this entire data set down to a single day. And also just down to the top sort of 100 customers that are impacted. So if I run this, I just want to point out that the exchanges view is actually reading as CSV file, whereas the other view we're querying over is raw Parquet, so we're actually joining data stored in completely different formats. And as you can see we've got some results quite quickly. Now the next thing to do is to start sort of you know, visualizing it, seeing what's in that data set, how can we represent that table in a better way. Now what I can do is I can go and have a look at down in the results view and change the view to a chart view. And I'm just going to make a couple of changes here. I'm just going to remove some of these columns and just leave the number of events and the number of distinct customers. And now what I've got is I've got a timed-sheared series showing when peak issues are occurring, the number of distinct customers when that occurs, and obviously the number of events. So, now that we've discovered how to extract insights we need to operationalize these. And to do this, we would typically run this query continuously as new data arrives and we store the data in the Data Lake to make it easier for upstream processes and visualizations. So I can show you that here as well. What we've got is we've got an expression, a CTAS expression, and if you are familiar with SQL Database or SQL Data Warehouse you may have seen this in the past. Now what this is doing is it's running the same query, but it's writing the output to a location in the Data Lake. We're writing to a folder called Output, and within that a subfolder called Network Health. And now what we can do is we can go back into our Data Lake, I'm just going to drop out and you can see we've got a new folder called Output, a new folder called Network Health, and here are all our issues ready to go. - So now you've kinda got the data, the visualizations to really see where those different points of failures are on the network to be able to dispatch the right field techs but what would I do next? - Yeah, so the next thing we need to do is we need to make sure that our field engineers have good visualizations that continuously running, pointing them at the issues so they can react based on customer impact. Now Power BI is a great data visualization tool and the good news is that it comes integrated with Synapse, but one of the first challenges is, how do we get data into Power BI. Now Power BI comes with a whole bunch of great connectors and because we're just talking SQL here, we can use the existing SQL connectors with SQL serverless without any changes at all. So let's have a look in our Develop hub. In our Develop hub you'll see we now have a Power BI option and we can see our Power BI workspace that we set up. And from there, we can go and directly query the Power BI data sets within Synapse. So I'm going to add a new data set and you can see that from Power BI's perspective SQL serverless looks like any regular SQL database. And I've got all my views in a database called TelcoDemo and I can download what's called a PBIDS file, or PBIDS file, which is a Power BI file which has all the connection detail setup, so I can immediately start querying our SQL serverless views and tables without any other problems or delay. So, to save a bit of time, what I've done is I've already done that and I've pushed up a data set with our network issues. And what I can do is I'm going to create a new Power BI report. And you'll see the report experience is now embedded in. Here's our network health data from before, and I'm just going to go and drop a map onto our report and we're going to start looking that up. So I'm going to pull the latitude in, the longitude in, and I'm going to make the size of my data points the distinct custom account. So you can see now we've got a nice hot spot graph. So we going to make this a little bit more interesting, we're just going to drop in a tree map for some navigation. A little bit bigger. And we're going to do that by exchange and again by the number of customers impacted. And now what I can do is I can start to navigate around the various different exchanges and see exactly where my issues are occurring at any given time. - And this is really great, but I know a lot of people watching are probably using Spark tables, so in that case, do I have to spin up and kind of run a Spark cluster against that data, or can I use SQL serverless to do that? - Right, that's a really good question. So, Synapse comes with its own managed Spark environment which is great if you've got existing Spark assets and skills, but SQL on-demand also plays quite well with it. So I can show you an example of how that all works. So what I'm going to do is I'm just going to go back to my Data hub, and you'll see that we've got our databases and if I open that up you'll see we've got a special database there called default Spark. Now if I open that up, you'll see I've got a bunch of tables. Now these are Spark tables. These are not views, these are not SQL serverless views. These are Spark views and that last view there, radius raw, is a Spark table that I previously set up and created. Now in the past what I would have to do is spin up a cluster to actually view the contents of that table, but with SQL serverless all I need to do is start querying it straight away. So here we're doing it again, doing another select top 1,000 rows and here I am querying a Spark table without a cluster running at all. - All right, so we've been talking a lot about serverless capabilities really for that on-demand compute, but when would we start using provisioned resources then? - So it depends really. We're finding with SQL serverless that we're getting really good performance and given its pay-per-query, it also allows us to attribute cost-to-business units. But you know, there may be a time when you need the performance of SQL pools and because we're talking SQL in both cases, it actually makes it easy to transition from SQL serverless to SQL pools. Remember the query I ran before? I can show the top here where we're switching between SQL serverless environments and our SQL pool environments. I'm just going to run that and while that's running I'm just going to explain what I've done. I pulled the same raw data into a SQL pool. And I've created a similar set of views in the SQL pool environment. And now I'm able to simply switch, run my existing original query and I'm getting exactly the same results, but this time using pre-provisioned resources. - All right, this is really great stuff. Thanks so much for sharing kind of what you've been doing with Azure Synapse, Jess, but any tips that you would give people that are watching from home about how they can start using Azure Synapse? - Yeah, sure, well historically analytics has been seen as big scary, big budget kind of exercise, and Synapse provides a path to get you productive really quickly, but also allows you to graduate to more advanced services as and when necessary. So I suppose I'd say look for the simple opportunities within your data and grow your analytic solutions from there. - Good tips. Thanks again for joining us today, Jess, and also if you liked hearing from endjin, please check out the rest of our How We Built It series at aka.ms/AzureSynapseSeries to learn from early adopters who are also kicking the tires with Azure Synapse. That's all the time we have for this show. Thanks for watching. We'll see you next time. (electronic music)

Info

Channel: Microsoft Mechanics

Views: 4,796

Rating: 5 out of 5

Keywords: Azure Synapse, data engineering, big data analytics, enterprise data warehousing, big data processing, data engineering, data engineering projects, on-demand serverless compute, serverless computing, serverless computing azure, serverless computing examples, serverless computing benefits, how serverless computing works, what is serverless computing in azure, big data analytics, big data analytics tools, querying data, databricks, etl data, data transformation, data factory

Id: JXdFAEYCO-M

Channel Id: undefined

Length: 12min 35sec (755 seconds)

Published: Wed Jun 17 2020