Getting Started with Databricks SQL

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello my name is simon whiteley and this is getting started with databricks sql analytics so i want a consultancy called advancing analytics and we help a lot of clients who are taking their traditional warehousing and trying to build it in the cloud trying to use data lakes trying to use data bricks to do a lot of the transforming and preparing of their data but what happens when you try and show it to the business how do people try and do their kind of bi style queries in that architecture and sql analytics can really help with that so we're gonna run through what the tool is how to get started with it where to go with it once you actually open it up so stay with me okay so we'll talk a little bit about this lake house challenge so the lake house is this emerging architecture of saying well we don't need different platforms we don't need something just for bi people and something for the lake and something for data scientists can't we just do it via one thing and we'll talk about what that is and what that actually means for what our tools need to do we'll have a look at sql analytics itself kind of where it came from and step through some of the key areas essentially when you open it up what are you going to see and how do you work with these different things what these concepts mean especially if you're coming from the traditional spark background because a lot of it will be familiar but different so let's talk about this lake house challenge this emerging architecture this idea of saying i just want one platform to rule them all like one thing to do everything but do it well and what does that mean so this is where we've been for a fair while now the thing called the modern data warehouse and it's an architecture that loads and loads of people have built already in the cloud now i work a lot in microsoft and so this is all my azure architecture but you could do a similar thing in aws same in gcp you could do the same on prem essentially it's the idea of building out a lake to get all the real benefits of the lack of schema and you know you can put data in there and not really worry if the data changes if you've got changes in structure you can put all sorts of file types in there but lots of video and film you can put loads of just interesting data in there which you can't really do if you're working in a relational environment and so many people have bought on to the idea of databricks and spark and they're doing their data processing in that lake so we're getting there in we're really dynamic we're scaling out we're preparing the data and cleaning the data transforming the data so all that good etl that old-fashioned data engineering is really really key and really really embraced in the lake now i've got a lot of data science and also happening there it's just a great environment to do it in it supports python and scala and ah and you can do it in a scalable way across that same data so our data engineers and our data scientists are very happy in this architecture but then because of this traditional thought that no no bi bi has to be in a relational database then people are picking up the data that they've prepared and curated in the lake and they're putting it into a relational database and then we have our bi developers developing facts and dimensions and kimbel style analytics inside this separate environment so essentially we've got two copies of the same data and then we're taking that data and we're putting it out to a bi tool be it tableau looker click power bi whatever it happens to be for our report consumers and they don't really care where the data comes from honestly they just want that dashboard to work so the challenge that we're trying to work towards is to say well actually we're getting better at doing the whole bi thing in the lake so with delta coming in so delta being that specialized file format that on top of parquet which is already heavily columns or compressed and really really good for analytical style queries it gives us things like an audit trail it gives the schema governance it gives us the ability to roll back if we make a mistake it means we can do things like merge statements which are huge when you're managing things like slowly changing dimensions so suddenly we have the ability to engineer our bi style data models in the lake and so we're kind of left on it said great we could just start doing bi entirely from our lake using delta serving it through data bricks and then you start to realize that it's not quite that easy so lbi developers who've shifted up to the top they're now accessing databricks directly they're coming from a very mature place they've been using sql server tools which have been around for decades which are really really mature are letting them query sql develop sql it gives them query hints it finishes their typing for them it's just really really nice designed environment for writing sql and now they're writing sql in a notebook which doesn't quite work it doesn't quite fit you can do it it's just it's just not as nice as where they're coming from and that's a challenge when you're moving to a new technology you need to bring people with you you need to be at least as good as where they're coming from and the same with report consumers if they're connecting into a traditional databricks cluster and they've got to piece together the connection string from different parts of an advanced configuration pane that's not really good enough people aren't going to do it so sql analytics is to address this problem it's trying to say well actually how do we work with these guys how do we take rbi developers who love writing sql they're passionate sql developers they care about performance they care about how our data is modeled but they need this environment that has type down has code snippets has all these interesting things to make them feel at home and to make them productive and they're not going to write in python or sql or or they're going to want to see and our report consumers they just wanted to work it just needs to be seamless it needs to not seem threatening it needs to be really easy to get them to just pick it up and run with it and work with it so that's what we're trying to do and that's what sql analytics is trying to say essentially it's saying we need to invite and embrace these different types of users into the environment that's working so well for other people already okay so let's have a talk about sql analytics what does that actually fit what does it actually do so in essence it's like another new area it's an entirely new workspace inside databricks so you've opened up the traditional database environment that's now known as the data science and engineering workspace and inside there you've got absolute control of everything when you design a cluster you pick the size of vm how many virtual machines you need how many workers how should it scale how long does it stay on what kind of initiation script should you have there's a lot of technical config that you can really dig down to the depths of you can write in scala r python sql you've got loads of flexibility and most development is either in notebooks or as spark submit jobs essentially it's all about kind of software development on top of data on our new site in the sql analytics workspace it's really heavily simplified so designing a cluster is just picking a t-shirt size there's no more understanding the vms and the memory constraints and tailoring it to your workloads it's a how big do you want it to be it's deliberately made really accessible and easy for people who aren't coming from a heavily technical background it's sql only so you can't write python and you can't write scala in there it is literally aimed laser focused just for this sql community and so it has things like query builders to help you write sql to put snippets in there to finish the sql as you're typing it and then put it into a lightweight dashboard so it's designed for that kind of person and so when you start using it you'll see these two different areas am i in data science and engineering or am i in sql analytics there we go so we'll have that choice going in there which option do we want to go with so the kind of workflow we're kind of going to see is this heavily on the engineering side we're getting the data in we're preparing the data and we're surfacing the data as hive tables and that is core to how sql analytics is going to work sql analytics expects that to already be data registered as hive tables i'm expecting to go in there and see a list of databases and tables i've got access to so the engineering workspace is key to actually that data preparation acquiring the data bringing it into the system getting it into the right shape so i can then present it over the fence to the sql analytics side and at that point we'll be writing queries on it shaping it analyzing it and either pushing quick sql directly from a power bi tool tableau something else or building a dashboard so we've got that separation of the kind of work people are doing that's not to say you can still write sql on the data engineering side of things but it's just not as geared towards it's not as optimized for writing sql in that environment okay so we're going to step through some of the key concepts and again it'll look fairly familiar it's actually built to be very very kind of recognizable in the databricks workspace so we've got these areas going down the side which are different key concepts and i'll just take a moment to step through these different areas to give you an idea of what to expect so i kind of skipped a little bit down to the bottom to start off with and we've got this idea of end points so where in databricks and any other spark development we're expecting a cluster we're expecting a driver and worker nodes and we're expecting to kind of design all the actual machines that our work is going to take place some in sql analytics we've just got an end point deliberately simplified and you can see that we've got a cluster size of a medium so it's completely obfuscated it's really really nice to say i just want this bigger cluster i want a smaller one a large one an extra large to remove all the complexity of actually building and designing a cluster so we need an endpoint it's still spark it's still running as a sparkler we have to have some kind of cluster to run things on and that in sql analytics is called a sql endpoint so we'll spin up an end point and we'll be able to do some work on top of it and then we have this idea called a query now a query is a reusable data object essentially i'm writing some sql and that's going to return some results and i can use those results in a few different things essentially it's a building block of data so that can be really simple just to select star from a particular table i can actually have subquery ct's i can actually have a fairly complicated sql statement as long as it returns me some data and then that data i can actually cache it and i can use it in several different visualizations so i can have just one data set one query that goes to the data does some calculation returns a set and then actually use that in several different visualizations so it's nice and efficient to refresh it and this is where all of the advancements in terms of sql writing are so that sql box i'm going to write some sql it allows me to type down it completes words it allows me to browse what hive objects i've got access to so it's a nice easy area to write sql save it and then reuse those cached data sets in various different places once we've got a query and once we've built some visualizations of it we can pin them into a dashboard so we have these kind of collections of all the different visuals that we've built and made and we can have a refresh schedule that will go off and run each of the individual queries behind it and we can share this with people so essentially this is our way of presenting our various different visualizations now this isn't to say this is going to replace your enterprise bi tool so if you're already using power bi tableau some of the premium bi offerings the dashboards are a little bit simpler when we're in sql analytics but if you don't currently have those or you just need something really lightweight and quick it's a really nice option just to present some good looking visuals that you can share with other people to then go and do their own analysis and dig further into things going on we've got alerts now this is super useful so if i've got a data set i can trigger various alerts based on it i can say with that value next time you refresh if the value is under a certain amount send me an email send alert these people if there's an anomaly so we only expect a thousand requests a day if we get 10 000 something's gone horribly wrong let the support team know we can set various different data alerts up depending on the refresh schedule of our data so we can kind of trigger various actions and build out on it based on various things that we're looking at inside the data models that we're building and finally one of my favorite areas is history so a big piece that's really been missing from traditional spark is the ability to understand how people are using the data what kind of queries are people writing if people are writing hundreds of different queries on different tables across all of my data which is the table that's not performing well which is the thing that's slowing down the whole of my cluster and we've always had that detail in the spark ui you can dig into the depths of the rdds and see the actual execution plans and understand what's happening at the spark level well that's really hard for someone who works at the data model level who works at the sql level you need to really understand spark to understand the performance of the different stages and tasks and what's actually happening in the spark engine so here we've got a really simple list of this is the actual sequel that they wrote and then this is how it broke down from performance in different steps and stages so that gives us a really great way of just actually keeping track of who's using our system how how many queries they write today and what tables are they hitting because then we can give that to our engineers to say we probably need to optimize these here are the ones we need to work on maybe we need some bloom indexes to go and speed that table up and it's that telemetry that information about how users are using our system which is so so valuable okay so they're the main bits and pieces that we'll see inside sql analytics and what we'll do now is we'll do a quick tour we'll show you the dashboard step through some of those areas of an endpoint and some queries and give you an idea of what it's like to work inside that environment so here we have the familiar look which is the azure databricks workspace or what we now know as the data engineering and data science workspace and what you'll see is right down at the bottom i've got this little switch this allows me to toggle between the two different workspaces so if i've done some data engineering i've created my tables i've registered them with hive and i now want to query them with sql i can then switch over to my sql analytics workspace and you'll see it'll take a moment to switch over we get a nice little re-dash logo and then i'm now in the new sql analytics environment now this is currently in preview so you might not have that little button just yet um but when you've been enabled for sql analytics you'll be able to do that and switch around so i'm now in the sql analytics workspace so first things first let's make sure i've got an end point set up because again it is still spot i need to make sure i have a cluster that's actually running i'm going to go into endpoints i get my list of endpoints that are there so you can see i've got different endpoints that i've set up for my different end users but my marketing guys i've got one my customer sales i've got another and what are we going to look at is business intelligence that's already running if i wanted to create a new endpoint i can just create a new sql endpoint a nice simple little box so i can call it my is it a cluster not really i get a cluster size and again this idea of a t-shirt sizing so looking at the dbu's you can kind of map it back to how big a spark cluster in the traditional databricks workspace you'd need to do to get this but the whole idea is you don't need to think in that much detail they've already picked the right kind of cluster the right type of high concurrency cluster doing delta caching so you don't have to think about it so it's been designed to make it really easy for you to pick the right type of cluster for your workload now we still pay for the cluster while it's turned on so if you want it to automatically turn off if you don't have to worry about remembering to press start and stop we've got that auto stop so i could say well actually just stop if no one's written a query if no one's submitted any sql for 20 minutes turn off the cluster and then the next person who tries to use it we'll need to wait three or four minutes while it turns back on again now there are some experimental preview options currently in here i'm not going to go too much into detail for those my ideas we've got this idea of multi-cluster load balancing essentially if i've got several different use cases i can actually build several clusters so rather than having one person accidentally writing a huge query that means everyone else who's trying to use that same cluster slows down what we can actually do is load balancer across several independent clusters known as endpoints in this case so we've got that and that is coming soon so we don't have access to that just yet i've also got this idea of photon now photon is a polymorphic vectorization engine to throw lots of terms at you but that's essentially saying they've built something really cool that can actually calculate more values at once than previously they're able to meaning your queries go faster essentially it's a whole flavor of spark made to go faster for this kind of sql analytics queries so that's something you can turn on it is experimental at the moment so they haven't covered it for absolutely every different type of sequel you might throw it so use at your own risk but really cool that's coming and that's going to really really speed up the kind of bi questions people ask against it okay in this case we're going to use my existing cluster so that's set up and running and if i need some other stuff in here i've got my connection details and we'll have a look at that that allows us to connect to this endpoint using tableau power bi etc and i can also see how busy is the cluster how many queries of people are into it and if you're used to your normal spark you are used to looking at ganglia which is horrendously complex complicated and has huge amounts of detail this is just a lot easier really simple to see how busy is my cluster currently endpoint okay so moving on we've then got queries so this gives me a list of all the queries i've got saved all those data building blocks i've created in the past i can write a new query and i can just start typing so you can see i've already got a database over here this is a hive database i created in my normal databricks workspace i've registered the database i created some tables i can see all the things that i've got access to as an active directory user i can just start typing so i can start typing select star from sales so it knows i'm typing sales lt data i can do products and it'll type down as i'm going it's being built for people who are writing sql which is great so when i'm happy with how that's working i can hit execute that's going to go off run on the cluster return the results and then just give me a table of data down at the bottom i can go ahead and start using that so we take one that i've built earlier discard that and go into my product performance so you can see there's a bit of sql and one of the hard things is people write sql in all different ways i'm not really happy with the formatting for that and so some of the small niceties is we've got a sql formatting button so i can just hit format and it applies standard formatting to my sql just little sql niceties that we're used to if we're in a sql editor that now we've got access to in sql analytics so now that i've got this data set here you can see i've already created a visualization against that same data set and again that'll just refresh whenever i refresh my data right down at the bottom i've got a little refresh schedule here so i can say well how often do i want it to automatically refresh this so i'm not dealing with jobs i'm not dealing with a whole etl schedule i'm not having to use an external tool i'm just saying actually could you just refresh my data every hour every day and it will automatically turn on the sql endpoint run the query update the visuals and turn off so i can go and see the latest visuals without the endpoint even turned on so if people just need to see what today's data looks like i can refresh it overnight and then have the end point turned off and then you just look at the most recent data and it's really nice really cheap so i can add a visualization this gives me a whole list of different types of advanced visualizations i can dig into in this case i can just do a basic chart it gives me lots of options for what kind of charts i want to do so i can just do a bar chart really simple put a category on there i want to give it some sales and then i've got a little visual i can start playing with it's very similar the chart one's very similar to the ones that begged into the notebook experience but then i've got a whole load of other advanced visuals i can start playing with so i can go into the nth degree of changing around fonts and formatting and colors and all sorts of things in there i can just take that visual and go you know that's a nice visual i'm going to stick with this and then what i'm going to do with it so i probably want to include this in a dashboard so i've got some options do i want to take the data and export it to excel go play with it somewhere else or do i want to take this visual and add it to a dashboard so in this case i can do i have a product dashboard i think i've got a sales dashboard so i can take it tag this visual and push it to an existing dashboard so then if we have a look at the dashboards i've already got so i can go and i can see which dashboards i've got access to i can give different people different shares of my dashboards i can set up a refresh schedule for this dashboard and it will automatically run all the different queries that the visuals i've included are based on and there's an amount of interaction it's not as super rich and interactive and slicing cross-slicing different visuals as you'll get with the premium bi tool but it's quite nice to go and explore bits of data dig into things to go and explore the map visuals they are all nice and tactile and interactive but it's not still not a premium bi tool it's just a nice cheap way of visualizing data if you're already using databricks and you're in this ecosystem okay so moving on we've got our history so you've seen i've been writing some sql and i've run some queries and i've got this list of everything that i've just been running and i can go and say well actually well that took two seconds that took 17 seconds that's more than i expect and i can dig into that get some information what was the sequel that the person wrote to actually trigger that workload how long does it take i can open up the spark ui and get the full details of the sql execution plan what actually happened under the hood if i need that level of detail i can also go and see what was the breakdown was it because it took too long figuring out and compiling that and turning it into a spark job was it just the pure grind of working with the data and i can get a little bit more detail about how efficient the sequel that's been written is and again that's not just for me i can go in there as an admin see all my different users what kind of things are they writing and how slow is it going where do i need to target my optimizations and then finally we've got these alerts so that's when i can actually put this trigger in i can say well based on one of my queries based on the query i've just written how often you wanted to check and it will check every time it's refreshed and then what kind of trigger it should have in there you can see this one's been triggered so that'll have sent me an email because my total sales were above a certain amount and again it is just that easy to go in pick the measure on that particular query how you want to compare it the value which you compare it to what kind of email should it send so if you're using a help desk kind of tool and it should raise a service request you can have it bake in a standard template okay so as a tool there's loads and loads of different things in there lots of different niceties that are built for people doing reporting doing exploratory analytics and just trying to write sql against the data in a way that is actually comfortable and makes sense for them okay so we saw lots of features about how we can do dashboarding and visualizations and query analysis inside sql analytics but it's also an optimized engine for actually serving analytics through other tools so one of the big questions is how do i use things like power bi and tableau and click to actually use the sql endpoints from sql analytics so i was using power bi for example there's already an azure databricks connector so i can just open up get data in power bi choose azure databricks and it's going to ask me the right questions to get that connection working now that will work against either the straightforward the data engineering and data science workspace against a normal cluster or it'll work against the new sql analytics endpoints so i'm going to set this up it's going to ask me a few questions so it'll come back saying what's the server hostname what's the http path now previously on a normal cluster you've got to dig into the cluster settings and the advanced config to come out and pull out that info and again that's just not particularly accessible to a bi end user someone who just wants to plug in a vi tool now we saw against the endpoint when i clicked on connections we just were given some of those bits of information and actually we've got server hostname and http path we just copied them straight in so it's already built in such a way that you can kind of just get the information you need and nothing else you don't need all the access to the other bits and pieces that you might use if you're doing a straight spark connection and using all the other big data spark tools because it's built expecting you to connect via a fairly traditional bi tour okay so just wrap it up so we've seen all these different tools all these different bits and pieces that are being brought in to try and help people achieve that lake house journey so all of our new users where we brought in our bi developer now joined our data engineer and join our data scientist accessing databricks directly but not quite because they're using the different workspace they've now got an area designed just for them just for their kind of workload for their kind of development style to help them do what they're trying to do and it is kind of just optimized in that way and then we've got our report consumers who can now just much more easily access a sql endpoint direct their bi tools against it and as that engine as that kind of sql analytic cluster improves as we see photon come online it's going to get faster and it's going to get more optimized and it can get better at serving data out to those kind of bi requests and that's kind of what sql analytics is doing it's it's not really changing the fact that people can actually use delta and do all the bi engineering and do all the data engineering and repair the data in the right way it's just allowing these various different people who just want to access the data and use the data and gain insights from the data to use the same set of tools to come on board and use the same platform to serve that data so thank you very much again i'll be around to answer any of your questions and as always please do leave feedback it's incredibly valuable for the conference organizers and for us as speakers otherwise i hope you have a great session

Info

Channel: Databricks

Views: 2,468

Rating: 5 out of 5

Keywords: Databricks

Id: OMjxlqIqSqs

Channel Id: undefined

Length: 27min 21sec (1641 seconds)

Published: Fri Aug 27 2021