Advancing Spark - Introduction to Databricks SQL Analytics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello spark fans and thanks for joining me on day one of the data and ai summit europe edition 2020. so we have just had the opening keynote where we heard from ali ghazi about the lake house vision well reynolds in and brook winning doing the demonstration of sql analytics so this is the brand new just announced last week different workspace inside databricks that is purely geared towards sql people those people it's purely geared for people who want to be able to write queries interactively in a much nicer sql experience build visualizations build dashboards connect to bi tools and do all that stuff so we're going to take a look into that and i'll kind of give you a bit of a walk through we've got kind of the preview version of it you can apply for the preview today so if you want to have a play you can go and do that and yeah any questions on it kind of uh anything you want to look into let us know let us know down in the comments don't forget to like and subscribe and yeah there's going to be a lot of videos from us over the next few weeks as we dig into different parts of it so if there's elements of it that you're interested in let us know and we'll see what we can do to explore and share and do all that normal stuff that we do okay so first things first yeah there is a big launch blog and we'll share that and that talks kind of through the vision which is really useful it's really good to understand who databricks are targeting this at who this is for um and it talks about you know they're aiming to simplify stuff so there's some things in here a lot of it is it's just data breaks it's just spark cluster but it's wrapped in a layer of abstraction it's made a lot easier they've removed a lot of complexity around the spark ui and cluster conflicts and all that kind of stuff it's geared to make it easier for people to adopt they've said you know what data engineers data scientists they can hop on spark and learn it and it's fine but talking to your data analysts your bi people you're kind of the people who are very used to working in a purely sql warehouse world they find it a bit complex looking in the spark ui to say what happened is like oh there's a lot of info in there so it's all about reducing complexity talk about uh lake house obviously we know that well uh talking about so there's a few different products things i'll skip over those because they will we'll look at that that's fine um one of the big things is connectivity so it's connecting to tableau to powerbi to look at the click all the good enterprise premium bi tools and that for me is a very important message because when i've spoken to people about databrick sql analytics they've said is that instead of power bi so i don't need tableau anymore i can just use this and the answer is no no it's an experience for ad hoc exploration it's very good for people doing sql there's no drag and drop data modeling editor it's not going to be a semantic model with dax and all things like your power bi it's no replacement for a premium enterprise power like a bi tool power bi tool sure microsoft but it is a really nice simple easily accessible dashboarding tool so if you don't have all the big premium reporting dashboarding requirements it's awesome if you do have those it's a new endpoint that you can connect your bi tools to that simplifies things and has a ton of optimizations in there so it's kind of for both people whether you are in tableau bi click that kind of thing currently there's stuff there for you whether you don't have that and you're like do i need to buy another bi tool you can use this as a entry-level bi tool but it also has a load of other stuff in there so that is great and one of the things i've got to say we are super honored at advancing analytics because we are one of the launch partners so we are one of the consulting partners for sql analytics so you know if you're figuring out how to build a lake you're trying to figure out how to build a lake house with daily bricks and all that kind of stuff then yeah come and talk to us um and there's a lot of stuff about performance performance is a big thing because they've got the photon engine in there and the photon engine we heard about that with the release of spark 3 and looks at the last summer there's lots of information we've got to hear about the vectorization the folding up of data past the cpu in their new c plus plus engine that's photon and this is the first glimpse of it baked into sql analytics as its first uh area so really cool uh and there's some history and adam administration and that kind of stuff so we'll have a look at those bits basically there's loads of new stuff all to do with sql so what is it well databricks here right so i've got my databricks portal um and it's nowhere in here it is i've got this cheeky little thing down in the corner which is switch workspace so you've now got this concept you've got the data science and engineering weight the workspace aka normal databricks or you've got the sql analytics workspace so clicking on that will open up an entirely separate you get the little redux logo we know what this is based on um so it's an entirely separate workspace and so people can come directly here you can give people the link to sql analytics and they'll just come here and work in this space and be non the wiser about all the pythony scholarly r stuff happening across the fence so it's kind of it is kind of deliberately creating a segregation going don't worry about the other complexity we're not going to give you a load of extra buttons to click on here's a real simple uh accessible workspace for doing sql style analytics so that is in here and that is uh where we are so there's a few things in here mainly quick tour down the side got dashboards and they're kind of similar to the old school dashboard so it's kind of just essentially lots of different visualizations and bits of markdown and various things just tagged and assembled into kind of a nice easy to use um dashboard of visualizations queries there's little snippets of sql so each bit of sql query that i write that produces a data set is a query so essentially i'm building up assets of here's a query that produces that visual here's a query that produces that visual so it's less of like a massive data model i've used for lots of things it's sql queries upon which i build visualizations again we'll have a look got things called endpoints now endpoints are essentially clusters but it's clusters made super easy don't worry about drivers and workers the number of workers and sliding or any of that it is literally i would like an endpoint this is a t-shirt size for what it should be hit go start using it but it's really really trying to make that stuff simple and accessible got history and alerts which are new based on queries and it's all about that regular scheduling the refresh of data the did something go wrong i want some metrics i want to say send me an email if i'm outside a threshold for reporting and good bi style things so let's have a look through some of those bits and pieces starting off endpoints so i need an endpoint to be running i've got a few different endpoints again kind of these are for my different groups of users i can kind of say this is what you should become doing and you can see i've got a bi endpoint currently working now i've been using dedic for so long it kind of hurts me calling it an endpoint because it's just too cluster cluster under the hood i know why they've divorced it from cluster and we'll talk about that in a second so it's running and that's key so similar to clusters this cost has to be turned on turned off all of that stuff i need to spin it up if it's turned off and it still takes a couple of minutes to turn on uh it's the same as all that running a normal spark cluster because it is a spark cluster under the hood i give it a name it gets an id the cluster size again that is t-shirt sizing so if i go edit i get kind of a drop down of things and you can see the dbus so we can kind of link that back to how we think about clusters in the databricks world you know so my xx small uh that's kind of like you know your most basic i've got a driver and i've got two real small workers paying about 4 dbus that kind of makes sense and then as you go up you're talking about you know it's going to have more ram it's going to have more cpu you're going to have more workers in there to distribute across but you don't really see any of that that's kind of like a we'll take care of that made it a lot more black box a lot more kind of just don't look behind the curtain um auto stop as again we always love it they should stop paying for if no one's using it we can adjust that i think defaults 120. and then there's two experimental features now multi-cluster load balancing isn't available yet so that's you currently can't turn that on and that's going to fire you speaks daily bricks and the engineers flick a switch for you um and that is to do with where's we in normal clusters we have auto scaling of the workers so i've got one driver and i've got like two workers 10 workers and it can scale up to 80 or 100 and it'll dynamically change the size of that cluster depending on workload this is different this is saying actually i'm just going to make another cluster or 3d clusters and then your endpoint your sql endpoint will just use whichever cluster is free so to load balancing across multiple clusters um and that gets over some of the things with a high concurrency cluster you're still going through the driver the driver still is the the point that's having to manage all your work so even if you scale out the workers you can still reach a concurrency maximum because your driver isn't changing size in normal data breaks now here we're saying you know what we can just keep adding like cookie cutter stamped new clusters into this we still have the same end point and so we don't have to point to any different locations with rbi tools and we can scale out concurrency a lot better so multicultural loan balancing super interesting for servicing large numbers of users all writing queries at once without having to invest in a massive driver that for most of the time will be at like 10 capacity so super interesting not available yet uh unless it's you've got a used case and you speak to the engineering guys but super super interesting now photon photon is the new engine it is the c plus plus based engine and we can click that button we can go oh enable it uh and it is still very experimental kind of pretty big beware this might occasionally break um so again not for production workloads yet but you can have a play uh and i've not done that kind of the real you know what difference does this make let's do some testing with some different uh combinations let's see what kind of query this really really makes go faster i've not done that yet so the plan is to do it i'll be doing some videos on it over the next few weeks but for now there isn't much to it and that's it there's a bit of security that that is entirely everything we can do with our cluster so forget all the spark in its scripts forget all our advanced settings forget the low level tweaking of various different parts of the cluster it's a oh like a small cluster please and then done up and running and that's it now we still get our jdbc endpoint connection so we still got the stuff so we can connect to it and use it as an endpoint for our tools and we can do some ordering we can see how busy this cluster is again no spark ui of just tons and tons of information and ganglion and all that stuff just how is it what what's running what's cued just real trying to make it simple and accessible talk at endpoints endpoints allow us it's our viewpoint into the world i've got my endpoint running i can run some queries and this is this is the hook this is the the nice thing for all of the sql people in that i can go in and i've got a query here i can edit my query and it's just sql so i'm just it's just a nice space to be writing sql um let's do let's do a new query so we can download that okay so i've got my new query this is my test query took me a moment to actually see where that little naming button was but again that's where you uh set what things are called uh and that can write a query so over on the side this is hive so anything in normal databricks but i've registered with hive i will see inside my sql analytics cluster so over here we don't tend to be we're not creating new tables so much we might create views maybe um we're not going off and reading the lake and registering new tables over kind of um you know we're not creating delta tables and that kind of stuff here generally this is normally a querying layer so we might in normal data databricks in the data science and engineering workspace do a lot of stuff bring a lot of data in transform it get it ready register those tables with hive and then at that point i can come over here and they're available for my analyst users to query so i've got my sales lt database i've got a load of tables in there and then i've got a nice explorer finally i can actually just visually see a list of columns and tables whilst i'm writing a query because we used to have like the described functions and that kind of stuff in normal um databricks and normal spark so i could do show tables i can then do describe table and i can get to it but trying to write some sequel and then having to stop and go what was that column called go off check it run it quick come back and it's just painful uh now we've got type down so it's typing down as we go and if i do i'm going to say from and then i've got sales lt get my list of tables in there say product great and it's just straight type down sql i could say no what was the thing i wanted i wanted my oh my my name uh my product number used to come in and they'll add them in for me you know it's building out a query for me a lot of the stuff again this isn't mind-blowing if you're coming from a sql um perspective you're coming from a sql background and you're used to using manager studio as your data studio all the kind of normal sql ids this is just like the minimum acceptable level of sql but the important thing is we didn't have this before so previously i'd introduced like people from the background into databricks going look stuff you can do it's amazing and they go it doesn't even do auto-complete it doesn't type down sql on them ah and it's just a real barrier to stop people adopting it so all of this stuff coming in is just making that a real acceptable event environment for people coming in to do it now it's not yet finished so this is still preview there's still stuff happening you know so i can still do the one thing that grates me so you know i can i can alias things and that works uh but then the type down i want to do p it doesn't it doesn't type down for an alias yet it will it'll get there um there's a few little niggles and quirks in there but it's still just a much nicer environment for doing your querying for working in a sequel kind of way and there's lots and lots of stuff in here okay so i've got a data set so i've written a query i've hit execute that executed it on my business intelligence cluster okay i can go and see first bit download it can do stuff um that's nice i can do that so i've got the first twitch running on which is a pool which of my various databases am i using which table am i exploring i can do like filtering my tables which is quite nice so i can just say just only show me things that are like this great uh and then when i've ran that i've got my results back now a few little bits and pieces i can do parameters i'll do a separate video on how you get parameters and widgets and all that kind of stuff working but you can parameterize the queries which is great i've got a formatter so i can just go i just just make my sequel good i don't need to worry about how i write so i just could smash sql in and then hit go and it'll tidy it up and i can turn the autocomplete on and off currently doing a limit to a thousand stuff um most importantly the majority of times when we're in here we're writing the query we get some results back and we go i need to visualize that i need to do a thing so let's just do a i'm going to make it really won't make any sense but just so we have some data in there i'm gonna pop group by in there there we go i'm gonna format it again because i wrote badly there we go okay so i'll have something that i can just put a quick visual over and i can say add visualization and i get this whole visualization pop-up um now off the top i've got a chart i would like a chart please uh and actually this gives you a lot of the same options that we had in a normal databricks notebook so if you've got visualizations that you normally use in there that you quite like you've got a lot of the same options and then we have a whole ton of other more advanced visualizations baked in you know then some of them are fantastic so being able to do things like funnel views are actually still quite cool uh i don't know if i've got the this is not at all the right shape of data but it's that kind of thing where you see data going down and down and down and you get to understand how things are stepping between stuff um we can do things like cohorts and relationships we can do like all the mapping stuff is great uh sankey i love me a sankey diagram which is when you've got like it's like the google analytics kind of this is the number of people that landed on your page this many dropped out this many went to the next page and it's kind of like the of all these people how did they go on a journey and that just comes out with so many different useful things i can do a sunburst basically a pie chart with more bits to it uh that's kind of spanning out you know word cloud just interesting stuff that actually comes up fairly regularly now when you're happy with that when you're like okay i like that visualization you've got this add to dashboard button so i could say i want to take that and add it to a dashboard that's where we have our dashboard so let's go over here now i've got a dashboard already set up so we can click in see a dashboard and there's just some stuff so you can see i've got a map word cloud various things all adventure works data you know sql we had to use adventure works um and there's bits and pieces in here so firstly i just hit refresh and i'll go off and it'll run each of those saved queries and again if it's a few visualizations off one saved query it'll run the query once and then update the visit it won't do it per uh this it's doing some interesting stuff in it i've got this i've got an image uh and that's just based on markdown so i can just have a little bit of mark down the same markdown that we use in a daily bricks notebook again i've just lifted and shifted that from one of our training courses oh i've grabbed that image and pop it in uh so any kind of markdown kind of stuff you can put in there and then you can have that uh inside your stuff uh we can set a refresh so i've just hit refresh manually but i can say refresh every hour and i'll just automatically trigger a refresh it'll turn the end point on if it's not already turned on refresh my dashboard and turn off now kind of interesting the most interesting thing is that means i can access my dashboard when it's not turned on so i can have this running on a regular basis to run every night and then users log in look at the dashboard kind of explore the cached data in the dashboard without having to refresh that cluster it doesn't mean the cluster turned on so you can do it kind of turn a thing on update all of my reports turn my thing off and that's a super cheap way of just exposing a load of data and then there's lots of things we can do in here you know we've got normal highlighting we can go and visualize stuff we can drill down we can do lots and lots of things again we'll talk more about that kind of stuff uh probably in future videos but it's quite nice quite good okay um so i've been running some queries so i've got my history and i can see in here just this rundown of here are all the different queries that people have been running recently and how long they took and oh that one took 30 seconds why did that take so long and you can drill in you can see the sequel lightly around you can see who did it you can dig in and say please you can show me the details and you can get the kind of the sequel um catalyst breakdown of actually what went on inside that query so you can understand why that was such a terrible query to run all right quite a lot of detail but again this high-level admin of this is the sequel queries people have been running hasn't really been available now it's quite hard to see that in the spark ui there's not this kind of an obvious place to go and say show me the longest running queries uh which we can now do that we can go query it we can filter it down we can say why why did that one break so once one had an error um oh it's because this guy simon he can't write a query and he just kind of put the wrong id in that's fine that's not our problem that's just a little user error maybe we'll reach out to them and teach them how to actually write sql you know normal stuff and again not mind-blowing plenty and plenty of tools have this kind of thing it's just something that was missing inside databricks that makes it that much more acceptable to be using in that kind of warehousing style of working so final thing if i was running my product sales query and i was running it on a regular basis so let's right down on the bottom down here i've got this refresh schedule so i can say again i want you to run oh every day that's fine so what did you do run every day what time never end keep going thanks that's just not going to run on a regular cadence now that's kind of firewood forget i just don't need to think about that now that's just running um but if i ever actually wanted it to tell me about something i can come in here and say just just have an alert i want to have an alert on this query that's got a regular refresh and it needs to be a regular refreshing thing for you to set an alert on it otherwise you'd never see anything um trigger when i need to pick a column value some weird column values in that one maybe let okay uh and when it's greater or less than so if it's less than 10. so i'm gonna have at least 10 records coming through here i want to have likes of that value less than 10 summed up how often should it tell me so if it runs every day um the first time it gets there actually tell me and then if it keeps going it's not going to tell me every time because it's told me once or should it tell me every time and be constantly poking me going sorry we need to fix this this is bro please fix this you know so some data based threshold validation alerting it's great uh and we can decide you know what is that although it looks like what's the email look like does it need to be a certain thing do i need to put in certain you know you could actually because you can edit that and you can have expression built you can have it so it's maybe kind of raising um an issue in your support queue it's sending an email to your support team going uh this is a validation can you look at that please and it can auto automatically be tanked with the references you need to get in the right bucket lots of different things we can do in there and we can create an alert so who should it go to so one of my users only got one user in here should go to me trigger when this condition's met um and there we go i got alert so super super interesting little useful bits and pieces um so there's a lot of stuff in there and obviously that is just like a super whistle stop high level tour over all of this stuff essentially we have this new environment that is all geared towards people writing sql queries and exploring the data through sql uh it exposes the hive data and so it's all assumed that you're using hive to have your databases to have your tables to expose that and explore that and you can secure obviously your hive tables using normal uh tablex of control security stuff uh allows you to build queries and build multiple visuals on top of those queries using that nice new sql editor and then save visualizations from that into different configurations of dashboards which can have their own refresh schedule set gives you a history of exploring the different queries that run out and the performance and all those things and it gives you the ability to set alerts saying notify me if one of my queries bring back result sets that meet these criteria in which case i can then do something about it um and yeah interesting stuff and it's going to be interesting as it matures and as we see more with the photon engine as we see maybe more visualizations differently coming through um one of the things that's definitely missing from me if you're trying to push it as a real dashboarding tool which they're not to be honest they give they're pushing it as a sql analytical environment uh if they're gonna go more down the dashboarding route you need to have things like um cross filtering and validation of your different charts you know regardless on that one it should filter those themes colors i want to be able to say well you know advanced analytics i want to have it my themes all that kind of stuff um yeah there's a few bits and pieces that kind of are missing to be like a premium be idle but again that's not what they're getting for again having those endpoints and especially when the multi-cluster load balancing comes in its ability to be the end point for thousands of users in tableau power bi click environments all that kind of stuff all serviced from here is actually much much better than it's ever been which again just ties into that whole lake house approach so if you're saying i want to have a lake that is managed and governed and has a ton of useful stuff in there and i want to use this to serve my analytics to the rest of the business these are all the features that we're missing now so delta enables us to do a ton of stuff in just having these data models and being able to query this kind of thing but then connecting into the air and using live it was okay uh and all of these features all of this direction is going to make that story stronger and stronger and stronger and stronger again the journey's not over there's plenty more that can be done plenty more that we'll see in terms of that evolution it's just the actual direction that they're going and the stuff that they're putting into it is heading in the right direction for me if we're talking about analytics if we're talking about people doing sql we're talking about people trying to explore the data that you've stored in your leg these are all fantastic tools that line up for it so yeah that's the end of my little tour of the new databrick sql analytics environment again let me know if you've got any questions let me know if there's any areas that you want to dig into i'm going to be doing some spotlights and some of those different more interesting advanced visualizations about how you need to shape your data to get it to work the features in there how you use them all of that good stuff but just let me know if there's things that you guys would like to know otherwise hope you guys enjoy the rest of the data ai summit and yeah we'll see you soon cheers
Info
Channel: Advancing Analytics
Views: 7,535
Rating: 4.8974357 out of 5
Keywords:
Id: 3vsbpwB6Kvs
Channel Id: undefined
Length: 25min 38sec (1538 seconds)
Published: Wed Nov 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.