Data Collab Lab: Super Powered PowerBI with Databricks SQL

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
um nice and humid today so um good afternoon from sunny skies in north carolina hurry yeah and i'm my name is tori and i'm from lake tahoe nevada but uh also lived in the last seven years in kansas city but now i'm in new york um so nice to meet you all i'll give the story of my squirrel when i give an intro uh introduction when we get into the agenda because it is a funny story and you all will likely see it the rest of the presentation as well there are people in the chat that are really interested in the story [Laughter] uh i'm from uh chicago i'm born and raised in chicagoland i was born in chicago i now live in the suburbs uh and it it does get really cold here uh you know unlike max who's got the nice weather down there and then uh yeah corey i didn't know you uh you you moved around a bit so you were you were born in uh lake tahoe you said yes i was the luckiest child to be able to grow up there it was amazing you could ski and sunbathe in the same day in july yeah it's one of the prettiest uh places my twin sister who actually still lives there she's mechanical engineer and is building her own cabin we're so different even if we're twins we look completely different too um you you you have a you have a twin i do is that someone just like you not like me uh complimentary that is awesome she's she's an engineer yes she's a mechanical engineer and the demo is her own house i keep asking her on sundays what are you doing oh i'm i have a hammer i'm rebuilding something i'm like oh great i'm at happy hour that gives a great uh indication i can see your hammer is kind of like you know you taking a hammer to like you know a database or you know just writing some sequence some tables hey so you all have hammers in different ways sequel is definitely your hammer not until you started taking a vacation at the lake house now things are just splendid i see what you did yeah that was that was a slick transition tori oh somebody said harry says one sister to software the other sister does hardware yeah that's awesome all right well i just have a few intro slides so i'm going to run through those real quick and uh get to this we can get to the good stuff we're uh just for the panelists i put the the linkedin and the youtube links in there so we can check out the chat spaces there okay all right so let's get started so welcome everyone thank you so much for joining us this is another data collab lab session um and this time we have hosts franco and john so uh this session is going to be about super powered power bi with data bricks sql and we have our guests for today are tori and max from databricks so i wanted just to do a quick reminder that this is part of the data plus ai online meetup group so if you're joining us from linkedin or youtube i'd love for you to check out the group and join us there that's where we post all of our live events that we stream and and everything's recorded too so you can go to youtube at a leader date if you'd like we have a data collab lab playlist so i'll drop those links in the chat spaces for everyone um but we'd love love for you to join that meetup group um that's where you'll get notified of the upcoming events and you can put them on your your calendar um and then as always subscribe to our youtube channel and uh follow us on linkedin so you stay up to date on all of our awesome announcements and and all of our content that we're putting out so we i wanted to give a shout out to so denny and brooke they have a podcast series so it's called data brew by databricks and they're in their second season already so the second season is all about machine learning from research to production and they have a lot of awesome guests so i think they have eight or nine episodes for this season so i invite you to check that out i'll also drop the links in the chat um on youtube and linkedin and uh they're available wherever wherever you like to get your podcast so spotify apple podcasts all those uh all those places and then i wanted to share we have a guest blog program so if there's any writers out there or any aspiring writers and you're interested in sharing what you're working on i'd love to hear from you i kind of helped put this program together and i'd love to hear your ideas so uh i invite you to check out that that page on our website to to learn more or you're more than welcome to ping me on linkedin if you'd like i'm happy to share some more ideas maybe give you some ideas or suggestions of topics we'd like to to hear about um but yeah i invite you to reach out and share what you're working on and so just a quick housekeeping um i know people like to ask questions in the chat spaces so we're going to be monitoring that while we go through the presentations uh and answering some of the questions live and this is recorded and it'll be available on youtube pretty much right after um right after it's the same link that we're live streaming through uh so i'll drop that in the linkedin in the zoom chat so you're welcome to revisit the session at a later date if you'd like to check it out so without further ado i'd like to pass it to our hosts franco and john thanks for having us everybody uh my name is john o'dwyer i mentioned before uh i'm a i'm on the developer relations team here and at databricks i'm also from colorado i've been living here for about 25 years thanks for joining us bronco hey everyone uh thanks for coming along today my name is francophitano i'm a solutions architect here with databricks i am in the chicagoland area um and i'm uh my background is data warehousing and bi so this episode right here is really close to my heart and especially when i started working with started seeing what what uh our guests tori and max were working on i was like we we have we have to do a data collab lab episode on this so without further ado corey uh you want to introduce yourself hi my name is tori i've been at databricks for about a year and four months and similar to franco i've had most of my career so far in analytics and uh data warehousing i will say uh personality wise my first boss that i had as the director of strategic development told me that i'm like a squirrel actually bought me a taxidermy squirrel that i still kept today and the reason that he did was you guys are going to find out i'm very excited i get very excited about pretty much any opportunity and same thing with bi in any report and i get these crazy look like and then i talk really fast but squirrels just you guys know is about 20 of power outages so they're really impulsive and then they also lose about 20 of their nuts that they try to find so even if i find opportunities i get really distracted often so he told me when he bought it for me be less like a squirrel but remain yourself uh it stuck with me so now you guys are going to see a squirrel the rest of the presentation but it's really nice to meet you all max pass it to you wow i don't think i can beat that introduction um but i will try to uh you know just be myself i think that's the only thing i can do from what he said tori um my name is max nienu um solutions architect here at databricks i've been here two years um prior to databricks i was a data engineer um my my career has been in data warehousing i started out as a software engineer but quickly i realized my law for data and i've been in data since and so i am really excited to talk to everyone here today because if you know me i am a self-proclaimed data nerd awesome let's set it up uh but we've got a story right tori uh who's gonna take us through the beginning of the story i am so max if you just share your screen if you just want to go through our how to share our slides okay so just to kind of start out with the agenda so we keep on track as you guys know i am a squirrel i'm going to start out just with a summary of kind of our project and i know i just gave you an introduction to who i actually am but for this project and kind of our presentation i'm going to be the lead analyst for our company fantasy data dot ai i'm gonna give that summary first of our problem or our original statement of kind of what and who we are and then how we delivered on our product analytic journey and then we're gonna get into an overview with max on the lake house and data a lot of you on the phone may have data bricks already or might know what databricks is but we always just like to level set on the foundation and what databricks is today because we've evolved and we evolved fast and then we're going to give a hands-on workshop with databrick sql and we're going to get into it and then after that i'm going to end with our power bi integration because i've had a lot of experience in power bi and i'm going to really hone in on the two features incremental refresh and what's about to be released in september nqs or native queries support which is going to be really exciting for any of the power bi users on the phone go on ahead next yeah and so this is a fictitious company so if you google search list it's not going to come up with anything but likely fantasy sports uh but fantasy data.ai for this purpose is a financial services company that's really focused on innovative banking experience in the fintech space uh we really specialize in rewards programs so for this specific use case we're going to really focus on the transaction data and events within the application and prior to using databricks really fantasy data.ai relied on unreliable tools which is really similar to what i've used in my background like mongodb which if they broke would cause the analyst process for creating new reports uh slowed down and in addition a real headache for engineering to make sure that the accuracy and availability of that data was made available the databricks lake house platform significantly improved our data quality reliability and performance of our pipelines and furthermore especially for analysts from a downstream current consumption resulted in about four times faster but two times the amount of data and i will say i'm giving a fictitious company name but this is an experience that i had with a customer this last quarter uh in addition to that being able to enable and allow our analysts to easily query json structured data resulting in lower tco and i'm going to show you that hands-on as we go forward so what did what do the benefits we get with databricks from this aspect especially as you guys are about to see this hands-on and the interaction between me and max from engineering to analysts uh when it comes to increased productivity we always talk about the lake house from a performance aspect and be able to deliver all of your workloads within that platform but we've really matured the ides for different personas to be able to productively do their work but also collaborate so uh what we found after this project is we reduced our effort to maintain and provision our infrastructure which max the body that he is is managing 20 streaming pipelines and 170 tables and furthermore increased reliability for all of our streams from spark to delta with two times data growth in last month because our business is doing fantastic and people are using their cards now that covenant is over uh and now uh and furthermore from that we're improved our improved our times so there's just a little bit of metrics here to give you an understanding of the overall benefit of not only the productivity and tco but what i was able to deliver faster because if you think about product analytics just from application development engineering and analytics it usually was a month six-month time frame to deliver on that project and that was just even trying to get the data available to the end consumer analyst to then do their proof of concept so i was able to deliver in weeks risk management customer cash back and credit rewards and then our customer 360 journey directly to our ceo cro and further executives can i just uh kind of just draw something out here i think the underlying point that you should really take away from this is that in today's world where json is becoming a really prevalent storage mechanism to transport data between organizations so like rest apis websites you generally get json back so the prevalence of having to manage this json data is just growing over time and it's not just limited to this one specific use case a lot of different companies a lot of different departments are now getting json data and there's a lot of lift in taking dates on and actually making it queryable and i think that's the big thing to pull out of here that that tory is is drawing out is that you can go from whatever json you're getting right applicable anywhere and get the valuable insights much much faster uh so just want to draw that out there and and kind of like connect the dots like this is this is not just applicable to like one different domain json is becoming more prevalent to transfer data between different mechanisms especially with rest apis so uh thanks for that do you have any comments like do you are you seeing json data a lot more nowadays and uh this is this is kind of like becoming a a new way to do this we definitely are and we're going to show you how tori and i actually went through and using um data break sql torrey is actually able to um query data on her own without me having to explode all the nested json uh fields for her and now it's given her like data break sql has given head out autonomy to be able to take a lot of the tables that i built and then like she would run her own product uh analytics on top of the data without me there and that's helping her iterate a lot faster and do her job a lot better so before we actually get into how you know we went from um doing everything we're doing and showing you all the new slick stuff i just wanted to walk us through you know back to time um where we had data warehouses and you know data warehouses were a thing of the past uh and back in the day you know data warehouses had you know a lot of data um you would detail your data from your operational data stores into a data warehouse and then you build some reports and this actually um took a long time and a lot of these data warehouses just had a lot of problems right if you needed to expand your data warehouse footprint you basically had to go buy a lot of other servers and it took anywhere from you know six months to a year to be able to build a data warehouse and then build it to full functionality it took a lot of manpower to actually do this now um with the problems of the data warehouse being that you know they weren't able to do a lot of unstructured data like json uh came you know data lakes and ditto lakes were very popular and you know they were very good for machine learning and you know they were also good for semi-struct simian unstructured data but the more we use uh the data lakes the more we figured out a lot of problems with data lakes the complexity of actually operating a data lake um and you know just getting data out of the data loop was a problem just because like you know data was just dumped in there causing what we call data swamps and so if you've actually lived through a data warehouse in a data lake um you you you probably are you know thinking back to like all the things you had to do um to to get like data out for your analyst or for your business i just get ppsd thinking about all the things i had to do in my data warehouse in my data lake days and that is why i'm really excited to uh to talk about the lake house right so the lake house takes the good things from the uh data lakes and the good things from the data warehouse and then puts them together and this is powered by um the delta engine that databricks has built and so we actually are able to serve unstructured semi-structured and structured data to machine learning data science and like your bi use cases and so this is what actually powers us to be able to do a lot of things that tori is going to show um essentially being able to do like your your customer reports your product analytics on a large scale um and so what we've seen is that you know coexistence is not a desirable strategy right and so today um organizations want to be able to you know keep a data warehouse and their um and their data link but then doing like the lake house strategy is the better way to go and that you're able to do um a lot of the things you can do in your data warehouses because delta has the power of asset transactions and so you don't have to keep like your traditional um database management environments in order to be able to get asset transactions you get that straight out of the box with delta we have optimizations underneath the coverage to help you with performance and so you don't have to do things like create indexes um or like you know have like different uh strategies for reading data for reporting and uh doing machine learn doing machine learning on prem we have all those optimizations inside your uh inside your inside your lake house and so and so instead of like coexisting between like this old architecture and like a data lake we want you to actually adapt the lake out strategy because we've combined the goods from both of them and we're giving you this this new paradigm that allows you to actually be able to do everything you need to do on your um on your data lake and on your data warehouses but today what our focus is going to be on is the bi use case right and so just from the ground up we power you know your semi structure the unstructured um and your structured data using delta lake and then the focus here is not about delta but delta lake is what uh powers the lake house we're able to do like your high query performance and we give a single platform for you to be able to do your machine learning your data science and your streaming analytics but today what we're going to talk about is your bi use cases because we think that bi is going to unlock the power of your business um so how do we get started i know you probably saw that dating um the way we we like to think about this is you know uh we have this thing called the medallion architecture which is um you having like your raw data um and filtering that raw data into what we call the silver layer and then you go layer if you think about it in terms of a data warehouse it is essentially your raw data your staging and then your reporting tables right and so this this curation layer allows you to take your data as it's coming in filter it clean it um put some like dates and um filters on the on the data and then aggregate it to a point where your business users are able to actually run queries on top of that data uh we've seen a lot of customers use like a kimball methodology um to normalize gold tables just so just because now with the power of cloud computing it's easy to be able to create all these tables um and i'm going to show you uh data break sql or torrey is actually going to walk us through um data break sql we're going to see how um data brick sql is built with a native interface for our bi and sql analysts and it gives them an interface that they're used to um and so they cannot easily run queries and then do all the good stuff to do with all the sequel tori is a sequel person and she's gonna talk you through a lot of these things i will let her um show us her power and the magic she does every day to get our products uh analytics to uh sea level suite um so what we're gonna do next is i'm gonna walk you through how we go from getting raw data into getting data uh ready and prepped for uh our sql analyst to be able to query and since stories i'll lead uh data analyst she's going to be the one to walk us through a lot of that so there's a question that came out i just want to clarify uh it's this you it's not just json only that this is supported for uh databricks sql is the delta lake which is part a with the transaction logger and the covers is the primary way that we advise to use uh the best way to use lake house uh but what we're going through here with json is one common example that we go through it's not that json is the only thing that's supported there's actually a long list of things that are supported csv uh text files anything that spark could read just want to call that out sorry thank you so i'm gonna just walk through a high level example of how you know we think about data um once we get the data from ingest uh to actually getting the data to our business right so here's an example of a wind turbine predictive maintenance model right and so what we do here is we were our wind turbines are going to be pushing data into our message queue here and then databricks is going to be ingesting the data into our bronze layer and so this data is coming in as a json file and what we're going to do is we're going to clean and we're going to parse these json files into aggregate tables in our gold layer now sometimes like we do have some fields that are nested json and what we're going to show at the end of this is you know how um when i pass these tables to tori who's going to be using it for product analytics how she can actually do the rest of her analytics by actually querying the json fields and so we're going to really show you the power of databrick sql and using that to create tables that have not been pushed through a uh a traditional database management system and these tables are still going to be sitting on your uh on your storage layer i'll be it s3 or adls um and so first thing here we're going to do our in just here um so as you can see here we're just doing the ingest um here we're essentially just doing streaming here and our data is coming in every 10 seconds and once the data comes in and we sanitize and see what the data looks like we can see here we have key value pairs uh what we're going to do is we're going to essentially just parse this data into our silver table and so our silver table here just does um the parsing of the data into like an individual columns here uh we have our id and our timestamp at the end and then what we're gonna do is we're gonna do um some join here um and then we're going to add some columns in here so we're going to add the status column into this table here for our gold table and we're going to do we're going to filter the data and then we're going to actually aggregate the data by ids and so once i have this final table i know this is a table that tory needs um and so what i'm going to do once i once i get this my final goal table is i'm going to hand it over to franco who's the admin on our system to actually now give access to um tory who is going to be the lead data analyst analyzing this data so what we're going to do here is instead of me running this uh sql grant statement we're going to show how we do that inside of databrick sql franco thanks max so i'm going to take this opportunity to show you all a brand new tool that is available in databrick sql now you might have seen it a new option pop up in your workspace if you haven't i'm about to show it to you so in the left-hand side pane there's a new data explorer data explorer allows you to traverse all of your databases that you have inside of your metastore that you have registered to databrick sql in this example uh we have a database here that uh max had created with his previous step and he had load he done all the etl to get the data into those structures and now we need to grant tory access like max said you could do grant statements in sql that's fine is there a better way with databrick sql there is so if you look at the new data explorer we can select the database and then look at the tables we can view the schema of the tables and even see sample data but most importantly we can set permissions with the ui so if i need to grant permissions to tori i can just go into the screen look up torrey's id grant her the appropriate permissions and click ok and now tori has the ability to access the data that max just provisioned for and i as the administrator have a very convenient ui to to process the requests for the different permissions of data with that i think we're going to take it back over to tori where she is now he has she has the data that she needs to be successful we've given her the access to the that she needs to be successful so dory over to you okay can you all see my screen yes okay so i'm uh super excited as the lead analyst because my desk is actually within the business and i actually sit right outside of the cro so i got this slack message from max in in a week instead of months that i'm used to it previous companies saying that the data is made available and i'm saying like is this is this real well one of the the issues that i have with databricks people because i just got access to it is that i'm used to oracle from a previous company and microsoft and i was like well spark sql how different is that and max sent me this great query and as a starter query and uh in slack and goes hey just pop this into the query service i've done a little tutorial of understanding how to navigate a little bit database sql and it goes just copy and run it and even gave me some markdown so just to give you guys for anybody on the phone that's not familiar with databricks sql when you think about what max just showed you from a maturity standpoint with databricks we've realized as i was explaining to you with ides that there's different way of ways of which personas interact with the data and before when we had a notebook we couldn't fit a uh i'm going to get this wrong this is my scrollness too but a square peg and a round hole uh for everybody to be able to interact with the notebook so we've made our eyes our experience for those personas much more specific to what they're trying to do without limiting the aspect of collaboration and being able to switch between languages or be able to collaborate on uh the platform in itself so in order to get to the sql uh space you're gonna have this little drop down here from apparent navigation i'm clicking on data break sql and this is where my home is in data bricks for the first service i'm gonna pop into i'm gonna assume i already have one of these clusters is the query so he already copy and pasted this query for me in slack and i'm not going to show you guys my whole slack i've already actually saved it so just bear with me on the and then i'm going to pop in here to my queries you'll see here event analysis and so you already copy and paste this and he was so nice because he already knew that i have experience with oracle and microsoft that he's already showed me the dot notation and the simple colon notation that we have in 8.3 to be able to un-nest the attributes within our bronze or gold table and so when you think about that with product analytics or an action with an event you'll have an event that happens within an application and then you have all of the attributes that are associated with that action most of the time with attributes in an application that's managed by application development and it gets really sticky when you have mobile application a web application because then you have ios android and web and they're generally not succinct across the three unless you pay for something that's like quite expensive and it's google analytics 360. so from that standpoint can i ask you a question real quick from something that came in the chat i think that you might you might be good at answering what what is there a reason why you want you want the nested data structures and you don't want the data engineer to parse them all out for you and create you a tabular structure is there a reason why you want to be empowered with the tools to un-nest and parse it yourself yes uh okay so that's a great point thank you for for asking that question because that should bring me back to why i'm excited that there is a nested column uh so not just from a notation to perspective if you think about all of the actions that happen within uh within an application this is going to grow large and it's going to change if they even change the name of the button that changes to a new attribute within that action so uh max being able to leverage schema evolution and the exploit exploratory phase of that refinement pipeline so from bronze to silver allows him to not have to manage the collaboration and the unification of that data for feature development with application dev which is really difficult and somewhat impossible especially across those three ads in order to do that for the analysts now when you think about on the end consuming for the analysts that the engineer doesn't want to unmess all of those attributes and make them columns because i likely only need three of them and i have to explore and this is generally how they do it with analysts they have to go to the application give themselves access to that application in your organization and filter down to their specific user id and start clicking on things because as you can see here most of the time it's not intuitive to what's actually displayed on the application of what's selected or done that it is on the end so from an exploration phase much simpler for me to be able to search and scroll through the attributes within a column versus having to go through all of the columns and find the right value and that's just for speed of productivity to find what i need but from a performance perspective since i don't need that having to have all of those columns unnested is unnecessary even to put into power bar to have as a table that you would consume to any downstream application is there anything you'd want to add to that franco or max i just want to say that i'm looking at your screen here and i can't believe that query returned in 256 milliseconds uh this is like we this similar experience that you would get out of a data warehouse but we're not in a data warehouse is that right yeah yes so and yes i will even show that if you thought about doing that in power bi even from a point-and-click tool perspective which powerbi does have the ability to un-nest json it's really slow it's like doing that in excel on your desktop uh and and even with power bi it requires you or even if you select it but it still unnests all of the attributes so you can see here it will this will grow might be even 50 attributes associated with an action within an application if you did that within power bi across the history of all of the events even within a year you're going to see six hours of spin the spinning wheel of death on your desktop um so not only from a speed perspective for even from a discovery you have the availability of doing that within your sqlite in the cloud directly thanks for mentioning that data sequel brings to the table and this is why tori is super excited to have data break sql like on her hands because now she can do a lot of things that you know she always wants to do without me being in the way and all i do is say hey here is your data go figure it out or go play around with it and figure out what to do and so i know it's always about to show us her magic and so i can't wait yeah and so even the collaboration between me and max him knowing that i don't specifically know the syntax differences between oracle and let's say spark sql he's already given me a little bit of a guide which is most of the time what analysts look at i mean even if you think about it right now you look at reddit and you google search you go how do you do this and then they go and you don't go directly to the documentation that shows you exactly what that notation is he sped me up by giving me the actual doc that i can go to within databricks uh with the new json operators and has already showed me a huge time saver by don't get too worried i'm i know i've explained this to you but let me give you this query to show you i've already unnested some of the attributes within request brands as well as selected a parent and then showed you how simply you could change that to whatever data type you need because again this is in the exploratory phase any time that you're in the exploratory phase or poc with an engineer and an analyst that qa when you hear that quality assessment is always that time period where it's like a year and it's really just dev and clarifying and discovery on the analyst side versus and vice versa so this speeds up that qa and the exploratory phase by putting that bonus and availability for analysts to also do their poc work and so he's showed that to me and said hey i know you don't know what you need yet and i'm also totally aware of how difficult it is to try to find validate and keep track of your application events especially when you're doing data analysis let me show you simply how i've already understood this and then also how you can explore and if this wasn't the right thing i'm sorry i didn't know because i'm an engineer and don't know exactly what you're trying to do or exactly what you're iterating in your mind for your business so i want you to be able to do that and so i've he's been able to show that for me and i can go in here and be like okay i'm going to go down to my user id and then i can easily be able to search across those values and then find out oh that's what it's called the arn aws i am that is not what it's called on the screen in the app when the when the customer uses it so uh this just gives you an overview of that query but just to show you if you thought about doing this in the data science workspace or even when you think about on the phone right now as you do this with tableau or power bi when you go into power bi you have to click on different data sources it shows up that odbc connection and then sometimes even when you're searching across that you're still getting the spinning wheels that's depending on your computer and what you're allowed or allotted by your business you have a really easy interface based off what you've been granted data to search right across where you're actually executing the sql query so you're able to go across the databases that i've been granted access depending on the source and then i can easily click so instead of having to not only find and copy it's super difficult to have the time consuming write that out in sql of all the different um column names that have been given even for max it cast it to be something that's easier for me to write select statements on you have a really easy tool next to you in databrick sql or you can just click and it will populate that into your screen to the right that saves you a bunch of time especially when it comes to things you can't control but it also does the tables and databases for you as well and then a shift enter we'll do a quick execute you can actually even just do specific amounts let's say oh i'm not sure if this is right oh yeah i'm just gonna make this work again i'm gonna make sure that this is right oh i must there we go so sorry this happens when i do the executive so i'm gonna make sure that this is right let's say that you're not exactly sure you can actually just select one of these things and just run the code and i'm this is a great point for me to add so you can run the selected that when you have these demo failures or even when you go through a code and you're in ssms and you realize that you were wrong or you were more right three times ago this is where i'm going to go in and show you the most amazing feature that is now data science workspace the databrick sql where you can go back to when was executed accurately and run it from there and so before when you thought about doing that in ssms something like that you had to actually do this on your own from priya so this actually has saved me in many demos so this is just an overview of the democratization of what you have for json and being able to explore your data but let's think about this from a lead analyst perspective when you think about proliferating this to the 50 analysts you have a different desk across different departments and what they're comfortable with or what they know about databrick sql so thinking about how max empowered me as a lead analyst and understanding one the stepping stone of the syntax isn't too difficult and ultimately it's a really small barrier to a really great need and availability of what i can do for my organization let's think about doing that for other analysts that may not know sql as well or even syntax differences between microsoft and spark when you think about a uh from power excel power bi to sql sql is the easiest to learn for business uh analyst or even a business in general when it comes to actually querying languages versus python and scala a really cool feature that i sometimes glitch over but i have a way of thinking about it in a different way when you think about liberating that to other analysts you have this idea of query snippets now this is a new feature that's different than other bi tools like power bi or or something similar that you're used to thinking about and query symbols it's like similar to a udf where you're you're giving a defined function or you're making a snippet of code available for others to just query in and i'm about to go show this of that select statement and let's think about this even from a case when statement because you're going to have casement statements and product analytics because you're going to have to rename not only attributes from what the column or what they're actually defining but the actual actions within that column i used to have to do that just to give you guys perspective and transportation when you think about unnesting all of the fuel uh well you're gonna think about this and i don't want to get too deep into it but uh whatever it's actually labeled in the apple application you need to grant that up so that the ceo or cro sees that in a way that makes sense to what they're trying to do so case when statements even for marketing and other use cases are something similar hard to actually standardize with your engineer and ultimately going to have to be managed and evolved with your analyst you're able to just create that snippet of code give it a name that makes sense to what you're trying to do i'm just going to say twice this part i'll just save json template sql and then you can even give it a description that makes that a lot more applicable go back to your query and say okay analysts in 50 different business units across the organization if you want to understand json i've already given that that markdown so that they could go directly to the doc i'm going to have a virtual zoom meeting that get them really excited about databricks people just from my perspective and understanding where they're having trouble but get them excited about exploring the data in a much faster more efficient way because they are experiencing pain in their current desktops and then they're able to go in just select and then you'll see here you see where it says snippet and gray and it'll populate the code for them based off that snippet and so most of the time you think okay well is that exactly what they're gonna use no this is just for them to go in really simply and see that use it as example change it as they need and then rename that query to what they're trying to explore or do now you can fork a query but that's where it gets really messy so again you're just sharing code you're trying to proliferate the example or the excitement of what you can do with spark sql this is great so it's essentially a really flexible way to the way that i think you're articulating here is very similar how articulated it's like management studio for the analyst in the cloud with a lot of extra bells and whistles uh because you can write your sql code get your response but then you can also add visualization uh and then the data engineer when you find the value when you when you actually build the sql that gets you what you need you can then take that send it back to the data engineer and then they can put it into production or they can kind of take it and then put it where it needs to go um so but this we we are uh with that this will there's some questions coming up uh this we are super powering power bi so we're not just showing you just data brick sql here so everyone hold on uh plea please wait we we are getting there we're showing you how to superpower your power bi experience right so yep and i'm about to show that i don't want to get too awesome just to give you a two-minute exciting thing as well when i thought about this specifically with my customer but then also thinking about this in my experience you are correct in the fact that you have full visualization capabilities as well as alerts conditional based alerts why is that really exciting with data break seal to augment your existing tableau and power bi investment is that there's no licensing cost associated with data bricks at all you pay for what you use so when you think about product analytics a lot of your product owners or the head of product need to see prototyping or see things that are specific to an internal organization and don't have a lot of business value for executives to see but are still super important most of the time you have a hard time getting that and your business approved for licenses and it's a cost that's unnecessary but still super important you have the ability to build all of these dashboards and functionality for other use cases of product analytics data quality etc in databrick sql without that additional cost without having to learn a new tool so to supercharge it i'm going to go back over to power bi and then talk about the the two features um today is there anything else that you guys want to add about databricks equal before i jump into that absolutely not you you hit on all of the the the features i love snippets the sql window and all the new parsing tech techniques that you can do those are those are excellent features thank you i'm ready to do that engineering and data break sql now uh yes and i always like to tell people from a collaboration perspective right tool right time so a lot of engineers are mostly excited about database equal because again it's out of the box you don't have to write certain things or do certain things in your own way you don't have to do your own visualization in uh our or or any other language so when we think about the audit log delivery or the actual application underlying data amazing thing of my engineer because max is also a body is that he gives me this understanding of what is available in the data and what it needs so if we just get an idea of what we're doing within this specific schema these are just events that are happening within an application they're not going to change so it's not like a a micro service for when a customer is looking up a credit record and that credit record is going to change or evolve right so when you think about that from an incremental refresh perspective and i'm going to give an overview of what incremental refresh is this is more of an easy use case but it's specifically easy because max already knowing that we got a utc timestamp has had a consistent understanding of analysts and data scientists meeting a readable and formatted date and time as well as aggregating it to just a date so he has already done that for me in my silver layer knowing that that's just a cleansing technique that you'd need to do across any use case and therefore when i'm going to show you this incremental refresh i'm going to talk about the different use cases and how you would achieve that incremental refresh within power bi but the ultimate thing i want to talk about when i go through the actual understanding of this is that most of the time it's difficult right now to get incremental refresh is because one you don't understand query folding or query folding is even hard for me to understand because it's dependent between your third-party application power bi and then the other application or vendor that you're working with data bricks and what is supported and what's limited in addition to that analysts don't don't even know the difference between truncate and load and incremental refresh because it's an engineering task so being able to to show you databricks equals if you're trying to do incremental refresh or thinking about improving power bi the collaboration is just something to really hone in at first because it's much simpler much faster for your engineer to understand the date that needs to be put in place for incremental refresh might actually already even have it on the back end of how they actually refresh the data currently for that table and therefore it can actually persist that or give you the exact column that you need but to give an overview of incremental refresh let's think about this in terms of maybe you do change instead of it just being transactions and they don't why do you think about incremental refresh at all in data specifically but specifically in power bi and the fact of the matter is for power bi this is simple to understand but refreshes are faster so you'll really only want the most recent data it's redundant and uh unnecessary compute to be able to refresh things that have not changed refreshes are more reliable because every time power bi refreshes the data set it duplicates and that's because of course you don't want to have an end user experience while they're in the report to say so sorry out of order come back in 10 minutes or 20 minutes or an hour so it actually duplicates the model as it uh as it's actually refreshing so the faster you can make that uh especially as your data grows the better especially for the end user experience and for the cost of power bi premium in addition to that it also enables large data sets up to i think i'm actually not sure about limitation in power bi but 100 gigabytes but even from that standpoint it enables large data sets in the cloud for power bi if you guys are familiar with working with power bi desktop currently you know that actually getting the data into power desktop is the biggest problem because you're using your computer and your compute resources you're going to want to be able to only load in which is what it does with increment refresh a small portion deploy that to the web and then do that first initial truncate and load up in the cloud with power bi because you're going to leverage your premium capacity and azure in addition it's that's like super easy to set up and i'm going to jump on over to power bi to show you that but thinking about this from a architectural perspective you're not going to want to incrementally refresh all your sources and tables why because if you think about your architecture or the fact that in this use case all of the events that happen with an application you're going to have another table of distinct users your users and their attributes are not going to change all the time and even furthermore let's say you really only have 100 000 customers right now loading 100 000 records isn't going to take too long and isn't that troublesome when you think about it the transactions you're going to have users doing 10 000 transactions a day thinking about that across 100 000 that's where your table is going to get into the 100 gigabytes it's not going to be your dimension tables or your slow changing dimension so really thinking about you only need to think about the date and the actual format of the fact tables when it comes to actually incrementally refreshing well if we go into this why not direct query or dual mode in power bi direct query being you direct connect to data bricks and leverage spark uh real time and then dual mode being that you both have imports you actually cache that data in and bring it into power bi as well as directory you really only want to do direct query specifically for the use case that you need to see immediate results in near real time it it's it's costly because you have to keep your compute up 24 7. so you're going to be paying not only for power bi listening and that process but also for databricks it's not needed if you don't need that and it's fault prone when it comes to two-party integrations so when you think about needing real-time reports and understanding latency you're going to have to debug between power bi and databricks and so that's a it's a use case that you want to make sure it's really fine-tuned and just know that it seems like a click of an easy button but you really want to think about that from a perspective of why you're doing or what you're doing before you just kind of move over to the easy so that's why i always think about this because i always have or i've always had experience with analysts that just did direct query because i i don't even have to worry about refreshing and parallelized it's just always the right data and then you get to explain them the downsides or even what that means when it in terms of how um okay so uh anyways i'm going to jump on over to power bi so you can see here that i'm already showing you incremental refresh i've done this prior where i've just created simulated fake data where it's grouped by the events and i've added to a new day so you can see here that i've added 12 20 or 12 19 2019 from the original records and then i'm going to add today into power bi as you're switching over to power bi uh and going on i just want to articulate the and highlight the main point here is that in order to do incremental refresh you have to have something to increment and date is very common they increment is it limited to date and power bi is can you only do date or is there like other min max things that you can do but basically your data has to have date work with your data engineer to figure out what is that that concept and then and then you can do the incremental refresh is that correct or yep that's exactly it and i'm going to show you here if you guys can also see my power bi desktop i've already connected to databricks super easy to do new source type in azure databricks do not get worried about having azure in the front it can connect to databricks in any of all three clouds uh but i've already connected configured to that specific table and you can see here that i've already created the two new parameters which is followed from the documentation in powerban i can follow up with the link or max if you want to shoot that into the chat of how to operate this in power bi so creating this new parameter with the range start and range end dates which i already communicated with max and understood from the data source to what i'm applying uh for those events to utilize and that's where you can see here i've already created that with exactly from the documentation when it needs to be named i've selected this as date time as they've instructed and then given that current value which really only needs to start from something way back that you understand is going to be the very beginning of your data set and what you'll configure after within the service as well as when i load this data in is when you'll apply the logic of how much data you would like to keep and truncate and load let's say for instance every time you refresh this you really want to refresh the last two days worth of data or one day depending on your source for this one specifically we only really uh refresh this every 15 minutes now but before or even thinking about this in terms of how i need to deliver it to the end user they only really look at this from a daily perspective it's aggregated up enough then i'll only keep in that day's worth of data but first and foremost you need to create these two parameters and that's range in and range start where you start being when you want to actually start uh the refresh or the truncate load and then the range end being the end period any questions so far franco or am i explaining no this is great i think i think we're on the right track i uh john are there any relevant questions coming in that you think uh would help clarify this uh i can't see the q and a in zoom but i have uh on the other channels i'm not really seeing much complicated uh people generally understand what's going on are you seeing anything not that aren't answerable by uh by text quite frankly no worries so you can see here where this is where you'll configure those parameters now you're gonna get this often with power bi and that's why testing is so important but i've already tested this with databricks and power bi but that's why i really wanted to hone in on the next feature native query support but that query folding aspect so uh this popup button them saying unable to confirm if the inquiry can be folded is not saying that it can't it's that from how they're on the back end understanding the folding between power bi and data bricks they just can't from that feature perspective confirm and so this sometimes gets really troublesome for analysts like that is this supported it says that it's not it that doesn't necessarily mean that but that's where you get that's where you get that issue with understanding does this worker does it not between two different tools i'm going to show you native query support after this but i just wanted to give that perspective because that's exactly what's honing in on earlier so you can see here all i did was double click on the table in order to pop this up after i created my parameters in the query editor and you're going to turn on incremental refresh you're going to select how many years you'd actually like to hold before it actually drops off so for instance if you only needed to see the last five years of transactions and you don't even need to start housing it in all of its eternity eternity then you can just keep that threshold before it actually drops off that's super important because again when it comes to actually proliferating how much data you're gonna store and then the performance aspect it's always nice to be and do exactly only what you need sometimes with anastasia it's like all is better than nothing even from a power bi model perspective you do not want to have a cube when you think about every single question and every report an operations department for this specific use case can answer because it gets really troublesome from dependencies from performance even trying to understand where the performance is painstaking between different reports in your business so that is where you can select how much data you like within it and then for the actual refresh when it actually checks but i'd like to truncate load this is when you can select maybe you want to make sure it's a week let's just keep the last week's worth of uh of records that have changed just knowing that maybe you potentially know that they have changed this is where you can configure it now this is dependent again on your use case and your data but also being able to select data type changes so if you know that really nothing else would change within a certain time period of 30 days let's think about even processing with transactions pending transactions are only in a certain time frame you already know that you can just select here that certain time frame bank seven days ten days and then just do truncate and load or not truncated but an incremental refresh of 10 days so you're not limited to daily so you have that ability to just like that and then uh and then only refreshing complete days but once you're done with that this is when you select your your specific needs uh for incremental refresh and then you deploy it to power bi service so i'm going to pop on over here to power bi service and then you can see here once i deploy that be able to go into settings you do not need a gateway connection you have direct connection between databricks and power bi so all you have to do is configure your credentials either through token or you have oauth using azure active directory etc within your organization i might need just to update my token which is fine i can show that to you later uh but once you deploy that all you do is refresh and it'll only refresh within two minutes versus what originally took 30 minutes or an hour and thinking about this in terms of development later on which is always important to note when you deploy a incremental refresh model to the power bi service as this starts to grow you can't export back out to desktop like you can in normal models to your computer and that's for the same reason that for the large data set format you actually deployed it to the cloud you're not going to want to import that back down to your computer and you can't even do that within power bi so it's super important to just note as you're in your poc phase or proof of value for just trying to get an understanding of this is what you need which is most of the time exactly what an analyst goes through for at least about six months with your end consumer it's better to kind of just filter down that data or fine-tune it with databrick sql and max before you projectionize it and put it into power bi service because you're not be able to export that back down and edit it and as you guys know uh the power bi desktop is the authoring tool for power bi so if you can't change any of the formats of data that you can look at within your data model or even be able to change some of the connecting points you won't be able to do that with when you do an incremental refresh so it's super important to note that you'd want to productionize that before you deploy it okay i'm we have about two minutes left i think we have about two minutes left right now because i'm really hit at home with the last one yeah hit it home okay so uh this is gonna to be released in september we have a lot of new features that will be released for power bi and data bricks two that i'm really really excited about is native query support and cloud fetch cloudfish being for performance especially as we start getting into direct query use cases and you guys have those business valuable use cases for real-time data and reports between power bi um in addition for embedding we're now going to have native query support between the odbc connector or the azure databricks connector and databricks so like you would see in a sql server database connector within power bi you can click down these advanced options and now you're going to have that sql statement availability now when you guys just saw data break sql when you think about copying and pasting and trying to proliferate uh spark sql and data breaks people within your organization really thinking about center of excellence you have the ability now to just copy and paste it and slack collaborate on your slack channel and they can throw that into power bi and load it automatically without having to go into databrick sql and copy and paste that themselves etc now why is this important also uh in addition you don't have to do the point and click but query folding so when you worry about all of the different buttons that you can click and all the available features within the query editor don't have to worry about query folding when it comes to what's supported because you're directly doing it natively from spark without having to worry about power bi and what power bi is supporting between spark that's great sorry uh i want to articulate a little bit more of that and then karen you can take us out cloudfetch is an amazing new technology that leverages object store as the media intermediary rather than transporting large amounts of data over the jdbc otvc connector which is known to not have good data throughput databricks uh it's equal that they brick sql endpoints have a new technology that we'll be rolling out over the next month month or two called cloudfetch which actually uses the fast object well it uses object store which we can parallelize the data throughput to which we can get much better throughput to the application your app both sides have to support it the capabilities and data bricks sql today you can download our odbc driver if you're on microsoft technologies and if you might be using excel uh you can load your the the download the odbc driver from our website and then plug it into your dsns in windows if you want directions how to do this reach out and you can connect power query uh in excel or anything that supports power query into using cloudfetch on databrick sql and essentially for data if you ever had problems like with your data coming back to you in your power bi application uh you know either refresh is taking too long or data transport getting in your way caught fetch is really going to help you out so awesome calls tori thank you very much for that karen uh you want to take us out yeah yeah thanks franco um so i just dropped the link to the session recording um in the zoom chat and also in linkedin so you're welcome to revisit the session at a later date if you'd like to um so just to close out i just want to say thank you so much max and tori that was an awesome presentation we're so happy to have you and uh thank you everyone for joining us i hope you enjoyed the session and we have another one coming up next week i'm gonna announce that later today uh so we hope you'll you'll join us next week um and thanks to our hosts franco and john and with that take care everyone have a great rest of your day bye thanks everyone
Info
Channel: Databricks
Views: 954
Rating: 4.8333335 out of 5
Keywords:
Id: 43Zk4tK4Hu4
Channel Id: undefined
Length: 60min 44sec (3644 seconds)
Published: Thu Aug 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.