Ask Me Anything about Delta Lake!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right y'all if uh i just want to count down about 15 seconds we'll we'll confirm the the stream's going in 10 9 8. welcome everybody thanks very much for the joining us today for this fun session on ask me anything about delta lake uh we've got a really cool panel here uh so i want to go ahead and do some introductions now uh while we start at least to my left because vinnie why don't you go introduce yourself hi everyone thank you for attending our session i'm vinnie jaswal a senior developer advocate at databricks happy to join my other panelists as well perfect and tv you're up next at least according to my view of the world okay hi everyone thank you for uh joining the session and thank you for having me in the session i'm a software engineer at databricks working on structure streaming and delta lake perfect uh ryan you're next hi everyone i'm ryan i'm a software engineer at databricks and uh basically mostly working on data lag and solar streaming perfect barack you're next hey everyone my name is brock i'm an engineering manager at databricks and i've been working at delta and structured streaming and now focus on ingestion perfect perfect okay well thank you very much my name is daniel by the way i completely forgot to introduce myself while i was too busy introducing everybody else i'm a developer advocate here at databricks uh apparently know a little bit about delta so figure that's why i would emcee today's session so we've got a lot of great questions we we will actually stop exactly at 1pm pacific today just because we've got a lot of good afternoon keynotes so i'm going to go through the questions that we have uh please folks who are interested in asking questions chime in here let other people vote for your question so that way we can go ahead and and chime in all right so first questions td i'm going to direct this right to you buddy uh just because it is about merch okay so this one's a little bit more vague so but figured give you a heads up on this one um they uh cyclon he's at oh here she's asked uh that they're experiencing significantly longer run times when performing merge operations with delta and then after the merge the auto compaction right seems to take a very long time so i guess the question is are there any number one why potentially couldn't this be happening in two like are there any things we can do to possibly improve the performance okay so since uh this person is talking about auto compiler stuff i'm gonna assume that this is the delta on databricks um so what could be happening is that merge is right writing out too many small files uh because what because it internally does a shuffle and it writes out files and it could be a partition table but there's a whole for a various different reasons it could be writing out too many small files one way to reduce the number of small files that it writes out is to enable optimized write uh in some cases the optimized write is enabled by default but it might not be the case in your case depending on which given version you're using so enabling optimized right might help in other cases um it will be worth looking at what controlling the number of shuffle partitions that needs to be set that is set by 200 by default and that could be causing number of small files but i have to say going forward we are focusing a lot of our efforts on making mars perform better right out of the box so that you don't have to do any of this sort of tuning there's a whole lot of features coming in there's faster version of merge in database called low shuffle merge that is in preview right now talk to your database contact for more details there's other features coming in like auto optimize shuffle etc uh in spark itself that avoids that element is the need for setting the number of shuffle partitions etc so yes uh it right now it takes a little bit of tweaking and understanding what's going on to do that picking but going forward this will not be a problem at all things should just will just magically work better i love it when you provide the answer magic all right great good great thank you very much td for that answer the hand movement is important exactly you have to emphasize it right so all right next question on the list which is actually a really good one and this is open to anybody here by the way um is that they're want to really try to understand the line between delta lake and lake house right there's there's these two concepts and they sometimes they're merged together a little too tightly right and it's like for example the question that they're asking is what do i need to add to convert my delta lake into a lake house so it's quite open-ended but there's a lot i'm sure there's gonna be a lot of good answers so anybody would like to go ahead and tackle this uh of question here wanna go first vinnie sure um yeah so delta lake is basically a structure basically a format on top of your uh data lake which allows you to uh enhance your lake house architecture so think about lake house architecture as a unified uh platform and delta lake is just one part of it if you have data into delta format you will be able to leverage all the uh components like machine learning sql analytics and run on top of your delta so you don't have to replicate the copies it's basically think of uh you know lake houses like a house built on top of your uh lake clean delta lake so that's how i would describe it i can add a couple more things on top of that so you know like delta lake is essentially as vinnie said kind of like the storage layer of your lake house to you know like have a completely lake house you essentially need an execution engine on top of this and kind of like having this open format being delta lake allows you to use different types of execution engines such as presto such as trino such as you know like proprietary you know like systems like you know red shifts you know spectrum or you know or delta engine uh to actually query the data that lies in your lake house on top of this you kind of like you need a security layer um to you know help you govern you know what is lying in your lake house so again here databricks provides additional services like we just announced delta sharing that's like a very new way to actually share your data in a secure manner as well as kind of like the unity catalog but you know like um that's essentially what you know it means like uh delta lake is kind of like the secure storage layer that gives you that table level abstraction that can work across different execution engines um to you know have a lake house architecture essentially yes so i think another way to think about is that that how crucial delta lake is to the lakers architecture is that you could build a lakers architecture with other five formats but it will be a very poor lake house that cannot compete with the data warehouse uh architecture that you're used to because without delta lake we do not get the transactional and data quality guarantees that is present out of a data in in a data warehouse so to build a effective lake house built a lake storing data in the delta lake format is absolutely crucial yeah last thing on top of that kind of like yeah with you know data lakes everybody had to build their own database essentially and with you know delta lake you kind of abstract out all the complexities of you know file management transactions schema enforcement and evolution and you know get all these data quality you know uh guarantees that you used to get with databases and data warehouses on top of this massively scalable and you know cheap storage system yep cool any others okay we're good all right perfect thank you very much all right i did want to call out there is has been a big question uh that folks have said could you share more information about delta live tables um we're not going to answer in today's session actually but the reason why isn't because we're trying to ignore it in fact uh you'll join myself but more importantly mate zaharia and michael armbrust later today at 3 15 pacific that session we will go ahead and dive into delta live tables okay so uh let's go ahead and wait till then for those questions okay so so again anything on delta lake oss delta delta sharing for that matter we'd love to go ahead and answer those but for now anything that's dealt to live tables again we have a session later today specifically for that okay so now that i've said my little piece the next question on the list and again open to anybody here is how frequently should you z order index sorry the z order index should be refreshed on a table um and they have they're wondering because the fact that they have a composite z ordered index does that actually impact the refresh rate yeah i can try answer this so basically for zelda optimization so usually we recommend you to do lists like the letter basic depends on how you how frequently you inject your data into your table your table for example if you use a streaming query try to inject like pretty frequently like every second or maybe every minute you probably need to like try to schedule the order like per hour to make sure you don't have a lot of small files in your table and uh and if your the other index is pretty complicated let's for example it may take pretty long you may consider to increase this interval but basically i think we need to look at the details of your data to see learn what we can do here it also depends on kind of like your query patterns as well like how frequently i'm going to query that and you're like do you need the latest data to be available if you can query it i mean if you're streaming with low latency probably you you do want to see the latest data so you might want to do you know the ordering more frequently regarding perfect yes yeah the second piece of the question was composite ids i don't think that changes how frequently you should be doing it or at least i don't see any specific reason why you would do it more frequently or less frequently based on having more keys right it's really about the how the data itself changes and not so much the keys itself that are coming into play and sorry the data itself changes and also the queries that you're running so cool all right uh so i think we did a great job answering that question now our buddy simon whiteley he's here with his question now uh we've seen a ton of awesome features coming into an auto recently and by apologies i cannot pull off a british accent uh at all um will we just see expect to see the same schema evolution in for instance uh coming to standard data readers or is the auras auto later the ingestion focus going forward yeah i can take that question um so all of these inference and evolution capabilities will be available in kind of like the vanilla readers by specifying you know proper options to those readers but you know auto loader will be our main focus of development and kind of like how you bring new data into delta lake and um from cloud object stores essentially and you know like you can imagine that it'll be the easiest to set up you know least amount of options to configure to get all these you know features out of the box rock on perfect all right well let's go to the next question then well we're clipping nice quite nicely uh with the new capabilities such as delta sharing and well the aforementioned delta live tables can i leverage these lake house capabilities on federated data when my own data lake is rather small so the the context of course is that instead of the traditional or traditional conversations about having very large data lakes right you know the petabytes range man now we're talking about having well now i only have a small data lake right it's not not that massive right so can really things like lake house architecture or delta sharing or delta live tables come into play yeah for like adult actually we actually have like a python collector you can load a little like a shared small table as a pendant data frame so you can use this to read any shared small table and basically this is basically we also see like people have a lot of small tables they just like to play and share with other people and we also have like a open server which like several small tables everyone can try use like a delta shearling connector to use and for large tables we also provide like a spark connector you can like use this connected with large tables which for example cannot fit into memory so it's not easy to handle by like pandas so in this case we yeah we need some like like a bigger data engine to handle excellent and one thing i'd love to add actually to this question is that the thing that people sort of need to remember when it comes to working with these type of technologies right for example delta lake right you can absolutely use delta lake even in a smaller environment not necessarily just a big brand because it is about protecting your data right it doesn't matter if you're talking about small amounts of data or big data you actually need some transactional protections around your data this is where delta lake itself actually is very helpful um things like delta sharing just because you have small amounts of data doesn't mean you don't want an open standard to sharing your data so that absolutely is super helpful right even for technologies like delta life tables right it's not yes we have it has the ability to scale with for massive amounts of data but the reality is there are things like governance there are things like environment configurations and all these other things that are very applicable to you whether you're dealing with small amounts of data or very large amounts of data so the reality is all of these technologies when you talk about think about lake house capabilities are actually absolutely applicable to you whether you're talking about small or large what's great about delta lake and what's great about these environments is the ability to go ahead and as you scale up you can do it extremely easily because the the system the environments are already designed for that so that's just a quick call out uh anything else anybody else like that cool i'm getting nodding approval so let's let's go ahead and dive into some other questions all right perfect all right um all right we're going to go dive into some details again so i'm pretty sure the engineers here are going to love this one um like since they're you're defining vacuum and checkpoint properties when you're creating a delta table yourself uh actually let me go read that again sorry uh oh okay sorry when you're because you're defining the various properties like uh checkpointing when you're creating a delta table do you need to actually trigger or specifically set up an explicit checkpointing or explicit vacuuming of the table when you do that i hope that makes sense what i just said go ahead td oh me okay um so regarding checkpointing everything is automated you don't have to set up anything to explicitly checkpoint it will just automatically checkpoint every 10 versions and it will take care of itself regarding vacuuming old versions that is something you need to schedule explicitly and run explicitly to version to clear of files that are not needed by the last n days of versions anymore just a quick add-on to that if you join the delta live table session they'll talk about how these maintenance jobs are automated by delta live tables as well yeah that's pretty cool isn't it yep and apart from the one more thing i would like to add is we can also define our retention period just to make sure if you are not expecting any sale data you have retention policies in place to ensure there is um compliance around if you are dealing with gdpr laws so by default retention is seven days but if you would like to retain data for less time or more time specify that in your code perfect thanks very much vinnie okay uh next online actually is a different question i'm gonna actually segue it off but there's a great question actually about the delta engine and photon and sql analytics because we're talking about delta like oss and delta sharing here today um we're gonna we're gonna kybosh this question for now but the only reason i'm doing that is because at le i'm actually forget exactly the time i believe 3 30 pacific today there actually is an ask me anything on photon so you're those are the right people to ask all those questions so at 3 30 pacific today please go ahead and join that session for any of your photon questions okay so now let me shift back and ryan i'm going to direct this at you because guess what we're talking about delta sharing all right how does the delta how does delta sharing manage the sharing of data or model versions uh so when so let's basically we define a protocol which is for example a share means you have like a mod can have like multiple schemas in a share and then each share is like like a database which you have contained like multiple tables this is kind of the kind of data actually model and for the server itself we have a leica open source implementation which you can like configure your server define your of these objects in a configure file and then after that you can surely stable to like any people you like to share and in databricks we are going to like start private preview like which we will basically ship a beta server with the unity catalog so because we need like unity catalog to provide like a great like echoes with all these tables so in database we will provide some better like echo features with this delta surely provided by unity kerala and then so for everyone uh using like a database you and with unity catalog you can like for example assign a share to a specific user or something like this and then only this user can access this chair yeah in future we may try to like uh improve our open source server to do all this work but in the current initial version you need to configure file to do delete no this is great thank you very much this is awesome and don't forget go to uh delta dot io sharing that gives you a reference to the blog also to the github repo in the github repo we've got plenty examples thanks to ryan for being able to go ahead and doing all this stuff so go ahead and give it a try because we're really happy with this delta sharing i'm pretty excited about it as you can probably tell so okay next question and this is more of a generic question um uh that i'd love to go ahead and call out which is basically could you give some customer insights uh when comparing data warehousing to lake house performance and or user experience actually benny vinnie why don't you take this up you have a lot more experience working with customers that's what they tell us at least so vinnie would you like me to go first or would you like to go first go first and then i'll add teddy no problem no problem okay so so to provide a little context i think the reason why td teed this up for me i'm actually a former data warehousing guy myself so i'm formerly of sql server and so uh with sql server team itself so yeah so used to build a lot of data warehouses and data warehouses obviously have a lot of really cool advantages right they they loud they they had this one place central location where you were and from a user experience it was great i got i have just one dba to choke and it was awesome right but the problem what it usually came down to is that none of your data actually was always structured like it couldn't it was never that way it was never a point where i could always have purely structured data even if i had the structured data i had to clean it even if i had to clean it it was inconsistent i had from multiple sources so ultimately it required us to have data links okay because we we were we bought into the fallacy of schema on read which is we can just go ahead and build a schema as soon as we dump the data into our data lake but the reality is that that's why lake houses became so important because it was the idea of understanding that okay we there are particular concepts of data warehouses especially databases like asset transactions that became extremely important and so from my end user perspective what ends up happening is that they could finally trust the data that was sitting in their data lake before this they were hoping their data was okay as opposed to being truly able to trust so now with delta lake we're able to say okay i know what data exists i'm i'm comfortable with that and we're good to go right but then what happened was also from a manageability perspective and so for for anybody that debates me on the manageability side of the house i usually turn around and say well yeah it's true managing sql server would be easier i'm not going to debate that actually okay but and here's the big gigantic but that's assuming all of your data was structured in a particular way that's assuming that you could actually put everything into a single uh single server that was assuming that you didn't have connectivity issues with all these different sources talking to rest apis assuming that you also didn't want to run do machine learning on the assuming you and so forth and so forth and not to mention i actually forgot the most important one my apologies to utd streaming right so right the fact that in this new day and age the reality is yes i could have the simplicity of managing a data warehouse but but it was very limiting right and there's there's all these other things that i need to do today that i ca that allow me to solve a whole new set of business paradigms that i couldn't do before well the fact is that's why the lake house became so paramount that's why it's so important that's why bill in mon for crying out loud bill amount the father of data warehousing is talking about lake houses okay and so you know he was in the keynote today with ollie right this is that that's even to me as a data for data hours again that's pretty mind-blowing all we need to do is get a gimbal in here and then we're good to go right so that that's more or less the context right that when you if you are thinking back in the day of if you could afford to just work in a data warehouse that maybe that's okay but in this day and age the reality is with all the different data domain problems that we have the realities of the lake houses that's why it's such an important shift for us and so it's completely well worth it and so i now shall get off my soapbox anything else vinnie would you like to add or anybody else here like that yeah that was amazing explanation danny few things i have also seen from customers is for example right now we think when we think about big data it's not just structured data now we want to do much more with like data which is not um which is not structured like video files or audio files if you are in a sensory industry right you want to understand those signals and how you can process that so data warehousing doesn't go further into that aspect how how would you handle unstructured data that's where you have data lake in the picture second thing is if you are dealing with data warehousing you might have to create bunch of copies now if you create bunch of copies of your data first of all you are running into a lot of storage uh complexities a lot of storage costs secondly all these uh data warehousing they are great at data management management perspective but uh everything is in proprietary format so if you want to have integration with your other tools you you will need to you know you will need to have some kind of connectors built in or do a lot of manual processing so you are wasting your time and resources into like um getting that data into other formats so in this in this time and age you want to make sure that you are accelerating innovation and you are getting much more value of our of your data rather than getting into infrastructure problems so that's where i have seen a lot of customers now if they have like 10 years worth of legacy infrastructure they they want to migrate to the modern lake house that's where lake house is kind of building that um building that link and other thing is when you think think about data themes you are no longer talking about just sql analyst you are talking about machine learning users you are talking about building machine sorry deep learning models ai so all these things if you remove data silos because you know if you have so many proprietary systems now your teams are siloed where are you going to get that vision so with lake house platform everything is unified so your teams are no longer siloed and you have much more better understanding of data and you can work together as a team to build that next ai innovation for your company that you want awesome thanks very much vinnie okay perfect we actually only have time for one more question we've only got a minute or so left so that way there's enough time for people to shift over to the afternoon keynotes uh posing to the engineers uh whoever like to tackle this question uh does too many versions of a delta table cause performance issues it's the old data versioning log versioning so anybody would like to tackle that one go ahead bro yeah um not when you're querying the latest version of the table like i mean having many versions doesn't matter with delta delta is very quick in figuring out which version is the latest table of the version and also being able to compute the state of the table at the latest you know state of the version so uh latest version of the table so having more versions will only show up as you know storage costs for you um but on the query performance side you shouldn't expect uh you know much query hits because of that performance hits perfect okay we apologize in advance there are still a whole bunch of other questions but we we had limited time so i ask all of you to join us at delta dot io uh there is a slack group since i can't really explain the url for that i'm not going to try to do that right now but join us on the delta user slack delta useless google group um for that matter even the delta lake youtube channel if you ever want to be bored with uh uh getting into the code itself you're more than welcome to join us there so but go to delta.io for your latest information for and links to all those different forms we all of us here are actually there pretty regularly so hope to talk to you guys there so that's it for us again thank you very much for spending your time with us uh and we hope you will have a good continue having a lot of fun at today at the summit all right thanks very much you
Info
Channel: Databricks
Views: 2,020
Rating: 5 out of 5
Keywords: Databricks
Id: 3kl5BhpOQ6c
Channel Id: undefined
Length: 29min 59sec (1799 seconds)
Published: Tue Sep 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.