Data + AI Summit 2024 - Keynote Day 2 - Full

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Applause] everybody hey super excited day two here um we have an awesome program in front of us but I want to start by first again thanking our partners uh without them this program would not be possible so I want to thank the gsis and the hyperscalers and all the isvs that you see on this picture please go to the expo hall check out what they're up to okay so we have a really awesome program in front of us today uh you're going to hear from uh Texas Rangers uh you're going to hear from Duck TB Creator you're going to hear from mat zarya who started the spark project he's going to talk about UC we're going to hear about uh Apache Iceberg from the original creator of the project Ryan blue we're GNA and then we're going to hear from deposit and our studio tarf and then Professor Yin from udub okay and then we have lots of lots of announcements today uh before I jump in I wanted to quickly recap yesterday okay so in case you missed it yesterday we talked about the acquisition of Taber which was a company started by the original creators of Pache Iceberg and what we talked about is how we intend to bring these formats closer and closer together Delta Lake and Apache Iceberg uh and if you want compatibility today or if you want interoperability today uh we announced the ga of uniform so store your data in Project uniform stands for Universal format and you will get the best of both of those formats okay we got all the original creators of both of those projects we're making sure that uniform really works well with both so that was the first thing that we announced second we talked about gen and in geni we kind of talked about how lots of companies are focused on general intelligence which is super cool models that are really good at anything you can ask them about history math and so on but we're focus on data intelligence data intelligence is not just general uh general intelligence it's Intelligence on your data on your custom data on the proprietary data of your organization being able to do that uh at a reasonable cost and with privacy intact okay we talked about compound AI systems and the agent framework that we released yesterday that lets you build your own compound AI systems okay and then we heard from rold yesterday about data warehousing and he talked about the performance improvements that we've seen over the just last two years so just bi workloads concurrent workloads on BI we saw 73% Improvement on the bi workloads that are running on data breakes the last two years okay so we're just tracking those over two years it's massive Improvement so check it out and then I was very excited about aibi so aibi was a project that we built from the ground up with generative uh AI in mind to completely disrupt How We Do bi today so that's also available in datab brakes so check it out okay so that's what we did yesterday those were the launches that we had yesterday uh but today let me see here my my clicker is not working how do we get to the next slide maybe one more time like that nope maybe it's out of battery okay can someone go to the next slide please no okay we're just going to talk about this slide today all right all right something is happening and now I'm worried about the next speaker um okay I will introduce the next speaker uh so next speaker um her name is y chin she's professor at udub she's going to be talking about slms okay what's slms everybody's talking about large language models these are small language models okay how do they work what makes them tick what's the secret sauce to make slms work really really well super excited to welcome on stage Yin [Music] he all right so I'm here to share with you impossible possibilities so last year when Sam Alman was asked how can Indian startups create Foundation models for India he said don't bother it's hopeless whoa first of all I hope that Indian startups didn't give up and will not give up second of all this conversation could have happened anywhere else in us any Universities or startups or Research Institute without that much of compute so here comes impossible distillation how to cook your small language models in an environmentally friendly Manner and it tastes as good as the real thing So currently what we hear as the winning recipe is Extreme scale pre-training followed by extreme scale post trining such as rhf what if I told you I'm going to start with gpt2 that small lowquality model that nobody talks about and somehow I don't know why how but somehow we're going to create or squeeze out high quality small model and then compete against much stronger model that maybe two orders of magnitude larger now this should really sound impossible especially when you might have heard of a paper like this that says the false promise of imitating propriatary large language models although what they report is true for that particular evaluation experimental setup they reported please do not generalize overgeneralize to conclude that all the small language models are completely out of League because there are numerous other counter example that demonstrate that task specific symbolic knowledge distillation can work across many different tasks and domains some of which are from my own lab today though let me just focus on one task which is going to be about how to learn to abstract uh in language to simplify this task let's begin with the sentence summarization as our first mission inoss so here the goal is to achieve this without extreme scale pre-training without rhf at scale and also without supervised data sets at scale because these things are not always necessarily available but wait a minute we have to use usually all three at least some of them but how are we supposed to do any good against larger model without any of this so the key intuition is that current AI is as good as the data that it was trained on we have to have some Advantage we cannot have zero Advantage so that Advantage is going to come from data by the way we have to synthesize data because if it already exist somewhere on the internet open AI has already cwed it that's not your advantage they have it too so you have to create something genuinely novel that's even better than what's out there so usually the stallation starts with large model but we're going to TOS that out just to show you how we may be blinded to the hidden possibilities so I'm going to start just for demonstration purposes with gpt2 that poor lowquality model and then I'm going to do some Innovations which I'm going to sketch in a bit to make high quality data set that can then be used to train small model that will become powerful model for a particular task uh the only problem though is that gpt2 doesn't even understand your prompt it cannot do prompt engineering using gpt2 you ask it to summarize your sentence uh it generates some output that does not make any sense so then you try again because there's usually random to it you can sample many different examples like hundreds of examples and we find that it's almost always no good like less than 0.1% good where there's a wheel there can be what way so we had multiple sequence of uh different ideas that included our neurologic decoding this is Plug and Play inference time algorithm that can incorporate any logical constraints to your language model output for any of shelf models uh we can plug and play this to guide the semantic space of the output but because gpt2 is so bad even with this you know the success ratio was only about 1% but this is not zero now we are going somewhere because if you over generate a lot of samples and then filter out uh you can actually gain some good examples this way and then students brilliant students came up with many different ideas I'll close over the technical details but we found some ways to increase the chance success ratio to beyond the 10% just so that it's a little bit uh easier to find good examples so then overall framework goes to something like this you start with the poor teacher model you over generate a lot of data points and then because there's a lot of noise in your data you have to do serious filteration so here we used the three layer filteration system doesn't the details are not very important but let me highlight the first one intail mon to filter which was based on off the shelf inail mon classifier that can tell you whether a summary is logically entailed from the original text or not this is of the Shelf model that's not perfect it's maybe about 70 to 80% good but it's good enough when you use this aggressively to filter out your data then weuse use that Mo uh data to train smaller model much smaller model which can then become the teacher model for the Next Generation students so we repeat this couple times to make in the end high quality dim some data and high quality model when we evaluated this uh against gpt3 which was the best model of that time so this actually was done before chat GPT came out and we were able to beat that gpt3 which was uh at that time the best summarization model out there but since chpt came out you know people are like whatever you know chat PT can do everything including including summarization so why should we bother so here comes mission impossible to where we are now going to compete against chat GPT 3.5 and to make the challenge even harder for us now we are going to summarize documents not just the sentences and then we are also going to do all of these above without relying on that off the shelf in tont classifier I mean in practice you can do that just like academically we wanted to see how much can we push the boundary against the commonly held assumption about the scale so our new work in for sum is a information theoretic distillation method where the key idea is instead of that offthe shelf inment filteration system we're going to use use some equations that equation has actually only three lines of some conditional probability scores so that you can compute using ofto shelf language models um it's only 2 early in the morning so let's not drill into the details of the equations but I can just tell you hand W that if you Shuffle this around you can interpret this as special cases of pointwise mutual information which you canuse use for the purpose of uh filtering your data so we use the same overall framework as before uh we now use the Pia 2.8 billion parameter model because we liked it a little bit better than gpt2 and for the filteration we're now using the three short equations that I showed you earlier um and then uh we do the same business uh this time though we make the model even smaller only .5 billion parameter model uh that leads to high quality summarization data set as well as model so how well do we do well as promised we do either as good as Chach PT 3.5 at least for this task or we do better depending on how you set up the evaluation challenges and benchmarks so you can check out more details in our paper to summarize I demonstrated how we can learn to summarize documents months even without you know relying on Extreme scale pre-trained data uh pre-trained models and many other things at scale the real research question underlying these two papers though is this idea about how we can learn to abstract because uh right now the recipes let's just make models super big the bigger the better but humans you and I cannot really remember all the context you know like I don't know like a million tokens nobody can remember million tokens in your context you just abstract away everything I told you instantaneously but you still remember what I just said so far that's really amazing human intelligence that we don't yet know how to build efficiently through AI models and I I believe that's possible we're just not trying hard enough because we're blinded by just the magic of a scale okay so finally uh infin as the third mission impossible so switching the topic a little bit now the mission is to make classical statistical engram language models somehow relevant to neural language models how many of you even talk about angram models anymore I don't know do you even learn this these days um here we're going to make n equals infinity we're going to compute this over trillions of tokens and the response time should be super instantaneous and we're not even going to use a single GPU on this like wow let me tell you how hard it is so hypothetically if you're going to index five trillion tokens in a classical engram language model with n equals infinity then you're roughly speaking looking at two quadrillions of unique engram sequences that you somehow enumerate sort count and store some air which might take maybe 32 terabytes of discus space maybe more who knows but it's too much we cannot do that and if you look at what other larg scale classical engram models other people have ever built it was Google in 20 2007 due to Jeff Dean and others who only scanned two trillion tokens I mean it was a lot back then um up to five engrams five engrams which already give them uh about 300 billions of unique engram sequences that they have to enumerate to sort count Etc so it's too many people didn't do beyond that very much so how on the earth is it possible that we can actually blow this up to Infinity so before I reveal what we did I invite you to go check out this online demo if you so desire in infin G.I /do Infinity dg. /do so you can look up any token you want here's one example highlighted which has 48 characters I don't know why that word even exists but not only it exist if you look it up there are like more than 3 thousands of instances and it shows you what millisecond it took it's a 5.5 millisecond it took and then it also shows you um how you can tokenize that long word you can also uh try multiple words to see which word comes next so for example actions speak louder than what so it's going to show you on the web uh what are the other next word that comes and you know again it's just super fast so what did we do uh it'll be you'll be surprised to hear how simple this idea actually is there's something called the suffix array that I think not all algorithm classes teach but you some do uh it's that data structure that's implemented very carefully uh so we index the entire web Corpus using this suffix array and the truth is we don't precompute any of these engram statistics we just have this data structure ready to go and when you call particular query we uh compute this on the fly but thanks to the data structure we can do this super fast especially when you do C+ plus implementation I know people don't usually use that language anymore when it comes to AI research but it's a good stuff that actually runs much faster how cheap is this so it's only a few hundreds of dollars we spent for indexing the entire thing and then even for servicing the apis uh you can get away with a pretty low cost and it's really really fast even without gpus the latency for different types of API calls are just a few tens of milliseconds you can do a lot of things with this um so one thing I can share with you right now is you can interpolate between your neural language models with our infin to lower the perplexity which is the metric people often use for uh evaluating the quality of your language model across the board so and this is only the tip of the iceberg that I uh expect to see I'm actually working on other stuff that I wish I could share but I cannot yet but we started serving this API end point uh a few weeks ago and already we served 60 million API calls not counting our own access so I'm really curious what people are doing with our infin so concluding remarks the tldr of my talk is that AI at least in the current form is as good as the data that it was trained on so the past 10 current AI usually depends Prim primarily on human generated data but it could really be that the future will rely on AI synthesized data I know that you know there's a lot of concerns about this that oh maybe the quality is not very good there may be bias so you cannot do this in a vanilla way you should do it in a more Innovative way but there are many evidences piling up that this actually works so segment anything by meta Sam is an example of AI synthesized annotation on uh uh image segmentations uh helped with human validation but human alone couldn't uh annotate that many examples of images here's another example textbooks are all un need by Microsoft I one to3 um again this is a case where when you have really high quality data textbook quality data synthesized you can actually compete against larger counterpart across many many different tasks maybe it's not as a still General as larger models in some capacities but this is amazing to serve a lot of business needs where you may not need generalist you may need a specialist and also what textbook alludes to you is that quality is what matters it's not just a Brute Force quantity but it's quality D 3 is yet another example why is it better than D to out of a sudden well in large part because of better captions but which better captions the previous model used all the good captions well they synthesize the captions that's how you get high quality data of course you have to do this with care but there are many more piling examples of uh task specific symbolic knowy distillation including uh the work of my own lab that demonstrate that this could really uh make smaller models really unlock the hidden capabilities of small models so it's really about quality novelty and diversity of your data not just the quantity and I'll end my talk here thank [Music] [Music] you awesome okay so there we had it mission impossible so the the secret sauce Behind these slms small language models uh is the data surprise uh okay awesome so uh back to this slide we saw it yesterday so this is the data intelligence platform and this is sort of guiding us different portions of the platform that we're going through we went through a bunch of them yesterday and today the next level that we're going to go through is Delta Lake and uniform so we have a talk on Delta Lake um that was our agenda a month ago when we put this together but uh it turned out that uh you know we now have acquired company Taber so we really really wanted you to hear from Ryan blue the original creator of Apache Iceberg so I want to welcome him on stage and bring him on [Applause] [Music] [Applause] [Music] here hey Ryan hey good to be here awesome okay so congratulations welcome to dat breaks thank you um we are really excited to be here and uh also excited to get started on this new chapter in data formats awesome so um what's the main benefit of joining data breaks why join forces I you know I've I've never wanted people to worry about formats uh formats have always been a way for us to take on more responsibility as a platform and take responsibilities from people who you know worry about things it when we started this people were worrying about whether or not things completed atomically and so this next chapter is really about how do we remove the the choice and the need to stress over you know am I making this uh the right choice for the next 10 years um that weighs a lot on people and I think we just we we want to make sure that everything is compatible um that we're all you know running in the same direction with with the the single standard if possible um hopefully we can get there yeah I think we're going to get there actually you you had a talk right a while ago that said something like I want you to not know about you know these formats in iceberg was was some title right something exactly I I don't want anyone thinking about uh you know table formats or file formats or anything like that that's a a massive distraction from what people actually want to get done in their jobs uh so I want people focusing on getting value out of their data and not the the minutia that's the the kind of nerdy problem that you know I get excited about leave leave that to us hey I like it as a nerd I think it's awesome we got thousands of people to learn how to do asset transactions and understand all the underpinnings of the stuff that otherwise would not give a damn about okay well everybody wants to hear like origin stories so can you tell us a little bit how did Iceberg get started what's the sort of History well at Netflix we were um really grappling with a number of different problem areas uh atomicity was one that we didn't trust transactions and what was happening to our data um we also had issues like you know more correctness problems you couldn't rename a column properly and those sorts of things and we realized that the Nexus of all of the user problems was the the format level we just had too simplistic of a format with the The Hive format and we decided to you know do something about it um and and then I I think the the real uh you know turning point was actually when we we open sourced it and started working with the community because it turns out everyone had that problem and we could just move so much faster with the the community it's been an amazing experience and you you started you were involved in the starting of the parket project before that right was it was was some of these thoughts even discussed to do this kind of atomist and so on back there or no this is so part of the my experience in the parquet project informed what we did here because there were several things that just were not file level issues they were this next level of you know really table level concerns like what's the the uh current schema of a table you can't tell that from just looking at all the files yeah um you know a lot of people think that this is uh you know the first time we're talking about these things you know you and I and others uh but this isn't the first time we're actually talking about you know interoperability and how to make this work right that's true um you know we've been in touch over the years uh you know talking about this several times um I I'm glad that we finally got to the point where it made sense um you know I think we were always going and and doing our own things but now we've gotten to the point where both formats are good enough that we're we're actually duplicating effort and the the most uh logical thing to do is this it is to start working together start uh you know avoiding any duplication if if possible between the two yeah that's super awesome okay so I think a lot of people here are wondering what does this mean for the Apache Iceberg Community well I'm really excited because I I see this as a a big commitment and a pretty massive investment in uh the iceberg community and the health of both Delta Lake and iceberg in in general um I'm very excited you know personally to like work on this and and do a whole bunch of uh fun engineering problems um and and that'll be uh really nice awesome man super super excited to partner with you you know collaborate on you know Delta uniform Iceberg all these formats and then you know make it such that no one here ever needs to care about this ever again thanks so much thank you thank [Applause] [Music] you okay so now as I said originally this talk was just going to be about Delta so now I want to welcome to Stage the C of dat housing at dat breaks Sean hpon to talk about Delta and uniform [Applause] [Music] welcome thanks olly um and a lot of us we we used to work together with Ryan in the past and it's really exciting to have him here so we could work together again um so this Talk's going to be very exciting Delta Lake um first of all can announce the general availability of Delta Lake uniform what is what is uniform um really it's just short for two words Universal format it's our approach to allow full Lakehouse format interoperability see with all of these different formats Delta Iceberg hoodie it's essentially a collection of data files in par and a little bit of metadata on the side all of the formats use the same mvcc transactional techniques to keep that together and so we thought to ourselves in this age of llms transforming language left and right couldn't we just find a way to translate that metadata into different formats so that you just need to have one copy and that's exactly what we're doing with uniform the uniform GA allows you to essentially write data as Delta and be able to read it as Iceberg hoodie um and we've worked very closely with the Pache X table and Hoodie team to make that possible and we're going to be working with the iceberg team to make that even better the great thing about uniform is there's barely a noticeable performance overhead it's super fast you get awesome features like liquid clustering there's support for all of the different data types from map lists arrays and best of all it's got a production ready catalog with UC and uniform it's one of the only live implementations of the use of the iceberg rest catalog API and that's available for everybody using uniform there have been over over four exabytes of data that have already been loaded through uniform we have hundreds of customers using it and one of them in particular M science as you can see here was very happy that they were able to have one copy of their data which allowed them to reduce costs and have better time to value and it's Innovations like uniform that are really making Delta Lake the most adopted open Lakehouse format there have been over nine exabytes of data processed on Delta yesterday over a billion clusters per year using it and this is tremendous growth it's 2x more than last year and if you're like me and when you saw these numbers I I did not believe 9 exabytes I literally up till yesterday we're going back looking at the code making sure they calculated correctly because it's just a tremendous amount of data every day that's going into Delta um and it's adopted by a large percentage of the Fortune 500 10,000 plus companies in production lots of new features but most interestingly it's sort of that last number they're over 500 contributors and best of all according to the Linux foundation and this is their project analytics site it's open anyone can go to it today over about 66% of contributions to Delta come from companies outside of data bricks and it's this community that just really makes us super excited and enables a ton of these features that are now available right and these are time- tested awesome Innovative functionality things like change data feed law compaction I I love the row IDs feature that just came out um but there are things like deletion vectors right deletion vectors are a way that allow you to do fast updates and DML to your data in many cases it's 10 times faster than merge used to be so if you have U DBT workloads or you're doing lots of operational changes to data delion vectors make your life easier and there have been over 100 trillion row rewrites that have been saved because of these deletion Vector features and it's enabled by default for all data bricks users and so it's through these features that we've also been able to unlock access to this amazing ecosystem of tools that support Delta and with uniform that's now GA we're able to get the same access to the hoodie and Iceberg ecosystem so if you have tools and fun um sdks applications that work and there they're all part of the Delta family now thanks to uniform and there's been some great improvements to a lot of the connectors the trino rust connector lots of awesome Innovation happening here and a lot of that is thanks to this new thing that we've developed called Delta kernel essentially at the core of all of this there's a small library that you can just plug and play into your applications or sdks that contains all the logic for the Delta formats uh all the version changes new features and it's making it so much easier for people to integrate and adopt Delta and most importantly stay up toate with the latest features and we've been seeing this the Delta rust connector is community supported and has amazing traction uh just a few weeks ago at Google's IO conference I believe they bigquery introduced complete support for Delta and very recently uh duck DB added full support for Delta and best part of this is we have hanis here who's the co-founder um one of the co-creators of duck DV CEO of Duck DV Labs professor of computer science who's going to talk to us a little bit about how they integrated Delta into dctb Hest get over [Applause] [Music] here hey thank you so much um yes hello and uh very good morning uh it's wonderful to see all of you here uh I have to adjust my eyes a bit to the amount of people um as uh Shanta said I'm one of the people behind duct so for those of you who do not know what is duct Tobe it's a small impress analytical data management system speak SQL has zero dependencies and it's um yeah I'm having a lot of fun working on it with a growing team um and last year I talked about duct Deb on this very stage for the first time and it was very exciting but also lots of things have happened since Dan and duct land um there's been an incredible growth in adoption um for duct B we seeing all sorts of crazy things and here as an example um it's just the Stars on GitHub have doubled within a year to almost 20,000 and in fact we're so close to 20,000 so if you want to like it today then you know maybe we'll beit it but what also happened and just last week we actually released ACB 1.0 and that was a big moment for us uh it was the culmination of six years of R&D in data management systems and what does what does 1.0 mean um it means that we have now a stable SQL dialect and uh various apis and most importantly our storage format for dub is going to be backwards compatible from now on out um but maybe taking a little bit of step back how does stct fit into the general ecosystem um if we look at the world's most widely used data tools Excel um and we look at very capable system like spark there's still a pretty big gap there's a lot of data sets that are not going to work in Excel but they are maybe a bit too small to actually throw sparket them so dctb is really perfect for this last mile Last Mile of data analysis where you may not need a whole data center to compute something um so for example you have already gone through your log files in spark and now it's time to do like some Last Mile analysis with dctb doing some plots what have you that's where dctb fits into this big picture but now we have to somehow get the data between spark from spark to dub so how are we going to do that obviously we're going to use the best tool for the job available right CSV F maybe not um so typically people use paret files for this uh obviously both spark and dub can read and write parket files so that works really well but we've all heard about the issues that have appeared with updates and schema Evolution these kind of things which is why we have Lakehouse formats so today we are announcing uh official duck DB support for Delta Lake um it's going to be available completely out of the box with zero configuration or anything like that um but this we have done a bunch of these Integrations and one thing that's really special about the Delta Lake integration is that we use this Delta kernel that data bricks is building with the community and that's really exciting because it means that we don't have to build this from scratch like we used to for example with the paret uh reader but we can actually delegate a lot of the hard work of reading Delta files to the Delta kernel while at the same time um keeping our you know operators within the engine and so on and so forth so it's really exciting um we also made an extension to for dub uh that can talk to the unity catalog so with this extension we can find the Delta leg tables in the catalog and then actually interact with them from DB itself so here we can see a script that actually works if you install duct Tob now um you can install this Unity catalog extension you can create your secret which is like credentials and then you can basically just read these tables as if they were local tables um if you want to hear more about this there's actually going to be a talk this afternoon at uh 1:40 just look for dub in the title um so the Delta extension joins this growing list of dub extensions um for example example there's others for Iceberg Vector search spatial and this sorts of thing but as an open source project and a small team we really excited about tabular um and data bricks beinging Delta Lake and Iceberg closer together because for us it means we don't have to maintain to so different things for the same essentially problem and we're really excited about that means less work for us and I think everyone wins I just want to plug one sort of small thing that we're actually launching today um I've mentioned extensions to dub we've seen a lot of uptake in duct DB extensions um and from now on actually we are launching Community extensions which means that everyone can make DCTV extensions and basically publish them and then installing them as is as easy as just typing install into a duct be near you so that's all for today um thank you very much and I will give back to [Applause] shant that integration is super awesome it's very exciting okay so how do we top that by not going back by going forward and forward to Delta 4.0 so we just we have the branch cut it's available Delta 4.0 is the biggest change in Delta since history it's jam-packed with new features and functionality um things like coordinated commits cations all sorts of new functionality that make it easier to work with various different types of data sets uh we won't have time to go through all of this so I'm going to pick a couple and dive into why these are such amazing features so liquid clustering is generally available now as part of Delta 4.0 and with liquid clustering we really wanted to set out to solve this challenge that so many people have brought up partitioning it's good for performance but it's so complicated you get overp partitioning small files you pick the wrong thing it's a pain to resolve and liquid solves this with a novel data layout strategy that's so easy to use that we hope all of you will say goodbye to partitioned by you never need to say that again when you define a table not only is it easy to use we found out is up to seven times faster for rights and 12x faster for reads so the performance benefits are amazing and of course it's easy to evolve the schema make changes Define anything without having to worry about all your data being Rewritten in transforms and you know there about 1,500 people customers actively using this the adoption has been insane uh over 1.2 zettabytes of data have been skipped and you don't have to take my word for it even shell when they started using it for their time series workloads saw over an order of magnitude Improvement and performance and it was just so easy to use next open variant data type and this one's really important that first word open is really exciting so what happens is now in this world of AI you have more and more semi-structured text Data um alternative data sources all of this coming into the Lakehouse and we wanted to come up with a way to make it easier for people to store and work with these types of data in Delta um and usually what happens is when you're stuck with semi-structured data most of the data Engineers they sort of have to make a compromise and you know none of us like to make compromises but usually it's about being open flexible or fast and often they'd only be able to pick two out of these three so for example for semi-structured data one approaches just store everything as a string right that's open it gives you tons of flexibility but parsing strings is slow right why why would you store a number as a string and have to reread it every time so of course there's an option to pick the fields out of your semi-structured data make them concrete types and you get amazing performance right this is open very fasty access however if you have sparse data you sort of lose out on a lot of that flexibility to modify the schema and you know relational databases for a while have had special enum or variant data types but all of those had always been proprietary if you wanted to use them to get a balance of not having to store everything as a string and not having to Shard out every single column you got locked in and so that's why we're very excited with variant to be able to kind of get that sweet spot in the middle you can have your Json data store it with flexibility fully open with amazing performance right it's very easy to use it works with even you know complex Json here's an example of the syntax and we found of course it's eight times faster than storing your Json data as raw Springs this is just tremendous so if you're storing Json in a string field today um go back to work or home and start using variant it's available in uh dbr 153 but most importantly all of the code for variant is already checked in to Apache spark there's a common subdirectory in the 40 Branch right now that has all of the implementation details for variant and all of the operators and there's a binary format code definition and library that we've made available open source so all the other data engines can also use variants we really want this to be an official open format that everyone adopts so that finally we have a non-proprietary way of storing semi-structured data reliably and so with that yeah oh it's a big deal with that I just want to summarize Delta Lake 4.0 it's interoperable we have this amazing ecosystem with people like hanis working together making it better and stronger you get amazing performance benefits and all of this is just so much easier to use now than it ever was before thank [Music] [Applause] [Music] you awesome all right okay so back to the STA intelligence platform site so that was awesome we heard about Delta uniform we from Ryan blue uh isn't it cool that du DB now will support riding natively into Delta and uniform and UC super cool okay uh and then we have Delta 4.0 so that's awesome all right so next uh creator of or original creator of uh Apache spark project mat zahar is going to tell us about Unity catalog okay so we have a lot of announcements here and this is a long talk so it's going to be super super exciting to see so let's welcome M and also pay attention to his T-shirt and his shoes carefully okay welcome M to stage [Music] yeah all right hi everyone uh thanks Ali yes I have the new Unity catalog t-shirt um y you'll be able to get one soon I think somewhere so um all right so I have a you know somewhat longer session for you today because I'm talking about um uh governance with open source Unity catalog as well as data sharing if you're familiar we have another open source project we launch Delta sharing that's really Making Waves in the open uh data collaboration space um and um and we have a lot of exciting announcements around that so I'll start by talking about what's new in unity catalog and what it means to open source it why we did it what's in there um and then uh so Ali announced that we're open sourcing it yesterday but he led me keep one more thing to announce today that that I'll talk about that's the next uh big direction for Unity C uh and then finally I'll switch gears to sharing and collaboration and we'll have some cool demos of all these things too so let's start with unity catalog so I think everyone who works in in data and AI knows that uh governance whether it's for security or qualityy or quality or compliance remains a huge challenge for many applications and it's getting even harder with generative AI there are new um regulations being written all the time about it I heard in California alone there are 80 builds proposed that regulate AI in some form um and also you need to really understand where your data is coming from if you're going to create models and deploy them and run these applications so we hear things from our customers all the time about how they would love to use AI but they can't really govern it with uh you know their existing Frameworks and even on the data space it's complex enough even there all the all the rules are changing and people are really worried about how to best do it so we wanted to to step back and um think you know it's 2024 if you had to design an ideal governance solution from from scratch today what would you want it to have you know we we asked a bunch of cios and we thought you you really wanted to have three things the first thing is what we call open connectivity so you should be able to take any data any Source that's in your organization and plug it into the governance solution because no one's going to migrate everything into just one you know data system over time most organizations have hundreds or thousands of so you really want a governance solution that can really cover all this data wherever it lives in any format you know even even in um in in in other platforms then we also really think you need unified governance across data and AI I think it's clearer than ever with generative AI that you have to manage these together you know you can't be managing AI without knowing what data went into it and also all the output of AI you know as you as you do serving uh is going to be data about how your application is doing it's got the same problems of course quality and security so we really need it to be unified and then finally um we heard everyone asking for open access from any compute engine or client uh because there are so many great um you know Solutions out there uh they'll keep coming out uh there'll be the next data processing engine the next machine learning library you want it to work with your data so this is what we're building uh with unity catalog especially with the open source launch today so first of all open connectivity Unity catalog on database uh and data bricks as a platform uniquely lets you connect data in other systems as well and process it together in a very efficient way through this feature we call Lake House Federation so you can really connect all your data and give users a single place where you set security policies you manage quality and you make it available um it also is the only um governance system in in the industry really that's unified governance for data and AI so since the beginning since we launched this uh 3 years ago we had support for tables models files and we're adding new things as they uh new new Concepts as they come out in the AI world like tools which we talked about yesterday with the Tool Catalog concept for Gen agents and for all these things you can get these capabilities on top ranging from Access Control to lineage to uh monitoring and and Discovery and finally um one of the the big things that is possible today through the open API and open source project we just launched is open access to all all your tables uh through a wide range of um applications I'll talk more about that in a bit uh but the cool thing here is again it's not not just data uh uh systems like dotb but also a lot of the leading machine learning ones like Lang chain can integrate with unity catalog um and since we launched this uh it's it's been extremely widely adopted most of our customers use Unity catalog now um and some of them are managing tens of pedabytes of data on it like GM or have thousands of uh active users Z Pepsi here so I'm going to briefly talk about some of the new things some quick announcements in each of these areas and then we're going to see a demo of how all these things fit together uh including with the open source project so let me start with open connectivity um so I'm really excited today that to announce GA of Lakehouse Federation the ability to to connect and manage um external data sources in unity yeah thanks everyone so this is yeah this is feature that we uh launched last year a year ago at Summit uh it it builds on on Apache Sparks unique ability to combine many data sources efficiently and it lets you mount these data sources in data bricks set governance rules on top and get the same experience managing quality tracking lineage and so on uh as people work on them that you get with the you know with with your Delta tables in there um and it's been going extremely quickly we now have 5,000 monthly active customers and if you look at the that graph of cze on lakeh house Federation it's still growing exponentially um so we see a lot of customers that can finally bring all this disperate data which is a reality in every company you know as much as every data vendor would love you know for you to have all your stuff in in one system is just not the case uh and actually work with it together uh and another really cool thing that we're announcing is lak house Federation for Apache Hive and glue so if you've got an existing Hive meta store or glue you've got lots of tables in there you can now connect that efficiently to Unity catalog and manage the data in that as well uh and that's rolling out later this year um so very excited about this I think it's a it's a defining feature of of of um databas as a data platform okay so what about unified governance across data and AI there's so so much happening in this space and uh our team has been working hard to launch a whole range of new features here so I I won't you know I won't even have time to go through all of them but uh just in the past year we've got everything from AI assisted documentation and and tagging to a lot improvements to lineage sharing uh monitoring um and so on um and I'm just going to to highlight two um two announcements here uh around ABAC and around Lakehouse uh monitoring um so first Lakehouse monitoring is going GA um so Lakehouse monitoring is the ability to take any any table or machine learning model in unity catalog and automatically check for Quality on it there's a lot of built-in quality checks like are there a lot of NS you know has it stopped being updated and so on plus you can do custom checks and the great thing about this since it's integrated into the data platform we know exactly when the data changes or when you know the model is called so it's very efficient and it does all this computation incrementally uh and it gives you these rich dashboards about quality you know uh uh classification of the data discovered and so on plus all these reports go into these tables so you can just quy you know programmatically see the quality of your entire data and AI um sort of estate there um so that's going GA today um and um sorry I should say that that's been uh that's already that's also already in use at thousands of customers uh and then the second thing that we're we're launching a preview of soon is attribute based access control so we've developed this policy Builder where um and and tagging system where you can set tags on your data at any level in the catalog you can Al will also autop populate tags based on um you know patterns of data we discover uh and you can um then then propagate you know the these masking policies across all of them and this works easily through the UI or through SQL so these are just two of the things thanks yeah thanks and then the you know last but not least uh the thing I'm probably most excited about um is is the the the launch of Open Source Unity catalog and Open Access so a lot of people asked me yesterday why why are you open sourcing Unity catalog uh and really it's because um you know customers um want customers need it customers are looking to design their data platform architecture for the next few decades and they want to build on an open foundation and today even though you see a lot of cloud data platforms that claim you know support for openness in some way when you dig into them they don't really truly uh um you know um follow through on that so there are a lot of cloud data warehouses out there for example that could read tables say in Delta lake or Iceberg but most of them also have their native tables that are more efficient more optimized and they really nudge you to use those and to have your data locked in and then there are other platforms even some of the lake platforms where um you know it seems like hey everything's in an open format but you have to pay for always on compute and pay a high fee if you ever read it from outside inside their engine and customers don't want that uh everyone is saying they want an open Lakehouse where they they own the data no vendor owns the data without locking and where they can use any compute engine in the future um so we've you know we we've been big fans of this approach for a while we think it's where the world is going so that's why we design uh everything we do to support it um and already today in data breaks you know all your data is in an open format there's no uh concept of a native table uh we also pioner neared this cross format compatibility last year with uniform where all these ecosystems of clients that only understand Apache Iceberg or hoodie can still read your tables and the next logical step is to also provide um open-source catalog um so this is what we have uh in the first release of unity catalog um we we'll be you know gradually releasing um uh the things that we built in our platform and and and um you know removing all the dependencies on on datab break code and putting them in the open source project but even in the for to release we you'll be able to govern tables unstructured data and AI um and we're really excited uh we proposed the project to the Linux foundation and it's been accepted this morning so it will be there um yeah thanks and and then another uh Cornerstone of unity catalog is we're doubling down on the cross format compatibility approach so even the first release uh implements the iceberg rest API it's it's one of the first you know open-source cataloges that uh that implements uh this this format so you can connect to it from any engine that understands that uh and we hope that means that a lot of the uh the ecosystem out there will work with it all right so you might be asking you know is this for real when are you actually releasing it maybe in 90 days maybe 89 days because a announced it yesterday um I'm just going to walk over to my laptop right here yeah so this is this is Unity catalog on GitHub you know looks looks solid to me people are working hard on it so just going to go into the settings here scroll down to the danger zone and make this thing public yep I want to make it public I understand make it public [Music] all right and I think it's public now so yeah take a look so github.com Unity catalog thanks everyone yeah so that wasn't that hard and of course we you know we invite all of you to to contribute we'll be working hard to expand the project and we want to do it all in the open we're not going to keep it uh you know closed for for a while uh to build this stuff up uh all right so yeah it is it is now available my slite is right um so we just released version 0.1 um it's this version supports tables volumes for unstructured data and AI tools and functions so it implements that Tool Catalog concept we talked about yesterday has the Ice book support and if you look at our website it's going to it has open source server apis and clients and these work just as well with your instance of unity catalog on data breakes so everything built there you can just connect to your current data um we're also really excited to have a great array of launch Partners uh everywhere from the um uh you know the the cloud uh vendors uh some of which have been contributing a lot to uh to Open Standards like the iceberg rest API already uh to uh leading companies in AI uh and in governance um so Microsoft AWS Google uh they're all excited to see this happening and we hope to work closely with them uh to to um you know contribute to Apache Iceberg as well uh and to help Define the standards for this so customers get the interoperability that they want um and then okay sorry and of course there is a lot more coming uh we're working on uh bringing a lot of the the the nice things you have in in data breaks and unity out here um including Delta sharing models ml flow integration views and other things and we invite you again to to um to collaborate with us so that's kind of an overview of unity some of our launches in there it's great to hear about them but even better to see a demo and for that I'd like to invite Zan Papa one of our product managers to show you through all these new [Applause] [Music] features thanks mat I'm glad to be here I'll walk you through each of the features that m talked about let's start with the catalog Explorer over here on the left hand side this offers a unified interface for browsing applying access controls and quering tables it enables you to nav uh navigate and organize cataloges tables functions models volumes and other Data Systems both within and outside of data bricks in some cases only some of your data will reside in the lake housee to address this we've simplified and secured Access Control to systems such as big query catalog such as glue and Hive MySQL postgress red Shi Shi Snowflake and Azure SQL all of this is powered by lak house Federation switching over here on the left hand side to a SQL editor I'll show you how to query an external data system by running some SQL that will join a store report Table that is Federated from Snowflake along with a Lakehouse native source that contains data on retail store returns once this table is created this store report Table will become a Unity catalog managed object which means the platform now handles all of your table management challenges including automatic data layout automatic performance and predictive optimizations for you but managed doesn't mean locked in this table or any Unity catalog object is accessible outside of data bricks via Unity catalogs open API let me show you how easy it is to query this newly created object using duck DB first I will opt this table in for external access as I've done for other tables in this catalog next I'll switch over to duct DB which is the same nightly bill that you can access right now I'll attach this catalog accounting prod to duck DB and now I'll run a quick query to see all of the tables in this catalog all right and as you can see that store report Table that I just created is right there next I'll run a quick query to select from this table I could do the same tables created in unity catalog and quickly query them using duck DB's native Delta reader this is data Rick's commitment to open source and open interoperability here and [Applause] now so far I've walked through Unity uh Unity catalogs Explorer lak house Federation and our new a open API however a major challenge for many organizations is ensuring consistent and scalable policy enforcement across a diverse range of data and AI assets let me show you how easy it is to scale your governance using tags and ABAC policies combined with proactive pii monitoring let's switch over here to the online sales prod catalog and let's take a look at this table called Web logs one of the features that's been enabled in unity catalog is Lakehouse monitoring which allows for simplified and proactive pii detection of anomalies in your data and models within this dashboard over here on the Le hand side you can explore columns and rows and you can see that Pi has been has been detected in the user input column now this is obviously a problem before this data set can be actually used this data must be mased and appropriate policies must be applied let's switch back to the catalog Explorer back in this Explorer over here in the rules tab a new rule can be created to express policy across all data it's now so much easier to mask all email columns across all tables of the single rule let's give this rule a name let's call it mask email all right and we're going to give it a quick description let's mask some emails and we want to apply this to all account users all right and this is a rule type of column mask and we're going to select a column mask function that I previously created conveniently called mask email and we want to match on a condition when a column is detected that has a tag with a tag value of pii email all right let's go ahead and create that rule and that's it and now to validate this mask we're going to go back to the web log sales table and we can observe here in this table in the sample data on the user input column over here to the right that the data has now been masked [Applause] since this applies to the entire catalog let's go to a different table uh somewhere in here there we go as you can see we've got an email address column in here as well tagged pii email let's go up to sample data as you can see over here as well this table has also been masked with one rule so M I've shown how Unity catalog enables organizations to have open access to their data seamlessly no matter where it resides well applying unified governance to ensure its integrity and security thank [Applause] [Music] you thanks thanks so much Zan all right so that's uh that's Unity catalog and action so as I said uh Ali led me keep one thing to announce today uh so uh which I'm really excited about so what we just saw is you know you you could set up a catalog it's open it's got Access Control monitoring you can get to it from any engine you can Federate stuff into it are you done you know as an engineer you might say this is pretty good what will happen unfortunately is you know someone will come in and ask a business question for example how is my AR trending in emia and to answer this kind of question there's not enough information in just the catalog in just the table schemas and things like that so you have to understand things like what is how is AR defined that that's some some kind of unique calculation for your business um maybe there are many tables that mention AR which one actually you know is the right one to use to get this information how is a Mia defined you know which countries are really part of it and so on um and so the question is how do you bridge this Gap um so this is something that is typically done in some kind of metric layer and we're really excited to announce that we're adding first class support for metrics as a concept in unity catalog so yeah yeah thanks yeah so this is this is something um we we've um U you know we will will be um holding out um uh uh later this year um so uh so Unity catalog metrics so um the the idea here is that you can Define metrics inside um Unity catalog and manage them together with all the other assets so you can set governance rules on them uh you can find them in search you can get audit events you can get um lineage for them and so on and like the other parts of unity we're taking an open approach to this uh we want you to be able to use the metrix in any Downstream tool so we're going to expose them to multiple bi tools so you can pick the bi tool of your choice uh we'll of course integrate them with aibi uh one of the things we're excited about is we're designing this from the beginning to be AI friendly so that aibi and similar tools can uh can really understand uh how to use these um and and and give you great results uh and you'll be able to just use them through SQL through through table functions that you can compute on um and we are also partnering with uh DBT Cube and AD scale as external metrix provider to make it easy to uh bring in and govern and manage metrics from those inside Unity so to show this stuff in action I'd like to invite zishan to this stage again for a quick demo of metrics [Music] thanks M as you mentioned metrics enhance business users ability to ask questions understand their organization's data and ultimately make better decisions rather than sifting through all of the data certified business metrics can be governed discovered and queried efficiently through Unity catalog having already discussed the catalog Explorer let's dive into business metrics in this overview tab here you'll see a list of all available sales metrics a few of these metrics are marked certified this indicates a higher level of trustworthiness we're going to select the web Revenue metric down below here as you can see it's also marked certified by clicking into this you can see the metadata that's associated with this metric and the predefined dimensions that are associated with web Revenue these are used when querying the metric such as date or location this is like having a built-in instruction manual for your data on the right hand side here you can see the metric overview section this is where this is where you can see the description of the metric who edited it and who certified it you can also see information about where the metric came from and where the metric is used such as notebooks dashboards and Genie spaces let's click into this dashboard as you can see in this dashboard over here on the right hand side in the xaxis column I have all of the interesting information such as the dimensions and uh country city state Etc this allows you to slice and dice without needing to fully understand the data model let's go into a notebook as well this metric isn't just usable in a dashboard it's also queriable from external tools and notebooks in this notebook we're using the get metric function to pull all aggregated data it's that simple finally let's go back into a genie space here here web Revenue metrics can be used to answer natural language questions in this space you'll also see that this visualization was created by asking about the revenue generated across States using this metric this approach extends the reach of these metrics throughout the organization making them accessible to business users as you can see M Unity catalog metrics make it easy for any user to discover and use trusted data to make better decisions [Applause] [Music] all right thanks desan super excited about this um so uh after doing two demos this morning I think Zan can have the the rest of the day off very um very nice to see those um all right so the the final portion of the talk I want to talk about I want to talk about what we're doing in sharing and cross or collaboration um you if you know if depending what industry you're in you've probably seen that uh data sharing and collaboration between companies between organizations uh is becoming a really important part of the modern data space uh it can help you know providers and suppliers coordinate better it can help you know uh uh streamline a lot of business processes just yesterday I met a customer who thought that they could speed up basically launching new drugs by a factor of two uh by uh implementing these kind of Technologies um so so really U you know really powerful way for for many Industries to move forward um and we started looking at this area about three years ago we wanted to provide great support for it and we started by talking to a lot of data providers who collaborate and what they told us um was that uh many of the data platforms out there support some kind of sharing between different instances but it's always closed you can only share you know within that same data platform within that you know customers of that data warehouse or whatever and as a provider as you know any company that wants to collaborate with a lot of Partners this is very restrictive so emper for example who's a who's a CDP um said that they would prefer to invest in open Solutions U that uh let them you know set up data collaboration once and then be able to reach anyone regardless of what platform they're Computing on so that's the approach that we've taken with all our sharing and collaboration infrastructure uh by creating an open collaboration ecosystem based on uh Open Standards and the core of that is Delta sharing a feature of Delta Lake that allows you to securely share uh tables you know across clouds and across data platforms and then we've built on that with databas Marketplace and datab break clean rooms um so if you're not familiar with Delta sharing uh basically uh this is a a core part of the Delta Lake Project where if you have um you know a table and increasingly other kinds of assets as well uh you can run this this server that has an open protocol uh and serve out just parts of your table to other parties that are um authorized to access them and because the protocol is open it's a very simple one based on uh parket um you know files that that are that that they're given access to uh it's really easy to implement a lot of consumers so of course you can use data bricks to access these but you can also just use pandas Apache spark even bi products like Tau and powerbi um are are letting you load data right into there um and it makes a lot of sense you know if you're a data provider you want to publish something why should the other party even need to install a data warehouse in the first place why not deliver that data say straight to Tableau or straight to Excel or something like that um so so that's Delta Shing uh it went GA two years ago and uh it's continuing to go extremely quickly uh so just this year uh I mean just now we have over 16,000 recipients that are receiving data through Delta sharing uh from our customers on the briak platform and this is growing by a factor of four year on-ear so there's no no end inside we're super excited about this and the other thing I'm really proud of is that uh 40% of those recipients are not on data breakes so this idea of cross uh platform collaboration uh is is real and our customers are able to deliver uh data and to have you know real-time data exchange with uh anyone regardless of what data platform they're using so super excited about the growth of that this year um we are uh continuing to expand Delta sharing and one really um exciting announcement is that we're hooking together two of the the best features of the platform lak house Federation and sharing to let you share data automatically from other data sources as well so we we talk to a lot of companies who have some data and a data warehouse or they have a partner who you know isn't on data brakes has another uh platform but they want to collaborate and since we built this Federation technology that can efficiently um query this data push down filters get it out uh and uh and deliver it um we are just connecting that to Delta sharing to let you seamlessly do this so now you can really share data from you know any data warehouse any database with any app that understands the Delta sharing protocol um so that's that's Delta sharing thanks yeah excited about that feature yeah all right so one of the um the the things that builds on Delta saying is datab break Marketplace this is something uh we launched um about 2 years ago and uh it's um it's um uh also been going extremely quickly um databas Marketplace has now up to over 2,000 listings again going more than 4X year on year um so super excited to see that um and this makes it actually up there with the largest uh data marketplaces anywhere in the cloud on on any platform so it's it and it's it's it's continuing to go uh our team has been adding a whole bunch of new functionality there that providers are asking for like private exchanges uh sharing of nonata assets uh like models and volumes usage analytics and even support for non- databased clients if you put data in there um you know you can you can reach these other platforms as well um and then we are also super excited to welcome uh uh 12 uh new uh Partners to this um to to to the Shing and Marketplace ecosystems uh some of these announcements went out last last week uh but anywhere from from axium amperity atlassian uh industry leaders in in in many different domains are now connecting to this ecosystems and making data available to users on data bricks or really on any platform that implements the open sharing protocol and they join our existing ecosystem of Partners so thanks to all of them um who are uh participating in this great so yeah really really excited to see how this will continue to go in the future um the final thing uh I want to talk about is that we're soon launching public preview of datab breaks clean rooms um clean rooms are uh a way to to to do a private computation with with another party where you can each bring in your assets you can bring in some tables some code some unstructured data some AI models any kind of asset you can have on the databas platform and you can agree on a computation with someone else that you R and then send the to just one recipient so for example it could be as simple as you each have some tables and you want to figure out you know how many records you both have in common and it can be as complicated as you know someone has a machine learning model they want to keep private someone has a data set they want to keep private but they want to apply these two together and get the predictions or get the differences between their two models um so two things really distinguish datab bricks clean rooms from other clean room Solutions out there the first is because you have this complete data and AI platform you can really on any computation it can be machine learning SQL python R and so on versus just SQL in in many other Kleen Solutions uh and also uh datab big clean rooms is built on Delta sharing it integrates with Lakehouse Federation so it's very easy to do cross cloud and even crossplatform collaboration if someone's primary data store is not data bricks they can still seamlessly connect it to the Klean room and uh do work on that uh so this is going to go in into public preview um uh just a little bit later summer um and we've already seen some really awesome use cases uh one uh company we've been working closely with uh is Mastercard who has so so many exciting use cases with a whole range of different partners you can imagine uh you know the the different kinds of things that they they can do with their data and uh who is U uh you know looking for the best way to do uh private versions of the state-of the dart um um algorithms and techniques to work with this data um so I want to show you all this collaboration work in action and for that uh we have our third demo I'd like to invite darana who's our product manager for clean [Music] homs thanks mate picture this I'm part of a media team at a large retailer we are teaming up with a supplier to run a joint advertisement campaign to to grow our sales for this we need to identify our target audience we need to collaborate on join customer data I as a retailer have data on my customers and their shopping behavior my supplier has their Customer Loyalty data however we have some challenges first we cannot share any sensitive information about our customers with each other second our data is on different clouds regions and data platforms and finally we want to leverage machine learning and python for our analysis and not just SQL data bricks clean rooms can help with all of this in a privacy safe environment let's see how so here I am on the data bricks clean room console as the retailer and I create a clean room in a few simple steps I'm using Azure and East us2 as my cloud and region and what's amazing is that it doesn't matter that my supplier and I are on different regions and clouds I then go ahead and specify my supplier as a collaborator and once the clean room is created I bring in the data I add my audience CFT table and what's awesome is that I can also bring in unstructured data and AI models to collaborate with and here I go ahead and add an ad science private library that I've created that invokes an AI function to help me with my audience segmentation task so now it's time to add a notebook and I add one I've pre-created for audience segmentation that I've preconfigured to use my private library and the best part about this notebook I can use use Python for machine learning now my clean room is ready for my supplier to come join so let me flip hats I'm now the supplier hence dark mode and I join the clean room that my retail counterpart added me to I see all the assets that they brought to the clean room and when I click into the audience graph table I see the metadata associated with the table but not the actual data this is perfect context for good collaboration while ensuring that I'm not privy to any sensitive information now it's my turn to bring in my customer data but my customer data is on a snowflake Warehouse outside data bricks and I don't want to create a custom ETL pipeline to bring this data in and I don't have to because lucky for me I can directly specify lake house Federated tables as sources to this clean room with no copy no ETL these clean rooms truly scale for crossplatform collaboration and now my favorite part I inspect the notebook the code looks good and I run it so the job run has successfully started and in a few seconds it's done and I'm presented with delightful visual results to help me understand that we can Target 1.2 million households for our campaign based on factors such as customer age income bracket and household size thank you so let's go back to our slides to summarize what we just saw our retailer and supplier with able to bring their respective customer data to a privacy safe environment the clean room and collaborate without sharing any sensitive information with each other it didn't matter that they were in different clouds regions or data platforms they could collaborate on more than just structured data and they were able to use Python for machine learning thank you all so much and back to you [Applause] [Music] mat awesome demo yeah so super excited about clean rooms uh and especially uh crossplatform clean rooms I think it's really going to transform a lot of Industries it just makes sense to be able to collaborate on uh data and AI in real time in a secure fashion um so I think overall I've given you a good sense of our approach we really believe that uh picking you know the right uh uh governance and sharing Foundation uh you know for your company is essential for the future and we think it it needs to be an open and cross platform approach we've been thrilled to see both Unity catalog and Delta sharing you know go from just an idea to being uh you know used you know virtually all our customers in a few years and uh we're excited that both of these are open um we're excited about the partners and we invite you to join the open ecosystem um so that's you know that was a lot of tech in the sknote but the exciting thing is what you can do with the tech and for that um I'm super thrilled to um invite um our next speaker uh an actual sports star for the first time on the data and Summit stage uh Alexander Boo from the Texas Rangers and the Texas Rangers are one one away from their first world championship Texas takes the lead it's happen the Texas Rangers win the World Series champions in 2023 [Music] that was a huge moment for us as a baseball organization moving from the bottom of the league to winning our first ever World Series all credit must go to the players and coaches that made this happen this is also a huge moment for our community as over 500,000 people attended Our World Series parade and for me growing up a lifelong Rangers Fan it was a dream comeing true however this was also a win for the data team that I lead at the Rangers and I'm here today to talk to you about how we use data intelligence to drive competitive advantage and transform how the modern game is played most of you may know this but baseball has always been a datadriven sport whether it's comparing statistics on the back of baseball cards to the Modern Age of Moneyball however how data is used in decision making has changed dramatically in This Modern Age of AI data used to be descriptive evaluating past performance now data is predictive optimizing our understanding of future player performance one example of this is how we're using data and AI for biometric insights we build predictive models on how the body's motion affects how a ball is thrown leading to Designed pitches Guided by AI that are personalized for each unique pitcher further with a better understanding of how players move when swinging the bat we can provide biomed recommendations to optimize for specific types of hits with these insights we can advise our players you're trying to hit for power get those legs and try to get it out of the ballpark you want to just hit for contact just square up the ball in Little League my coach would always tell me to choke up on the bat and bend your knees we now measure that at high frame rate 300 frames a second this pose tracking gives us further insight into injuries and workload management too data and AI helped make the most of our players athletic talents leading to those incredible clutch hits that maybe you all saw during the World Series Post tracking isn't our only new data source we also track every player's position continuously at 30 frames a second for every major league game this gives us unprecedented ways of measuring defensive capabilities by understanding Tendencies reaction times and the way that our fielders move when trying to catch that fly ball we can optimize our defensive placement using AI to maximize the likelihood of a player making that out and yeah maybe we got a little bit too good at that and Major League Baseball changed the rules a couple years ago but we still use it to this day this culminated in a playoff run where we went 11 and0 on the road highlighted by impressive defensive plays such as home run robbing catches and clutch double plays how did we change this data and AI game I say it was it wasn't always like this getting to this point where we could realize these successes was not easy there were so many challenges rather that we faced just a few years ago when we began our data modernization Journey stop me if any of this sounds familiar our on-prem stack could not scale to these new data sources those RIS in it cost in the maintenance of these on-prem servers led to an untenable Roi on AI Investments further as our data team grew supporting minor league operations all around the country as well as scouting initiatives all around the world those governance and permissions became difficult to manage we lacked governance and ran into fragmented silos our data teams were split between minor league Player Development amateur analytics International and advanced scouting teams these slow and disjointed processing within silos led to delays of reports that our players and coaches needed in some cases we weren't delivering reports until the next day well after the game had already finished and while we don't have a live link to the Dugout perhaps uh caused by a certain trash can banging incident a few years ago it is still imperative that our players receive that information post game to prepare themselves for what happened and how to be successful tomorrow with 162 games baseball is a marathon and quick feedback is a necessity for our players to solve these problems we have unified and simplified our data and AI stack on The Lakehouse with data bricks Unity catalog unites our data silos Under One Roof we have a variety of data with sensitive information such as player addresses Financial contract information further biomechanical and medical records should not be widely accessible through the org Unity catalog allows us to have that single- shared platform with appropriate permissions in place to comply comply with both internal and external regulations such as furpa and Hippa Unity catalog also gives us the ability to manage clusters ETL pipelines all within the Json metadata once our data is loaded the data intelligence platform was also able to comment and provide AI summaries around what that data actually is this democratized is used for our analysts who sometimes struggle to figure out where the correct data source lies finally we've also built hundreds of ML and AI models on this data the ml registry governed by unity catalog gives us a great platform to organize and search those models however Unity catalog also allows us to govern who and which data teams can access models and features from the feature store for their own projects data lineage of all of this gives us a great insight to see how the data flows from source to modeling to those final bi reports that our players need transparency builds trust and of course data sharing allows us to connect with other data verticals and vendors inside of our department this includes ballpark it includes concessions as well as sharing live data within how the fans are engaging with the team everything with the appropriate permissions in place the net result we now have four times more data ingested and used for AI at the same cost as our Legacy systems we have hundreds of users scattered around the country and the globe with secure and governed access to these data and ml kpis we also have 10 times faster Data Insights after games and workouts getting the reports into the hands of our players quickly that they need to be the best that they can be and of course all of this contributed to our first ever World Series win I tried to have a spotlight on my ring for the whole time but it's a it's all they didn't said no to that but data bricks is really helping our organization win by impairing our team with data intelligence however we're just getting started here with the rise of generative AI we have invested time and effort to find innovation in this new space I actually have a quick demo where I will be using the datab bricks AI bi Genie to provide a natural language interface into our data with the trade deadline coming up as well as being in San Francisco I thought it would be fun to see if there are any players on the San Francisco Giants that might have future trade value in this application we are using public data from baseball Savant notice that these tables as well as the application are both governed throughout Unity catalog users need to have the correct permissions to access both comments as well as summaries describe and help teach the genie application what exactly these internal kpis that mean something to me but maybe not to you what those actually are and of course all of this needs to be shared and governed within the workspace analysts can ask broad questions of this data you're going to see here that we're going to be looking for just who on the giants has any trade value I know I type super slow I guess they're were like do you want to do something the computer and I'm like it's fine you can pretend that I'm typing that out uh the genie doesn't know how to answer this question of what is trade value it just brings back statistics about players on the Giants so what we can do now is instruct the genie what I care about with trade value I want to look at the difference between expected and observed performance to look for undervalued players notice that we quickly see that the genie application is able to see that Luis Matos as well as Matt Chapman both have had significant ific underperformance this season on the Giants but maybe we'll have better performance for the rest of the season if they call Louisa mat ever back up from Triple A but that's a side note anyway we can save this as a thumbs up as well as provide instructions and save it as an instruction to save it off as uh for an easy access later on and we can also visualize this data for quick consumption since we saved this as an instruction it's trivial now to do the same analysis for other Baseball clubs here I'm asking it do the same anal Anis for the Chicago White Socks after some time thinking that's my double fast forward click there it goes we see that Martin Maldonado as well as Andrew Ben attendi have been underperforming for the Chicago White Socks what this has allowed us to do is democratize and allow our analysts SQL developers and less technical stakeholders unprecedented access to our raw data in our database this allows them to ask the questions they need and create an efficient starting point for targeted and further decisionmaking leading into the trade deadline thank you so much to the data burk's team that supports us Michelle Hussein Chris that onboarded us up here thank you for the opportunity to speak with you all this morning finally we're always looking to continue pushing the boundaries of data and AI in sport if interested please reach out baseball is a team sport after all and I will say we do a lot of our hiring in the offseason so best of luck if you use that QR code but you can always find me on LinkedIn and happy to kind of talk about this further thanks so much to [Applause] [Music] [Applause] [Music] [Applause] everybody wow that's so cool did you guys see the ring that he was wearing was like gigantic uh that's so awesome uh I call it Moneyball 2.0 uh you got to go check out their booth in the expo hall they can actually analyze your swing and everything like they'll collect all the data and they'll you know uh give you a score and you can improve it so check that out okay so I'm going to introduce NEX uh my co-founder um I actually said it backstage that you know they said hey he's the number one committer in Apache spark but actually we looked and now he's actually no longer he's number three but for seven years uh he was the number one committer on Apache spark project uh and he's going to tell us about spark and one of the cool things is you know this project is now you know 10 years plus old uh so you think that okay we know what spark is but actually it has dramatically changed in the last just two three years okay the project has completely been transformed and it's going to tell us how we did that and what are the changes and how did the community uh pull off this change that it's gone through so let's welcome uh rold chin to [Music] Stage all right thank you Ali for that uh number three speech uh good morning again um so as many of you know this conference actually started previously named as the spark Summit or spark nii Summit and in this talk we'll be going back to the roots of the original conference Apachi spark um so three years ago at this conference we pull a 100 of you and we ask what were the biggest challenges you had was Apachi Spark all right so about 100 of you and here's what the 100 you told us by far the number one was hey I have a bunch of Scala users they're in love with spark it's great but they also have a whole bunch of python users out here as a matter of fact there's way more of them and there really doesn't get spark Sparks kind of clunky it's difficult to use in Python um it's not a great too for them and the number two is everybody else was said hey um I love spark I've been using it I'm using scalar also but dependency management or my spot application is a nightmare and version upgrades are takes 6 months one year 3 years you name it um and then there's a consensus among the language framework developers out there not a huge population but very important component of the apach spark Community would tell us hey because of that tight jvm language uh sort of nature of spark it's very very difficult interact with spark outside of jvm as a framework developer not just as an end user so we got to to work um and let's talk about the first one spark Scala but my usual is only python all right if you've been to this conference uh in the past you know this is not the first time um we're talking about python but I found this video from about three years ago um just the other day as I Was preparing the talk and it's from Zach Wilson who used to be a data engineer Airbnb and here's what Zach has to say another one is spark is actually native in Scala so writing spark jobs in Scala is the native way of writing it so that's the way that spark is the most likely to understand your job and it's not going to be as buggy so Zach scal I believe Zach's actually maybe somewhere sitting here um but Scala is the native way of riding spark and writing is not as buggy so it's not just that people at this conference saying that right we got to work um 3 years ago at this conference I think it might have still be named sparking a submit back then and the uh theme of all the slides were white background instead of dark background we talked about the project Zen initiative by the Apachi spark community and it really focused on the holistic approach to make python the first class Citizen and this includes API changes including better error messages debuggability performance improvements you name it right comeo almost every single aspect of development experience um 2022 two years lat two years earlier uh we gave a program report and talked about all the different improvements we have done in those two spark releases and last year we show a concrete example of how much auto complete have changed just out of the box from spark 2 all the way to spark 3.4 so um in this slide summarizes um a lot of the key important features for p spark in spark 3 and Spark 4 and if you look at them it really tells you that python is no longer just a so bodon on the spark uh but rather first class language and there's actually many python features that are not even available they're python made python idiomatic they're not available in Scala for example you can Define python usually Define uh table functions these days and use that to connect to arbitrary data sources it's actually much harder thing to do in Scala at this conference alone this year we had more than eight talks talking about various features of just py spark itself so a lot of work have gone into it um but how much benefit are the users seeing um again this is one of those moments I would tell you I can tell you non-stop about it but it's the best you try it out yourself it's actually completely different language um when you look at the last 12 months alone um pypar has been downloaded by over 200 countries and regions um in the world just according to piie Stats and I was doing some number analysis the other day I was really surprised to find this number um just on data breaks alone for spark versions 3 3 and above so it does not include any of the earlier spark version which is's a lot of them out there but just for spark 3.3 versions and above on data bricks um our customers have run more than five billion ppar queries every day me to give you a sense of that scale um the I think the leading cloud data warehouse runs about five billion queries a day on SQL um this is actually matching that number it's only a portion a small portion of the overall P spark workloads um but the coolest thing was as I found the earlier video from Zach um in which he said hey scala's the native way of doing it I found another video he published just about three months ago by the way I've never met Jack Zach until like last week when I uh call him reach out to him ask hey would you be okay for me to show you the video but let me play you this video from this year by Zach uh but things have changed in the data engineering space uh the spark Community has gotten a lot better about supporting python so if you were using Spark the differences between pie spark and Scola spark in spark 3 are there really isn't very much difference at all so thank you for the endorsement from Zach right so if your impression of spark was hey um spark is written natively in Scala that's still true we love Scala but your impression is hey if I'm really using python I will get super crazy jvm stack traces I get terrible error messages the API idiomatic try it out again it looks completely different from three years ago right and of course the job's never done we'll continue working on improving python for spark uh but I think it's fairly uh so reasonable declare hey python is a first class language of uh spark so now let's talk about the other two uh proms version upgrades dependency management and jvm language um now let me dive into a little bit more about why this prts exist so the way spark is designed is that all the spark application you write your ETL pipelines your data science analysis tools your notebook uh logic that's running um runs in a single monolithic process called a driver that includes all the core server sides of spark as well so all the applications actually don't run on whatever clients or um servers they independently run on right running the same monolithic server cluster and this is really U sort of the essence of the problem because one all this because they all run in the same process the application have to share the same dependency and not only do they share the same dependency each other they share the same dependency as spark itself um debugging is difficult because in order to attach a debugger you have to attach the very process that runs all of the things um and now last but not least if you want to upgrade spark you have to upgrade the server and you upgrade every single application r on the server in one shot it's all nothing and this is a very difficult thing to do when they're all tightly coupled so two years ago at this very conference U Martin and I introduced to you spark connect um the idea of spark connect is again very very simple at a high level want to take the data frame and SQL API um of spark that's either python Scala Centric and Crea a language agnostic binding for it based on grpc and AP arrow and this sounds like a very small change because it's just introducing a new language binding and new API lric agnostic but really it's the largest architectural change to Sparks since the introduction of data frames apis themselves and with this language agnostic API um now everything else run as clients connecting to the language agnostic API so we're breaking down that monolith into you can think of it as microservices running everywhere and how does that impact endtoend applications well different applications now will actually run as clients connecting to the server but there are really clients that're running in their own sort of isolated environment and this makes upgrade super easy because the language binding is designed to be or bindings designed to be language Diagnostic and for and Backward Compatible from API perspective so you could actually upgrade the spark server side say from spark 3.5 to spark 4.0 without upgrading any of the individual applications themselves and then you can upgrade applications one by one as you like at your own pace same thing with debuggability now you can attach the debugger to the individual application this runs in a separate process anywhere you want without impacting the server without impacting rest of the applications now for all of the language developers out there um this language agnostic API makes it substantially easier to be building new languages just in the last few months alone um we have seen s of community projects that build gold bindings rust bindings shop bindings all of this and it can be built entirely outside the project with their own release Cadence so one of the most popular programming languages probably the top two post uh programming languages for data science um are R and python right spark has built-in python support there also built-in r support spark R but the actually the the most popular art programming uh library for spark is not the built-in spark art it's a separate project called sparkly art and sparkly art is made by this company called posit um which is actually I I was talking to uh the posit folks uh behind the stage and uh I told them hey I think p is the coolest open source company audience I never heard of and the reason you have not heard of it is they rename themselves fairly recently to posit um but the people at posit created the most foundational open source projects for example deer the very project that Define the grandar for data frames that we're all enjoying today ggplot the visual grammar of visualization our studio the most popular R IDE um West mckin who created Panda works at posic um also Apachi Arrow um so I would actually like to welcome tarf president of posit onto stage to talk to you more about sparklyr [Music] [Applause] [Music] good morning everyone thank you very much for the introduction it's very kind of you we uh I'm very excited to be here and thank you data bricks for giving us the opportunity to uh to speak to this audience we as a company are probably somebody people that you don't know you've never heard of us until uh he gave you a little bit of update but we are a public benefit Corporation we've been around for about 15 years our focus is very much about code First Data science um the uh our governance structure is one that allows us to think about things for a very very long term so our Ambitions are actually to be around for the Long Haul and to continue to invest in these open source tools we have been we support hundreds of our packages and we also support the r Studio IDE and if you if you've been watching us for a while you may have noticed that over the last 5 years we've added a lot of capabilities to the python ecosystem right so uh Li in some cases these are multi multilingual uh Solutions so things like quto shiny for python great tables all of these are examples of of projects that we have and we have more that are coming out over the coming years the um sorry I'm having a hard time reading this slides in 2016 we released a package called sparkly R and the reason we released it is because we wanted to have an idiomatic implementation for uh for for the r users that is more aligned with what the Tidy verse is and for those of you who don't know what the Tidy verse is it's like a philosophy of how you write packages and uh and the patterns that sort of go along with that the original design of spark made it so that for users in corporations in particular to be able to use it they would have to run our studio and r on the servers themselves and so you can imagine when Spar when spark connect became available last year we were very very excited because it finally solved one of the key problems that we saw which is like how do you make it so that the end user through a client does not have to get into a jbm can just access it directly and so happy to say you know about so we started last year and basically by the end of last year we had had gotten uh support for sparkk connect to happen uh Unity catalog we worked with the data breaks team to figure out how to make sure that sparkly R and the IDE had uh clean support for that and one of the most interesting things is we added support for our userdefined functions which is actually a really big deal because now the r users in the in the in your organizations can actually participate in using spark to solve the really hard problems and they can collaborate with other people in the sparkk ecosystem so we're very excited about that and we're interested in sort of getting people feedback on that if you get a chance to try it out so this is very anticlimactic those of you who were there yesterday for the demo you saw Casey she like the world stopped we decided to make life easy it's hard to demo some of these things but the change this is the open source desktop IDE and you can see that that's this oneline change that you have to make to be able to connect to spark connect and now this user on the desktop can go ahead and access the spark cluster and leverage the full capabilities of that this is one of the key things that we're we think that make a big difference in terms of people's ability to contribute and adopt spark so you probably have noticed we we we've over the last year we've been announcing all kinds of things with data breaks one of the key things obviously where sparkly are and uh spark connect and support for that but we have also uh been making changes on our commercial product so the first commercial product that we have supporting this is something called posit workbench which gives you a sort of a server-based authoring environment that uh supports our studio jupyter notebook jupyter lab vs code and ties into the authentication and authorization of the systems and so you basically get the full power of the governance that you have in data breaks but having it surfaced to their data scientists you can expect that over the coming year you'll be more commercial products and open source tools that will have those tighter tighter Integrations with the datab break stack if you're at all curious or interested you know feel feel free to check out any of these links to he'll learn about how we're working more with spark and Spark connect and how we're working with data braks thank you very [Music] [Applause] much all right thank you tarf the reason I'm so excited about spark connect is that it makes like uh the Frameworks like sparkly are possible it makes it easy to use makes it easy to adopt easy to upgrade easy to build and this really ultimately benefits all the developers all the data scientists all the data Engineers out there because now they can use whatever language they're most comfortable with it doesn't require all of those to be built into spark you will get idiomatic r on spark now um with spark connect it's really trying to solve this last two PRS version upgrades managing spark make it easier to be building non jbm language bindings with that it brings us to spark 4.0 this is actually not a conference in which we will announce spark 4.0 is really today it's actually Upstream open source project working at its own pace but it is coming later this year to give you a preview of some of the features just similar to other major version previous major version releases of apach SAR there will be thousands of features I can't possibly go all into today but spark connect will GA um and become the standard in spark for um NC SQL will become the standard in spark for there's a lot of other features that we're looking forward to but one thing I'm particularly excited about as at um definitely at this conference is that the um the opportunity for the different open source Community to be collaborating with each other um especially when it comes to computer and storage so many many features actually requires co-designing the compu stack which is where Apachi spark comes in as well as a story stack which is where Delta Lake L Foundation Delta Lake and apachi Iceberg come in um as a matter of fact many of the features you've heard about at this conference at session talks at Ken notes collations Road tracking merge performance variant data type Sean talked to you about um type winding there are not just features in delta or features in iceberg features in spark they actually require coth thinking about all three projects um for them to work and this is s of really a spirit um of Open Source and the spirit of collaboration in open source so last week even though spark 4.0 is not officially released yet last week the apart spark Community have uh officially released spark 4.0 preview it's not the final release but it gives you a glimpse into What spark 4 would look like please go to the website check it out download it give it a spin and let us know your feedback thank you very [Applause] [Music] much awesome super excit about spark 4.0 um I got to say you got to check it out P spark is amazing these days and then also all that version management installing it uh managing Spark It's just so much simpler these days uh I just tried it uh a week ago you can just go to any terminal and just say pip install pie spark and that's it it'll just install the whole thing it just works it's hugely different from let's say 10 years ago where you would have to set up the you know the servers and the demons and all of that and configure it and use it in local mode and all that just pip install pie spark okay so uh back to our data intelligence platform road map we're now reaching towards the end uh but this is the most exciting thing for me uh this this project that we're going to talk about next uh is something that actually a couple years ago we asked all the top cios that use datab bricks what's the number one thing you want datab bricks to do and it was a really surprising answer it was something that we didn't expect them to say to do and then since then we've been super focused on nailing this Pro problem so so I'm very very excited to welcome on stage belal Aslam who are going to take us through what we've done there let's welcome [Music] him all right good morning so I I'll get started as it turns out there are five bals at data bricks I asked all five to give me a little cheer but that was more than five so thank you uh okay so thank you Ali for the introduction so we've heard about machine learning we've heard about bi we've heard about analytics and all these amazing things and I'm here to tell you that every single one of them everything starts with good data all right how do you get to good data well there are three steps you have to follow uh and every single one of us including me we are traditionally cobbling together lots of different Tools in an ever increasing tool chain that gets more and more expensive more and more complex let's go through that real quick so spark and especially data bricks is already very good at Big Data as rold was telling you this is the world's biggest big data processing framework but as it turns out that a lot of your really valuable data is sometimes in smaller data so for example you may have MySQL Oracle postgress all these different databases they're incred incredibly valuable so you might be setting up deum and Kafka and a monitoring stack and a cost management stack just to get the changes from these systems into Data breaks or you might I'm actually pretty confident that every single one of us is using a c CRM of some kind maybe you're using Salesforce Nets Suite maybe you use hrms like workday and netsuite right tons of valuable data in there just waiting to get into Data BRS so you can uh start using it and then once your data is in a data platform like data breaks the next step the very next step is is to transform it as it turns out newly ingested data is almost never ready for use by the business you have to filter it you have to aggregate it you have to upend it and clean it lots of Technology choices CBT a great open source project you might have heard about Delta LIF tables in pisp park reyal was telling you how popular it is which one of these do you use and again how do you monitor it how do you maintain it and once your data is transformed that's really not even half the battle you get the value out of data by actually running your production pipelines in production I don't like waking up at 2 in the morning with an alert so now you have to orchestrate so you might be using airf flow great now your tool chain just expanding just a little bit more you're responsible for managing airl flow and its authentication stack and so forth and then of course you might have to monitor all these things in cloudwatch this is unnecessarily complex and this is inefficient and it's actually very expensive which is why I am extremely proud to unveil what we're calling datab Brees Lake flow this is a new product that's built thank [Applause] you this is a new product that's actually built on a fundamental Foundation of datab breaks workflows and Delta LIF tables with a little bit of magic sauce added on and I'm actually going to start with the magic sauce and it gives you one simple intelligent product for ingestion transformation and orchestration all right let's get into it the very first component of these three components is something we call Lake flow connect lak flow connect is native to the lake housee these are native connectors and when I say they're native and they're high performance and they're simple for all these different Enterprise applications and databases if in the audience today you're using server postgress a legacy data warehouse or using these Enterprise applications we're on a mission to make it really simple to get all of this data into Data bricks and this is actually powered by Aron technology a company we acquired last year um so I'll give you a quick demo in a moment but I actually want to talk about one of our customers called insulet and insulet manufacturers in a very Innovative insulin management system called the omnipod and they they had a lot of customer support data locked up in Salesforce and they're one of our customers of Lake flow connect with that they're able to get to they used to spend days on getting the insights now they have it down to minutes it's super exciting all right so actions speak louder than words so let's take a look all right so you're in Lake flow here and what I'm going to do is I'm going to okay I'm going to click on ingest and you see it's it's point and click which is pretty awesome and it's designed for everybody I'm going to use click on Salesforce and my friend Eric Oren has set up a connection by the way everything in Lake flow is governed by unity catalog and it's secured by unity so you know you can you can manage it very easily and govern it and there are three steps and now okay great so now I see these objects from Salesforce and uh I'm going to choose orders I actually work for Casey I don't know if you remember her cookie company I'm building the data pipeline she's my CEO so I'm going to bring in some order information uh for our ever growing cook business into the into this uh catalog and schema and hang on a second there we go and within seconds data should show up in our lake house excellent all right that's it that's all it took there are no more steps all right let's get back to slides so I want to do something here and going to give you a peek behind the curtain uh we're all Engineers here and there's actually something pretty magical that's happening in out side of Lake flow connect you might think you know gosh how hard could it be to connect to these apis and these databases you know can't you run a SQL query it turns out that what you actually want to do is only obtain the new data the change data capture from these Source systems from these databases and Enterprise applications and as it turns out this is a really really hard problem you don't want to tip over the source database you don't want to exhaust API limits the data has to arrive in order it has to be applied in order things go wrong it's the the real world and you're coupling systems together and you have to be able to recover all of this is undifferentiated heavy lifting and I'm really glad that we're doing it because with archon tech CDC is no longer a three-letter word it's point and click it's automated operations and it's consistent and reliable super [Applause] exciting all right let's go to the second part of this product the second component what happens once you bring in data you're now able to load data from these databases and Enterprise applications the very next thing you have to do is to transform it which is to prepare it remember you have to filter aggregate join and clean it typically this involves writing really complicated programs that have to deal with a lot of airor conditions and I'll show you that in a moment uh the magic trick behind Lake flow pipelines because it's built on the foundation it's the evolution of Delta life tables is that it lets you write plain SQL to express both batch and streaming and the magic trick is we turn that into an incremental effic and cost effective pipeline okay so let's go back uh and remember I am making I I just did some ingestion of data and what I'm going to show you is I've also pulled in data from SQL Server I won't show you that flow so I have data from Salesforce I have data from SQL server and I need to now go ahead and create a little bit of an aggregation out of that okay so let me show you how simple that is within Lake flow one of my favorite features here by the way is that it's one single unified canvas so this little dag at the bottom you always see it you can hide it if you want but I'm going to click here on Salesforce and I'm going to write a transformation okay that's simple now this is an intelligent application it's built on the intelligence data intelligence platform so I might just go ahead and ask the assistant what it thinks I should join uh okay it comes up with a pretty reasonable join it says you can join these tables I'm just going to let it figure out how to join them figure out the key for me and that's pretty awesome okay that looks about right it found the customer from ID key I'm going to go ahead and accept that and let me just run this transformation real quick I don't have to deploy it I can run it in development and it'll actually give me the ability to debug it real quick okay perfect so I can see that I have orders dates products customers all of this came together really nicely I have a nice little sample of data Perfect all right so we can go back to slides now okay thank you that was that was one of the bals uh okay great so let's again I'm going to give you a little bit of a peak behind the curtain here why is this pretty amazing notice that in in this there was no cluster there was no compute I didn't have to set up infrastructure I didn't have to write a long script I just wrote sqil and this is the power of declarated Transformations this here is actually my valuable transformation and instead without Lake Flo pipelines you have to do table management you have to do increment alization many many times and even have to deal with schema Evolution I've spoken with some of our customers they've written entire Frameworks to do schema Evolution and schema management again that's undifferentiated heart heavy lifting why should you spend time on that right and this this Beast just grows and goes and lak flow pipelines are powered by something called materialized views and they're magical because they automatically handle the bookkeeping for you they handle schema Evolution they handle failures and retries and back fills and they magically choose the right way to incrementalization so in my world my CEO My Cookie CEO is really demanding our e-commerce website is just really taking off and now we will need to be able to do uh real-time actions on our website so from this pipeline this which pipeline which looks like batch I'm going to add some streaming to it I'm going to go ahead and write some uh joined and enriched records into Kafka so let's let me show you how easy that is and let's take a look at that great so this remember this is my pipeline here uh I did I did this materialized view a transformation so again from this unified canvas I just add a pipeline step and I'm going to go live here I'm going to write some code okay so I'm going to create something called a sync think of a sync as a destination I'm just going to call it Kafka because I'm going to write to Kafka and all I have to do is this is kind of cool because all I'm doing is writing SQL here and I'm going to point this at ca. dates.com and that should be enough to create a sync and all the credentials are coming through Unity catalog so this is again governed and I'm going to create something called a flow and a flow think of that as an edge that writes changes into Kafka I'm going to do Target Kafka and I'm going to select from the sales table that I just created and I'm going to use the table changes table value function um okay something's not right here and I need to do Dev great okay so this looks good and remember this is what looks like a batch Pipeline and I'm going to turn this into streaming there we go and just like that our data is in Kafka let's go to [Applause] slides and this is something super exciting there is no code change here I didn't have to make a change I didn't have to choose another stack everything just works together one of my one of the coolest things we're doing is something called realtime mode for streaming you can think of this as realtime mode for streaming makes streaming go really really fast and the magic trick here it's not fast once or twice it's consistently fast so if you have operational streaming use case where you have to deliver data and insights just turn it on and this pipeline will go really really fast and we have talks about it Ryan n house is doing a talk on it so please go check it out perfect so now I have I've ingested data from SQL server and Salesforce I've very quickly built a pipeline that is able to deliver badge and streaming results is always fresh I didn't have to do manual orchestration but now my CEO is very demanding the cookie business continues to grow and Casey wants insights and she wants a dashboard that she can use to figure out how her business is doing and this is where orchestration comes in and orchestration is really how do I do all the other things that are not involved with loading data and transforming data such as building a dashboard and running it or refreshing a dashboard one of my favorite capabilities in data brakes is something called Data brakes workflows and we've evolved it into the Next Generation and workflows is a drop in complete orchestrator no need to use airf flow airf flow is great but it's completely included in data breaks and this is just a list of Innovations it has lots of capabilities that you might be used to In traditional orchestrators okay so what I'm going to do here now is I'm going to walk over and I'm going to start building a dashboard I'm going to to run it after my pipeline is done okay let's take a look okay so remember I have data going into Kafka I have all this I'm going to just add another step I love this unified canvas it's like a really nice context on where I am and this is super cool the just assistant suggest a dashboard that's pretty cool actually useful revenue and product insights I like that that's what I would have wanted and uh let me hide that a little bit and there it is that's our dashboard so hey good news our cookie business continues to grow we're not all the way done with the business and this is super cool uh we actually have a really interesting Insight here that sugar cookies tend to sell in in the month of December so super cool so that's it you don't have to do anything else uh let's get back to [Applause] slides so I'm going to wrap up really quickly uh I'm super excited about one Innovation that I think will make our lives as data teams and data Engineers much much better look it's great to create dags things that run after another it's great to have schedules when should something run but as your organization grows what you really want are triggers and triggers think of them as work happens when new data arrives or data is updated and this is is actually what allows you allow allows us to do another magic trick which is run your pipeline exactly when it's needed when Upstream tables are changed when up Upstream files are ready this is super cool it's in completely available in the product it's actually a foundational block of uh lak flow jobs perfect so now my everything is running I've ingested data I have transformed it I've built a dashboard my pipeline's running in production like I said I hate waking up in the middle of the night and typically I have to glue together a lot of different tools to see cost and performance Lake flow includes unified monitoring includes unified monitoring for data Health Data freshness cost runs and you can debug to your your heart's content but it has that single piece pane of glass so you don't have to if you don't want to Lake flow is built on the data breakes intelligence platform it's native to it this gives us a bunch of superpowers you get full lineage through Unity catalog that includes ingested data so all the data Upstream from Salesforce or workday or MySQL we already captured the lineage it includes Federated data it includes dashboards even ml models not a single line of code needed it's built on top of serverless compute frees you up from managing clusters managing instances which how many executors what type of instent should I use it's serverless it's secure and completely connected to Unity so it fleas you up from that hassle but what's also really cool about it is that we did this Benchmark this is real data for streaming inest it's three and a half times faster and it's 30% more cost effective so that's you know have your cake and eat it too it's super exciting [Applause] data intelligence is not just a buzz word as you have seen in the last couple of days it's actually foundational to data braks it's also foundational to Lake flow lak flow includes a complete integration with data breaks IQ and the assistant so every time you're writing code every time you're building a dag every time you're ingesting data we're here to help you uh author Monitor and diagnose and one last thing this is actually an evolution of our existing product so you can confidently keep using delal life tables and workflows we'll make sure that everything is forward is backwards compatible all your jobs and pipelines will continue to work and you can start enjoying Lake flow so Lake flow is here uh we're actually doing a talk Elise and Peter are doing a talk on uh Lake flow connect I think very soon lak flow connect is in preview please join us give us feedback on what connectors you want we're very excited about it pipelines and jobs are coming soon all right I think that's it thank [Applause] [Music] you awesome all right that was awesome Bal ivth sounds like a king uh that was super super awesome what I really loved about that is I don't know if you noticed it this is actually a big deal so spark has micro batch architecture so things take you know in when you're trying to stream things it takes a couple seconds sometimes 5 six seconds what he showed you realtime mode that we now have real time mode gets it down to 10 20 milliseconds so it's like a 100x Improvement the P99 percentile of uh latency is you know around 100 millisecond so it's kind of game changer and then of course we saw connect you can get your data in there you can do incrementalization you don't need to worry about you know getting the logic right it'll just do it for you you so super exciting okay awesome so I just want to wrap this up quickly so on the top row there you see the announcements from yesterday I'm not going to bore you and go through those again uh on the bottom row you can see what we did today so we just heard data engineering so you saw that Unity catalog open source live on stage by mate that was super cool but also metrics I'm excited about metrics every company has kpis how do we have certified kpis that we can rely on and that we know are semantically correct and we know how to compute them so that's also a big deal and then we heard about Al Lake 4.0 and project uniform going GA so lots and lots of great stuff and that's it for today uh hope you enjoy your lunch and then please go to the sessions they're super super awesome thank you everyone thanks
Info
Channel: Databricks
Views: 14,536
Rating: undefined out of 5
Keywords: Databricks
Id: uB0n4IZmS34
Channel Id: undefined
Length: 135min 37sec (8137 seconds)
Published: Fri Jun 14 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.