Day 1 Morning Keynote | Data + AI Summit 2022

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the world has changed you don't need data to know that the world has changed some changes are for the better some not so much but it's definitely a new world out there less of the same old more once in a century less certainty more unpredictability less margin for error more urgency to get it right there's so much riding on our shoulders it's not like the fate of the world rests in our hands actually it kind of is the answers are out there all you need to do is start your journey no one knows what's coming next or where all this is going [Music] but if anyone can predict the future it's us wherever you look the challenges are daunting from forecasting the next hurricane to forecasting this year's hottest fashion trend from solving supply chain disruptions to solving what to watch tonight from building sustainable solutions to building his college fund they all have one thing in common you because data and ai are essential to solving all of them and that puts you at the center but you can't do it alone it takes a team it takes a community we don't all come from the same place but we're all heading in the same direction side by side heading toward the same destination lake house [Music] in hindsight the future was always clear this is where we were always heading to a place that's open to everyone built for the way you work so we can not only predict the future but shape it together [Music] please welcome ali godsey co-founder and ceo databricks [Music] hey everyone welcome super excited to be here this is awesome finally in person last time was three years ago we could do this in person i also want to thank everyone that's watching online so we have about 5 000 people here at moscone center it's fully sold out and we also have about 60 000 people virtually checking in in the last two years this community of open source projects that we see have been growing out of bounds and it's really awesome to see and i'm really really looking forward to an awesome event this week we actually have over 160 countries represented in this event so we have people from all over the world calling in watching these videos we have developers data scientists analytics engineers software engineers thought leaders industry leaders and it's a true community event where we get together can learn from each other and move our industry forward before i move on with our program i want to first take a minute to thank our sponsors without these companies that sponsored this this event would have not been possible so let's get them around to applause and i also want to really congratulate everyone who took training here so we actually trained 30 000 people here at this event so 30 000 people and 120 of them are actually certified so it's really special thing to be certified so this way we're growing the community and getting more and more people to learn these technologies growing this open source community so we're super excited about this but what i'm here to talk about today is how data analytics and ai is disrupting whole industries so the companies that we see on this slide the fangs they all use data analytics and ai in a really really strategic way they used it in every use case in every department and they were that way able to completely change the world right if you think about it a company like google wouldn't even be around today if it wasn't for ai we'd be using alta vista okay twitter wouldn't even work if it wasn't for how it picks those tweets and shows you what you want to see so these companies did it but nowadays we also have more and more larger enterprises that are also able to do the same thing so one of my favorite examples is at t at t takes data that's previously used to sit in a data warehouse this is subscriber data and they combine it and join it with real-time event data that's coming in and this event data is someone wants to change their phone number or they sign up for something and they can join these in a lake house and they can in real time be able to provide 182 million subscribers with fraud alerts so in real time you'll know that okay someone's trying to hack your account and this would have not been possible if it wasn't for data analytics and ai so how are these companies doing it what's the what is it they sort of crack the code on and we really really think that this data ai maturity curve explains what they were able to do so on the x-axis here you see how mature some companies or organizations data and ai is on the y-axis you see how much value they're getting out of it so how competitive is it for them and on the left-hand side it starts with just getting the raw data cleaning it starting to ask questions about it making reports doing some queries but then as you move to the right hand side that's when you're starting to use predictive technologies that's when you start using that crystal ball and you're asking questions about the future what's my revenue going to be next quarter or which of my customers are going to turn or which product do i think that you want that's when you really really get competitive advantage out of data and ai and that's when it becomes really disruptive it's on far right there and today unfortunately most organizations aren't on the right-hand side of this curve they're somewhere on that spectrum and they're struggling and the reason they're struggling is that there's this big technology divide in this data and ai maturity curve so on the left hand side as you get started you store your data you start asking some basic questions the simplest way to do that and the best stack in the past has been to put your data in a data warehouse and then plug in a bi tool like tableau power bi and then you get your dashboards so that's how it's that's that was the easiest way but then as you want to move to the right-hand side and you want to ask about the future then suddenly that stack doesn't work so you have to completely redo it use a different technology stack you store your data in a data lake and then you start using ai technologies on top of it and you hire different type of people so you hire data scientists for that on the left hand side you hire analytics engineers or analytics folks and there's this big divide between the two and this divide is pretty bad because what it actually looks like in most organizations that i see is that all of their data is first landing in what we call a data lake so on the right hand side you see all data comes in this is all the raw data it could be logs it could be things that their software is producing images audio text but then to do anything on the left hand side you have to copy it over to the data warehouse and then they can plug in their bi tool okay but if they want to do any real-time streaming or machine learning then they have to access it directly in the data lake in the raw file format and this is really problematic and this is really slowing down organizations the reason for that is you have now two copies of your data it's duplicative and siloed if new data arrives in the data lake it's not going to be reflected in the warehouse if you update the data in the warehouse it's not going to be reflected in the lake so this whole industry of data ops has been created and you have to hire all these folks and all they're trying to do is reconcile all these different copies of the data lying around okay so that's problem number one problem number two is securing the data making sure that it's private is really really hard because you now have these two incompatible modes on the one hand side you have these files on the other side you have these tables and columns and if you configure one wrong you might suddenly have prevented someone from getting access to data that they should have access to or even worse someone might have gotten access to some data they should not have access to so this is a big problem this big divide on the security side and then of course the use cases you can't really run the dashboards on the lake it won't perform well and you can't really use the data science and real-time applications on the warehouse you pull it out you copy it out so that's what's happening today but what if we could unify these two and we could all just have one repository of all of our data so we store all our data in the lake but what if we crack the code and figure out how to make it really reliable and really fast okay so that's the first layer and what if we could actually on top of that also figure out one security model and one governance approach for all of that data okay whether it's files or whether it's tables or whether it's machine learning models or dashboards or whatever it is and what if we could just do all these use cases all in one place on top of this data so whether you're doing real time streaming or machine learning or analytics or bi you should just do it in this one stack and this is what we call the laycast paradigm so that's what we're going to be talking about a lot this week in the lake house you have this one unified simple platform for all of these use cases and at databricks the way we accomplish this is with the following technology stack we have delta lake which is really core technology to making this happen so delta lake is really the secret sauce that enables you to take all that data make it reliable and fast so we're going to hear a lot about that from the creator of delta michael armbrust here today and then on top of that unity catalog is the technology we use to really make sure that we can govern and secure all data assets in your lake house so there's going to be lots of talks on that as well and then of course on top of it we have all the technologies so our data warehouse databrick sql uh our workflows uh our streaming engine structured streaming and machine learning uh platforms end-to-end so all of this can be done in this tech stack on top of it and that's what we call the database like us platform and this week we have a lot of exciting announcements on this lake house so on the governance layer we have lots of talks so it's gonna be really really exciting to hear what's happening on the unity catalog side uh on the data warehousing side i'm gonna share with you some benchmark results that we haven't actually published before so that'll be exciting and we'll also tell you how we implemented it so there's going to be a talk on exactly how we made it really fast and how we made it work on the data engineering side we're going to hear from michael armbrust on delta it's going to be a really exciting announcement there and on streaming we're gonna tell you about a new effort that we've started that just targets streaming workloads and then of course in data science and machine learning we're gonna hear about how we can actually integrate all of this in the platform so it's gonna be really exciting so why do people want this lake house so we spent a lot of time actually asking customers what makes you excited about this lake house approach and three teams came up over and over again and the themes were number one it's simpler i don't need to have lots of different data repositories move things around making sure that they're consistent and i don't need to have lots of infighting in my organization between different data scientists and dbas and others that can't agree on the technology stack this just makes it simpler so that's number one number two eighty percent of our customers actually say that they're on more than one cloud so this number keeps increasing every year so they don't wanna do this again and again on different clouds it's hard enough as it is building it up on just one cloud making it work on another cloud with different set of technologies they don't want that so they're saying we want one multi-cloud approach that just works across all of these and the third thing that we found that actually really matters to most organizations is that they want it to be open they want to have open source open standards and that's really important to them because they don't want to get locked into one vendor they've done that before most cios that i talk to they say you know we don't want to be in this game of after 10 years trying to get rid of legacy software we don't want to be locked in and this openness is really important when it comes to the ecosystem so we're going to have lots of talks here this week by other vendors and by other open source projects and they can really integrate with this lakehouse approach that we have because all the core elements are open source so you can actually get a five tran and do ingestion and connect it to databricks or you can use a dbt and build your data pipelines or you can use a mute or privacera or collibra for your governance you can use tableau power bi looker for your dashboarding and for machine learning we have lots of partners and on data science as well and then of course we have all the sis that we work with that can actually help build solutions for these organizations that need help so this open ecosystem is really really important it helps people get educated get classes books moocs so that's what's creating this movement and actually we've now started focusing on verticals so we're focusing on different industries and focusing on the use cases in each of these industries so we launched a lake house for healthcare and life science we launched lake house for retail media entertainment and fintech and each of these lake houses they come with templates of how you can actually already solve the problem if you don't know so if you haven't done it before here's a template for how to do risk analysis in finance and you can get started with that immediately and we found that this really helped many organizations so this is great the lake house is awesome um and we're super excited to see that there's lots of other people who've taken notice of this as well so in the note in the industry lots of folks are talking about the lake house so here's microsoft talking about using a lake house or building layouts on top of microsoft azure this is google talking about open lake houses on google cloud and you can see they actually have an offering recently that helps you build these lake houses and aws have been talking about this for a while as well they just spell lake house with a space in the middle we also have oracle now talking about it and these vendors are even around still and even classic data warehouses are saying that they're lake house okay so everybody has a lake house if you actually google data lake house you'll see lots of ads bought by lots of vendors what is the data like house everyone wants to tell you what the like house is so but if there are so many of them which is actually awesome i actually think this is how people will want to build data and ai this is how you move to the right-hand side of that curve i really think the future of data is ai and the way you accomplish it is with the like house but when there's so many vendors saying hey we have this too it becomes a little bit confusing so how do they stack up how do these compare so i'm going to share with you benchmark results on how they actually compare so what i'm going to show you we haven't published before it's called the lakehouse benchmark and in the lakehouse benchmark we did the following thing we took all these different vendors that claim to have a lake house and we use them as a lake house what does that mean it means that we stored the data on a data lake in an open source format parquet and then we used a data warehousing benchmark on that data okay so just to be clear we did not use them as a data warehouse we did not ingest the data into a data warehouse we stored it on the lake itself and used them as a lake house and here's what you'll see in the results so on the left-hand side you have the databricks warehouse engine as part of the lake house and then you have cloud data warehouse one cloud data warehouse two in cloud data warehouse three and the fourth one is a multi-cloud data warehouse okay and what you can see is that we could do this for eight dollars whereas if you look at the vendor 4 it's 30 times more expensive so there's a 30x price differential to try to do this on a lake house with these other approaches okay and you might say hey how did you do this benchmark is tpcds 3 terabytes how did you tell me exactly what did you do maybe we're wrong maybe we got the numbers wrong maybe this 800 pound gorilla is actually 750 pounds okay or 600 pounds it's still a huge gorilla okay in the room it's still a giant difference between these okay we didn't get it 30x wrong i hope uh so that means this is actually going to matter and we're seeing more and more organizations saying hey if i can do this 30x cheaper and i can actually do machine learning i can do real-time streaming and i'm not getting locked in maybe i should actually move to this and we're seeing actually the movement we're seeing it in the numbers people that are migrating over and using this instead so this really matters to people and especially if there is now we're seeing you know there's more and more pressure um there's cost pressure on cios they get less budget and they have to do more with that then they think yeah maybe i'll pick this okay so that's the lake house but you might be thinking okay but this is kind of cheating this is using them in an unintended way okay what if you actually use them the way they were optimized so we did that as well again we used cloud data warehouse 4 the blue one there and and here's when you use them apples to apple so here's what we're doing we're loading the data in so we're also taking into account that you have to load the data in to the system and then we also now are actually computing how much does it cost when they optimize the tables okay because that actually turns out to be a significant cost so that's called auto clustering and then on top of it we run all the 99 tpcds queries and you can see here still huge difference okay so this is apples to apples that's a data warehouse stored as a data warehouse on the left hand side is the like house that database provides and you can see big difference between the two approaches especially if you want to use the enterprise tier that most companies want to use so there's a big difference and you can see this here and actually if you're using a lake house it's even better than this because the load portion you want if you already have your data on the lake you don't need to do the load portion so the the dark green portion goes away but you still have to pay for that dark blue assuming you already have your data in a lake okay so that's benchmarks you could say hey they're synthetic who cares about tpcds all that matters are customers and what people are doing with this tell us about use cases anyone can you know cook up some numbers so i want to just show you three examples of three customers that are actually doing this at scale so first one is a global media company you all know about it this company they were able to actually take a data warehouse that used to have where they had subscriber data and then they would get these real-time events that were coming in from people watching their uh streaming videos online okay so it's a very popular streaming video provider and they joined these and using this they were able to build personalized models and with that they were able to actually cut down their data warehousing costs by 30 million dollars and they actually created 40 million dollars of accelerated revenue because they were able to keep people online because they improved the video quality with this so that's one vendor using elect house the other one is a fortune 10 retailer so it's a huge retailer we all know about it they were able to combine traditional supply chain data that they had in the warehouse with real-time iot sensor data that was coming in and that way they were able to actually reduce food spoilage and they claim that they can actually save 100 million dollars worth of food spoilage with this approach and they have 10 times faster time to insight and the final one is atlassian it's one of my favorite companies they've spoken here multiple times on stage and what they said is they got rid of their data warehouses completely and they've adopted this lakehouse approach they actually use it for all kinds of things they've actually democratized data inside of their organization one of my favorite use cases if you go into jira and you type anything it will actually predict where you're about to write and that's all created in a lake house and that actually helped them lower their costs of warehousing by 60 and we're able to actually deliver much faster insights to the organization so this is awesome that's why we're so excited about this we want to democratize this we want to bring this technology this revolution to everyone and our approach is to do it with the lake house we have a really exciting show for you this week we have a fantastic lineup we will have spark delta lake mlfool projects and announcements around each of these projects we're going to hear about a new streaming project that we're launching and you're going to hear about trends like mds data mesh in the lake house we have data centric ai from andrew ing peter norwig and hillary mason and we also have how can we use ai in education to change people's lives and we're going to have lots and lots of innovations that our customers from databricks are going to talk about this week so it's going to be super excited so let's get started all right so that brings us to our first speaker this week i'm super excited to welcome reynold shin to stage reynold is chief architect at databricks he's also co-founder he's number one committer on apache spark project but before i get on stage i want to share with you a funny story so he actually was a phd student at uc berkeley when i met him the first time and believe it or not his phd thesis was about how can you use sql to ask questions from human beings so using sql to ask questions from humans you can actually you can google it it's called crowddb so without further ado i'm super excited to welcome rainbow to stage reynold come on [Music] thank you ali good morning it's so nice to see all of you here in person especially in this very room about a year ago where i got my first covet 19 vaccine shot good memories so today we have two exciting announcements representing actually some of the largest changes to sparks project since the project's inception but before i dive into those i want to share with you some community updates the world has changed forever in the last two years in the middle of the pandemic if you like me you picked up cooking baking and playing with big data and that's why it's no surprise that apache sparks grows has been continuing going up and it's now being downloaded more than 45 million times a month just on maven and piper alone that's 45 million times a month this is actually the 13th year of the apache sparks project history and it's really astonishing how much the project's continuous success is getting so in 2020 about two years ago the hacker ranked developer survey actually uh indicated for people with apache spark skills your wages have gone up almost by 30 percent between 2019 and 2020. and this survey was echoed by another survey from stack overflow just last week the 2022 style overflow developer survey that sparked with the top paying technology at the top so congratulations to all of you especially the ones that received the certificate earlier today the the survey really indicates the demand in the labor market for spark skills and now you know what to talk to your manager about in your next one-on-one and i also know it's kind of annoying to see ruby on rails just right up there above spark but it's not just stats is growing at acm sigmar conference also just happened two weeks ago in philadelphia spark was awarded the acm sigma systems award now this is a very big deal because sigmoid is the most prestigious database community it is actually where relational databases were born the acm sigma system award recognizes systems with significant impact on the theory and practice of large-scale data systems and this is really a testament to spark's adoption as well as its influence on the generation of systems to come but spark's not just resting on its past glories there's so many changes that's happening across the board going from python to sql to performance usually a decade into the project you kind of see the project start scattering and not really changing that much but there's so much change happening right now i won't have time to go into any of the details there's some examples on the screen here i encourage you to check out the deep dive talk tomorrow at 2 p.m to learn more about them so for all of you that's here you know that sparks often associated with big compute large clusters thousands of machines big applications right but the reality is data applications don't just live in data centers anymore they can be everywhere they can be interactive environments that cover notebooks ides they can happen in web applications they can happen even in edge devices raspberry pi's home pods your iphones probably have a lot of applications can benefit from power of data but the reality is you don't see spark and ubiquitous in all these environments why is that the case so if you zoom in you realize spark has a monolithic driver this monolithic driver runs both the user's applications as well as all the spark code and this includes an analyzer the optimizer and the full flash distributed execution engine this coupling of the monoliths of the user's application and the smart code actually makes it very difficult to embed spark in any of the more remote environments as a matter of fact the opposite is happening you have embed the applications in spark so for people to work some people work around this by figuring out hey let me why don't i just send sql strings for example over from my application to sort of some sort of gateway that happens in spark but sql doesn't actually capture the full expressiveness of spark it is just a much more limited subset it's great but it doesn't run for example your machine learning applications some other projects come up with their custom protocol to send code over like jupyter notebooks with jupyter notebook kernels but that's very very difficult to debug and for a lot of programming languages that simply don't have jbm interops you can barely even get it working now let's say you actually figure it all out and actually got some custom protocol going and it's really working you run into a whole suite of multi-tenancy to operational issues and the fundamental issue here is a lack of isolation all right one application that's consuming too much memory and male behaving might create the um out of memory exception but it doesn't just own the application itself it owns every other application in the same driver a lot of other problems with dependency conflicts it's also very very difficult to upgrade your application or upgrade spark because you have to upgrade all the application and spark at the same time right same thing with debuggability because you might not actually have network connectivity onto the spark clusters for security reasons and observability as well so with that i'm sure pretty excited today to announce spark connect which in my mind is the largest change to the project since the project's inception spark connect enables applications anywhere to leverage the power spark the full power spark so what is spark connect the idea is very simple we want to create a decoupled architecture and a stable client server protocol so we can embed a thin client in any applications and this client api is designed to be thin so you can really embed anywhere you want even computation devices but devices very very little computation power and it's designed a way to be language agnostic so you don't you can actually use it in any programming language you want let's take a look at how this works so we're creating again just a protocol between the client and the server and this protocol is language diagnostic it's quite simple and it's very stable and the prototype actually uses the grpc and use protobuf all the client needs to do is you have your client's data frame api in any language you want of the native language and then it generates unresolved logical query plans based on proto and sends the unresolved query plan over to the server which is the spark code using grpc all this happening in the language diagnostic way and the server can execute that query plan using the standard query optimization execution pipeline and then send the result back now again using grpc and arrow in the language that lost way back to the client and the client can interpret the result and do whatever he wants with it this is very similar to think about sql sending sql strings over but using jdbc using odbc or using any of the client drivers you might be using for database drivers but there's so much more than just sending sql because you have the full power of the data frame api in spark so with this protocol and with thin clients we can create now we can actually embed spark in all these devices including the ones very very limited computation power and such devices can actually drive and orchestrate all the programs and actually offload the heavy lifting execution over to the cloud in the back end so this architectural change also mitigates a lot of operational issues we just talked about for example applications that use too much memory actually oom the driver itself not only owns its own client it has no impact on any other applications dependency conflicts are now isolated to the client itself and it's much easier to upgrade because you can either upgrade the application one at a time or upgrade the server as a matter of fact if we designed it right you could even upgrade the server without changing the application at all to to gain the benefit of all the latest performance and security patches it's very easy to debug because just use whatever native programming language toolchain to attach to application you can do all your step by step debugging same thing with opposibility so rather than me talking more about it i would like to invite martin grund the project sleep from databricks responsible for this contribution onto stage show you a demo welcome martin thanks reynold thank you all right so we talked a lot about how can we actually demo spark connect and all i got was a secret query and an ipad so let's see what we can do all right so i open this app here i have a pre-configured spark shell and i will now create a data frame based on a secret query that i've previously copied in my clipboard and typing on an ipad is hard so please bear with me a little bit okay all right so here we have the secret query that i got i don't really know what it's doing so the next thing i'm going to do is actually printing the metadata of the of the projector columns to get an idea what's in there all right so what happens now is that spark connect goes to the spark cluster resolves the plan and gives me back the projected columns all right so i've seen spark version 3.3 i'll create a new data frame and a whole pi spark notion that i know and love select the date column df.date then i will select the version 3.3 and lastly the total number of downloads oops and all right total and to make it more interesting i will then sort the result by date in a descending order and only show the last 10 or the top 10 results all right cool so but what happens behind the scenes so every spark connect plan has and built in logical unresolved plan that reynard already talked about and i'm going to show you how this looks like so you can always look at the composition of your spark fare at any point in time and as you can see here there's a secret query a projection and a limit all right so now let's just collect the data and then we get the result back as a panes data frame since it's a panos data frame i can use pandas directly on my ipad again to create a new column that then calculates the number of total downloads from spark version 3.3 over the total number of downloads to get the ratio okay so divided by s of total okay and now i will just print the result okay so again it goes back to the server and what we see now is that over just seven days of releasing spark 3.3 it went from zero percent of the downloads to about 35 of the total pipe downloads and what you get what you see here is that i used an ipad python and ipad i used spark i use python i use panners all in one to show easy becomes to build modern data application and with that i'm giving back to reynold thanks thank you martin i think it was actually his own personal ipad um thank you for donating that the uh so in case you haven't realized and noticed what is martin's doing was actually he was driving orchestrating all this computation on his ipad which offloads most of the heavy lifting part of downloading crunching through like actually billions of records onto the cloud all right um so seven years ago um when we first started databricks we thought it would be so out of the realm of possibilities to run sparkle on mobile that i personally wrote this blog post for april fools day called re-architecting spark for mobile platforms there was before spark 2.0 even came out with spark 3.3 now wall street journal actually fell for it and called us and tried to verify the authenticity of this but we were wrong we didn't know this would be possible and with spark connect this could become actually a reality so spark connect was officially voted as a spit just a couple weeks ago or spark project improvement proposal um the community is now working hard to actually bring it forward and very likely you'll see experimental support for you the next few releases starting with python but obviously eventually rolling out all the other programming languages as well so the second thing i want to talk to you about today is about streaming we have seen explosive growth of streaming applications also just in the past few years and right now just on databricks alone we have more than 1200 customers building streaming applications and running streaming applications every day and this have led to 150 this has led to 100 over 150 percent year over year growth of streaming jobs so today i want to tell you more about what we're doing with spark structure streaming and with that i would actually like to invite kasich onto stage and kha'zik is a well-known figure in the streaming community he's the co-creator of pulsar storm and among other things he was also the former ceo of streamlio which is streaming company is responsible behind the project pulsa and he happened to be one of he happened to be the most technical ceo i know so on the planet so welcome classic [Applause] [Music] thanks for the intro reynolds i think most technical the title ceo goes to ali he gives a lot of great feedback on the design documents deeper feedbacks so i started databricks last year and one of the things that i wanted to find out was why spock streaming is popular so we talked to a lot of customers and users and found out that a lot of reasons they use stock streaming and here are the top five first is unification because we have batch apis and streaming apis are the same so if you're already using spock streaming all you have to do is exten just change the configuration a little bit and you can turn it into a streaming job so the second uh robust fault tolerance and recovery so spark automatically check points at every stage of the processing and stores the state and then whenever a failure occurs it recovers from the previous state so so whenever uh failure occurs it automatically occurs from the previous state the failure recovery also is very fast in the sense like only the failed tasks are recovered so it is opposed to the streaming systems like flink which allows you to restart the entire pipeline so third one is performance so the structure teaming provides very high throughput with the seconds of latency at a lower cost using full advantage of the performance optimization of the spocs sql engine one of the deployment we have done around 14 million events that corresponds to 1.2 trillion events per se per day for the most challenging workouts and the fourth flexible operations ability to do arbitrary logic and operations on the output of a streaming query using for each batch operation it allows you to do a lot of what we call absurds and output to multiple things etc then finally we have the stateful streaming it's inbuilt into spark streaming itself so it supports a lot of inbuilt stateful aggregations and joints and other operators with watermarks for bounded states in addition you can do arbitrary stateful stream processing as well where you can write your own stateful operators in java and scala so as stock streaming grow a lot of new emerging applications developers started using it to develop these new applications let us look at a few of them so first proactive maintenance in oil building where you continuously monitor the data and so that the drill bits do not strike a hot surface so you have to do productively monitor this data second one is elevated dispatch so where you continuously monitor the elevated data for emergency purposes and dispatch and alert the dispatchers for any situations third tracing micro services stitch the request and response from microservices that serve a web request for tracing and troubleshooting purposes these applications expose some of the shortcomings of structured streaming so if we address these shortcomings we can even see a much more skyrocketed growth so what are we doing about it so but uh before that i wanted to find out what are the requirements that we need to these emerging applications are driving first is consistent subsequent latency second one is ease of expressing processing logic for complex use cases then the third integration into very different cloud sources and things so that sparks structured streaming can operate with multiple sources and things in the new era so structured streaming needs to evolve to satisfy these new requirements so what are we doing about it so i'm very excited to announce producing project light speed that takes stock structure seeming to next generation so what is project light speed is all about there are four pillars in project light speed first predictable low latency we are targeting reduction in tail latency up to 2x second pillar is enhanced functionality ability to capture advanced capability introduce advanced capabilities to capture the processing logic very easily using new operators and apis third operations and troubleshooting simplified deployment operations troubleshooting and so that you can get to the issues quickly and get the pipeline back up and running and finally kind of improve the connectors and ecosystem where you can connect to different systems in the cloud and write data into the syncs in the cloud and also improve features like authentication and authorizations so first let me the next few slides i will go through some of the highlights of project light speed so the first one is about how we are going to achieve predictable low latency so sss stock structured seaming has several bookkeepings so one is at the start of the micro batch another is at the end of the micro batch so the start of the micro batch essentially plans what records constituted into the micro batch and that is forced into external storage second the micro battery started and executed and third the batch is marked as done so that is the second bookkeeping at the end of the micro batch so these bookkeepings has to be stored in storage before the start of the micro batch and the end of the micro batch so that is executed in sequence so that causes a lot of latency so when we did an implementation and measured it we clocked around 440 milliseconds so now when the continuous streaming is occurring in fact where micro badges are executed one after the other so you don't need these all these bookkeepings you can simplify the bookkeeping further so in so we can eliminate the mark batch turn because the offset dangers itself serves as a marker so now the offset ranges do not have to be forced into disk instead it can be returned while the micro badge is executing or overlapped with the micros execution so this when we did a poc implementation it gave around 120 milliseconds that is around close to 4x improvement or 73 percent improvement in latency in stateful stateless pipelines so [Applause] second improvement in predictable low latency is improved state check pointing so currently when the microbiota is executing after the execution of the micro batch the state is propagated to the external storage synchronous fashion before the next micro battery started so in the asynchronous check pointing that we have implemented so the war we overlapped the micro batch execution with the propagation of the state asynchronously this achieved another 20 to 30 percent in stateful pipelines which is another major improvement so the next slide let's look at some of the advanced functionality that we are doing where we are going to use uh for python as a first class citizen spokes structure streaming pipelines can be programmed using multiple languages one is java scala and python and sql python is a popular choice by a lot of developers and python provides several apis into for ddl operations and dml operations like filtering joins aggregation grouping and windowing but there is a gap which is what we are going to address and that's what for arbitrary stateful processing so there are two apis map group with state and flat map group with state which allows the user to write their own stateful operators like exponential weighted average for the financial community and other operations the challenge with this api is this arbitrary python code that needs to be executed in the jvm system or it can be executed out of process system and bring the data back in so there are challenges in implementing this as well so the next let's talk about how to improve debug ability so this stream streaming pipelines are very brittle even in any uh streaming systems so there can be several reasons one is surge in data and resources are not adequately provisioned and bug in user's code so whole idea is to pinpoint the problem quickly and get the pipeline back up especially most of the streaming pipelines are real time so sparks streaming structured streaming provides a lot of metrics and logs and at every micro batch level but what it doesn't provide is a timeline view of metrics for the operators and the operators could be at different levels one is the operator that is expressed at logical level and how it's optimized at the physical level etcetera etcetera right so we need to provide a timeline view of the metrics for operators at the physical level then the group operators by executors and incorporate source and sync metrics so you have one single pan of class so and to look at how the source is pumping the metrics and how crosstalk is processing the metrics uh processing the records then followed by how things are getting the data so finally connectors and ecosystem so we are going to add more connectors for amazon dynamodb on the google cloud pub sub and also sync for amazon kinesis to send some operation alerts and everything for operational pipelines and we are going to improve existing connectors first of all uh performance improvement for data like connected so that we can write data faster into delta second one is like uh incorporate the aws im authentication for managed kafka services from aws then we are going to add a efo enhancement for amazon kinesis which allows you to do a higher bandwidth so and many more features in case if you are interested in more in lightspeed a blog will go out live at 9 30. please look it up and we have a list of all the features that we are going to roll out in lightspeed and if you are interested in a community collaboration with us there are a lot of jiras that we have opened in the open source and we are very welcome to work with the community and if you are interested uh come and talk to us or come and talk to me and that's all i had thank you that's super exciting we actually saw two and a half x growth in revenue for our streaming workloads last year so i think streaming is finally happening every year we're waiting for that year where streaming workloads take off and i think last year was it and i think it's because people are moving to the right of this data ai maturity curve and they're having more and more ai use cases that just need to be real-time like real-time fraud detection okay so i am really really excited to introduce the next speaker he really doesn't need an introduction but michael armbrust is creator of spark sql he's also creator of structured streaming in spark and he's also creator of delta and before i welcome the stage i want to tell you a story about him in berkeley so at uc berkeley he was always the student that was first with everything so he was the first student to have an iphone so everybody would gather around his desk and say oh wow you know and he was also the first one that started using git subversion control at uc berkeley so can you imagine if he didn't exist we wouldn't uh have version control in spark today so without further ado welcome michael armrest [Music] thank you so much ali i am really excited to be here in person and with all of you virtually to talk about delta lake in this talk i'm going to talk about three things i'm going to explain what delta lake is and why it's the foundation of the lake house i'm going to tell a story about delta lake's history which is actually deeply entwined with this conference we're all at today and then finally i have a really exciting announcement about delta's future so first of all what is delta and why is everyone talking about it well as ali said this morning there's always been this big divide between data lakes and data warehouses data warehouses were the traditional technology they were really easy to use they were really fast but they were expensive and not very scalable data lakes were the young upstart they came in they let you store tons of data but they were kind of slow and clunky and difficult to configure and delta really was created to unify these two worlds it brings acid transactions to the data lake and it brings speed and indexing and it doesn't sacrifice its scalability or its elasticity and it's what enables the lake house when delta first started it was mostly a spark technology but that couldn't be further from the truth today we have connectors for everything from old school technologies like hive to fancy new technologies like dbt which you'll hear out here about in just a little bit this year has been no no different we've added a ton of different connectors we've added support for flink trino and presto and we're working on support for pulsar and google's bigquery as well as the ecosystem has expanded so has our user base so a couple of i think a year and a half ago when we released delta 1.0 we were only getting about half a million downloads per month and today it's at over seven million downloads per month which is pretty cool and the other thing that's really been changing in the project is the the health of the contributors so this graph that i'm showing here is actually a metric by the linux foundation that looks at the health of contributions in any given open source project it looks at how many unique people are fixing bugs and responding to pull requests and merging code and so you can see really just how much momentum there is behind behind the project and it's increased by over 600 percent in the last three years but now i want to kind of rewind and take a trip down memory lane talk about the history of delta so i'm going to go back to the year 2017 a much simpler time i was working on structured streaming and spark sql as ali just said and i was talking to a bunch of users at this conference about how they were processing tons of data from a variety of data sources and they were doing it all in parallel on the cloud it was all great and pretty much what every single one of them was doing was they were when they were done processing data they were writing it out as part k files to s3 parquet is this pretty cool open columner format it's part of a database but there were still a bunch of problems it turns out that a big collection of files is not a database i was fielding bug reports from users constantly who are saying spark is broken because what they were doing is they were basically corrupting their own tables because there were no transactions when their jobs failed because a machine was lost it didn't clean up after itself multiple people would write to a table and corrupt it there was no schema enforcement so if you drop data with any schema into the folder it would make it impossible to read there was a bunch of added complexity of working with the cloud the hadoop file system just wasn't really built for it i'm sure people in this room remember setting the direct output committer and if you got it wrong things would be broken and even just working with large tables was slow just listing all the files could take up to an hour and so it was here at you know what used to be known as the spark summit that i talked to a bunch of people and thought there's got to be a better way and i actually always take a couple of days off after spark summit to decompress and i was so excited i wrote a design doc during that vacation and so in 2018 we came back on the stage and we announced databricks delta it was one of the first fully transactional storage systems that preserved all of the best parts of the cloud and even better it was battle tested by hundreds of our users at massive scale in fact one of my friends dom uh got up here and told us about his use case where he had been using delta for the last year to process petabytes of data in real time with hundreds of analysts around the globe for a critical information security use case if you haven't seen that video i suggest you go to youtube and check it out but delta was too good to keep just for data bricks and so in 2019 we came back and we announced the open source delta lake and we didn't just open source the protocol the description of how different people can connect and make transactions in the system we actually also open sourced our battle tested spark reference implementation and put all of that code up on github but we weren't done with delta we were actually believed in this so much that databrick started to commit its business to it and so in addition to all the exciting things that were happening in delta lake we were busy building features to make delta even better so we added this really cool command called optimize which automatically takes all of your tiny files and compacts them into a larger one transactionally so you can get dramatically better performance we built this really cool command to go alongside of it called optimize z-order which actually takes your data and maps it to a multi-dimensional space-filling curve so that you can filter efficiently on multiple dimensions that works really well with this cool trick called data skipping based on statistics it's basically like a coarse grained index for the cloud we added the ability to write to these tables from multiple clusters and a whole bunch of other things that i don't have time to talk about but there's a problem here this all of this advanced technology you know you could read and write delta from anywhere but all these advanced features were only available inside of databricks and that's why today i am really excited to announce delta lake 2.0 all of delta is now open source and so if you see this feature matrix you can see we're already on our way if you've been following the project closely you might have seen something is up we've been opening up tickets on github and we are rapidly open sourcing all of these different features and bringing them out uh you know for the community and what we've seen is this is actually going to dramatically improve the performance of this open source project so as you can see the baseline this is delta 1.0 with that optimize command it improves performance a little bit but when you add in z order and data skipping the performance gets really good which is super exciting this is uh you know the same tpcds query that we've been showing all day today and the other really exciting thing is delta is now one of the most featureful open source transactional storage systems in the world we are the only one where you can run it directly against cloud storage systems like adls without any extra infrastructure we're the only ones that have data sharing and there's a whole bunch of other differentiated features we also did a little bit of performance comparison across all of these different open source projects and delta is somewhere between two and four times faster than the next competitors and we're not the only ones who have noticed this huge performance increase so this is from some people over at data beans who are users of the open source project and as you can see they saw that we are dramatically faster not only at loading data but at processing data than iceberg but we're not done yet there's actually some really cool technology waiting in the wings i want to tell you get just a quick preview of one of those things so there's always been a problem with columnar formats like parquet the problem is due to the encoding when you want to update even just a single value you have to rewrite the entire file this is called write amplification in databases because it takes a single tiny write and turns it into a big massive copy of all of this unchanged data and so we're very excited to add this new technology to the delta protocol that allows you to delete a single row called deletion vectors what this is going to let you do is it's going to let you mark that row as deleted so you can only write out the data that change which will dramatically speed up things like deletes updates and merges and we've already gotten started on this effort so as you can see there's a couple of jiras some of them already resolved where we've been adding some of the groundwork to both parquet and apache spark now i would be remiss if i didn't recognize all of the great members of the community so if we could give a round of applause for all these people who've made delta what it is today if you'd like to learn more there's a ton of exciting things going on at this conference come and join us for our amas or our deep dives into various topics if you want to get involved actually coding for the project you can come to our meetup and actually talk to some committers and figure out some good projects to get started if you're joining us virtually you can still join the community check us out on github or on slack or on twitter and with that i'll thank you very much and now i'd like to introduce dave weinstein the vice president of engineering at adobe's experience platform he's going to tell you about their journey into the delta lake thanks [Music] good morning everyone i hope you're as excited as i am about the big announcement that we just heard from michael let's take one more moment to applaud the entire databricks company for taking this step to open source delta lake like databricks adobe has a deep heritage of contribution to open source and open standards from three weeks ago when we were part of proposing the latest open cost project to the cloud native computing foundation to 1992 when we invented pdf the most widely used open document standard in the world adobe has always believed that open source and open standards are critical to delivering on our mission at adobe our mission is to change the world through digital experiences we unleash creativity for people by empowering creative people to connect in myriad ways from the media we consume to the ads we see to the websites that we browse and shop on throughout the day chances are those experiences were touched by adobe technology adobe gives everyone from emerging artists and influencers to the largest global brands what they need to design and deliver exceptional experiences with creative cloud tools like photoshop and illustrator we empower creative people everywhere to build beautiful and engaging visual experiences with document cloud tools like pdf acrobat and sign we accelerate accelerate document productivity because sometimes you can't meet in person and with the adobe experience cloud we power digital businesses helping companies design and deliver experiences that drive profitability as well as loyalty while most people know what adobe does for powering document and creative productivity far fewer are aware of the scale and impact that we have with the experience cloud on some of the biggest brands in the world our tools enable them to deliver exceptional experiences to their own customers impacting all of us here today and everyone who's watching online so what kind of scale are we talking about well it's a lot every day we collect in our real-time data collection systems over 150 billion discrete requests from our customers website mobile applications and their real-time streaming systems and we manage well over 32 billion distinct customer profiles across a broad swath of the largest brands in the world and every day we compute over 24 trillion distinct segment realizations our system is continuously creating actionable audiences allowing customers to personalize their customers digital experiences so how does databricks help us to solve the big challenges of understanding and optimizing customer experience at this kind of scale we'll focus on two big challenges the first challenge is what i might call the merging of the hemispheres just as with people many technical systems built to optimize customer experience suffer from what we would call a split brain personality disorder where the left and the right brains struggle to effectively work together to deliver on the promise of digital customer experience they're separated between the left brain the place where analytical workloads are run and their right brain where real time decisions about personalized experiences are made the left brain in this equation is typically a customer data lake or data warehouse it's where data is explored where reports are generated where new hypotheses are tested and usually where aiml algorithms are deployed the right brain is real-time experience delivery which requires the use of low latency access patterns point reads with latencies counted in milliseconds keeping these systems in sync is a big challenge we see this same pattern in customer after customer they struggle to keep their analytical and operational systems coordinated and they suffer from this split brain personality disorder so when we built the adobe experience platform we set out to solve this problem for our customers at the center of the adobe experience cloud we've built a single common platform to integrate our applications and our customers experience data it's here at the foundational layer of our tech stack where we leverage databricks and apply the lake house concept and the delta lake table format we provide easy ways to bring in data whether it's streaming in in real time or onboarding massive batches and we've created a common paradigm for activating data seamlessly into whatever systems needed to deliver personalization at scale the system brings the left and right brains together allowing the analytical and operational brains to work as one unified whole at the core is the architecture for which databricks has coined the term lake house it has the scale and cost characteristics of a traditional data lake with the transactional and change data feed capabilities of a traditional data warehouse or database we can leverage those change data feeds so that every record being inserted updated or deleted can be propagated continuously in real time into those real-time operational systems this ensures that the operational brain can drive creative actions taken in the analytical hemisphere the second challenge we faced was to address the needs of customers in the most challenging and highly regulated environments with respect to data governance and privacy we heard from customers in healthcare financial services and telecoms that they need specialized capabilities in this area to effectively satisfy the needs of enterprise enterprises for data governance at scale it became clear we also needed to support acid-style transactions at scale this is where the capabilities of a lake house came into play for access control we worked with databricks to add the ability to apply a filtering schema at query time so we can control which users and groups have access to the data in the case of user activity audits we can leverage the time travel and versioning capabilities in delta lake to support auditing of changes over time and finally data hygiene is where we saw some of the biggest benefits from delta lake the change data feed allows us to manage record level deletes and updates at scale and cost effectively without delta lake in the past we had one illustrative example where when we were leveraging iceberg we had to update 28 27 billion records in order to delete just 27 rows not only has this accelerated our ability to cost effectively deliver those types of capabilities at scale it's also accelerated the delivery for the main architect who worked on the data lake portion of this data hygiene project he estimated that we shaved two years of development time with our move from iceberg to delta lake so how did we take advantage of lake house architectural patterns in order to support the new set of capabilities we partnered very closely with databricks every step of the way first in early 2021 we sat down with databricks shared our roadmap and they shared their roadmap and we aligned on what were the major challenges that we would have to overcome to be successful in this journey second we had weekly meetings with databricks and we executed on 50 distinct pocs to verify support for our existing workloads and make sure that this transition would only bring positive benefits we did find a few areas where there were feature gaps and we partnered very closely with uh databricks to close those feature graphs and we're really grateful for the partnership that they brought to bear there and then finally uh just a month ago at the end of may we launched the first offering built on top of this new architecture healthcare shield for our real-time customer data platform product uh and i'd really like to thank not only databricks but the entire community around spark and delta lake for the support that they've shown us in being able to achieve this if if you would like to learn more thank you so much if you'd like to learn more about adobe experience cloud or adobe experience platform uh please visit the link above or scan the qr code and it'll take you there and i just like to thank everybody here today for letting me share our lake house story with you [Applause] [Music] all right adobe experience cloud is really really cool actually you guys should check it out it's powering most websites and now at the core it has a lake house and we're super excited about the delta 2.0 open sourcing announcement so we're very excited to build a huge community around that project but next i want to talk about data governance security privacy and we put the smartest person we had at databricks to work on this so i'm going to introduce matasa haria who of course needs no introduction in this audience he created apache spark but before he comes on i want to tell you back in berkeley 10 years ago we were at the gym and we're trying to lift weights and he's telling me hey i think we can actually do streaming inside spark and that's how spark streaming came about at the gym in berkeley so mate welcome to stage [Applause] all right uh thanks everyone really excited to be here uh you can see our lee kept going to the gym after that and i didn't but you know it was a lot of fun when we uh when we were going there so yeah so i'm going to talk today about data governance and sharing for lake house i'll talk about some things we're doing on the databricks platform as well as a lot of work we're doing in open source with the open source uh delta shank project so let me start by talking about governance so anyone who's had to do it knows that governance for data and i and ai today is very complex because of the very complicated uh technology stacks involved with multiple different systems uh and by governance just to be clear i mean controlling who has access to data auditing that and in general understanding how data is used in your organization and so one of the big challenges is that there are so many different technologies involved with different ways of doing governance for example in a typical enterprise today you'll have a lot of data in your data lake think something like amazon sc and if you want to control permissions on that you can set permissions at the level of files and directories now this is already a bit of a problem because it means you can't set fine grained oh and column permissions and it's also very hard to change your model for how you organize data you have to move all the files around if you want to change your directory structure so that's already kind of awkward now on top of that you probably want to think of your data as tables and views so you might have something like hive metastore where you set permissions on tables and views and it sounds great but the problem is those permissions can be out of sync with the underlying data and so that leads to a lot of confusion and then you also have your data warehouse you have richer ways of setting permissions there but it's just a different governance model you set it all up in sql with grant statements and then you have many other systems like your machine learning platform dashboards and so on that each have their own um way of doing uh permissions and you have to somehow make sure your policies are consistent across all of these so a year ago at data ai summit we announced a major new component of the databricks platform unity catalog which gives you a unified governance layer for data and ai assets and the idea is really simple for all the kinds of data assets you can create in databricks we have one interface for managing permissions managing auditing and so on and that's unity catalog and when designing this we wanted to make the interface to set the the permissions and to manage it very open so we actually chose to base everything on ansi sql grants so anyone who's administered a database any tool that knows how to set permissions in a database can use unity catalog to manage all these assets and we also set up centralized auditing and lineage so last year we were just starting uh you know to hold out this product we're still developing it very actively and since then we've added a lot of functionality to unity catalog so i want to tell you a little bit about some of the new things we added and uh what's coming next so first of all just how do you use unity catalog so this is the most basic thing you can do with it you can set up access controls you can set them up using standard sql you can use a rest api or ui as well and we've extended this beyond tables to other kinds of objects like for example files in your cloud object store and we do that in a manner that's consistent with the permissions you set on tables and we also have really easy access to audit information it's just a system table that you can read with all the actions so very easy to see everything happening on your lake house now beyond that we have built-in search and discovery allows you to quickly search document things about all the data in your organization um in the ui and we also have a really powerful lineage feature that we we just launched so this allows you to set up lin to track lineage on tables columns dashboards notebooks jobs basically anything that you can run in the database platform and see how what kind of data fed into each one and who's using it downstream so it's very useful for understanding how data is used fixing issues with data and so on so to show you a little bit of this in action i'm actually going to do a short demo of just the lineage feature so i'm going to head over here and log in to do that let's see all right so yeah i think if you can show that on the screen okay great so yeah so i'm here in databricks this is just one of our you know like actual deployed environments um and i have a little notebook that's going to process some wind turbine data and i'm just going to get the signing so you can see this has some a python command at the top and then a sql command so the python command is taking data from three tables and joining it together to create this table called turbine master and then the sql command is creating a new table turbine feature is based on some computation based on this um this command over here and so you know they're just arbitrary code um the lineage feature we have works with any computation you do in spark in any programming language so now let's look at the data explorer here and uh try to find uh um this this data so this is the the table i created turbine master and you can see here all this information about it you can set permissions on it and there's also a tab for lineage so one of the things you can do is see the lineage graph and this is actually computed in real time as you do work on databricks yep so yeah so yeah so what you see here this table was created by combining three other tables and then it was used to produce one thing and you can even click on these guys and see what's producing each of these upstream so we connect all this information as you do computation on the platform and beyond that the lineage also extends to individual columns so for example for this column i can see where it came from upstream and i can also see downstream it's actually used to compute two columns in this in this features uh table um yeah and it even extends the notebook so i can see this notebook was used to create the table and also a whole bunch of notebooks using it and that's useful for example if i ever change the table i want to know which users i'm going to impact by doing that so that's lineage and unity catalog so switching back let's switch back to the slides here all right so that's just one feature of unity catalog we've also been working to integrate unity catalog with partners across the modern data stack and we have integrations with many best in class tools so this includes advanced governance tools that can set sophisticated policies across databricks as well as other platforms and also leading products in data ingestion bi and data pipelines and we also had a bunch of customers using unity catalog and giving us great feedback so here are just two examples block is using us for financial data that they process and millerman is using us for healthcare data and in both cases it's really simplified the way they manage data at scale so today i'm really excited to announce that unity catalog is gaining we'll be holding out ga in the coming weeks for unity catalog um yeah so yeah very excited to see what everyone does with it and we also have an exciting roadmap ahead some of the big things we're working on include attribute based access control this allows you to set tags on all the kinds of objects you can have in databricks and then set a policy that applies to anything with a specific tag like say all your data tagged finance including dashboards models all that stuff and we're also working on easy row and column filtering within a table to uh to show people just different pieces of it so that's unity catalog um that's one key step towards maximizing the value of data in an organization is actually being able to govern it inside it but with with the databricks and the lake house you want to look at all the ways you can maximize the value of data and there's a lot more that you can do so for example a second important pillar to really using data effectively is sharing data between organizations and we support that through the open source delta sharing project that we started last year at this conference so i'll tell you a little bit about what's new with that so first of all why is data sharing important many organizations are now starting to share data either with partners so just to improve a common business process you can share you know details about like what uh what you're doing together um but also many are starting to monetize data and gartner for example um says that they expect three times better economic performance from companies that share data and are part of an ecosystem and they also see 50 more of these ecosystems starting just by next year so today you can do data sharing in a number of proprietary platforms mostly data warehouses but there are a lot of problems with using these so um we talked with with many data providers many data consumers and they all had some challenges with these and the issue is that the sharing is only within one technology platform for example in bigquery you can share data with other bigquery customers and in redshift you can share data with other headshift customers but if you ever want to share with someone on a different technology platform it's a problem and this is a big issue for data providers because you work hard to create a data set and now uh you have to copy it into you know 5 10 20 different systems just to reach all your users and you want to reach as many users as possible with you know minimal maintenance overhead um so the problems are because you have vendor lock-in here the only way to actually reach a lot of users is through expensive replication and maintenance of all these data sets so when we looked at data sharing we took a very different approach based on open source so last year at the summit we created delta shang an open standard for data sharing and it's a very simple rest api that any platform can implement and basically any system that can process par k can read data through delta shane the way it works is the provider has a delta table in cloud storage they can run the server in front of it and add users and then the users can connect with any client all the ones i'm showing there like pandas apache spark all have plugins power bi has a plugin to to read from delta shang and then those can process it you know anywhere they are they don't have to be on the same uh software platform and the transfer is efficient because it uses a feature of cloud object stores that allows you to give someone temporary access to read just one file so you don't have to stream all the data through the server it's actually really fast so now you get cross-platform sharing you publish your data once and people can consume it from anywhere and you can share your existing large-scale tables without copying them into a different system so at the last summit when we announced this we were just putting up the github repo like it was literally empty two or three months before the conference so it was a brand new thing so we didn't know how this would do and we've been very excited with the growth of the community since then actually today there are petabytes of data exchanged using delta sharing every day just on databricks alone and this is you know yeah so this is data that's actually read and processed across organizations it's not just people publishing stuff um and here are just two examples of customers that are using this so so nasdaq is using delta sharing to um to to monetize very large data sets they have that they couldn't share through legacy platforms they were just too big too expensive but they were just sitting there in amazon sc in an open storage format so now they can actually uh monetize those and then shell is using it to exchange massive data sets within the energy industry where a lot of companies want to coordinate on fine grained data to improve their processes together we've also added a lot of new things in the open source project you can see a lot of new connectors we started with just pandas and spark and there are many that have been released and are in progress we added a feature called change data feed very popular feature from delta lake that lets the consumer see what's changed in each release of a table makes it much easier to consume shared data sets and also for databricks users we added very easy one-click sharing with another databricks account just put in their account id and you can share a table with them and we're also excited to announce that delta sharing is gaining and it's going to hold out in the coming weeks [Applause] yup and like unity catalog we have an exciting roadmap ahead two of the big things we're working on are sharing views which allow you to do fine-grained data filters and other computation before you send the data to someone and also sharing streams so someone can consume a table as a stream to do real-time processing on uh you know on data across organizations okay so that's uh that's data sharing we think it's really important for maximizing the the value of data um and so today at the conference we're also announcing two brand new efforts that built on the on delta sharing to further expand how organizations can use their data in today's ecosystem and these are the databricks marketplace for commercializing your data and databricks clean rooms for private computing so yeah so let me just briefly explain each of these so i'll start with marketplace um so we looked around and um you know many cloud providers offer data marketplaces but when we asked users about these marketplaces such as data providers they actually you know they don't use them that much and they said there were a couple of limitations one limitation is that each marketplace is closed it's for a specific cloud or a specific data warehouse or software platform because the the goal of these from the vendors is to get more people computing on their platform and paying them money and so that's nice for those vendors but if you're a data provider and you worked hard to create a data set it's really annoying to have to publish it to 10 different platforms just to reach all the users who want to use your data set so it's a problem and then from the user side one of the challenges is these are just limited to publishing data sets so you go in there you see a table or a file and you can pay you know fifty thousand dollars to get it or whatever but what are you gonna do with that what if it's not useful to you like all you get is that table then you have to figure out how to use it so it would be nice to share you know more than that to share kind of entire applications or solutions so we wanted to rethink the concept of data marketplace we think what people are looking for is a bit more general it's sort of a solution marketplace and we also thought it's really important to be open so as a publisher you publish stuff once and then people can consume it uh everywhere and that's what we're doing with um with databricks marketplace it's an open marketplace for data solutions that's built on delta sharing so any client that can read delta shane can actually access this marketplace and this has some really nice benefits for both providers reaching more users and publishing more complete applications and for consumers who can actually get started with something that includes not just data but code you know notebooks ml models dashboards examples of how to use the data and we've set it up so pretty much anything you can build on the databricks platform you can publish on the marketplace to give someone a complete application so to demonstrate uh databricks marketplace i'd like to invite zahira valani our senior director of engineering for the say yeah who's going to give you a demo hey thank you mate i'm so excited to introduce the databricks marketplace an open marketplace where all the data and ai assets are in one place to help you get to insights faster so how does it work let me walk you through it from the perspective of an end consumer let's imagine that i'm a data analyst working on an acquisition in the retail space i need some data on purchasing trends and i need it right away because we're in a competitive situation and we need to make a decision soon here in the databricks marketplace i can already see a variety of data products ranging from financial data products from providers like nasdaq to healthcare provider healthcare products from providers like iqvia below the featured providers each data each tile represents a data product a data product includes not just package data sets but it can also include dashboards notebooks and even machine learning models and i can search across all these data products right here so let's put in retail let's take a look at this data product from yipit data a leading market research data and insights firm it looks like something that i can leverage but i want to learn more let's take a look in the product details page yippee data has provided an overview of the data product and some potential use cases to help me learn more about the offering i really need to speed up my understanding of this data set so how do i do that yippit data has also included a notebook with some examples of working with the data let's take a look oh wow here are some example visualizations and analysis of the data this is so helpful before i would have spent days trying to get these types of insights this notebook has really helped accelerate my understanding of the data set and it's given me a head start to start thinking about how this can apply to my use case but there's more here on the top right i can navigate to a live dashboard of the full data set i'm getting to do all this exploratory analysis before i get the data that way i can be confident that it's what i need for my use case and this dashboard could be really useful to my organization on an ongoing basis now that i'm confident that i want to use this data let's go back to the marketplace and we'll get the data there's a catalog telling the marketplace where to provision the data product and i'll also get the dashboard and the notebook and i'll get data and in less than a second i now have access to the data [Applause] behind the scenes when i clicked on get data the provider is automatically provisioning the delta share in my workspace and the dashboard and notebook are deployed into my workspace and so now the provider will also be notified that i requested access to the delta share and they can follow up with me as part of the evaluation i can click on the notebook and continue on with my analysis even if you're not a databricks customer you can still go through this entire workflow the databricks marketplace is not just for databricks customers it's open to all what this means is that if you are a if you register for an account you can log in and browse and access the marketplace and because it's powered by the open delta sharing protocol you can use the tool of your choice so for example if you're a power bi user power bi can connect directly to the data provider using delta sharing consider what we just did with the databricks marketplace you can search across a variety of data products you can use descriptions notebooks and dashboards to quickly evaluate if it meets your needs and you can get the data and related assets delivered to your workspace without having to configure or build any data ingestion stay tuned for more details on the databricks marketplace later this year [Applause] thank you here all right thanks so much zahira yep that was uh that was really cool to see so this is the first open marketplace for data and ai in the cloud we already have some awesome partners with with the marketplace as you saw and we're excited to see what um what everyone else will do with it so if you can switch back to the slides um okay there we go okay so so the last thing i want to cover here is private computing with databricks cleaners so what are clean rooms uh if you're not aware of the term they're an emerging concept that a lot of organizations are starting to use to run computations on joint data and basically they're a way to set up a secure environment where different owners of data sets can put in you know data sets of their choice and run computations that are mutually approved by everyone involved so for example if you think of a retailer or an advertiser they may have a lot of questions about how campaigns are doing together but neither of them wants to give its full data set to the other one that's very risky but they could set up a clean room to run a specific computation and same thing applies say to two banks that are trying to detect fart together and many other use cases now clean rooms have existed for a while but the existing solutions have a couple of drawbacks a lot of them are just built into sql only systems so the only things you can do in the clean room are sql and this isn't enough for more advanced analytics like machine learning graph processing the things you need for a lot of the use cases i talked about and also these are these are these proprietary platforms so you have to copy all your data in so for small data sets maybe you can copy them in and do this stuff but if you have massive you know petabyte scale data sets in cloud storage you can't do that it's very expensive so at databricks we tend we actually happen to have technology that can address both these platforms both these problems first of all we have a cloud platform that lets you run the very best open source data and ai tools and we can run it in a serverless automated and secure fashion and second we have delta sharing which lets you securely share access to some of these large data sets so this is what we're using in databricks clean rooms it's a simple clean home solution that allows you to run any kind of computation that the databx platform supports on existing lake house data so if you have parties involved in a clean room they can each they can mutually agree to create a clean room they can each share some tables into it and they can each approve jobs and it's got three benefits the drops can be arbitrary computations you can run on database they can use python r sql gpus you know tensorflow whatever you want you can you can run it in your clean home computation you don't need to replicate your data and the scalable to large number of collaborators and uh many and large data sizes so we're just getting started today uh with with uh with this feature but we're very excited and we'd love to hear about your use cases for it so that's it for uh governance and sharing on the lake house we now support four really important ways of maximizing the value of the data and the really cool thing to me is that we're doing all of these following the lake house philosophy so they all work with existing data at massive scale in open format and they all have open interfaces that means it's not just data bricks but a wide variety of computing platforms that can participate in these and we're very excited to see what you do with them [Applause] all right that was amazing so we now have unity catalog that lets you govern all your resources not just tables not just files any data asset machine learning models so that's fantastic you can lock things down you saw the marketplace i'm super excited about that because it's an open marketplace it's actually the only open marketplace where you don't have to be a database customer to use any of the data assets that are shared there and you can actually share all kinds of data assets not just you know tables it could be dashboards it could be apps all kinds of things so super exciting and then the clean rooms for people who don't want to share the data set but they want to have a way in which you can actually run computations on the shared data sets so that's super exciting okay so next i want to welcome to stage tristan handy from dbt labs i'm very excited about this talk because tristan essentially created a new persona so the analytics engineer didn't really exist before dbt labs and dbt and tristan he basically enabled a whole generation of data analysts to become engineers and do what otherwise just only data engineers could do and the reason i'm really excited about this is that i think actually these merge these worlds are merging i think the world of data analytics and analytics engineering and data engineering is going to end data science are getting closer and closer together people will do more python more machine learning more data engineering so with that really excited to welcome tristan on stage tristan [Music] wow it's great to be here i come to you from a far away land the world of the data analyst this is a world that i know and love as a data analyst i deeply understand the businesses that i work at i understand how to observe their performance and diagnose their problems using data i write i write a lot of sql and in a little little python i brainstorm in dashboards and sometimes i even make a spreadsheet or two i'm not very opinionated about my tooling it's all about getting the job done but over the past six years i've become very opinionated about my workflow let me take a minute to tell you that story six years ago data analysts created a ton of one-off artifacts sql queries that got saved on a laptop spreadsheets that were fragile and broke and dashboards that became out of date as schemas changed this didn't really feel like anyone's fault it was just the way that things had always been but i hated this i i really hated this it meant that data analysts just weren't as effective as i knew that they could be i had worked in software companies for a long long time and i understood how software engineers created leverage they wrote modular tested scalable code they checked it into source control they deployed it with ci cd pipelines they treated it like an asset not like some waste product they also built cultural norms around the stewardship of this code they practiced agile and devops and sre the result is that software engineers operate with leverage experienced software engineers are some of the best paid professionals in the world and it's all because of this set of tools norms and practices i felt left out i wanted to work like this as a data analyst so back in 2016 my co-founder drew and i built this little open source tool called dbt our goal was to help people like me author and deploy modular code inside of cloud native data platforms at the outset really truly nobody cared i wrote this big manifesto back in 2016 about how analysts work needed to change and it was received with crickets it's not that anyone disagreed it's just that this entire train of thought was like completely foreign to the analytics ecosystem back then but slowly slowly a small community of like-minded folks came together by the end of 2016 there were a whole hundred of us and year after year new practitioners got exposed to the power of bringing software engineering practices into analytics they got excited about how this could impact their work and their careers today there are over 30 000 of us in the dbt community and there are over 12 000 companies using dbt in production somehow something happened an entire ecosystem turned on to the idea that their work could be done differently early in this journey we realized that we needed a name for this thing that we were doing it wasn't data analysis but it also wasn't really data engineering or at least it wasn't data engineering done the way that data engineers had traditionally thought about it the community settled on the term analytics engineering here's the definition that claire carroll uses in her 2019 article analytics engineers provide clean data sets to end users modeling data in a way that empowers them to answer their own questions empowers them to answer their own questions i think of analytics engineers like librarians a librarian's job is to make sure that knowledge is well organized that folks with questions can find answers what we realized was that this hybrid persona the analytics engineer was the missing link on the modern data team and the entire industry is waking up to that as of today there are over a half million analytics engineers in the world up from up from basically none in 2018 here's how we see the most high functioning data teams working today data engineers are responsible for building and maintaining scalable systems and they also are responsible for landing data in the lake house analytics engineers are responsible for taking that source data and transforming it into usable data sets data analysts are responsible for insight generation but really this is an oversimplification anyone who works in data knows that these lines are never this clear these roles overlap with each other all the time the real secret to high functioning data team is making sure the team members with all kinds of skills and backgrounds can collaborate together which is why dbt was designed from the ground up to feel comfortable for the data analyst while scaling to meet the needs of data engineers it's the one tool that all data practitioners can collaborate in to build their knowledge graphs before wrapping there's one more thought that i want to explore with you the world that i've been talking about so far is a world of counts and sums and standard deviations it's descriptive it asks what already happened but that's only half of what this conference is about one of the biggest problems alluded to this in data today i think is the lack of collaboration between my world analytics and the world of ai and ml i'm excited to be here with you because i think these worlds are converging ali kind of stole my thunder i think that and you know honestly i think databricks is doing a ton of great work to make this happen with the launch of delta whether you're looking for the acid compliance of a warehouse or the unstructured qualities of a lake you've got the best of both worlds with the launch of databrick sql you now have high performance managed way for your data analysts and your analytics engineers to access all the data in your like house so databricks thank you for inviting me and data analysts like me to the cool kids party at dbt labs we are also pushing to unify the worlds of analytics and ml this year we're making huge investments in supporting python-based data science workloads inside of dbt thank you i'm i'm excited too um i'm not i'm not quite ready to show a demo but we are innovating in the open and we'd love your input ultimately analytics and machine learning are just two methods of problem solving we believe that all data practitioners should be able to use consistent tooling throughout this entire spectrum we're excited to partner with the entire databricks ecosystem to bring this to reality thank you so much [Music] thank you so much tristan that's super cool so i presume we're gonna see more and more ml python and other things inside dbt and that's super super exciting okay so i wanna now welcome to stage a man who's been doing data warehousing essentially his whole life he's essentially our cto of our data warehousing solution so welcome sean to be on the stage [Applause] hi everyone it's really great to be back here in person i'm sean i'm a software engineer working on databrick sql and later on i'll be joined by miranda luna who's a staff product manager working on databrick sql so what has happened in the three years since databricks announced the lake house petabytes of locked in data all alone in the data warehouse were set free to be shared and interacted with in the lake house millions of business questions hopelessly unanswered in the data warehouse were brought to the lake house where machine learning models found the answers so it should be no surprise to anyone here that the lake house is great for open access to data and sharing and of course integrating predictive analytics but as ali pointed out earlier lake houses are great at data warehousing workloads as well so how is it that we got the value of the lake house to surpass that of the data warehouse databrick sql this is our sql warehouse offering on top of the lake house where we've been focusing on connecting to your data stack bringing together the best of data lakes and data warehouses and obsessing over value with performance so let's talk a little bit of how we connect your data stack with databrick sql we wanted a first class sql experience that you can use for ingestion transformation and consuming your data with any tool out there for data ingestion we work great with 5tran to bring in data wherever it lives this can be google marketo salesforce data or just file sitting in object storage you heard tristan just talk about analytics engineering and we've been working really hard to make sure that dbt integrates amazingly well with databrick sql so you can run your dbt projects and pipelines and of course data consumption because what good is data that you're just loading and transforming if there's no way to get it into the user's hands so we've gone and worked with all of these bi vendors to make sure they're all lake house certified but data is critical and sometimes just looking at a chart isn't enough you need to integrate the data into your business and applications so today i'm very excited to announce connect from anywhere this is our initiative to make it simpler to connect and bring the data and the lake house into your custom applications this will be a sql rest api in preview open source go python and node.js clients so that you can connect this data with the applications or languages that you love the most let's talk about how we brought together data warehousing and data lakes so databricks had always been wonderful at letting you put python and sql together for your machine learning models but from sql directly the times where you want the flexibility of python and so with python user-defined functions we're now in preview bringing that functionality to all the sql users so they can take the power of python and bring it into their sql environments you can sign up for the preview at this link uh query federation right the lake house is home to all of your data sources no matter where they live and now that's true more than ever with the introduction of query federation and databrick sql you can connect data sources directly join them with other data sources join them with the data in the lake house all of this is automatic and transparent and our optimizer will make sure it's efficient execution materialized views these are essential for accelerating common query patterns and these aren't your grandmother's materialized views mind you they're powered by delta live tables using the underlying streaming architecture that karthik explained earlier to make sure your data is always fresh and up to date data modeling with constraints now you can have informational primary and foreign key constraints to define the relationships between your data you'll have identity columns that are automatically generated for you and very importantly enforced check constraints so that you don't ever have to worry about data quality or data correctness issues again when working with your data so let's talk about performance um the journey to photon you see all of this that we're talking about is in the cloud and in the cloud time is money the faster something runs the more optimized it is that's savings and with photon it's our next generation query execution engine it's mpp execution written in native code leveraging vector cpu primitives which is just a fancy way of saying it's really fast in the chart here you can see with every release of the dvr runtime since 1.0 we got faster and faster in the early days but eventually it just became hard to get faster and we knew that we needed something new to get to that next level of performance and that was photon since we've introduced it it's processed exabytes of data and billions of queries our customers have seen three to eight x interactive workload performance improvements they've run etl jobs at one-fifth the cost and on average a total cost of ownership savings of 30 and in some cases up to 80 percent and we talked about sigma earlier the conference where we got um the spark systems award we also shared all of the details and inner workings of photon in a paper a couple weeks ago and were really humbled to receive the best paper uh industry paper award at the conference yeah it's great so our customers love photon and it's winning awards and i think it's about time very excited to announce that photon is now generally available on the entire databricks platform spark compatible apis to accelerate all of your workloads [Applause] all right but what else in performance um this chart on the left here it shows tpcds at 100 terabytes tpcds is an industry standard benchmark for comparing the value of data warehouses you can see that databrick sql compared to all of the other cdws out there has up to a 12x performance improvement this is huge you can read more details about it on the link below but i really really want to focus on the chart on the right because this is something where we've shown that databrick sql can be amazing for large workloads large data sets but this is tpcds and a variation of that benchmark which is an official variation where you can run 32 concurrent streams of users constantly hammering queries against the system and what you do is you measure the throughput of that and you can see over the last two years we didn't start great but it's gotten 3x better meeting and if not now exceeding all of the other cloud data warehouses and so let's focus here for a second because what is it about queries on small data sets that's so important well so query compilation when you're running a query one of the things you first have to do is compile a plan of how to run it so let's look at this simple query over here select a b from t where c between 5 and 23 limit 1 000. before we run this query we sort of need to figure out hey what's the best way to run it and in this case we ran this query on a small data set and this is kind of what the timeline looks like you can see the compilation phase has something pretty involved in it called analysis this is where we check all the column names references and types make sure that it's correct and there's valid metadata there's the optimization phases where we check the transactional semantics make sure everything's correct figure out an optimal way of distributing the work across the nodes but it's very unintuitive sometimes just generating this plan took longer than actually executing and processing the bytes so here you can see it was about 1.8 seconds to compile it and just 300 milliseconds to execute the query so we've worked a lot on this problem and very soon we're going to be releasing the next version where we've optimized delta reduced some of the redundant operations in this pipeline and even some of those operations now run directly in photon leveraging the power there and so we got this query to run 2x faster but that wasn't enough we didn't want to stop there we've got another feature working on called metadata pipelining where we get that box about file listing because it's a lake house an open lake house after all so data can be coming in from any data source at any time and we always need to check to make sure we're working on valid sets of data instead of waiting for the entire file list to get generated which can take a long time with things like parquet with partition tables lots of files what we're doing is we're overlapping the execution and compute while we're streaming the file results out and then in this experimental system with this query we further reduced the runtime by another 300 milliseconds so it's very exciting and that's not all there are tons more performance features we're working on we have a long history in obsessing over sort in 2014 there's this thing called the greysort challenge which is measuring which system can sort 100 terabytes the fastest and we won in that year and since then you know we've been like okay we got to come out with a better version of sort and we're doing that soon but everyone's like don't you just need sort for order buy like what's the big deal here and while that's true in a data warehouse when it comes to the lake house machine learning and streaming workloads rely on sort for a core primitive so when we went to make it go faster we also had to make sure we could handle different size keys variety of data unstructured systems we've put a lot of work in here and really excited that this will be in preview soon to accelerate all sort workloads regardless of the use case we're adding accelerated window functions to make these operations run faster and when it comes to reading and scanning data we've added vectorized kernels to process all the open formats quickly and adaptively deal with all the cloud systems but i just bombarded you guys with features the lake house is great it's time to show you why it's so great and i'd like to welcome miranda to do a quick demonstration hello hello i am very excited to be here today to show you how a the lake house is your best data warehouse and also we can just double check and make sure sean's not lying to us about any of these new announcements so today we're going to do three things we're going to start with some nyc taxi data it's just data that has some information about trips how long they start how long they were what the fair amount was starting ending zip codes and we're going to start by doing some classic data warehousing we're going to look at our favorite bi tools we're going to make sure that we can explore and visualize the data but then we're going to transition to some of the new and exciting things that sean just announced we're going to look at query federation we're going to look at some oltp transactions we're going to show how we can actually leverage those in the same query as uh one against our historical data in the lake house and then last but certainly not least we're going to take a look at using our colleagues in data science gradient boosting model right alongside my my query of historical data via a python udf so let's go ahead and get started awesome so right here what you see we're starting with a stopped serverless sql warehouse this serverless sql warehouse i went ahead and pre-configured it to have scaling up to five different clusters that'll come up a little bit later and what we're going to start with here is actually taking a look at this data using this serverless warehouse in tableau so let me go ahead and switch over okay perfect i will enter my password and as this is logging in let me just hit sign in um what's actually happening under the hood is this is going to go ahead and load up a dashboard that i already had here and what that's going to let me do is actually start that serverless sql warehouse under the hood it should be up and running in seconds yep there we go and now we've got our data i'm going to go ahead and do a quick filter just to make sure that we can actually query it live our data in the lake house yep that went ahead and adjusted perfect um and again i'm using tableau right now but let's just take a quick peek because certainly you can use any of your favorite bi tools whether that is i mean we can see it started whether that is power bi whether that is if i just swap over looker uh we get to benefit from a couple different color themes here uh certainly also we have a connector for thoughtspot have a number of customers using that and sigma now of course me doing you know one user one dashboard is exciting but not realistic so why don't we actually try to scale that up and see if we can support hundreds of concurrent users i'm going to go ahead and switch over to this notebook where i'm going to go ahead and pop in 200 concurrent users what this is going to do is essentially the query that we had in that dashboard before and that's going to essentially change all the parameters for those 200 users i entered so that we have unique queries nothing's cached nothing is speeding up and we're going to see if we can get those clusters to pop up to what we looked at the max 5 that we had set on our auto scale so yup nothing too fancy just a little scale test so let me switch back over and we'll go to the monitoring page we kick that off we can see we're still at kind of that one cluster normal size great let me kind of flip this timing all right great now we're up to five and we can see that we're running many more queries concurrently so from here we can see that we've already scaled to support 200 concurrent users whether those are in tableau power bi etc but why don't we actually take a look at some of this data directly into lake house databrick sql provides a built-in sql workbench where we can actually query this data and visualize it directly without having to go to a bi tool just so we can do some quick spot checks so here i'm going to take a look at some actual fares all trips you know under 10 miles from a couple different originating zip codes let's do a quick spot check and make that scatter plot again so we can make sure that the data looks exactly how we expect so we'll flip to scatter we will quickly go ahead and pick distance and we'll also do our fair amount and then just so we get some different colors let's go ahead and break that out by day of week should look pretty similar yep check so this is great but you know i promised i would show you a little bit more about what sean kind of just announced so actually why don't we start by kind of looking at some recent transaction data that's stored in postgres in a different system so i'm going to switch over to this other query so in this query i've actually already set up the an external table that's going to allow me to query a postgres postgres in an external system this is just going to be kind of recent oltp transaction data but what this is going to allow me to do in a second and that's just my secrets it's going to actually allow me to use the data stored in the like house in the same query as data external so let's just do a quick gut check and make sure we have data in that table that external postgres table and then the next thing we're going to do in that second statement you see here after this finishes is we're going to actually filter the data in the lake house based on what we see in that postgres table so here i'm going to say you know only show me the trips where that meet under the minimum distance from what i've seen recently and just like that we're going to have now updated our view of data in the lake house based on what we see in our most recent transactions in postgres pretty exciting all right and now again i want to switch over we're a data and ai company right so we have to do some ai this is a gradient boosting model that i'm just going to scroll through real quick one of my colleagues on the data science team built but i'm going to go ahead and invoke that in the sql query directly via python udf so you can see here we're just kind of building a couple data frames and if i switch back over to the sql editor you know this is something that i as an analyst would have not been able to do before is to leverage that gradient boosting model so this first statement that i'm highlighting just shows the actual fares same thing we just looked at trips under 10 10 miles but this down here you're going to see we're actually using it to predict fares for trips we don't have historical data for so trips 10 to 12 miles i'm invoking that python udf all it needs is the pickup date time is a float and the trip distance for me and i'm going to go ahead and run that i've also appended a column so we can quickly tell what's actual and predicted and do a gut check that everything came out as we expected so let's just take a quick peek it looks like we have actual and predicted so we have now been able to generate forecasted fares directly in the sql workbench based on a python udf my colleague wrote and over here i just kind of updated our tried and true scatter plot and you'll see i'm highlighting the predicted fares in red and those fares are again generated by virtue of using that python udf and the gradient boosting model and then here i have my classic view of the the scatter plot we did to start so seamlessly i was just able to leverage not just my historical data but also use that to invoke a model that has been written by one of my colleagues and i can forecast fares for trips that i haven't had historical data for just yet so hopefully that was a pretty exciting way to see that sean is actually telling us the truth we like to do a small gut check on mr shant but yes we are so so excited about all the announcements today and all the progress in the past year and thank you for coming along with us on this journey there are even more exciting things to come so with that i will let sean take a little bit more thunder thank you all thanks miranda [Applause] oh so if you guys noticed miranda was able to quickly start up that endpoint run her queries and most importantly when those 200 simulated users came in that cluster auto scaled nearly instantly and that was with databrick sql serverless and so with serverless your users get instant auto scaling capability your admins don't need to worry about management pools reserved instances anymore they get simple and predictable pricing and of course for budget it's a reduced total cost of ownership so very excited to announce today the preview of databricks sql serverless on aws this means every single databrick sql user out there on aws can log in right now and enable serverless and benefit from the amazing features we showed here and don't worry everyone on azure or google it's coming very soon to you as well so this is the lake house where the best data warehouse is always a lake house you can visit this website to get started thank you very much everybody [Applause] [Music] all right that's awesome so you saw the benchmark results earlier that i shared sean showed you how that's done okay so under the hood all the optimizations all the things that they did to make that happen so that's super cool and we saw that miranda also verified that everything he said is true i always do that with him as well so cool so we're running a little bit late but we have one last speaker so please stay tuned so i'm super excited to welcome on stage kirby johnson from amgen amgen is a really exciting partner of ours i actually remember one qbr that we had where kirby was there and we were talking and they said that for them the thing that was so exciting was that as an it organization they were able to actually start impacting the business in a really strategic way and that that was actually helping you know save lives and actually make it much more interesting than you know what things had been like you know say 10-20 years ago so i'm super excited to all come into stage kirby welcome thank you thank you i'm excited to be here and i know um right before your break so i will be hopefully short and sweet but interesting and so let me tell you a little bit about amgen and our journey towards the lake house so at amgen our mission is to treat grievous illness in technology terms you can view us as a full stack tech company we do discovery of new drugs we take them to development we manufacture and we market it and the focus is on grievous illness and we are one of the oldest individual independent biotech companies were global scale 25 000 people 25 billion in revenue we have six therapeutic areas we focus on so cancer research which is oncology bone health inflammation etc and we have 25 products that are very focused on that in the grievous illness and we have millions of patients worldwide so what does that mean for us data's at the core of how we discover develop and deliver life-changing treatment and data is very is incredibly variant so in discovery you're working with molecular data or genomic data in clinical trials it's about real world evidence of how is it behaving in the real world and then when you move into manufacturing it's sensor data and iot and so on and so you need to be able to combine across these variant data sources in order to actually deliver questions and so what are the questions that really matter to us there's a 2020 study that said development of a drug is still taking 10 years and a billion dollars to go from discovery to market we're trying to make that shorter faster cheaper just like every other drug company but how do you do that you do it with technology you use it with data and so we're starting to answer these critical questions but we need to be able to answer the most important ones which go not functionally oriented but cross-functionally so that it meets the different users and the different personas and the different types of data it doesn't do any good to just to develop a new drug if you can't actually manufacture it so you need to be able to combine across the two and that's is where we're starting to use the databricks like house we've been a databricks customer for about four years we have 2500 monthly active users we have extensive number of data science projects and that varies anything from you know a short scale data science project that's a few weeks to answer a very specific question to a team of five to ten data scientists who might be working on a project for you know six months uh and and the question and what we're trying to do is really self-service enablement with guardrails and with control and that's where we're leveraging databricks our journey has been very similar to probably many customers we started with a data warehouse we went to hadoop-based ecosystem both on-prem and then aws that gave us a lot of value but then we started to get in the scenarios of you need to do weekly processing and spin up 400 machines to do weekly processing and time we couldn't do that with a hadoop based ecosystem so then we moved to databricks with elasticity in the last few years we've been focused not on just the data lake but on harmonizing data on enabling self-service on uh connected data with graph technologies as well and tying in both the relational version and the uh and the graph version in our database in our environment so i think like any good architecture slide you need to have one and we've we focused on an open modular architecture we can leverage what databricks provides we can also switch out technologies as it's available and i know everyone has a different version of these this is ours but i think the most important aspect to this is what are you optimizing for what do you care about do you care about standardization do you care about best of breed do you care about flexibility and i think for us we found we cared about the following we cared about having a data center of gravity so once you had enough data in the same platform then you had more and more people who would volunteer to bring in the data and add to it because what they already needed was there and you had internal software products that would continue to build on top of your your data like architecture because what they wanted was already there so you had a synergistic effect of each new product that was starting to develop on it that led to simpler handoffs because the challenge is really when you switch across tools or when you switch across personas and you do that handoff that's where you end up with confusion complexity friction of experience and so then we doubled down on skill set specialization we wanted to be experts in one tool versus proficient in five so we're definitely not a hundred percent like house yet um there's a definitely a ways to go but we're moving in that direction and so where is it that we want to get to that would lead to more of a 100 like house i think simplifies simplification and performance so we just heard about the sql talk i'm excited to learn more about it because that's the area that's been causing us to replicate data off the databricks environment for you know sub two second queries where you need to have interactive experience so excited to try that out the usability and governance so lineage discoverability fine grain access control if you unblind to clinical trial you're basically in a lot of trouble so more and more governance more and more controls and then in the long run um i think we're interested in how do we unlock other personas not just the databricks personas but you know users who can't write sql users who need to explore data developers who aren't actually really developers like we heard about analytic engineer from dbt things similar to that so if this sounds interesting to you we're hiring we have 200 plus remote positions uh we're in the business of trying to save lives there's also two other talks on amgen that go much more in depth than what i was able to do one on our commercial business and one on a deep dive within our data platforms and other platforms around it and with that i'd like to say thank you appreciate the time and enjoy the conference
Info
Channel: Databricks
Views: 48,170
Rating: undefined out of 5
Keywords: Databricks
Id: BqB7YQ1-KKc
Channel Id: undefined
Length: 133min 58sec (8038 seconds)
Published: Tue Jul 19 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.