Announcing Delta Sharing with Demo | Matei Zaharia | Keynote Data + AI Summit NA 2021

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

open isn't just about open source it's about access it's about sharing as an industry we often talk about how data is the lifeblood of every successful organization increasingly the road to success is about how data flows between organizations data needs to be gathered organized exchanged and enriched to take on bigger and more complex opportunities but the systems in place today haven't actually allowed this to happen historically getting access to data has been very difficult data sharing solutions have been tied to a single vendor or a single commercial product introducing vendor lock-in risks and this approach has been the same since the 80s data vendors say store all your data etl it to isv1 and share it with anyone this is great as long as everyone buys and adopts isv-1 but over time ice v2 came up with a better speed than isv3 and then 4 had better price easier to use so isv 1 isn't the only vendor and now the data lives in many different systems and they're siloed even from the vendor's perspective it's a nightmare what vendors want is for the data to be easy to consume but now it's hard to consume and the nightmare to manage 10 connections securely the world has also changed the only way you could talk to these systems back in the day was jdbc odbc or sql but today there is a plethora of data sources video audio data files sql is still the only language to talk to this data a language from the pre-big data era it's not built for this kind of unstructured or semi-structured data where python has really become the lingua franca so as a result the community is stuck what we need is a way to connect all the data stores and all data types with language flexibility to allow data practitioners to use the tools that they prefer an open future demands an open approach to data sharing this is why i'm really really excited to announce delta sharing the industry's first open protocol for secure data sharing it's fully open without proprietary lock-in not restricted to sql but it supports sql and it supports data science it's easily managed privacy security and compliance delta sharing will be part of the delta lake project under the linux foundation it'll be able to tap into this wide variety of support that the community already has for delta in fact we've already seen tremendous support for the project providers including aws data exchange nasdaq factset standard and poor's and others have announced that they will make over a thousand data sets accessible through delta sharing additionally microsoft google tableau starburst and many others have announced that they will integrate support into their products now to tell us more about delta sharing please welcome mate zaharia co-founder of datavix and creator of apache spark okay i'd like to share these legos with you oh thanks ali i love legos so i'm really excited to talk today about delta shang the open protocol we're launching for secure data sharing that will make it as easy to share data securely as alia and i just shared these legos when we designed delta sharing we had four big goals in mind first of all we wanted to make it easy to share existing live data that a company already has in its data lake or lake house without the need to copy it out into another system and the reason for this is simple most data and enterprises today lands and is processed in a data lake and so this is where you already have your your freshest newest data so we wanted to make it easy to share exactly that with downstream customers and not have to move it into something else and introduce delays and potential uh errors and gaps in the data so that was the first goal second we wanted to make this data easy to con to to consume in a wide range of clients and we do that by leveraging existing open data formats such as apache parquet and the goal here is so that when you share the data with uh with you know another organization the users in that organization should ideally be able to connect the analytics tools directly to the delta share without requiring any help from it for example to you know set up a new data warehouse and you know give them permissions to get into that and so on so we really wanted to reduce the friction for consumption if you just have a user that you want to share the data with and we're doing this through the open source and commercial uh partners for the project who are who are going to support making that super easy um third we wanted strong security auditing and governance controls built in because we know it's very important to have tight control over how you're sharing the data so we've designed the protocol to allow that and finally we wanted delta shank to scale efficiently to sharing massive data sets we see a lot of use cases where organizations are sharing terabytes of data with their partners because it's fine-grained industrial data or financial data or in general you know it's just a large data set and we wanted to make sure that this is super efficient and this is one of the places where a lot of data providers we talked with you know had had challenges scaling legacy solutions to really work with these large data sets so how does delta sharing work there are two parties involved there's a data provider and a data recipient so the data provider can start with an existing table that they have in their storage uh in the delta lake format and you know if you're not using delta lake and you're just using apache park a it's also very easy to create a delta table that points to your existing park a data so they don't have to do anything to the data itself and then in front of that they can run a delta sharing server so this is what implements the side of the protocol and basically in the delta sharing server you can decide you know which recipients have access to which subsets of the data and you can set up access permissions and that's what this enforces now on the data recipient side they can actually run any client that implements the delta sharing protocol so it could be apache spark it could be pandas could be tableau you know could be a wide range of systems that are supporting this open protocol and the provider doesn't need to know what they're running they just connect to the server and they get access you know to the the subset of the data that they're allowed to get to so how does it actually happen under the hood um so what will happen under the hood on in the protocol is that you know the data recipient when they want to access a table they'll start by sending a request to this data shanks server for example let's imagine that the provider here is a retailer and it wants to share real-time information about items being sold in its stores with its suppliers so it might share a table called sales with each supplier that shows them you know what's being sold so that they can do uh capacity planning and they can produce the exact right items at the right time so then the the the recipient asked to query that table now the first thing the server is going to do is check the access permissions and see whether this access is allowed and let's imagine it is actually allowed in the permissions you set um so next thing it's going to do is it's actually going to look at the um the contents of the underlying table which are objects or files in a cloud storage system like amazon s3 and a key thing that the delta shang protocol supports is it actually allows the recipient to ask for just a subset of the table for example i just care about sales of this one line of products so the server can actually look at these files and smartly filter down which files in the cloud storage system sd in this case but it could be your favorite cloud are relevant to this specific way and then the final step is to transfer the data back to the recipient and this is also where we do something pretty interesting and we actually leverage the capabilities of the cloud object store to make this transfer very fast and parallelizable and low cost so the way this step works is that the server actually generates these short-lived urls such as pre-signed urls on amazon s3 that allow the client to request you know just those specific files that they're actually allowed to get for this query and it sends those back to the client and then these are just https urls and so the client can then just open them and start reading and it will actually transfer the relevant objects directly from the cloud object store this is really nice because it means if you're sharing a massive data set you don't need to stream that through a server and have huge infrastructure sitting around that can do this serving reliably you can actually use you know some of the best you know lowest cost most reliable infrastructure on the planet that you have in these cloud object stores and you just defer to them to transfer the actual data um and so it's it's heavily optimized and it's likely going to be the cheapest most reliable way to get this data to many of these recipients so that's how the the protocol works this design has a number of really nice benefits for both the provider and the consumers so for the provider first of all they can easily share you know data they already have and in fact it's also possible to restrict it further for example delta lake actually versions the the files that are that are part of each version of the table and so you can actually easily share just one specific version of it for example today's snapshot and then while you're changing the table you know in real time the recipients don't have to see the latest changes unless you want to in fact you can even share just one partition of the table to give them access to just one subset of it and you could even share for example a view where you filter the table and you know we would compute those filtered data files in real time so it's very flexible in terms of what you share at the same time as a provider you can also you can share a table live if you want to share the newest data and you can actually update the table reliably anytime you want with asset transactions because delta lake supports that and then clients will always see a consistent view of the table so you can do that kind of real-time consistent sharing use case as well now for the clients one really powerful property of this protocol is that it's very easy to implement a delta sharing client if you have a system that already knows how to eat parquet because all you have to do is make you know one call and get that list of urls and then read them so this is why we we see already so many open source projects and commercial systems that uh that support reading from delta shares because all of them support reading parque as an open format and we just leverage that and let you share you know now these subsets of a table in real time so it's it's super easy for clients to implement most data analytics systems out there already support 4k so we anticipate it will be very easy to support this and finally for both parties uh the transfer is fast cheap reliable and parallelizable because it's using these uh these massive cloud object stores so even if you're sharing a terabyte of data and someone wants to read all of it that will actually happen uh using something like amazon s3 or gcs and then the client will be able to eat all those urls in parallel and you'll be able to get a very high transfer rate so the scale of your table should be no limit uh if you want to establish you know sharing across organizations okay so that's a bit about the protocol as elise said we're really excited about the ecosystem that started to build around this already we're releasing um connectors with a whole bunch of open source projects and the community is already starting to write some as well for example scribde uh has written a connector for the r interface to delta lake to support delta shang we're working with a number of other companies too and we're also seeing many of the leading commercial vendors in data management and analytics support this as well so uh you know a lot of the major business intelligence vendors analytics engines and also quite a few governance products that will allow you to centrally manage what data you're saying and anonymize it and so on all of them are um are supporting delta shares so if you if you share a table this way you your your users can use any of these systems to connect which is super powerful and finally we're really excited about the traction that we're seeing from data providers as well where many of the leading data providers are are backing up this project and will will make it possible to consume their data really easily in this very wide range of of data analysis tools are through this open protocol so we really think that you know the future of data sharing is open and we think delta sharing is going to be a key part of that of course we're also implementing delta sharing in databricks so if you're a database customer you'll get a secure delta sharing server integrated just as part of our service and that will allow you to set permissions about who can access data manage recipients and also get fine-grained audit logs about who is sharing data who's consuming it and so on that you can use for example for compliance and for billing and as the interface for this we've made the interface super simple you'll be able to create and manage shares just in sql using the new create share command that we're adding and we also have rest apis that allow you to do this programmatically if you if you want an application that actually manages all the shape so at the bottom i'm just showing the syntax that an administrator could use in sql if they have an existing delta table they can just run a few commands in databricks basically they can create a share object which is a collection of data that we want to share and they can add tables into it and then they can just grant permissions to recipients on individual shares using a you know a standard grant statement and um databricks will run a sharing server in front of the table just as part of our platform that will do the secure sharing and auditing and then clients can connect to it from any system that they want they don't have to be database customers or users you can just connect for example tableau directly to this and start consuming the data without any help from an i.t team or you know other data infrastructure to get this in the hands of that user all right so that's enough um about uh delta just in terms of talking i also want to show you delta in action so you can see how easy it is to consume in a wide range of analytics tools all right so i'm going to show just how easy it is to get started with delta shane what i'll do in this demo is i will configure some data to be shared inside databricks and then i'll show how to consume it from a wide range of other clients from you know even uh organizations that are not using databricks that just want to connect to what i'm saying um and in particular i'm going to uh to pretend here uh that i'm some you know i'm working at some kind of health organization and i want to share real-time data about covet vaccinations i've collected with a whole bunch of other organizations you know governments hospitals uh and so on um so let's i've actually put in a data set in into my database workspace that is actually some real world uh vaccination data from our world in data and i'm just going to show what that looks like so you get a sense of it and then i'll show how to share it so this is the data set i have um you can see you know it's sort of this this table that has basically for each country at different points in time how many covet cases they had how many vaccines they gave and so on and you can use this to plot a lot of interesting data about what's happening now this is just sitting in my data lake as a delta table so what do i have to do to share it so the first thing i have to do is to create a share object this represents a collection of tables that i want to share with someone that i want to give a name to so i can do create share vaccine data and it just creates it and then i can add this table into it and i can just put it into the share and if i want i can put multiple tables into a share and actually share them together so that's fine so maybe i have another table here about the distributors of the vaccines so i've put that in and now i can just describe this and see that i've got the correct tables inside it so this looks like the stuff that i wanted to put okay so the next step is to give some credentials to all of the recipients that i want to share the data with so i'm going to just show adding one recipient and let's say this is the cdc in the us that i want to share this data with so i can just do create recipient and what you'll see here is it gives me this activation link that i can email to my contact at the cdc or otherwise transfer to them that lets them actually download some credentials to access this share so let's see what that looks like for them i'm just going to this link here so okay so you'll see you know they have this big button here to download a credential file and then there's information on how to access this share directly in you know their favorite data analytics tools they don't necessarily have to use databricks they can just directly connect to it without setting up any other kind of infrastructure and then when you download this you just get this dot share file that you can then pass into the different tools okay so the final thing i have to do to actually share the data with this person is grant permission to disrecipient on the share that i created so i can just do that with a grant statement so i can just do that and now they have permission to access it okay so what does it look like now for someone working at the cdc to access it it turns out that they can access it using you know any other variety of uh tools that they might use um and it's really easy in each of the tools so i'm just going to demo a bunch of the ones supported so the first thing i'll start with is actually amazon elastic mapping juice now emr um you know is is obviously a hosted um system that can run apache spark in the cloud but it actually doesn't have built-in support for delta sharing right now nonetheless this isn't really a problem because the delta sharing client is open source so i can easily install it it's just a data source library that you can install uh into your apache spark cluster and then you can start connecting to this delta share you know even though this user is on emr and i'm sharing the data from databricks so i have a notebook here that shows how to do that i'm just going to open that and let's see what a user has to do in emr to access this data so the first thing the user can do is just import delta sharing this is the python client library we've got um and then they can start accessing it based on that credential file that we gave them so let's start by um by listing you know what what files have been shared with us so i can build this sharing client and i have the the table here the i have the the credential file that i gave to this user and i can just list all tables and you can see that the two tables i put in there vaccinations and distributors the other thing i can do is i can just load the table as a spark data frame so to load the table as spark i need to pass in the credential file and then i need to say which table inside it and i can put i can do that in here so it's it's in it's in the share called vaccine data that vaccine data.vaccinations and so i've created the spark data frame here so so now i can just look at this data frame you can see it's a spark data frame that has all the columns in there and i can do analytics on it for example i can count how many of these records are in the share and this is accessing the data that i had in databricks it's accessing it directly through s3 and it's allowing this job to process it and to count these so very easy to connect to it as if you know it's any other spark data source you can just connect to it in your favorite tool okay so that was uh emr but you know that still requires someone to set up apache spark what if you know i have an analyst who just wants to work with this on their laptop for example um it's not a problem either you can actually just connect the delta to delta shang and pandas as well so here's another jupiter notebook i have that's just running jupiter on a single machine and i'm going to show how to connect to it from pandas so i can do i can uh so again i will pass the um the the path to the uh to the dodge share file i have so in here it's in home so i can just pass in you know the path that i want to load um and now it's given me a pandas data frame and if i want i can print this data frame and i see all the same records that i was seeing before with the different countries all with the correct schema and i can go ahead and do whatever analysis i want in pandas on this so super easy to connect it all right finally i'm going to show how to connect some business intelligence tools directly to this so that if i have a user who's an analyst who doesn't want to write code at all they can also work with this share again without asking it to set up a separate data warehouse for them and load the data into it or to do anything else to access the real-time data all right so let's start with accessing the delta sharing tableau i have tableau here and i can open the connector for delta sharing and just open that and then select my dot share file that i downloaded in there so again this is the one i got from the activation link and then i can open this data set inside tableau and i see the same table vaccinations and distributors so for example if i want to work with the vaccinations table i just load it in there it's going to load it through the open protocol into my instance of tableau and then i can just you know take some of the data and begin plotting it in tableau using my favorite tool so i don't need any help from it or anyone else to set up some infrastructure i'm just working you know with the latest data that's been published by this health organization so for example maybe i want to make a map and using this data i want to plot for example uh you know the the amount of people fully vaccinated in different countries and you know let's uh show this a max here just show the maximum they've gotten over time and then uh use that as a caller in my plot so i've just made this this visualization in a few clicks from the data that's being shared in real time and finally let's try the same thing in power bi so again i open power bi and i look for the delta chain connector and i can put in that path that i had earlier with with my file and again now i can browse the share see all the tables inside it and pick one of the tables and just start working with it in power bi with just a few clicks so we're loading the vaccinations table here and then i can actually just take some of the fields in it and start plotting them i'm going to plot the date and i'm going to plot new cases and also new vaccinations over time and show how those trends are changing over time specifically in the u.s so this is the new cases this is the people fully vaccinated and let's also filter it down so that the location is in the u.s and you can start to see here you know cases have started going down and vaccinations have started going up quite a bit since the beginning of the year so super easy to just take this data connect my favorite analysis tools to it and just start working with it without any other infrastructure okay hope you enjoyed that demo uh so that shows what's possible with delta sharing today we're actually just getting started with the project and we have a really exciting roadmap ahead for the open source project so we're working on a number of things uh first of all we envision uh this protocol to allow sharing many other kinds of objects as well not just static tables for example delta tables can already be viewed as a stream they already do change data capture where you can see what was added and removed at each version so we want to allow sharing a table as a stream so you can have clients that just look at what's changed in a data set if you want that and that's an easy extension to the protocol we're also looking to add machine learning models from ml flow table views and arbitrary files as objects that you can share so you can really share any subset of your data and any kind of object that you want with organizations and have great support in all these systems that can connect with we're also working on the governance side on a bunch of exciting capabilities so for example we want you to limit the amount of time that something is shared for so you'll be able to to set these controls and we also want to allow easily sharing something in a restricted clean home analytics environment where basically the user can only work with it in a specific interface for example to do data science or sql but can't exfiltrate out all the data so these are all things that we can support on top of the basic protocol so you can get started with delta shearing today we've we've opened soyster just today and it's it's actually part of the delta lake 1.0 released and we've currently released a reference server that you can use to try out the protocol and clients for for pandas spark and rust and we're also working on a whole bunch of other open source connectors and we're working with our partners on the commercial uh connectors that will directly support this in a wide range of projects so you can get started uh with this today at delta.io slashing there's a tutorial there so you can just walk through it and see what it's like to use this and we're really excited to see what use cases you enable using the delta sharing protocol

Info

Channel: Databricks

Views: 6,950

Rating: 4.9712229 out of 5

Keywords: Databricks, Delta Sharing, Data Sharing, Open Source

Id: HQRusxdkwFo

Channel Id: undefined

Length: 27min 43sec (1663 seconds)

Published: Wed May 26 2021