Ali Ghodsi and Bill Inmon | Fireside Chat | Keynote Data + AI Summit NA 2021

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
uh bill super excited to be here with you uh we started talking a while back and obviously we're big believers in this lake house concept and we started working with you uh around this and you started writing about this and evolving your thinking you were cranking out material around this data lake house you've seen this industry over many decades and you've seen it evolve and go through different transitions i'd love for you to walk us through the sort of evolution from you know started with data warehouses to data lakes to the lake house and you've also generously shared with us some images from your upcoming book uh so maybe we can look at those and you can walk us through uh this evolution uh in the beginning were applications and people started building applications uh uh and then they discovered that uh they had data everywhere and and it wasn't a matter of can i find data because data was everywhere that became a question of can i find the right data and and that's when people discovered that they needed integrity of data and that's where the data warehouse was came in the data warehouse the discipline required to build a data warehouse and to vet the data was the impetus for building data warehousing answering the question what is the right data without having to go throughout the organization and then data warehouse came along after data warehouse came along we discovered uh data marks and then we discovered quite a few other things along the line and then and then the world began to discover other forms of data uh we began to to to look at uh uh analog data uh we began to look at uh textual data and we found out that there is a a wealth of information uh that hasn't ever been looked at before i think it's kind of interesting to to note that uh if all you look at is classical structured transaction based data you're probably only looking at five to ten percent of the data in the corporation and now there's nothing wrong with that because you need that data but but but uh you're not looking at the full picture of of uh of what's going on in the corporation and every now and then i'll pick up a book on uh data management and data architecture and all the person talks about is structured data and i thought you know this is this this is this is not right uh because they're only looking at a a fraction of what is in the corporation in order to look at what's in the corporation you've got to look at structured textual other unstructured other unstructured would include iot data would include analog data would include a wide variety of data that's the picture of the real data in the corporation and so that's what we were faced with today now in truth uh the world is facing an avalanche of of data coming from all over and and and the challenge is uh number one organizing the data so that you can understand what's there number one uh number two uh merging the data together so that you can find out information that goes across more than one of these environments uh that's a second challenge uh number three organizing the stuff so that you can find anything because there's so much data that's out there uh trying to uh uh organize uh the information to where it can be found is a challenge and so if you take the data and just put it into a data lake uh that's a good first step but it's only the first step that uh after you've created the data lake uh you then got to say okay uh i need to have another layer of analytical information uh for it because uh such things as lineage of data i mean again it's it's kind of interesting to know what a value of data is but until you can understand where did that data come from uh when did it come into the system uh what's been done to it uh what hasn't been done to it all of those informations are relevant to uh understanding uh how to do analytical processing uh analytical processing in an effective fashion so uh if i were to look at this diagram that's in front of us now and say where is the industry uh the industry is at the point of having building and or going to build a data lake and then the next step is them discovering that uh they need an analytical infrastructure uh in order to make that data lake uh comprehensible and usable and this is where machine learning would come in uh and a lot of other things so that's a uh uh that's where i think we are today and so here we see that analytical infrastructure that needs to be uh added to the data lake uh uh in order to uh turn it into usable data uh this unlocks the data for the end user so that not only can the data scientists use the data but the end user can use the data as well and once we do that we've opened up the door to uh lots of other analytical tools we certainly machine learning certainly data science and statistical processing but all kinds of other people can now start to go in and find and use the data and this is you know uh uh these are your images from your book and you know your images they go from top to bottom uh so so up at the top is where the data comes in i believe and at the bottom is where you're actually the use cases where you're getting value out of that data that's correct that's fascinating many of us are fans of your book building the data warehouse you know and we actually taught it we've been we were taught it and then we've been teaching it at universities and we're big fans uh you know is there you know are we going to see more more books from you uh yeah uh uh i i've done uh in my life 60 books now uh the 61st book is one on uh the data lake house uh which i'm working on right now uh uh and in many ways the data lake house uh has a lot of not all of but a lot of the propositions that was facing data warehouse uh 25 or 30 years ago so i'm working now on the definition of what is a data lake house and how do you go about building it and what are some of the characteristics yeah and in the lake house that's is that a first class citizen the data science um it's kind of interesting it's my observation and i could be wrong about this but it's my observation that when the first notions of the data lake house came came about uh that uh it was primarily for the data scientists uh but but but uh there is another whole community of plain old end users that need to look at that data as well so uh um in order to accommodate and it's really kind of interesting the the data scientist looks at one kind of data and the end user looks at another kind of data it's the same data but it's looked at in different ways and uh so i i think in terms of the maturity of the data lake house uh we need to start to accommodate the end user i think the data scientists uh i think the data lake house was originally built with the data scientists in mind yeah that makes a lot of sense uh and then now expanding it to also include the traditional bi workloads and end users who might want to see simpler dashboards or you have places where they can actually understand the data more uh you know in a simpler way absolutely that's true and uh and and actually the uh the data scientists and the end user are are probably looking for different things i mean they might be looking for the same thing but in general uh the data scientist is looking for patterns uh trends uh and things that haven't been seen before and the uh the end user is typically looking for things like kpis k performance indicators uh and things like that so they're they're they're essentially using the same data but they're looking at very different things in that data that's fascinating uh and last you know decade or so you've been working on textual etl yourselves that's correct uh we've been taking the technology of of uh the problem of looking at uh text and asking itself ourselves the question how do we turn this text into actionable uh information because there's a there's actually a tremendous amount of business value that's buried in text that isn't being looked at today and these are text records that might not even have a lot of structure to them right it's not this is not table or text that you get that's adheres to a particular schema when it comes in if you're waiting to have structure in text you're going to be waiting a long time uh 99 of the text that's out there in the world has no structure i mean it's like our conversation there's nobody sitting there telling you what to say there's nobody telling me what to say we're just having a conversation and so uh in terms of structure uh i don't know we're certainly friendly with each other but there's not really any structure to what we're talking about nor is there for emails for uh what's on the internet i mean text and structure structure just really as far as i can tell doesn't exist in text or at least if it does it not very much how do you do that with your textual etl how do you understand the context i can tell you having been doing this the last 10 years of my life that managing text is 10 of the battle and managing and understanding context is 90 of the battle uh it really is difficult um we different text has different requirements some text can be handled one way other text could be handled a completely different fashion and so uh the last time i looked at our technology we have about 67 different algorithms in our technology and and our technology goes in and selects the appropriate algorithm for the given piece of text at a time and uh you said you have 67 algorithms or so that understand the context uh how are these built are they there must be a lot of variety let me give you a flavor of just some of the algorithms one of them is something called homographic resolution and suppose uh you were a doctor and reading doctor's notes and you saw the term h a now what does the term h a mean well if a cardiologist wrote h a it would mean heart attack if a general practitioner wrote h a it would mean headache if an endocrinologist had written h a it would mean hepatitis a so the interpretation of what h a means depends entirely on who wrote it and and that's and that's one of our 67 uh algorithms that we have orchestrating uh how do we determine context got it so you're really using these advanced statistical techniques and these sort of to understand the context from these 60 or so algorithms and they sort of you know search over the data multiple passes and is that is that done on a sort of lake house pattern uh well uh what we do is is we we read the raw text and then put it into a form that goes into the lake house once it's in the lake house it can then be analyzed and mixed and merged with with other kinds of data on speaking of languages uh when the processing that your textual etl does in the lake house um does it do you have a variety of awesome programming languages that you're actually using to access them or what's this written in actually i'm curious your technology the underlying uh the underlying technology that we have is microsoftvb.net so bill i'm curious uh on your take of elt versus etl certainly um uh that that's actually a naughty question because uh uh there are advantages and disadvantages however you do it i've always been a fan of etl because of the fact that etl forces you to transform uh data before you put it into a form where you can work with it but some organizations want to simply take the data put it into a database then do the transformation now when it comes to text text is a different uh beast all together because i i'm i'm not a believer that you can do elt with text i mean i mean i i i tell you what if you can do it i don't know how and and we do this every day and so i i think for text you you don't have any choice but to do etl other technologies you do have a choice and there are some reasons for doing elt and i understand those but uh but again i'm a fan of etl because etl forces the organization to do the transformation and i've seen too many cases where the organization says oh we'll just put the data in and transform it later and guess what six months later that data has never been touched the way we see it is when when there's more structure to your data you can do elt because you can load it in and you can use sql with sql you can do a lot of the transformations actually once it's over then but as you pointed out for all these complex data types the text and the audio and video and all these other the data science workloads uh it's just very hard to express that with sql yeah with images for example i i don't believe you can do elt with images i mean maybe you can maybe you can i don't know how it's done yeah well a lot of these machine learning frameworks that actually analyze the images they directly access the the files and do etl yeah yeah yeah the the the the yeah that makes sense to me so um so switching gears um end user is really important uh there was this concept of data lakes for a long while yep persisted would you say the data links did a good job of addressing those end users no the data lake the data lake was not thought out from an architectural standpoint uh from a technical standpoint i think the data lake i mean it was fine nothing wrong with it but architecturally there were many things missing from the data lake and uh uh uh and and and and because they were missing uh it made the data lake not useless but it made it very difficult to get information out of i mean you've worked in this space a long time would you say many data lakes turn into data swamps most of them do every now and then you see one that doesn't but but most of them do and the lake house how does it so it takes that data lake but now also adapted to end users helps you with the structure that you need uh so that you can actually make sense out of that data not just turn it into data form that's correct uh what i can say yes that is correct that with if you take your data lake and turn it into a lake house you can actually now start to get your money's worth out of it if you don't turn it into a lake house it it turns into a swamp how important is uh open source open aspects of these architectures today i mean that didn't exist when you came up with the concept they were housing you know many many decades ago but today yeah you know many many decades ago uh oracle and ibm and microsoft uh uh had secrets i mean they they they tried to keep everybody from openness and uh and and uh i i i think the world that we live in today uh is first off is a different world and i think it's it's a positive thing that it's different and uh i'm i uh the the data warehouse would have accelerated way back when had we had openness back in that day and age and i'm glad that we have it in this today yeah so is it a good summary to say the lake house gets the data science focused from the data lakes but it also gets the application and the vi uh focus from the warehouse and blends them for all these variety of data sets that you referred to earlier is that a good sound i would i would absolutely agree with that statement um that's awesome um i'm going to get a little bit personal here so when we uh started talking uh you started writing these blog series about the lake house yup and it seems it seems you're writing them in the middle of the night uh tell us a little bit what's going on i i i i i'm a writer uh i've written 60 books right now and and to me writing is a form of relaxation uh for most people writing is hard work uh but at this point in my life i would rather sit down and write a good book than i would uh uh go out and play a game of golf and and so uh writing to me is a form of relaxation uh uh it's and i enjoy it i mean what i mean it's an in it it's like doing a crossword puzzle it's an intellectual challenge i enjoy doing it and and there's nothing that i would rather do than sit down and write a good book and and uh and i mean god has made different people different ways and and god made me a writer and so uh i i write uh very quickly uh if i were you know some some artists uh the painters uh take a long time to make a painting and some painters paint very quickly while i would be uh one of those painters that painted very quickly um and you know we've been big fans of the lake house but you've helped us actually evolve that through your writing you know and uh you know and we couldn't keep up with your writing because you know there's you keep cranking in the mouth at very fast space but you know we at database we would have whole groups of people sit there and read these and try to say wow actually we haven't thought about that um wait till you see the book uh uh uh which by the way is now on chapter eight uh so uh uh i should have i hope to have the book finished by next week but uh wait till you see the book there's there's a lot of information that i think that the data bricks and the world is going to find useful the lake house um you said when you wrote the book building the data warehouse you wanted everybody to go off and build their data or house so you're now writing a book on the lake house is the intention that you want people to run off and build lake houses uh absolutely i i i i'll tell you what it's almost a different proposition what if people don't go and build data lake houses uh they're gonna end up with this this flood this this avalanche of data that they're not gonna know what to do with and and and and that's gonna be tragic because you can do so many things with the data so it's not so much a problem of are they going to build a data warehouse it's going to be what happens if they don't build a a lake house and and because if they don't build a lake house they're going to have this mountain of data that sits there and nobody's going to be able to do anything with it in the long run what do you think the impact of the lake house will be and if you put it in perspective with the previous sort of industrial kind of impact that data warehouse has had well i believe the the lake house is going to unlock uh the data that is there and going to present opportunities like we've never seen before and and and that's going to be the effect of creating the uh the the lake house
Info
Channel: Databricks
Views: 3,075
Rating: 5 out of 5
Keywords: Databricks, Lakehouse, Data Architecture, Data Warehouse
Id: ylawoga4Z2M
Channel Id: undefined
Length: 21min 8sec (1268 seconds)
Published: Wed May 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.