Data Lakes in the Cloud

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello, this is Torsten Steinbach, an architect at IBM for Data and Analytics in the cloud and I'm going to talk to you about data lakes in the cloud. The center of a date a lake in the cloud is the data persistency itself. So, we talk about persistency of data, and the data itself in the data lake in the cloud is persisted in object storage. But we don't just persist the data itself, we also persist information about the data, which is, on one side, information about indexes. So, we need to index the data so that we can make use of this data in the cloud, data lake, efficiently. And we also need to store metadata about the data in the catalog. So, this is our persistency of the data lake. Now the question is: how do we get this data into the data lake? So, there are different types of data that we can ingest, so we need to talk about ingestion of data, and we can have a situation that some of your data that is already persistent in databases. So, these can be relational databases and can also be other operational databases, noSQL database and so on. And we get this data into a data lake, actually via 2 fundamental mechanisms. One is basically an ETL, which stands for "Extract-Transform-Load", and this is done in a batch fashion. And the typical mechanism to do ETL is using SQL, and since we're talking about cloud data lakes, this is "SQL-as-a-service" now. But there's also, in addition, and often you combine those things, the mechanism of replication which is basically more of the change feeds so after you may have done a batch ETL on the initial data set, we talk about, "how do you replicate all of the changes that come in after this initial batch ETL?" Next, we may have data that has not persisted yet at all which is generated as we are speaking here, for instance, from devices. So, we may have things like IoT devices, driving cars, and the like. And they are actually producing a lot of IoT messages - all the time, continuously, and they also need to basically land in, and stream in, to the date lake. So, here we're talking about streaming mechanism. In a very similar manner, we are taught that we have data that is originated from applications that are running in the cloud or services that are used by your applications. They're all producing logs and that's very valuable information, especially if you're talking about operational optimizations and getting business insights of your user behavior, and these kind of things. This is very important data that we need to get hold of. So, we're talking about logs and these also need a streaming mechanism to basically get streamed and stored in object storage. And finally, you may have a situation that you do already have data sitting around in local discs. So, you may have local discs, maybe on your own machine. You may have even a local data lake, a classical data lake, not in a cloud and typically these are Hadoop clusters that you have on-premises in your enterprise, or it can be as simple as - you find it very frequently just NFS shares that are used in your team, in your enterprise, to store certain data. And if you want to basically get them to a data lake, you also need a mechanism, and it's basically an upload mechanism. So, a data lake needs to provide you with an efficient mechanism to upload data from ground, on-premises, to the cloud, into the object storage. Now, the next thing we need to do, once the data is here, is process it. We need to process the data. This is especially important if you're talking about data that hasn't gone through an initial processing, like for instance device data, application data, this is pretty raw data that has a very raw format, that is very volatile, that has very different structures, changing schema, and sometimes it doesn't have a real structure, it can be binary data - let's say, images that are being taken by a device's cameras and you need to extract features from that. So, we're talking about feature extraction from this data. But even if you have no structure extracted already, it might still need a lot of cleansing, you may have to basically normalize it to certain units, you may have to round it up to certain time boundaries, get rid of null values, and these kind of things. So, there's a lot of things that you need to do about transformation, you need to transform the data. Once you have transformed the data, basically you now have the data that you can potentially now use for other analytics, but one additional thing is advisable that you should do with this data: you should basically create indexes. So, you should index this data so that we know more about the data and can do performant analytics. And finally, you should also leverage this data - you have a catalog you need to leverage it. and you need to tell the data lake about this by cataloging the data. So, there are multiple steps and often we talk about the pipeline of data transformations that need to be done here. Now the question is what do we use here? And there are actually two processes, two mechanisms, two services or types of services that are especially suited for this type of processing. One is Functions-as-a-service (FaaS) and the other one is SQL-as-a-service again. So, with SQL and function as a service you can do this whole range of things here, you can basically create indexes through SQL DDLs, it also can create tables through SQL DDLs, you can transform data when you can use functions with custom libraries and custom code to do future extractions from the format of the data that you need to process. Once we have gone through this pipeline, the question is what's next now? So, we have prepared, we have processed all of this data, and we have probably cataloged it, so we know of what data we have. Now it comes to the point that we really harvest all of this work by basically generating insights. So, generating insights is on one side the whole group of business intelligence, which consists of things like doing reporting, or creating dashboards, and that's what's typically often referred to as BI (Business Intelligence). And one option that is possible now is to simply directly do BI against this data in a data lake. But, actually, it turns out that it's especially useful, or an option, for batch ETL options - like creating reports in a batch function. Because when it comes to more interactive requirements, you need - basically sitting in front of the screen, and you need to refresh in a subsecond, let's say a dashboard here. There is actually another very important mechanism that is very well established and it is part of this whole data lake ecosystem and this is a data warehouse. So, a data warehouse - or a database, maybe more generally - is highly optimized and has a lot of mechanisms for giving you low latency and also guaranteed response times for your queries. So, the question is, how so we do that? Now, we obviously need to move this data one step further after it has gone through all of the data preparation in the data lake with an ETL again. And it happens to be again that SQL-as-a-service is a useful mechanism because it's a service we have available on the cloud, we already use it to ETL data into the data lake, now we can also use it to ETL data out of this data lake into a data warehouse. So that it's now in this - I would say more traditional, established stack of doing BI that can be used by your BI tools, reporting tools, dashboarding tools, to do interactive BI with performance and response time SLAs. So, that's one end-to-end flow now, but, very obviously, inside there is more than just doing reporting and dashboarding. So, there's a whole domain of tools and frameworks out there for more advanced types of analytics such as machine learning, or simply using data signs, tools, and framework that now, basically, can also do analytics and do AI, artificial intelligence, against the data that we've prepared here in a catalog. And machine learning tools and data science tools, basically they all have very strong support for accessing data in an object storage. So, that's why this is a good fit basically let them connect directly here to this data lake. Now, that is the end-to-end process - basically getting from your data, with the help of a data lake, into insights. One of the big problems that is there today is for people to do that, to prove and explain how they got to this insight. How can you trust this insight? How can you reproduce this insight? So, one of the key things that need to be part of this picture is data governance. So, data governance, in this context, has two main things that we need to take care of. One is we need to be able to track the lineage of data - because you've seen the data is traveling from different sources, through preparation, into some insights in the form of a report. And you always need to be able to track back: where did this report come from? Why is it looking like this? What's the data that basically produced it? And the other things are: you need to be able to enforce - what a data lake actually needs to be able to enforce, policies, governance policies. Who is able to access what? Who is able to see personal information? - and can I access it directly, or only in an anonymized and masked forms? So, these are all governance rules, and there are governance services available, also in the cloud, that basically a data lake needs to apply and use in order to track all of this. So, we're almost done with this overall Data Lake introduction, but there is just one more thing that I want to highlight and this is, since we're talking about the cloud: In the cloud, how can I deploy my entire pipeline of data traveling through this whole infrastructure, - how can I automate that? And here, basically, function-as-a-service plays a special role because function-as-a-service has a lot of mechanisms that can that I can use to schedule and automate things like, for instance, batch ETL step, - or like generating a report. So, this is the final thing that we need in our data lake in order to automate and operationalize, eventually, my entire data and analytics using a data lake. Thank you very much.

Info

Channel: IBM Technology

Views: 20,394

Rating: 4.9645233 out of 5

Keywords: data lake, cloud storage, database, data warehouse, machine learning, analytics, SQL queries, big data analytics, predictive analytics, data discovery, relational data, non-relational data, object storage, hybrid cloud, devops, public cloud, private cloud, data center, bare metal, virtual server, GPUs, cloud infrastructure, scalability, data scientist, cloud, IBM Cloud

Id: IPkQpBdde5Y

Channel Id: undefined

Length: 14min 40sec (880 seconds)

Published: Fri Aug 09 2019