Hello Friends! At first there were Databases, then came Datawarehouse and now we have Data Lake Data Lake is an emerging technology which has redefined Data extraction, storage and analysis So today we will understand Data Lake in a very easy language, dont go anywhere as there is a lot to come ...[Intro Music] Thank you so much friends for showing your love and support Because of this I am getting more confident to make more such videos PLEASE SUBSCRIBE TO ITKFUNDE CHANNEL IF NOT DONE YET ! So lets start, actually this is a very wide topic but I have tried to summarize this so lets see if I am able to get my message across At first place we have to understand what is a Datawarehouse ? Only after understanding Datawarehouse, we can figure out what is Data Lake. I wont cover Database as yoou already know what it is Database stores data for any application Let us understand what is a Datawarehouse ?Dataware house is a centralized location where all the data of an enterprise company is stored But before storing the data we need to define the business need Why we want to store this data in a Datawarehouse Suppose we take Walmart biggest retail outlet in U.S for example And they have business need to understand above points Customer, Product, Sales, all this data gets stored in different source systems Similary when you visit a retail store or a shopping mall And you reach POS (Point of Sales) POS is a place where you make the puchase At this point your data enters into the systems and gets stores into various transactional systems But now if you want to merge and analyze this data from a central location then you first extract this data So data you need for analysis can reside in any of these systems It can be in Sales, HR or Inventory source systems Once data is loaded into Datawarehouse Then business analysis and reporting happens on top of this data Hence Datwarehouse works on philosphy "THINK FIST, LOAD LATER" This means you forst need to know what data you need and based on your requirements you fetch the required data Datawareouse came in 1980s and its very common in todays age Today whan you talk abut Business Intelligence or analysis you are mainly talking about a datawarehouse So friends, this is avery smmarized explanation of what a datawarehouse is Now with fastly growing technology advancements Data vloumes is increasing exponentially Today every small system, sensor or machine is generating insightful data But this is not structured data this data includes unstructured data sensor data,logs etc. Businesses today need this unstructured data as they dont want to ONLY analyze its operational data AT one time it was impossible to store analyze such data But now it is with the advent of "Cloud Computing" We have made a sperate series on Cloud Computing Please checkcout to understand in easy language Loading unstructured data in a datawarehouse is challenging As it supports structured data with pre defined business requirement Now if we bring Data lake into picture so its philosphy say "LOAD FIRST, THINK LATER" This means you first load all whatever data you have and then later analyze how you can use this data Friends, I have done my engineering from BHOPAL BHOPAL is capital of Madhya Pradesh, a state in central INDIA Bhoapl is called "City of LAKES" So lets compare Data Lake with a real LAKE Water is a real Lake could comes from various sources Like Rains, Rivers, Tributries and Seawers Similarly data from various sources systems comes and ressides in data lake in its "RAW" natural form Data Lake accepts all kinds of data structured, unsructured logs,images everything. It stores everything in its raw state hence Data Lake acts as a central reservoir which stores without any pre analysis of data Once data is in Data Lake you may use it for multiple purposes Like water in real lake can be used for irrigation, drinking and industries Data Lake allows you to use the stored data as per your needs and business use case Data Lake has been conceptualized around this idea as we now have enough power to processs and use such data In order to better understand Data Lake lets compare it with a datwarehouse So if Data lake is a like a lake, then Data warehouse is like a water tank fitted in a house terrace A Lake can be used for multiple purposes whereas a water tank is solely used for fulfilling water needs of that house Whatever water need the house has shall be fulfilled by that water tank So you can clearly see the difference here Data warehouse fulfills all your water needs But first you need to define that you need water in your house and then you fit the water tank So Data warehouse like a water tank is designed in a very clear way and works with structured data On the otherhand data lake needs no pre planning for loading any specific data Data lake is very good platform for advanced data analytics Now with technologies like Machine Learning, AI you can process any kind of data and for this you now have data scientists So friends Data lake and Data Warehouse are different but can work together as well There is a common MYTH among people that we either have to go for Data lake or Data warehouse Friends both should coexist as all your requirements can not be fulfilled solely by data lake or data warehouse SO ideally you should first build a data lake and then derive various data warehouse or data marts from it If relate this to our previous example we can take water from the lake, purify and supply to water tank Data lake has many advantages however it also can have a potential downside Data lake came because data warehouse was too rigid Sometimes after spending an year building a datawarehouse we used to realizee that datawarehouse has not achivied all business requirement and is not useful Data lake on the other hand works in an AGILE manner and allows flexibility in data storage and analysis But one downside data lake has is if you do not have a proper metadata management Data lake in very basic term means that you should know what data is coming in your lake and how its being managed Data duplication should not occur Similar data should not be loaded again and again Because your data lake could very soon become a data swap if not managed properly If you do not know what data is coming into the lake then it will very soon become a dumping ground To avoid this you need to have very strict Data Governance there needs to be a strict vigil on data coming in datalake Friends I hope you understood these concepts Data lake is new but various cloud providers have thier products like google has big query Microsoft Azure and Amazon web serices (AWS) also offers data lake implementation Data lake is wide topic but I tried explaining it in easy to understand language Friends now you tell me in comment section what you would prefer a data lake or a data warehouse or both I hope from this video you got to know what is data lake THANKS YOU ALL, PLEASE LIKE SHARE AND SUBSCRIBE I WORK REALL HARD ON THIS SO PLEASE SUPPORT GOOD BYE!
