Data Pipelines Explained

Video Statistics and Information

Captions Word Cloud
Reddit Comments
let's talk about data pipelines what they are when and how they're used so i want to start with a simple idea most of us are fortunate enough to turn on the tap whenever we like and fresh clean water comes out however have you have you thought about how that water actually gets to you well water starts out in our lakes our oceans and even our rivers but most of us probably wouldn't drink straight from the lake right we have to treat and transform this water into something that's safe for us to use and we do this using treatment facilities and we get the water from where it is to where it needs to go using water pipelines right now once that water has gotten from the source to their treatment plants it's then cleansed and and made sure it's safe to use and then it's sent out using even more pipelines to where we need it and we use it in a couple different places we need it for drinking water we need it for cleaning and we also need it for agriculture right so we use even more pipelines to get this water to where it's needed okay now as you can see water pipelines take water from where it is to where it's needed now we can start to think about data in organizations in a very similar way so data and organization starts out in data lakes it's in different databases that are a part of different sas applications some applications are on-prem and then we also have streaming data which is kind of like our river here now this can be data that is coming in in real time and so an example of that could be sensor data from uh factories where data's being collected every second and being sent and being sent back up to our repositories so just like our water sources this data is dirty it's contaminated and it must be cleaned and transformed before it's useful in helping us make business decisions now when we talk so how do we do this work we do it using not water pipelines but data pipelines okay so when we talk about data pipelines we have a few different processes that we can use to help us handle the task of transforming and cleaning this data we can use processes like etl we can use data replication we can also use something called data virtualization right okay so one of the most common processes is etl which stands for extract transform and load and that does exactly what it sounds like it extracts data from where it is it transforms it by cleaning up mismatching data by taking care of missing values getting rid of duplicated data putting in making sure the right columns are there and then loading it into a landing repository for ready-to-use business data an example of one of these repositories could be an enterprise data warehouse right okay so most of the time we use something called batch processing which means that on a given schedule we load data into our etl tool and then load it to where it needs to be but we could also have stream ingestion which would support the streaming data that i mentioned earlier so it's continuously taking data in transforming it and then continuously loading it to where it needs to be okay now another tool that we might see is data replication so what this involves is a continuously replicating and copying data into another repository before being loaded or used by our use case so we could have a repository here in the middle that copies data from our source into this into this repository so why would we do that right well one of the reasons could be that the application or use case where we need this data needs to have a really high performant back end to it and it's possible that our source data can't support something like that another reason could be for backup and disaster recovery reasons so in the situation where our source data goes offline for some reason we still have this backup to keep running our business processes against okay so the last one i want to touch on is data virtualization so all of the methods that i've described so far require you to copy data from where it is and move it into another repository but what if we want to test out a new data use case and don't want to go through a large data transformation project well in that case we can use a technology called data virtualization to simply virtualize access to our data sources and only query them in real time when we need them without copying them over and once we're happy with the outcome of our our test use case we can go back and build out these formal data pipelines so data virtualization technology allows us to access all these disparate data sources without having to go through building out permanent data pipelines so once we're satisfied with the results of our data virtualization project we can build a formal data pipeline that can support the massive amounts of data that we need to that we need in a production use case now unfortunately we haven't figured out a way how to virtualize water but we can definitely do it with data in our in our organizations okay so after we've used all these different processes to get data ready for uh analysis or different applications we can start using it so what are the different ways in which we can use this data well we might need it for our business intelligence platforms that are needed for different types of reporting well we might also need it for machine learning use cases right so machine learning requires tons and tons of high quality data so we need to use these data pipeline tools to feed our machine learning algorithms and so this clean data can be fed into our machine learning models to help us start making better and smarter decisions in our business okay so as we can see data pipelines take data from data producers and give them to data consumers thank you if you have questions please drop us a line below and if you want to see more videos like this in the future please like and subscribe
Channel: IBM Technology
Views: 50,695
Rating: undefined out of 5
Keywords: IBM, IBM Cloud, data pipelines, data, data science, data analyses, data integration, data replication, data virtualization, business intelligence, data mining, ETL, Extract transform load, machine learning, batch processing, Luv Aggarwal, Stream injection, data producers, data consumers
Channel Id: undefined
Length: 8min 28sec (508 seconds)
Published: Thu Jun 16 2022
Related Videos
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.