3 - ETL Tutorial | Extract Transform and Load

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys my name is become Tucker welcome to this video on etn in the previous videos I have given an overview about data warehousing and its architecture in this video we are going to talk about one of the most important aspects of data warehousing which is ETL so let's get started edie L stands for extract transform and load as we mentioned the previous videos that data is read from the heterogeneous sources from the source systems and loaded into staging table and then finally after the transformations are complete it is wrote into a data warehouse so the complete process of extracting the data transforming the data and loading the data into warehouse is called extract transform and load it is not a one-off process for a data warehousing because there are in the operational systems and the source systems keeps on changing so this is a regular process where the data has been regularly fetched from the social system and loaded into an warehouse after transformations are complete so this is important aspect of the warehousing because if the source data from Gov rational systems and so system is not extracted cleansed and integrated into a proper format in in the warehouse then it will be difficult to perform the query processing efficiently which is the ultimate purpose of the data warehousing ETL has three phases first is the extraction path second is the transformation and third is the loading of the extractor and the transform data into the warehouse let's look into each of these three stages in detail to process the data extraction so as we only have mentioned the data from the various sources has been extracted into the staging area as a first step in the data warehousing so there are a couple of data extraction strategy first is the full extraction wherein all the data from the operational systems and all the source systems gets extracted into a staging area so this generally happens in two scenarios one in for the initial load when the data warehouse is getting popular in first time or any scenarios with we would have any strategy for identifying that changed record so we extract the full data and do all the transformation and identify all the changed or the modified records on the staging area so in this strategy we extract all the data from those systems into our staging server next is partial extraction with a bit notification so sometimes we get the notification from the source systems that which data has been updated which has been deleted and which data is the new data so this is also called the Delta and in this strategy we only extract the data which has been modified and it is easy and quick as compared to the full expression and the next strategy is the partial extraction without objectification so in this strategy we do not extract the full data set from so system we extract the data based on certain keys or certain strategies for example we have populate the data of we have extract already extracted data till yesterday so we extract today's data and then we identify all the updates on that data one thing we need to remember while extracting the data from so system is that it should be designed in a way it does not negatively affect the so system performance because social system is generally where all the actual business and transaction happens so we should design our system or schedule our system to run at that time that it should not negatively affect the social system performance so let's move on to the next phase which is transformation phase so the data extracted into a staging server from the source system is in a raw format and we cannot use that data as it is it has to be cleansed map as to the requirement and transformed before it is finally ready for loading into a data warehouse so this stage all the transformations all the cleansing and all the mappings happen so let us look at the basic transformation tasks first is the selection wearing we select the data which is required to be loaded into it they are get up here house or which is actually meant to be transformed so in this step would select those data next step is the matching in this step will look up the data from various local files and then match the data that needs to be transformed next panels with a cleansing and enrichment so data is not cleans in our source systems it is not standardized because we are fetching the data from more than one source system so it has to be standardized or normalized hence we do the data cleansing and enrichment and the last is consolidation and summarization as we mentioned earlier we consolidate and aggregate the data from the source system because we do not want to load same data from social system into our house hence we consolidate summarize and aggregate the data from so system intake cross solution phase so let's look at the some of the sample transformations that we apply on a data first is standardizing data as I mentioned earlier data has been pressed from various sources and then it needs to be standardized before loading into a warehouse character step conversion and encoding handling we also need to convert the data into a define encoding in the data warehouse because so system may or may not have a seam encoding we calculate and derive new columns from the existing columns we split and merge fields split it from a single field so multiple fields and then we combine some fields as well and this all happens based on the requirements we convert units because so system may have a different way of storing it a particular measurement but we convert into a standard format for example we can work the date/time we can word the units of measurements into a standard format to summarization aggregation of data and consolidation deduplication is also done as a part of transformation in which we delete the duplicate data that will be see from multiple sources and then finally we do the key we store chain key researching is important because we do not do the same keys that we use in our operational system in our data warehouse in data warehouse has a concept of surrogate Keys surrogate key are the non meaningful keys and it should it should be generated in respective of the keys are the primary keys defined in the source systems so we design our data warehouse so that we populate the sorrow gate keys using the data from the source systems so next is loading the data into the warehouse from staging server we purchased the prepared data in the transformation phase and load into a data house now there are types of loading strategies we use first is the initial Road as I mentioned when we load the data for the very first time we do not care about identifying the newly or the modified records we generally take the whole data set from the staging server and load it into it data warehouse so this is the one up process and it is generally done when the data is going to be populated with it on the for the very first time in the date of your house next is the incremental load so this is when we apply all the ongoing changes from the so system into the data warehouse periodically so in this we only load the records which is which has either changed or the newly inserted course into social systems and the last is for refresh we sometimes required to do a full refresh as well which is basically completely erasing the contents of one or more tables and loading the fresh data for certain tables we do want to keep the history of changes so it is better if we completely erase those tables and load all the data which is extracted from social systems so this completes about the loading strategies so next is the ETL tools we have couple of PTL tools there and revised software's which are informatica I limited a stage Microsoft introduced services ab initio and Tarun until over eight years these are full-fledged and priced softwares we have couple of open source and community software's as well which is pink Tahoe cater and talent clover also has a open source version but it is very limited to certain transformations only now let's look at what are the features and capabilities of ETL tools the data connectivity with the source and the target system the ETL tools will allow to connect put the source system and the target data systems efficiently and seamlessly next is scalability and performance we should not be able to change a job in order to perform it better whenever the data has been increased whenever we have a business do some acquisition so we should scale the ETL jobs by adding more clusters ETL tool should have no pre-built transformation connectors which allows us to do the transformation easily and quickly ETL stew should also have a data profiling and data cleaning components which allows to standardize and clean the data because the data from your system may or may not be clean or may or may not be in the same format so data profiling and detecting using allow us to make the data standardize and improve the data quality if your tools will also have a can abilities easily performed logging and exception handling it should have a robust administration features which allow us to schedule the jobs which allow us to assign the jobs to particular users and which allows multiple developers to work simultaneously on the same project it should have connect us to easily to get with Web Services ETS who should allow and give the capability to perform the operations in badge as well as the real time this completes our overview of the ETL I hope you guys have liked this video if yes don't forget to hit the like button and subscribe if you want to see more videos
Info
Channel: Vikram Takkar
Views: 198,565
Rating: undefined out of 5
Keywords: ETL, ETL Tutorial, ETL Basics, ETL Explained, Learn ETL, Learn Data warehousing, Data warehousing tutorial, Extraction, Transformation ETL, Loading ETL, Talend ETL, Pentaho ETL, Kettle ETL, Informatica ETL, Datastage ETL, ETL Training, data load in Data warehouse, ETL tasks, ETL Strategies, ETL tools, Informatica, Abinitio, Talend, Clover ETL, Datastage, Data Transforamtions, Data Loading Strategies, What is ETL, Why ETL
Id: WZw0OTgCBOY
Channel Id: undefined
Length: 12min 20sec (740 seconds)
Published: Tue Sep 08 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.