Olympic Data Analytics | Azure End-To-End Data Engineering Project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this project we will extract the data from the API using Azure data Factory that is kind of like the data pipeline tool available on Azure it will build a flow like this and load our data onto the Azure data Lake storage so first we will load the raw data then using Azure databricks we will write our spark code and transform our data and load our data back to our transform data Lake storage once that is done we will use synapse analytics to run the SQL queries on top of the transform data so that we can find the insights and get the visualization on top of it hey guys welcome back to another project video where we will execute the entire data engineering project from start to end from extracting data doing the transformation and then loading the data and doing analysis on top of it in this video we will be using Azure Cloud so we have already done few projects on AWS on gcp so I just wanted to introduce the new cloud and show you the way we can perform data engineering on Azure Cloud so we will be learning about many different Services we will understand the basics about Azure Cloud we will understand about databricks data Factory how to write spark core synapse analytics how to build a data Lake and many more things so I hope you have fun executing this project and before you execute this project and start doing it I will highly suggest you to like this video comment your thoughts below this video so this will help this channel to grow and reach more and more people and if you like these type of content I keep doing this on my channel so you can hit that subscribe button a lot of people watch my videos but they do not subscribe it so I will appreciate that if you hit that subscribe button and let's get started with the video so in this project we will take the Olympic data that is available on Kegel and we will build end-to-end Azure data engineering project so we will use the major services available on Azure cloud and try to build a simple data pipeline we will copy the data do some basic transformation and then load the data onto some Target location so this is the overall project we will understand about each and everything so you don't have to worry about anything right now so this is the architecture diagram as you can see we have a data source then we have some ingestion mechanism ISM we are storing that data onto some location then we are using some transformation code using Apache spark then again we are storing this transformation block using the analytics function and then we have the dashboard at the end so let's try understanding these entire architecture diagram one by one so that we can get understanding about each and every individual service so first we will understand this architecture diagram and then we will Deep dive into Individual Services and what the mean on Azure clouds okay so first is we have the data source now to understand this architecture diagram you need to know the basics about ETL extract transform load so in extract transform load we basically extract data from multiple sources then we apply some transformation Logic the transformation logic can be anything such as cleaning the data removing duplicate values null values or applying any business logic and it then we load that data onto some Target location so this particular architecture diagram is built on top of that using Azure cloud services okay so first of all we have a data source that the data that we will use from the cable and we will understand about the data and what types of file we have the columns we have as we go power in this video okay then we will use a service available on Azure Cloud called as a data Factory now data Factory is an integration service you can build a data pipeline you can you know connect to multiple sources if you want to do some basic transformation you can also do that we have a lot of different functionalities available and this is the service available on Azure that you can use to build a simple the pipeline or you can also do some complex work if you want to do that so this is about the data Factory so we will use Azure data Factory to extract data from our data source and then we will load that data onto our data League storage now data Lake storage is basically an object storage where you can store your file so on Azure Cloud we have something called as a storage account okay storage account inside the storage account you can create the containers where you can store the blog files or you can also create the tables and there are multiple things available so you can read the documentation about it but the overall goal of the data lake is to store the structured and unstructured data onto the storage so what we will do we will extract the raw data and we will store the raw data as it is on our storage location after that we will want to do some transformation code on top of this so what we will do we will get this raw data we will write our code in Apache spark on this databricks environment and do some basic transformation and again load our data onto the transform data so after cleaning the data and doing some basic transformation we will again load the data onto the data Lake storage over here so we will have the raw data stored and we will have the transform data stored both of these database gets stored onto our data link storage once you have that then we can use this analytics service called as Azure synapse analytics okay so in this you can use multiple things you can use notebook you can use SQL functions to you know run the analytics query on top of it so if you want to understand who won the highest gold medal by the country or which player won the highest gold medal you can do that using the SQL queries also and at the end you can build a dashboard using multiple tools such as power bi local studio and tablet so here's the overall understanding about the architecture you have the data source available we will use some integration service provided by Azure to build a simple pipeline we will load that data onto the storage location then we write some transformation block then again we store the transform data and then using Azure synapse analytics you can do some analytics on top of it and then we can build a dashboard so I hope you understood about this and now we will try to understand each of the service in little bit detail and what they do so that you guys can have the better understanding okay so before we go forward I just want to give you the basic requisite required for this course one is you need the stable internet connections and the laptop so you don't need very high-end laptop because everything we will do will be on the cloud platform so all you need is a stable internet connection and the browser where you can access the short Cloud portal okay this is the first thing second you should know the basics of Python and the SQL because I will not spend my time teaching you about all of these things so it is expected that you know the basis about it and if you don't know about it then you can go check the courses below I have the courses available on python for data engineering then SQL for data engineering I will teach you everything you need to know for the python and how to use it for data engineering and same verse equals so once you complete that course you will have the solid foundation on Python and SQL so if you don't know then I will highly suggest you to go and do the courses that I have and at the end you need the Azure account that I will tell you how to create it and assure provides 12 month free trial and two hundred dollar credit so this project will be completely free you do not have to pay anything so we'll understand about how to create Azure account as we go forward in this video okay so these are the simple praise requisite you don't need to know anything I will teach you everything in this single video only okay so let's start by understanding our data source so this is the data source that is available on kaggle 2021 or Olympic in Tokyo so let's understand so let's understand this data source in a little bit detail and after that we will go deep dive into different Azure services available okay so let's understand our data set uh this data set is available on Google you can find the link in the description so this data set contains the information about 11 000 athletes with 47 disciplines and 743 teams taking in part in Tokyo Olympics 2021 and we have like multiple different files available over here so as you can see we have the Excel format files available but I have converted this into the CSV for the better data processing so we have the athletes coaches and then we have the end is medals and team so what you can do if you want you can download it if you have the account on Kegel you can easily download that but I've already done that for you and uploaded the same data on my GitHub repository that the link you will find in the description so we'll understand and just look through the data what it contains so first we will look the athletes uh inside the atlas what are what type of columns we have the information we have available okay so we have the person name which is kind of like the athlete's name pretty simple then we have the country the country they are from and then we have the discipline such as basketball rowing uh such as wrestling karate and all of the other discipline uh that is performed on the Olympics okay so this is the information about athletes I will also suggest you to spend some time on looking at this data by yourself so that you have much better understanding then we have the information about the coaches uh of that athlete so we have the name the country discipline and the event so four columns pretty easy to understand now we don't really need to understand each and every column and what they mean or you just have to get the idea of what the data looks and you're good to go then we have the entries gender so in this in each and every discipline how many were male and how many were female and the total count so as you can see in our city we had like 64 females 64 males and the total is this in badminton we have the 86 something like this so they have the information about like the number of genders that participated in the particular discipline this is that then we have the information about Metal so for which country won the number of medals such as gold silver bronze and total and the rank of it so on the number one we have the USA then China then Japan Great Britain and everything so the medals contains the information on the country level about the gold silver bronze and the total base and also the rank okay and then we have the teams teams contains are basically we have the team name so each and individual team uh in that discipline and the country name uh with the events there's the high level overview of our data we have the five files such as athlete coaches and paste medals and Team all of these are in the CAC format that I have uploaded onto my GitHub repository as you can see it over here so you will find the link in the description you can click onto this data because we will be using this as our data source okay to get the data from here we will extract data from my GitHub repository you can create your own repository if you want or you you can just uh do something like this you can just start this repository if you want to track this and also you can Fork it if you want to pull the data from my GitHub repository so it is pretty simple or you can just download the zip file over here if you want to upload the data so three options okay um if you want to do it on your own just download the zip or you can poke it and then from that URL you can upload the data to the uh our Azure clouds you will understand how to do that so don't worry about it but I just wanted to give you the overview of this part right now so this was all about our data set now we will understand the individual services that we will be using in our project so let's continue with that okay so I hope you got the understanding about the data set that we will be using for this project now let's spend some time and understand about the a short class because a lot of you might be beginner to assure Cloud so I just want to give you the basic overview of various services and how the Azure portal is structured so that when we get started it becomes much more easier for you guys to understand about all of these things okay so this is our azure account structure what it looks like okay whenever you create your account when you sign up on Azure portal you will have your own account so this is one of the account that is available on Assad so if I have my own account it will be based on my email so in this case it will be third shell paramar and my emails this is my own account that I own in the account you can create multiple subscriptions so as a free trial you will have the free trial subscription that you can use to you know uh run some services or do your work but as you go forward and when your free trial expired then you will have to buy the subscription and you can purchase the multiple subscription based on the need so you can have the engineering subscription or the marketing subscription and or based on the budget you can allocate different subscription plans to different departments so if you are working in a big company you will have like multiple subscription according to the requirement and then inside the subscription you have the resource Group Resource Group is basically kind of like The Logical grouping of different resources the resources in this case can be data Factory uh data data lakes and apps analytics Azure database all of these are the services or the resources that is contained within that Resource Group kind of like the folder right inside the folder you store multiple files so it is same like you create a resource Loop which is kind of like the folder and inside that you can have like multiple resources so as you can see in the dev Resource Group we have the virtual machines Green at a short SQL database same in the prod we have some virtual machines we need an Azure SQL and for the marketing we only have the Azure SQL available so based on the subscription you can create the different Resource Group and inside the resource Group you have different resources so this is the basic about the Azure cloud portal I just wanted to give you understanding about this because as we go forward we will be creating our own Resource Group we will be assigning different resources to that Resource Group and the subscription so just wanted to give you this basic understanding uh before going forward okay now let's try to understand the Azure services so we have multiple services available we already talked about in our architecture diagram so first one is we have a data Factory now data Factory is a data integration services that enables you to create schedule manage data pipeline for efficient data movement and transformation between various source and destination so as you already understood right it is a service specially designed to create the data file plan so you can connect to multiple sources create a simple flow like if you want to transform some data you can also do that on data Factory and then load this data onto any Target location it can be data Lake it can be any data warehouse such as snowflake or rdbms if you want to do that you can easily do that on data Factor so this is what the overall idea behind the data Factory then we have the data Lake gen 2. this is the solution that combines the capability of data lake with the power of azure blob now Azure blob is basically the object storage where you when you store the when you store the file on assure blob it is considered as an entire object is same as the Amazon S3 or Google Cloud Storage where you cannot have the hierarchical manner of data storage so inside the bucket each and every file is considered as an object but when you use the data Lake solution you can have the hierarchical file system just like the file system we have it on our local computer like you can have a folder inside folder you can create multiple folders and then you can store your files so this is what the data Lake solution provides and there are more functionalities that it provides on top of this so you can go and check it out but this is the core idea behind the Azure data Lake Gen 2 is that it can handle the structured and unstructured data and you can do your analysis on top of existing file and run your SQL queries okay this is the data Lake Gen 2 then we have Azure data breaks now to understand this let me explain this to you in very simple manner okay we have the Apache spark which is kind of like we use for the transformation data processing right so this is the open source framework for data processing then databricks is basically kind of like the software company that provides the Apache spark solution not a participant but it provides a lot of different solution it's kind of like the managed service where you can write your Apache spark so you don't have to worry about managing your cluster and all of the other things it's kind of like the online platform where you get everything at one place when you try to install Apache spark in your local PC you have to do everything by yourself you have to download it install it you know put it on your environment variable and then you can do that but over here it will be managed by the databix company so you do not have to worry about all of the other things and they also provide you more services on top of it so such as Delta lake so there are many different services that they provide on top of it so it's kind of like the good solution so what Azure did is they did the partnership with the data breaks and they integrated their entire service on Azure Cloud so it is called as a short databricks okay database is a unified analytics platform built upon built on top of the Apache spark designed to help data engineers and scientists to collaborate on big data processing and machine learning work so it provides the tool for exploration processing building machine learning model on scalable environment uh everything everything that we talked about it does okay then we have the synapse analytics it's connected with data warehouse available on Azure Cloud just like we have Google bigquery Amazon redshift snowflake database data warehouse so this is the synapse analytics available on Azure Cloud so SQL data warehouse cloud-based analytics service provided by Microsoft Azure it combines Big Data data warehousing into the single platform you can also do a lot of things such as you can also build a data transformation pipeline also write the spark code so we will understand about the synapse analytics part as we go forward now we have all of these services available right but the fun thing about Azure is that these Individual Services has some kind of overlap now what I really mean is that inside the synapse analytics you can create the data Factory workflows so so they have kind of like integrated all of the services together so instead of using data Factory differently or assure database differently you can just use the single synapse analytics and do your entire work you can create the data Factory pipeline inside the synapse analytics you can write the spark code inside the synapse analytics and also do the SQL code but to learn multiple services or various Services we have picked up different services so that you guys can have the understanding about different services in just one video so what I've done is that I have divided the entire flow using the various services available on ashore so that we learn more as we go forward but always remember that you can do entire thing on synapse analytics you can also do it on databricks and you can also do it on data Factory right you can either use Individual Services or you can do everything on analytics over here so for this tutorial we will use the individual services so that you guys can have like better understanding about it and you will understand what I'm talking about right now as we go forward in the actual execution video you will have much better Clarity on what I just discussed okay so let's have the basic overview of what data Factory looks like so it has something like this you will have this kind of portal where you can have like multiple transformation functions available so if you want to copy data if you want to create the data flow if you want to you know integrate the databricks notebook or synapse Analytics some work SQL code you can also do that so we have multiple activities kind of like the drag and drop format so you can just drag it connect to the source connect to your target if you want to add some transformation function you can do on top of it so this is what the Azure data Factory UI looks like after that we have the database notebook this is the format of the data bricks notebook as you can see it over here we are reading some file we are displaying it we have data available compute service available SQL editor queries the thing that I talked about that you can also do everything inside the databricks you can also do a lot of things on data Factory and then we have the synapse analytics available over here uh you can use SQL function to write query you will get some analytics you will have your tables available over here and you can also create the data Factory there is the integrate button on the side that will help you to directly create the data Factory a pipeline like this exactly like this this UI in synapse analytics also so it looks they have kind of like integrated everything in one service also but they also have the different services for the same work kind of like the redundancy right uh so this was all of the basics now don't get confused if you didn't understand the last part why they have like mixture of the services uh even when I was you know starting to learn all of these things it looked difficult initially but you get the hang of it once you start executing all of these things by yourself so I hope you understood about this so now what we will do we will create our Azure account I will show you some steps how you can do that and after that we will start with our execution okay let's understand how to create Azure account it is pretty simple nothing complicated all you have to do is write Azure Cloud over here and you will find the short Cloud the first thing you can go over here and first thing is uh you could you have to create your own account so you can click onto this free account and you will be redirected to this page as you can see Azure provides 12 months on the popular services so for the 12 months you can use all of these popular services available now for our data services we do not have a lot of services available so for that they also provide the 200 USD credit so you can use this 200 credit and we can easily uh complete our a short data engineering project within let's say 10 to 15 dollars max if you use it properly so you can use this credit you have this credit available for 30 days so make sure once you create your own account you complete this project within that time span or you have to you'll have to create the new account with the new number and the your credit card details so make sure you do that so this is the thing that you need to know that you can complete this project for free you do not have to pay anything so you will get the 200 credit so you can use that to complete this project but make sure you complete this within 30 days you do not spend more time on it and once you do that uh okay so it is pretty simple you can just click on to start free once you click on the start free it will redirect you to create the Azure or Microsoft account so let me just open this uh onto the incognito mode because I already have an account so I click onto the start free it will redirect me to the Azure portal and so that I can create my account over here so all you have to do is create the account or log into your account if you don't have one you can just click onto this create account if you already have the account then you can log into the Azure portal so once you log in you will be redirected to some form where you will have to fill the information about yourself your phone number and all of the other things so once you fill that you will see something like this all you have to do is click on to agree and go on next here you will have to enter your details about grade card so if you have the credit card or debit card you can just add it over here so you can just add the information this is only to verify that you are the actual user so don't worry you will not get charged just use any credit card or the debit card details you have so that they can verify that you are the actual user you will get some OTP and you are done so once you do that all you have to do is click on to next and you can sign up so this is pretty simple so once you create your account you will be redirected to paste something like this if you haven't then you can just go to the portal.ajar.com and just log in and you will see this page again if on this page they ask you to you know subscribe for the free trial you can just click on to that button and add your payment information and or you will be able to start your free trial then in there so this was all about creating the Azure portal now once you have this then we can start with the execution of our project so I'll see you there so the project that you are about to do was made possible because of our today's sponsor project Pro I have taken this project from their website so this entire project you can also find it on projector.io Project Pro is a platform where you can find the top Big Data projects and data science projects so if you want to build your own portfolio you can go to the projector.io and visit the website and you will find lot of different projects they have more than 200 projects available in in the Big Data domain and these projects are created by top industry expert you will get everything from guided videos course PPT each and everything is available on projector website and the best thing is they are providing 10 flag discount on this platform so if you are interested in learning and doing more of these projects then you can check the link in the description so you can visit project.io you can just sign up over there and they will contact you you can just give my name and they will give you 10 then and there so I hope you take advantage of these useful platform available that will help you to grow in your career so you can visit project product I have to learn more about it and let's start with our project execution so let's start with our project execution so we will start with understanding the architecture diagram in little bit detail so over here we have our data source we already understood about our data we'll be using the Kegel data the Tokyo Olympic one and then we will copy that data from the data source to our Azure data storage using data Factory so we will just build a simple pipeline where you will just take the raw data and upload the data onto the data storage which is the data Lake gen 2. once we have our data available onto the raw storage then we will use Azure data picks we will write our PI spark code we will do some basic transformation again the goal of this entire project is to build a data file plan and show you how you can build a data engineering project so if you are not mainly concerned about the making the logic right of the transformation we want to just build a simple pipeline so we will try to remove some duplicates and we will drop some columns rename it and then we will upload our data back to the data Lake storage which is the transform data so this is our the initial pipeline where we will load our data do some basic transformation and then again load that data onto the data Lake and then we will use the synapse analytics to understand our data so we can easily write the SQL query on top of these data that we have transformed and extract the Insight from it and if you want you can build a dashboard using power bi glucose studio and tablet so we will do everything step by step so the first thing you need to do is go and login to your Azure portal so once you log into your Azure portal it looks something like this now the first thing we want to do is ingest our data to the data external so as per this architecture diagram we need two things one is a data Factory to create the simple Pipeline and second is the data like Gen 2 which is where we will store our raw data so we'll start by creating these two things first individually so the first thing you need to do is just on the search bar you can just write storage account you will see something like this or you can also go on the sidebar and also search for the storage account and it is available over here so click on the storage account and you will be redirected to this page over here we want to create our storage account so first click on to this create storage account and you will see something like this over here first you need to select the subscription we already understood about all of this so if you are using the free trial you can use the free trial if your free trial is over then you might have to pay for the subscription so if you have just created your Azure account then you might already have the 200 for free to use so you can just use the free trial and for the resource Group we will create the new Resource Group and the resource Group we will give to Q Olympic okay something like this and just click on to okay okay this is our Resource Group so inside the resource Group all of the different resources get stored we already understood about this concept at the start of this video so we created our Tokyo Olympic Resource Group after that you need to give the name to your storage account so I can just give let's say Tokyo Olympic Olympic data and this should be unique across the Azure universe so if it is already taken you can just add your name at the end and it will make it your name okay so this is the first thing you need to do uh this is my Azure this is my this is my storage name then you need to select the region for the region you can select the nearest region as per your location so for me it is southeast Asia but if you are living in us or in any other country then you can select the region according to that after that we need to select the performance and we will keep it standard we don't want to touch anything and then we have the return and say this basically means if you want to replicate your data across different data centers or region just for the tutorial purposes we will keep this default but it changes if you are working in the real world if you have specific requirement then you can choose the different options from this and you know make your data available across different regions so now just click on to next over here we don't want to touch anything this is more of the advanced thing so once you enable the hierarchical namespace that means against all of the files that you store inside the container of your storage account will be available in the hierarchical format the way it stores in our local PC so generally when you store any data to any object storage such as S3 or Google storage or just Azure block storage they store the entire data as an object but once you enable the hierarchical namespace you can access your data just like the simple directory the way you do it on your local computer okay so this is important you need to select the hierarchical namespace when when you are creating the storage account and all of the other things you don't need to touch just click on to next we have the networking uh again we don't want to touch anything just click on the data protection just click next next tags review and this will take few few seconds to review all of the settings that we have created and once the validation is done then you can click onto this create and it will start creating your storage account okay so it will take few minutes once it is done then we will continue with our video so our deployment of our storage account is complete you will see something like this all you have to do is Click onto the go resources and this is what the entire page looks like looks like of the storage account as you can see it over here you have multiple things do not get scared just by looking all of this information it will take a few minutes but as you start working on this you will get used to all of these information so let's try understanding what all of these things means okay so over here we have a lot of things on the side panel we have the overview this gives the entire overview of what our storage account actually is so we have the resource Group this particular storage account the Tokyo Olympic data is inside this particular Resource Group the location is provided thus primary and the secondary region is already provided as you can see we have the subscription ID the replication and all of the other things then also we have other things available such as security networking information and all of the other things so this is pretty common in all of the different Cloud providers so do not get scared you don't really need to understand all of these things right now but as you go forward you can start learning them one by one and as the opportunity comes you can understand them okay over here we also have many different things we have the activity log if you want to track the different things that have happened tags IM Control Data migration and all of the other things the important thing that we are really interested is the container so as we already discussed at the start of this video in the Azure storage account we have four things one is the container where you can store the data as an object you have the file share queue and the table so we will be storing our data inside the container so you can just click on to this container and click onto this create container button the plus one over here you just have to give the name of your container so I will give the name as to again Tokyo column pick data okay something like this and this is my container name and I'll just click onto the create uh this created my container so this is where we will store all of our data so you can click onto this container and this will go inside this particular container so as you can see it over here this is our storage account and this is our container so in this we want to create two folders Okay click onto this add directory first folder that I want to create is draw data this is where we will store all of our raw data as it is from the data source that we extract and the second folder that I want to create is the transform data okay we have the two folders available and it is pretty simple to understand right now that okay we will get our data from this particular data source using Azure data Factory we will ingest this raw data onto our raw data folder so we'll have proper list so we'll have so we will have all of our draw data stored onto one folder and then using the Azure database transformation then we will load our data onto the transform folder so we will have so this is so this is how so this is how generally things work in the real world when you extract the data you put that data onto one location once you transform it you put that data onto the another location so we have these two folders available so this is pretty much done we already set up our Azure data Lake now what we want to do is we want to copy the data from the data source and put the data onto our Target location and for that we need to use Azure data Factory so this part about the storage creating storage container and the data lake is done second thing we want to do is search for the Azure okay data Factory just search data Factory you will see something like this you can right click on to this and open it onto the new link so so that we can have all of these different tabs open so we can easily navigate between different things okay so this is again what the Azure data Factory console looks like uh just like the storage container all you have to do is click on to the create data Factory okay once you click onto this it will redirect it you to the similar page just like the storage account over here Resource Group now we don't want to create the new resource too because we already have our Tokyo Olympic Resource Group so just select that give the name over here so let's say Tokyo uh okay Tokyo column pick data Factory I'll just write the DF okay region you can select the nearest region so I'll go with let's say Southeast southeast Asia now there are chances that sometimes uh there is a limit of the resources you can grant onto the particular region so if you face any error that uh the quota X exceeds on this particular region then you can select any different region it doesn't matter so if you get any error then just select different region and you'll be good to go so this is all we need next just click on the next I don't want to configure git so you can just click on to next over here again this is all of the networking part we don't worry about it then we have the data encryption we don't want to touch this that this will review all of your settings and once it is done you can just click onto the create and this will start initializing your data Factory resources so as you can see the deployment is completed now all you have to do is Click onto the go to resources just like we did for the storage account and you will see the new panel for the data Factory so on the site panel we have a lot of different things available just like the storage account such as IM activity log the networking part so if you want to you know work with the configuration site you can always do that but as a data engineer you are mainly interested working on the data side because if you are working in any company these things about the networking and the security aspect will be handled by the specific team but having understanding of those things also helps you to become a better engineer so it is always recommended to so it is so it is always recommended for you to know all of these things but it is not mandatory and for the this tutorial purpose we are not deep diving into the networking or the security part we just mainly want to create the data file plan okay so we have the Azure data Factory available the thing you need to do is Click onto the launch studio and this will redirect you to the Azure data Factory panel okay so this will take few minutes to load as you can see we have our data Factory ready Tokyo Olympic DF and this is where you can you know create the pipeline to extract data from the different sources and upload that data onto their Target location so we have ingest option orchestrate transform and configuration we already understood about all of these services at the start of this video so I don't want to Deep dive into the theoretical part of it I just want to mainly focus on the execution side of it you can just click onto these arrows and check about all of these different options let me just close this part and this part okay so it looks cleaner we have the home this is where all of your data factors will get showed we have the author so once you create so we have so we have the home then we have the author this is where you actually create your pipelines then we have the monitor so once your pipelines are running uh if you want to monitor them if any error has occurred then you can easily track them over here then here's the management part this is basically the configuration start of the thing so if you want to change some settings if you want to create the Link services that we will do right now then the run times in all of the different parameters are available so you can again spend time on to this and understand if you want but we will go forward and we will start ingesting our data so all you have to do is click on to this author part and as you can see we have like a lot of different options available so all you have to do is all you have to do is click on this author and click onto this plus icon over here I want to create pipeline okay so again we have a lot of different options available such as if you want to create pipeline based on CDC data set data flow and everything but we want to create a simple pipeline so just click onto the pipeline and over here just give the name as data ingestion okay I'm just giving it this a name data ingestion so that if we have like multiple pipelines available we can easily identify it okay so this is like the best practice so then this is my pipeline window right so let me just collapse this uh this part you can just click on this Arrow because we don't need it you can also click on twist Arrow so this is much cleaner to understand the everything that we do you can also click onto this properties and this will also collab so now you can see everything that you do on this manner so we have the activities available inside the data Factory the activities are basically the operations that you want to perform on specific data so we have like more transform for the notebook we have like if you want to run the specific notebook from synapse analytics you can do that or short data Explorer functions pass Services data factories so we have you have a lot of different things that you can integrate inside the data Factory and this is the fun part about the Azure is that inside the synapse analytics you can create the data Factory pipeline inside the data Factor you can use Azure data breaks so they have like combination of these services that are you know compatible with each other and the features are similar across all of these services so you can use Simple synapse analytics which is the analytics service available and inside this analytics you can also create the data Factory and write the spark Port so uh just to give you the understanding about various Services I have picked up the different services so that you can understand it okay so in our case we want to copy our data and either you can click onto this move and transform and drag this copy data or you can just search over here copy data and you will get the same okay so first thing we want to copy our data from our API or this data source to our location which is the data storage this is our copy activity inside the copy activity we want to give the source where our data is actually stored and then we want to give the sync where we want to load our data so let's try understanding where our data is stored so I have uploaded all of our data that we want to use in this tutorial onto my GitHub repository now why did I do that because on Kegel if you want to access the data you have to do a lot of different steps you have to get the access to the API and get the authentication keys and stuff like that so I downloaded the data and I uploaded data onto the GitHub repository you'll get the link of this GitHub repository in the description so you can go and check this data and using this GitHub repository as our data source We Will We Will extract we will extract data from this repository and load our data onto our Azure location okay so we have five files available we already understood about all of this data so I don't want to spend time explaining you about each and every file because we already understood about it so we have the athlete.csv you can just click onto this and this is our CSU file what you really want to do is click on to this rock button once you click onto this RAW button it will redirect you to this particular page okay this page contains this URL which is the raw URL of this CSV file so this is the first step that you want to get the UI URL of that particular file if you want to extract the data so all you have to do is just copy this so what we will be doing we will be using this URL to extract the data inside our data Factory and load the data onto the Target location okay so go back to your data Factory over here click onto the source you will see at the bottom and first we need to create the link from our data Factory to our this GitHub repository for this particular file okay so we will have to do it one by one we can't do it together so for each and every file that we have available on our GitHub repository we have five files available we will have to get the URL of each and every file and we will have to connect it with our source so first uh you can go to your data Factory click onto this new icon so as you can see we can extract data from multiple places we have the Amazon RDS S3 blob space so if you already have your data stored onto the blobster you can also extract it you can also access the data from the Azure data Lake Gen 2 then again we have multiple options available and from that data source you can easily access and integrate the data factories from these options what we are mainly interested in is HTTP okay because this raw data is accessed through our HTTP server now this is about basic about networking but you can just hit onto this this particular URL using HTTP and you will get this data downloaded to your computer also so all you have to do is Click onto this http click on to continue now we want to specify the file format again Azure data Factory supports multiple types of files such as Avro binary delimited text on Excel file Json file or cxml and pocket so in this case our file is stored as a CSV file so you can just click onto this delimited CSV and click continue now we will give some logical name to this particular property so over here I'm copying the athlete CSV so you can just give the Outlets over here just copy the name from this location and put it over here now you have to create something called as a link service what it really means is that we are creating the link from our data source to the GitHub repository that is available so you will understand once we do that just click on to the EU and you will be redirected to this page so over here just give the name as athlete HTTP okay just for our understanding that this is our source description you can give whatever you want over here the main thing is that we want to give the base URL and the URL that we want to give is this one the GitHub repository raw one make sure you copy this raw URL you do not copy the URL before this because this URL is different then the actual raw URL and you can get the raw URL just by clicking onto this RAW button on the GitHub repository and you will get this raw URL okay so put that raw URL over here uh authentication type you can click onto the anonymous okay you can also put your GitHub or credentials such as your username and password but as this is open uh you don't have to put any authentication so just click onto this Anonymous and you are done click onto the create so this will create the link service which is the HTTP link service from your data Factory and your GitHub a repository then and after that and after that you can just leave the relative URL and over here make sure you have the first row as a header because in our file our first row is our header which is the heading of our column so person name country and discipline so make sure this is stick uh just click none over here and click on to ok now what we did we just linked our source to our data Factory and you can click onto this preview data over here you will see and you will be able to get the complete view of your data as it is so as you can see person name country discipline and we have our all the data available as it is from this GitHub repository inside our data Factory so if you see this that means you are able to successfully create the connection from the data Factory to the GitHub account once you do this uh leave everything as it is you don't have to do anything else then we want to do a sync before I think let's just give the name to this copy data so you can just click onto the general part and let's just give the athlete let me just copy this athletes not raw athletes okay athletes and we are done the source is done now we want to do is about the sync this is basically we want to load our data onto our Azure data Lake storage for this all you have to do as you can see we have like resources available but over here we have to create the new source and connect to the assure data Lake gen 2. so you can click onto this Gen 2 uh just click onto the new and click onto the Gen 2 and click onto the continue now you have the option if you want to convert your data from you know a CSU file to orc Park you can also do that but we'll stick with the CSV file uh for the ease of understanding and just click on this continue now we will do the same thing as we did it for our GitHub but over here we have the target as ADLs so we will you can just write ADLs or short data Lake service Gen 2 okay so we will just give the name as pdls for the Link services click onto the create new uh you can just keep it as it is azure data like storage over here subscription you have to select the free trial everything else is default don't touch it and the storage account you just select the storage account that we created the Tokyo Olympic data in your case it might be different the based on the name that we gave while creating this storage account now it might start making sense why did we create this account first because if we did not create the storage account we will not have this particular options available as our Target location so as per the architecture diagram we are extracting data from our data source using data Factory and we are loading that data to the Gen 2. so in this case we have the GitHub as our data source that we already linked our service now as a Target which is a sync Source we have the storage name which is our storage account so you can just click onto the create and the your link service is created we created the link from our Azure data Factory to our Azure data Lake storage over here you can now you have to select the path where you want to store your data so you can just click onto this browse icon on the side click onto your Tokyo Olympic and we want to store our data onto the raw data okay so click onto the raw data folder and OK so you will have the file path available Tokyo Olympic raw data and we also have to give the file name if you don't provide the file name it will put some random name so in this case we will keep the file name same which is the athlete CSV let's just convert this into the lowercase a okay and first row is my header import schema just click on To None because we don't have any schema attached to it and just click on to ok all look all looks good now all you have to do we have the source available we have the sync available click onto the validate this will basically validate all of the settings and if you don't see any error available over here that means it looks good and you can just click onto the debug button debug basically means it will run your code just to test if it works or not so uh we will run our pipeline so once you click on to the debug you will see our pipeline is running it is currently queued that means we we will copy our first file from our GitHub repository to our Azure data like storage so you can have the current date is queued it will take few minutes to run once it is done as you can see it is succeeded and we can go to our raw data as you can see the Accolade CSV is available on our storage account how cool is that right we were able to extract data from the GitHub which is stored somewhere in the world create the link using Azure data Factory use Simple copy activity and load our data to our store your account and this is what generally happens in real world right instead of azure data Factor you might use something else but this is the simple process you have data stored across different system you try to integrate it get that data and put that data onto the raw storage then you do the transformation so this is what it we did it for the one file now what you want to do you want to repeat this activity for all of the different files that we have available inside our data inside our GitHub repository so you can get this copy data again over here I'll go back and inside my data I will get the coaches you can click onto the Raw first thing is just copy the name of the coaches and rename this to the coaches for The Source you will have to create the new source for this particular URL because the last URL about the athlete was different than this URL so all you have to do is for the source click on to the new one just search HTTP I will just fast forward this little bit so we don't waste much time onto this because this is the same thing just click on the CSV click continue over here give the name as coaches so that you can always understand this over here uh Link services click onto the create new Link services uh give coaches as http everything is same over here provide the base URL uh authentication type Anonymous and just click onto the create this will create the connection with the coaches file click on to ok and then click on the preview data and you will be able to see the preview of the courses data is available then go to the sync for this thing just click on to the new search for the Azure data link Gen 2 continue again select the CSP Link services you can use the existing Link services now you don't have to create the new Link services because you have already connected with our storage account so we will be storing everything over here so in this case you don't want to create the new Link services just rename this to coaches so let me just get this coaches coaches like they'll just write the coaches sync okay and for the file path again we'll go over here click onto the Raw click on to ok just give the name as coaches dot CSP click over here none and click on to OK and then we were able to create this thing for this same thing so be created so so we were able to so we were able to do it for the coaches so I can just easily connect this okay just use this arrow and now you can just click onto the validate uh if you don't see any error then you could then click on to debug so the way it will work is that first it will complete this once it is completed then it will complete the coaches file okay so we will just again see all of these into the action right now so as you can see it over here our athlete is complete and the coaches is complete you can go to the storage account refresh this and you will see two files as you can see I made some error while uh writing this file coaches.cs field so let me just remove this uh this is comma.csv and what I can do I can go over here on the sync I can just open this open this one and fix this coaches comma csv2 dot CSV and it looks good so this is what we did for the coaches you can again run this and get the file now you have to repeat the same action for the rest of this three files so I will just fast forward this and complete this but it should be pretty easy so I'm just going to do the same thing for let's say get the copy data okay uh over here what do I have foreign then I'll create the new Link services and press enter enter HTTP for the base URL I will provide the raw for the raw URL this one I'll put it over here authentication is not needed just click on to this uh none first OS header and I was able to create the source I can click onto the preview data it looks good I'm just trying to fast forward because we already understood this so you can again do it by yourself if you want to do it with me you can do it with me also okay so for this thing just click on to the new Azure data Lake storage CSV file uh over here I have this increase gender so you can just say interest gender sync okay and this click onto the Azure data Lake storage over here Tokyo Olympic raw data select OK and the file name is our entriesgender.csv so I'll just write increase gender Dot CSV this is let me just Gap the CSV none click on to ok no okay it should be none okay okay and then I'll do the same for other things this is entries gender I'll do it for next file medals I'll just click onto the raw one uh over here I'll copy the medals put over here go to the source click onto new http continue CSV file medals and then click onto the new Link services medals HTTP PS URL is this one I'll put it over here authentication is anonymous click none everything looks good you can just hit on the preview data okay I can see my data then I can just go to the sync click onto new or short data link Gen 2 CAC file over here I'll just put metals uh sync okay sync Services just select the Azure data Lake storage file system is this one Raw and the file name is medals dot CSV none okay medal is done and the last one is test one teams just do this okay I'll get the copy over here where the copy first thing is teams for the source I'll create continue http continue PSA file uh teams link service create new teams http base URL I'll copy this GitHub URL authentication is this create everything looks good click on to OK click on the preview data looks fine for the sync click onto the new Azure data Lake Gen 2 CSA file over here I can just say teams sync sync service this data Lake over here raw data take over okay medals.csv for show us header click on to new none okay and it's done so let me just connect this can get this and put over here and get this and you can get this over here okay pretty simple people charge thousands of rupees for this I'm just doing it for the free so appreciate if you guys can like this video comment it out and subscribe to the channel okay so this is what it looks like right we have copyright activity created for the athletes coaches uh gender medals and the teams click on to validate close and just click onto the debug now it will start running our entire pipeline okay and it we will have all of the data available on our Azure dialogue gen 2. so that's it is running now let's just wait for this to complete okay so let's see how much time it takes to you know copy data from the HTTP server to our Azure audio Gen 2 so I'll just wait for this loading activity runs so after it is complete now it is currently copying the code here so you can go over here you can also refresh this so we have the athlete coaches file is available uh it is in progress let's just wait you can also refresh this okay increase gender is also complete we have seen medals is also available and let's just wait this might be done teams we are waiting for the teams yeah all of this is succeed okay teams did not come let me check for the teams open Team sync okay wait yeah so I made a mistake for the teams I gave the name as medal so should have given teams.csv foreign run this again so we have Metals as it is and for the teams yeah it is teams.cs so I made a mistake while uh writing our teams.csv I renamed it as a metal doc because I was doing it very fast so uh make sure you don't make the same mistake as me and once it is complete we were able to see all of the piles so I'll just skip to once it completes so we can move forward okay so the pipeline ran successfully as you can see it over here if I refresh this you will see everything is succeeded and if I go over here you will see all of the five files are available if you want you can also go ahead download one of the files and if you just want to you know verify this if this file really makes sense or not you can see everything looks good so uh nothing is broken so pictures completed our these two parts which is getting data from this data ingestion and the raw data store so we just completed these two part which is injection of the data and storing of the data so if you reach to this section then congratulations you just completed the first part I'll highly appreciate if you can like this video and comment below that you reached to this section that way I will know that you actually completed this project and if you found this helpful or not now we will jump on to the Azure databricks part where we will write our basic transformation code okay so over here again go to your ashar portal so if you want you can close this because we will not need this so now what we want to do is we want to write our database so you can go to just search on the top data bricks you will find something as Azure database just click over here and we want to repeat the same process as we did at the initial stage just click onto the Azure database service over here your subscription should be free trial your resource Group Tokyo Olympics that you already know over here the workspace name you can give let's say Tokyo Olympic um DB this is the data bricks for the data Factory we used DF uh region select the nearest region according to your location over here we have the different pricing here you can select the standard premium in trial we will go with the premium one and then just click on to the next click on to the next just next review and create and it will take few minutes to validate it once it is done you can just click on the create and it will start deploying our Azure databricks workspace okay so we did the same process it is pretty simple you have nothing you will do just put some name select some options and you're done so you might be already getting used to this process because it is pretty straightforward you just have to first thing so the first thing so the first thing always when you are working on any service you have to create the workspace around it just put your resource Group and some name and select some options and then you are done you can also check the documentation if you want to understand more about the different options but for the tutorial purposes this is good enough okay so we'll wait for this to complete and once it is complete then we will move forward okay to the deployment of our Azure database is complete and as you can see you can just click onto the core resources and over here you will be again redirected to the resources page it has the similar feature on the left side you have multiple options available but the thing that we are mainly interested in is this the workspace just click onto the launch workspace and you will be redirected to the Azure database workspace where you can you know create the notebook instance and all the other things to write your code so once the login the authentication part is complete you will see something like this so this is what the databricks workspace looks like nothing complicated it is pretty simple uh once you you know hear about all of these things from the outside world the Azure data breaks and all the stuffs it might look complicated and some part of it is complicated but overall it is pretty simple so don't get uh afraid of trying all of these things whenever you want to learn anything just go to the platform and create the account and get started okay so that is the best way to learn so over here again we have multiple options as you can see it on the side we have the new workspace this is where everything you do will get appear over here the reasons based is basically your recent activity then you have the data so what is the data basis you create onto this workspace you can also check it over here you have the workflow the compute Services basically where your actual you know code runs on the computer resources and a lot of different things is available so first thing is we want to create the compute because we want to write our spark code it should run on to something and for that you can just click onto the create compute okay so the first thing over here you see is the policy now policy is basically defines the type of services that you want to use so we'll go with the unrestricted but you have multiple options which is if you want the personal compute if you want to have the compute shared between different users you can use that so we will go with the unrestricted so we have the two things one is the multi-node and the single node for the multi-node uh you have the options of setting up minimum worker and the maximum worker so you can set the number of computers should run in the back end so if you want to understand about the purchase spark broker thing you can check my up learn Apache spark 10 minutes you will find the link in the description but what really means is that you will have multiple machines and it will scale automatically depending on the code that you write on your notebook so for the tutorial purposes we will just go with a single node we don't want to you know waste resources uh just like that so just click onto the single node this is a single user you can try multi-node but you might get an error because there's a limitations Azure provides so just click on the single node runtime you will go with the default one you can select the multiple runtime but we will go with the default one node type again as you can see the memory or the course you want so we'll go with the standard one which is a 14 GB memory and the four ports because we don't have like the very large data but as per the requirement if you are working in the real world you can choose uh the size of the compute and once everything looks good you can just click onto this create compute okay so we will wait for this so it will take few minutes to you know spin up this computer and once a distance we will continue power okay so the cluster creation process for Azure database is complete you can click onto this compute thing and you will see everything is ready over here so now what we want to do we want to write our spark code where we can easily read our data and transform our data so that we can again store that data onto the store location so as per the architecture diagram we have the data stored over here we created our Azure databricks workspace now what we want to do we want to write simple code to get the data from here transform some data and put that data onto the translate location so you can click onto this new and click onto this notebook you will see something like this uh you can just close this you can close this uh let's give the name to this notebook Tokyo Olympic or transformation so now what you want to do is we want to create the connection from this Azure databricks to our Azure data storage so that we can easily access the data so the steps to create the connection between these two Services is little bit different and we will try to understand that in simple way okay so the thing you need to do is you need to mount this Azure data Lake storage to the data Factory so that you can access the file from it so mounting is basically attaching so just like we use the USB cable to attach the hard disk physically we have to mount this particular service to the Azure data Factory provide some authentication in the back end so that you can easily access the data so there are some steps involved so you already have this notebook available on the GitHub so you can copy the code and work with me okay so the first thing you need to make sure is over here is you select your spark cluster so in this case I've already attached my cluster if it is not in your case then you have to attach it after that go over here search and search for the app registration uh this is a step involved that you have to create some app and get some credentials to create the connection from Azure databricks to the uh ADLs and we will understand and you can and you can you and you can search and you can search about it but just try to follow along with me and you will uh at the end you will understand why we did all of these things okay so just keep the name as this app zero one something like this and just click onto the register you don't have to do anything else uh this will register the app for that we need two things from over here we need the client ID and we need the tenant ID so you can open the notepad or you can open any word document and you can just copy this client ID application client ID copy this and paste it over here and then you can copy this tenant ID and paste it over here so we will need all of these things as we go forward so just copy this after that after that once you have once you have your client ID and the Tenant ID then you need to click onto this certificate and the secrets over here we need to create our secret ID we will also need this for you know accessing the ideal list so click onto the new client secret and just keep this name as the secret key something like this and click on to add copy this value you can just click onto the click onto this uh this button and just give the name as secret key okay now again uh for the tutorial purposes I'm exposing all of this value but I will you know delete all of these resources so that it doesn't get leaked so just for the understanding just follow along with me and you know try to understand what I'm doing so you have the application ID tenant ID and the secret ID available so you will have these three things now you can use these three credentials to create the connection from your database to the ADLs so once we have these three things then we can you know start writing the code and start integrating and we will face some errors and we will understand how to solve them okay so let me just pull out the code so the first thing we need to do is create the configuration format so you can get this configuration thing from The Notebook available on the GitHub repository over here we need to replace few things so instead of this client ID you can copy your client ID and paste it over here secret ID you can copy your secret ID and paste it over here and your tenant ID you can copy this and paste it on over here on the end point okay so these few locations just uh you know put your all of these things so this is the basic authentication process you need to go through to create the connection from your this data Factory to the data Lake storage now in this case what we can do we can also put this onto the keyboard so uh generally on Azure you can search for the key vault key vault is basically where you can store all of your secret key password and everything about the resources and then you can use the key wall to access all of these things so it is not exposed inside the code so all of these things do not get you know stored inside the code but for now we will just ignore that part we don't want to go there because there are some few steps involved uh to you know create the key Vault and get the access so we will put all of this credential inside the code but remember this this is not the best practices when you're working in the real world for the tutorial purposes it is easier to understand this way but in the real world when you go you will have to use something called as a key Vault to properly access all of these keys and IDs securely so we have the configs after that uh if you have something called as a mounting point so we have this format DB dot utsfs.mount it needs three things one is the source so for the source we have something called as this uh on the data side you have to provide your container name so you can just go to the storage this is a little bit complicated and I hated this part when I was working with the Azure because they have a lot of this authentication process that you have to go through to create the basic connection but this is what we have to do right so over here first is we want to create copy the container name Tokyo Olympic and you can put it over here uh yeah over here you can put the Tokyo Olympic data this is my storage account name and after that you can copy the container name so container name is Tokyo Olympic data so in this case I will put it over here so uh make sure this is my container name which is this one this is my container and this is my storage account name so do not get confused between these two first we have the container at the storage account name let me just comment it over here container at storage account name okay so container ad shows account name do not make mistake because if you do then you'll get an error then you have to do something as slash and provide the folder name that you want to mount so in this case I have the Raw okay you can just leave it as it is you don't have to provide that and you can just over here you can just Mount as let's say oh Tokyo Olympic okay as it is and everything is set the configuration is attached you can just run this and we will see some error and the first error is uh I missed the double Port uh let's see oh it's a single port not to double quote uh let's see what all this error do we get uh hopefully we should not get any error but if we get we might get the error about the permissions but if it does not then it should be good so we'll just wait for the few minutes once it runs the code and as you can see we were able to successfully create the connection to the Azure data Factory and you can easily check that just by doing something called as LS uh you can do the file system first file system a list and MNT just copy this part foreign if this connection was successful then we will be able to see all of our files from this Mount location so we are getting the error which is basically we do not have the permission to access the files that are stored onto the data Lake and for that we need to explicitly give the access so what we are really doing is that we are using the app the app registration that we created using the credentials of the app we are trying to access the data that is stored onto the data Lake but we have not given the permission to that particular app to access the data from a data Lake and for that all you have to do go on to your storage over here the container side click on to this your Tokyo Olympics click onto this Access Control IIM add role assignment and over here I can just give uh access to let me see a storage log contributor okay just click onto the storage blob contributor tool this basically you are giving the access to the app to be able to read write or delete any object onto this container click on to next over here just select the members and you will see you have to write the app zero one which is which was our app name you can just select it over here select you will see the app is available here click on to next next review assign and this will add a new role inside your IM so that you can give the permission to this app to access the data from the Olympic data and wait for the few minutes and once you run this you will see you will have the access to the raw data and transform data and you will be able to access the data inside all of these different depositories again you will paste lot of Errors while doing this if you don't congratulations you were able to do that but you will face few errors while doing this so make sure you are using the secret key properly over here okay make sure all of these things are proper the container name and the storage account is properly set and you provide the access to the contributor to the app once you do this wait for the few minutes and you should be able to see this output so now we are able to mount this particular entire data Lake to this location we can try reading one file and to do that we have something called as athletes let's say we read this athlete file so over here I go to Raw data athlete CSV okay so I'll just create athlete is equal to now before reading the files generally when you write the spark code you have to create something called as a spark session then you know create the app to the spark application and then write your code but on the data breaks you get all of these things already assigned to it so you don't have to create a spark session from the scratch so as you can see it over here you can easily check the app name and everything is available to you so you don't have to do that but generally when you start the spark code in any other platform you will have to create something called a spark session so let me just Google it and show you uh over here you have to you know import your spark session uh this is a Scala code let me get the python yeah so from PI SQL so so from PI spark dot SQL spark session you have to build this spark session plot Builder give the app name get or create then you will have the spark function so we already have this built in on the databricks so you have to do that by yourself so just wanted to point this uh just wanted to let you know because a lot of people might get confused that why you don't use create the spark object because we already have it okay so you can use this spark dot d dot format you can provide the format as CSV we are reading the CSV file dot load we have the load function over here we will provide this Mount location mount slash uh insert the raw data this is my raw data raw data I have athlete.csv okay I can just run this and I can just do show if I do show you will see we were able to read the data frame but there's a problem the column name uh the actual header is is being considered as a data so to do that we have to use something called as a option option is the basic of Apache spark okay uh header as true okay once you do that now you run this now you do the show and you see we were able to get the proper data frame so now what you have to do you have to read all of the data frame available uh inside our data lake so let me just copy this put this five times okay and you can go one by one let's say go coaches you can put the coaches over here you can put the purchase over here then we have entries gender then put it over here remove the CSV okay and then we have metals okay after that we have teams and teams here also okay so we will read all of the CAC file that is available so there is a concept available in Apache spark called as a lazy evaluation so if you have watched my Apache spark 10 minutes video you will understand that that when you run any transformation box so this is considered as the loading or any transformation block Apache spark does not actually execute this code what it does it has something called as a lazy evaluation so transformation block is only executed once you perform some action so when you run something called as action then it runs the entire code from the start to end to give you the final output and when we do the actual transformation you will understand that so I just wanted to point out this basic concept as we go forward right so for the transformation we will do some renaming of the column and we will change some data types because we don't really have any garbage data available uh inside our data set because the data set was pretty clean so we will do basic transformation and we will also do the basic analytics by using Apache barcode so that you will get understanding about the different functions available in Apache spark so we will start by getting understanding about the data set of the asset okay so we will do searches call as a print schema there is a function you can just write the print schema and it will print the entire scheme of the data frame so as you can see we have the person name which is in the string format that makes sense person name should be in the string we have the country again there is in the string we have the discipline it is in the string so this data frame looks good we can go about the coaches so let me just get the coaches dot show I'll just do the show and it will print all of the information about the coaches and I can do the print schema for the purchase also and again all of these values are in string and we have the schema as string so nothing wrong in this data frame then we have the gender okay so let's look at this so we have not had show I'll do the show and over here I'll do print schema again as you can see it over here on this part is a stream but the number of females number of males and the total amount of these two should be integer but in our schema it is considered as a string so what we need to do we need to change the data type from the string to the integer and this is what the basic transformation means right you try to get the data into the proper format so that you can pass it forward so when you are working with the multiple data sources you will get a lot of data that is not into the proper data type and that time you will have to manually you know apply the proper data type so that you can pass the quality data forward for the analysis uh to build the machine learning models okay so what we will do we will do something like this is equal to there is a function available inside Apache Spa which is called as the width column okay uh that uses you can use you can use with column you can use with column function that has that takes the argument such as string and the column that you want to apply some functions on top of it so we will use the female this is the let's say the female part and we want to create the new column female which is basically we are overwriting on top of this female column column you will have female okay dot cast to something called as a integer type you can have int you can have integer okay something like this or you can also use integer type function and we will have to import this on the top so let me import these two things uh let me just get this so we have to import two things one is a column function from the spark package and the integer type and let me just go on the top or you can add click onto this plus icon and you can add these two things from PI spell.sql.function import column and from PI Square SQL types import the integer double Boolean and all the other types so now we can use the column function and the integer type in our code like this okay let me get this okay now so this is for the female but we also want to do this for the male and the total also so you can if you want to write more you can just also do the with column and provide the mail and the total also but we can write it on the new line also and in Python if you want to write something in the new line you have to hit the backslash and then write dot with column over here I'll write mail column mail dot cast integer type and then you can do the same Dot with column I'll do total column total dot dot cast integer type okay I'll just run this hopefully we do not get any error we are getting the error as unexpected type okay you have to add the parenthesis if I run this it runs successfully now if I print schema you will see this is string this is integer integer and integer so if you try to get the sum of this you will get an error because it is in the string string format but now it is in the integer format so we are able to successfully uh convert our schema of the male female and the total to the integer type then what do we have we have the middle so let's go and do the same for the metals we to show first we check the show and make sure we have a lot of integer values available so Metals Dot uh print schema I'll get the schema so in this case also you can see we have lot of string values but it should be integer now again we can convert these manually by writing the code I just showed you one of the example but there is a one more way and the way is you can do this while reading the file so there is an option available inside the purchase bar called as the this one dot option in first schema as true okay now if you do this and let's just apply this to the metals as well and this part this part and this part if you run this a purchase Park will go through the CSV automatically and will try to understand the schema of each and every column so now if we run this medals dot print schema you will see instead of string we have all of the integer values automatically assigned to the different columns that we have available I just showed you both way because you know I just want to show you the different way you can transform this data you know and Write the basic Apache spark code so you have the medals done and then you can also do the same for let's say teams uh you can do the show inside the teams we have all everything as a string only so let's just do the print schema schema and if I do this you will see everything is in the Stream So this is about the Reading part of it now we can also do some basic transformation on top of it so in real world you might have to filter some of the data Group by it and you know try to create the new type of data frame so that businesses can get the data on specific granularity but in this case we are just doing for the tutorial purposes we can also apply some other transformation functions such as let's say if I want to get the find something like this okay find the top countries with the highest number of gold medals so you can do this something like this so Top Gold medal by countries something like this and we have this medal available okay inside the middle we have a country and the gold column available so you can do just do medals Dot order by gold and you can just say ascending is equal to false okay if I run this and you can also to show if I do the show you will see um United States had like 39 goals Japan 27 great between 22 you can sort this data based on something but uh we just want the two columns so you can use something called the select select and then provide the team country comma gold we just want these two columns and if I run this now you will see we just get the two columns only so if you have the requirement if you want to process some of the data you can do some basic transformation like this or let's say if you want to get something like this which is uh calculate the average number of entries by gender for each discipline so you can use uh this code I already have written the code so that we can save the time but what this code does is that you have the data frame name interest gender we are using the with column inside the width column we are providing the new column name average female then we divide the females by the total and same with the male we divide the male by the total and if you do this you will see the output as the average female and average male in each and every different discipline so you should you can do the same analysis using code and also using SQL just like there are many different functions available in Apache spark so you can just go on to this Apache spark on transformation function go to the Apache spark documentation okay so you can go over here so you can go to the Apache spark documentation and you will find literally each and every functions available for your transformation work okay as you can see it over here you can have the map filter filter map Union so I have the plans to launch the detail Apache spark course in the future so if you want you can always uh you know register for that so I already have the python SQL and the data warehouse course available but if you want you can check the link in description at the time that you are watching this video you it might be available so always keep eye on the new courses that I published you will get the detail guide about each and everything so this is what you can do using the database about this part the reason I added this section because I just wanted to show you the connection how to read data all of these things are important while you're creating the pipeline okay uh the goal of this writing core was not to transform anything is just to show you how to read data do some basic transformation along the way and then load the data onto the Target location and that we will be doing right now okay so we were able to read the data we did some basic transformation now I want to write this data so you can get the data frame name let's say athlete then you can have something as right dot mode now we might be writing this data again and again so you can have something called as a overwrite so let me just remove this part for now okay let's just go and have the option as header true I want the header as to dot CSV in this case uh where we where do we want to load our file so you can just go on the top and in our Mount location over here okay this part where I want to write my file so I can just paste it over here Mount some data and the file name is athletes okay so I'll just write the athletes it will automatically consider as a CSV so if I run this uh hopefully it writes file onto our storage so this is where my transform data is and this created the folder and this is the file where everything got stored now this is how the Apache spark basically store data it has a lot of different metadata and then it stores the file so if you want the single file you can convert this into the pandas data frame and then write it but you will keep this as it is for the understanding purposes so and generally when you write the file using Apache spark it will create the folder and then write the files with the metadata with it okay but you can do the single files so using the pandas data frame while converting it so uh now if I try to run this again you will get an error because it already exists and to overcome this error you can have something as on this mode which is let's say dot more overwrite now if I do the override it will override on top of the existing file so it will directly write on top of this only so if your data is large in a purchase Park will divide the writing object file into the multiple different small parts so if I go over here as you can see we have the duplicates because we ran the code two times but this part which is where our actual CAC file is stored you can just download this and you can I can also open this you will see it is the same file on the athlete file which is on the raw storage uh but this we wrote from the Apaches bar code so this part we will have like multiple passages part zero zero zero one two three four five like four five six seven if you have like very large file so two if you want to you know write the single file only you can use something called as a repartition three partition okay and you can provide the partition count so let's say in this case I provide partition as three if I run this and if I go over here you will see I was able to partition data into three different files so you can provide the keys and all of the different things partition key and everything while adding Nepal you can explore more about it but just want to show you this concept that you can partition your data into multiple different files while writing the spark code so in this case I just want the single partition I just want the single file so I can also go over here but we will have the previous version so what you can do you can just delete this first okay let me just delete this and then you can run this code again once you run this you will refresh this and you will see the athletes and I have the one single partition available over here so this is about writing the analytes you can do the same for all of the different data frames available so let me just get this I have already have written this so instead of this I will just change my Mount location so you can do this transformed for this this part this part and this part so I run this and if I go back onto my wait yeah over here so as you can see we have the five folders available that we loaded using the Apache spark code so let's try to understand this part using the architecture diagram so we had our data stored on to the uh GitHub repository we use data Factory to extract this data and then load this data onto the Target location which is our data Lake then after that we uh created the data Factory workspace created the compute and started writing code we read the data we did few transformation we understood about the Apache spark core and how to do print schema do group by filtering and everything and then we loaded the same data onto the transform data folder now what we want to do we need to load this data onto the synapse analytics okay this is where you can you know use SQL to do the analysis or generally data analysis or data analyst or data scientist person can use all of this data to you know get the insights on the data or build a dashboard on top of it so this is what it looks uh now again you can create the synapse analytics folder using this you will have to create the connections and all of the different things so we have done a lot of things using the data Factory which is kind of like the workflow management Tool uh we have done using the code now we will do it using the UI so you will get understanding about all of the three different ways okay so this is the end of part one of this project I hope you had fun executing this if you reach till here then let me know in the comments and also don't forget to hit the like button and subscribe button now if you want to watch the part 2 you can just click on this card and start watching it the part 2 is uploaded on the project Pro channel so you can go there and finish this project all the best
Info
Channel: Darshil Parmar
Views: 283,325
Rating: undefined out of 5
Keywords: darshil parmar, darshil parmar data engineer project, data engineer, data engineering, azure data engineering project, azure end to end data engineering project, azure data factory, azure data engineer, azure databricks, azure project, free azure project, end-to-end data engineering project, azure data engineer tutorial, azure data engineer tutorial for beginners, free azure data engineering project, azure data lake gen 2, databricks, pyspark, spark project using databricks
Id: IaA9YNlg5hM
Channel Id: undefined
Length: 95min 59sec (5759 seconds)
Published: Sat Aug 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.