Seamless Data Integration: ETL from Google Cloud Storage Bucket to BigQuery with Cloud Functions

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

a hi in this video I'm gonna show you how to do ETL operation from Google Cloud Storage bucket to bigquery using Cloud functions as you see in this PPT this PPT depicts My Demo here I'm gonna show you how we can set this ETL environment and finally see the you know the automated you know the data extraction transformation and finally loading into a bigquery table and finally from there you know we can use it for you know executing the analytics query right so this video is is purely focusing on you know how do we set up this so that you know it becomes an working ETL environment right so before I go to that stage you know I'm going to walk you through the solution so how does basically you know going to explain you how does it works actually yeah so here we're gonna store our you know raw data right so the raw data would be stored in Google Cloud Storage bucket so I'm using storage bucket here right and in that bucket we're gonna upload some sample data of certain you know um the line deliberated or Json format data so we're going to upload few raw data of some some kinds uh into this Google or storage bucket and then in front of that you know we're going to keep a function basically a cloud function which is on you know even driven um you know the um a platform as a service code running you know kind of service from Google Cloud right so and let's execute that cloud function with using you know even driven for example whenever we upload a files into the Google Cloud Storage which will create an event and that even should invoke the cloud function then we will write a logic in a such a way that it just downloads the file which is you know kept in the Google Cloud Storage it reads the content in as expected way and then also based on the data of the file based on the data and the file lamina it's going to detect like you know what kind of data that file carries and what kind of uh you know the bigquery table it has to create of what schema and finally you know so basically that kind of intelligence would be written in the code functions and then you know so basically whenever that event happens it gonna do a step-by-step job right so first one it will identify what is a file and then it will determine what kind of scheme I need to create in a Google uh you know bigquery Google Cloud bigquery basically so in bigquery uh so how does it decide you know research is basically basically basically on the you know the filing file naming conventions okay I'm gonna show you that how did I do that and then once it creates a table in the bigquery with a certain schema which is defined in the schema.tml file I'm going to show you that as well right so once that happens you know it has to do a next job of you know uploading the content of that I know it's Json file stored in a storage bucket into the table dynamically so this is the first operation it has to happen and in the second operation or say you know you are doing it for second time so basically it should not create the a schema yeah but it has to just append the data whatever present in a new file okay so that is how this scenario is been constructed so here main focus is on how do you set up the environment and how does this automation behaves in the both the cases okay that is the main name whereas the code required for this demo is already been explained in my previous video please do watch that video to get to know more about the code account how does the code actually works whereas for especially for this particular demo purpose I will be sharing my code into my GitHub repository and the GitHub repository link could be find from this video's description yeah all right so with that note let's go to the uh you know the the next next uh part of the demo that is creating the infrastructure at the Google Cloud to achieve the scenario right so I'm gonna go back to my previous slide yeah so here this is my Google Cloud so this is my Google subscription basically and currently I'm in a project called Cloud Quick lab which is my channel name underneath that I have the service accounts if I go to the Im so I have these many accounts and the yes especially uh the user information uh you know default Computer Service account yeah so I'm currently using this computer service account remember that this computer service account should have a certain permissions like I'm not sure I'm give I have given bigquery admin because we are dealing with the bigquery services and I have added editor permissions as well similarly it should have an event Arc event receiver you know basically permissions as well yeah all right so if you have these three permissions basically you know so this is suffice for cloud function to do the required job right so I'm gonna go to the first resource of the you know the ATL environment that is a storage bucket right so let's say you know you have an application which actually is creates uh you know the data file or our data file in a storage bucket okay and that is what the bucket I'm gonna create you create it here say let me call it as a demo Cloud uh Cloud Quick lab so I'm going to create a bucket dynamically right away and then I'm gonna go to the next option so I'm gonna choose the specific specific region because I would love to create my you know the storage bucket uh within this particular region because I'm going to create the other consecutive resources the same reason as well yeah all right so I have chosen the reason that is U.S Central one and I'm gonna go to the uh the next option and see if I can do something so I'm going to keep all other operations as and other all the configurations as default all right and then click on a create so basically it says you know yeah just I'm gonna confirm it and I'm going to create a bucket right away so basically what we did right now is we created a cloud storage bucket in the U.S central region so this is our bucket right so we are done with the first part of the you know first part of our ETL enrollment that is uh cloud storage bucket right and then we go when we go to the next you know tab so basically I'm switching the tab so don't get confused all the tabs that are belongs to you know what I'm showing you in my browser it belongs to my same Google Cloud subscription and belongs to the same project called Cloud Quick Labs so here I'm going to create the function from the scratch so I'm just clicking on a button called I know create function and here I'm going to give a name like nice name like you know cloud quick Labs demo yeah all right I'm Gonna Keep the reason as us Central one right and the trigger so I'm gonna uh I'm gonna add the you know see add the event Arc trigger basically so I'm gonna click on that button so here even provider is not Pub sub it is it should be storage okay because the store is going to create the uh event for us so I'm gonna choose that and automatically the options gets changed here the trigger type is yes good it's a Google service and the types of a Services cloud storage and in the event I'm gonna choose the first one that is Google Cloud Storage object version one archived if you see you know so the live session of the object has been created in a non concurrent version yeah so if you click on this one so you know so you're gonna get a further options like choose the bucket where actually the object need to be captured the object creation event need to be captured something like that yeah so I'm so basically why I'm doing this one because you might be knowing that you know you have a this is the bucket name and in this bucket name your application or you know your consumer is going to upload the you know the raw data in this format yeah so that's the reason I'm just choosing this bucket and it has identified that bucket as well okay all right so with that um so I'm gonna go back to the event so that you get um uh so basically you know find analyze so it's not an archive if you see right so we need to choose a finalized why I need to choose finalized is if you see the subheading that is a new object is successfully created in the bucket so that is what we want a trigger point okay so not like the other trigger points so basically what does this means you have a bucket if you have somebody uploads a object into that bucket you know basically it's going to create an event and that would invoke This Cloud function that is what it basically yeah so I'm going to save the trigger now with that options uh with these options you know let's go to the next configurations of the function up uh so basically this is uh yeah so here if you see here we are giving the this is the core part of the cloud function so here we're gonna choose the runtime because I'm interested in the function in the python that is especially in the latest version that is python 3.11 right and in this one so we're going to keep the entry point function equal to hello underscore GCS yeah let it be like that and you can choose a new you can change it according to your conventions in the main dot py file we're going to upload the code which I have created and explained in my previous video I'm just going to use the same file so it's a little bit changes here so you're gonna anyway so this file has been is being shared at my GitHub repository uh you know so source code link will be shared in this video's description please find this file from my videos description repo link yeah all right so far now what I'm going to do is I'm going to copy the functions I have created to do the job as it is and I'm going to upload here so if you see here underneath the import function framework so I'm just gonna just do a control you know V and then bottom line so you know here this is the initiation point right because that is where the cloud function starts execution then we're gonna go down and add a one more function which actually you know connects the other functions belongs to the uh belongs to the job right that is uh you know that is extracting the data and uploading into bigquery table so that is what I'm appending it here now so basically if I can select this one you see the code gets execution from here that is basically streaming you know so this is if I can walk you through the code as well simultaneously so this bucket you know so whenever there is an event happens so this function gets called and this function what it does is it does checks if the table exists in the um basically yeah it does checks if the table exists in the bigquery schema yeah basically bigquery uh data set we're going to create a data set in a time now so it will check if it has a sum table underneath the uh data set yeah so that is what it if it has it will not do anything if it does not has it will create it okay and then it gonna check the format so how does gonna check the format is basically based on the you know the the format basically you know why how does the format gets connected so all the format information has been provided in the schemas.aml file I'm going to upload that file as well so basically it reads the schema it reads the file name matches it and decides you know what kind of file it is and how does it need to update yeah so all those operations happens simultaneously so check if table is exist and it's going to create a new table here yeah and then load the so basically no so it is using an um bigquery Native API operation that is load from URI right so and that is where this job is doing so basically it's identifying the URI of the object which is being uploaded to the to the cloud storage bucket and then invoking the bigquery API yeah so this is creates a schema I'm gonna minimize this and this is the main part of the function that is which actually uploads into the um into the bigquery data set table by referencing to the you know the cloud storage bucket object URL yeah all right so that's how the function does so in detail you can work from Yourself by referencing the code which I'm going to upload uh to my GitHub repository and the link of that would be shared in my this video's description you can find it from there so I'm going to quickly move to the next option that is I need to create a file called schemas.yaml failure and that file will be referenced in the main.yama bin.pf file so to create a new file you can create click on just uh just click on yes button so in this schema what I'm going to do is I'm going to just upload the schema uh definition that I have kept for the based on the you know the file naming convention so basically what does it happen says you know we gonna give the clues to the code saying like hey this is my table name and the table name format is this yeah and the and the schema of the table has to be like that so that data would be passed to the uh you know to the code basically it is there in the main.pfs so how it is there so if you see here we have an code piece card with open basically it reads the yaml file and just loads the data present into that yam Alpha in the form of a dictionary basically yeah and that config data is used down the line all right so that is the basically theory behind it and then we have a one more important file called requirement.txt file which actually keeps the uh the you know so basically this keeps uh the python modules required to execute the code piece basically so that is what I'm gonna keep it here so how does it you come to know basically if you see in the in the inputs I have multiple Imports like logging OS Trace back so these are like a default modules which would be available in the in the python you don't need to import those but basically the function framework which is by default you need to import bigquery and storage models of python has to be imported here all right so with that configuration so you know we are just just done with the um you know just done with the setting the uh you know the cloud function so I'm just doing the deployment of my cloud function so basically here I'm gonna showing you with using the UI options but think that you can do this with using infrastructure as a code using cloud formation user using the apis are using the terraform or using the native capability that Google Cloud provides right so currently it is deploying yeah so we need to wait for uh once the deployment is been been completed and then we can start with the uh with actually the test cases okay so basically if I go back while it is deploying as I explained in the another native another video so you can check it uh for more details about this file so basically this is the file which I'm going to upload to uh my bucket and this file name has some Clues so I'm I'm using this mechanism but in your case the mechanism could be something different you can Innovative from yourself yeah so basically table number one is the my table number one that it has to create in a data set and then the the type of the data that present in this file is been mentioned after the underscore yeah that is data new line delimited Json so basically what does this mean in the sense it contains a data and the data are in the format of Json but it's a new line Json okay so basically every line is a one one Json object basically yeah and but it is not separated by comma remember that that's the reason it is showing an an error which is okay yeah all right so that is about the code piece and let me go back to the uh source code let me go back to the cloud function deployment here you go the cloud function is deployed successfully you see the green Mark and if you go back to the uh to the main page that is cloud function and you see our Cloud Quick Labs demo function is up and running fine which is a good sign now we need to do one more job uh in the in the Cloud Quick lab so basically this is the uh you know my Google Cloud subscriptions bigquery options okay if I can refresh it so basically this is a big query page of my Google Cloud subscription right so in that one if I refresh so I have only um uh the Google Cloud uh you know so basically we have a Cloud Quick Labs okay so this is the one project I belongs to I need to create a data set here right not in the something else so what is the data set I need to create I have hard coded a naming convention in the script that is uh bigquery data set right that is staging I've just given a name like staging so I'm going to create it that so click on this one create a data set and give the data set ID yeah and this tall operations I'm gonna list rest all configurations I'm going to keep it default and click click on create data set so basically data set is created okay so we are now all set now let's go and do a final uh demo you know so basically let's check if our setup gets worked or our ETL operation is is you know ETL environment is working now so how do I do that for that case you know I'm gonna go ahead and upload those the same file which I showed you and explained you so just click on upload so basically upload is started writer right now so it basically yeah you see the file is successfully uploaded now we should go back to the logs and we should see whether the logs have just been generated of you know of of triggering the function and uploading the the data to the bigquery table yeah so that's that's what we can see here you go so it looks like it also looks like it has triggered so there is a something wrong so basically if I see here uh looks to be some some problem in the code so basically table one created a table Yeah so table yeah looks like everything is going good yeah and and let's see our job is complete basically you know so what is the logs that I'm seeing it here basically the function that is cloud function which we have written it gives a trail of the operation that is happening in the source code that you have written right and that can be visualized in the log so either you can add a print statement either you can add a login statement so those statement gets printed here and it will give you a clue like you know how does your code is executing you know one after the other in my case in our case we are lucky you know it is working fine you see the bucket has been identified file has been identified a meta generation is identified right so this is the identified blah blah and the table has been created and it says you know it has finished the job as well let's go to the uh the to the staging and if I can explain here you go our table has been created and the table has this many schema which I have declared here so if I go to the schemas.aml file this file you can find it from the even from the Google code function app as well functions as well so here these are the schemas have been identified yeah and schemas are also been used to create a table now let's go and and and query this table from a new tab so I'm gonna give to the give it to it it as a star and just run it to check if some data has been uploaded into this table or not here you go the data has been uploaded successfully yeah all right so right now it has uploaded the uh you know so the file let me check let me try to do the another operation um by using the basically by let's let's try to do one more operation so I'm just going to copy this file uh so I'm going to copy this file and paste it here right so I'm just gonna give the name like some different name and we're going to upload this file area all right so and and so why I'm doing this one because let's consider that you know your operation is doing now two yeah all right so this is Json uh yeah so basically um yeah all right so basically what I'm going to do is I'm this time I'm gonna yeah so I'm just gonna with what I'm trying to do is here you know I'm gonna I'm gonna upload it one more time so basically yeah so just uh let me see so to upload a new time what I have to do is I have to clean up because I already know it will gonna overwrite it so for that sake what I'm going to do is I'm going to delete this object and and re-upload it one more time just just for demo purpose okay because anyways if I upload into the same bucket with the same format it's just gonna you know update it's not gonna create a new record so I'm gonna uploading I'm just uploading it again and uh we see if the file is getting uploaded here you go the file is uploaded and if we go to the logs and refresh it we should see the the iteration happening one more time so why did I do second time is because to show that you know it will not create a schema again but right but it just goes ahead then and you know and does the just does the append data yeah that is just appends to the existing data actually so let's wait for this job to complete here looks like the job is completed yeah all right so uh here we go so it says you know bucket is found well meter is found this time it didn't trigger and it has identified properly right and if we go to the same bucket and and just run on query earlier it was two record now you should see four records here you go the four records have been successfully uploaded right so which means that you know this ETL environment is working as expected and we also did a query on that table which is uploaded okay so basically what happened is you had a very raw data where you cannot do anything node being converted into a structured data where you can do so many queries and bring out the you know the the insights okay that is the power of ETL operation all right so with that note I have uh successfully shown you the things new to be shown in this video finally I can't request please do subscribe my channel that would really encourage me a lot with that note thank you thanks a lot and see in the next video

Info

Channel: Cloud Quick Labs

Views: 13,895

Rating: undefined out of 5

Keywords: #cloudquicklabs

Id: CHk_jlS_A8I

Channel Id: undefined

Length: 20min 10sec (1210 seconds)

Published: Sun Jun 11 2023