GCP Cloud Functions for GCS object events | Load data into Big Query tables against GCS events

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Applause] [Music] Indian gcp data engineering so this is another video on Google Cloud functions all right so in this particular video we will see how we can write Cloud function secondate against Google Cloud Storage events okay so basically in a common practice right usually Whenever there are some files placed in the Google storage bucket usually to process those files through some ETL pipeline right so you will schedule that Pipeline and periodic basis so based on the scheduled time so those files will be picked up by the particular ETL service or tool then it will it will be processed based on the scheduled time but for example let's see we don't know the time when exactly ah the files or objects which are getting placed in the Google Cloud Storage right then how you can process those files in such cases so this kind of architecture right so the cloud functions with Google Cloud Storage object events comes very handy it will be very useful in such use cases ok so so go directly will go to the demo slide so in this video so we are basically concentrating on a use case where if there is a file placed on a Google dot storage bucket right so that file can be anything it can be a CSV file or it can be a text file it can be a Json file it can be any other file So based on that event right so basically ah so there are different types it will support like whenever there is object placed and it is finalized that that is one type of event and if one of the object is getting archived on the Google dot storage bucket so this is sorry that is also one of the event in the same way we do have other events so but in this video particularly we are going to concentrate on Whenever there is an object placed on Google Cloud Storage that is object finalized right that is placed in the Google class storage then we are going to write a cloud function right using python so where it will try to read the metadata of that particular file and also metadata related to that event as I already told there are different event types right so overseas object finalized so when we are doing the demo then we will come to know all right so so here we will see the object finalize event so we'll try to even uh basically capture the metadata related to the event and also file right and also we will try to load the data which is available on that file into a bigquery table right so that is what we are going to see in this video or in this demo all right so what I'll do I'll try to place a few CSV files and a Google dot storage bucket and then immediately once that file file is placed in the Google Cloud Storage bucket So This Cloud function will be triggered based on this event and it will try to capture the metadata related to the event and also the file and also it will load the data which is actual that I mean to say into the bigquery table so there will be in the bigquery actually we are going to have multiple tables so in case of metadata related to the file and the event we are going to have one table so always whenever you upload some file into Google Cloud Storage bucket so it will try to capture that metadata and it will try to write that right into that table so whenever you try to upload one more file into Google storage bucket it will just try to append that metadata into existing table but in case of the file for example let's say I've placed file one then it will try to create a table with the file one schema in the second turn if I try to upload a file tool again it will try to append that metadata into the existing table and then it will try to create one more table with the file to schema right so basically in the CSV file I'm also maintaining the schema so it will just Auto detect the schema and then it will try to create the tables based on the CSV file uploaded into the Google Cloud Storage bucket now let us quickly go to the demo okay this is our Cloud function environment right so for that we'll have to click on create function basically here to create Cloud function I am again using Google dot console method right and I have already code already written and I'll try to go through that code and high level right so let us create this function first let me use first generation only for this demo so let me name this function GCS to B Q okay so region would be like this only I am not changing anything but here when you come to this particular trigger section here you have multiple triggers so as I already told in my previous video generation one generation one Cloud functions will support these many events but if you go to the Gen 2 then there are more events it is supported here I am selecting Google dot storage right this is the event type I was talking about so right now in the Gen one it is supporting four event types whenever you do some archive and the Google dot storage bucket you can write some function in the same way deleting and finalizing this is what we are using as part of this demo or video and there is an update on the metadata also you can have an event and you can write a function for that event okay but here let me use this finalize and creating an object into Google or storage bucket so we'll have to select a bucket so I have already created one bucket as part of this project so this is the bucket we are going to use over here let me select that bucket right so select now you can see this packet has been selected just save it okay now we'll have to select the runtime also so as I already told so there are different options available for our compute right and memory so so this is the very basic thing 120 by 128 MB this is very minimal Ram we have but let me select 256 MB and this is the timeout it will be get executed for 60 seconds and if it there is not there is no respond then it will be timed out right so this uses app engine default service account so Auto scaling so let me choose only three nodes or three workers right that's it now go to the next now for the runtime I am using python 3.10 now this is my main py so it has been already provided some template right and requirement file so basically if we have some python dependencies to be installed we have to specify those package names over here right now let us go to the code so what we are doing there okay so this is a code so these are the different python dependencies I've been using as part of this core because I am writing data into my bigquery table right but while writing into bigquery table I'm using Pinas data frame and also pandas gpq this python Library which would basically write data into a bigquery table and also these are the ah few internal ah other dependencies which would be used by this function basically this is to connect our Google Cloud Storage file system and also this is one more internal package which would be used by this function so I have to specify all these packages over here right in our requirement file right in the main py I have to copy paste this code so just try to I'll try to go through this code and high level basically this is a standard template provided by Google functions so there are two objects it is taking as an input argument called event and context okay basically event will provide the metadata about the file and also actual data which is there sorry metadata about the file the context will try to provide the metadata about the event right so either code there are different types of events so this context will provide metadata about event event will provide metadata about five so when the file has been created and when it got updated something like that right so I'm just trying to capture all that information into a dictionary over here event ID event type and the bucket name where it where the particular file is getting placed and the file name when it has been created and updated right and then I am trying to write the data into a bigquery table so basically it will try to see if this there is a table already exist with this name then it will try to open the data if it didn't find any table in the in the data set here so then it will try to create the table and then it will try to load this particular metadata into that table once it is done in the in the next step okay so basically I am trying to load the data which is already available in that file which I am going to place into Google Cloud Storage bucket so basically it will it's using this pandas data frame to read that CSV data and then again by using this gbq right here I am using gbq and also here jbq so again the same thing it will try to go to this particular project and the the data set if it finds the table name so table name would be file name over here actually right whatever the file we are placing into Google Cloud Storage so it will just take that file name and it will try to create that table with the file name itself if if it finds a table name right then it will try to append the data otherwise it will create the table right so this is I mean very useful right whenever you have to perform some lightweight etls like whenever there is a file getting placed in the Google dot storage if you have to load that data into bigquery automatically based on the event you can definitely use this approach it is very helpful now we will try to copy this code and paste over here right now let us deply our function all right so we have this requirement we are going to install all this python dependencies then this is our main py so let us click on deploy right it will take some time maybe maybe a couple of minutes once it is successfully deployed then we will try to place some files right over here in this bucket and we will see whether we have the tables getting created with the file names and we do have the metadata available in a table or something like that okay so we will go to the bigquery ah right editor you can see this is our project we have this data set there is nothing as of now ah once we have function successfully deployed and if we place some files into Google dot storage Bucket over here then you can see those tables getting created over here okay let's wait you can see this is still deploying we'll have to wait for some time yeah now you can see this function uh ah get deployed successfully so you can see this green color tick mark over here so now we will have to go to this locks right you can find some information right about this function so if there is some errors then you can definitely see those errors right you can see create function there are two entries right and the log section now what we will do we will try to upload few files one by one into this bucket and let us see whether that function automatically getting triggered and it is time to load the metadata and also actual data into a bigquery tables okay so for that let us try to use some files which are already available in my local machine so this is one of the file so simple file let me open this so simple file actually it has four columns it's a CSV file and some data up to up to 16 rows right so let me then it will create basically it will try to write the metadata about this file into a separate table right uh over here this table this is a metadata table this is a actual table it is going to create it's going to create table with this beam underscore SQL name okay so now let us upload this into this bucket okay open now you can see this has been uploaded now let us go to Cloud function and let us refresh this logs now it should show that that function has been triggered successfully and it got completed successfully right it will take some time we'll have to monitor these logs right see function execution started now you can see finished with status okay that means it successfully finished right now we will go to the bigquery right and we'll have to examine that data set whether it has key we have the Stables created or not now you can see there are two tables created with the file name beam underscore SQL it will it it will just ignore the extension.csv right it will just consider only file name B minus course SQL and this is the metadata table let us first of all click on this metadata table so click on details it has been created just now and preview the data now you can see event ID we are capturing that information you can type as you already told we are using object finalize event and the bracket name where that file got placed and the file name beam under B minus Co SQL that's right so when that file uh were created and go right and when the when that file got updated right so this is the metadata information which is related to that file and also even now we'll go to the actual table where we have the data right so it has been created and let us click on preview now you can see so it's Auto detect schema all right and we have almost 16 rows right now in the same way we will try to place one more file right and we'll see again this Cloud function will be triggered so for that we can use the CSV file Greenhouse some details then upload it now got placed into this bucket now again just go to your locks right now again this function will be triggered see it has been triggers successfully started now finished right now let us go to again bigquery now we should see one more table and in the metadata you should see two rows see this is one load file right and again object finalize same thing same metadata right now just refresh this and you should see one more table with a file name should get created over here right you can see this is a table right just click on to you now you can see there are 20 000 rows right uploaded into this table Yeah so this is how you can write Cloud functions against Google dot storage right object events and you can make use of this Cloud function for you at lightweightly lightweight etls right ah OK that's it for this video so thank you thank you very much for watching this video
Info
Channel: Anjan GCP Data Engineering
Views: 13,577
Rating: undefined out of 5
Keywords: GCP, cloud functions, gas, trigger, python
Id: n3dMMgUdRdk
Channel Id: undefined
Length: 17min 35sec (1055 seconds)
Published: Fri Oct 21 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.