Using AWS Lambda As A Data Engineering - Automating An API Extract With AWS Lambda And Eventbridge

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there everyone welcome back to another video with me Ben Rogan aka the seal data guy today we're going to talk about data engineering and AWS or how you can actually use uh AWS as a data engineer in this video we're going to mostly be focusing on lambdas as well as a few other components and how you can use it in your everyday workflows for this video we're going to specifically uh dive into a few different things one we're going to talk about what in AWS Lambda function is we're also going to talk about some of the pros and cons in terms of picking it as well as possibly some other options you might decide to use and then we're going to go through a real life coding example so we're going to go on AWS we're going to build a function we're going to automate it uh in a few different ways and we're going to kind of see the results so with that let's just dive into using AWS as a data engineer so first of all what is uh a Lambda right like what is an AWS Lambda function you probably hear about it you may have heard the term serverless and that's kind of what it's attached to right like it is a serverless Computing service which essentially means that you can have this code this script that you want to run and not have to set up a server to run it now I know that doesn't make sense there is a server somewhere right like somewhere virtually there is a server that exists that will eventually run your function so let's say you're trying to automate uh some API call to actually pull data right like you might might want to at midnight extract data from a source so you set up a Lambda to run which will show you how to do that later well instead of having to maybe set up an ec2 instance with some you know something set up with Kon uh that you're going to have to pay for all the time right you're going to have to pay for all the time that it's running you can set up this servess function that gets triggered usually by some sort of external event whether that's something calling it a time slot you know time period uh passing it's 1:00 got to run it an event external uh happening some data being shifted into S3 whatever the actual event is you can almost see this the same way people used to set up these basic scripts in the past and have KRON callit right it might not be this fancy uh complex function or or module or whole code base it might just be this one little script that you need to run at a certain time and that's really all it is right there's no real server you have to allocate you just have this script that at certain times is spun up so with that you get a few benefits like I referenced it's cost effective so that's one benefit right you don't have to run it all the time if you're only running this script you know once every two months uh right like that's going to be very cheap versus if you were to use something like mwaa so you were which also could be running these tasks at certain times using the scheduler which is Amazon's Apache airflow instance or manage instance it is far cheaper the mwaa uh costs easily $300 to $500 a month whereas this Lambda is going to cost you you know a dollar a few pennies whatever is going to end up costing you I've seen people pay pretty low cost on their AWS lambdas uh it's easy to scale right like every Lambda that you end up calling it doesn't you don't need to figure out where it's going to run it'll run itself right um You don't need to spend time setting up distributed systems you've got this basic Lambda it'll end up running on its own obviously wherever it's going if it's going into some database you might have to figure out that side but in terms of the actual lambas and running the the scripts that's not a problem also if you're in AWS lambas are heavily integrated into everything else right you can literally set it up so that if something happens in S3 or Dynamo DB or somewhere a lambda's fired right whether that's because you set up a step function which we'll talk a little bit about or just because you've set it up that it triggers based on something else happening using a vent Bridge which we'll dig into it's very easy to do right it's not complex you don't have to set up code to do it you just you know drop down and you select the event and another benefit I didn't reference about serverless is you don't have to spend time updating your server or or or managing the server right instead you just have this serverless instance that you don't have to think too much about now with anything there are pros and cons with any architecture decision if you decide to use a Lambda there's a tradeoff uh one you could argue there's a cold start problem right uh if you've got a server running all the time it's far easier for that code to run because it doesn't have to you know turn on or set up some instance somewhere another issue is you often could run into some runtime limitations like there's only so much storage it can only run so long it might only have so much compute and so there might be some limitations there obviously there's also vender lock in right like if you set up all your infrastructure like I talked about earlier with event bridge and you have you know Dam modb connected to Lambda S3 Etc it's going to be much harder to move because you're going to have to Now set up similar things in a different Venture so there's some lock in if you use AWS Lam D but that goes with any solution you pick another thing that I didn't reference is obviously it's cost effective on one side if you are using this script very you know minimally but on the flip side if you're hitting this Lambda all the time eventually it will likely become cost prohibitive right um depending on how you set it up you're likely paying a margin um on top of lambdas right versus owning your own server so there is this trade-off where it's like well if you use it enough eventually you will spend more on it then it's actually probably worth to just spin up your own server and some people don't like having their hands tied right like logging is a little more limited uh whereas if you own your own server you've got a little more control so logging and monitoring can feel a little less uh is a little more limited because you can only use kind of what AWS provides you but it's not bad and especially for scripts I think it's usually good enough and so those are some of the pros and cons of using an AWS Lambda now again there are other Solutions you might use I referenced mwaa earlier or Apache air flows managed instance on AWS which could be for more complex tasks right in this case we're going to do a very simple task right with uh we're going to automate some basic events that basically cause uh Lambda to fire off but if you have more complex dags you might want to do something like use mwaa I'll put in another shout out for mage which I am an adviser for them so there are plenty of options of how you can automate your tasks especially if you're talking about ingesting data or maybe doing a basic transform you can also use things like snowpipe which also has a trigger for basically when something loads into S3 it automatically ingests into snowflake there are tons of honestly um other Alternatives in the 5 I think of a few more I'll add them here but with that guys that's my intro so let's actually do the thing you came here for which is code so let's dive into coding uh and using AWS lambdas and automating some basic events in this case we're going to scrape my YouTube analytics and show it to you guys let's dive into that section next all right let's dive into uh AWS Lambda and actually looking how you can set up a function that can uh pull essentially data from an API uh and put it into an S3 bucket so first let's go over creating a new Lambda we'll go over this run data Test example um we'll go over the code and and kind of what it does so if I hit create function this is where you're going to essentially create your AWS Lambda again you can either create it from scratch which is generally what I do but there are also blueprints um to give you kind of the Baseline of maybe you've got a micros service you need to build um you can even do things like get something from S3 or or or similar uh functions that you know you do all the time so those do exist you can find the one in the right uh language that way you don't to build it from scratch but for today we're going to do author from scratch so what you'll see here obviously is you can first or you first need to pick kind of what you're going to run it in right is it going to be node is it going to be Python and in this case we'll do python but there's tons of options here in terms of what you can build your serverless function uh in so do python um less concerned in this case about architecture but if if that's something that you're interested in you can definitely pick something there uh and from there there are a few advanced settings you may or may not need to know um in particular there is enable VPC so if you for example have a uh external Source you might be trying to hit like I had an API I was trying to hit that needed to whitelist my IP address uh you might have to create essentially a static IP address which is arguably a video in itself but I'll at least put a link below if you need to create your own static IP address um on a VPC so you create a VPC that has a static IP address that way as you're kind of going through you you know where it's coming from um that way you can give whoever needs to wh list you um the right IP address so you'll see here I have a Lambda VPS or sorry VPC um which is acting as that static IP address um there's a few other things I haven't had to use uh you know you might have a function URL which will be uh if you want to actually set it so that an external set of code might be able to call it almost similar to an API but there actually is a way you set up an API Gateway so that's a different project Al together in a way where if you need to set up an API uh you can create a function and then put an API Gateway in front of it that then knows what the f where the function is um but that is one way you can just ping essentially your serverless function to to run it in some cases so obviously you need to give it a name so we're we're going to be pulling from an API it's a weather API so it's just weather run weather API or run weather extract something like that it's a pretty straightforward script so you can then hit create function actually I'm going remove this create function just so we can kind of look at it now what you'll see obviously here you can actually put your code whatever you want to write you will run into a few specific problems likely so for example uh if you want to get something like pandas uh added into uh your python script essentially you want to import pandis uh you're going to have to add a layer so layers essentially allow you to add external libraries you can hit add layer for example Le and there are a few that exist out of the box you can see like pandas exists out of the box um so there are a few that do but you'll likely have to uh create your own uh layer here and there for some of the uh functions that you're building which I'll link how to do that I generally go through Cloud shell which is right here um just because it always seems to be the most consistent where I always end up building a layer correctly but all I'll put a link on how you can do that below so you'll end up creating your function which will go over in the actual code uh here in a second um but the other things you'll likely end up needing to change is probably something in configuration generally I do have to mess around with the general configuration so you'll see here you might have to increase memory there was a a database I was recently pulling from where I think it was taking somewhere in the range of like two or 3 minutes to pull from uh that database and I you know ended up increasing both the memory and Epal ephemeral storage and that pretty quickly changed that to I think like a 30- second pull uh you do end up paying for you know uh increased size you know the the amount you pay is is somewhat connected to how long it runs plus uh the size uh and and instance you're you're setting up you can see hit view pricing here so different different sizes do cost uh more if you increase it so do do keep in mind that you do pay for that uh and then you usually have to increase timeout so if your if your function keeps timing out you know you run it in in 3 seconds it dies uh likely you should uh increase how long it has till times out it maxes out at 15 minutes so that's when you know if you're running longer than that you know it's going to kick it out regardless but that's just the basic configuration so you can change memory ephemeral storage and how long it takes till it times out now if we go through this I already have run data test so this is a pretty straightforward function okay so if we look at this example here you'll scroll down the the main function that essentially will need to exist is this Lambda Handler uh you'll see it has event and context so essentially event is where you pass data to um your your function and context is more about the actual context of like what's running it where like giving high level information about as as it kind of suggests the context of what is running in so events really going to be if you pass data uh of some kind uh it'll get passed to event and then if you want to pull it out for example if you pass it the API key or maybe you passed it um some sort of filter or information that you know you want to give to the API itself um it would be an event um from there you know we're we're giving an API key which in this case we're we're going to open we API um to pull like the weather from in this case a Latin longitude uh we're only going to be pulling one row of data but we keeping it simple uh you can have environmental variables which is one way to store obviously uh secret information that's not the best way the best way would be use AWS key service but we're just doing this for an example so keeping this simple um the key service adds a little bit more code so I didn't want to do that after that you'll see I have a function here that actually gets the weather data using the API key so I pass that information then using that data that we get back uh we end up writing it to a CSV again we're just writing one row so we're keeping this pretty straightforward uh from there after we we pass in and write that CSV essentially we end up uh creating a file name so we have this SG CSV uh I add a Tim stamp to that file name just so we know when it came in in so then we put that into an S3 bucket and folder uh that I have out there so this is just a basic example of essentially inserting that data into an S3 bucket or putting it I guess is a better term you might then afterwards load that into or insert that into like a snowflake instance or data breaks you might have like a snowpipe that automatically triggers as soon as it gets that information um so that's that's pretty much the straightforward script we're using uh if we scroll above we can actually see the get weather uh data so this is again a Prett pretty straightforward function it only is pulling one row of data so you're not going to see a loop here um I'm just parsing this one data set here and returning it so we're going we're going and doing your classic request.get to get that API information if it's you know if it comes back good we say okay great now return that data right parse it out return it and then we end up again writing into a CSV and keeping it pretty simple from there like I said we're also adding in a file name here so we're changing we're adding a time stamp to the file name uh essentially we're just getting that that date time now uh and then par splitting up the original file name so in this case it was SG and then we're adding in an extension here so we're actually adding in the Tim stamp for so year year you know month month day day hour hour minute minute kind of set up. CSV so we're keeping it pretty straightforward there that way we don't have to think about all the configuration of that file name we're just letting the function do it for us so yeah it gets the file name it gets the data and then from there again we're just taking that data uh and running it and loading it into essentially S3 so if you want to test this function right you can literally just hit uh the the test function so I can hit test and you'll notice it's essentially run successfully and I've already checked that these are loading into S3 so it's loading into my S3 bucket and from here you might have a tri trigger again in EV vent Bridge which we'll go over here in a second but there's a few different ways you can actually set up things uh so that they run uh maybe afterwards or maybe something that actually triggers this Lambda in the first place so next we're going to kind of go over how you can trigger this this Lambda so for that we're going to go to event Bridge Amazon event Bridge so you've got this Lambda that you've built you can go to event bridge and there's a few things you can have a schedule so this is very similar to Kon I create a schedule you know test schedule we can literally go down and say recurring schedule and use cron based scheduling so uh and'll it'll give an example of how it's going to run so if I say like day the month and then you can use some this it'll actually tell you when it's going to run here so if we go below it'll say okay it's going to run on March 1st uh at one and then basically it's going to go every every then it's going to go every minute because I've set it up to every minute so if I want to change it to only one at the 1 minute and only the first day of the month you can kind of see that it's set up to run this is going to run on March on March 1st April 1st at 101 um if you want to change that yeah okay so if you want this just to run essentially every day versus a specific day of the month um that's how you'd set up your cron schedule they also have rate based scheduling which basically means like if you just want to be very simple and say like every 12 hours and not think about it too much but those are those are the ways you can essentially set up a schedule and then they also let you say like during certain Windows of time oh let's just say this is off uh and then from there you choose what you want to actually do again you don't just have to do lambdas you can have start a pipeline here uh for sage maker or a ton of other things but we'll just do Lambda you'll say which one and you'll you know we'll do run data test and then you could actually give it a certain payload going back to that event you can actually give it information for that event to to parse from so you can give Jon on there you hit next and then from here there's a ton of extra configuration so do you want to retry things after a certain period of time in case something fails um do you want to send like certain messages Etc keep going next and essentially from there it's going to check hey are you good with this okay create the schedule so that's event Bridge it's it's pretty nifty again if you have KRON jobs anyways that you're running and you just need migrate them over they're very simple it's really easy to set out there are a few other options you can definitely use with Amazon event Bridge for now we'll just go over event bridge and schedules cuz I think that's the one most people commonly use uh hopefully this was a great intro in lambdas and how you can actually set them up again next I would like to do a video on S3 uh and how you can take that and load data automatically into something like snowflake there's a few different ways including things like using snow pipes and then maybe we'll we'll dig into Athena there and running queries on top of S3 anyways thanks so much for watching this video and I'll see you guys in the next one thanks all goodbye
Info
Channel: Seattle Data Guy
Views: 4,348
Rating: undefined out of 5
Keywords: AWS, aws data engineering, aws solutions architect, big data, how to become a data engineer, data engineering skills, data engineering skills 2024, what is a data engineer, aws lambda, aws eventbridge, automating aws lambda, automation, automation engineer, how to automate your data pipelines, data pipelines, api, python, using python on an api, ben rogojan, seattle data guy, analytics, data engineering cloud skills, data analytics cloud skills
Id: AXpOnpNg3cQ
Channel Id: undefined
Length: 18min 0sec (1080 seconds)
Published: Wed Mar 20 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.