Deploying a Modern Cloud-Based Data Engineering Architecture w/ AWS, Airflow & Snowflake-Wes Sankey

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we opened up in Charlotte about a year and a half ago but our headquartered in Seattle and have offices all over the place and we're hiring so we are looking for folks to join our data and analytics practice so if you're interested we used to reach out to me I have specialized in data warehousing streaming systems software engineering cloud I'm super interested in machine learning and hope to delve more into that and have been trying to learn about it a little bit more of my free time but these are some of the topics that I try to focus on in my work and in my free time but enough about me let's talk about some more interesting things so brief brief and Jen we're going to talk about some traditional data warehousing patterns that I hope you are all familiar with I'm gonna go into some of this shortcoming of shortcomings of those patterns talk about a newish paradigm and then go into a demo and what I hope that you primarily take away from this presentation is less around the specific particular implementations of whatever technology we choose what I'm really trying to drive here is showing this new paradigm showing how we can plug-and-play different technologies into this architecture and and talk about where data engineering as a career is going and where we've been and what this transition has looked like so with that I'm gonna start the presentation with a kind of a contextual concept so imagine you are a data engineer you've started working at some new company and they have a fairly straightforward on-premise architecture it might be running sequel server it could be Oracle the particular material implementation doesn't matter but in this really simple use case which I'm sure you've all interacted with some version of this we have some application databases these are just my sequel databases not really under our control but we bring that data over via some ETL process it's called a sequel server integration services if we're working with sequel server we bring that over maybe a couple times daily into our sequel server data warehouse we model that data in a dimensional data model and then we perhaps have some analysis services cubes that pre aggregate that data and then our end users will access it via Excel workbooks so this is this was an extremely common pattern albeit simplified - probably far beyond what actually any Enterprise worked with but I think it illustrates the point of a simple architecture that we can understand and in terms of cover our health of our system I want to track it using three metrics so in terms of storage we're not looking at ton of storage it's a little bit of application data so we're definitely good on storage we've got plenty and our on-prem data warehouse doesn't take up a ton of compute capacity the aggregations that we're applying are and that complex because we don't have a ton of data there's not a lot of work to be done and then our ETL is fairly straightforward we have really just one ETL process to load in that application transaction data into our data warehouse so things are moving along great with this architecture but then we as a company start to grow and expand and we have some new data sources to bring in so we started integrating with Salesforce and that complicates our architecture a little bit more but again nothing too complex so here we we need to set up another ETL pipeline to bring in that Salesforce data into our data warehouse but all things considered this doesn't this doesn't significantly impact our architecture too much so we looked back at our health metrics our storage has increased salesforce is gonna add a lot of data but we're still feeling pretty comfortable in our systems capacity compute again that's going to increase as well we have some more complex modeling to do to integrate our salesforce data with our OLTP data but again it's not nothing too wild and then our ETL complexity has additionally increased because we now have two separate ETL processes to maintain so these are running along fine with this but all of a sudden our company decides that we need to start ingesting some clickstream data and if you've ever worked with clickstream data from the web you will know that there's a lot of it to get value and it can be complex to deal with so we have our clickstream data coming in which of course requires some ETL process so now we've got three ETL processes we have a lot more data coming in that clickstream data is really pushing our server to the limit so if we look at our new health metrics things are things are looking not so great that clickstream bid are really pushed us over the edge as far as our storage goes our compute is barely keeping up and our ETL complexity is high now so we do what any good data engineers and database administrators will do which is we decide to add another instance to a sequel server cluster to help distribute that excuse me and distribute that workload a little bit better so this works it gets us to a point where we're back in kind of a mid-range where storage is not crucial but still were we need to be cautious of it same with compute and ETL complexity but then the big one hits and our company's featured on the front page of hacker news we get an influx of new users and our clickstream data our salesforce data our application data it goes through the roof and we can't keep up in our on-premise orkut texture is hosed so we've got some problems with this traditional on-premise architecture key among them is scalability it's difficult to scale when you're dealing with things like web volume data when you can't reliably predict the amount of traffic but you're gonna get so it takes a lot of time to add new servers to that on-premise architecture it takes a lot of man-hours it's a difficult task another key problem here is the separation of storage and compute so these on-premise systems they tightly coupled the storage system in the data warehouse to compute so for example if your data volumes are increasing significantly but your compute requirements are not you still have to add more storage that system and increase your compute capacity because of that tight coupling so you you can't have one without the other and then finally it can be difficult to deal with independent isolated ETL systems that often have criteria formats that aren't portable so it's not easy to take for example an SSIS package and convert that to some other system and they're often not very source controlled friendly because they're a lot oftentimes point-and-click GUI based applications and then there's scheduling and orchestration troubles I can come where you have a whole bunch of different systems that you're trying to independently orchestrate so it can become a nightmare and you spend pretty much all your time putting out fires so what is the fix to this and this is kind of a joking slide because you know in the last 10 years I think this has been said a lot that hey Hadoop is the answer to all our problems with our on-prem database it's gonna scale everything for us it's gonna do everything but I think as we've seen in the last 10 years Hadoop has not lived up to the promise that has claimed to so this is Google Trends chart showing interest of Hadoop over time it's steadily decreasing it's you know there's certainly implementations out there but Hadoop is the answer has proven to not be the right answer to this so that is that is the lazy answer that I think we as an industry decide that we can do better with and that's what I'm gonna show you today is how we can be better than constrained on-premise architectures how we can do better than something like Hadoop which while it served its job at certain places for a certain amount of time I don't think that the patterns still apply today so hopefully it doesn't come in it's any surprise that we can leverage cloud in cloud can solve a lot of our problems it of course gives us infinite not quite infinite but effectively infinite scalability so if we need more servers we can easily provision that minutes you don't have to order physical servers rack and stack them then deal with the the weeks of time and the sometimes prohibitive cost that it takes to add more servers similar with storage so you can use object storage things like s3 and that is effectively limitless as well another key component here which is which is going to be central to the topic today's hybrid architectures so you you don't have to commit 100% to the cloud you can use the cloud for the components that you want additionally you don't have to commit 100% to one cloud so you could use some services in AWS ecosystem some services in the azure ecosystems some services in the GCP ecosystem a lot of the core architectures are based on non proprietary open source platforms that you can swap in and out and write you know api's are exposed through these platforms that you can plug-and-play the architecture as you need to and then finally and I think this is a subtle yet extremely profound sentiment that I've seen in the last you know five or so years is you know around 2014-2015 people were claiming you know who do business the next big thing and no sequel is gonna overtake sequel but what I've seen and what I think what I imagine a lot of you have seen as well as that you know sequel is still an incredibly reliable tool to do the job but we need to do and it's proven to over the last 40 years since its inception be an extremely useful tool that has kind of proven the test of time and while there are a lot of interesting things going on in the no sequel world for a lot of what we're doing pure sequel is exactly what we need and what's gonna work best for us and it's what we are though so we want to take the tools and concepts that we already know and apply them to these new architectures so that it is not a fundamental shift in our kind of thinking of the architecture we still have some grounding and the architecture is of the past so with all that said I'm gonna try to make this a little bit more concrete and show you two examples of one particular implementation of an architecture that solves a lot of the problems that we discussed here so hopefully some of these concepts are new hopefully maybe you've got you have familiarity with similar tools or you've worked with it but again I encourage that you don't focus too much on the implementation of these tools I'm gonna have the repo publicly available for you to dig in because I'm not gonna dive way too deep and all the code and every specific element of the architecture I would say that that's probably an exercise that'd be great to do outside of this for this I think it'd be great to follow along see how these things connect and get a high level understanding so with that we're going to kind of three main demonstrations here so we have infrastructure as code which a concept is a concept but if you're not familiar with it is a way to provision resources in the cloud using configuration files so we don't have you know an administrator logging and point-and-click to provision everything for us we have our architecture and defined in a repeatable source control the ability to put it in source control kind of files so I'll talk about that and the particular implementation here is cloud formation then we're going to orchestrate some ETL processes using Apache airflow which is a really interesting tool and we're gonna run that orchestration process on AWS infrastructure then finally we're gonna show how that ETL process can manipulate and load data into a snowflake data warehouse so gotta let's go over so let's get started so I'm in the AWS console and I'm gonna go into the cloud formation tab and you'll see here we have this concept called stacks stack it's just a kind of discrete set of components that are deployed as a single unit so I already have VPC which if you're not familiar with the AWS vernacular of EPC is stands for a virtual private cloud it's basically just the networking components required to spin up some other architecture and the confirmation script to provision business is also included in the repo if you want to take a look at that so I have this already spun up but I'm gonna show you in the repository what the script looks like that we're gonna be deploying so again I'm not going to go into excruciating detail on this because it's a couple hundred lines long but basically it's a gamble file that defines some information about the resources who are going to be deploying so it's great about this is we have a deterministic manifest that outlines all the components in our architecture which is great we can check this in source control if we need to tear down our architecture and spin it back up again we can do it knowing that all the components are going to come back up exactly as they were so configuration is code is kind of pattern the first the first kind of framework or pattern that I would encourage you to look into if you haven't done so already so I have a bunch of things that are gonna get spun up here which I'll describe as we're doing that but this takes about 10 to 15 minutes to actually run so I'm gonna kick that off and then go into a little bit more description describing what's happening here so to provision this CloudFormation stack I'm gonna click create a stack with new resources and we have a couple options so you can specify your template the file that you upload by an s3 URL which I have this stack and an s3 bucket of mine so I'm just going to specify that URL there and click Next and then it gives us prompts us for some options so what should we name our stack so this is gonna be my air flow and it's running on ECS which i'll talk about momentarily these parameters here these are custom parameters that I have defined in this stack so I'm saying whatever somebody spins up this stack you need to provide it with some additional information so that it knows what work to do so first one is the IP address range so when we run Apache airflow on UCS we need to allow inbound traffic so for now and big red letters here don't do this in production I'm just going to allow access from everywhere again don't actually do this in a production setting because everybody will have access to your application we have some pre-populated parameters here these pertain to the container configurations so Apache airflow is going to be running as a docker container so we have to specify how much CPU and memory to allocate to that container when it's running we ditional you want to specify which port to expose on our container and then we want to specify down here our snowflake account ID so I have in this tab open my snowflake interface which won't get into in a minute but I'm gonna pull the account ID from the URL and you can sign up for a free trial of snowflake you get 30 days to experiment with it so I encourage you to do that the region of my snowflake hangout is in u.s. East one and then finally it needs to know what stack was used to provision my VPC because it's going to use some resources from that so I'm going to tell it I use the demo VPC stack now we have some boilerplate options do we can add some tags to specify properties of the resource that we're deploying so this can be really useful to allocate you know for example I specify the department or the owner or email address so that your resources have metadata on them you can search for we're gonna skip over this we don't need to add anything extra and then finally gives us a page where we can review everything but we told it to and I feel comfortable with this so I'm gonna click create stack now it's gonna go through and based off all the resources defined in that cloud formation scripts it's gonna spin them up one by one so going back to its presentation while we wait for that to run let's talk about what's actually being provisioned here and then go over some of the possibly new terms or technologies that you haven't been exposed to so this is a high-level architecture diagram of what we're deploying so on the Left we already want every cloud formation that is the tool that is allowing us to provision our resources using this configuration as code Systems Manager is a tool with him AWS ecosystem that allows you to store certain parameters so we're storing the username and password for airflow or rather for snowflake in parameter store among some other properties so it's taste just a easily accessible key value store within AWS so that's what is doing the deployment for us and then within our VP see this top box here we are in a public subnet deploying a service called ECS ECS stands more elastic container service and all it is is a serverless tool that you can use to run docker containers and doc containers are fantastic they're a great way to deploy applications so that's what we're doing here we have defined a docker container that has patchy airflow pre-installed so that's you know gonna be gonna make it really easy for us to test our application locally if we have docker running locally and then deploy it up to elastic container service so AWS Fargate is just the server list component of elastic container service if you want you can provision your own ec2 instances for your containers to run on but I would recommend the server list model because it's gonna be less maintenance and less overhead Pachi airflow requires a database for it to store its metadata so we are spinning up an RDS instance RDS is Amazon's managed relational database service so you can an RDS spin up databases you can do my sequel Postgres you've got the new sequel server we're just doing a small Postgres database here because it's smaller just storing some meta data so our docker container running airflow it's gonna connect to that RDS instance running need airflow metadata database and that's how our ETL orchestration is gonna run within AWS down here we have the snowflake infrastructure now it snowflake will run in AWS I believe it'll run in Azure as well don't quote me on that but I believe you can do multi cloud with snowflake anyways so flake is a completely cloud-based data warehouse environment that has a lot of the attributes that we look for in a modern data warehousing solution so it separates storage and compute it's scalable it's cloud first so we can do an entire day talking about snowflake but for now we can move on and just talk about some of the core concepts and go into the demonstration so talking about core concept first is the computing storage layers are completely separate and independently scalable so snowflake under the covers will store data in an optimized calmer format that is compressed so it's it's gonna be similar to your other column your data warehouses like redshift or bigquery or I believe synapse analytics will do this as well for Adria no warehouse where it stores data in a columnar fashion which is advantageous for analytical workloads so the beauty of separating the storage and compute layer is that one your storage can scale and and then elite of compute so you could store terabytes or petabytes of data but if you don't really need to do much processing on that data you don't have to spend inordinate amounts of money on the processing of that now snowflake has this concept called a virtual warehouse which I don't think is a very good name for it because to me warehouse implies storage but in this context of virtual warehouse is describes the compute resources so virtual warehouse is this independent cluster compute resources so when you run operations in snowflake that would require compute things like select queries deletes inserts it's you have to have a virtual warehouse running and you can configure the size of your virtual or a house depending on the requirements of your workload so if you have a really small workload that's not going to require massive compute power you can spin up a really small warehouse have leave it running for the hour so that you need to run your workload and then spin it down and you're only gonna pay for that amount of time at that warehouse was running additionally multiple concurrent warehouses can run and query the same data at the same time so this is one of the advantages of kind of this this architecture where compute and storage are separate is that these warehouses they're all hitting the same object storage under the covers so you could for example have one workload running which is your ETL workload that requires significant compute so you have a warehouse dedicated to that workload and you can of course set permissions so that not just anybody can use that warehouse to run their queries so it's a great way to kind of federate the use of resources in your data warehouse and control costs another core concept is loading data so stuff like provides some great utilities for loading all sorts of data format CSV JSON park' and the pattern that you apply when loading data is generally two steps so you upload files to a snowflake accessible store location that can generally be any cloud object storage s3 I believe Azure blob storage would work and then you load data from the staging files into a table now this is the approach that you would use for kind of batch loads however there is a component of snowflake called snow pipe which is a persistent data loading mechanism where you are not charged as much as you would be if you did kind of this standard data loading using that ad hoc approach we're not gonna go to snow pipe now that is beyond the scope of this talk but I just wanted to make make you aware that there are alternative ways to load data into snowflake so let's talk a little bit about air flow if you're not familiar with air flow it is a python-based workflow orchestration engine so I've talked about it as an ETL tool but really it's more than that it is a workflow engine so you can write any really Python code or have any job that requires some kind of directed sequence of events that have dependencies on each other to run bunny or flow is kind of built for the data engineering workloads so it has native database integrations with redshift bigquery snowflake my sequel sequel server big data integrations with things like EMR spark the Brix sometimes all these great plugins that make it really easy to create these kind of data focused workloads and orchestrate them so advantages of airflow it's god-like I just mentioned all these fantastic integrations I think the greatest advantage is that it's pure code based so it's all Python so well one of the disadvantages of the legacy architecture that we reviewed in the beginning was that they often rely on these legacy ETL processes that are you know GUI based and proprietary but airflow is open source it's written in Python all your workflows are written in Python so it makes it easy to put in a source control collaborate with others it's portable if you want to run air flow in AWS you can do that you can run air flow in GCP you can run it on your own server if you'd like it is completely portable and then like I mentioned it's open-source software which I definitely prefer to use open source software in cases like this finally the last new term will go over which I talked about a little bit is ECS which is elastic container service so we've got air flow air flows awesome we have it in a container but it needs to run somewhere we need a server for it to run on and ECS provides that for us UCS is a managed service ryan containerized applications so it is quote/unquote serverless using the Fargate model which means we don't need to deal with provisioning and operating our ec2 instances however if you want to manage your ec2 instances you have a use case that requires that you can absolutely bring your own ec2 instances there UCS is tangentially related to kubernetes so in an alternate universe you know we could do this exact same thing but run our airflow container on kubernetes generally the same concept it's just a an orchestration platform for running containerized applications so let's step back from these slides a little bit let's check on the progress of our stack so our stack is complete that's fantastic but we still have a little bit of work to do before we run everything so the first thing we're gonna do is do some setup in snowflake and this repository which again you I will provide the link to so that you can play with this as much as you like in your free time it has some scripts that will you use to do how to configure everything for this demonstration so the first script is gonna deploy some objects so I'll copy that in there just real briefly talk about what's going on so I talked about how we load data into these stages we're going to create some stages pointed to an s3 location that has some public data sets that we're gonna work with so here are the two stages were creating that contained the data that we want these are s3 buckets this first one contains I think they're JSON files and the second one contains CSV files and then we are going to create some raw tables that we're gonna load the data into so can run all this right in the browser and I got a feel okay so it so I actually ran this earlier ona doing some testing so I think we're okay I think there's just a collision on an object that already exists so right across our fingers and hope that that's the case but what we did do is created a user just to run the ETL process so as you can see up here I'm the sysadmin you don't really don't want to have your sysadmin be the service account for everything so we're gonna log out there and log in with snowflake user and we're now logged into a separate user account we're gonna deploy a couple more things you know what I ran the wrong script so let me just go back for continuity script we actually need to run is framework to the script you create our databases and create our roles and everything so let's go ahead and run that so that's gonna create a database for our raw data it's gonna create a database for our analytics data we're gonna model out it's gonna create some roles just doing kind of its admin stuff I'm not gonna go through line by line here but I will point out we talked about these virtual warehouses so here's the DDL to create one of those virtual warehouses so we're gonna create a medium warehouse and we're gonna create an extra small warehouse as well and then we created a user snowflake user we will log out of my user well I'll get in this snowflake user it's gonna prompt us to change the password and I'm gonna use the password that I've configured in my Amazon parameter store so that air flow knows how to connect and now we are logged in as our service account user for detail process and now we can run this script but I mistakenly tried to run floor which caused our air which is where we need to create our stages and create our raw data tables that we're gonna load into so I'm gonna go ahead and run this script and it completed successfully it's great so we've got everything staged to start loading data into our snowflake data warehouse so I'm gonna go back into AWS and we're gonna go to UCS and UCS is our container service where we go in and we have here a cluster so in ECS you have to create a cluster to run your containers so we have this cluster which was created by our cloud formation script we then have a service our services component of air flow - GCS and a service it can contain one or more running containers so ours is fairly simple that we just have the single running container task but in many application workloads the less so in the data engineering space and more so if you have a you know a micro service it might be composed of several micro services rather that create an application you'd have them all running within the same service on the same cluster and they could communicate between one another but our use case and simple and we just have this one running task and you can kind of consider this task is a running container in ECS so we'll click on that we get some status here great it is in a running state which is what we want we have some properties that describe our task definition which is what we deployed when we ran our cloud formation scripts not going to go into all that right now just gonna look at this container it's great so we can have some details just around our properties but the important thing we want to look at is this public IP address because this public IP address now is how we're going to access the air flow interface so we're going here and we have air flow running in a container air flow gives us this really nice user interface that allows us to see our DAGs a dad again air flow is a directed acyclic graph it's really just a an application that describes a data flow so we have two in here and snowflake round snowflake analytics and I'll go ahead show you what the code looks like to kind of demystify that so our snowflake analyst bag it's actually really simple it's just a little over 50 lines of code and all we do is we define our dag give it a name for the description provide a schedule then we add some operators so these operators are can think of them as the actions we need to execute and we have three snowflake operators so like I mentioned snowflake among many other cloud data warehouses is a first-class citizen within airflow so we can provide these operators that tell snowflake what it or rather tailed airflow what it needs to do within snowflake so the important thing to look at here is we are executing a sequel script so again sequel is king we love sequel so we just want to keep things sequel and not use any funky proprietary language or some funny most sequel syntax I like sequel so let's stick with that so this sequel script that we're gonna run is specified there we have some other properties here the connection ID that we're gonna use what warehouse we want to use so that's that virtual warehouse so we're using our loading warehouse the database etc then finally down here we take these operators and we define the dependencies between them so this says that first operator runs then the second operator runs and third rail runs so this is effectively our directed acyclic graph or our dag and if you're familiar with SPARC you should be familiar kind of with the concept of a dag but I won't belabor that point anymore I will get to the fun stuff that I hope works I'm going to turn on this dag for our snowflake raw load that can refresh here and right now we have kind of these statuses here so we have two operations queued and we can continually refresh to check the status of what's going on so we have two things running here so if I click on that it'll provide some information about what's running I can look at logs can do a whole lot I like to generally just stick to this main page and see what's going on but so far things are still running which is great and then we can go into snowflake snowflake has this really cool history feature which allows us to see what queries have run what queries are currently running how long they took to run what warehouse was used to run them so really powerful metadata that we can use that you know a database administrator loved so we can refresh this and we can see now that we have this copy command running so this is copying into our airline raw table if you remember when you look at our dag it was just running sequel so let's take a look in here there we go there's the command that's running right yeah so again nothing too magical going on here we just have a sequel command running and air flow is what has orchestrated that come in to run and it is a loading our airline underscore raw table so that's great that's gonna take probably a few minutes to run it's I think 20 gigs of data it's it's not a trivial amount of data but it's maybe 20 million rows or 100 million rows I forget the exact metrics on it but we can let that run and we can go into air flow refresh our status here so you can see we have three successful tasks that ran so these we can look at that task we can you know do all kinds of things we can look at the log that was executed so we we have complete visibility into what happens in our tasks run admittedly air flow has a slightly steep learning curve as far as navigating user interface and understanding how scheduling works it's slightly confusing and not the most intuitive thing in the world there is a really good UNAM eCourse that goes in depth into air flow that I can provide the information floor if you're interested in learning more air flow is actually a tool that was created by this guy Maxime bouche em and I don't know the exact pronunciation but he built it while at Airbnb hence the name air flow there so he was a data engineer at Airbnb who built this I he's not in there being me anymore but he's worked at Airbnb Facebook he's got some really incredible content and some cool ideas around kind of the future of data engineering and denna during workloads which is definitely a motivation for a lot of this presentation and a lot of the work that I'm currently doing so things are still looking good we still have our kind of raw copy operations running and if we look what we're gonna run next once that's complete and this snowflake analytics so let's go ahead and take a look at this dag this should look fairly familiar looks a lot like the snowflake abroad egg or actually we already looked at the analytics dag the raw dag is what running now so apologies they're messed up my DAGs but yep this is gonna run some analytical queries so it's going to run the create analytics calendar which is really just kind of creating a dimension table for us with some information some data information that we can use and then it's going to run the land aggregation on the analytics airline and NYC taxi data set so these are the operators are working with again you can do quite a bit with air flow and let's go ahead and check back at how our data is running so looks like our NYC taxi raw has been loaded so let's go ahead and just take a quick look at that and if we refresh here look at our raw public should see airliner on NYC taxi raw of course oh and a note about snowflake I am doing this all through their browser interface which is actually a pretty solid interface as far as browser interfaces go and browser IDs go but you of course have the ability to connect using data grip or kind of any of the JDBC compliance database IDE so you are not constrained to just running this in the browser you can run this connect okay you can connect to it as you would to redshift bigquery obviously you can't use siegel server management studio but there are other options they also have a command-line utility for running some administrative and operational commands within airflow so a lot of ways to run the commands and to set up automation workflows but let's let's go ahead and take a look at this data so here's our raw data I'm only gonna select the top 10 rows because there are 100 million but we loaded our raw data and we took some metadata about it so we say here's the record here's the filename from that record so this could be really useful for a data lineage strategy it's always great to in your in your downstream dimensional models to be able to trace back a record to the exact immutable source from which it was derived so that's one feature I really like if snowflake is when you load data you're provided with a lot of great metadata that provide can can really help to assist in a strong data learning strategy you have absolutely the row number that that record came from an ax source file we we hard-coded this create process you could obviously put whatever you want there and then create timestamp so when was this data loaded and of course importantly you have the source data so we can look at this and see that we've loaded JSON in here and that's another another great feature of snowflake is that you you can query nested data you can query JSON you can query UCS we use it's it's very flexible in the data formats that it supports and in our sequel command to query the what is this this is the airline data so in our airline query you can see here that you know we are already ran that here we go so here here's how we can actually extract the fields from that airline raw data so just pull out from the source cut this field name casting it to the appropriate type so you have a lot of flexibility and how you manage that let's take a look at the taxi data just for completion this is gonna look slightly different because we loaded from CSV so it already when we provided the CSV load format it delimited appropriately our data into these columns so we have kind of all the raw data there but we still of course have our source file name our source file row so we can get that really nice data lineage that we want so there's our data or raw data that's been loaded and let's refresh this and see that everything is completed running just great so we can go back to air flow can kick off our analytics job which won't take quite as long that's just gonna run these queries that are loading our airline table calendar table etc so similar to our first dag that we ran we can see that we've got already one successful task that ran if we look at our dag we can go in and view it in a different way so this kind of hopefully that's not too hard to see but this shows us a graphical representation of our dag so we have this great analytics count or gate node which we click on that it'll expand and tell us information and then these kind of next nodes in the dag which describe the operations that are gonna run or so just a nice kind of visual representation and you can see you'll get this chart fill it out with the statistics around the dag runs so there's there's a lot of power in this UI and it definitely take some getting used to but it's pretty great so that's running again we can go back to our history tab in here and refresh and we can see that created our NYC taxi data table we created our airline table and we've heard our calendar date table so everything is done in here so we go back to air flow and we refresh we have three successfully completed tasks which is fantastic so we can finally after all that hard work go in take a look at our analytics database okay in our public schema and see you've got our airline table our calendar date table and our NYC taxi table so let's just select top ten from let's do airline actually because that's the one that was nested JSON so we can see there we go so here we go our data has been flattened from that JSON then we can go on and you know model this in any way we want and infer their downstream processes but at this point our data is in the database you know air flow has accomplished this job of orchestrating loading data into our data warehouse you can certainly configure air flow to run your kind of further downstream transformations to take this raw data and transform it into a data model another tool that is really great for that but I've been really interested in lately it's called DBT if you are not heard of DBT definitely check that out it's one of the more exciting products in the in the analytics engineering work industry that I've come across and it's presents are really cool to kind of build analytical models and ETL processes but I'm not gonna get ahead of myself where you have enough technologies that we're focusing on today so to recap we've done a ton there's there's a lot of work that went into this and we've done some really cool stuff here but to zoom back out from the details we've taken air flow which is an orchestration tool wrap that in a container and you can see down here here's the docker file for that container so no magic here we've just it's pretty short docker file we're just installing air flow in an animus on Linux image and we can run this container I know where we want we can run this container locally which is great for testing that's one of the benefits of docker and we can be confident that if it works as we expect locally in docker it's probably gonna work as expected when we run it on you know ECS or kubernetes because those environments you know it's it's that's kind of one containers it's great so we've got air flow running in container it's running on ECS which is our server lists container runtime engine so amazon has this our networking infrastructure to run all that and we can navigate to our airflow UI through there then we talked about snowflake which is a cloud first data warehouse that has some key advantages such as separation of compute from storage elastic scalability of storage and compute a nice web-based UI integration with airflow so a lot of great things there so that pretty much concludes the demonstration and again those are telling we went over here there's a lot and I could have spent you know several days going into the nitty gritty details on all this again like I mentioned I'll definitely provide this repository here and we have in that repository there are some some good instructions for how to run all of this and how to configure and all this is all based on a project that myself and a few other data engineers at slalom started working on last year we presented a lot of this at a Amazon Web Services demo day so there's definitely I don't want to take even half the credit for all this great work there's some really talented folks at solemn who put in a ton of legwork for this that I can just kind of spin on their shoulders and show a lot of this off so hopefully this was a useful and engaging presentation for you I hope that you know if you have not explored some of these tools in the AWS ecosystem or snowflake or some of the more modern paradigms of orchestration an ETL like airflow that maybe you can use this as a starting off point I'm happy to field any immediate questions right now I'll also provide my contact information and please reach out to me at any time with questions that you have around this or just kind of any related topic I'd be more than happy to kind of engage and jump on a call and talk through any of this and if you want to work on getting this running locally and running to issues hit me up we'll jump on a zoom call and get it running so with that I will stop sharing and open the floor to any questions so I think Michelle was gonna monitor chat was that happening I see a couple and then so do you use ya mole template or AWS cdk4 CF and I'm using an email template so in this repo that I'll provide we've got these yellow files here cloud formation you can configure the files using Y Amal or JSON so whatever one you are most comfortable with any concerns or issues regarding this snowflake database like the cost is something else you can cross during the project that's a great question and this is huge and then definitely one of the main considerations that you should focus on before you implement any any cloud solution is that the the learning are the kind of stepping off point is very easy to get involved in AWS and are any of these cloud services and you can start to run a whole bunch of applications and data warehouses and then the end of the month see a really nice fat bill that you weren't expecting so with snowflakes specifically you know again there's let's just talk about if you're just running this in a demonstration environment you have a free trial account so you can run that for 30 days so everything I can do that I did today and snowflake is totally free in Amazon a lot of its gonna fall under the free tier if you haven't used AWS before so you sign up for an account and you're eligible for a year of not entirely free everything but there are services that fall under that free tier so just to run this demonstration doesn't cost really anything but if you're running in production of course with snowflake I've never actually run snowflake in production so I've only used it in demonstration purposes however I've worked with other cloud data warehouses like redshift and they're my advice would be you know monitor your costs really closely to set up alerts set up budgets most major cloud platforms I know Amazon does at least and I'm sure snowflake has this ability in some way for you to set up budgets and notifications Alerts around your cost for the month so costing all this out could be a whole nother presentation so I don't go to that too deeply but there are great options there and one of the beauties of something like snowflake is you can you know get your feet wet with it without spending thousands of dollars you can provision some you know small work clothes and some small virtual data warehouses and see how it fits into your workload and scale up from there if needed alright another question was airflow over two step functions any comments yeah that's a really great question so if you're not familiar with step functions step functions are an Amazon service that is effectively allows you to orchestrate certain kind of activity within Amazon so you can use step functions for you know a lot of what we just did airflow for you can execute scripts using step functions it's it's kind of a generalized interface for creating workflows and I I would say that I would prefer airflow mostly because you avoid some vendor lock-in now you're gonna have to weigh your trade offs so if if vendor lock-in is more important to you than definitely look at airflow now if the alternative what's more important is kind of the overhead it takes tube to manage so as we saw with airflow joining an AWS for example you know you've got to spin up a little bit of infrastructure it's server wasn't as simple to do but that there's a non-negligible both monetary cost and kind of technological cost to to provision that so I'd say really it's what's important to you and kind of the ease of stuff functions being totally server lists and you can spin that up you know pretty instantly and or use air flow where you are you have maybe more control and more configuration and it's a little bit more open but more to manage so again what kind of a classic trade often this is of really any kind of server list for a server based solution is vendor lock-in versus technical overhead and complexity but some functions are really cool I think there are ton of use cases for them and I don't think it's totally necessary that you use exclusively set funk step functions or exclusively airflow I think depending on your use case step function might be the right tool for the job if it's something really simple you're just running one script airflow is probably overboard so that's my thoughts on that I thank you so much we really appreciate what you've presented was highly educated I definitely make use of your materials and thank you everyone for attending our online meeting would also like some feedback if you in there are other ways we can improve this no be glad to hear from you um if you would like to present as well please feel free to reach out to any of us you can reach out to me via cirilla well you can reach out on LinkedIn or send an email to the group or to Brad Ausmus shell okay thank you once again and we hope to see you in July in July yeah July 7 that's it thanks again West thank you I just posted the information to this chat and I can also distribute that to you my oh and we can get that communicated for people who catch it thank you
Info
Channel: Charlotte BI Group
Views: 2,551
Rating: undefined out of 5
Keywords:
Id: 924V6edYW_w
Channel Id: undefined
Length: 78min 52sec (4732 seconds)
Published: Thu Jun 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.