Vocabulary for Data Engineers - Data Engineering 101

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
have probably all done it we have probably all assumed that the people we're talking to understands the words and terms that we're using that might be very specific to whatever role we have at a company even if it's their first day on the job again i think we're all guilty of doing this so i wanted to put together a video for people who are just starting out in the industry or maybe are just trying to understand what their data engineers are talking about by putting together a kind of core set of vocabulary that a lot of data engineers tend to use to kick this off let's use the term dag d-a-g or directed acyclic graph basically this is often an abstraction of tasks that need to happen in a certain series of steps that's kind of the whole point the directed point of the word basically means that there are multiple tasks some of which are like going to be upstream and some downstream so it's going to go in a certain direction acyclic just refers to the fact that the dag itself or the task that may occur can't or will only take on one cyclo so it won't like ever get into an infinite loop and the graph in this case represents that there's a finite set of nodes and vertices that all kind of connect to each other those nodes essentially being tasks basically it ends up looking kind of like this so often you'll have something like a task a or first task that then you know task b depends on and then maybe a c and d depend on b but c and d can both happen at the same time and then maybe both of those tasks go into you know something like task e to kind of finish it off so it's a very basic and simplistic graph of basically tasks just basically a chart of tasks that go from a to e and when someone references the term dag that's what they're referencing a set of tasks that need to occur in a certain order and in the data engineering context when we use the term dag we're often also kind of connecting to the concept of a data pipeline which will be the next word we're kind of pointing to the interesting thing to me about data pipeline is it can honestly cover a lot of different things there's a lot of different ways we use and design data pipelines there are the more standard data pipelines often referred to as like batch data pipelines which occur at certain time intervals and often follow a pattern of either etl or elt which just references extract transform load or extract load transform and these are just your kind of standard ingestion pipelines which will often just take data from sources and then push it into an analytical storage system which we will get to later and that's kind of your standard set of data pipelines but on top of that you also have streaming data pipelines which often push events directly into often your analytical storage system or one of those could be a data lake could be a data warehouse which we'll get to but there's also to me personally integration pipelines that are often managed by things like ipass systems which ipass or ipaas stands for integration platform as a service so these kind of data pipelines tend to integrate across multiple systems so whereas your kind of standard etl will just end up pushing data from source systems like salesforce or hubspot into your either data lake or data warehouse an ipass or an integration style data pipeline often integrates two different source systems together so maybe you want to pull data from salesforce and integrate it with hubspot and then those often connect these pipelines data engineers don't generally develop too often these are more i think the work that's either done by like enterprise engineers or um you know someone who is a sales force expert because unlike ingesting data and storing it into data warehouse you need to understand how the underlying system actually works so if you're pulling data let's say from hubspot to salesforce you actually need to understand what's happening in salesforce and there's a lot of people who are salesforce admins and salesforce experts and developers who understand that kind of work and so they often i think are either involved or just the ones who develop those kinds of pipelines and look at that i've unavoidably added ipas etl and elt along with my data pipeline uh terminology here because that's just how it ends up working when we end up having a conversation in any field it's almost hard not to bring in other terms that we rely on while you're trying to describe the current term you're working with so that's the interesting thing to me about did pipelines is it can honestly mean a lot of different things from more standard batch data pipelines which usually are again etl and elt based and there's even some other things you can add in there as well to streaming to some things that are more integration based where you're just trying to integrate data across different systems now often these more standard data pipelines are utilized in such a way to extract data from some sort of source system like salesforce and then push it into an analytical storage system like a data lake data warehouse or data lake house um so now you've got even more new terms so let's start with the classic data warehouse which is just basically a place to store your kind of company's information in a way that first of all attempts to integrate all of your various entities so you might have your finance data you might have hiring data you might and you try to create some sort of model in that data warehouse that represents your business essentially it tries to replicate a lot of everything that you're doing in such a way that one it tracks historical information which is one of the key things and two it tries to integrate all that data across all of these complex systems so that way when i ask a question that maybe involves multiple departments i can answer that question because i can connect all this data together and the fact that it actually tracks historical information unlike often a lot of source systems means that when i ask a question um that wants to go back over years you can actually do that you can actually can select data using something we call slowly changing dimensions often in such a way to actually look at data from a historical point of view which will get to uh what we often also reference as scd here in a second but first let's jump into a data lake obviously i'm not going in crazy depth in terms of explaining exactly what a data warehouse is but one thing i didn't explain there was that data warehouses tend to have a pretty rigid schema that is predefined when you start and thus the data types that are required and used in a data warehouse are limited whereas a data lake is often more of a place that you can start storing raw data often it might be schema-less when you store it and you might do something that we often reference as schema on read which is define kind of the schema as you're processing data in the data lake and it's so it's a place where you can often store data that maybe isn't as naturally fitting into your data warehouse or maybe it's just too expensive to store in your data warehouse in a more traditional way it could also be things like images text that is maybe a little more free formatted xml json just again things that are maybe less structured or semi-structured is often what we reference and that's something that often fits into the data lake there are multiple reasons why people might store data in a data lake or a data warehouse some of the things have to do with cost you know again size of the data or maybe time like maybe you don't have time to figure out if this data belongs in the data warehouse or if you even could provide value or how to model it properly to fit in and so you might end up storing it in the data lake as a place that you can kind of do some uh work on it we might provide you know data scientists or someone the ability to access it and work with it in such a way that often requires a little bit more of a technical know-how because you have to do some either sort of coding or work with it in such a way that's not sql friendly like you would in often the data warehouse unless of course you put you know something like hive or presto on top you know there's definitely a lot of nuances to this and i'm again speeding over this to just help most people understand what a data lick is from high up above but that's generally a data lake it's often data that's semi-structured to maybe even not structured at all that people can store in such a way that should be organized tried not to make a data swamp as many of us describe but can be utilized for future use now let's dig into a data warehouse because there's a lot of different things you're going to hear the first time you've ever worked in a data warehouse like the first time i ever worked in a data warehouse i didn't even realize it was a different thing like in terms of like database versus data warehouse i learned relational models and so when i was kind of working in the data warehouse i was like okay there's some things called facts some things called dimensions but you know there there's like keys and keys are like ids and so you can connect that way and a lot of it you know i i just didn't understand that it was a totally different system so let's kind of dive into some of the different key components you can hear starting with some of the tables you're gonna hear tables that are often referenced as dimensions dimensions are generally more like categories or descriptors things like product might be a dimension person or customer is also generally a dimension you also might have things like category and actually even date plays a role in terms of dimension because this tends to describe things like interactions or sales and those sales like total sales are often what we reference as facts facts generally are things that involve business transactions and then everything around that they used to basically describe that business transaction is generally the dimension so think like location another way to think about this is generally speaking a lot of the times when you're creating like a pivot table a lot of the things that you're going to pivot on are generally dimensions right because most of the time someone's going to ask for the average sales by location in maybe quarter so in this case location and quarter are both dimensions and the total sale and is a summed up version of what would often be data coming from a fact table those sales numbers or the amount that was sold so that's often what people are referencing when they say fact or dimension they're just different types of tables and often you're going to be querying around the fact table because that's the table that's going to tell you the numbers that you're trying to calculate now in that there's also something that i referenced earlier if you remember uh slowly changing dimensions or often as we just call it s c d and these are thrilling things that basically allow you to track historical data the problem with a lot of source systems is they don't often track historical data sometimes they do in logs and that's why we have things like change data capture to kind of pull in and track every change that occurs in a data system but slowly changing dimensions allows data warehouses to track uh history like changing dimensions over time using uh different methods you'll you'll hear a lot of different ways that people can do this um like you'll hear someone say like oh we're using scd2 or scd6 and these are just different ways to kind of capture historical information and let me give you some context around this let's say you're going to create a report and you want to count the number of customers in new york city over the last three years well over the last three years people have moved and if you just pull that data out and you know replace where someone currently lives in your data warehouse you know they lived in new york and now they moved to connecticut or maybe they lived in california and now moved to new york when you run that query and especially when you do that over time if you haven't tracked that historical information over time you're going to give the wrong numbers because whatever information you're giving is just who currently lives in new york over multiple years which isn't accurate if someone's moved they need to not be included in that year that they moved and so that's why slowly changing dimensions are beneficial often what these look like if you were to actually see it is you'll see like let's say two or three rows with this same let's say customers information or let's say in this case let's say employees a job title now with that you'll also see start date and end date and this is a very simplistic way of looking at it so start date and date that gives you that range now when that employee changes uh roles you'll have a new row for the same employee the end date will likely be filled or maybe there's often also the option of putting a flag that says this row is no longer active and then a new row will be created with their new job title that will now capture when they started this current job so now you've tracked the information of when they work in this role so when someone asks to kind of track historical information on employee roles you have it another term that i recall got me in an amazon interview was when i think it was someone that do like project management was interviewing me and asked me about slas and i had no idea what in the world in sla was because i never had to work with the concept which sla just means service level agreement so when you're building all this stuff like these data pipelines these data warehouses people rely on these systems and if you don't deliver data on time if your data doesn't have a certain level of quality um if you don't know who's responsible for all these things problems happen throughout the process there's going to be days that something happens upstream that you can't predict and now your data is wrong or maybe it's late so slas and i'm just gonna steal this from locally optimistic is basically a commitment between a service provider and a client um particular aspects of the service kind of quality availability responsibilities that are agreed upon between the service provider and service user the goal here is that in situations where often this data is required like let's say for a report every monday at eight am that everyone is in full understanding of one that there is a due date for this data to exist um too that there will be people like on call and ready in case at 7 30 you figured out that hey this date is not there and we know who to call and it just make sure that everyone is in agreement on what services need to occur because here's what happens if you don't have an sla and i've seen this happen oh hey i made a change to a let's say a data pipeline last night and it broke some things and in general no one ever complains when this breaks but it was tuesday uh evening and on wednesday there's a board meeting that occurs and in that board meeting they look at this report that depends on this pipeline being successful and if it's not um you know the analysts look bad and everyone looks bad but if i don't know that it's happening and that i should be extra careful um on tuesday i'm going to keep pushing data and then everyone's going to be upset now obviously i shouldn't be breaking data pipelines but the point more is that everyone knows so that way software engineers is there making changes to their systems up upstream you know those conversations can happen and so that's the goal of in sla is to make sure you're building trust is to make sure everyone understands what needs to occur when because again it's hard to say that you know let's say data engineering is responsible if they're unaware that there is some sort of end goal that needs to always be successful at a certain time and so that's why you'll hear things like slas it's because oftentimes you need to kind of branch this gap between you know the engineering side and maybe more the analytical side and the business side so everyone understands when things are supposed to exist or when things are supposed to happen and there's a certain level of trustworthiness and uh level of like knowing this data will be here 99.999 percent of the time now with that this will end video one of data engineering vocab i really want to do some others where we're going to talk about things like any potent acid we're going to talk about some of the sequel terms that you might hear that people just throw around you're like why did this person say cte what is that so i have hopefully more of these planned depending on how this video goes and so yeah i really appreciate all of your guys time and i will see you next time thank you and goodbye
Info
Channel: Seattle Data Guy
Views: 36,774
Rating: undefined out of 5
Keywords: learn data engineering step by step, learn data engineering free, seattle data guy, data engineering 101, data engineering, ben rogojan, how to become a data engineer, data engineering skills, data engineer for beginners, data science vocabulary, data analysts, how to become a data analyst, data analysts skills, key data vocabulary, programming, tech
Id: TDbjd6Wl6TI
Channel Id: undefined
Length: 15min 10sec (910 seconds)
Published: Wed Jun 01 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.