What "REALLY" is Data Engineering? By a Data Engineer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
data engineering is not about creating large-scale data pipelines on cloud it's not about writing code in python or spark it's also not about creating beautiful reports and visualizations data engineering is about using data to drive as much of value as possible for your company or your client now this value can be in the form of multiple things it can be about figuring out what are important metrics in your data transforming or aggregating your data so that it makes sense or also making some recommendations based on historical data points I get a lot of questions whether should I learn spark in python or in Scala should I learn snowflake or should I learn bigquery well guess what it doesn't really matter as long as you are able to solve complex data problems you can also use Excel for all I can so when you Google or when you search what is data engineering on YouTube this is what I get [Music] now you might end up becoming more confused after the search than you were before because there's a lot of misunderstanding here there's a difference in between what catches more eyeballs or what gets more views versus what is the reality of data engineering I am Josh and I work as a data engineer at Google and companies like Google use a lot of data to improve their products or help other companies do the same so today I am going to talk about what really is data engineering foreign [Applause] [Music] before we move on to rest of this video this is a new channel and I need your support as much as possible so drop a like And subscribe if you have not already first let's start by looking at the Google trend of the word data engineering it allows us to look as far as 2004 and we can see that there was a minor interest in this world during this time as well we will find out how but before we look into it we need to understand that the concept of data engineering is not new it has been around for decades and in fact around centuries even in ancient civilizations they needed to keep track of data like trade goods or crop yields or tax record some fantastic examples of early data recording goes back to around 3100 BCE it was a Sumerian clay tablet and they used to capture data all over it with different symbols the clay tablets contain valuable data documenting information such as distribution and deliveries of grains like barley or wheat now if you think about it information of grains on these tablets was data and when somebody moves this tablet from one place to another from one person to another person and the chain continues this becomes a data Pipeline and if there is somebody governing all of this data movement then that person kind of becomes like a data engineer now let's fast forward it a little to World War II in World War II they calculate the number of submarines that the coastal command airplanes should expect to spot per flying hour and the most efficient scouting system when the results obviously fell far short of what was recommended or predicted it became apparent that the planes were being detected earlier than expected by u-board Crews and the paint of the bottom of the Wings had to be changed from black to white so concept of predicting something out of data that also started with World War II itself fast order a little and look at one of the most fundamental concepts of data engineering SQL and dbms in 1970 at IBM a computer scientist named Edgar Epcot invented a relational model for database systems four years after that SQL was invented and at that time it was called SQL which many data Engineers call it even now including me all that is fine now going back to the trend chart of the word data engineering we saw that from 2004 itself there were some occurrences now that's because companies like Facebook and YouTube were invented during this time and they had started capturing ton of data and they obviously needed to process that data and eventually improve their products so that is where I would say this entire race of Big Data started now since we are talking about Big Data let's look at the trend chart of the word big data so we can see this also existed way back in 2004 too well this was because in 2003 a paper was released that described the architecture of Google's distributed file system called GFS which was published by Google for storing large data sets in 2004 Google also published a white paper on mapreduce also something to note here is that these were only white papers they were never really implemented So at around 2006 a couple of people started implementing this and then it was named Apache Hadoop that we all know now and from this time frame of 2006 to 2014 we also came up with Hive at 2010 and in 2014 we saw a certain spark in the world of data engineering which makes sense because the world big data started becoming super popular during this time after 2012. now when we compare the trend chart of the word big data to data engineer you would realize that big data started becoming popular way earlier now there was an interesting reason for this as well in 2017 Gartner came up with a study that 85 percent of the Big Data projects failed in 2019 Venture beat also came up with a study that 87 percent of data science projects never make it into production and in 2019 Gartner also came up with a similar uh study that through 2022 most of the Big Data projects or data science project will fail to deliver the actual business outcomes so this is very surprising why is that people are making a lot of technological advancements we have distributed systems but then we see these kind of articles that big data projects are failing or data science projects are failing well that's because people started hiring a lot of data scientists to do the work now one main reason for this is let's say you have a very good model with very high accuracy but then your data itself is trash so if you put trash from one side you are going to get trash from the other side as well as an answer so it was very important to First Reform or refine the data that we were getting there were lots of data quality issues the data was not correctly sampled or there were very faulty values there were consistency issues and maybe data was not in the correct aggregations or Transformations were not being applied this is where the value of data engineering suddenly Rose and you can see after 2017 the word data engineering is becoming super popular and even now people have started to realize that you need to have a solid data engineering Foundation before you jump and invest into having multiple data scientists when you think about simplest questions related to data even they require complex pipelines with sophisticated infrastructure setup and some might argue that this is one of the most important parts of becoming a data engineer now I have also received a lot of questions that what is exactly a difference between a data engineer a data scientist and a data analyst so I've tried to explain this in my previous video when I was talking about system design and different user personas but I'm going to Deep dive a little further in this video so when you go to this website Quant Hub they have given a very good kind of a pyramid diagram so in the bottom of the pyramid we have collect that means collecting different data from let's say different logs different sensors whatever the external data was or the user generated content when you scroll through Instagram or when you like it anything then after collecting data we have to move that data from let's say the end devices or Edge devices to a particular storage solution which should have unlimited scalability ideally there is a need of reliable data flow the infrastructure pipeline the entire ETL of structured and unstructured data storage then comes explore and transform that means cleaning up the data that you have stored also detecting for any anomalies or maybe doing some joins between different tables doing some group buys or aggregating data into a particular grain that will be useful further by data scientists or analysts now up to from the level collect move store and explore transform all all of this is the responsibility of a data engineer and also aggregate kind of falls into a data engineer's responsibility but when we start talking about labeling your data that means figuring out what different features you have labeling your data creating different segments like training data test data and creating this kind of a split this usually falls into the purview of a data scientist once a data scientist has done that they should be able to create a model kind of train it that comes into learn and then ultimately optimize the models using different methods after this at the very top of the pyramid comes Ai and deep learning now this is not fixed it's not like just because you are a data engineer you are not going to be training a model or optimizing it in fact it changes a lot uh if you are in a startup even if though you are hired as a data engineer you will be doing a lot of things like creating the model deploying it and also maintaining it and same thing goes for an ml engineer but for a big organization sometimes even the first three layers entirely do not fall into a data Engineers purview there is a separate engineer called an infrastructure engineer that takes care of the setting up the infrastructure and the initial pipeline so it really depends on what organization you are working in but overall this is how you differentiate between data engineers and data scientists what about data analysts so for that let's take a look at this diagram so you can see this is an example architecture data pipeline that I had created for system design video but I can it fits perfectly here as well so you have different data sources you do some processing and then ultimately you have you can see on the right hand side I have a bi users so they can be considered as data analysts who have access to either fire some queries SQL queries on something like bigquery or get final reports so there's also data science workspace here and that's where data scientists finally interact most of the time data engineer is the one who has created this pipeline especially on the left hand side and there is also an operational user so what is an Ops user well this is also a type of a data engineer there are different types of data Engineers one is let's say a data architect who who creates like sort of an infrastructure diagram or creates designs this entire templates then second is let's say for example a data warehousing architecture somebody you would have heard job postings like snowflake data engineer or bigquery data engineer they would fall into this category there are also spark based data Engineers who are good at spark workloads there are real-time data Engineers as well cloud data Engineers who work on cloud as well so that's where I fall most of the time the point is that there are so many different types of data engineering roles available including the operational row that we saw in the previous diagram these roles are not mutually exclusive you can play more than one role at a time and your goal should not be to become a jack of all trades who can do everything but who knows about all the things but is an expert in one of the things now to become a good data engineer at least a basic data engineer there are some certain skills that you should learn first so let's take a look at that so these skills are SQL python Park here it's mentioned AWS but you can take up any Cloud platform that you want to learn Java is kind of optional if you already learned have learned python Hadoop is also optional if you are very good at spark hives it's kind of outdated nowadays Scala is again I would say optional if you're good with python and then Kafka and nosql now I'm not going to go into the depth of each and every skill and how you can develop it because for that I've created a data engineering roadmap video I'm going to link it down in the description below so if you want to check it out you can do after watching this oh and also there's one more thing don't forget to like this video And subscribe to the channel if you have not already what do you think is the difference between data engineering data science or data analyst for you right after watching this video or were you wrong let's discuss about it in the comment section below if you have any question I'm going to give instant answers for the next two hours so just drop them Below in the comments also if you have any suggestions for upcoming videos I would love to hear from you and thank you so much for watching see you guys next time [Music]
Info
Channel: Jash Radia
Views: 23,406
Rating: undefined out of 5
Keywords: what is data engineering, what is data engineer, what does a data engineer do, what is the need of data engineer, data engineer vs data scientist, data engineering vs data science, learn data engineering, learn data engineer, learn data engineering from scratch, google data engineer, what really is data engineering, job of a data engineer
Id: TDrIBFgb6hE
Channel Id: undefined
Length: 13min 5sec (785 seconds)
Published: Fri Nov 11 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.