The Harsh Reality of Being a Data Engineer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome to our startup I am the CEO and I can see that you're here for the data engineering position so tell me what can you do for us I can basically set up your entire data infrastructure so you can do the processing I'm planning to gather all the data from the apis and store it on GCS first and then expose it on bigquery so that even if you don't know how to code you can just write a SQL query to just look at the data underneath who told you that I can code sure and for the scalable nature of the pipeline I'll mostly end up using spark but I don't see how that is an issue right now since you don't really have any users not right now but we will because we will be using machine learning to make sure that our product reaches everyone who needs it and take it to the next level wait but you do realize that I'm not a machine learning engineer right I'm a data engineer what I thought it was the same thing what what hello everyone I'm Josh and I'm a data engineer at Google and data engineering roles have been growing drastically since past four or five years the main reason being people are starting to realize the importance of data engineering before data science or data analytics and that's why companies have started hiring a lot of them in fact the Euro year growth about data engineering job profile is about 50 percent and also the pay that you get being a data engineer is pretty good so all these things are obviously pretty awesome and that's why we have seen a lot of people trying to move to data engineering but there are some caveats that you need to be aware of now these are not necessarily I would say the cons of data engineering or anything it's just some of the points that you should know if you are thinking about moving to data engineering that you may face some of them and you should only move if you are okay with them and also if you are already a data engineer then you might find some of these points pretty relevant but before we get started with the rest of this video I want you to leave a like And subscribe to the channel if you have not already here are three things that data Engineers face that definitely make data engineering not a walk in the park all right so number one data engineering is always something that is supporting the core business it's never The Core Business [Music] especially in a product based companies so to basically explain this I I know that this might trigger a lot of aspiring or existing data Engineers but let me come out and just say this that data Engineers are usually the second class citizens of any company especially in a product heavy culture what happens is that product managers and data scientists and business analysts even they are closer to the product or the end goal that they are trying to achieve in terms of business point of view but data Engineers are the ones enabling everything from the back end and then not delivering a direct impact to the business but making sure that those people who do deliver direct impact like data scientists or product owners or business analysts they are well equipped most of the cases in product companies data engineering is not a revenue generating function now when I say data engineering is not a directly Revenue generating job profile I say it from a product company's point of view for example before I join Google I used to work at GS and there are so many service based companies like Deloitte KPMG Zs Etc now the role of these companies is to provide data services or Data Solutions to somebody else now in that case if you are working in data projects you are a revenue generating function because you are built by the r and the company is directly generating Revenue based on how many hours that you have put in as a data engineer that that obviously doesn't mean that data Engineers are not important it just means that you are usually the one far away from the business outcome of any product but if you are okay with you know staying a little bit away from The Core Business work but focusing completely on Technologies because as data Engineers we have to learn a lot of Technologies we have to be tech savvy and we are the ones who usually design an architecture from technology point of view so if you like all those things and you are okay with staying a little further away from a business outcome of a product then data engineering is obviously a great fit and ultimately data Engineers do earn a lot of money almost as much as software Engineers if not more and because data Engineers are not that close to business work-life balance is usually pretty good unless if you are from as I said a service based company company then you become the integral Revenue generating function and then in that case your work-life balance can be a little bad and usually the biggest challenge when you are transferring from any other domain to data engineering is that you have to find an end-to-end project to Showcase your data proficiency now if you are new to data engineering you might feel lost at it that's where today's video sponsor project Pro comes into picture because I genuinely believe it's a great place to create end-to-end project and get selected for your job interviews or at least upscale yourself project Pro is a curated library of verified solved end-to-end project Solutions in data science machine learning and Big Data all the projects are created by top industry experts from Top Global Tech companies here you will get end-to-end project Solutions reusable codes guided videos and 24 7 customer support get access to 3000 Plus Code recipes which are absolutely free and buy the subscription plan to get access to 250 plus solved project don't forget to check out the project Pro website mentioned in the description and in the first comment below and if you use the link that I've mentioned you will definitely get a lot of additional discounts now back to the video now second thing that sucks when you are a data engineer is dependencies so there are usually a lot of dependencies when you are a data engineer you are basically a bridging function between producers and consumers so producers in some of the cases can be let's say software Engineers who are creating their apis and storing data user data somewhere in the database or any other platform so that can be a producer another producer can be a data team itself and third type of producers are usually third-party data something like I've worked with pharmaceutical data and I work with third-party data sources like iqvr mmit so there are these three mainly three types of producers on the right hand side you have different consumers that means you have data scientists you have business analysts product owners even CEOs of companies basically are the ones who are consuming the data that you have processed so there is a core dependency between the producer and you and from you to Consumer so every time something goes wrong let's say a data scientist comes to me that I don't find this data good there is definitely some issue in this data so I as a data engineer it's my headache to find out what went wrong and even after figuring and debugging my Pipeline and that takes obviously a lot of time and effort and even after all that I have to figure out what type of data caused the issue and from that I have to pinpoint it to one particular data source and then I have to talk with that stakeholder who produced that data to make sure that that data is fixed and then when they fix it I have to make sure that it passes all the quality expectations and then inform the data scientist again so this is obviously a long process and because of dependencies sometimes your work also increases another issue that can come from these type of dependencies is that we have to standardize a lot of data for example one data source can call customer ID column as cost ID another data source might call it as CID and third data source might just call it as ID so we have to standardize a lot now obviously you would say this is one of the core points of a job description of data engineer why do you have a problem with that so the problem arise is when one of the data source completely misses some data points for example I might need a column X to calculate a kpi y and for in three out of four data sources have that column X but the fourth one completely misses out on that so that becomes a huge issue where you have to talk to the data source team and you have to talk with the final consumers and make them understand that I know you're looking at the central dashboard or Central machine learning model but you might not find this type of data because the column itself doesn't exist so there are so many issues that come with dependencies and you have to be okay to be in that kind of an environment and you have to be the ones solving all these questions so being a data engineer is not a kind of a job that you basically go to a Starbucks cafe open up your laptop and then just do coding all day data engineering is about talking to a lot of different stakeholders when something goes wrong so you should be comfortable with your communication skills and skills to identify these kind of problems as well then third type of issue that generally data Engineers face is Technologies I mean there are so many technologies that data Engineers deal with and if you look at job descriptions of data engineering profile five years ago it would be completely different it would be around you know map reduce or Hive Hadoop and nowadays it's generally about spark and it's also about snowflake or bigquery and if you like skip forward to five more years I'm pretty sure spark is also going to get obsolete some time so people are going to move towards snowflake or bigquery type of system so being a data engineer you have to deal with a lot of evolving Technologies in fact half of your job will consist of updating an older pipeline because that technology went obsolete for example five years ago most of the data engineering projects were about moving data from teradata or Oracle to a file based system on cloud like spark Hive and processing it there in fact some of the projects even right now are of this scenario but now a lot of projects are even moving away from file based system like spark or Hive and going directly with snowflake or bigquery so so a lot of your job will consist of updating the existing pipeline just because the technology went obsolete so you have to keep yourself updated with new technologies all the time now I obviously mentioned Hive right now iceberg is also a new type of file storage management system where they basically allow you to do a lot of things that Hive based system were not able to do one such example is also parquet file format versus delta Lake file format so for example parquet is a columnar file format that allows you to store distributed data and it has super fast performance when it comes to processing your data in spark but it had a lot of limitations and with Delta Lake now you can do absurd now upsert is basically a type of operation that allows you to update the existing data even though your underlying storage is a file based system how does this absurd affect your pipeline right so for example let's assume a simple STD scenario let's talk about scd2 SED generally means slowly changing Dimensions let's say say you have a data like this where I have my name and I have my address that's it now the name is Josh address is ABC another name is not Josh and address is X Y Z now let's say my address Josh is address it up always updated from ABC to pqr if I'm updating it like this then I'm losing track of my older address so what I essentially need is I need my older address as it is and then against Josh I'm basically writing bqr which is the new address now I need couple of more columns like start time and end time to track it more efficiently so for example start time here can be ah December 2022 and end time here can be January 2023 for not just we don't have really an end time so it is set up as null and for my new address I'm basically moving to my new address starting January 2023 and I don't have an end time so here now if you filter based on end time equals to null you'll get only these two records that means these two records will have the latest current address of any particular person if I want to look back at the history of changes of address I can do it by start time and end time column so this is slowly changing Dimension so your address here is a dimension that keeps on changing now doing something like this in part K was extremely challenging and complex that format does not allow you to update the existing records only just write it on your new location but with Delta Lake that limitation is now gone so now imagine if you get a scenario where you have to implement an STD type of logic in your project and you end up using parquet instead of Delta Lake then somebody who does know about Delta Lake in your organization might come up to you and ask a lot of questions that why did you choose Parkade Delta Lake was the best solution in this case in in that case you might just get caught like a deer in the headlights and it would look pretty bad so that so my point is that your knowledge will get outdated very quickly so you have to learn these new technologies that are coming every year people are also talking about no code solution I have my own point of view on that I'm going to link a video in the description where I'm answering some of your questions and in that video I have a section where I specifically talk about no code problems ah but the point is that even if no code tools are successful and they are able to mature in a very good way in the future that you really don't need to code even in that case there will be 10 to 20 different tools to choose from so as data Engineers our job will be to figure out what is the right one for a particular use case so even in that case you have to keep on evolving if you are somebody who gets tired of you know learning new technologies then data engineering might not be for you that's it what I wanted to cover in today's video are you a data engineer have you run across some of these problems let's just talk about them in the comment section below if you have any questions also feel free to drop them I respond to all of the comments and as I said don't forget to leave a like And subscribe to the channel see you guys next time foreign [Music]
Info
Channel: Jash Radia
Views: 221,483
Rating: undefined out of 5
Keywords: cons of data engineering, harsh reality of being a data engineer, reality of data engineering, data engineer job experience, data engineering job experience, why data engineering, is data engineering good, data engineering pros and cons, data engineer pros and cons, why not to do data engineering, why not to become a data engineer, google data engineer
Id: PZMHwRb5AAw
Channel Id: undefined
Length: 14min 20sec (860 seconds)
Published: Sat Jan 21 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.