Things I Wish I Knew When I Started As A Data Engineer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Pig ... that was a thing lol

👍︎︎ 2 👤︎︎ u/[deleted] 📅︎︎ Nov 08 2021 🗫︎ replies
Captions
or good day and welcome back to another video with me ben rogerson aka the seattle daily guy today we're going to talk about things that i wish i would have known before i got into data as 2022 is barreling towards us it's hard not to feel like we're still back in december 2020 but alas we are mere two months away from january 1st 2022 and i hope you're taking some time to reflect back on the last two darkness years old friend since i am in this reflective state i kind of wanted to look at my career in data and see the lessons that i've picked up over the last near decade and share them with you so that way if you're thinking about jumping into data you can hopefully go through these lessons much faster as you realistically come up against the exact same things that i'm coming up against or came up against in the last few years so i'm hoping that this video will help people out or for those of you who are familiar with these problems we'll maybe hit that little bit of nostalgia button maybe some ptsd just a little bit so that you can all kind of understand the pains that i've went through as well as anyone else who's been in the data field over the last few years and really sit down and ask yourself do you really want to join in on all of the fun that is in the data field so without any more interruptions let's go into the things i wish i would have known when i first started in the data field the first thing i learned when i jumped into the data field was to not try to learn every new technology all at once more importantly focus on learning your core skills and building that foundation before learning specific tools and again new technologies and skill sets it can be very tempting there are so many things coming out every day it feels like new technologies new startups new solutions that you want to try out but if you don't have solid sql python and just general data skills all these other fancy tools are kind of pointless so trying to learn everything all at once is going to get you nowhere building a solid foundation instead in your data skill sets is what's going to take you very far because regardless of what tools come up in the next few years if you don't have your core skills like data modeling or having some form of data analysis process all these tools are kind of pointless and oftentimes they change very rapidly i look back just in the last decade when things like redshift and hadoop became very popularized and a lot of companies were trying to implement them and people were learning things like pig and flume and all these other very specific tools that floated around things like hadoop that are often kind of floating away in today's datascape there's just such a need for a company to simplify their overall data infrastructure that creating complex hadoop clusters and systems is often not sustainable and this is why i think a lot of companies have switched to tools like bigquery and snowflake to try to manage a lot of these analytical storage systems or companies are switching over to tools like data breaks that are trying to sell up the idea of a data lake house which is a combination or a hybrid of a data warehouse and a data lake and trying to figure out how we can answer all of the use cases whether it be analytics or ml from one data analytics storage layer versus two and the truth is in the next five years a totally new solution might come out that might blow all of these other solutions out of the water and everyone might switch over to that and suddenly whatever kind of more upper or higher level skills that you've learned on specific tools are again kind of pointless and go out with the trend so it's more important just to have your core data skills and coding and sql and then you can put it on top of these different tools so my tip number one is avoid getting caught up in the hype and trying to learn every new tool that exists of course one of those core skills you need to develop is understanding how to develop maintainable code which will be my tip number two which is learn how to develop maintainable code or just maintainable systems in general so you're not just putting things together by you know putting a few pieces of vba here and some python there and some bash here and creating some system that's only really maintainable by you because you're young and naive and think that someone is going to be willing to wake up at 5am every morning to make sure that the system is still functioning properly which is not a sustainable system you need to think more like the meme that i've heard so imagine the next engineer after you is some borderline psychopath that has your address and knows exactly where you live so that if they find a single mistake in your code they're gonna find you and make sure you fix the problem for them the point is when we're young we kind of just develop code that works and is not often maintainable maintainability is far from an easy concept to i think fully understand i think one of the ways that i kind of understand it is that code should be easy to understand when you look at it obviously comments are always necessary but putting a paragraph long comment in just a line of code might say that your code's a little too hard to understand and maybe too abstract but that's always a fight in technology you know how far is too abstract the truth is you're not going to work at that job forever and if you build a system that is unmaintainable it will likely either disappear after you leave or become a very frustrating stumbling block for whoever picks up the system later so i recommend that you build a system that is easy to maintain rather than just trying to pump out tons and tons of code and work and dashboards or whatever you're doing because you think that's how you are productive it's far more productive to build one dashboard that's very maintainable that produces clear insights as well as clear actions from leadership than it is to produce 10 dashboards that have no one utilizing them and eventually go into the dashboard graveyard yes that is a real place that exists at most companies because there are thousands and thousands of dashboards at most companies many of which are basically just copy paste versions of the same thing and one of the ways to avoid this situation is by creating maintainable systems if you've been in the data field long enough then you've probably heard every marketing term that exists and one of those terms that i've heard more than enough in the last few years is creating some form of source of truth now here's the thing about source of truth it's less of a final destination and more of a constant goal because every new application that you bring on requires some form of governance or integration regardless of what kind of data is coming from that system there's always this balance between trying to have all of your source data systems integrated well and making sure you pull in data that's actually the source of truth and for those of you who don't understand kind of what source of truth means and why it's so challenging source of truth is one of kind of the pillars or reasons people develop a data warehouse which is to create a single place that people can go to for specific fields of information for example if you have something like workday and salesforce oftentimes these systems will be heavily integrated meaning you might be pulling information like job titles for employees from workday and putting them into salesforce which can lead to issues if people try to report off salesforce and workday because if the sync isn't perfectly timed someone's going to pull the report and a person that was once a junior position will be a senior position in one report and vice versa meaning you have a syncing issue and this is where sources of truth are kind of meant to come into play by only pulling this data from one location and making sure that that's what's used for reporting everyone is at least reporting on the same information so at the very least everyone is wrong at the same time which is much better than people having different information because no one likes being in a meeting where a manager or director happens to point out that one little field that happens to be off because of a sinking issue between two reports and this continues to happen today because it's just so hard to get a single source of truth and here's what makes it worse i can tell you from one of my first experiences where i was informed that a data warehouse that i was using was a source of truth and being young naive and my first job i believed it wholeheartedly so i started creating tons of reports off of this data warehouse only to find out that the previous senior analyst at that company had kind of had a side deal with bi where he created his own kind of copy of the actual data warehouse and so we just had a copy of that data warehouse that was honestly often not timed correctly so whenever i created a report we were always somewhere in the range of like one to six hours behind meaning that my data was always off which wasn't great because we were part of a finance team and anytime our numbers looked off compared to the company roll up we just looked bad but again coming from school where you only have like one excel file or maybe like a single sql server database where you're doing all of your work from and there aren't a bunch of different teams pulling the exact same information and all this political and business complexity it's very easy to understand why when you go to your first job if you've never interned anywhere you kind of trust whatever data someone puts in front of you because that's the way it's always been but that's not the way the business world works there are rarely pure data sources of truth but instead just kind of a constant goal to get a source of truth all that needs to happen at a company is that a crm or some other piece in the workflow changes a little bit and the concept of source of truth honestly is almost obliterated that's why source of truth is such a hard thing to get and it's important to remember that source of truth is merely a moving target and rarely a final destination now before going on it's important to take a moment to like and subscribe to this channel if you're enjoying this video also it's important to remember that as you're writing sequel and a lesson i wish i would have known earlier in my career you need to take a moment to save your sequel as well as version control it sql for those of you who are unfamiliar has a lot of complex business logic that can be very hard to remember if you make even the smallest change and i can recall multiple times early in my career where i would save a query and not be able to find it again or maybe not save it at all just write it and think i would never use that query again only to need it multiple times over and over again and it is highly frustrating to try to remember the exact queries you wrote in the past because business logic is complex and it's not easy to remember just off the top of your head unless you're some sort of 200 iq point genius you're likely going to be scratching your head and honestly just angry at yourself for not just taking the time to save your sequel with a better named file what's great about today is luckily for you a lot of modern tools have sql history that's very easy to access whether it be bigquery or snowflake but you should still save your queries in some form of version controlled way whether that's using something like dbt or github i think it's very important to try putting some sort of habit in place where you start using version control to save your queries because queries can have a lot of complex logic that is very difficult to remember and you do not want to have to try to remember how to write the exact same query for some report that was pulled for a vp so for the love of everything that's beautiful in this world just version control your sequel another mistake a lot of us make when we first jump into our data careers is saying yes to every request we get from anyone whether it be writing a query a data pipeline doing some sort of analysis here's the thing you do not have infinite time and i know you're eager to make everyone happy and prove that you're worth paying the 50 000 a year that they're paying you or let's hope that inflation has brought that up to closer to 80 000 but the truth of the matter is you don't have time to do every task people ask and tasks take a lot longer than you think even doing a quick analysis requires finding the right data doing a quick quality check doing some analysis doing some quick you know just exploratory work writing down your questions doing the analysis writing it up and then returning it over to whoever asked so make sure that whatever work you're doing is high value obviously when you first start out it's probably not a bad idea to say yes to a lot just to get it out of the way and just to get as much practice as possible but i think it's important to learn when to say no and again this is something i'm dealing with today whether it be in consulting or my full-time work it's just so easy to say yes to everyone because you want to be liked you want to prove that you're worth whatever they're paying you but the truth of the matter is a skill that's even more important is learning what to spend time on you can't complete every task and we are only really paid for 40 hours of work a week and i know some people want to work more than that and i'm glad that that's what you want to do and when it's your first job we just feel like we have so much we want to prove but i think it's important to try to think about what impact that work has down the line this is something that i discuss when people ask me what skills are required to succeed at a fang and one of the skills that's hard to master is learning what work is important and only doing that and not trying to get too distracted by work that maybe people are just asking you because maybe they're just curious maybe they don't have any actual stakeholders behind their own question and because you constantly say yes they're just going to keep asking you various ad-hoc questions especially if you know sql and know how to actually work with data people are going to kind of take advantage of that and not in a bad way just in terms of the fact that hey if you're able to answer the questions why aren't they going to ask and in this world where we're trying to create a self-service bi system which similar to the concept of source of truth is a marketing term that i've heard way too much for the last decade and rarely do people actually get it right but the point being that in a world where in theory most people can get access to data either through sql or other abstraction layers even someone that doesn't know sql should hopefully be able to pull data and answer these questions on their own so you need to kind of take account of your skills and figure out what you can do to be most impactful rather than doing every quick ad hoc task along your whole journey so learn to say no once in a while it's okay there's a lot of other people that can possibly do the work including the people who are asking the question so don't get too caught up in trying to please everyone by saying yes all the time focus on being impactful and not just doing every quick task because you feel like you have to also i think it's important that we share all the lessons that we've learned over the last few years so if you have lessons that you've learned being some form of data professional i would love for you to share them below because everyone who's in the comments can then learn from them so please take a moment to comment below what lessons you've learned being a data professional whether that's being a data analyst a data engineer an ml engineer whatever it might be i would love to hear from you guys do take a moment to like subscribe and share this video if you feel like you learned a ton it means a lot to me and i love seeing this community grow we're about to hit hopefully 20 000 maybe in the next two or three months and i'm really excited for that thank you everyone so much for watching this video and i will see you guys next time goodbye
Info
Channel: Seattle Data Guy
Views: 30,819
Rating: undefined out of 5
Keywords: aws data engineering projects, software engineer day in life, data engineer interview questions, data engineer, data engineer day in the life, how to become a data engineer, how to become a data analyst, career growth in infosys, data engineering tutorials, data science projects, learning data engineering, data analyst interview, should i become a data engineer, data engineering, sql, sql interview questions
Id: FvCInKiLJVg
Channel Id: undefined
Length: 14min 41sec (881 seconds)
Published: Fri Nov 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.