What Tools Should Data Engineers Know In 2024 - 100 Days Of Data Engineering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
there are what feel like an infinite amount of tools you can pick from as a data engineer and likely if you've worked in the industry for a while you've maybe worked with some and heard of others and are always wondering what do you actually need to know to be a successful data engineer and the funny thing is and I think the challenge is what we're working on today will probably change a little bit tomorrow you know when I first broke into the de and data engineering world uh Hadoop and Spark were all the rage and you would have to figure out how to host it yourself and spin up zookeeper and it would be like 30 different solutions just to get it working and now you know we're just running things on data bricks or Athena or something other than uh you know us manually managing some of these Solutions so the tools that we use changed drastically uh over the years but I wanted to create a video that was in conjunction with my 100 days of data engineering video that helps you guys understand what tools you need to know as a data engineer and we are taking a quick pause from uh the AWS cloud videos but I'll be back on those here shortly if you haven't watched those uh give those a checkout later if you'd like to learn more about how data Engineers can work with Cloud but for now let's talk about tools from a high level let's first cover the basics and this is one of the challenges is like where do tools start and where do tools end in terms of solutions I think it's fair to say that programming uh and certain languages and basic solutions kind of fit in the tool aspect and Tool World right like they are tools we've built as humans as tool Builders to help us automate and and build processes so with that I think the tools you'll definitely need as a dat engineer even with things like chat GT obviously uh SQL python Linux I say Linux more more overarchingly um most likely you're going to have to write your fair share of bash scripts or at least interact with um servers right you might not need to be an expert but you need to be able to interact with those systems so yeah python SQL Linux some level of understanding how to work with networks all of that will likely come into play and you need these Baseline skills like they seem basic but you need them like you can't get around them uh yeah there's lots of dragon drop tools but I recall recently I was working with someone who was working on ssas and they were like oh yeah I don't do the c um blocks in ssas cuz I don't know how to do it which to me is a little bit of a copout because yeah you kind of should be able to do at least some baseline coding doesn't have to be fancy but at least you know some level of understanding you know of object-oriented programming how to write functions just your Baseline uh understanding of python now along with those Basics come kind of your other technical basic Solutions and tools right uh things like git right you probably think of it as GitHub but git is a broader solution and broader tool that you will need to know you will need to know how to Version Control uh all of your various code that you will be putting into places whether it's in a Lambda or in a larger system that you're developing you know in air flow Etc it needs to go somewhere and so at least understanding the four or five commands in git that you will likely use all the time you know things like get ad get commit and get push uh at the very least at least to understand how those actually operate and if you want to go deeper there are plenty of articles that explain to you how this tool operates but just being able to understand like how to create branches and these small things is very vital to being successful as a data engineer and again these skills kind of really build up your Baseline and I think this is why it can be hard to break into Data engineering is because these are the tools that can take time just in themselves to become decent T you can probably in the 100 days that I've set up get a good idea of these Solutions but maybe becoming really good at them is hard and honestly I'm still working on these Solutions and constantly finding new things that I maybe don't know fully um in these Solutions uh another kind of basic technical set of tools that you will likely need to know is things like SFTP and pgp these again are kind of this interesting space I haven't started talking about like actual tools yet like uh people probably think of like airf flow or snowflake but these are Baseline skills and Baseline tools you will likely need to work with you know you likely will have somewhere uh that sstp will come into play it still exists today even with data sharing I have to do it to Facebook I had to use SFTP or secure file transfer protocol in order to push files out to external Partners who would then ingest that data and then do analytics on it and then give us back some sort of reporting on it and similarly as we're going through that process often we'd encrypt that file uh with some set of keys so using something like pgp uh or some similar uh protocol will likely be required as well and now you've got a baseline set of tools and I probably Ed skills and tools here interchange l in all fairness some of this is tools some of this is skills but I think you can't avoid these right like these are things you have to know how to work with you're going to have to know to program you're going to have to know use SQL you're going to likely have to interact with a Linux box somewhere whether it's an ec2 instance or a gcp cloud compute somewhere you're going to be interacting with these Solutions and these tools now you are a data engineer so you can't just know obviously these These are tools that likely depending on how you apply them either make you more of a software engineer a data scientist or a data engineer which why we break these uh names up uh I've seen some people I think recently kind of poke fun at these names or be like you know you're not really an engineer that's less of the point to me to me it breaks up the difference of why these different jobs exist and what they do right sure we're a data plumber that's fine plumbers still have specific sets of tools and plumbers also solve very hard problems I know I've had them have to fix a few around my house so as a data engineer there are specific tools that we use heavily first often at least you'll have to interact with databases in particular likely you'll pull a lot of things from certain Source databases and these Source databases tend to be your traditional relational database Management systems or something more on the like nosql side so things that are maybe document databases uh like mongod DB uh so that's always great or cander DB you'll also likely need to know obviously again the traditional postgress MySQL you don't have to know every database that exists right there's IBM db2 there's you know your Oracle databases more than likely as long as you get two or three under your belt and they're kind of uh you know using some different uh dialect of SQL you will be familiar enough to pull from various sources in the future yes they might all interact slightly differently like you they one will use change data Capture One Way one will do it another way one will have bin logs one won't but as long as you get I think three or four that you're comfortable with that you can build a basic schema on you can understand how to insert data into that you can understand how to update data you'll likely be okay here you don't have to be again an expert but you need to be familiar enough to build on them to understand why someone might put an index somewhere it will take again time these aren't things you have to rush through don't let 100 my 100 days of dat engineering make you feel like you have to run through any of this stuff again if you don't learn it now if you happen to get a job as a data engineer somewhere you'll be learning it there uh so just make sure you don't run too too fast otherwise you're going to be stressed out in your actual job so again now that you've kind of got your databases under underway right you've kind of built a good understanding of how databases operate this will kind of give you the next layer of knowledge so that when you go into what now people kind of call like cloud data platforms or cloud data warehouses honestly there's so many different terms now that people use for these Solutions because you look at them and they aren't actually set up like sometimes traditional databases which is why it's good to actually understand how I think traditional databases operate so that when you look at snowflake uh you don't think they're the same thing you don't think like oh this operates 100% exactly the same way um as my traditional database or same thing with data brakes which is even farther removed from your traditional databases um and harder to probably grasp that hey there is a a compute engine here or there's some sort of you know uh query engine here sort of with spark and there is kind of storage but the some of the traditional stuff that exists is kind of all peace Meed out right it doesn't exist in the same framework and so that's why it's really important I think to build these steps slowly so that you understand these differences um I always remember when I first uh interacted with my First Data Warehouse because I had taken like your tradition relational database course um in school and I was literally uh interning at the same time while I was taking that course uh and looking at this data warehouse I was like oh these are kind of the same right like you've got something called a key here and an ID here they're the same thing right is in my mind and obviously a few months later I I eventually learned that no these are different things and I had to like dig into that and start reading things like Kimble and actually dig into the differences and that's why I think again the more you can kind of understand and see when when things are different it it just makes you uh more valuable as a data engineer moving forward so the next one I'm sorry for that die tribe but the next one is really these data platforms so snowflake data bricks we're going to throw a big query in there again it doesn't fit maybe as much of the data platform is space but I guess if you add in everything else gcp has it kind of can so if you add in all the gcp data flows and things like that it kind of fits but these are the traditional Solutions you might have to know um you can also again throw in red shift there in Azure synaps analytics which actually does fit more of that data platform space but those are kind of your key um data platforms that you'll likely be building on and in general building some sort of data warehouse or dat lake house if that's your cup of tea um on obviously it's going to depend on which one of these Solutions you pick they all do operate slightly differently um the way I often feel it is gcp tends to feel to me like it's very a little more limited in terms of like what I can find- tune whereas snowflake tends to be a happy medium and and data breaks like it gives you a lot of uh control but then you have to understand how that operates almost kind of like the old or orle days where it's like Oracle gave you a ton of control but that's why you'd pay a lot for Oracle Consultants cuz they'd have to know how to like set up control files and all this stuff um as you were loading data and as well as fine-tuning a lot of other stuff whereas you could just use SQL Server which I often found a little easier to work with versus Oracle now as you're learning these Solutions you're going to again be layering more and more of your skills on top of each other think about it what are you going to be likely writing when you are working on snowflake or big query likely SQL that is how you're going to intera act with these VAR Solutions hopefully on top of these tools you also have the skills and best practices to build a data warehouse or data lake house but those that's where I definitely draw the line in terms of like what's a tool what's more of a skill and a best practice right that's going to fall more in terms of like how you build data pipelines how you build data warehouses that comes more into skills and best practices and design versus actual tools that can help you um Implement these uh best practices and and designs and along with that if you're on a a data bricks uh fan person you will have to learn a spark and how it operates and how you can best interact with it including when you're writing things like SQL instead of maybe uh python or Scala in order to interact with spark you know what's the best way to run joins things like that um are really important to understand why you might want to use something like an engine like Spark versus maybe Presto or trino and where in fact at Facebook we even had the ability to switch in between spark and Presto depending on the job cuz sometimes it was more efficient to use Presto or trino and sometimes it was more efficient to know or to use spark and you'd have to know why and so it's really good to know um at least a little bit about all these tools if not a deep understanding because it will become valuable as you're going along I think the important thing is as you're going through these steps of learning again you don't need to feel rushed you will learn all of this stuff through time as long as you're putting in the effort you know if you're putting in 10 minutes a day probably won't learn it but if you're putting in hours a day like most of us have at some point if not still today you will pick up these Solutions you will learn them and you will feel confident in actually being able to deliver with them all right so now you've again you've kind of built all of this Baseline the next set of tools you'll often see that you need to know are things like orchestration ETL and data pipelines you can throw an elt in there these all kind of fit in this similar space and I know some people would get mad if I said that but I say that because airflow which um obviously fits in this workflow orchestrator space often when I see it implemented gets used as an ETL type solution or data pipeline to run a very basic um extraction of data and then load it somewhere and then maybe add in snowpipe or something similar that can just pick up a trigger of one file dropping into S3 but you're really going to see that there are a lot of different ways you can do pipelines and honestly what you often find is there's a few kind of types of tools you can do things very custom you can build it yourself people love doing that for some reason even though we we've built a ton people love having um open- Source Solutions again airflow and Mage are examples of that these kind of fit again in that Orchestra Trader world but also often just get used as data pipelines or ETL type flows as well and then you have things that are maybe fully you know managed things like ssas very drag and droppy um so ssas Azure data Factory and a few others that all involve you know dragging and dropping and and automating tasks that way and so those are kind of the various tools you'll see um there's a few others that like very much focus on maybe like just extract and load most of those tend to be very easy to work with so I don't think you need to put a ton of effort learning them off the bat I think it's very much worth to at some point but more than likely you'll need to pick some of these Solutions because you will uh likely use them and they tend to be the easiest to learn cuz there are others even again we could go on forever in terms of orchestrators data pipelines Informatica and a few others that often cost a lot of money to get access to but for now I think just understanding the concept and getting a few of these tools under your belt maybe one or two is good enough generally to at least make you uh hirable which is your first goal is just get hired in a junior position and then eventually you can go from there and again like I kind of referenced earlier the cloud is another set of tools that you will eventually learn there are a ton of clouds you don't need to learn all the clouds generally I tell most people AWS is a safe bet because most people use it if you do want to learn Azure understand that most of that is going to be uh at large Enterprises that use it whereas AWS T to be a broader range and I find that most people that use gcp use it for big query because they like big query so if you are going to pick AWS tools that's where I'd start and I do have this AWS video that you can go through to actually look at all the various uh tools you'll likely need to know I think I go through like eight or nine um cuz you don't need to know every solution and so the cloud is always a baseline that you need to know and then there's a ton of one-off tools that you may or may not need to know like honestly I have mixed feelings I will say that it's worth at least digging into Docker cuz you probably will have to occasionally start up a Docker instance here or there and same thing with kubernetes at least understand how it operates understands how to kill you know a pod occasionally but more than likely you will have a devops team that manages it and if not you are the devops team and that's now your new job like you it's very hard to like build data pipelines and manage a bunch of infrastructure that you've developed similar thing can be said about terraform right like it's worth knowing but in theory you should have a devops team that does that obviously nowadays people are um starting to I feel like reduce the amount of people that work on these teams so maybe maybe you will be a one One Stop Shop for all of this stuff but these tend to be the things you can maybe learn last uh you don't need to put a ton of effort in immediately um some of it will just come through uh you just naturally doing your work but also you don't want to be running or trying to figure out how Docker Works while you're pushing code to production so make sure you've at least run it a few times and if it is your first time seeing it in production and you've never run it in production try to find someone to help you out there cuz it's just there's a lot of ways it can go wrong and those are most of the Baseline tools you need to know I'm sure there's others that people feel like I've missed please comment below if you think uh I'll pin it if I think it's a good tool I'm like yeah that's actually true I should have covered this but I think that's the Baseline it takes a little bit of time to become a data engineer and I think this is part of it like you don't have to know all of these tools super in- depth but you need to know what they do in an interview someone likely might ask like where would you maybe use one of these Solutions versus another um how do joins work uh on one solution versus another uh does red shift have merge if it doesn't how can you you know end up running something similar to a merge statement all of this stuff is important to at least understand and have touched here and there I don't want this to be discouraging again you have long careers um it took me a few years to get to a point where I had a title of data engineer and and even now whether that's the title whether the title is data plumber I think is less the point I think the point is how you do your work so hopefully this was helpful for you out there whether you're an analyst an engineer data scientists um to understand the tools that a data engineer will likely need know with that guys thanks so much for watching and I will see you guys in the next one thanks all goodbye a
Info
Channel: Seattle Data Guy
Views: 31,585
Rating: undefined out of 5
Keywords: data engineering, how to become a data engineer, data engineering tools, 100 days of data engineering, seattle data guy, becoming a data engineer, data engineer skills, what tools should a data engineer know, snowflake vs databricks, best data engineering tools, top data engineering skills, ben rogojan, data analyst, data analyst tools, switching from data analyst to data engineer, airflow, python, spark
Id: nB7Lo9pGzVk
Channel Id: undefined
Length: 17min 30sec (1050 seconds)
Published: Tue Apr 02 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.