Data Engineer Roadmap 2023 - How To Start Your Data Engineering Career

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody my name is johannes grey but you can simply call me joe and i've been working as a software engineer for over 15 years now and switched to data engineering about like four years ago and i'm here to tell you all those little things that i picked up along the way and in today's video i'll tell you what i would do if i was about to start all over again and trying to land a job in data engineering and i will try to provide you with some sort of a road map so that you can try it too hey man i heard you talking about how to become a data engineer what does the data engineer actually do well data engineering is some sort of a mixture between software engineering devops and some understanding of machine learning the main job that you have in this profession is to design and build data processing pipelines try to make everything production ready so what do i mean by that if we think about like a super simple data processing pipeline you have to ingest data from somewhere then you need to store it in your storage and after that usually there is some sort of a transformation step where you need to clean the data and transform it and validate it and then you need to store it in some other location ready to be used then by for example a machine learning model or a dashboard or whatever and you need to try to automate the whole process as well as make it production ready and introduce things like operational excellence you need to introduce logging monitoring so that you know that if something goes wrong that you get notified and you can look into it [Music] hey man i can't really get this data engineering thing out of my head can you please tell me what are the needed skills to be able to work as a data engineer well since data engineering is quite a versatile profession you need many different skills for example you need to know how to properly develop software so you need to know how to write good quality code that you don't repeat yourself and also since you need to deploy everything to for example a cloud provider you also need a deep understanding about that specific cloud provider or how to deploy services in general and what architectures can be used and how to build data pipelines so you need to know some infrastructure so that you know how to deploy your code to the cloud or to a provider to be able to actually run it and also what helps is knowing things like docker and unix or linux since most services run on unix or linux machines so you need to be able to log into there and also yeah look around know where everything is located and how to deploy software to those operating systems hey i was thinking about what you told me about the skills needed to be a data engineer and i i just can't understand like what are the the steps that i need to to to approach it should i do online classes so what is the best way to go about it how how can i learn those skills well some people might give you a list of online classes or courses that you need to do to become a data engineer in my opinion you need to actually do something to learn something and you need to face real problems and try to overcome them to actually get experience and learn something and most of those online courses or classes are kind of linear and every problem is already solved for you in my opinion it's not the the right approach because i have some experience i was sitting quite often on the other side of the table interviewing people that want to join the team or that want to join the company and i often see that many people have done this online classes but they don't actually have experience so if you ask them some questions about common problems then they are not able to answer those and so my approach is different and it might have a steeper learning curve but i think that it's worth it in the end so bear with me for a second and by the way if you liked the video so far it would be super awesome if you could consider going completely insane on that like button and you could also consider leaving a comment down below and tell me your opinion on that topic my proposal would be to actually come up with a demo project for example like the one that i mentioned earlier about this data pipeline where you have some ingest and then you need to store the data you need to transform the data and then you need to store the transformed data again and this would be an optimal super simple use case to actually have a look at what you need to learn so let's have a look at this process the most obvious thing would be okay we need to write some processing code so that we can transform the data to be able to do that you need to probably learn some sort of programming language now you could pick basically every programming language but a programming language that is quite common and popular in the data science and data engineering community is python and python has a big community lots of good documentation and free tutorials and so you have lots of resources to learn from i will link some of them down in the description with your programming language covered now you have something that you can run on a piece of data for example just pick the titanic data set and now you actually need to deploy it to run somewhere right so the next logical way or the next logical thing would be to get familiar with some sort of runtime or cloud provider or whatever so for this idea we could use for example aws which offers free packages where you can play around with it and get familiar with it so maybe try that and learn only the things that you need to know because aws is huge and offers many different services but stick to this problem that you're trying to solve you want to store data somewhere you want to transform the data and you need to store it somewhere else right and well just let's use aws because i'm most familiar with it you need a storage that is a s3 in aws terms and you need something to run your code on so that could be maybe a lambda function or maybe some sort of docker instance that you could run for example in an ecs cluster so that would be the next steps to look into how can i create a lambda function and execute my code in it or how can i create a docker image with my code and then run it on something like ecs and aws has super awesome documentation and there's everything that you need to learn and i would go step by step in the first step i would use the web interface even though in one of my last videos i said never use the web interface but for getting familiar with it it's a good idea because if you now also introduce other concepts like infrastructure as code and so on that would be just too much so first create everything using the web interface and and then try to make it run try to create a thing that when you copy a file into one s3 bucket that your for example lambda function is automatically triggered and stores the transform data in another s3 bucket and that should be a pretty good baseline and if you have that then you can build on top of that right because now you are kind of like familiar with the services of aws you know how aws works you're somewhat confident in writing python code for example so the next step would be to try to create the resources that you just created with the web interface to create them with infrastructure as code for example using a tool like terraform as i have also made a video about some time ago so terraform is a nice way to create the infrastructure in a cloud provider that would be the next step so now you need to learn okay how do i create an s3 bucket in terraform how do i create and deploy a lambda function using terraform and how do i create those triggers to actually trigger the different steps in the pipeline and the next step would be to use maybe concepts like continuous integration and continuous delivery to then also automatically provision your code and whenever you push your code to a repository for example github then your lambda will be automatically updated with the new and changed code and you won't have any manual steps anymore in this whole pipeline and when you go through all those steps then you actually learn something i mean you have to eat for some time right it won't it won't be easy you you will need to really dig into it you really will need to read documentation but guess what half of my day in a in my usual project work i'm probably reading documentation and trying to solve some sort of problem or try to get more familiar with something right you really need to be beaten down and be able to get up again and and do the real work because only that way you will really learn something and as i said start very simple there are lots of demo data sets around as i said maybe just take the titanic data set and then build gradually on top of that and yeah don't be scared because once you actually have started all the things will fall into place and will make sense and it will get a lot easier but try also to only tackle one thing at a time don't try to from the beginning without knowing anything start oh yeah i will do the programming then i will instantly create everything using infrastructure as code and also i will do a ci cd and everything that will be a disaster and you will lose your motivation to actually learn something try to start with something simple create a transformation script where you for example pick the titanic data set and just for example drop a few columns just to have some sort of processing and to have some learning of the python programming language then get this free aws account and just play around with the web interface try to get familiar with that especially with the components that you will need for your demo project for example s3 buckets and a lambda function and then if you have that one running then you can start to improve and improve and improve and get those more complex concept in place hey man so the things that you just said i mean it seems to be super hard is it even worth it yeah i know that this approach might be harder than just going through some class but it will be worth it in the end because it really teaches you something you will need to learn to overcome problems you will need to learn to read the documentation this is something that is essential so i would really really encourage you to try that approach good things in life are often hard to get and that's some sort of a natural selection so you really need to push through it and it will be worth it in the end you need to believe in the process i hope you got some value out of my experiences and if so please consider going crazy on that like button subscribe button and all the other buttons you can find down there and don't forget also to leave a comment and tell me your thoughts about the topic so far see you the next video bye bye
Info
Channel: Johannes Frey
Views: 11,668
Rating: undefined out of 5
Keywords: data engineer roadmap 2022, data science, data engineer, big data, data engineering, data engineer skills, data engineering for beginners, data engineering career path, data science roadmap, machine learning roadmap, how to become a data engineer in 2022, johannes frey, data engineer certification, data engineer courses, data engineer skills required, roadmap data engineer, roadmap data science, roadmap machine learning 2022, data engineer 2022, data engineer path
Id: kxj3PzuWAfg
Channel Id: undefined
Length: 12min 0sec (720 seconds)
Published: Thu Feb 17 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.