Data Engineer Talk - Workflow Engines, Why YOU need it!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm joe and super awesome here today so when someone told me about workflow engines and things like bpmn and tools like kamunda many years ago i was like what the i use if statements and things like that to control the execution flow of my program why do i need a webflow engine and oh boy was i naive back then so let's talk about it i'm johannes fly but you can simply call me joe and i've been working as a software engineer for more than 15 years and i switched to data science and data engineering about four years ago and i'm here to share with you the few little things that i picked up along the way well what is a workflow engine a workflow engine helps you to orchestrate the many different parts that you use for example for a data processing pipeline so maybe you need to make an api call to retrieve some data then this data needs to be stored then when this data is stored you maybe need to trigger a spark job to actually transform the data to make it readable for your let's say machine learning model then after that you need to trigger the retraining of your machine learning model and then when all is said and done maybe you also need to trigger some automatic evaluation and so on and so forth and all those steps they kind of consist of various different services like for example the api call could be a cloud function or an aws lambda as i said the data transformation could be done by a spark job the retraining of the machine learning model could be done with some sort of another web service using tensorflow and somehow you need to chain those things together and also you need to take things like error handling retries gracefully failing and so on and so forth into account and this is basically where the webflow engine comes to your help and helps you to solve those problems back in the intro i said like yeah i control my program flow with if statements and such but a workflow engine then helps you to control the execution flow of many programs right it's kind of an overarching thing and also a important thing to notice or to note is that workflow engine only helps you orchestrate infrastructure or services that are already in place so it won't create those for you so first of all you need to create for example the cloud function that does the api call you need to create the spark script or the spark job you need to create the retraining job and then after that the workflow engine helps you to tie everything together and make it work smooth together but before we go into talking about why you really should consider using a workflow engine please go completely insane on that like button that would really help me a lot so let's talk about some reasons why you really should consider using a workflow engine data processing pipelines are rather complex and you need something to manage in a sane way all this complexity and tie everything smoothly together and this is exactly what workflow engines are for workflow engines usually provide a graphical representation of your workflow and all of the workflow engines that i worked with or even while the workflow is running will show you graphically what step is currently in execution so you know exactly what's going on and for another very important point that i can't stress enough and that gets often overlooked is that workflow engines provide very sophisticated error handling and mechanisms like retries you can also group many steps into like some sort of container and then have those be retried whenever one single step of them fails so it has like many many possibilities and yeah so the error handling part is like really good and you will miss those things if you try to build something like that yourself so why should you reinvent the wheel and as i just said there are other ways to also accomplish this chain of process steps like for example on aws and on gcp there are for example triggers that you can trigger things when for example files get stored into the cloud storage yet the thing with that is that you lack the graphical representation about what's actually going on so you need to get into the system have a look what triggers are defined where do they belong to and so on and so forth and the most important thing is that you don't actually have proper error handling so you need to build everything yourself and i mean why should you make your life more difficult than it needs to be if there are already solutions for those problems just use them and even though i don't talk about one specific webflow engine there are many workflow engines to choose from some which are used are for example airflow gcp workflows and aws step functions and all of those even though they are slightly different have a somewhat similar feature set and basically help you to solve the same problems so what are some of those features that those workflow engines actually bring you right i already talked a lot about arrow handling so that one obviously but then again you can also for example create usually sub processors so where you can group various different steps together and make that kind of like a component that you can then reuse and parameterize for uh yeah further use cases then usually you also have things like loop processing where you can feed the workflow engine and array and it will start the next steps for each of those elements in that area and also you can run usually things in parallel where you're again provided with some array or something like that and then it will start kind of like separate processes for each entry simultaneously or you can parameterize and tell them how many parallel uh yeah flows it should start and where you usually have some sort of forking like parallelism you usually also have some sort of joining them back together to get one single result out of them so this is also usually supported by workflow engines and of course you can also control workflows if conditions you can check whether the output of one step meets some certain condition and if so you can choose this execution path and if not you can choose that execution path and also workflow engines usually come with connectors to the most used services or or tools so for example most workflow engines have connectors that help you to run spark jobs maybe there are connectors that where you can ingest data from kafka and so on and so forth so all those kind of state-of-the-art and generally used tools usually have rather good connectors to those workflow engines and it's rather easy to use them so now yeah everything sounds cool and so on and so forth but how do you actually configure that is it some graphical user interface is it some coding that you need to do well as i said earlier a workflow engine basically helps you orchestrate code or services or programs or tools that are already in place so you already need to have some cloud functions in place for example you need to already have some spark jobs in place and then basically um so the tools that i use so far usually have some sort of a configuration using yaml you just specify the order in which those steps should occur you can specify as i said the error handling you can use the connectors to the different other tools and everything is usually neatly done with yaml so and it's also it's not that complex and it's actually quite good to use so at least the tools that i have been using so far so i hope i could raise some awareness for workflow engines and that you might be inclined to maybe give it also a try and to see whether that will improve your life whether it will bring you some benefits in your daily project work if so would be super awesome if you could leave a like and go completely insane on that subscribe button so far so good see you in the next one bye [Music] you
Info
Channel: Johannes Frey
Views: 5,801
Rating: undefined out of 5
Keywords: workflow engine, data engineer, data engineering, data science, machine learning, johannes frey, why you need worflow engine, data engineer talk, workflow engine camunda, aws workflow engine, Apache Airflow, data engineering projects, if-statements, execution flow, BPMN, data engineering toolbox, programming, data engineering for beginners
Id: RKoiZOmmrkg
Channel Id: undefined
Length: 8min 21sec (501 seconds)
Published: Thu Jun 02 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.