Airflow Sensors : Get started in 10 mins

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey glad to see you again because in that video you are going to discover a very important topic in airflow the airflow sensors so what if you need to wait for a file to land at a specific location before moving to the next task or what if you want to wait for a task in another tag to complete before executing your dag or what if you want to wait for a value to be in your sql table before executing the next task well in order to answer all of those pretty important questions you need to use an airflow sensor my name is mark lomati i'm the head of customer training at astronomer and bestselling instructor on udemy and my goal is to make sure that during the next 10 minutes you are going to discover everything you need in order to get started right now with the airflow sensors to create incredible data pipelines so you will discover what are airflow sensors what are the limitations and the things that you really have to be careful with otherwise you may end up with a deadlock then last but not least what are the different best practices around sensors before moving forward don't forget to subscribe to my youtube channel that will help me a lot and more importantly that will help you as well as you are going to receive a notification as soon as i create a new video and smash the like button it's always good for the youtube algorithm that being said let's get started as usual to better explain a concept and why this concept is important for you i think it's great to start from a very concrete example so let's imagine that you have three different partners partner a b and c where those partners will send you data at different times for example partner a might send you some data at 9am then partner b at 9 30 a.m and partner c at 10 a.m so the thing is they will send you data at different times and then as soon as you have received those data you want to extract them with other tasks then last but not least you want to process all of those data with the last task so this is a very simple data plan but it is a very common use case of airflow now here's the question how do you check that your data is arrived that partner a has sent the data you are waiting for or partner b has sent the data that you are waiting for either on your file system or on your f3 bucket how do you make sure that your data is there and that you can move to the next task extract well this is where you need to use a special kind of operators in airflow which are the sensors so at the end those tasks will be the airflow sensors and more specifically either the file sensor or the f3 key sensor let's define what exactly is a sensor so a sensor is nothing more than an operator evaluating at a given interval of time if a condition is true or false for example if a file exists at a specific location if yes then your sensor will get completed so basically each time you need to wait for something to happen before moving to the next task then you need a sensor and by default your condition will be evaluated every 30 seconds every 30 seconds the sensor will check if your condition is true or false the good thing is airflow brings a ton of different sensors so for example if you want to wait for a file to land at a specific location you can use the file sensor if you want to wait for a f3 key in your f3 bucket you can use the f3 key sensor if you want to wait for a sql record in your table you can do that with the sql sensor and if you want to wait for a task in another dac to complete you can use the external tasks sensor there is really a ton of different sensors and that's one of the many reasons why airflow is so powerful so that being said let's discover exactly how to implement your first sensor okay the first step is to import the sensor that you want to use in that case you are going to discover the file sensor in order to wait for a file to land at a specific location before moving forward so first let's import the file sensor and to do that you just need to type from airflow dot sensors dot file system import file sensor and for the file sensor you don't have to install any provider in order to access it which is great but if you want to use the f3 key sensor for example you will have to install the aws provider so once you have the file sensor you are ready to implement it you are ready to add the task in your diag so let's create a new task let's call it weighting on score 4 underscore file equals to the file sensor and as usual you need to specify a task id so let's use waiting for file as well now before showing you the specific arguments of the file sensor let me give you the arguments that all sensors share in airflow and the first one is the poke interval and the poking tower defines the interval of time at which your condition is evaluated so by default poke interval is set to 30 seconds which means with the file sensor the file sensor will check if your file exists at a specific location every 30 seconds now you may ask which value should you put there well there is no right and wrong answer for that it really depends on your use case if you expect your file to land after like five minutes that it makes sense to put five minutes for an interval of time but just keep in mind that if you put a very short interval of time there is no guarantee that everyone will be able to evaluate your condition at this interval of time for example every second okay then last but not least don't forget that depending on the sensor you use for example the sql sensor each time you are going to verify if the record exists in your database you are going to create a connection behind the scene so if you have a ton of sensors and you have a very short poke interval then you are going to create a lot of connections to your database and that is something that you might not want now what if your file never arrives what happens with your sensor in that case well this is really important because by default your sensor your file sensor in that case will check every 30 seconds during seven days before timing out that's huge and when you run a sensor you run a task which means a worker slot from the default pool is taken for that task until it gets completed which means this can lead you to a very big issue called the deadlock but before talking about the deadlock let me show you exactly how to correctly set your timeout so in order to define a timeout in your sensor in any sensor you have to put the following argument time out equals to the time in which you expect your sensor to succeed so for example if your fight should arrive between 9 am and 10 a.m then you should define a timeout of one hour if you expect that your file arrives between 9 a.m and 9 30 a.m then you should define a timeout of 30 minutes okay so that's how you should define your timeout so here let's put for example um let's say five minutes like that obviously you should define a timeout greater than the poke interval you have in addition if you are wondering what is the difference between the timeout and the execution timeout that it is available with all operators well you are going to discover this in a few seconds but first let's talk about deadlocks let's use the dag example at the beginning of the video and let's say this dag is triggered every day and you can execute at most 12 tasks at the same time in your entire airflow instance what happens if you keep the default timeout set to 7 days well the first day you will run three tasks three sensors corresponding to those tasks and then you will have nine slots available so nine other tasks that you can execute but your file never arrives so the next day you get three more tasks three more sensors running corresponding to those tasks while the sensors from day one are still running so now you have six tasks running and six slots available and then day three now you have nine tasks running and three slots available then a day four you have 12 tasks running and zero slots and that means at the fifth day you have a deadlock and so you are not able to execute any more tasks in your entire airflow instance because all of the worker slots are taken by your sensors that's why it is absolutely important to always always always define a timeout in order to avoid keeping a worker slot while your sensor is running even when your condition is not checked because your condition is checked every 30 seconds so during 30 seconds nothing happens one thing you can do is to define the mode of your sensor and by default the mode is set to puck but as a best practice when you expect that your sensor will run for let's say 30 minutes or 45 minutes i don't know more than 10 minutes it's always a good thing to define the mode to reschedule so that during the different interval of times the worker slot taken by your sensor will be released so that other tasks can get executed and then as soon as the interval of time is elapsed then your sensor will take a new worker slot okay so by doing this you avoid keeping a worker slot you avoid having deadlocks in your for instance it's really a best practice to use mode rescheduled all right the last question that i didn't answer yet is what is the difference between the timeout here and the execution timeout that it is available for all operators well with the execution timeout if your operator if your sensor is longer to execute than the time defined in this execution timeout then your operator your task fails whereas with this timeout that it is available only for sensors you can use another argument called soft underscore fail which is set to false by default and this argument allows you to say if your sensor is longer to execute than the time defined in timeout then i want to skip the sensor i don't want to put it into failure i want to skip the sensor and to skip the sensor you just need to define soft underscore fail to true so this is the difference between execution timeout and timeout with execution timeout your sensor fails whereas with timeout if soft underscore fail is set to true your sensor will be skipped alright that's it about sensors if you want to learn more about them you can check the link in the description below but now you have a truly better understanding of what a sensor is what can you do with your sensors and what are the best practices around sensor have a great day take care and see you for our next video
Info
Channel: Marc Lamberti
Views: 3,848
Rating: undefined out of 5
Keywords: airflow sensors, airflow sensor, apache airflow, sensors, filesensor, s3keysensor, externaltasksensor, sqlsensor, beginner, airflow, airflow 2.0
Id: fgm3BZ3Ubnw
Channel Id: undefined
Length: 11min 54sec (714 seconds)
Published: Thu Jul 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.