What's coming in Airflow 2.0?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I want to start off by thanking everyone for coming and just to move out myself my name's Brian Lavery I'm a data engineer at the New York Times and I post the New York City airflow Meetup and I want to welcome all our new members we got a lot of new people will register specifically I think for this event because it's such an interesting event and we hope to see you all at their upcoming meetups we have them regularly continuing with differential meetups so I'm going to be quick about this so we can get away for the main event slides here so the New York Times is hiring we have open positions including things like data engineer software engineer machine learning back where a software engineer and many more these are partial and fully remote positions available we're open to that we have already have a number of partial and fully remote people working for us and you can check out for more information you can check out at my TCO calm at slash careers so the last thing I want to do is I just want to thank you thank our airflo committers for their hard work and for giving us this preview of their fruits of their labor and just thank all right that's it for me hello everyone and thank you thank you Brian for the introduction and I would like to thank you think Brad for hosting us it's a it's a first time we do this this kind of event like hot pink online at the media and it's very special because it's in New York and we know like New York was hit hard by the coffee situation so thank you Brian for really finding time and helping us with with setting it up and I would like to quickly introduce the team here who is going to do that and who is going to do the presentation so Tomoaki Alec and coming we are from from podia company and ash Daniel and capsule from astronomer and so we are working together as if we are one team in one company but we are like mostly from two companies most of the committers are so yeah it's it's it's really great to have all of the commuters here because we are all really instrumental to make the airflow to zero happen and that's what we are going to talk about so now I will pass to ash who will tell about the first thing which happens in airflow to zero right now it's kind of accepted practice that you shouldn't run more than a single scheduler at once you can do it but yeah things get a little bit wrong or it's basically untested so don't do it but that's not really great for me yo we're 2020 cloud motive is all the rage we should be able to run more than one schedule both for kind of two main reasons one is for performance we all we all know if we use to F load the performance of the shedder sometimes leaves things to be desired so you know that's that's kind of this is not strictly high availability it's just kind of general optimizations as well this edge that takes too long between tasks if you've got lots of small tasks we all know it goes wrong so let's fix that so that's part of what we're gonna do one of the ways we're going to do that is by scalability being able to horizontally stay on your your scheduler so that if you suddenly know that you know at midnight UTC or midnight local time you take off ten thousand jobs and it's a bit slow well then you can just do that you can just set up some scaling rules have a two or three more scheduler pods come up at that time for your load and then yeah speed things like for that time and then scale it back down when it's quiet and also just kind of resiliency like yeah a shifter can die and the tasks all carried on running so long as you're not using local executors but kind of assumed that people don't use that or if they do they accept them but also we should be able to have more than one and just have both of them take over so the design how we're going to do that this is a IP 31 I think but anyway the rough design for this is to have active active scheduler so each scheduler process is fully capable of doing everything there's no leader elections there's no master scheduler they're all capable of doing everything so so that they all seem to pick down you know kind of split brain those to do this we think we're using the existing database primarily this is just because you've already got a database with some smart use of row level locking we can get the performance and you can get their behavior we want we think we can get the performance we want and not having to run something else such as an accepted a or a zookeeper or a console not needing something else for that external kind of leadership election just makes it operationally simple this is all obviously the plan we started where I feel like we haven't done it yet so if it turns out to not be the case we'll reevaluate but right now we think we can do it I mean yeah so that's kind of how we're talking about doing it what it will sort of look like you know everyone likes a diagram this is all very simplistic but yeah not pictured in this diagram would be a celery broker for instance but yes you have multiple scheduler machines pods containers whatever running all speaking to the local database the yes they will we're going to do some kind of we got some plans in place to make sure that they don't step on each other that they don't fight loon loitering having a predictable but random order between sheds versus oh yeah you get three put by just adding on the scalar or another shed but all you have to do to get more throughput please turn that dialogue single number add another setup on it and it kind of just starts and takes over the actual tasks evolves in implementing this we want to separate the dag pausing from the dag scheduling right now as an f-1 1010 but now this is Becky been true since forever airflow won't show you the task until it's passed it techniques anything I say should should say what cue a task and tense past it the pausing and Andy putting a task into scheduled state is happened in a subprocess but he actually sending it to the executor getting it started doesn't happen until that was that parsing result is returned we want to break that link as soon as your task it's ready who really wants to come so just kind of that's kind of step one to get the throughput the other thing that we are talking about doing to get the throughput is something we've called like a fast follow or a snowball executor mode which is one of the problems with the scheduler being so slow is right now you need the bad definition to be able to make some settling decisions when a task has just finished on a worker we've got that date definition already we can look at the immediate downstream jobs of that task that just finished and go hey can we schedule any of these we don't have to wait to come around and that bag to be passed again we've already got it there we could already make that decision so we can kind of particularly if you have a large number of dag files that should be a big big speed up so they're just like kind of you know I'll go what I'm saying we're gonna get there but our goal should be some second lag like some yeah sub hundred millisecond lag between one task starting of the next one go that's where I want to get to know how long it takes to take but I'm gonna keep working until we get there and yeah the other thing that we need to do is test this to destruction we need to know where this worked where doesn't where it falls down what the water a chaise difficult is complex it's a big architectural change so we need to make sure we get it right part of what we're doing or how to part part of the tools that we're going to be using for this is tactile section which is about sensor tag civilization so a vision you can show now so currently when you install an airflow where when next the acceleration is not enabled by default so when you install air flow both the webserver and the scheduler need access to the deck files like we see in the figure in the left side put the owner and web server passes the deck files and create a deck bags with these files this is kind of redundant and imagine if you are using docker to deploy air flow you need to make sure that the deck files are in sync the deck directly are in sync in both of these components so to reduce this complexity of and the duplication of having to pass tag files both things we introduce next civilization so the figure on the right side you can see that the she jeweller parses the net files it utilizes it in a JSON format and it then shows it in air flow metadata based the web server then reads the decks from there from metadata base and shows that in the UI so the web server does not need to pass the deck files anymore so this was the main concept behind the acceleration but we also did some good optimizations around it which is my next slide so one of the thing was stateless web server as I said previously with the web server had to pass the deck files and when you have more number of decks so let's say if you have 100 decks right now the web server would take huge amount of time to pass 200 decks and then it would load and ready to serve the request and as soon as number of legs and then their files start increasing complexity as well in increases by complexity I just mean to say nested subjects for example as soon as they start increasing the webserver would take longer and longer time to put up which is far from ideal so stateless webserver with the exhalation solves that issue the one good optimization that we did is lazy loading of tags so as I said previously we were loading the entire tag bag so loading all the deck files instead now we would just note the webserver with empty deck back so the webserver is not at all constrained by the number of deck files you have so it would load move it an empty deck back and it would load them only it would load the next only on demand so whenever you would click on the web UI click on a particular deck the web server will then fetch the deck files from meta database so this would definitely help with reducing the web server startup time and this reduction is notable when you have like large number of ends and deploying new decks to air flow would no longer require the web server restarts as the web server just fetches from the database we also have a feature to use the JSON library of your choice so the default by default we use the inbuilt JSON library to Eli's and deserialize it but you can also for example use you Jason and I Jason to do that and it also the net solution work also PHP for the egg versioning and shit'll achieved so these are all the tasks that have already been completed and are already available in air flow one ten ten so we initially did some work for 1:27 but there were some known limitations which were solved into later versions and for air flow 2.0 we plan to do the following so first is decoupled the dead passing and civilising from the scheduling loop as well so but we were doing this work in two phases the first phase was introducing civilization the acceleration in the web server the second phase is introducing them into scheduler itself so the current scheduling loop in the inside the scheduler does more than just scheduling it all passes the it also does neck parsing it also realizing the next to the web meta database so we are planning to have a separate component it might because she Eliezer or it might be called the deck parser that would be responsible to pass the deck files and store them in the database and the scheduler will now use the silliest X from the database instead of just the raw deck files this should reduce the delay in scheduling tasks when the number of decks a large as well because the Sheila honestly scheduler should just do scheduling of tasks and not outside of it currently it's more than just fashion tiller I'm going to not talk on the next thing I have been working on which is dialing or just starting to work on its dagger Jenny so the current problem we have right now is if you update your deck file let's say if you add a task the deck structure for the previous diagram changes to so here's an example so if you check this deck file and it's representation inside the graph view this and this assume that it has already had one execution now if I say I want to add a task and and then I go back to my diagram view and check the previous execute of previous diagram it would have this new task as well but it would show that it has no status which is like far from ideal and this was this task was not present when this deck executed at that time so this is what we want to change with deck versioning as well a diagram should correspond to a particular deck file which if used that time or seal is damaged it used at that particular time so we are planning to store let's say X number of Scylla is the X that the Diegans for that so the goal for decoration is like supporting support for storing multiple versions of silliest eggs as a part of this exercise we are also planning to provide maintenance legs to help old egg runs and their associated sea lice the egg files and as I said the grass we should show the dead that is associated it with the diagram the deck structure should not change and yeah that's about it I'll hand over to Camille from Polly dia in a factor of how I wanted to always talk about he gone from improvement of scheduler that's what I review schedule a component I check what can we improve in kingdom hover and Farah started working on we didn't even have a tools that would even in a basket measured components performers so I began with the preparation of tools that allows me to check the performers I called them per kid these are the protests and context manager that simplified my daily work they allow you to for example check that they put a big word where's in our even Braga cow it's integrated in spiders so now if you want to paste queries in the test case all you have a to do is run up artists which are additional option do you see a performance problem is his he stops and that is our Allah example from our foe is that back contains 200 ducks ducks is reveled in 200 200 squares it could be easier replied by one cartoon database its teeth problem is called and +1 problems it's very common in air phone and I will try to fix it I have a to pistol stud that exact situation because they are not I was eating eatable I have already optimized or one component that fault processor and I received fantastic results in the local environment we can observe it 10 times faster processing time of one file and I read with the number of words over 500 times the de havilland area in pickup and are even more fantastic the YouTube network latency between the database and scheduler when I improved performance I wanted to avoid change that may have a negative impact to our chip that I have created with some additional tests that perform optimization in a context manager that counts deckers py you can catch up change that have potential impact on performance during a code area such that are critical in open project like opportunist all because some people just come and go and the knowledge isn't not always passed pawn let's summarize we know what is a common source of the performance problems in air flow we have a tool that allows us to check performance during the daily work we protect ourselves I've changed lessons in the teachers next class one is yogurt okay so so I will talk about rest api although the person who is mostly working late is actually coming who just spoke but one thing that we have in REST API is like you are going to have a really stable and fully featured REST API that follows the open API to is your specification so you can see like an extract from Schwager showing the api's and showing the showing this this example calls here but what's even more interesting here is that we have two interns working on the api's as well so tomorrow actually we have a kickoff meeting with them all three cheese is an organization that pays for internships of people who normally wouldn't be able to participate in in projects to participate in open source projects and we together with taxol we are mentors and we helped to check a number of participants and we chose the two participants a frame and a Meyer so welcome a frame and admire if you're listening to this to this Meetup and they tomorrow they start working together with Camille on the ice but everyone's help is welcome as well so they they will just be helping but if anyone can contribute to the api's that would be fantastic because we have a lot of work there and we just start the real work on that but now the main topic that this asks some people who follow slack and development that least that's very close to my heart is development at sea environment which it seems like forever I'm working on it since I joined the project so we have a few changes some of them already implemented so we are going to move to the top actions with CI we've had lots of problems with Travis and we are just finally moving out we just have a few things left like kubernetes tests we will have a easier way to test those kubernetes locally that's something that is coming that you'll be able to love to run easily kubernetes that's locally we also have quarantined tests which means that some tests that you know are flaky and we are going to fix them soon they are just failing but we don't care for now or don't care too much about that and we will introduce a process of fixing of those tests very soon I'm gonna work also on cleaning the CI image something that Ashe always bragged about thank you for that make the image smaller and I'm gonna I'm gonna make it smaller finally because we get to the the production image so I can have some time for that so there are some things that can be moved out like Hadoop Mini cluster and few other things and also last part of this of the test we are going to automate the system tests of overflow so we are we already have a number of system tests especially in Google Aria where Polly dear he was working so we've implemented automated tests with the actual system that we connect to in this case in CCD so we are we are trying to like we're implementing a full automation of those those tests so that they can be run like weekly or daily daily probably not because for now for Google it takes like you know seven hours to run them all or something like that they are very very slow but we are going to repeat them often and this is an example of screenshot from the github actions that we have some people find it less readable some people find it more readable I really love the feature which you see here else like timestamps finally we can see how much each step in the in the CI has taken place but about this is something that we're we like actually I think we all like the deed have actions comparing to Travis even though it has some queries as well but it's much better than Travis a few more things about the dev environment so you know probably like I'm I created breeze and I'm adding more and more features to briefs continuously so right now we can even you see the list of comments available is drawing almost weekly so right now I'm just adding the generating backport packages which I'm going to talk about it so we can also use briefs to do all the maintenance of generate a number of useful artifacts from air flow so breeze is is you know for unit testing package will be introduced reparations I'm gonna refresh some videos so there is the breeze video and the main pilot page which is all fully outdated and I have no time to really record it but I already have the screencast done so just just I have to record the voice and one last thing that is kind of my dream or something that I'm I was longing for for a long time I'm gonna I'm gonna try to integrate with the recently announced cold spaces cold spaces is like online development environment for anyone using github so I really hope with breeze and what what's behind we will be able to literally develop air flow from your browser using the Deaf environment provided by github which I hope it's gonna be really really nice to start your adventure with our slope back port packages this is a this is a big thing I think so we are bringing the air flow to 0 providers 210 so we were developing air flow to 0 for like months now or maybe even years ash correct me if I'm wrong but it's probably like a year from now like when we started working back the air back then we started working to the zero or even more yeah so so we are we are continuously adding new things but it's not being used because it's only in master and very few people use it so we are now bringing all the providers of integrations with external services to 1.10 and this is all fully so so there is a separate package for provider which means that we have 58 providers 58 new packages that I'm working on releasing right now so I mean we're going to release these candidates like today tomorrow most likely and voting on it and maybe in next week it will be available to anyone so anyone will be able to use air flow to zero communication with external services and a number of operators and hooks and sensors and whatnot in air flow one to spend 10 or before it's only Python 3 6 it's automatically tested as much as possible so far so just build and installed on air flow Windows 10 so we just test if it works so it's like fairly stable right now and we also test that automatically using the system tests we also tested all the google packages so this this make it really something that if automated when I automated tests will be in we will be able to test also that and in the future yeah so in the future we are automating it and actually when I implemented it right now this back port packages it turned out that I was afraid the AIT 8 which is splitting air flow into sub packages will be a little bit more difficult but what I see we achieve right now with backward packages is like almost a 88 like it is very very it's almost there so and I hope we're going to be we will discuss it but I hope we're gonna use the same approach for releasing air flow to zero in the future one thing that people find also in the backward packages on release our release notes so this is something that I worked over last week and I have some good results so you know automated through these notes for all the back port packages so you will see for each button back port package this is these are just two example so this is the which one is it this is snowflake and SSH backport package so we will see which requirements are required for it package you will see the list of new operators in new Hook's generated or provided by this package those are the new operators and hooks that were not in 1.10 so far you'll also see the release notes or changes automatically you will also see cross dependencies because some cross some backward package depends on each other so this is the SFTP package but I believe and it depends on SSH package so you have to install both of them to use all the features but you also see the list if there are operators which were moved from the 1.10 to 2.0 you will see the list of those operators with links to the operators that we should super easy for you to switch to the new operators from new packages after after this is after this is release so that's about the automated release packages and one more thing that's that I'm working on is the production image docker image so we have this production unit we have a number of it is alpha quality work right now it's 1 to 10 and master and we have the list of issues and gathering right now feedback so so we have a lot of feedback from people using the image and I am gathering it still and I'm gonna apply it soon and ask for any PRS and changes that people would like to see to to come as well so we started with like a really better image without a lot of features because I wanted to learn from people how they are using pookal and alder images which are which are floating around so I'm listening to the use cases and I'm gonna apply in a good way like structured way what I'm hearing we have integration with docker compose as well that's something that will be worked on so you have a nice liquid composed to run their flow integration would hound chart that's something that mostly astronomer works on and Daniel will tell more about that in a moment and I'm just now spacing it over passing it to Daniel who will turbo tell more about what he is working hi everybody um thank you so much for having me I'm gonna be talking about a couple of the cool new integrations the cool so a couple of the cool new features in the air flow and kubernetes story and probably the one that we're most excited about is the Keita autoscaler am i able to yes so traditionally when airflow users wanted to have an auto scaling story they would have to use the Cooper Nate they would use the kubernetes executor and the kubernetes executor uses a cube API to launch a pod per task which is great if you are trying to scale down to zero but there are some drawbacks the first of which being that when it comes to very high scale cases you if you try to launch 1000 tasks in parallel that is 1,000 pods running in parallel on your kubernetes cluster which can get really expensive it's also the case that there's a nonzero start-up time per pod where you have to start the start the container and then you have to start the Python interpreter then you have to load the air for libraries parse the DAGs and all of that can mean that especially for very short tasks that there's some amount of overhead on a per tasks basis and so we have all historically you have this comparison of the celery executors and the kubernetes executors where the kubernetes executors had dynamic allocation but the celery executor was faster at SL A's and more efficient on a could you could run more multiple tasks per worker we were able to figure out an auto scaling solution for the celery executor in kubernetes which works really nicely and we do that using a system called kata and kata stands for kubernetes event-driven autoscaler so it's just a CNCs sandbox project where you can essentially create an auto scaler for number of tasks in a queue or in this case we act myself and Tomek actually built auto scalars for Postgres and or my sequel and so now we can run a Postgres query on the airflow metadata database and by the results of those query handle auto-scaling so in this example we what the solution we came to was the math dot ceiling of number of running plus number of queued divided by number of tasks per worker which is usually 16 so we start out with a nun with a scale to 0 airflow cluster with no celery workers running at all and the second we add one task now it launches a worker and it's able to scale down when that task is completed and if we are now running if we right now run it we now have 20 running tasks and 20 tasks waiting in the queue it's able to scale up to that amount and then once again scale to zero and so now you have the multi-tenancy and the fast follow of the celery executors but with the auto scaling of the kubernetes executors and this also kind of creates a new airflow paradigm because historically if you wanted if you have the celery executors and you wanted to create a new celery queue that had GPU or had extra memory that means you now have to do resource allocation for that new queue but with kada queues are cheap they all scale to 0 so now you can have 50 different fifty or a hundred different queue configurations each of which is just a kubernetes deployment object and customize them as much as you like and now you can basically get a lot of the benefits of being able to have one queue with certain secrets one queue with extra RAM for very intense jobs one queue that's extra light because you have very simple tasks and all of those skills to zero so they're all very cheap I also wanted to talk about there's a as far as improvement through the as far as improvements to the kubernetes executor or airflow 2.0 probably the coolest one is going to be that you're not we are now going to have pod templating from Hamill and JSON files for the default worker pod so when we first came out with the kubernetes executors the airflow community didn't really know much about kubernetes and we kind of had we wanted to make the process as straightforward as possible so we actually abstracted the KS API away and so while this was great in terms of having an easier learning process there were some kind of that the side effect of it is that the KS API is very fast changing and very fast they're very expansive so we get a lot of PRS from people saying hey why is this feature from the KS API not in be built like why can't we use that for our air flow workers and now we have this ability where you can just have a llamo file modify it to whatever matches your use case and we will hook it up so in air flow 2.0 there's going to be a config in the airflow dot CFG called pod template file you link it to the path to that default yamo file and the kubernetes executed when i'll parse the mo file and launch a worker pod and i want to give a huge thank you to David Lum for spearheading this feature and we're all really excited to see it in airflow 2.0 and of course finally I wanted to talk a little bit about the official air flow chart so what's pretty cool about this hound chart is that this is actually the this is actually the chart that we at astronomer have been using for our customers for probably at least the last year so there we have tested this on a multitude of clouds we've tested it up some pretty gnarly scale and we have a lot of customer feedback and knowledge from that that we've been able to put into this home chart and we even have kada auto scaling in the helm chart which you can just turn on by setting a variable and so we're hoping we where there's an open PR right now we're hoping to get this merged very quickly this will actually be available in air flow 110 but we're going to release a new chart with each airflow release and so this way the charts will be guaranteed to work with the new versions of air flow we will consistently test against the official docker image and the air flow chart will actually make it significantly easier to run kubernetes based tests on air flow so now as part of our CI CD pipeline we'll be able to launch air flow with the helm chart and then run integration tests against it and this also significantly simplifies the air flow onboarding process for kubernetes users because especially with helm 3 you don't even need a tiller you just have a coronation cluster run helm install apache / air flow and now you have an air flow cluster running on your namespace and I believe next is a Tomac hello everyone I want to tell you more about craters and died and it's called functional debt so for example currently if you want to make a part of function an operator what you have to do is to use title operator so we are defining a functions and then we are using Python operators with task IDs and Python code labels and we are passing the arguments to the function that will be used during the execution this this creates some boilerplate code around the bottom operators because we have to define this we have to define tasks IDs provider to provide the calls and things like that and what is more we have to define two relations because in this example we have to define first the relation that the get task will be will be executed before save task so this is the order relation but we are also defining the data relations that the safe task will use the result from get task and currently if you want to do this you have to use the ginger template and you have to write this by hand which is quite error-prone process and the new approach which we are working currently on is about making this process quite simple and what you will have to do is to use a task decorator so we are we are eliminating the number of boilerplate code around creating such tags and what is even more interesting we are removing the needs of setting the downstream and upstream explicitly because as you can see once we are invoking a task a function which is decorated as a task we can use the output of such function in the next function and the output is a of a new type which is called ex-con art and it will be resolved to ex-con value during the execution so in this approach year we are in one by defining one relation we are setting two of them because we are setting the upstream between upstream and downstream between those tasks and also the data relation so the D value from get cut pictures will be used in the safe cuts and all that was and all of that was proposed in the I 41 which is awful functional dag definition which was proposed by Jared so huge thanks to that and the main point of that is to ease the way of converting a task of converting a function careful task and to simplify the way of writing such that and my opinion this is data scientists and other people who use simple tasks and want them to define on the dark level but there is also one interesting part of this I which is play table XCOM storage engine which removes even more boiler called boiler plate code because for example it care currently if you want to download some data from GCS packets and then perform a transformation on those data and then and then we have to upload this data back to the back so we have to repeat ourselves in the process of downloading and uploading the data to persist and by making the instant class to be able we can abstract this this desire of retrieving the attack and I think this is this is really really nice feature and some part of this I will be available probably in the 1 to 10 but all of that will ship with Oracle to 0 and now ash will tell us about some smaller changes that will be part of Oracle to consider so this is just kind of about some smaller sections some smaller changes that have already been merged into arrows master branch yeah so we make changes into master and then we as a release managers back Porthos into 110 once we try not to do any breaking changes we do so you know we get it wrong encasement sorry about that but yeah we don't want any breaking changes you want them so here are some some of the other things that we've have already merged into master or have said we're going to do before arrow 2.0 because we're sort of following semver effort effort 2.0 might break some things so here are some some changes connection IDs need to be unique so to those who aren't expecting it connection of these appear unique it but they aren't so it's a caitli trip some people up you can actually create to my secure connections with nerdy kind of my you know my data warehouse you could create two collections pointed to two different MySQL servers this was you know when airflow is created in 2015 kind of a poor implementation of our load balancing but there are better ways of doing this and this kind of this is not really documented and it's such a surprising feature that for 2.0 we're gonna remove it if you want a che there's going to be deaf installation so hive you can put if you connect into a hive meta store you can separate your host by by a commerce at some kind of solution like that which is better suited to the purpose and makes connection OD unique just like makes a whole lot of let things less shocking and hopefully fixes it other things titan-3 early so airflo 2.0 will not run on anything less than Python 3 5 3 I believe Python 2.7 has been officially unsupported by - and at the Python product by the that's been unsupported since January of this year if you are still running on Python 2.7 you need to upgrade fo 2.0 well my works you will not be able to all these lovely new features unless you upgrade so this is your warning upgrade to start the process of upgrading to Python 3 if you haven't already fo 110 will work on both pythons 3 3 5 3 6 3 7 3 8 coming at also 2 7 but after one point head will be the last for this series to support Python 2.7 the other one is the are back UI so the kind of the web server where Fla was released it had kind of had it web server is the UI at I believe it's 110 0 maybe 110 1 we added a new feature called the are back UI which was yet role-based access control so we rebuilt the website of the UI on top of a project called flask app builder because it has a lot of robust access control and screens to manage the permissions and kind of all that built into built into it so we built on top of that from fo 2.0 this is the only option you know previously you can get this in 110 by setting our back to equals true in your web service section of your config but for ever to 0 this is your only option this won't affect many people directly unless you happen to be using a few plugins if you're probably in doesn't support the our back you know plugins that also the web server which not many people do but if you're using one of those if you have one of those and it doesn't support the our backs you need to update it a lot of the shiny new features we back ported to 110 we've only back ported them to the BBQ I just so you have to do twice the work to back wall that kind of the big examples there will be the change in every 110 to be able to let you change the timezone in the UI so 110 110 10 I should say 1 to 10 all the UI or all the dates in the UI and our localizable you can have them in your local timezone which is helpful if your timezone is not always you GT for six months of the year hi kami okay it's the one advantage and curse half the year I'm on UTC yeah so yes there's that yes will are back to you I will allow her dad level access control so Dax can only be shown by some users and not others so time for the big question but when you're all been dying to ask where does every window coming out I'm not going to give you a date we don't know we want to get it out sometime week or two three but no promises this is an open source project that the kind of the six of us on this call are pretty much the only people who are paid to work on air flow full-time and there's quite a lot of work in the India separately outlined and we also have to do something some other things some of the rest of time so it's gonna come soon yeah so Amy at the course of three but the answer I'm afraid has to be when it's ready so yeah but how do you upgrade to 2.0 so we want to make the upgrades from 110 to 2.0 as seamless as possible there's going to be some breaking changes that's the whole point of making it a 2.0 release from a one point 10 if we can avoid making a breaking change we will but sometimes it's unavoidable if it's too difficult to upgrade people will stay on an old version or stay on or version at then stop using air flow which is not what we want we want air flow to keep going and we like it we want to be able to continue working on it and for that we need people to continue using it so we want to make upgrades as seamless as possible as we mentioned earlier we're going to release these back ported provider packages which will let you use the new code lay out now but you don't need to use those we cut in import shim so if you import the old names in your bags they will still continue to work you can go for 1:10 to zero when it's released and it will still work so just get a whole load of warnings about missing import yeah and the bottom point there is before we get to 2.0 we want to make sure we've got everything removed or broken that we need to 2.0 is not 2.0 because it's necessarily shiny new big features is 2.0 because it's breaking changes so yeah how to how to upgrade to 2.0 when if is eventually released and additive safely step 1 upgrade to the latest one point 10 release whatever that is right now it's one 10 10 1 10 11 will follow in a month or so we kind of go for a roughly every two months Willy's processor on the 110 series as part of that there will be this command doesn't exist yet but we will write it which will give you in one place all the warnings that we can find of how you need to change this before you upgrade so yes I've run that tool to upgrade to 110 laters 110 run this check upgrade check fix the warnings then you can upgrade outflow usual kind of airflow upgrade process applies for you make sure to have to make sure you have to run that migrations and then kind of reapply it yeah and then I guess kind of that's that's the road to 2.0 and when we're hoping to get there but yeah as mentioned no promises it's when we've got all the briefing changes done and then kind of our final thing is the air flow summit so back in the suite halcyon days of January we were planning on putting on a summit in quite much venue it's gonna be in the Computer History Museum in San Jose the global pandemic has somewhat said no to that to that thing so instead we are going to do a virtual online summit hosted by various different meetups around the world across ten days each meter posting one day and having it at a time this is good for that meetup kind of all the speaker's already aligned to kind of use the corporate papers from the planned physical thing and yeah check out more details at the website and thanks to sponsorship entry is free so there we go now we open up to QA kind of so the question that we are looking at some most volted questions there was a question if there is a plan to support upgrading versions migrating to the DB and sharing backwards compatibility of the API yes that was one of the answer from from like like last few slides from from by ash and the question additional question does it include upgrading through 2 dot X versions I believe so yeah we will well we haven't yet down to 0 release but yeah we plan to have like - 2 - 1 - 2 - 3 etc all the migrations there yeah there is a question which I cannot answer right now maybe somebody else how do you deal with sub ducks into Xero the issue it's sequential executors of India also answer if it weren't so much repeated to expand on that answer yeah this may not be - Zarek's is kind of not a breaking change so it may be something we can do later we want to make sub dives more transparent to the scheduler so that rather than having a sub bank operating which starts its own mini or inside it he said to herself just looks through the soft and operational question is where can I get that tweeted like on tasks so yeah I see that I'll take this one so there is a question that we discussed before like where can I get that sweet careful shirt that I'm worried actually if you look very closely this is the old logo of airflow so I have it from meetup which we were which which which I talked at in Sunnyvale a year ago more than that but this is a very since I see this is a popular question like 17 18 volts and Counting so probably we have to come with something during the air force on it so that's enough for us to produce some Sparks and get some ways people can get those t-shirts I think that's a that's something that we have to think about and I will talk to the to the to the to Louise and the guys who are organizing the salmon because one more comment so this the whole meet up the training here is organized by software guru this is a company that works with us on on air flow summit as well so we will use the same technology for streaming air flow sonic you're focusing on getting the helm chart merged and once that is up then we will begin active development on an air flow operator we are also we're using the Wii right now the plan is to use the Google cloud platform one as more kind of inspiration just because it does a lot and we want to start out with something smaller something that we can just say we'll start with just this database just this airflow comfy and then we'll slowly add in all the other databases and all the other configurations a question about how long will the lag be between tasks in 2.0 we don't know but my goal is to give that less than 0.1 seconds I'm not really gonna stop working at it until it's less than 0.1 seconds it should yeah that should be achievable so I'm just gonna keep plugging away until I get down to that video it's a fairly average you cut off but that's that should be titles likes for such are kind of in this day and age yeah another question I see a how can we how can people get involved in development like just start developing we have a number of issues which are good marked with a label good fair first issue we just recently moved to github issues backpack from JIRA which was like one of the best moves ever I think and I'm sure everyone agrees with me on the on the call so the you can you can find good first issues there and there is also contributing rst documentation in the inner flow and it it literally guides you step by step how you can contribute how you set up your development what are the steps to create your first PR etc so everything is described there and like feel free to just pick an issue and start working on it [Music] [Music] [Music] [Music] yeah can we expect some features to have data validation between source and destination not in to 0 that's something we can add later but yes it's like it's on it's on my get list of things I want to do you know kind of let's say it's an ETL job and you expect sevens million rows a day and then suddenly it drops off the cliff and the next day you get two rows 170 million being able to expose that natively in air flow would be great both from a let's just graph it and also then some level of being able to alert on that of hey it's not within ten percent of yesterday within seven day average something like that no don't know what the firm plans are but yes we want to work on that also yeah that's kind of also useful for machine learning work loads if you know you kind of you want to do a split testing of some models you run both of them at the same time you want to know kind of which one gets you the better results so yes that's definitely a feature that's coming it won't be 2-0 big because it's not needed to be 2-0 that can be a two point one or a two point two as I said yeah 2.0 is kind of like what are the breaking changes we need to get in now and then we can add new features after it okay there was this nice quote this question about the property use metrics and alternative options to study so yeah maybe ash because you answered that already yeah so I would love to have Native Prometheus support in airflow but it's not trivial out of the box because of the way F lo uses subprocesses subprocesses and across different machines and try to work out how that behaves without just storing into a database and querying it which may be something so maybe help us work out what it looks like and how we can do it and we haven't worked out how to do it yet and right now it's not a high priority for us again not a breaking change so can come in ain't none 2.0 coming a later point release I'll answer some questions in the meantime did we consider namespacing to isolate groups to isolate or group tags Mabel's connections or even plugins you know so kind of video like you have names faces and kubernetes can you have main spaces of your bags and connections um nothing firm it's something we've talked about but it won't be coming in 2.0 it's definitely something that we're all aware of particularly like limiting access to connections and variables to specific tags is something that is kind of on our radar but that we're not no plans yet to be back okay can people hear me now okay oh great yeah so I'll just reassure the two questions I answered apologies for the technical difficulties so the the first question was I saw the airflow operator was moved from Google club and I reload yeah I'll sign off and sign back in when its agree these already go there was something around vaccinations denied that does Dyke sterilization introduce any ill have any impact on the time in which a gang can be manually triggered it shouldn't really affect it but they said to the performance improvements should there's actually if it's a new file that you put in place by default the air flow sailors or the scheduler won't notice new files until I think it checks every five minutes we haven't got current plans to do that but we could and should set up some kind of inotify so that if you drop a new file in place they can get picked up like that so PR welcome that'd be a great feature and I notify or a poll or whatever support to the dag pausing process so that it picks up violence straight away that be wrecked it's kind of independent of DAG civilization or not but yeah good idea all right try number three em can people hear me and then I'll answer the questions yep I'm speaking right now I just want to make sure that people can hear me and is it working testing one two three testing one two three just so what okay great yes so once again apologies for the technical difficulties so the first question was I saw that the Google cloud operator or the airflow operator was moved from the Google cloud platform to the Apache repository will we be supporting that soon and so what I the first I wanted to say that the first thing we're focusing on is getting the helm chart in and so that should be done very very soon and the second so once the home chart is it once a home tart is merged with master and we're in released with air flow then we're gonna focus on the operator and as far as the Google Cloud one we actually are suppose we've kind of decided to use that more as inspiration and the reason why is that the glue the Google Cloud operator air flow operator it does it has a lot of configurations it works with a whole host of databases it works with a lot of things that would be very hard to start out with so we're kind of pulling sections out and starting with a much simpler operator we're gonna I think I believe we're going to start with just just Postgres backend then sell it and celery X computer and then work our way up and so that way we can start out with something a lot more manageable and eventually kind of bring in all of those other databases and other configurations the second question is when will Keita be released and the answer is it is already released if goto if you actually go to the helm chart either the helm chart PR or to the astronomer slash air flow chart repo you'll see that we have it working on we have we have a kada option so once the air flow chart is well if you use either the astronomer one or wait till the official one is released you just have to do the helm install air flow and just install Keita onto your cluster and do - - workers kada enabled equals true and that will start the Keita auto-scaling or your celery executor and then are there any other questions there is a question what are you talk very controversial what are your thoughts on this popular controversial medium or medium article about only using kubernetes operator which is like where we are all using airflow wrong and how to fix it if you do not read it but I transfer from for myself I think the grenade is running everything - kubernetes operators is this kind of doesn't let you use the flexibility of air flow so like the big part of like why our flow is useful is that you can actually write everything you want in Python so I keep on repeating that life write Python code like air flow is all about Python and it's Python centric and and if you have people who are like data scientists and engineers and they are working on and they know Python didn't they know how to do how to use it and they are very much used to the language being able to write run everything in Python is great and this is the power of airflow which i think is super you know it's a super power of airflow because otherwise you have to know all the doctor you know prepare your image deployed a habit run with kubernetes and if it fails we have to debug it and and that's so far it's so complex that like just running a Python code is much easier that's my point but maybe others have other opinions on that um I mean sorry yeah I mean I I would say that it comes down to your use case what I would for I think for anything simple anything where you are just kind of communicating with an external service so for example if you're doing a spark submit if you're doing an HTTP request a lot of those a lot of those more straightforward tasks there's no reason why you can't just use the operators that are built for it where the kubernetes operator really shines is when you have something custom so for example maybe you want to maybe you want to run a degree cumin building our own operators or use building operators as much as we can what's the best practice also in terms of the migrations to-220 so I can actually take that one because that was act so I hopefully this is gonna be posted today but that was actually one of the major parts of my PI con talk was using custom operators and when I I'm actually a huge proponent of custom operators especially ones that are built on top of the especially the ones that are built on top of the ones that we provide so one example I give in my Python talk is let's say you're a data engineer and you want to you have a kubernetes cluster that has GPU notes and you want to give your data scientists the ability to use those GPU nodes but they don't you don't want them to have to know what a kubernetes node selector is you can actually take the kubernetes pod operator and wrap around it and wrap it around with the node selector GPU equals true and that way you can abstract so much of the kubernetes complexity from your user that they only have to know oh I just use the GPU operator it's a really great way to very simply build a data science platform essentially like rather than having to build something from scratch you take these existing operators you put you bacon all the configurations so for the SPARC operator you can bake in all the jar files that are by default in your spark jobs and that way now your data scientists just don't have to think about that anymore there is there is one at another controversial question maybe our the real-time refreshes in the UI expected to be part of airflow to the zero that's just me question I wouldn't say to zero and just just one one comment to that actually when we started the outreach in mentorship with CAC Co the the second project we like Adi was one of the projects and the second project was like improved the UI of air flow and I don't think we had anyone even trying [Laughter] [Music] that's I think that answers the question um it depends too much on how you deployed your DAGs is that Cove still available because we're not storing the coding database are the codes available are the exact libraries you need so available so it will be amazing if you don't run this version from three weeks ago but sometimes you want to run the task or the dagger on for three weeks ago with the latest code you just fixed a bug so there's there's kind of edge cases around that so no essentially we just punting on that decision because it's a hard one to answer and we're not gonna top it for now any plan for air flow to deploy streaming jobs - so is the question will air flow itself become a streaming platform and my response to that is one of the well we don't plant I don't think we plan to be a streaming platform but we do plan to like reduce the amount of time between that it takes to launch a task I would also say that anyone who has had significant experience with the Kaka cluster can attest that there is like there is use cases for streaming but it is a monumental infrastructure effort to maintain a streaming platform and if you look at the web page of our flow I think one of the first sentence is air flow is not a streaming solution and I think it's also right sorry to run everyone but I have a two-year-old daughter who needs to be put to bed so I'm gonna have to get that so I thank you very much everyone and later - for joining us so uh another question maybe maybe some somewhat interesting are there plans to offer airflow on cloud marketplaces so as far as I know there is one bitNami image on on on a juror I not I'm not sure if there are any plans of making it a cloud marketplace unless something with the airflow operator can bring bring that to that or can and on I don't like we cannot hear your muted use I am you did yeah don't make yeah you can you can talk now okay so the question of the questions that were answered already I work I think so there is 30 second lag so with available on the Google cloud marketplace so we can deploy it to deliver in the case of DCP you can also use the cloud composer yeah I also capsule I believe it we astronomer has astronomer has something in the Amazon Marketplace is that correct luxury car seat heard you okay capsule I I believe that astronomer has has a offering in the Amazon Marketplace is that correct I think I said I cannot hear you for whatever reason it doesn't she doesn't look like let me just see if there's any other questions to answer I think we can slowly wrap up great it's like it's been over one-and-a-half hours I passed though yeah I'm not sure now we have the answer thank you guys thank you also for coming this has been really fun yeah thanks a lot and thanks a lot thanks again Brian for for hosting us that was very great and that one comment from my side it wouldn't be if not the coffee situation we probably wouldn't be here talking like six people from like three people from worst Oh from the one donut from from Los Angeles Los Angeles and Brian is in New York and watching it that that that's kind of unexpected think after after the coffee situation and I think it's great actually yeah thank you thank you guys so much thank you
Info
Channel: Apache Airflow
Views: 5,607
Rating: undefined out of 5
Keywords:
Id: znowFIBK1lk
Channel Id: undefined
Length: 76min 18sec (4578 seconds)
Published: Thu May 14 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.