Flexible, Easy Data Pipelines on Google Cloud with Cloud Composer (Cloud Next '18)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[LOGO MUSIC PLAYING] JAMES MALONE: Good afternoon, everyone. Welcome. I'm James Malone. I'm a product manager with Google Cloud. And I'm joined today-- FENG LU: Hi. Good afternoon. I'm Feng Lu. I'm a SWE with Google. And I've been working on Composer since its inception. JAMES MALONE: And today we're going to cover Google Cloud Composer. First, before I begin, Cloud Composer recently went GA. And there was a lot of work from Googlers to make it happen. So to all of the Googlers, thank you. But far and above, beyond that, Composer is a work of love of the open source community. And we wanted to take a second before we begin today to say thank you to everyone who has participated in the development of Apache Airflow, who's given input, who has written code, who's used it. Truly, thank you, and we're excited to be a part of the Airflow community now with Composer. The first thing that we want to do today is give an overview of Composer. So Composer's a new service. It's built on Apache Airflow and we'll also talk about Airflow. So we want to just level set and talk about Composer, why we did it, what it does. After that, we're going to go in depth and talk about Apache Airflow. We're going to look at a DAG. We're going to run through a demo of Cloud Composer. And then, next steps on how you can get involved with both Airflow and also Cloud Composer. So before Composer, there were a few ways to create and schedule and manage workflows on Google Cloud Platform. And to be totally honest, they were not the best at all. So on one end, you had very low cost, very easy, but pretty inflexible and not so powerful solutions. Mainly, just putting something in a crontab, scheduling it, and letting it run. On the more complicated end, you had customers that were developing really complicated frameworks to schedule, orchestrate, and manage things on Google Cloud Platform. Really powerful, but also took a team of engineers to do. In our opinion, none of this is an ideal solution. Because people are focusing on things that are not really what they're trying to do. They're developing orchestration engines. They're developing description languages. They're not focusing on what they set out to do, which was just having a workflow run and monitoring that workflow from time to time. So we thought it should be easy to create, manage, schedule, and monitor workflows across Google Cloud Platform. I call out all of those steps individually, because they're all really important parts when you think about the lifecycle of a workflow. It's not about just scheduling something and letting it run. It's not about just monitoring something. I mean, we really wanted to think about the whole process holistically together and find something that would allow somebody to create this workflow, to schedule it to run, to look at how it's running, and then manage that workflow based on what's happening. So we came up with Cloud Composer. Cloud Composer, if you didn't catch last week, because we didn't really make a lot of noise because Next was this week, Cloud Composer just recently went GA. So it's generally available in a couple of regions inside of Cloud Platform. And it's based on Apache Airflow. So the best summary of Cloud Composer is it's managed Apache Airflow. So there's a few things that we really wanted to tackle with Composer. First, we wanted an end to end GCP-wide solution to orchestration and workflows. Second, it was really important to us that Composer works both inside of Cloud Platform and outside. To be totally blunt, if we developed something that just worked inside of Google Cloud, it really would be missing the mark. Because hybrid cloud, multi-cloud, and just not locking people in are all a fact of life. So if we developed something that was proprietary, we thought, from the outset, we would be failing. We wanted it to be really easy. We don't want a really complicated workflow system. Because then people are just wrangling with the infrastructure, the description language, and that just sucks. It's not a great use of time. It was also really important to us that it's open source, for a few reasons. Again, we don't want you to be locked in. We want people to be able to look under the hood, see what's happening. We also wanted people to be able to contribute and to be part of a larger community. A common question we've gotten since we launched Airflow into alpha and then beta and GAs, what's the difference between Composer as a managed Apache Airflow service and just running Airflow on my own on VM or a set of VMs? So there's a few things that we tried to tackle. First, we wanted a seamless and integrated experience with Cloud Platform. That means that Composer is available inside of our command line tooling. There's a Google API. It's not just a standalone thing that feels like a separate product. Second, we wanted security. With Cloud Composer, you have IAM. You have audit logging. It's not just a separate product. It acts and fees, from a security and auditability perspective, the same as any other Cloud product. We also wanted it to be easy to use when things weren't going quite the way you expect. So when you're developing with workflows or running workflows, some pieces of that workflow may not always run exactly the way that you want. So Stackdriver integration was very important to us. And it's a first class citizen with Composer. We also wanted to make it really easy to manage your Airflow deployment. So that doesn't just mean creating and deleting your Airflow deployment. It means doing things like setting environment variables, or installing and maintaining Python packages. Things that you could conceivably do on your own, but aren't a value add when you're doing it on your own. Because we've built on Apache Airflow, there's a core set of support for several products inside of Google Cloud Platform. So for example, there's data flow operators. There's BigQuery operators. There's Cloud storage operators. And the support for Google Cloud Platform products is expanding with each Airflow release. And that's a core part of what the Airflow team is doing, either inside of the Airflow team themselves, or with other teams inside of Google. It's expanding the breadth and depth of support for Cloud Platform products inside of Airflow. Really importantly, Airflow supports a whole host of things outside of GCP. So it supports services. You can go talk to things like Slack or JIRA. It supports technologies. You can go call REST APIs. You can go hit FTPs. All of the support for non-Google things is absolutely usable within Airflow and also Cloud Composer. And we didn't want to break or limit what you can do outside of GCP with Composer. So all of the cool things that you can do with Airflow outside of GCP are very usable. Since our GA happened late last week, not a lot of people may have noticed. So we wanted to call out a few things that just launched last week with our GA. We launched support for a couple of new regions inside of GCP. There's expanded Stackdriver logging, which you will see today in the demo. There is expanded Cloud IAM rules. And we took a bunch of fixes that will appear in future versions of Airflow or new additions, and we backported them to the Cloud Composer release. We try not, really, to modify the Cloud Composer Airflow version too much from the mainline version of Airflow itself, but once in a while, there are fixes, tweaks, additions that we backport. Our general philosophy is unless there's a JIRA associated with it, we won't inject it into Composer. Because again, we don't want to make Composer a black box. It's just not a value add, and it's an alternative form of lock in. Since not a lot of people may be familiar with Airflow or may have just heard of Airflow, we wanted to quickly cover Airflow, some of the core concepts, just to establish a baseline. We're really excited about Airflow. We love Airflow. And we want other people to love Airflow and get excited about Airflow. If you are totally unfamiliar with Airflow, it's an open source project. It's incubating in the Apache Software Foundation. It's been around for a few years. I think it's fair to say it's become one of, if not the leading open source package to create and schedule workflows. Airflow is really interesting, because all of your workflows are code. They're Python code. So they're highly approachable. You can do a lot of different things. And you'll see part of that in the demo today. You can do things like programmatically generate workflows, which is really cool. One of the questions we faced, especially initially when we made a bet on Airflow about a year ago was, why Airflow? What we really wanted to do with Composer is tie the strengths of Airflow as an open source package and open source community with the strengths of Google Cloud Platform. So Google Cloud Platform, we were very good at running an infrastructure, creating services, adding layers of security, auditability, maintaining costs. But as I mentioned, we didn't have a workflow and orchestration solution. Airflow did have a really strong workflow orchestration solution. It had a whole bunch of connectors for services inside of and outside of Google Cloud Platform. It already had a description language, defined set of APIs. So we wanted to join the two as a union inside of Cloud Composer. As we've developed in Cloud Composer, we've also started contributing back to the Airflow community. The KubernetesExecutor is a good example-- and I'll talk about it a little bit later in the deck-- of something that is really interesting to us, something that we're very interested in. There's a few key concepts in Airflow you may want to be aware of if you've never used Airflow in depth before. First, all your workflows are graphs. So workflows are a series of tasks and the tasks can fit into a graph. And that can be a very simple graph. It can be just one node, which is a bash script that just says, tell the time. It can be a really complicated graph, which we'll show you an example of our build system for Airflow, which is a good example of a complicated graph. And a graph has a series of tasks. Tasks are essentially steps, something that happens. Maybe you're running a SQL query, running a BigQuery query. Airflow, in those tasks, has operators and sensors. And operator is essentially something that tells something to do something. So a good example would be a BigQuery query operator tells BigQuery to go run a query. There's also the concept of a sensor in Airflow, which is essentially a binary weight until something happens. When it's true, it precedes. Airflow itself also has a lot of really interesting deep functionality that we evangelize people use inside of Composer. It's also another reason why we thought that Airflow is a great bet. You can do things like define connections and have your workflows use certain connections. You can set SLAs on your workflows and see what's meeting SLA and not. You can do things like pass information between tasks. There's a lot of really interesting and advanced functionality inside of Airflow. Generally, we get a lot of questions of can Airflow do x, y, or z. Often the answer is yes. Airflow actually can do x, y, and z. And with that, I'm going to turn it over for a look at Airflow in depth. FENG LU: Thank you, James. So with that being said, we do want to-- next, we're going to review some of the product details, the way how we construct a product and some of the design decisions that we have made, so you know the capabilities where Composer is. So in Cloud Composer. we introduced this new concept called environment. Really is very similar to a Kubernetes cluster or data cloud cluster. Essentially, it means a collection of managed GCP resources that gives you the functionality needed to run Apache Airflow. Inside a single GCB project, you could create multiple composing environments. And all the environments are integrated was Google Cloud Storage, [INAUDIBLE] stack travelogues, as well as Cloud [INAUDIBLE]. So the way to interact with the product, the following-- you could use Cloud SDK. You could use [? Pantheon. ?] You could use REST API. Functionality wise, all three masters are equivalent. However, I do want to point out one difference in Cloud SDK to make it convenient for the Composer users and so that you don't have to manage two set of command line tools. There's Composer command line tools. But at the same time, that's also like Airflow command line tools. So what we did in the product is we sort of, like, a tunnel through Airflow commands through the Composer [INAUDIBLE] Cloud commands. So that in a single place, you can manage both your Composer environment at the same time you can interact with the Airflow environment. Earlier I mentioned that a composing environment is really a collection of GCP resources. So here, I'm going to give you like a zoom one level in, and then explain how and why we decided to use the following GCP resources to construct the Composer service. At a very high level, you notice there are two projects. One is called a customer project. The other one is called a and tenant project. So customer project are probably the ones you're familiar with. Interact with GCP by creating a GCP project. That's the customer project. And tenant project is a concept new. Really, it's the same as ordinary GCP project. It's just the case that this tenant project is managed and owned by Google. As we walk through the detailed architecture, we explain why we decided to make this design decision that a portion of the resource actually lives inside the tenant project. Airflow itself, you know, if you look at the way how it's been constructed, it has a number of microservice--like flavor. You have an Airflow web server. You have a Airflow apps scheduler, Airflow metadata database. Those components just naturally match in a map to the wide range of GCP services we offered. So for example, I started with Airflow scheduler and Airflow worker. If you look inside the Kubernetes class, we decided to host both the worker and the scheduler inside Kubernetes cluster. The reason we do that is so it allows you to conveniently package your workflow application dependencies. Because essentially, all your tasks can run inside the containers. The worker and the schedule and the communicator [INAUDIBLE] through the [INAUDIBLE] executor set up. And then moving next, we have this Cloud SQL proxy. And then, we decide to host the Airflow metadata database inside a tenant project, so that only the service account you to create a composing environment has access to an metadata database. It's really for enhancing the security, as we believe that a Cloud SQL or the alpha database house all the valuable metadata information regarding workflow. Think about if you have connection credentials stored in a database. You obviously don't want anyone in your project be able to access that credential information. Walk down the right side. We have the Airflow web server interacts with the database surface, all the work flow information. And now, we decided to host the web server inside GAE so that it's partly accessible. You don't need to have the clumsy proxy set up to be able to access the web server. But we do realize at the same time, you don't want to make your web server open to anyone on the internet. So as a result, we collaborated with a another service in Google Cloud-- it's called identity-aware proxy-- so that only authorized users would be able to access the web server. Later on in the time of session, we'll be able to give you a way to have a sense and a feel of how that works. We also make it extremely easy for you to configure access to web server. It's exactly the same way as how you would configure IAM policy. Moving left, we have GCS. We use GCS as a convenience store for you to stage your decks. Deploying a new workflow to Composer is as simple as dropping a file into a GCS bucket. We understand that there's need, sometimes you need to stage your workflow artifacts. And we also make it a managed service for you so that your artifacts is nicely staged in the GCS bucket, which later allows you to sort of retrieve those artifacts back out. Finally, everything's being-- all the interactions, all the logs are being streamed to Stackdriver logs. Airflow has a very-- the UI itself comes together with task logs. But you can't really find out what's happening if, for example, there's a Airflow worker crash, or there's an Airflow exception in itself, the workflow scheduler runs into exception. At that point, you sort of lose the visibility on what's happening. That's why we decided that it makes a lot of sense to offer this additional Stackdriver logging capability. With the architecture-- and now I'm going to explain a little bit about workflows. Because not everyone is sort of familiar with workflow and not the way how Airflow express workflows. So at a very high level, a workflow consists of a collection of tasks and their interdependency relationships. So in this case, for example, you have some following HDFS. And then, whenever you have new [INAUDIBLE] addition to HDFS, your workflow-- you may want to kick off your workflow such that you will copy the file from HDFS to GCS. Once it's in GCS, or Google Cloud Storage, maybe you want to trigger a BigQuery operator that does something, load the data in, and then subsequently maybe run some query, and make the result available maybe while a [INAUDIBLE] notification. So a few things you probably have noticed in this example workflow description, there are tasks associated with workflows. For example, you want to run a BigQuery job. There are also dependencies. You want to wait until the data is available in GCS and then you start your BigQuery job. There's also a component of triggering something. You know, I have a file that all of a sudden now appears in my on-prem HDFS. Then that disturbs my workflow. So those are some of the elements or building blocks in Airflow workflows. I'm going to start with actually give everyone an example how you can build a very simple workflow that consists of two or three tasks. So really, to Airflow, unlike other sort of orchestration workflow solutions, they express workflow as code. So instead of having this giant configuration file expressing your workflows, you write your workflow as Python code. Roughly speaking, there are about five steps for you to define a workflow in Airflow. The first thing is, like, import. Form the Airflow system. You import the DAG, which is the acronym for direct [INAUDIBLE] graph, which is really another way of saying workflow. And then you try to import all the operators that you're going to use in a workflow, in this case, a BigQuery operator. And then also trigger rules, which allows you to specify [? intertask ?] relationship. Once you have all import statements ready, the next step is for you to define all the arguments to your task or to your workflow. You probably noticed that there's a default DAG args. This is really a very convenient thing that's provided in Airflow. So that if you do feel that you have a common arguments to a number of your tasks, instead of specifying those arguments at each and every single Airflow task, you can specify them at the beginning of your workflow, and then just pass that in automatically by the Airflow DAG model. Imagine if you are going to use a configuration, that's probably harder. Because you probably have to copy and paste a lot of duplicated lines of code. Once you have all the workflow data specified, now you're going to define your workflow. There you give a name, you give a schedule interval, are you passing a default args assignation. Now within the DAG, you start to define your tasks. In this case, we have two tasks. You have BigQuery operator task. You also have a BigQuery to GCS operator. As James mentioned earlier, Airflow has a wide range of supported GCP operators. Both BigQuery operator as well as the BigQuery Cloud Storage operator. They are all available in Airflow. Once you specify all the tasks and define all the task arguments, the next step is actually for you to chain them up by giving some dependency. In this specific example, the line simply says, hey, I want to run bq_airflow_commits_query first. After that, I run the export to GCS task, as simple as that. Now the reason why we decided to go with this route in Airflow and Composer is really there are a lot of nice things that you can do once DAG or workflow can be specified as programs. It gives you vision control. It allows you to dynamically generate DAGS. It also give you the choice of support templates, so that you have a [INAUDIBLE] of your workflow and then be able to dynamically instantiate your workflows. Like I said, the first thing with DAG was, in Airflow DAGs, the language choice supports Jinja templating. So you can specify a template command. And every single run, you would be able to reconfig that task. Likewise, think about your need to generate 1,000 tasks, or 10,000 task of the same type. It's probably fairly tedious to be able to do that in a configuration language. So with DAG as code, it's just like a simple two line of code will give you 1,000 tasks. The third thing is Airflow sort of naturally match the programming or the software model or software development process, where you have modules and submodules. So here in Airflow, you have workflows and sub workflows. They're called DAGs and subDAGs. So in the example on the left side, you could have many tasks, or if you want to realize that you want to create a reusable DAGs, what you can do is just package those DAGs into a subDAG and include a subDAG, in the [INAUDIBLE] DAG. So with all that, I'm going to give a composite demo. So in that demo, we're going to show you how to interact with a product. And also, for example, we just cover how does that look like? How do you query a workflow status? How would you trigger workflow execution from a external system? All right. Now we're going to switch to the demo. JAMES MALONE: I'll turn on the screen mirroring. And then switch to the demo here. This is also known as the game of how quickly do we know Chrome OS, all right. Screen mirroring is on. So hopefully, we can switch to the demo now. And wait. There we go. Chrome to the rescue. FENG LU: As you can probably tell, I'm not a Chrome user. [LAUGHTER] All right. JAMES MALONE: Sorry about that. OK. There we go. Cool. FENG LU: So we have-- let's go back. We have a very simple and a niche Cloud Console interface for you to interact with the environment. So to create a composing environment, that's where you need to do, specify name-- Google Next. Specify a number of nodes you want, like three nodes or 10 nodes, how many number nodes that you want. Which location you want to deploy-- that's an [INAUDIBLE] one. You have the option to define a machine type. If you do feel that you're going to have some CPU-intensive workflows, you can try to config your composing environment with some more powerful machines. And likewise, you can also specify network and subnet. This is for the case when you need to have a shared VPC, or even like reuse some network we have defined in your project. There's one thing I want to call out in the configuration is that we do allow you to provide your own service account. You don't have to rely on the default computer engine service account. Instead, you can supply any service account you gave. This allows you to sort of restrict the possible set of services that your Composer environment can possibly interact with. We do allow you to have Airflow configuration overrides. For example, if you want to increase your DAG load timeout value from 100 seconds to 200 seconds. This way you can specify some of the Airflow configuration overrides. Once you have all the parameters input, just create [INAUDIBLE].. Yeah. That's what you need to do to bring up a composing environment. While waiting for the environment to be created, right now it takes a while. Because as I mentioned, we host the Airflow web server inside GAE, and it simply takes GAE 10-plus minutes to get the application deployed for you. So we have pre-created a Composer demo environment. So we can take a look. Those are the service account. Those are the name. You've got a view of all the details pertaining to your environment. Earlier I mentioned that we use GCS to deploy workflows. So let's click into the DAGs folder. We have one simple DAG, it's the bq demo, which is the example I just worked everyone through. Deploy a new DAG is as simple. Just copy a file into this GCS bucket. Meanwhile, if you want to interact with the service and then try doing interact with a Airflow web UI, understand what other workflows-- like, you don't need to run through clumsy proxy set up. Simple click gives you the Airflow web UI. This is the demo DAG I just mentioned. It's really, like, two or three tasks. I just want to mention to everyone this is not to say that you can't create more complicated DAGs. James mentioned earlier that within Google, we have this complicated DAG that really help us to run CI/CD, so that we make sure that all changes are submitted to Airflow upstream will not break GCP operators. The other part I want to show to everyone is earlier, I mentioned that the web server is hacked by IAP and only authorized user could access the web server. So I'm going to open an ignition window. All right. Just give me one second. James, I need your magical power to restore my demo page. JAMES MALONE: No worries. AUDIENCE: [INAUDIBLE] FENG LU: Cool. Thank you. JAMES MALONE: We also have somebody who knows the Chrome keyboard commands well. You'll need to copy and paste that. You should be set. FENG LU: So what I'm trying to do is I try to log in with my personal Gmail. JAMES MALONE: And he uses two factor authentication [INAUDIBLE] idea. FENG LU: Yeah. JAMES MALONE: Yeah. FENG LU: I'm sorry. I forgot to bring my phone. [LAUGHTER] But trust me, it works. I notice that like a few of you are taking photos of this now. Please try it offline. Hopefully, I'll guarantee that you wouldn't be able to access this server, this website. Cool. So the other thing I mentioned earlier, because we have this IAP-protected web server, it gives you the guarantee that only authorized user could access to your DAGs. It opens up a lot of possibilities. So here we're going to have a demo where you actually try to trigger DAG execution from a GCF. You know, you're writing all function-- really, it doesn't really matter whether it's GCF or a different computer, as long as you have the necessary IM credentials. You are being added to the IM policy of Composer, you would be able to remotely trigger the execution of a DAG. So I'm going to test this function. What it does is, behind the scene, this function will try to call the URL, as Airflow also hosts its web server in the same application as the Airflow-- excuse me-- Airflow also hosts its API server in the same application as a web server. So behind the scene, what this really does is send a RESTful request to the web server I just showed you. Let's do a page refresh. Now you see a task being triggered. And this is the manual ID generated by Airflow. So this is what I mean that, once you have the DAG, you could trigger the DAG anywhere. It gives you a lot of flexibility so that you don't necessarily need to have access to G Cloud, not necessarily need to have access to proxy, or even Google Cloud Console. The fastest way to [INAUDIBLE] this, while we're waiting for this-- I just realized this particular DAG execution has already completed-- manual-- at this time. So the other thing I want to show to everyone is, in this GA release, we also support Kubernetes pod operator. So in the past, it has been a pain for user to manage dependencies. As James mentioned, we do make it a convenience so that you can any Python packages. But sometimes if your dependencies live outside of Python, then what can you do, right? That's why in this one, we realized that-- we make it a very convenient way, backport the Kubernetes pod operator in Airflow 1.10, which was just released a couple of weeks back. But it also will make it just work out of box for you. So you don't have to worry about it-- what's the service [INAUDIBLE],, what's the credential, all that stuff. So as a result of that, I'm going to show you another demo, which will really just-- Yeah. Let's just wait one or two minutes for the DAG to show up. This is what I mentioned, that Airflow has this directory scan interval, which allows you to specify how often you want to scan for new changes, new files, or new workflows in your DAG folder. Again, there's some default value. But through Composer, we gave you the option to override that default configuration value. While waiting for the workflow demo to show up, we can look even more into details about what this Airflow UI offers to you. You have different views of your DAG. It is the graph view, true view. You can also see the code at any time, just in case you need to sort of switch back and forth between the graph representation of your workflow and the code representation. You could also conveniently add connections to your workflows. Only imagine that Airflow does store all the connection credentials in the metadata database. And they also conveniently offer you this Airflow, this web UI, for you to mutate your connections. Any time you can also see all your configuration that comes together was this particular installation. As I mentioned, those configuration can be later on changed through the manager service. And let's come back now. We do notice that, as we added this Kubernetes pod example, it's pretty straightforward. Let's take a look at the code. So as I mentioned, you have all the-- just to recap-- you have all the import statements at the top. And then you have all your definition, all your workflow data that you're going to [INAUDIBLE] in your workflow. Now you define your DAG. After that, you define your task. So in this case, we have two tasks. The task in the first one is really like a batch operator [INAUDIBLE].. The second one is what I mentioned-- you have a Kubernetes pod operator, which allows you to, in this case, get a per image and a compute of the value of pi. At any time, you could-- this is the Airflow web UI where you can inspect the log output. Sometimes there might be a delay in how soon those logs appear. This is where the Stackdriver logging is helpful. So let's try to look at the logs that's viewed at the same time by the worker. As you notice, we have this pod launcher. That's actually related to launching the pod. So the nice thing we added into the Stackdriver log is that we organized the Stackdriver log, and then provide labels for you to access the task. So I'm going to-- here's what I'm going to do. I'm going to field out all logs pertaining to this specific task. There you go. You see all the logs that are being generated for you. While, at the same time, it may take Airflow for a while before they have the logs populated for you. And now the task is done. We can take a look at the output. As you'll notice, you'll start a [? GKE ?] pod. And then just wait for that [? GKE ?] pod to be done. At some point-- wait, let's see-- I'll try to scroll over. There we go. There's a job pending, job running, and then it prints the value of pi. And then finally, you'll get a task successful. So with that, that concludes the demo session. Feel free to try the service offline. Like I said, I promise that you can access my web server. All right. Quick recap on the demo-- we showed you how to create a composing environment, the various ways to interact with the environment, how would you monitor workflows, how do you deploy workflows, monitor workflows, trigger workflow from a external source. And then inspect your workflow status with Airflow logging, as well as Stackdriver logging. Pass to you, James? JAMES MALONE: Excellent. And just to add on, the security that we use for the identity ware proxy is the same load balancer level security that's used for things like the Google Cloud console. So we are paranoid about security. So the two factor authentication is a good example of that. It's a very core security. So you would have seen a rejection, and that rejection is handled very low down in the stack. So I want to talk about next steps in terms of our involvement with Airflow, in terms of where Composer is going, in terms of how you can get started. There is a lot of Composer questions that we get. And we just want to go ahead and answer some of the most common questions. Please if you have questions, I'll have details on how you can bug us. Please bug us. We're here as a resource. We are not shy of questions. We love input. We're here to collaborate with people. These just happened to be the questions that we get 99% of the time. There's questions on can people install their own Python packages, Airflow operators, Python-specific things? The answer is yes. Inside of the Cloud storage bucket that contains your DAGs, you can either add custom Python modules-- that's the interesting thing about everything being code-- there's also specific folders to add plugins for Airflow itself. There's questions on whether we touch the environment after it's been created. And the answer is no. One of the soft spots on Airflow right now is how changes are handled version to version. Right now we think it's best that we don't change your environment's version or version of any of the components once you deploy it. That may change over time as the Airflow community, as Composer matures. Will Cloud Composer be offered in more regions? Yes, we are actively working on it. GA is a good example of that. Is there a graphical way to create DAGs? Very, very common question-- the answer is no. But this is something of extreme interest to us and also of interest to the Airflow community. So the answer is no, but I would expect that it will happen at some point in the future. Which version of Python can be used with Composer? This is probably the most Composer-specific question that we get. Right now it's 2.7. We are actively working on Python 3.5 support. You can probably expect that in a future Composer release. Future directions for Cloud Composer-- so we showed off some of the work we've done with Kubernetes. The intersection of Kubernetes and Airflow is exceedingly interesting to us. Google, in terms of the Composer team and other teams, have been involved in work for Kubernetes and Airflow. The KubernetesExecutor is a good example of that. It's not the last example or the end of the line, I think, in terms of Airflow and Kubernetes. We're also working on additional operators. So we want to support additional API surface area coverage of the products that are already inside of Airflow, so things like Dataproc or BigQuery, data flow. We also want to expand support for new products inside of Airflow that are GCP products. Third, resource usage-- so right now, you can create your Composer environment with a fixed size. You can also resize that environment. We're also thinking of ways that we can increase the elasticity of that environment, based on workflows that are executing on that environment. Much like a lot of our managed services, we want to tightly constrain the resources that you're using for an environment to what's actually going on in that environment. There's a ton of different things that you can do to get involved with either Airflow, if you don't like Composer, and that's totally OK, or Composer itself. Airflow, the Apache website for Airflow's a really good place to get started. There's links to the mailing list. They have pretty active mailing lists. There's a lot of information for how to get involved in that community. In terms of Composer, we have our documentation for the product. There's also a Google group mailing list that you can join. Please ask questions on that mailing list. There's also a Stack Overflow tag that we look at as a team. So if you have Composer-specific questions, please use that tag. Because the more questions that you ask with that tag, the less likely it is that other people will need to hunt and peck for that information over time. And we just can't anticipate all of the questions that might come up over time. As a plug, there is a Meetup group, if you happen to be local to the Bay area for Airflow. There is going to be an event in September, which is going to be hosted by the Cloud Composer team at the Google Sunnyvale office. So if you are local, please sign up. We just put up the Meetup event for that. So it's something to check out if you're curious. With that, thank you all very much for being here. We sincerely appreciate it. And again, thank you to everyone who used Composer and gave us feedback. We're open to questions. We have five minutes. If you have questions, please come up to the mic. Happy to answer any questions you guys have. [APPLAUSE] [LOGO MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 33,063
Rating: 4.8709679 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: GeNFEtt-D4k
Channel Id: undefined
Length: 45min 32sec (2732 seconds)
Published: Wed Jul 25 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.