AWS re:Invent 2019: [REPEAT 1] Deep dive into Amazon Athena (ANT307-R1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone my name is Janek and I'm the product manager for Amazon Athena and I'll be joined for this talk by my co-presenter Anthony who's a principal engineer on my team and I know that I speak for the both of us when I say that we're absolutely delighted to be here today to talk with all of you about Athena how we think about Athena explain a bit about where we see Athena's future and introduce some of the work that we have done over the last one year to help realize that future so by way of a quick show of hands can I get a sense of how many of you here use Athena regularly really fantastic thank you for being latina users but for those of you who have not yet used I'll do a quick intro so Athena is a is a very powerful query service that allows you to submit sequel statements to make sense of your structure and unstructured data out of the box we are integrated with AWS glue for metadata management and we support querying data that sets an Amazon s3 so Athena is also server less so when you submit your queries to us you don't have to manage your servers we do all of that heavy lifting for you so that is what a teen is today and we've received a lot of feedback that this is not where you want Athena to be in the future so some of the future items that we want to consider are you've told us you want to use Athena for ETL use cases we want to support that you want to use Athena to query any data source and not just data that sits in s3 we want to enable that you want to use Athena for ML use cases we want to be right there to help you to run your inference so that is what we expect Athena's future to be here are some of the breakout sessions but I think some of these might have taken place already we do have a workshop later in the day so we hope to see you there and this is how we're going to be spending our time today so we're going to go over some common Athena usage patterns we're going to understand the key personas that exist in an organization and we're going to dissect and understand the use cases and challenges I will then introduce some of the new work that we have done over the last 12 months to help solve these challenges and then anthony will take the time to walk us through a demo which I know you will find as exciting as I did when I first saw it so without much further ado let me start by showing you some of the most common Dafina usage patterns so a lot of customers use Athena for ad-hoc investigations in this example you can see that the customer has data in various data sources now they have data in redshift they have data in Amazon RDS and you know dynamo DB and in a lot of other places they run ETL jobs and the data is ingested into Amazon s3 they then populate the metadata in blue catalog and then analysts in this organization use Athena queries to do ad hoc investigations at times they also use to connect to Athena a bi tool and that they do by using our JDBC and ODBC drivers customers are also increasingly using Athena for SAS use cases so in this particular example you can see that the customer has hot data in a relational store like Amazon Aurora and they have worn and cold data in s3 now when their external customer request comes in it goes to the API gateway and the API their joins the data which exists in Aurora with data in s3 via an Athena query we also support and a lot of customers use us to do ETL and then subsequent query analysis and that is easier because customers already have their service logs sitting in Amazon s3 they then dump their application logs and then the logs are the data that they source from the external partners into RAW format in Amazon s3 they then use Athena's create table as select statements and insert into statements to update the metadata and the partitions in the glue catalog and at the same time optimize and store the data in the final location in s3 in a format such as Apache parquet which could be optimized for analytics and then analysts in this particular organization again use Athena queries to do their investigations so these three were are not the full exhaustive list of the use cases for Athena these are just some of the most common use cases and the information that I've learned from most of you here I just want to relay it back to the community for the benefit of all so let's switch gears for a little bit and understand who the key personas are in the organization so for the purposes of this presentation let us take the example of a fake ecommerce company now this ecommerce company has been around for a while and over time their technology stack has evolved and teams in this organization use the data source which is the best fit for the application that they're building so for instance when the user request first comes in they are shown a product catalog and that product catalog is stored in Amazon Aurora which is a relational data source they have a payment service that uses HP is running on EMR they also have a order processing service which processes active orders and active orders are defined as orders which are submitted by the customers but not yet fulfilled by our e-commerce company and they have to be frequently rendered back to the application and to enable that they use Redis their customer specific information such as email address and shipping address and phone number all of that info is stored in document DB infrastructure logs are stored in cloud watch logs and cloud watch metrics and these are used to debug any in all issues that may happen now our e-commerce company also has a business intelligence and a data analyst team and that team uses Amazon redshift for long term and historical analysis and finally they also have a shipment tracking service now that relies on dynamodb so hudecki personas in our e-commerce company we have anody analyst Carlos the administrator Richard the engineer and Maria the scientist now some of the key use cases of all of these personas are listed here so Anna she runs sequel queries she schedules reports to generate the key performance metrics of the organization and also populates business dashboards additionally she also has to do a lot of ad hoc investigations so for example let's assume that customers of our e-commerce company are complaining that their orders are stuck in a weird state then Anna would be tasked with aggregating the info from all of these various data sources in near real-time and identify what specific orders were stuck Carlos he has the non-trivial task of creating maintaining and managing the data leak he also has to specify access control policies for any and all users and rules that exist in the organization we have Richard and I personally identify the most with Richard because I used to be him in a previous life Richards builds stuff so he builds SAS applications using Athena api's and he also helps Carlos by building ingestion pipelines and Maria by building applications that can be used to running machine learning influence then Maria our scientist her primary focus is to build and train the most accurate ml model possible she also helps analysts and other users in the organization to run machine learning influence now some of the main challenges that these personas face are that Ana data is spread across all these various sources and in order to identify and isolate all of the customers and orders that were stock she has to you know first rely on the data engineering team to complete all of their ingestion in real time before she can run her analysis Carlos again like he has to struggle because data is stored in a variety of applications he has to build and maintain and also scale ingestion pipelines and workflows required to get all of that data from all of these sources into a central data leak and additionally teams are going to continue to want to experiment so in case there's a new team which wants to experiment with the new technology Carlos has to find a way to support that Richard our engineer also has to support multiple formats he has to learn how to access and retrieve data from multiple data sources and has to do it all over again in case a new data source is now used by the organization and Maria in order to train her ml model she requires to extract data from all of these sources and you know that also as fast as possible because otherwise her model might be accurate but for like you know a month prior and not for the year and Anna and analysts in the organization also rely on her for running inference so she could be a bottleneck in the organization to run influence so the next question is like we know there's all these challenges but how do we solve for them so I don't reduce some of the features that we've done that help to solve these challenges and the first one that I want to introduce to you is federated queries in Athena so what is a federated query a federated query is a query that you can write in sequel that can scan data from data sources which could be relational or non relational or your object stores or even your own custom data sources regardless of whether you have your data source on the cloud or you host it on Prem the federated query would be able to scan data from across both of them and you can use this federated query in complex ETL pipelines for ad hoc investigations and also in your applications so why do you need a federated query and it is because of this picture now for those of you with eagle eyes you might have noticed that in this particular picture we have an additional data which is advertising data now our e-commerce company could have stored this advertising data on an entirely new data source so the point is architecture is going to continue to evolve engineering teams are going to continue to use data sources that they see fit for their purpose and aggregating all of this data in a central repository is going to continue to be a challenge and that is why you need a federated query so here are the key concepts that you need to know about a federated query so when you submit your federated query to Athena Athena first has to connect to a data source that data source we refer to as a federated data source now to connect to the federated data source Athena uses this new construct called a data source connector these data source connectors are basically connected code that run in AWS lambda as functions in your account so when you submit a federated query to Athena we would invoke or we have the ability to invoke multiple lambda functions in order to optimize for performance when we connect your federated data source to fulfill the results in the most performant manner possible so is a federated query simple to use I think it is extremely simple to use there's three main steps to follow the first is to deploy the datasource connector the second is to register the datasource connector and when you register your connector you can specify a friendly name which we call as catalog name and you can use this catalog name to refer to your federated data source in your queries and the third is obviously just start writing your sequel queries and to reference to a table that exists in your federated data source simply use the three-part identifier syntax which is catalog name dot e or database dot your table and finally you know we hope that you're amazed and you rejoice when you see the results of your federated query in a very performant manner so how do you deploy a datasource connector by now we know that datasource connector our lambda functions and there's two main ways or two easy ways to deploy the datasource connector the first is using one click deploy mechanism using a service called an idiom the service called serverless application repository so this picture here shows we've uploaded a connector that we have also open sourced called Athena cloud watch metrics connector to service application repository users can simply browse to this location fill in a couple of parameters hit deploy and they're good to go the second way is you can use any mechanism that you say fit it could be lambdas UI lambdas api's or any other mechanism to upload the code manually to AWS lambda so this particular picture shows Anthony and I we've uploaded a couple of connectors to lambda using lambda api's or lambda UI and again as soon as to deploy you're good to go registration of a connector is also extremely easy simply fire up the Athena console hit connect new sources and you'll see a list of connectors that are available for you to use in the next screen hit or fill in a couple of parameters including the catalog name and then you just start writing your query additionally we also support construct called a registration less federated query we think this will be extremely useful for you to prototype before you register your connector with Athena for good and enable all of your users in the organization to use that connector so to use the registration less mechanism simply add the prefix lambda : the function name and this is the function name that you use when you deploy your connector as the catalog name so for instance in the query select star from lambda : CMDB you're telling Athena to run a federated query using this function called CMDB so here's the list of the data source connectors that AWS has open sourced the open source connectors to HBase document DB dynamo DB we've written a JDBC connector that you can use to connect to data sources such as our Orion redshift we've also released connectors to Redis flood watch logs flood watch metrics among others so what do you do when you want to fit it into a data source that does not yet exist in this list do for that purpose we've also released a query Federation SDK that you can use to write your own connector to federate into your data source some of the features of our SDK are listed here and I'll run through them quickly so we support s3 spill this is useful in cases when the response of your request exceeds the capacity of the lambda and you are then able to spill to this easily to fulfill your request we support partition pruning to be able to scan only the relevant bits of information that your query requires and not to increase your scan size we also have the ability to invoke multiple lambda functions in parallel in order to optimize for performance of your query additionally we support your request and response types into Apache arrow types and this has also helped this also helps to improve the performance of your query we provide to you the identity of the user or the entity that has submitted the query and allow you to hook in the authorization models that you want to help for authorization protocols we also respect the fact that you could be running analytics or refrigerated very against a data source that you use live in production or against a federated data source that is not as scalable as Athena is and for that matter we allow you to do or specify congestion control mechanisms so this is how self-service ETL job could look like using a federated query so you can write one sequel query to get data from multiple data sources and Athena writes all of the query results in Amazon s3 so you get your data into s3 by this time and you can wrap this federated query around a create table as select or an insert into statement so what does does is this helps you to create the right partitions in the glue catalog and you can also specify the format of the data that you want to store in s3 s and finally to do this repeatedly or a regular workflow you can schedule this mechanism or schedule this workflow using a mechanism such as lambda or you know you can build your own scheduling application the next feature I want to introduce to you is our support for the - meta store so a lot of customers complained to us or rather gave feedback to us that they want to use Athena because of the simplicity but they are unable to do so simply because Athena only supports the glue catalog for metadata management this feature removes that limitation so you can now use a data source connector to connect Athena to any metadata management engine out-of-the-box we've provided a reference implementation for the Apache hive catalog and when you use this connector a beta Tina you can run a query that in one query scans data from data that is referenced in the hive catalog with data that is referenced in the glue catalog with data that is referenced in any federated data source so this spec shows you how easy it is to use our reference implementation to connect to the hive catalog simply fire up the Athena console hit connect data source select the right options fill in a couple parameters and you're good to go so to bring it all together how does federated query help our personas so Anna our analyst is now able to select data from any data source for her analysis she's able to create a real-time data driven narrative and identify which exact or which specific customers were affected and what orders were the ones that were stuck Carlos the administrator can deprecate you know the old ingestion pipelines and move to a server less ingestion pipeline using just a couple of sequel queries Richard our engineer does not need to learn different data access paradigms he can use sequel queries to source data from any particular data source he also does not need to help Carlos anymore to scale and build ingestion pipelines and Maria our scientist she can use data from any data source to train her model and that helps her to improve the accuracy of her models the next feature I want to introduce to you is user-defined functions support in Athena or UDF's so what are the challenges without us at times our personas such as it could be Anna it could be Richard or Maria like they might need to pre or post process data before ingestion or before using in their analysis and that is really difficult to achieve in sequel queries without UDF's to achieve this they have to schedule or work with and switch between multiple different applications which is frankly not the ideal experience Carlos also has to sometimes you know provide restricted access to Anna so for example he might need to redact sensitive information which could be an entire column from Anna and without UDF's what carlos does is he duplicates the original data set - the sensitive information in the form of a column that needs to be redacted and then provides access to ana which is again not the most ideal solution in the world so with this feature you can now invoke your own functions in Athena UDL czar powered by a WS lambda and you can use the same query Federation SDK to author and define your UDF's since they are used or powered by lambda you we also support Network calls in these videos you can invoke UDF's in either the Select phase and/or the filter phase of your query and you can invoke as many units as you want and athena automatically optimizes for performance of your body by matching together rows so you only have to focus on your processing logic so this is the promise of UDF's in athena you write your function once you deploy it once then anyone in your organization can use the UDF any number of times and their queries the link i've pasted below highlights a use case and a blog post that shows how you can use a combination of federated queries and UDF's to build a ingestion pad line here's a code snippet on how to invoke a UDF so the bottom half of the snippet shows how you can define your UDF the top half shows how you can invoke your UDF in sequel queries and as you can see units are simple to write and even simpler to deploy invoke an Athena supports scalar functions as UDF's at this time so how do you DF capabilities help our personas Anna she can now invoke custom code in her queries without having to switch between various multiple applications Carlos does not need to duplicate data for column level access controls he can simply provide a UDF which scrubs the data from the original leader set before providing access to Anna he can also transform data before ingestion as part of his survey less detail workflow so for example at times before ingestion Carlos might want to transform a particular string into an enum and that is really simple to achieve using a UDA Richard of an engineer can also invoke custom code in applications and the best part is he can deploy and build and deploy a library of UDF once and then anyone in the organization can use those UDF's to invoke in their queries any number of times and Maria she also can transform data that she needs to use in her model training purposes so at times as we know sometimes to optimize the training performance of the model she needs to transform certain strings into data type like integer and that is really easy to accomplish using UDF's additionally she can also apply pre or post processing logic to her result set the next feature I want to introduce to you is machine learning capabilities in Athena so if you've been paying attention to Andi jassi's keynote over the last couple of years you know that AWS has a mission to democratize machine learning and this feature helps us to do that so why do you need ml capabilities in Athena our customer conversations have revealed that when you consider number of employees as a dimension then the number of employees were proficient in sequel far far exceed the number of employees were proficient level specific technologies similarly the number of employees who were proficient in sequel far outnumber the number of employees were proficient in a programming language such as Python or Java or C sharp or any other language and which is why we believe that the ability to run and invoke machine learning inference using sequel queries is a huge advantage so using this feature you can now invoke your machine learning models and use them for inference in your sequel queries so you can deploy your ml model once on Amazon Sage maker and again anyone in your organization can use the model any number of times to run inference and if you combine Athena's federated query with this feature then you can run inference on data that is in any data source regardless of whether that datasource supports inherent analysis or inference or not you don't have any additional need to build applications specifically to run inference and there is no additional setup required to use this particular feature this could be a workflow when you want to use Athena to train your ml model so you can run a federated query to select data from any data source you can then transform your data set as required and then you can use the steerer set to train your ml model and deploy it on Amazon stage maker and to use that ena to run inference your model is already by now on stage maker you can write a UDF to pre or post process your data and then anyone in your organization can run ml inference using data from any data source I've listed some sample ml use cases here so for example our scientist Maria can now help us to predict whether or not a proposed new video game would be a hit Anna can let us know whether there's any revenue anomalies and transactions or if there's any particular transaction that flags out as fraudulent and the link to the blog below highlights how you can use Athena in your training workflows and also in your inference workflows so here's a sample query to invoke inference the top third shows you how you can invoke inference using sage maker in your sequel query the middle third shows you how you can invoke a UDF and then the Select statement shows you how you can use a combination of ml inference and UDF invocation in the same sequel query for your analysis so how do you thin SML capabilities help our personas ana can now easily run inference on data from any data source and she can now incorporate additional ml inference capabilities and enrich her dashboards and reports that she populates carlos does not need to duplicate data into multiple formats simply to enable ML specific use cases and he also does not need to build and maintain an extra application only to enable other analysts in the organization to run inference Richard can invoke ml capabilities in SAS applications without requiring to learn new technologies and without requiring to understand and learn new data access paradigms maria is no longer an inference bottleneck for the organization because anyone can now use sequel queries to run inference and she can now switch her focus back to creating and training in new deploying the best and the most accurate and will model possible the next feature i want to introduce to you is Athena's integration with lake formation as this particular diagram shows creating and maintaining a data Lake is certainly a non-trivial task you first have to set up your storage then you have to clean prepare and catalog your data then you specify access control policies and then you open the data up for analysis you can see from the right side of the page that you can use Athena to make your data available for analysis we've considered how you can use Athena's federated query UDF's create table as select and insert into to move your data over and to catalog and clean and prepare your data for analysis an intersection will see how Lake formation integration and help you to specify and configure access control policies for everyone in your organization so how does our lake for mission integration help before this feature if Carlos has to provide access to Ana to run a few queries he has to manage access control in three separate services the first is in Athena to allow Ana to run analysis or run her queries the second is in the glue catalog to enable her to retrieve the partitions in the table and the metadata information and the third is s3 where the underlying data actually lives with lake formation integration Carlos can achieve all of this centrally using lake formation lake formation also allows for fine-grained access control to the column level carlos can simply go to the lake formation counselor use lake for vision API has to specify column level access protocols and similarly Carlos does not need to go to three or four different services simply to answer auditing and monitoring questions such as who accessed this data under what with what permissions with what roles in what time and so on and so forth lives in lake formation api or console he can easily determine auditing and monitoring information in real time this is how his Athena user request workflow would look like Carlos first goes ahead and sets up user access protocols and permission levels in lake formation and then when Ana's request comes in to Athena Athena will request temporary credentials to the lake from lake formation and then used those credentials to retrieve only the data that Ana has access to this screenshot shows you how analyst one who's represented by Ana is granted access to all of the columns of a particular table and analyst two represented by Maria is granted access only to the jurisdiction name column when both of these analysts run their sequel queries which is a simple select start of this table as expected Ana is able to retrieve all of these column whereas Maria is only able to retrieve the jurisdiction name column and I have more features to talk about but I also want to be mindful of the time because I know you guys are chomping at the bit to see the demo but very quickly some of them were new notable features are we have added advanced geospatial functions these are also available in preview and steps to onward to the preview are listed at the bottom we've released Athena of work Lupe's earlier in this year that you can use to isolate your workloads and use work groups to also specify cost guardrails with added support for insert into added support for the octa IDP and also updated our JDBC ODBC drivers multiple times throughout the earth so the key point that we would want you to take away is how do organizations benefit when you use Athena so your teams can continue to choose the best fit for purpose database strategy and Athena does not bind you to any or only proprietary formats you can use any metadata store you can have your data in any format put it in any data source and use any IDP you want if we don't support it today give us that feedback and we want to support it tomorrow we're very confident that we will be able to improve the velocity of your analytics and ETL when you use Athena with this I'll have Anthony over to the stage to run you through the demo thanks genic as Yannick mentioned my name is Anthony virtuoso I'm a principal engineer with Athena and today what I'm gonna do is I'm gonna walk you through how easy it can be to get started with Athena query Federation and to do that we're gonna do two things first we're gonna deploy a ready-made connector and then use it in a federated query and Athena and buy ready-made connector I mean one that the Athena team has written and then published for you to use in AWS server less app repo the second thing we'll do is we'll actually write a new connector to a custom data source with a proprietary file format live today will also include a couple UDF's in that connector and then we'll also use those in a federated query from Athena so up on the screen right now I actually have our github repo everything that I'm going to show you and in fact pretty much everything related to federated queries in Athena is open source and available on this repository there's a pretty detailed tutorial as well that I'll point out in this Athena - example folder a lot of what I'm going to show you is essentially walking through this tutorial excuse me so to get started I'm going to navigate over to AWS server let's app repo and I'm gonna search for Amazon Athena Federation Amazon Athena Federation is the name of the author that the Athena team uses when it publishes connectors to service app repo and you can see that it pulled up about ten or so connectors here and each of those has the AWS verified author badge when you see that badge you can rest assured that the code you're about to deploy has been authored and is maintained by the Athena team so I'm gonna scroll down and go to our Athena cloud watch connector because that's the one we're gonna deploy right now and then use Ana query and when you click on the connector you get some information about it including the down to the very nitty-gritty details of what deploying this connector will do in the form of a cloud formation template further down you can see what permissions this connector needs in the case of the cloud watch one it needs access to s3 for potentially spilling large responses and it also needs the ability to read from CloudFormation our sorry from cloud watch logs and then if you scroll down more you'll get a detailed readme of what all the configuration options are how you can use this connector and what tables it supports and so on on the right hand side we get some of the configuration information that we can provide for this connector the first thing that we need to fill in is an S we bucket that the connector can use for spill so I've already created one I'll go ahead and type that in the second thing that we need to provide is what is the name of this source what do we want to use in our sequel queries what do we want the lambda function to be called and so I'll just simply call this cloud watch to make it easy hmm for any date for any data that might spill to s3 we encrypt it by default you can optionally use kms for key management of that spill but by default we will use a local source of randomness from your lambda function so I'm going to leave that as a default I can also control how much memory and how much lambda runtime the function is allowed to use and then lastly I couldn't set a spill prefix in that s3 bucket to make it easier for you to set up as three lifecycle policies to delete that data after the query finishes I'm going to check this box that gives the server sap repo permission to create the scoped down I am role that this connector will need and if I scroll wait at the bottom I can deploy that connector and so essentially what's gonna be happening right now while service app repo prepares to to deploy this connector is it's taking that CloudFormation template that I showed you before it's filling in the the variables that I just provided for my spill bucket and the name I want to give the lambda function and the first thing it's gonna do is it's going to go create those scope down roles which is what it's doing right now and then once it's created the scope down roles it'll go ahead and deploy the lambda function itself that entire process usually takes about a minute or two depending on the complexity of that the I am role that the connector needs so while that's deploying I'll just scroll down here and show you a little bit more information so with each of those connectors that we've published we include a pretty detailed readme in the case of this particular connector it also gives us an example query that we can run just to kind of quickly get started it also gives us all the instructions we would need if we wanted to build this connector from scratch maybe because we want to modify it for our own needs or just because we're curious in how it specifically works so if I scroll back to the top I can see that the deployment succeeded now I'm going to switch over to my Athena console then I'm going to make sure that I'm running my Athena queries in a special workgroup called Amazon Athena preview functionality so this is going to be the way that you deena that you want your query to use preview capabilities we wanted to make sure that it was very easy for you to opt in but not to accidentally send your production workloads through that preview capability so I'm already in that workroom I'm gonna switch back to our query editor and I've already copied pasted that query from before except I've made two changes to it I've substituted in the name of my source everywhere where it had previously said lambda function name so what this query is going to do is it's going to go and grab all the details of the log lines that exist inside of our cloud watch connector for a log group called AWS lambda cloud watch this is actually the log group for the lambda function that runs the connector and then it's actually going to query a table called all log streams and this table is essentially a view over all the log streams that are contained in that log group we're gonna go and run that query and this will probably take about ten or fifteen seconds to run and what's happening right now is essentially Athena is calling that lambda function during query planning to say what are all the different log streams that I'm going to need to read and then I think it will call that lambda function for each log stream and actually affect the read and you can see I got back my log lines down here and so the reason I'm showing you this is just to give you an idea of how how quickly you can get started with federated queries now obviously depending on the source you're connecting to you're going to need to provide more or less configuration information for example if you're connecting to an RDS instance for Federation you're going to need to provide the B PC details so that that lambda function will have network connectivity to your RDS instance so now that we've seen how to deploy a ready-made connector let's take a look at what what it takes to write a connector from scratch and if you remember I mentioned before that all this stuff is available on the github repo I want to make sure I call that out again because we're gonna go through some code right now and I don't want you to think that you have to absorb all of it in this moment my goal is really to show you how simple it can be to integrate with the custom source or to take one of our existing connectors and modify it you know for your own for your own purposes in case the generic one we've made doesn't quite fit your need so in order to do that I'm going to just switch switch over to an editor where I've checked out a copy of this github repository and there are essentially three files that we're going to need to modify in order to create our our custom connector we provide this example project that's essentially a way for you to get started it has all the the boilerplate taken care of for you and then it has comments in each of the important functions where you should go and add your business logic or add your logic for connecting to your proprietary source so the first file that we're gonna be modifying is called example metadata handler and as the name implies the the functions in this file provide a thena with metadata about our source so during query planning time Athena is going to call the functions that we're about to modify the first function that we're going to implement is that it's called doula schema names and essentially Athena will call this function to get the list of schemas or databases that are custom source supports most likely Athena would call this if somebody did a show databases sequel statement so I'm just going to uncomment the code that's already there in the example and in this particular case all we're doing is returning a hard-coded list of schemas in this case schema one the second function we need to implement is very similar to that previous one it's called do list table names and essentially this is how Athena will get the list of tables in a particular schema or database and again we're just returning our hard-coded in this case fully qualified table name where we're saying whatever the schema you requested is it always has a table called table one again a somewhat contrived example but I'm just trying to illustrate how simple some of these functions are the next function is a little bit more involved it's called do get table and this is essentially the function that Athena will call when it wants to get the detail of our table what are the columns that it has what are the types is it partitioned if so what columns is it partitioned on and so the first thing that we do in this function in this particular example is we define our partition columns and here we're saying that at this table table one is partitioned on year-month-day and then just below that this is our first time that we're going to be interacting with Apache arrow as part of this connector if you recall earlier genic mentioned that a lot of this protocol is based on Apache arrow in order to make it in order to have a very easy interchange format so in this particular case what we're doing is we're building an Apache arrow schema object where we're gonna define all the columns that are in our date our table and the first three columns we define are integer fields in in this case there year month and day and they happen to correspond to our partition columns the next field is a string field called account ID followed by another string field called encrypted payload this is actually gonna be a little bit of an interesting column in our example because the sample data that we're going to upload to s3 and then query is actually going to have this column AES encrypted so we won't be able to read it without using our UDF that we'll write later we then add a complex column of type struct and we say that that struct column called transaction has two child columns an ID column and a completed column and that's essentially the core of the scheme of our table the next things that we end up adding are pieces of metadata and for the most part Athena doesn't really understand metadata on your table you can think of these as table properties the only time Athena really understands them is if it sees a piece of metadata whose name matches the name of a column Athena will assume that's the documentation of the column and so if somebody does a describe table Athena will take that value and actually show it in the result of the described table for as documentation for your user that takes care of our get table function the next function we need to implement is called get partitions and this is our first opportunity to control how Athena paralyzes reading our table in this particular case we're just looping over our partitioned columns generating some random values so we're generator we're looping over year month and day and essentially just generating some preset number of partitions but the interesting part of this function comes in when we go to return that data to athena so essentially what we get to do when Athena calls this function is that it's going to provide us the query predicate and so we'll be able to apply partition pruning to remove the partitions that Athena doesn't need to read because they won't contain any relevant data and the way we do that is by using this block object when we do block that set value under the hood the SDK is actually writing that value excuse me the SDK is actually writing that value into an Apache arrow response object for us and it will return a boolean that tells us whether or not that value satisfied the queries constraints we collect all those boolean z' to decide if this particular partition is one that Athena needs to read and then we return either a 1 or a 0 to tell Athena if we wrote a valid row or not and then the last function that we need to implement in our metadata handler is called get splits this is our second second opportunity to tell Athena how it can parallelize reading our table and essentially as the name implies we're gonna tell Athena how it can split up the reads for that particular partition so similar to the table properties that I mentioned before I think there doesn't really know what if split is for the most part a split is there for us because Athena will call us for each split and say hey can you go read this for me but there are a couple fields on a split that Athena does understand the first one is the spill location and here I'm using this function make spill location provided by the SDK to create a spill location and a spill location is essentially a path in s3 that is unique / split so that if the data represented by that split is too large to come back in the lambda response the lambda function will spill that to s3 and that Fino will go get it from there the thing that's somewhat interesting about this is that because we tell the we tell Athena the spill location ahead of time Athena doesn't need to wait for our lambda function to complete before it can go get that data Athena will actually start to prefetch and watch that as 3 location as soon as data arrives it'll process it the reason why that's important is because imagine if your lambda function was going to read 10 gigs of data we don't want to essentially have Athena waiting around for your function to read 10 gigs of data write it to s3 and only then will Athena begin to process it instead we want to pipeline those reads and writes and having this spill location lets us do that because Athena will begin to read even before you finish writing the second piece of information on the spot that Athena will understand is an encryption key and so Athena will use this encryption key to decrypt any data that gets spilled and in this particular example I'm also using a utility from the SDK called make encryption keep that under the hood will either use KMS or a local source of randomness to generate the encryption key for me depending on how I've configured my connector the next couple of pieces of information that we put on our split are year month and day and again these are for us Athena will not understand what these things mean but Athena doesn't know that it's important for a for it to call our lambda function for each split and you'll see how we'll end up using the year month and day properties on this particular split in the next section and so there you have it and I think five functions we've just implemented all the metadata required for Athena to be able to plan a federated query the next thing we're going to do is we're going to move to this file called example record handler and here we're actually going to implement only a single function and that's going to be the function that Athena will call when it wants to actually read a split so for each partition that we gave a theme Athena called us to generate splits and then for each of those splits Athena will call us to go and read the data associated with the split so let's take a look at what we need to do to read a split so the first thing that we do is we need to figure out what the split is asking us to do and so in our particular case we're gonna grab those three properties that we set on the split the year the month in a day and that's gonna tell us where we need to go do our read for this particular example we're saying we're gonna be uploading our sample data to a specific bucket in s3 so with this piece of code we're gonna go and grab lambda environment variable called data underscore bucket that's gonna essentially just be the configuration for where our data stored in s3 again this is a bit contrived in a real-world situation you would be reaching out to the actual data store and figuring out where that data lives it's unlikely that you could be hard-coding you know a single bucket a single bucket or single location and just below that we can actually see what this example ends up doing with that year month and day information it's essentially creating a path and s3 to a specific file that represents that partition or split it then opens that file and we'll see later on it'll loop over that CSV line-by-line and translate that into Apache era before sending it back to Athena the next interesting thing that we end up doing in this file is we actually have to set up that translation how do we go from a CSV to Apache arrow and to do that we end up using a utility from our SDK that ends up hiding a lot of the nuance of Apache era and the thing I want to call out here is that even though I'm gonna walk you through a utility that essentially abstracts a lot of Apache Aero you can at any moment choose to just drop down into a more power user mode and interact with Apache arrow directly and in fact if you do choose to do that you can often achieve better performance than using some of these abstractions but I can assure you that these abstractions will minimize the amount of code you may need to write to do your integration so you kind of have to balance that performance with you know maintainability there so essentially what this generated row writer does is it gives us the opportunity to to provide it with the pieces of code that are required for extracting and translating each of the columns in our source or in the table that we're reading and then at the end we will use that generated row writer to actually compile some ore to do some on-the-fly pseudo code generation so that when it's actually translating every row from the source to the destination it's not having to do a lot of metadata lookups or C is this column an integer or a string and then what's the coercion for that so we're doing a little bit of code in here to improve that performance so let's take a look at just one of these extractors since for the very simple columns they're pretty basic so for the for the year column what we're essentially saying is that we're gonna take the first column out of the CSV we're gonna parse it into an integer and then we're returning it there's not a whole lot there these are very simple columns but if we move down a bit it gets a little bit more interesting so let's take a look at the extractor for that account ID so it turns out that in this particular example a count idea is a sensitive column to the business and they don't actually want to grant access to the entire thing instead they only want to grant access to the last four characters so what this particular extractor is doing is it's grabbing the account ID out of the CSV it's then checking the length of that of that that value and then only grabbing the last four and returning the last four characters of it so in this way you're essentially affecting masking of that column the last column that we need to write an extractor for is a complex value we had to find the struct column called transaction ID earlier that had two child columns an ID and a field called completed which a boolean that told us whether or not the transaction had completed and the way we do that here is we essentially create a map that corresponds to the fields in our struct and we use a utility from the SDK called block utils to set that complex value now the thing that's different here from some of the other examples we looked at is that for complex values our SDK doesn't yet support predicate pushdown so at the bottom we're returning true to say it always passes the filter and and we'll rely on Athena to do the filtering of any of the complex types where as with those other columns that we wrote extractors for like year month and day the SDK will underneath automatically take care of applying any of the query predicates so that it will reduce the amount of data that gets sent to Athena and then the last thing we do is we call build on our generated row writer to do that code generation and at that point we've got a translator that can be used as we loop over each of the lines in the CSV and then then we end up using this Spiller utility or I should say a block similar utility to actually write the rows back to Athena under the hood this block spoiler will automatically take care of chunking up our response encrypting it and potentially spilling it to s3 if that response gets too large and so there you have it we're now done writing our connector the next file we're going to look at will actually contain two UDF's that we'll use in our example query okay so our first UDF is called extract transaction ID and it's a fairly simple function all we're going to use it for is to extract the the ID field from our complex transaction type and so the first thing that the code here does is it just checks to see is the input valid is it non null does it actually contain an ID field if it if it doesn't satisfy those it returns a negative one the reason for returning a negative one in this example instead of just you know throwing an error or something is because I didn't want to fail my query right you have complete control over what you want to do when you encounter data that's maybe not perfect whether that's returning a default value or you know trying to recover it some other way assuming that field is present we we essentially get it from the map that were provided and then we cast it to an integer return it and that's essentially it for our first UDF our second UDF is a little bit more interesting it's called decrypt and what we do inside of this function is we're going to attempt to decrypt the column or value that we're provided and the first thing that happens there is we need to grab our encryption key in this particular example if you end up checking it out of our github repository there's just a hard-coded key in there but in reality you would probably reach out to something like KMS or a give you a secrets manager or wherever you hold your encryption keys to retrieve the key that you can use to decrypt that column once we have the key we call our symmetric decrypt function and we provide at the the value that we want to decrypt as well as the key and then it returns for us the clear text version of that column same as in the previous function if we encounter every any errors we're just returning a default value in this case no otherwise we're returning our decrypted value and we're done with our two UD apps so the next thing that we need to do is we need to compile test and then deploy this connector and UDF suite and in order to do that our sdk comes with a helpful utility called publish and essentially what publish does is I give it the the name of the s3 bucket where it can upload the the compiled connector to as I hand off to server list app repository I then also give it the name of the module in the in the maven project that I want to compile and turn into a connector and then I tell her what region I want it to deploy to I'll go ahead and hit enter here no give you a list of a more complete list of what it's going to do but essentially at once this command finishes it will have tested and validated my connector and UDF's it will have published it to service app repo as a private application just for me and then later on I can choose to make that public or share it with other people in my organization so I'll go ahead and hit yes this will probably take a minute or so to actually compile an upload and so just finished running through its unit tests it's now turning it into a bundle that can be deployed to serverless app repo and uploading it okay and so at this point it now has the application available for us and serverless app repo so I'll go ahead and switch back to my browser and if I navigate to serverless app repo and go to private applications you can now see our example connector here so I'll go ahead and click on that and we'll deploy it and then we'll use it so I'm gonna give it a name in this case I'm just going to call it example and that corresponds to the name of the lambda function as well as the name that I will use in my query to access that source it's asking me for the bucket where I uploaded my sample data I've done that before so I'll go ahead and type that in it's also asking me for a bucket where it can optionally spill data too and I'll fill that in everything else I can leave as default and I'll go ahead and click deploy so same as before this is going to go and create the I am roles that this connector needs it's then gonna go and create the lambda function and deploy the code that we just wrote together so we'll give that a moment to do its thing actually while it's doing that I'll show you guys the test query that we're gonna run okay so here we have the test query that will execute and the first thing this test query does is it defines our first UDF the extract transaction ID and we essentially just need to define what its parameters and return type are the reason we do that is because when Athena goes to run the federated query and calls the lambda function the lambda function is going to match the the signature of the UDF as it appeared in the sequel with the signature of any of those Java functions that we wrote in that file and if it finds a match that's essentially how the pairing of UDF happens and we we essentially tell Athena that this UDF is of type lambda invoke it's hosted on a lambda function called example and then we do the same thing for our decrypt UDF and then we're selecting a year month and day as well as the account ID we then select our encrypted payload column but we use our decrypt UDF to decrypt it so that we'll be able to read it before then calling the extract transaction ID UDF on the transaction column so that we can just get back that ID field and then we're doing a from lambda : example and this is how we're telling Athena that this particular table is a federated table and is hosted on a lambda function called example and then I'm just scoping down my query to be a single partition 2017 November and the first so let's see if it finished deploying our connector and yep it looks like I've finished deploying so I'll go ahead and run this query it should take about 10 seconds or so and again what Athena is doing under the hood here is it's calling our lambda function during query planning to get the details of the table it's then calling the lambda function to do the read and then it's calling the youth the lambda function again with batches of rows to execute the UDF's and we can see we got our data back so we have our year month and a column our account ID column has been masked down to just four fields our decrypted payload has been our encrypted payload has been decrypted and now we can read it we can say we see that it says secret and - the transaction ID and lastly we've successfully extracted the transaction ID from our transaction field so there you have it I think we're in about 20-25 minutes or so we deployed a connector that was ready-made we wrote in connector from scratch and we wrote a couple UDF's from scratch I want to thank you guys for joining us today we're pretty excited to see how you guys use this new functionality I think Janek and I will be hanging out for a few minutes after for any questions although we ask that we kind of do that outside so they can prepare the room for the next doc thanks everyone [Applause]

Info

Channel: AWS Events

Views: 14,922

Rating: undefined out of 5

Keywords: re:Invent 2019, Amazon, AWS re:Invent, ANT307-R1, Analytics, Amazon Athena

Id: tzoXRRCVmIQ

Channel Id: undefined

Length: 60min 42sec (3642 seconds)

Published: Thu Dec 05 2019