Data Lake Day | Building Your Data Lake on AWS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm very excited to be here my name is Aditya Chawla thank you so much for your time today thank you so much for coming over and thank you so much for people who are watching us live on the livestream as well so my name is Abdul touch Allah as I said I'm a Technical Account Manager with AWS enterprise support and I'm projected ami I'm the product manager for AWS glue and lake formation which is a new service that's in preview right now awesome thank you so today we are going to talk about building your data Lake on AWS so let's start with some positioning here there is more data than people think as we all know data sets are getting bigger and they are all coming from various different sources different diverse sources like applications like CRM ERP and many other different applications and they're all going into their own silos right and the velocity at which this data is coming in is also growing really fast with machine generated data and logs growing faster than business data thanks to things like network connected devices or micro services architectures or just the growing popularity of DevOps when we build a data platform we need to keep in mind that data platforms need to survive a long period of time like typically around 15 years and so when we build something a data platform we need to think about not only today's needs but also the needs that are going to fulfill this for the next 15 years so with assumption that well we know that data grows 10 times typically in five years we need to build something that that can scale a thousand times in 15 years and there are more people accessing this data and more requirements for making this data available so many teams within organizations they would like access to this data whether it's to run their business reports or to run their ad hoc analytics or to run their machine learning models they would like the flexibility to self unbowed this data run real-time analytics or run massive scale queries and and search this data at the same time adhering to all the security and compliance requirements so you might wonder how do I democratize this data access to enable all the informed decisions that are organized or sub organizations within within my organization need to need to do but at the same time employ governance and control so that there's no mismanagement or misuse of data the solution here is a derelict and many of our customers are moving to this data Lake architecture data Lake is an architectural approach where many types of many types of data are managed in one place and the data can come from different sources and the data can be semi structured structured unstructured and they're all managed with a unified set of tools and this data is readily available to be categorized or processed or analyzed or consumed by different teams in organizations and the data that comes in is written as is in the data Lake so you don't have to think about what's the predefined schema going to be like or think about what questions do I need to ask about the data that's coming in beforehand of course so let's take a look at what are the concepts of data Lake how is it different from a data warehouse and what are the value what what value is it going to bring in so first of all it's all there I in one place a single place of a single source of truth so that teams within the company they know exactly where to go when they need the data they're looking for it should handle it handles structured semi-structured unstructured and raw data so just structured data is like csv files or park' files semi structured data is like JSON files or XML files unstructured or raw data is text files or log files coming in yep so and then it it suppose fast ingestion like I said data is coming at very fast pace it's only growing so we need something that that can take care of or that can handle that kind of velocity at the same time applications need to be able to read and consume this data really fast so data Lee it needs to support that and like I said we don't think about schema at the time of writing we let the data come in and write it as is but at the time of reading is when we we decide what schema or what kind of structure our data is going to be read in red in it has to be designed for lower cost storage and this is something very important decoupling storage and compute so back in the day as we needed more more and more storage we also had to scale our computer resources and so as we had more and more data coming in our computer resources would also go up and they would sit idle because they're not really processing the data and so there would be unnecessary resources sitting idle right so if we decoupled storage and compute you can add data and ingest data as much as as much as its coming as fast as it's coming in at the same time you don't need to scale your computer resources you only need to scale your computer resources up when it's time to process the data or analyze the data and once that's done you can scale down and you don't have to pay for those computer resources and of course this is very important it has to support any kind of protection and security rules data leaks help you cost-effectively scale so the first and foremost part of a data Lake is to be able to handle exabytes of data like I said we need to think about how can we how can we scale for the next 15 years like thousand times so we need to be ready to take that kind of data and s3 in Amazon is a perfect place because it's already capable of storing that amount of data another thing that I should mention here is typically data leaks have three stages when we when when we store data so the first stage is landing dock or landing zone our some people call it zone or swamp where the data that's coming in is written as it is raw data without making any modifications and the second stage is it's called the silver zone or some people call it the pond or refinery where that I or it's basically staging data so you're taking the raw data and making some modifications doing some experimentation doing some explore explore Ettore analysis and storing it in that second stage and finally there is the the the catalog or it's called some people call it the lagoon or the gold zone where data is refined and it's its catalog and it's ready to be consumed by any other teams that need to be need to consume it but at the same time a data lake and if there are people or if there are teams within your organization organizations who are interested in raw data directly without any modifications they should be able to access that raw data as well and Darryl I can make that possible at the same time if teams are trying to access the data from the second stage the the silver zone or the pond they should be able to do that as well so when we build our data lake we need to make sure or we data Lake allow applications to consume this data from any stage as required and this process of loading transforming and cataloging you only do it once and this is a perfect reason why it scales so well because regardless of how many applications are writing to the derelict and regardless of how many applications are reading from the data Lake you only do this once you load it once you transform it once and you catalog it once and so you only you don't have to move the data once it's there and because of that it's cost-effective because you only are keeping one copy of whatever data it is and and in the beginning when when you first get started you're only paying for the storage that's you that are using but over periods of time as more and more data coming in is coming in you you pay for whatever that scale is yep so may they're available to all any kind of tools within the organization so there are some examples on the top like a redshift or quick shaped or EMR or athina and then when we store this data up we we make sure that we use open formats or interfaces that way teams can innovate and do more do more engineering on that data so let's take a look at some of the architectural principles here building D couple systems this is very common these days so for example if we have data coming in we store that data we process that data and then store it again and analyze and then finally get some answers if we put all that in the same machine or same place scaling you scale everything together maybe the first store process needs more scaled and the second store process so if you decouple them and make sure that one process or one component is unaware of the other component and let them just communicate with each other it's a better better design that way you can scale all of them independently the second thing here our AWS we believe that it's very important that we use the right tool for the right job so for example if I have to tighten a screw the best tool for that is a simple screwdriver I can get a very expensive Swiss Army knife that can do the job for me but the right tool is a screwdriver so I'd rather use a screwdriver so looking at our data structure Aladin latency or throughput or access patterns for what we are trying to do using the right tool with for that particular job is a very important architecture principle leveraging managed and server less resources or services wherever available so this goes to any managed service you can you can build these resources yourself manage this yourself maintain these yourselves but you're spending a lot of time and resources on things that that you don't really have to do somebody else can do that for you so if you let AWS take care of that overhead you can focus on your application at hand next even general design patterns so a data like like I said you once you store the data once you transform the data and once you catalog the data you should not if the principle is we should not touch that data let that data be immutable so if an application requires data in a different format or maybe different requires data from various different areas of the Dalek and it requires it in a store in a particular structure it's better to use materialized views that update themselves frequently right so but your main data or the data that you originally loaded or transform doesn't get modified and it stays there cost consciousness just because it's big data doesn't have to be big costs and just like everything else today whatever we build we need to make sure that it's ml enabled so that any future ml models can can consume the that data so here are some typical steps the first thing is storage customers need to think about where they're going to store this data like so they have to set up storage if you're using AWS you would do that with buckets and partitions start with buckets and partitions if you're doing it on Prem you would start with a large disk of arrays where that data leak is going to be hosted secondly moving the data into the derelict so for this we have to interface with whatever source data whether it's in on Prem or a cloud platform or IOT devices no matter where it is we get that data in raw format and and then immediately we need to identify what kind of data that is so that you can organize it and send it to the right place within the data Lake and then immediately we need to crawl that data to and understand what is the schema like so that we can cat we can add some metadata to it that way the next process that looks at that error has easy an understandability of what this data is right so the third step is to clean this raw data and prep it so like I talked about stages you have the middle middle stage where you're transforming and experimenting so the second stage or the silver zone is where you can take the raw data make some modifications and do some exploratory analysis and at the same time if you're ready to move it to the final stage you can take that data again and move it to the final stage and catalog it there so that's what that the third step is and this involves a lot of complex SQL queries usually because the final place where you store that data is usually in the form or format of tables so that you can query that data real easy the fourth step is to configure and enforce security and compliance processes so policies so that so that any kind of data sensitive information that that is on a need-to-know basis look so for example you can enforce security maybe at the table level or the row level at the core of the column level or encrypt this data and then audit who's accessing this data right so any PII information any in any kind of sensitive information finally it's very important to make this data available for consumer consumption right so when when we catalog this data it's very important that we label this data properly so that teams who are trying to read this data can easily trust this data and and know for sure that this is the single source of truth and this is the best place I can go to get this data rather than figuring out again going back to the raw data and figuring out if the cataloging was done properly right so let's take a look at yeah sorry you can go ahead derelict and can contain differently so the question is the question is about if if we are creating this data like say for example here in the United States and the data is coming from from different regions in the globe on the and let's say some of the data needs to have GDP our compliances applied how do we solve that do you have any are you to remove certain rows of data once the customer asks you to remove it is that what you mean by GDP or compliance right so right to be forgotten typically would mean that you need to delete the rows that have PII data pertaining to the customer yes so typically we see customers running jobs that run in the background and remove that data and they will basically remove the rules and that requires the level of service you need to read the data you need to figure out what rows map to the users and then sure right yeah yes yes yep so typically you have some level of lineage component that will track you know where all the data is flowing and you'd also you can also put some of this into the metadata in the catalog to know whether where are your I mean the thing is that you still need to detect where what fields for your organization map to PII or user information it's not even PII right and then you would need to track that there are products that track lineage for you that you can use to know where all the data has been written and then you could essentially run a job to remove that data from so they need it's a combination of tracking the lineage of your data and then running a job to cleanse that once a customer requests you to forget or ask you to remove sure we can definitely discuss it during labs thank you yep one more question yep so the question the second question is with step 5 it wouldn't work if there's a third party trying to access this this data that they shouldn't be accessing sure so definitely we'll we'll discuss that during the lab but the question is about what if somebody wants to delete their data and somebody wants us to delete their data from their data like how would that get propagated across all three stages so we can definitely take that offline yep material yeah yep so the question yeah okay so sorry I know it's a lot of good questions coming out in these good discussions because you're doing the twitch stream we're kind of on a tight schedule and it's harder for people following on twitch to follow but they're great questions and I think we can help with that so come chat to us afterwards and we we can dig much deeper into these sure thank you all right so let's dive into some of the components so the first component obviously is the base where we are going to store this data so in this case Amazon s3 is a perfect place it's secure it's highly scalable and durable object storage with millisecond latency you can store any type of data like websites or mobile apps or corporate applications or IOT sensors data coming from sensors at any scale and that is that stored can be unstructured like I said logs or any kind of dump files or semi-structured like json or xml files or structured files like CSV or park' and you can also apply lifecycle policies on on this data right so data that's new can be in standard tier data that's a few months old that's not really frequently accessed you can change it to a different tier and finally data that is ready to be archived you can move that glacier and you can set these policies up so that this this change of tiers happens automatically behind the scenes so that you don't have to do that for every piece of data and it support another great thing about s3 is that it's natively supported by major big data platforms like SPARC I've presto and others again like I said decoupling storage and compute when we get new data you get to add that to s3 don't have to scale your compute clusters when it's time to process this data you can get EMR clusters on spot instances and and once your processing is done you can you can take those down and another thing data is in one place different applications different heterogeneous and analysis clusters they're all they all can read from the same place and the same data and the other thing is s3 offers 11 nines durability and you don't need to pay up for replication within the region AWS will take care of it and it's secure and the data that your store storing can be encrypted either with client side or server side encryption at rest and it's low cost so let's take a look at some of the ways to move this data into the data like so if you're moving data from your on-prem then we have Direct Connect which is a direct dedicated network connection then we have snowball which is a secure appliance to move data I think it takes about 80 terabytes and then snow mobiles so if you have a lot more data than 80 terabytes then snow mobile can support up to hundred petabytes and then if you have local databases on your prom on-prem you can if you move that data into your data lake you can use services like database migration service from AWS and if you have applications running you're on your on-prem but you want to write to the derelict on AWS you can use AWS Storage Gateway and then if you have real-time sources like IO T's you can use services like AWS IOT core which connect to the data like on AWS and then you can also use services like Kinesis data firehose data stream and video streams to write real real-time streams into your data like now before I go to the next slide let me remind you that we talked about these three stages right we talked about the loading landing dog the transform stage and the catalog so there has to be something like I talked about there has to be something that queries each of these stages and transforms the data that that that you need to write to the next stage and obviously if you do it yourself it's pretty complex and so you it would be nice to have a service that can that can do that for you and that glue between these two these stages is again aw-oooo it's a serverless data catalog and ETL service so first of all you can crawl through the data that's coming in regardless of what stage it's in and you can quickly catalog this data and it lets you discover the data and create or detect schemas and make the data searchable and available for ETL and then glue can also author jobs author ETL jobs so it can generate customizable code in either Python or Scala and it can schedule these ETL jobs and run them and it's completely server less and flexible and built on open standards so I talked about these crawlers so these crawlers basically can automatically build your catalog and keep it in sync so they automatically discover new data and extract extract schema definitions and so they detect they can detect any changes any kind of new data if it has a different schema they can detect that and then version these tables and then I think later in the afternoon we will talk about detecting hive style partitions in s3 with Gareth and then they have built-in classifiers that come with with popular types or you can write custom classifiers with garokk expressions and you can run these ad hoc Orana schedules and their server lists so you pay only when they run all right so let's take a look at how glue would help us with the data like right so the data coming in is in there is in raw format on this on this left bucket on the left and that's that's the raw data and you have these crawlers looking at that raw data and then creating the catalog on the top and then we also have ETL jobs created by glue that can transform this data into the staging area which again have crawlers on top that again catalog the data and then finally a game ETL jobs created to move or catalog the data into the final stage and then again we have crawlers on top that can make these catalogs that can catalogue that data and this catalog is a single view of your across of your entire data like and like we like we talked about they discover any kind of schemas or new changes or new schema come off the news of new data coming in and they make this data searchable and it's a blue essentially the serverless apache spark environment and you can use ETL libraries or bring your own code and write your own code in Python or Scala and even call any AWS API is using the boto 3 SDK so putting it all together you have data coming in on the left whether it's from Direct Connect or snowball or Kinesis or IO T's or data migration service or Storage Gateway they're all going into that first bucket in the left which is which is the bronze zone or the or the loading dock and then you have these crawlers again and ETL jobs that are either cataloging or moving the data to the next stage and once your catalog is completely ready with with the data you can use applications like Athena or redshift or EMR to look at that catalog and further analyze that data and you can use tools like quick side to visualize data that's that's inside Athena or redshift or you can use you can run machine learning models using sage maker on the data that that that's processed by EMR question yet yeah yes so the question was if glue works only on s3 the no it can also look at EBS and and other types of data other types of sources yeah does so the question is if glue replaces may see or extends may see I it doesn't replace or extend may see it could be used in conjunction with something like may see so may see for those of you who don't know uses machine learning to identify things like personally identifiable information in your files on h3 so there could be deer files with credit card numbers or do you have files or social security numbers and so the purpose of macey is to identify those those kinds of personally identifiable information where glue is really about identifying the structure and the format of a file so you'd probably want to use them together alright let's talk about security real quick so Amazon s3 the buckets and objects are private by default and you can secure the data access with the combination of resource based policies and bucket policies and I am user policies and you can also use kms with client side or server side encryption for the data that you're writing to the buckets and you can also use object tagging so for example if some of the data is personal health information you can tag that so that in conjunction with I am you can you can make sure that that access to that data is is restricted and audited properly and glue is also the metadata that the glue creates in the catalog is also secure if you can you can restrict it with iam policies and it's all managed by am of course and you can attach I am principals like users or rolls and only those rolls or users can read it and you can also have resource based policies which are managed by glue and you can have one one policy per account or catalog and it's similar to bucket policies in s3 and you can also allow cross account access so at this point let me talk about so far we talked about how to build this server less data lake on on AWS with s3 and glue but building can still take months so here are some of the samples sample steps so the first we have to find the sources whether it's RDS instances or any kind of other data sources we have to find those sources then we have to create these s3 locations where your your stages are going to be your data leak is going to be then we have to configure those access policies and then map tables to the s3 locations and ETL jobs you have to create those to load and clean the data we talked about those even if glue does it you still have to sometimes look at what what's doing what it's doing and also write Python and scalar scalar data when necessary Scala code when necessary and creating metadata access policies just like for your actual data you also need these policies for metadata and then you have to configure access for the services analytic services that are trying to read read from this data like so and then you have to rinse and repeat this for all data sets that you have all the users that you have and all the end services that you have and many more things that you have to do and so I would like to let projector talk about a service that we have lake formation of course it's a manual process so we have a service that can help you and it's lake formation and I let projector take over from here thanks thank you so so I think what kind of walked you through is just one example which typically you know you do if you just want to query some data he just walked you through an example but as you know you get more data sets that you want to bring into your data Lake and more use cases and teams that you want to onboard this process could just get more and more complicated so if you remember the diagram that he showed you among the five steps we typically see customers spending a lot of time in step three where they're prepping the data and cataloging it and step four where they're making sure that the right access policies are in place sorry my voice has seen better days but hopefully you can still hear me so with with lake formation we are really targeting three three value propositions that will help you build your data Lakes faster we're making it easy for you to bring data into your data Lake so that it's readily available for you to start querying that data we are adding a new way to define security permissions for your data Lake access so that the permissions could be defined once and you know enforced across multiple different AWS services and then we are enhancing the glue data catalog so that it's easier for you to search and discover your data and and gain newer insights from your data the service is in preview right now this is some of the early customer interest that we've seen this is you know two codes from change healthcare which is which manages a lot of the healthcare payment and clinical information data and then Fender digital that builds digital experiences around sender the music guitar brand and we have a lot of customers who've signed up for our preview so if you want to try it out in preview please go to our web pages and sign up and we'll get in touch with you once we have whitelisted your account so what does Lake formation do or what does it provide it provides you a way to quickly bring in your data without really worrying about a lot of comply etl which you can still do with your existing tools you can run it in in glue or EMR or any any other tool that you might be using and then it registers it with the with the glue data catalog so the glue data catalog still remains your catalog still remains sort of your central source of truth lake formation provides you a way to apply security permissions around the catalog so instead of thinking of applying bucket policies I am permissions in different services to access that data and then also figuring out metadata permissions Lake formation allows you to do that at the table level it provides you a way to are we still linking questions do we have due date questions or maybe at the end if that's okay so the question is so the question is how do you access the glue reader catalog information okay sure so glue glue data catalog is is a managed service it has ap is like any other aw service it does provide you a way to crawl information automatically from s3 we also support a set of databases DynamoDB where we can go and crawl the information and automatically catalog it but you can use the api's to put metadata for any data source and installed in the glue data catalog so nothing stopping you from that it's a managed service just like any other service aw service so the way you would interact with it is through our restful api so it can call on-premise databases over JDBC connection that the databases that we support and I can go over that so just to cover the four things that the four different buck of things that we're building capabilities in so the catalog will provide search and and collaboration capabilities and then will also provide a way for you to audit all access including data access that goes through lake formation so lake formation really built on the capabilities that are already available in glue and we'll go through each one of these and what they do and what they mean but essentially lake formation will provision resources glue resources in your account so you would have full visibility into what it is doing how it is kind of processing your data ingesting your data and how it's managing managing your data your data still remains in s3 so for lake formation specifically s3 is a storage layer that we are supporting your data remains in s3 you have again full access to your data so we are not adding any proprietary sort of layer on top of it if you already have data on s3 you can register that you can register your buckets with lake formation and lake formation can start managing access to that data and you can take advantage of all the capabilities that auditor went over you know why s3 is a great storage layer for data leaks you can still continue to leverage those capabilities without really you know getting locked into a particular service or a technology so let's look at what blue prints do so blue prints are essentially recipes or templates that we are building that will allow you to easily bring your data into your data leak and in the first set we these are the sort of sources will support so we are going to support a set of data bases RDS databases or data bases that you're running on ec2 or data bases that we can go connect over a JDBC connection will also support ingesting logs such as cloud trail ELB logs cloud front logs and then will also support bringing data that lands on s3 from Kinesis firehose into your data leak so what exactly you know do blueprints do so with a blueprint we will expose a set of parameter that you can provide as inputs to and these typically would involve things like you know where is your data so what's the connection how do we get your data what are the credentials there where do you want us to land the data in your data leak and how frequently do you want us to move that data and then blueprints will on under the covers you know crawl your data detect the source schema automatically converted to a recommended target format partition that data if you have provided us a partition key and you know keep track of updates so we can do incremental updates or we can bring the entire data set over to your data lake and overwrite what's in your data like every time blueprints essentially under the covers will create glue jobs and crawlers so you can customize the blueprint parameters or you can actually go to your jobs themselves and you know edit those and customize those know the source tables could be in my sequel database for example and we can detect the schema bring in the data from my sequel if you want us to do it or incremental your based on a key we can pick up the incremental records and when we store the data into s3 you can also give us partitioning keys that we can partition on yes target will be as three sources could be multiple different sources in addition to blueprints we are adding ml transforms these transforms can be trained on the data that's in your data leaks so you can point us to a table in the in the catalog we can generate a sample said that you can go in label you can also bring in your own label set and provide it to these transforms and then you can run these as part of a glue job essentially what do these transforms do so here are a couple of examples what they typically do is they use fuzzy matching logic to find out that two separate records are the same records so this is beyond just you know matching a key so if you have say movie listings coming from two different providers and you want to know this is the same movie you can you know train find match restaurants that we are adding and you can run it on your data excuse me similarly you can deduplicate a set of records and this is an example where you know this is from the Amazon catalog retail catalog and you can know that these two products are the same and you can D duplicate them if you like so that's on the data ingest piece moving to the second bucket of things that Lake formation is building functionality in it's the security permissions that Lake formation is adding so this is just a view of the world today especially with glue and s3 where you are securing your data access separately from your metadata access and that typically gives you table level access so with lake formation we are actually allowing you to define your security permissions based on tables and columns instead of objects and then lake formation will actually map that to the underlying objects and when doubt access to the services that integrate with lake formation so this is basically just walking you through a flow as an admin you can set up permissions and excuse me as admin you can set up permissions on tables in the data catalog your end users who are querying this data from a variety of alw services do not need direct access to s3 they basically just need access to tables in Lake formation and then based on the permissions that are defined Lake formation will wind out temporary credentials to the service that's querying that data yeah so the permissions will be one set of question is what about catalog permissions so with this permissioning model there's only one set of permissions there's no separation between metadata permissions and data permissions you basically would give access so if you can see the screenshots here you would basically give access by saying grant a set of permissions grant create permissions select permissions alter permissions to a set of principles which could be i.m principles which is an IM user role we are also working on for Active Directory users and groups especially for EMR or use with EMR so you define these Commission's through late formation and these permissions can allow you to scope down access not only to a table but a subset of columns so you can see your with this visual representation user one has access to all the columns that don't have personally identify the data per user two has access to all the columns for this table which is Amazon's review Amazon reviews table yes yeah so if you look at this security so the question is the restriction at the catalog level is that propagated to the data so when you register a bucket with lake formation in lake formation then starts wending out access to data that's stored in that bucket so the first step typically is to say that you know these this is the boundary of my data leaks not all your s3 buckets are your data layer buckets typically so you can register that with lake formation and then you can go and catalog the data so you can either crawl it and catalog it or you can run your own the real statements and catalog the data so the tables basically you would have a location to where your data is so the tables would map to that data so once you define your security permissions so if I say grant select access to this user to a particular table when this user runs the query such as select star through any of these services and let's take an example of a tena for example so if you run select star in Athena Athena will request lake formation for access what Tina will say user a is trying to access table say run select star on table T and then Lake formation will verify the permissions and send temporary credentials back to Athena Athena is still the one querying the data Lake formation is not in the data path in any way and essentially Athena is the one that will query the data filter out the our columns and then return the results back to the user the users are authorized by lake formation they're still authenticated through our IM service so by authentication I mean that I am will tell us that the user is actually user a so that authentication happens through I am but the authorization will happen in lake formation so moving to the third question let's take one question because we're running really close to time yes yes yep yep yep yeah yep so we have crawlers that can run on a schedule and they can actually detect schema changes automatically catalog the schema changes and alert you that there is a schema change so crawlers provide you that you should think of the glue data catalog as a replacement for the hive meta store which already exists in the Big Data world and if you have ETL process that already is updating the hive meta store you can continue to use that as part of your ingest or as part of an ETL you could update the catalog or if you want to basically decouple that you can run glue crawlers to go and and scan your data so glue crawlers essentially sample your data they don't actually read all the rules in the data and then they will detect things like you know there's a new column that has appeared or if it's a JSON data set you know new key value pairs that is eventually will map to a column that have appeared in your data and they can automatically catalog that [Music] yep yes yeah yep yep yep yep so it depends on what tools and how your queries are written if you have queries requiring a specific set of columns so the question is how does how do you analytic tools react to your data changing if you have new columns being added so if you if your queries say are written sequel queries are written such that they are querying a specific set of columns they will run fine because those columns is a new columns being added you know they will still run if the Select star you'll get the new data so it really depends on what your you know what your queries are we don't see data being deleted as often indeed Alex it can happen but if that happens on your query is asking for a specific set of columns then you know unfortunately that query is going to break or it's going to get null values back so we so crawlers both crawlers and glue catalogs and table notifications to cloud watch and you can build a mechanism there to consume those table notifications if your downstream processes are very tightly coupled to the schema you can also have crawlers set so that they don't automatically update yeah and they can only alert so that is also possible so you crawlers have a set of configuration options that you can play with depending on how your downstream tools are expecting the data to be yes yes yes yes yep Athena works directly against s3 and it works there's an out-of-box integration with the glue catalog wretched spectrum if you have data in redshift and you also have data in s3 you can use the redshift cluster and ratio spectrum to query that it yeah it depends on on like you know I think Athena is more for ad-hoc wedding if you just want to do a doc wedding and you want to go to interface and start running your sequel queries then yes they are very good but ten minutes left so I think I should have to I have only one more slide so I should have time for questions let me just finish it and then we'll take or maybe two more slides but I'll be quick so we are enhancing the search capability in the catalogs we're making it a text based search as part of lake formation so the as I mentioned the glue catalog remains to be the same catalog in lake formation but you get the ability to define security permissions around the tables that are in your catalog will have a text based search across all the metadata including custom metadata that can be stored in the catalog you are already able to add attributes like data owners or data stewards at the table level using table properties in glue data catalog we are also extending that to columns so we will have a bag of key value pairs or bag of properties that you can add at the column level you can add more context to the columns you can add things like you know sensitivity levels etc to your columns and then you will be able to search on those fields and and find the relevant data sets and then typically with cloud trail today you can get audit audit logs for API access but with lake formation we will also provide you data access audit logs for all the data access that goes to lake formation so with the laws you can see things like what permissions were granted and revoked by which user who access the data from which service what was a requested query what was the result that was returned and you can you download some of the recent alerts will be available through the console and through our api's but you can download all the audit logs from cloud trail just as you do for other logs in AWS I think that's all I have this is just a quick overview slide of how our analytics portfolio stacks up for different use cases and that's it that's that's our contact information and I'll take your question now so tableau can be connected through a teen or a chip right now all the yes not directly through Athena so Athena provides a JDBC driver to Athena that can you can use to connect tableau quick side we are tools oh you don't need a red as far as I know yeah yeah we have five more minutes so more questions sure sure yeah yeah so the question is there's a parka file it has ten columns five of those columns need to be obstacle in some way to a particular user so the user should only be seeing the five columns so with EMR we are actually building integration so that an Active Directory user can log in to an EMR cluster and query data using the Active Directory credentials so you can have multiple Active Directory users logging into the same EMR cluster they can be federated into an iamb role to get authenticated but we can use sam'l tokens or we will use sam'l tokens to differentiate between one user from the other and in lake formation just as the example should an i in principle you can add an Active Directory user there and you can say this user has select to these set of columns or or has select access to all columns except these columns so you have option to either exclude or include and then EMR we'll come to Lake formation just as a Tina example to request access and then use the temporary credentials that Lake formation passes back to query data filter the columns out and return the results back to the user yeah I don't know he so look so this is just with like your question is do you have any comparison for performance across redshift Athena and Richard spectrum yeah it yeah I mean I think okay that's fair I think the question is how do you how do you choose when to use EMR versus when to use Athena when to use redshift spectrum so part of it depends on your data so you know if your data is largely part of it depends on your data and part of it depends on who is accessing it so if your data is largely something that is sitting in redshift and you also are having data in s3 and you want to use redshift essentially to go across the redshift and s3 or use the same cluster to query data in s 3 then spectrum provides you that capability it can do things like mixed views and things like that that I mean you can also use spectrum for just squaring s3 but spectrum requires you to have a redshift cluster if your use case is you know I have a lot of data in my data leak and I want to you know run some ad hoc analysis once in a while or even I mean DC customers using Athena where they are scheduling or Tina queries as well so if you want to just use Athena and use server less press to Athena's using Athena's based on press to know Athena is a great choice we have customers who want to use EMR because they need the tight control over you know how do they provision the cluster manage the cluster what kind of instances they have for that cluster and then you know fine-tune the the cluster they share it across multiple different use cases and you know gives you sort of a way to run presto hive sports whoo depending on what application you want to run and what kind of tools you want to use you know you might want to use the EMR for that we have a lot of the sim is still using hive yeah yeah I mean and then you also have glue that provides a service spark so if you want to use you know a server let's park platform for data processing or for sequel or specific well I queries then you have some options there so hopefully then yeah I don't know if you have any metrics but that's good feedback we'll we'll take that back yes so we'll have all API so the functionality that you saw will be available through api's and you should be able to access it so the column level access is only available through the services that are integrated with lake formation but we would be able to give you table level access to api's and other functionality through api is like accessing the metadata or accessing permissions or setting permissions all of that sorry I couldn't get your question so do do I do we have ml2 that work with lake formation that have test environment I'm not quite sure I understand your question I don't know if you do but actually of which he Tom's up but come chat to us afterwards we can dive into a little bit more thumbs up though but that doesn't mean it's time for labs for about lunch thanks so much that appreciate it [Applause]
Info
Channel: Amazon Web Services
Views: 20,807
Rating: undefined out of 5
Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud
Id: 25FV_2bj-Mo
Channel Id: undefined
Length: 60min 32sec (3632 seconds)
Published: Thu Apr 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.