AWS re:Invent 2019: [REPEAT 1] Data lakes and data integration with AWS Lake Formation (ANT218-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is michael shaw i'm the general manager of AWS glue and AWS lake formation and today i'm going to show you how easy it is to build a data lake using lake formation will be giving you a demo today and we'll be getting it up and running literally in the span of this in this hour with me i have my product manager lead product manager jena de Marla he's going to be running the demo and I also have our one of our key customers friends and partners Joe super at the end of his talk he's going to be giving you his story or their story on and on how they use lake formation to get their data Lake in production and the values and benefits that they got from it so let's get started quickly I'm going to go over a quick outline of my talk first I'm gonna tell you some of the trends that are driving the revolution it's causing or leading our customers to build data lakes today and why they're building those data lakes I'll tell you what a day Lake is why it was hard to build data legs prior to lake formation launching and how Lake the mission makes it easy and we're gonna be going through that demo and saying at setting it up today I'll also tell you a little bit about some of the technology underpinning Lake formation and some of the cool stuff that you can use to make this stuff easy so in the past about 20-30 years ago enterprise data warehouses were all the rage every Enterprise that you saw was building a centralized information hub called an enterprise data warehouse they would take all their structured data from their structured databases their OLTP systems are P systems CRM systems extract all that data merge it all together into one big schema and then load it into an enterprise data warehouse and that's where they would do all of their analysis all of the reporting the BI would be happening over that data warehouse and then they'd get an understanding of how their business was running and how to optimize their business now this ETL process was a centralized process getting this big schema put together there took some time sometimes it would take months to get it together and of course if you ever had to kind of change anything well it was brittle and time-consuming and cumbersome but you'd do it now enterprise data warehouses today are no longer the center of information there's still an important piece of the puzzle but they're no longer the information hub for an enterprise and they're good reasons for this one of the biggest reasons is that the data that enterprises want to analyze no longer fit in data warehouses there's a lot more data when people think when we look across our customer base nearly every five years the amount of data that they store and analyze grows by an order of magnitude so if they're starting out at you know terabytes of data we're looking at petabytes of data for the lifetime of the data platform that they're using or if you're at petabytes today you're gonna end up wanting to be at exercice and data warehouses are just not cost-effective at those scales another reason for this is that the data that they want to analyze is much more diverse it's no longer just structured data you want to go and analyze your unstructured as well as your semi structured data these include things like your emails and chat logs social networking feeds network logs application logs click streams at tech logs and so on and so forth there are also many more people and I want to analyze this data many different different types of personas that are going after this data in an enterprise it's no longer that you have business users wanting to do sequel analytics over the data you also have data scientists wanting to build machine learning models and running inference over this data scientists actually analyzing this data in healthcare and life sciences you have a devops culture these days and you want to be doing analysis of a real-time streaming log so you can alert when things go bad so all kinds of new things people want to do with this data and in response to the variety of workloads and analytics and people that want to analyze this data enterprise's want to democratize access to this data they don't want to have to go through a central IT process to be able to give access to any one person in organization but at the same time there is sort of attention because there's even more regulatory pressure these days for from governments and at the enterprise level to make sure that the right people get the right access to the right data at the right time and so new tools are needed and this is where the cloud comes in and thank goodness right because the cloud is a game-changer that's why you're here and what we see on the cloud is a revolution people are no longer building enterprise data warehouses but instead putting together data lakes on the cloud and at the center of a data Lake is really its storage system and on in Amazon in AWS its Amazon s3 storage system that people are using s3y s3 well because s3 is ubiquitous it's everywhere you can access it from anywhere it's highly available it's inexpensive or cost-effective highly scalable and incredibly reliable we give you most eleven nines of reliability or durability basically once you put data in there it's gonna be there forever unless of course you delete it and what we see is our customers putting all of their on-premise batch data into these data Lakes they're also spooling a lot of their real-time streaming data from their IOT logs application logs Network logs all being put into s3 the other nice thing about the cloud is that you have many more primitives many more services that you can use to analyze the data in situ while it's sitting on s3 you have Amazon redshift which is our data warehousing service but you can also use H maker for example for ml machine learning EMR for big data analytics quick site for business intelligence and Amazon Athena for ad-hoc sequel analytics and these services are available on-demand they scale automatically for you you only pay as you go it's great you can shut it down when you don't need it and so what's left is a way on what's needed is a way to organize all your data in s3 secure that data so that it can be shared efficiently and securely and multiplex that data make it available to all of these services you want a single locus of control for all of that and that is what a data Lake is what data Lake is the new information hub in an enterprise it's the centralized secure repository that enables you our customers to govern discover share and analyze both structured and unstructured data at any scale and on AWS we have more than 10,000 data lakes it's more de Lakes than anywhere else and here's sort of a selection of customers that are building large data Lakes on our platform and since we launched a few months ago here's a selection of customers that are actually building data leaks using lake formation because it makes it much easier for them to go build it there are companies like insurance companies financial services companies health care companies solves a lot of different use cases and what are they saying they're saying three main things one that it's really simple and easy to set up their security where previously it was incredibly difficult and hard to it's easier to ingest the data and get it set up and three and what is much simpler more feasible more inexpensive to clean the data than it was before so what was hard about building takes data leaks therefore lake formation well exactly all the things that lake formation saw it was hard to clean your data and secure your data in a data Lake it used to take months to do it and here's why there's actually many steps involved in building a de Lake more importantly there are many different types of skill sets that you need to go build a data Lake and use it you have different personas that take care of different parts of this process for example data engineers are the ones that typically do the ingestion and cleaning or the data preparation that are needed for the data Lake they set up storage move or extract the data from their sources and then load it into a target s3 in this case clean it and prep it so it's canonicalized so you can actually do something with the data and you've all heard about how much time and effort this takes right I think the adage goes a lot about in any analytics project about 80% of the time or effort goes into cleaning and prepping your data but one of the things that's overlooked or really not talked about as much as the amount of time and effort it takes to secure and enforce security policies and compliance policies for your data and that's also part of building and managing data leaks and typically you have data stewards doing this and then finally you have to make all of this data available accessible through a variety of analytics engines and you want to make sure that you can serve the needs of data analysts business users data scientists I want to give you a quick sample of the steps that were required to actually build a data Lake on AWS prior to lake formation just to give you an understanding of how complex this could be first for example here's an account which eventually with a bunch of RDS databases the sources that you would need to go and extract data from you'd have to send up set up landing areas where the data would land in this case you'd have probably set up landing areas in Amazon s3 set up a bunch of locations or buckets then you'd have to set up access policies on these s3 buckets these are I am policies written in JSON not a natural language for a lot of people right and these policies are at the level of objects as three objects API accesses and paths really doesn't map to the kind of data structures and sort of data sets that you have when you're doing analytics when they're thinking about tables and databases and rows and columns and so actually getting these policies to work exactly right the way you wanted to is very difficult there were all kinds of limitations on what you could express a big these policies could get and where you had to actually specify all these policies in many different places so it's hard to even know who had access to what in some cases customers would have too much access give too much access to their users and in other cases they wouldn't give enough access so users that wanted to get access to data that they had rights to couldn't alright well so far all you've done is set of policies and setup spaces but you actually haven't done anything in terms of loading data so one thing you need to do is represent the data so you could use the data catalog to represent the data that's sitting inside of your s3 buckets set up a schema for your tables in your databases then you have to load the data from your sources and typically you're going to use some kind of ETL system in this case AWS glue is an ETL product that you could use it would you know help you write a PI spark program that would get data from your databases into s3 but you'd have to know how to code in this case this is a PI spark program not done quite yet remember the metadata that you set up there might be sensitive information in there so you have to set up a metadata access policies as well and who can access that metadata and this is self is written in JSON not a natural language for people okay we're still not done yet now imagine you want to analyze this data through a data warehouse like redshift I'll have to log in and set up the cluster but beyond just setting up the cluster you'd have to set up users roles that can access those bucket policies as to access those buckets and adjust the policies of those role allow that so those policies allow that access and then once you're done you're gonna have to rinse and repeat this for every new data set that you bring in for different users as they add come in and every new end service has a slightly different configuration than the next and of course you're gonna have to manage this as you go forward and update the policies as these things change if I haven't you know made the point yet this is an error-prone and a time-consuming process and this is why I used to take months well no longer with lake formation we can make this all super easy lake formation is a fully managed service that enables data engineers data stewards and data analysts to build clean and secure data lakes in days let's take a look at what lake formation or the solution stack and lake formation looks like and what benefits it provides we already talked about the storage which is cost effective and durable and globally replicated so you can you know get access it from everywhere lake formation first provides a set of tools that allow you to ingest that data and clean that data in a way where data engineers no longer have to do a lot of the undifferentiated heavy lifting they simply just use a bunch of templates that get the data in there and so they can get the data in there and enables them to build other things much faster main value proposition in lake formation is around security you can actually manage all of yours fine-grained permissions and your security policies in a single place that's alongside a data catalog and you can also keep track of all the accesses to your data lake in a single place it lets you audit all of the accesses that are there so that you can comply with various regulations and work with auditors to you know meet our auditing requirements and finally there are a set of new features that allow you to discover share these data sets in a secure way efficient way and all of these various integrated services allow any user that using any one of these services doing any one of these tasks to access the data and all of these security policies will be uniformly applied so it's not the case that one service allows more access than another all right so what we're gonna do is we're gonna walk through some of the features that Lake for formation provides an origin and demo at one sort of user persona at a time we're gonna start with the data engineer or I talked about blueprints and how Lake formation makes it easy for these data engineers to simplify the ingest process which when you switch over all right so what China here is gonna show you is the process of first registering buckets or locations and in buckets so that lake formation can manage that area what he's doing is he's picking a bucket that we've already set up for you a landing area effectively go ahead and select that that's gonna be managed by lake formation and he's gonna pass it a roll it's that role that's conferring access to lake formation saying you can manage this area and you can carve up the access to that area as you see fit good and register location there alright next you need to give users basically push the different personas access to those locations so they can load the data to go to data locations grant and so he's gonna pick a user here it's basically a data Lake admin I'm sorry yeah and that data leak an admin is gonna use the lake formation role and that de l'académie is going to have the right to write into that bucket that we just gave access and that we just registered inside of lake formation now we're gonna create a database and this demo what we're gonna be doing is we're actually going to be taking a cloud trail data now cloud trail basically records all of the API accesses that are done by a single account and we're gonna import it into our data Lake and then query it using a variety of query engines and so what he's done here is create a database where we're gonna put the crowd to Klout trail data and the location for the database is the same bucket that we registered go ahead and now we need to give our user permissions to create tables in that database so let's go do that there's our lake formation role again databases cloud trail demo and what you'll notice is the permissions that we're giving are the ability to create tables and modify the metadata and as well as drop those tables in that in that database to the user that's going to ingest the data into the data lake all right now we've set things up we've added a landing area now let's actually get the data in there to get the data in there you can use blueprints blueprints are basically just templates what they are go ahead and click on it are a collection of pre prepackaged templates that you can use to set up workflows that grab data or extract data from various sources in this case we have templates that extract data from databases like most my sequel in Postgres oracle and sequel server as well as the templates that grab popular lob formats log types like cloud trail or ELB logs bring it into your data Lake and organize it in an optimized fashion so that you can query it fast using Athena redshift and the other analytics engines let's use a blueprint let's use the cloud trail blueprint now when you're using a blueprint what you need to do is give it a bunch of parameters so it knows what you're bringing in in this case we're bringing in the lake formation demo trail and we're gonna bring in a day's worth of cloud trail data we have to tell it where it's gonna go so in this case it's gonna go into our target database into our target location and the data format is gonna be Park a park a is one of these columnar formats that people use that are optimized for doing that make analytics run fast we can run it on demand or in a continuous fashion in this case we're just gonna run it once so we're gonna run it on demand and all its gonna do is take a snapshot of that data and bring it into our data Lake he's when you create this when you use the blueprint to collect create a template so I create a workflow that that workflow is actually running inside of AWS glue and so you need to give the workflow a name and that's what China has done and you have to give the workflows on the side of so it can read the data from those databases or from those locations from those services give it a prefix I go ahead and create it now this demo is actually running live on our production service today so I'll take a few minutes to actually create the workflow actually takes a few seconds hopefully doesn't take a few minutes or run at a time in the talk okay the demo gods are happy today so we're gonna we're gonna now run the demo run the workflow go ahead alright so we've started the workflow you can actually see that the workflow has started it's running why don't you click into the workflow to see what it's actually doing you can actually get more information about the workflow about what are the individual sub tasks that are actually running in that workflow by clicking view graph and what you're going what you're doing now is going over to AWS Gloup Gloup is the thing that's actually running this workflow for you it's got all of the primitives that are necessary for ingesting the data and you can see it's got a fairly complex graph and what you can do is you can actually monitor what's going on through glue on how far things have run what's left to do and just get a status of what's going on especially if if something goes wrong you can see where it breaks you can actually go into glue and actually rebuild all kinds of workflows this is just a workflow that's created from one of the templates that we have depending on the template the workflow can vary so we're gonna go let this run and I'll tell you a little bit more about our cleaning product some of the other cleaning features that we have while this is running all right so we talked about blueprints I think the only point that I that was left and made about left to make about blueprints is that you can customize your blueprints excuse me you can customize your workflows for your needs if the blueprint isn't doing exactly what you want to do you can have it create those workflows go down into glue and and edit and modify what's going on there yourself if you like a lot of our customers come to us and say what's the difference between lake formation and glue well we just showed you lake formation is a service that's really targeted for data analysts data stewards data engineers it's a higher level service that's actually that's built on the primitives that glue provides glue has a number of primitives that you can use a lot of developers love glue and let's talk a little bit about some of those primitives so that you understand what's going underneath the covers so glue provides a number of scalable service components that help you build data lace the first one is a data catalog it's fully managed and its hive meta store compatible it's basically a metadata service its stores all the information about your tables your databases columns their data types schemas and so on and so forth and the data catalog is integrated with pretty much all of our analytic services so they can access data that's referenced by the data catalog alongside the data catalog our crawlers again fully managed you can run crawlers over any data step that you have in s3 and they're all and they will automatically infer the contents of that data they'll figure out what the compression type is unpack it they'll figure out what format it is unpack that it's infer the schema automatically we have a number of AI techniques to be able to do that extract some statistics and then populate the data catalog it'll also extract how the data is laid out on s3 to make it faster and easier to query that data at the core of glue is a serverless ETL platform what does this mean well we basically give you serverless Apache spark you give us a spy spark script or a scholar script that's runs in spark we automatically provision the necessary the resources for that job run it for you and shut it down you have to manage no machines you don't have to manage networks or configure any networks networking settings we do all of that for you and you only pay for the time that the script is running so it gives you a server list interface you can also interactively create those scripts using our GUI that we have in our console now those scripts turn into jobs but you may have multiple jobs that have to run in succession you might want to intersperse them with crawlers and so you want to actually create flexible workflows that can get a larger task done you want to actually orchestrate all of this and so we actually give you an orchestration system you can author and monitor the workflows that you create using our orchestration immune system and glue and it sends out alerts with cloud watch so you can integrate other external services with it okay we showed you the workflows now let me tell you a little bit about some of the transforms that we have added as part of lake formation to make it simple for cleaning your data a common a really common task that you see customers have is the task of actually integrating multiple data sets that represent the same thing happens all the time in this case I'm giving you an example where you're merging two different catalogs that represent the same things but those things are not represented exactly the same in the records so an example here is we're showing two different types of to a catalog with two entries for shoes one entry says track Sadler the other one entry says men's Sadler penny loafer they have roughly the same colors and the options they're roughly the same price you and I can look at this and say well this is probably the same product in the same shoe so that record would look the same for us but if you had to go and programmatically do this it would be a complete pain and whatever you know set of heuristics or rules that you would code up would have to change you know based on catalog based on the product and be really really painful to actually you know get done and couldn't scale it to any reasonable size so we have a number of ML techniques that we used internally in Amazon retail to make this work at scale that we have now released to you as part of AWS glue you can actually use it we let you train models that we'll do this merging of data sets and finding duplicates across these data sets automatically for you and then run it across your data sets you can use it for deduplication you could also use it for record linking if you have two different cities that's and you want to know which records are the same so how does this work well the naive technique here is pretty simple you look at all your records and you want to do fuzzy deduplication you basically look at every pair of records have some scoring function and then do some kind of threshold a to find out you know which records are the same and which records aren't this you know are different the problem with that approach is twofold one you know if you get just some reasonable size of records maybe ten thousand a hundred thousand records it's gonna take you forever because it's an N squared problem the other problem with this is again depending on your scoring function it's really not gonna work that well you're not gonna get very good accuracy so our scientists have been working pretty hard and making sure that you can do this well and these algorithms actually can get you into the high 90s in terms of accuracy and recall and they can actually let you scale to very large data set sizes the way they do this is they actually break up the problem into three steps the first step is to not look at every you know N squared pair of records and instead put these records into bins and each one of these bins records are likely to have matches and across bids they're not this is a step called blocking and when you're actually training here machine learning transform and glue you can tune how well this works the bigger the bins the more expensive the algorithm will be but the more accurate will be the smaller the bins the less accurate but the cheaper it will be and this is gonna depend on your on your data set in your workload and where you want to set this the next part of the algorithm is going a usual number of features to actually score or they're to records are the same or not and it's gonna adjust the weights of these features based on positive and negative examples that you're gonna give it typically we like you we ask you to give us at about a thousand examples ten sets of ten roughly where you tell us whether two things ought to be the same or ought to be different and that'll adjust the weights for this scoring and then the last part of this algorithm is to actually based on the scores create partitions or basically groups that tell you that the things in the group are the same record and we have a number of interesting heuristics that make the partitioning the partitioning step much more accurate and you can also control the partitioning the precision recall for the partitioning as you wish and once you've done this and you're created a model you can run this over and over again for any of your data set that you have to do the deduplication the main point that I want to make here is that we can scale pretty well and handle about a half a billion dollar half a billion rows in under three hours the previous state of the art used to take several weeks to solve this problem and its availability available to you just basically you know at no additional cost on top of the cost of glue all right now let's talk about centralized permissions how do you actually set up permissioning and how does that work inside of glue you're gonna have the data steward and she's gonna be the one that's responsible for creating the permissions on your various data Lake resources the tables the databases and down to the column level she's gonna specify them using lake formation you're gonna shut off access to this data through s3 and users are only gonna access and analyze this data through the integrated services that we have we have four integrated services Athena redshift glue and Amazon EMR those services are going to consult Lake formation to see if those users have access to that data and what parts of that data they have access to and then if they have access the data is going to get back to the integrated services and they'll do the analysis and give the answer the permissions and lake formation don't look like JSON policies they look like simple grant and revoke statements the kind you would have in a database system and the permissions that you can specify you can specify down to the column level so you might have a user for example China who is only allowed to access the bottom bottom columns of this particular table and perhaps Joe is only allowed to access the top columns and you can use you can actually specify any subset of these columns for whatever users you want and we don't replicate the data we don't create materialized views or any of that stuff all of this stuff is happening on demand as the data is being pulled off in addition in lake formation you can usually easily view the permissions in a single pane pane of glass so you're not searching in five or six different places to see who has access to what and then finally you can audit all accesses in a single place many have asked us well does this mean that we no longer use I am that's not quite the case the lake formation permissions model actually works in conjunction with I am the way to think about this is I am still gives you the coarse-grain permissions to access the catalog and then what we ask you to do is provide all of the fine-grained permissions on the databases tables and columns through Lake formation and underneath the covers lake formation is really just using I am and it's api's to break up access to the underlying objects using a technique called credential bending let me walk you through that so imagine you have a user in this case on lake formation users can be I am users roles or they may be actually users that are federated through some other identity provider like Active Directory a user is going to come in and ask a query to do some analytics they perhaps over a table t1 some one of these analytic services say Amazon Athena is gonna then gonna request access to the underlying table now many objects comprise this table lake formation is going to go and check to see if the users authorized to access that table and then return short-term credentials for that table along with what columns of that table that user has access to the service is then going to request access to the underlying objects that comprise T it's going to go directly to s3 to act access that those objects are going to get returned back to the service all of the columns all of the rows are there and the filtering happens inside the service so it happens inside of Athena or if you're using redshift that happens instead of redshift that means there's no intermediary in the data path when you're actually running these queries you don't have to pay the cost of an additional hop or you know any additional cost for actually running additional servers that do filtering link formation is also backward compatible what does this mean if you've already set up a data Lake using the glue data catalog with all of these various permissions you don't have to give that up leave the metadata there leave the data there there's a five step process to upgrade from your data catalog permissions that you have to lake formation permissions and we can show you how that works offline there are a number of great chalk talks and and builders and sessions out there that will show you how to do it all right so let's see how all this little stuff works now we set up our our data Lake hopefully it's it's done running has it loaded yes it's completed ok it worked today sorry let's go let's go to the database and take a look at the table there's that database open it up you'll see that there's a view tables and it created a cap crop to trail table and it's got park' data in it that's the cloud trail data you just invested let's give let's give permissions to this table to two different types of users there's gonna be one user let's just call it the data owner I think is what we're calling it right what did we call it the redshift user okay great and that retro fuser is gonna have access to all the columns in the data set perhaps this is a data admin or or data engineer that needs to do some ETL and create derived data sets the thing to notice is that China just gave access by just selecting database the table and the operation that that users allowed to do these are high-level operations like selecting or inserting or dropping a table or altering the metadata these operations require many API calls they're not specifying all of these things these things all are happening under the covers go ahead and grant it now let's create another user or actually we already have the user let's create another permission for a user called a data analyst okay great and that data analyst is going to go to the database go to the same table but we're only going to give it select permissions for a subset of the columns we can use you can give it the list of columns to include or exclude in this case we're gonna just gonna include the columns that they're allowed to access perhaps this is a data analyst that's just doing some simple analysis so let's get the first five columns in there before it doesn't matter okay great we're gonna give it select access to those columns and one more thing I want to tell you let you know that in in our model not only can you give permissions to select these things but you also can delegate permissions so you can say that the user has a lot the ability to grant other users select access or other users drop or delete access it's kind of like the grant option that you find in a database system okay go ahead and grant alright great so now we've specified the permissions let's see what happens let's go to Athena so remember this was the Athena through Athena there's gonna be the data analyst the data analyst can actually see the database let's take a look at the table that it that it that he or she sees you see that you know that person can only see the four columns that we specified she doesn't even know that the cloud trail data might have another 15 columns underneath it let's run the query when you run the query here's what you'll see you only see the columns that you specified now let's go over to redshift same thing one of the things that we do have to do in redshift is tell the dredge tell redshift which database to connect to and which schema to to associate it with so let's do that here and it's using the lake formation roll that we set up to be able to access that schema or that database awesome okay I know you want to yeah there you go go to the database you'll see the table and you know this was a data owner or data admin and in this case that data owner sees all the columns and if you run the query all of the columns will come back this is all internet delay the performance that you see here is not indicative of the performance that you're going to see when you go and run these things while it's running why don't we go over to the next pain this pain basically shows you what a data admin or a data steward is gonna take a be able to see these are all the accesses that have happened to that table into those databases in a single place you can look at the event the principle that did that access when it happened and if you double click on it you can actually see much many more details about the context of that access all happens is in a single place it's actually put into cloud trail for you and you actually can bring that cloud trail data back into your data Lake and query it with Athena kind of mind-bending huh all right let's keep going see if the redshift query has finished there it is and it has obviously a lot more columns than their other query yet so we set up a data Lake today for you on s3 showed you how you can set up permissions for users with different access controls and they can access this stuff from anywhere in AWS should we go back all right so that was the work for users now let's see what happens oh yeah one more thing I want to talk about is the data catalog itself a data catalog we've added features to allow you to annotate the metadata in the data catalog you are already been able to annotate things like tables and and databases now you can annotate columns we've also provided keyword search across all these annotations as well as the other metadata and the metadata catalog and this is going to enable customers or your users to actually discover what data sets are relevant to them and we're adding more and more features along the lines of a business user that would be useful for a business user lake formation today is available in something like 13 regions here and we're adding more and more regions over time we've got four integrated services all of them are GA except for EMR which is in public beta we urge you to try this stuff out and we want to hear from you with that I want to thank you for coming to this presentation at the end of reinvent and I want to hand it over to Joe who's gonna tell you about their journey and getting all of this stuff to work thanks a lot thanks Mahal yeah I just wanted to you know say that anybody that's given presentations you know that Murphy likes to show up so I applaud Nicole for doing the demo there but who's going to replate tonight show of hands okay I'll make sure these next 60 slides go very fast only kidding and we want to get there as soon as we can so again my name is Joe super I lead a talented I have the privilege of leading a talented group of individuals for global infrastructure and operations engineers architects data engineers DevOps s eries prior to newskin I was a senior consultant with Amazon Web Services on the professional services side for about two and a half years doing cloud consulting and helping enterprises on their digital journey and DevOps transformation prior to that I led successful startups and transfer exits as well and also led Rd for telecom and aerospace things like the Mars rovers and James West wait Webb Space Telescope so let's go ahead and get started so Nu Skin today is in 48 countries around the world we operate you know a pack amia Europe Africa the Pacific but we are MLM direct sales company so we are built on a foundation of over 200 quality products that are originated and in personal and personal care and nutrition so one of the things that really attracted me to Nu Skin is that the force for good we've donated over 650 million meals to families in need and that count is growing every year so and as of 2018 I'm approaching a three billion dollar revenue we are publicly traded company so our mission statement is to be a force for good throughout the world by empowering people to improve lives with rewarding business opportunities and innovative products and enriching uplifting culture so we allow people that want to start their own businesses and want to be able to sell their own products and things like that we give them the tools and capabilities to be able to do that so as you can imagine we're very customer focused as AWS is today so one of the reasons why I'm standing up here today is we did a transformation with Nu Skin that started back in 2018 and as of this year we started in 2019 an all-in vibration of our data center and that finished in August 18th and one of the you know the you know some of the bullets you see here cost-effective scalable modern standards architectures but the biggest thing is faster innovation and we want to be able to give those insights and access to that data very quickly and very efficiently one of the key things as you see in this next slide is this Democrat infographic is that we comparable migrations we did at 25 percent faster took us eight months 15 two weeks sprints we did in an agile manner and you can see in the middle there we took almost a thousand servers down to about half of that and applications down to almost a third of that using the migration acceleration program one of the things I like to talk about is the innovation I've categorized them out and where we have actual resources working on innovation so almost 8,000 hours to date this year on innovation and almost 2,000 hours just on data lake and lake formation and one of the things i'll talk about then upcoming slides is one of the things we're using lake formation for is automating our daily sales reports so we have again 5252 markets in 48 countries trying to figure out how we did from a revenue perspective on a day to day basis can be very complicated when you're dealing in multiple currencies multiple commissions things like that you have to be very accurate but it takes a lot of time so what we did is we set off as part of this transformation is to build out add a leg and the reason that we wanted to do this is through the use cases that we had to have our own mission statement around to drive opportunities for our customers we wanted to be able to empower our end users to make data-driven decisions let's not look in the rearview mirror let's look in the windshield in front of us to actually make that decision rather than say why did we make that decision give flexibility to our stakeholders and what they're actually going to use you know we have a lot of data sources everybody today everybody's sitting out there today has databases we have third-party data sources Oracle sequel you name it Google Analytics APM tools you name it so we are a big sa P environment ERP and one of the things we are doing through supply chain is using AWS forecasting to help set inventory levels you know help with their energy maps of what we should manufacture and what should be on the shelves when we actually do a promotion or event so when we were building out our data Lake there was three main areas that I had the team focus on it must be easy to manage simple security model save time and resources cost effective but it has to be transparent low barrier of entry for anybody that wants to get access to that that data so what we started off in designing just to give you a little high level designs for different zones and as you can see on the left hand side there there's the the raw area that is the unmodified original source as is dropping in and that's obviously the API calls or Kinesis streams coming in then we hit the formatted section that's the park a snappy partitioned data so that's compressed and efficient when we go to hit with Athena or redshift then we have transform so that's anything that we're doing with our business rules elt and then published which is very controlled so there's if there's PII data or anything like that but the published zone is where everybody's gonna live so if they have a visualization tool anything like that they wouldn't be able to go get that data really quickly and then obviously glacier to lifecycle policy that data out and then on the right if you look at a lake you want to create a lake now to swamp so we have to be very careful about what it data comes in and who has access to that data so when you fill up a lake it starts at the bottom so as you can see if we start filling it up there's the raw as is controlled and you see its operational data does that see that trident on the right side no a lot of people are going to get access to that and we have to somehow automate and manage that then we hit formatted you know transformed and published so we want to be able to teach everybody to swim and dive deep in the areas that they need to but only when they need to well we can't have data being you know PII data or specific data to stakeholders that they shouldn't be such as you know financial reports you know publicly traded company you have to protect a lot of that data that sits out there so now taking that from a different angle we have its three main areas that I focused around is data ingestion and quality data engineering and then data science and analytics and you can see on the left-hand side we're taking everything from social data to sales data to inventory and the disparate data source is the Oracle sequel server my sequel everything that we have and we're using Mattila on the front end obviously with other AWS tools like Kinesis streams firehose to be able to drop all of that data into the raw and then Calibra is more about data governance data lineage tools so that we can understand where that data is coming from so we have governance around it and then you can see redshift is being able to hit through spectrum going into formatted and transforms if we want to be able to crab that data really quickly and then as you go further to the right we are using a tool called l Tareq's for machine learning business users to be doing their own ETL alt and then machine learning with sage maker and then the visualization tools that we have today are power bi and MicroStrategy so everything I said there is great but how do you automate at all how do you really get every stage of this to be self-sufficient so when we I knew of AWS Lake formation before I joined Nu Skin and June of this year and I was very highly anticipating waiting for it to be released and the one thing that I really wanted to focus on is using it for was administration and security Mike formation can span the whole subset of this pipeline you see in front of you but the biggest thing that's was tripping us up was administration and securing that data to the specific stakeholders so before lake formation this is exactly what we're trying to do or planning to do so you can see the raw data that was coming in we are going to have to literally copy that data and only remove the elements that we needed for specific stakeholders so all data there's some no PII is who wants the data so that means requests coming in making copies multiple keys in that same bucket then you see okay as my whole talked about the s3 bucket policy there's objects you have to keep updating that keep up with that your glue catalog those metadata policies it's never-ending it's growing at the end of the day becomes very error-prone and risk adverse to to doing that so it's duplicative work and we didn't want to do that so here is um and as we go back to the 310 it's easy to manage and simple security model this is a snapshot as you saw on the demo before but from our sales orders this is a small table in a big database but you can see there's multiple columns that I can give access to revoke access to but it eliminates all those s3 bucket policies that are out there today so it's a simple grant revoke model for permission so entry low barrier of entry for your data engineers anybody that's a database engineer has working around databases it's gonna be simple for you to pick up and use right away so now this is what that last line a couple slides ago went to it got rid of all the copying of that data so that data duplication went away and now we only have everything through lake formation lake formation transparently handles all of that DB grant syntax it's behind the scenes and it's familiar user interface that the engineers and data scientists can use so some of the key KPIs that you look in the bottom right is 43% reduction just in copying data with the first use case because of that sales data people couldn't see specific data points or shouldn't see specific we're gonna end up copying that 43% more than we should have and that led to a 37% almost 40% reduction and operations of building out lambdas building up pipelines building out how we were going to manage us three bucket policy it took all that away so what you saw in the demo is that literally I can come in now and I've done this many times with the team is we can give the CFO access to what he needs to see but then a financial planner something else and this is exactly how it is specific columns on this on the same table in the same database what they need to be able to do so in the last area requirement that we wanted to make too transparent and wakeful raishin is transparent it literally just gives short-term credentials to all of the services on the right so it inherently works with every service that you see quick sites say Jamaica blue redshift and that's big because we've chosen redshift as our data warehouse moving off of Oracle so having that availability to reach in but then also have column level permissions put on top of it is the the way we wanted to go so unified controls and seamless integration was a as a big point for us so to wrap up across all three areas there you know it's a lower learning curve you don't have to worry about the the JSON style syntax the best three bucket policies it's a single dashboard interface you get column level permissions out of the box something you'd have to duplicate data all day long and then it reduces the process of having to onboard new people into your engineering team but also your functional areas legal HR finance and how what the data they want from day to day and it's transparent it just works out of the box there was no I mean we've already had our data like set up and we were waiting on lake formation to be released and it came out I think two weeks later and we were able to apply it in place in a brown field not a green field so it works with your lake data lakes today so this is great Joe you showed me a lot of slides give me an actual use case I really like to see the was myself but we talked about the daily sales report that on a day-to-day basis how many Pichu people touching that is sporty combined plus hours down to 15 minutes the big thing that use case and lake formation helped us is the sensitive sensitive and critical data that's located and that and the data that we're extracting from s AP and from various government entities on what we're doing for transactions in those those areas of the world so we were able to literally put the permissions in place we were looking about a two and a half month set up time before lake formation we were able to get it really quickly and ready in three weeks by putting lake formation on top and then actually just given the column level access no copying of data we scrapped all that architecture and literally just went straight with lake formation so it's in production today so if you ever see our CFO he'd probably like yes this is the best thing he's ever seen because he's able to get his reports in 15 minutes now every day so some of the plan use cases that we're looking at is natural language processing with sentiment analysis things from Synthesia that we're talking from our call center from social clique string and then inventory forecast we're doing a lot of stuff with AWS forecast today so all that data that's coming in still needs a lot of security around it so our C so everybody that I'm partnering with loves lake formation because it allows us to secure it down to that level and they can audit us at any point in time through an API endpoint if they wanted to but we are looking as I said lake formation can go that whole pipeline we only did it on administration and security we're looking at the ingestion blueprints how they can help you know build upon what metallian does today and then also the duplicate matching when you have multiple data sources out there I was I knew there was gonna be duplicates but there's a lot of duplicates out there but it's just a different name and then what it's actually named coming into the data way so that's exactly where we're at and so a little blur for me I won't read it to you but that the goal was that it was an easy-to-use security layer right out of the box that allow us to accelerate our data Lake and putting that into production so and last but not least during this transformation new skin had had maybe five certifications in AWS today we have over 70 and it wouldn't have been out without AWS training and certification and we've had everything from Big Data to DevOps to security so I'd encourage you to go take a look at AWS training and certification and see about what you could do with lake formation and other services that were released here this week so thank you very much [Applause]
Info
Channel: AWS Events
Views: 12,787
Rating: 4.9039998 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, ANT218-R1, Analytics, Nu Skin, AWS Glue, AWS Lake Formation
Id: Wk9Hf4cwUFM
Channel Id: undefined
Length: 58min 35sec (3515 seconds)
Published: Sun Dec 08 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.