AWS re:Invent 2018: Technology Trends: Data Lakes and Analytics (ANT205)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi this is the technology trends session my name is on the recouped I run analytics RDS in Aurora for AWS so this is an unusual talk for me normally I do some sort of geek geek dive on distributed systems and databases and this is a little bit of a more forward-looking talk on what I believe the trends are that are sort of driving data and analytics over let's say the coming 10 years or so so if I do it right you'll agree with about 70% of what I say and you'll think I'm you know insane about the other 30% so you know we'll see how that goes so you know you may have seen this cover from the Economist and some other thing where it's basically talking about the world's most valuable resource moving from oil to data and yet kind of makes sense that that would be the case that things are moving towards data in an information account economy so and you know this is I think a very telling slide that shows the world's five largest companies by market capitalization Oh from 2001 to 2008 eeen there's been a lot of movement since I had to submit my slides into for 2018 so don't pay attention to the market caps there but the sequencing is roughly correct so but you can see that in 2001 you know there was one tech company won a bank one retailer one gas company and one conglomerate go to 2006 there's two companies the conglomerate a tech company in a bank go to 2011 there are three oil companies a bank and a tech company 2016 something happens where it's all tech companies and it's a very specific type of tech company these are all data centric companies and you know 2018 it largely the same not a great deal has changed so that that's interesting and I think it reflects the movement of how people value data in this world in particular I mean I'd ask the question who knows who can recommend a song to you better Apple or your best friend from high school who knows your wants needs desires better Google or your spots now it's hard to say right it's so-so you know but what do we mean by a data centric company what do we sell how do we make money right it's very clear what you sell when you're selling oil right if you're collecting data what are you so so here's a small thing from c-span where somebody's talking about that we believe that we need to offer a service that everyone can afford and we're committed to doing that well if so how do you sustain a business model and which users don't pay for your service senator we run ads that's correct so you know that's I think an interesting point in there and we'll talk about that in a little bit more detail as we go on but you know if you take one thing away from this talk it's that if we're all turning into data-driven companies if we really believe that that's what people value then you gotta start thinking about data as an asset to our business not as a cost and so you know that starts by saying hey let's stop throwing data away let's make it available to more of the users in our organizations and let's arm them with more data processing technologies right now that sounds good but it's actually super hard so because there's a lot more data than people think what you know I've been in this business for a long time and my estimate is the data is growing about 10x every five years that's across industries and so on and when you're making a data reap LAT forming decision in general that decision is valid for about 15 years and people often arguing with me about it in the last come hey you're moving away from let's say Tara data when did you make that decision to move on to it and it's right in there and you know how much is your data grown over the past 15 years and so I think you know it's one of these things where people can predict the past really easily but they can't really predict the future very well and they somehow think things level out and at least to date over the last 30 years it hasn't and the point of that is is that if you multiply the 15 years with the 10x growth every five years that means your platform decision needs to scale a thousand decks from where it is today we've got a petabyte today you're gonna have an exabyte in 15 years if you've got a terabyte today you're gonna have a petabyte in 15 years second problem is that there are more ways to analyze data than ever before Hadoop wasn't a thing 50 11 years ago elasticsearch wasn't a thing eight years ago presto wasn't a thing five years ago and spark wasn't a thing four years ago now think about how important those are as part of your data processing ecosystems today right so that's interesting because the innovation and data processing engine is almost something you need to defend against or in other words you know sort of architect your way to be able to support what's coming in the future right and then there's this third issue where you know we all want to democratize data and we all want to make it available to more users and at the very same time we want to limit access because that's how we get compliance that's how we get audit that's how we ensure security right so that's pretty challenging all three and that's one of the things that's driving people to data leaks sorry this thing was gonna build for a while I mean just keep going I'm gonna assume that it stops here because what data lakes let you do is already today you can support exabytes of data and it separates the notion of where you store data and how you transform data from how you access it and how you manipulate it right so you load it transform it and catalogue at once the data is available to a multitude of tools and the second key point I'd ask you to take away from this talk is is that you really have to use rely on open formats and open API because I don't know what the next great tool that's going to be out there is I'm pretty sure it's gonna support part K or or C or you know JSON or whatever it's not necessarily going to support whatever random you know like binary format that is highly optimized inside some other tool right that's a path to lock-in so Andy announced lake formation earlier today you can sign up for the preview small ad there and they're basically three components to Lake Formation first we have a bunch of blueprint that help you build and transform and deploy your data you know we have a bunch of security policies so that your security can be applied on the data Lake itself not on the access paths into the lake and that's another thing that's necessary one of the reasons people are kind of stuck in their data warehouses today is that that's where their centralized security is that's where their centralized catalog is and so forth so my view is is that at least for the next 15 years data lakes are the new data warehouse and data warehouses are the new data Mart's right so you're gonna be doing subject based analysis there and so that means that you need to control your data and do the auditing and so forth near the center and so what we're doing is we're generate inside our catalog we you know we allow you to define policies including based on tags like PCI PII etc and then at that point you can say like on the rugs allowed access to PII data but only for his employees right and so on and then that policy is applied across all of the tools that access because what we're doing is we're wrapping the surveys that access to basically filter out the rows and columns you're not allowed to see and we're at we're also doing that using a JDBC driver or ODBC driver for your sequel based access so you know just basically being in that query path gives us the ability to control the access as well as audit as well as do things like data masking and pseudo Naumann ization and so forth over time so we're pretty excited about it it's you know obviously it's a service that gets better over time as we improve the last big portion is that as data gets bigger it becomes very hard for humans to manipulate it right and so one of the biggest problems there is cleansing your data so we're really moving towards a machine learning based approach starting with deduplication of data and record linkage between data but you know that's again going to expand as time progresses so so how that works it's basically saying the same thing you know to build data quickly you know we're trying to identify crawl and catalog your sources you know and do that more dynamically and using machines right I mean no one's sitting there you know like the difference between old old scale us school yahoo and google was that google crawled well yahoo had a bunch of editors defining that directory structure and then the data got too big and you know one clearly ate the other and then also to ingest the cleanse of data and transform it into optimal formats behind the scenes based on access we talked about security management and forcing encryption defining access apologies policies and if we can we're in the path for all of the accesses we can clearly audit you know regardless of what tool you're using and we're really interested in providing a high self discovery environment so you can just choose a search mechanism to find what data is where and that sort of gives you you know sort of the screen grab of what our database analytics and now blockchain picture looks like and you know it's looking at that rough layer cake it's you know data movement at the bottom the storage and lake capabilities in the middle glue is an important part of that both fairytale and for cataloging and then a variety of different databases and analytic tools and blockchain capabilities and be next there step there and then above that we also think of AI ml as a core part of your data architecture unless you get the lowest layers clean its hard to do anything with automated reasoning and you know those are our tools on the side on the other side you can see that you know we're retailers at heart we believe in you know selection we believe in low prices we believe in fast delivery so you know that carries over to AWS as well and yet in this particular case you'll see that there are a ton of marketplace offerings in each of these areas as well so here's a picture showing you there's some of the recent announcements lake formation you heard of this morning in redshift we've added concurrency scaling so you can add cluster capacity as your demand grows and shrinks elastic resize so you can resize in minutes as opposed to hours we're adding embedded ability of the dashboards you generated quick site as well as MLS based insights you've heard about blockchain this morning dynamodb transactions as well as the ability to automatically scale that up and down you've heard about time stream this morning jewel DB global database capabilities in aurora as well as amazon RDS on vmware so you know those are the key database centric announcements this morning and here are some places that you can go to if you want to learn a bit more about these slides and I'll just hold on this for a minute in case you want to read it rather than trying to talk through I think people have their phones down now so so you know let's look at a few companies and how they're using data Lakes and what that is doing for them so Epic Games you're familiar with Epic Games you know fortnight who doesn't play for tonight and you know what they're really interested in is creating a constant feedback loop for their designers based on the things that their players are doing and you know that's actually not very specific to the gaming industry it's something we all benefit from having regular constant feedback as long as we can make our delivery cycle fast the way they can write and you know what they're really focused on is ensuring high engagement right because that's customer Saten that yields more time spent on their game and so what they've kind of moved towards is a model where they have their game servers and clients and so forth and that goes through a Kinesis stream into two different pipelines one that's near real-time which basically they're pushing through spark into DynamoDB and out into grow fauna for access and the other one int from a batch perspective where it's going into their data lake and and then going out into in their case tableau and ad hoc sequel and the important the key part of this step I think is the telemetry is all connected collected using Kinesis and they've kind of separated out real-time from batch in terms of how you process and so I think that this has been a very successful outcome from them if we look at Equinox so here's basically what they're responsible for I'm sure a lot of you use Equinix as well and you know they have a lot of clubs a lot of studios a lot of different offerings as well as central capabilities to encourage people to you know go to the club right and utilize their services and you know it's but well as value-add services like their spa and whatnot so a lot of their world is about trying to support connected insights from both the digital products that their end users use like you know my phone and Apple health as well as the equipment itself inside the clubs right so for example gamifying the cycling experience at soul cycle or you know digital assessments and location tracking and whatnot so what they're doing is moving data from a variety of places both their own data as well as you know Adobe and other social information that they have access to to get processed with informatica and EMR into redshift and then out into a series of subject layer data Mart's using Postgres and out into their presentation layers and going in and out of their data lake and so that's another very common frame here that you basically data warehouses and data lakes aren't an either/or decision most people are using both and here's a picture on their overall pipeline where you know adobe analytics which is basically how they you know collect information on their work to sort of promote things it goes into s3 goes transformed okay gets saved to their data lake every day is used in the thena to add to alter the table to add a partition and then is it made available through glue and through redshift spectrum which is how we integrate redshift with data link okay and their basic benefits from this is is that you know they've done a big Reap lat forming from a prior solution they built two apps in just four months and that's pretty nice and you know they basically have a lot of you know traditional benefits that you can read there so that's kind of the data Lake story I think you're all probably looking at data leaks when I talk to customers in general pretty much everybody's looking at data leaks so let's get to the part that you're probably going to disagree with at least in part which is where we need to rethink what we mean by what is data and what is the analysis of the data and what do we use with for so you know what you see up there is a picture of a tablet with a set of information that's streaming off of operational data and now that's kind of a very nice version of what most of us think of as analytics today and what we think of as data today right and you know what this is data right that's so I'm gonna give up a set of Amazon examples here so that's an obviously an echo device and you know we think about echo with respect to the speaker we have on our desk or in our bedroom or whatever but really what makes the echo interesting is the fact that you know what are the questions that people ask it and more importantly what are the questions that we were unable to get answer for you right and that's how it gets better at that feedback loop what were we able to understand what question did we end up having to say oh you know I didn't know how to do this and so on right and because the very self same hardware appliance gets better based on the data that's in that continual feedback loop this is also data so this is someone shopping at an Amazon go store so you know there are a lot of people like Safeway and Albertsons and so forth that have you know these cards right and you can they get a lot of insight into what you buy at the end of this day what they don't get insight into is your customer journey so if you think about the ghost store we can pay attention to the path you walk through a store what things you picked up and put back what you looked at perhaps on you know and she say like did you look at the nutrition label before putting it back or something else right we don't necessarily do all of that today but you get the concept of how now that's integrating video into the data experience and why you know and there's that's a lot of data that we might be collecting but we're throwing away today in many other environments right and you can see how that's valuable like for example you know so I'm a Safeway shopper and you know they have things that are coupons but they're coupons for everyone in the store right there aren't tailored to me right it doesn't say hey I'm the rug you bought this Cabernet here's two bucks off of it right it's sort of whatever I might be buying or not be buying you know it's true for all Safeway shoppers it's not very personalized this is also data so this is prime now with one-hour delivery you know I chose not to use a picture of a drone here because that seems high-tech that's just some guy peddling on a bicycle right but he's still delivering product within an hour and you know like let's say my wife asks me to go and buy you know replace a bunch of you know toilet paper it takes me about an hour to get to the store and back I mean I find it remarkable that Amazon is able to pick and pack and ship something not just to me within an hour but also several other people on my route and do it cost effectively right and think about how that transforms the customer experience right that you're now you know I can do something else with that hour in my day right but I'm still getting you know the things I need right and so that's a customer experience point here and that's kind of the core of what we what I wanted to talk to you guys about which is you know we often use data to optimize our transactions but we don't often use data to really more deeply engage with our customers and so the question is how do we do that and how does that change the data that we collect how does that change what we do with the data that we collect right so that doesn't mean that reporting analysis modeling planning etcetera are going away they'll be around forever right but I think it's just a question of what we what else we do with data I'm gonna go through a couple of examples here so I was on the plane recently you know a 18 hour flight from San Francisco to Dubai on my way to India and I like everyone on that flight I basically have a screen in front of me and you spend a lot of time watching movies right because what else are you gonna do there's a lady to my left who's you know maybe 75 years old there's a gentleman to my right who's maybe 22 we all have the same screen in front of us not the physical screen but the sequence of choices no the airline that was traveling knows my gender and those my age and those were I live and those were traveling you know why wouldn't we use cohort analysis to decide what we should see I mean rather than assuming that all of us want to watch The Avengers right which is a very popular movie you know you haven't seen that you should but you that doesn't mean that all of us have the same interests right so that's just cohort analysis but we also all provide social exhaust out there one of this is going to a wedding one of us is going to a job interview one of us is going to a child a child's birth or grandchild's birth how does that change what we want to see right and I can tell you this if it were Google or you you know you Google YouTube or Facebook on there they'd be monetizing 18 hours of my undivided attention right it wouldn't be just something where they're saying oh you know let's keep the guys you know Sapa for eclis quietly doesn't keep bothering the you know flight attendants right which I mean is value but you know I think they there's more value there to be unlocked let's take another example so I often think about the world and the world I'm in and the world I manage is actually fairly complicated it's a little bit too complicated for me so I try to do an analogy of what would I be doing if I were running a small coffee shop right and so if I were running a coffee shop I'd know all of my regular customers I know what their drink favorite drink was I know whether they just want that coffee pushed across in the morning because they just need to get caffeinated or they want to talk about sports or politics or whatever I probably read a little bit about it before the store opened in the mornings I'd know what to talk to them about you know so let's say uh nugs coffee ends up becoming popular it's more like Phil's or Pete's know if I can aspire to Starbucks someday right I what I not want that same quality of service of course I would so imagine there's a camera with recognition running on it and it knows who's coming in the door imagine there's some audio so it knows what we are conversations tend to be and imagine just pops up okay this is on a rug they hide out in the rug and you know he didn't really wants in grandé latté and you know he's interested in whatever the forty-niners right and so you can say something right and it sounds like for the barista in front of you like it's a experience and it's part of a relationship right and I've been starting just to sort of train myself in this new process looking at all of the transactions I'm part of nowadays and thinking about them as if they were instead relationships that I could be building or they could be building so for example we all checked in recently into a hotel right so you know you provide your credit card you provide your you know did you know driver's license or whatever other ID you have and you know they give it transaction and you off you go and they're very polite and nice and you know efficient and so forth and so they're optimizing the transaction flow let's imagine instead that there is a camera that recognized me as I came in the door and instead they were saying to me Oh mr. Gupta I see this you're a tree you must be here for reinvent and I noticed this this is your seventh reinvent you know you're one of the old-timers here aren't you so I've already got your badge already so you don't have to check in somewhere else for that and I know you're not a gambler but I thought maybe you'd like a show and I've integrated by the way with your calendar and you know I see that there's this you know one spot on your schedule on Wednesday night and you know hmm I think you like sushi so I've got a reservation ready for you at Sushi Samba and you know it'll be held for you for 24 hours it's a totally different experience right and I benefit from it as a consumer and my hotel benefits from it right and I the fact is is that you know like economically we're taught that you know people care a lot about the small differences the fact is I don't actually know whether Pete's or Starbucks as you know 10 cents more expensive or less expensive I kind of know which one I like in terms of it's much more about the service the person behind the counter the Wi-Fi whether it's crowded you know all of those sorts of things than it is about the particular product and I think we sort of need to move towards that relationship model in general and data can help us get there so the you know as we think about more data and different data you know presumably the applications also change right so one of the things that one might do is manufacture events like singles day or prime day and so forth so we run a lot more packages through a fulfillment center on prime date and so you can see that dynamodb is a system that I ran three point three four trillion requests on prime date which is a pretty extraordinary number you know we don't normally talk about trillions right and you know twelve point nine million a second that's crazy and you know so there's this basic notion about systems that's seamlessly scale up and scale down while providing a consistent experience why is consistency so it's important because people get frustrated within a second or two right and so you kind of need to provide that regardless of how successful you are right you mean you have to plan for success just as much you have to plan against failure so I think these sorts of systems matter a great deal change over time matters a great deal you know increasingly we're going to be looking at IOT sensors vehicle telematics application logs you know a lot of the things I talked about previously referenced architectures that were deeply integrated with a short term experience from someone coming in the door to a my having all the information in front of me on the screen to be able to serve that request as a relationship request in the context of other relationships I had so you kind of need that to be able to pull that at pace through a time-series database and you've seen us announce timestream today we think it's going to be interesting we do see ledger databases as an emerging need and in part because of the integration across data silos right and across companies so for example in that Hotelling example I gave before I was integrating with my cattle with my calendar I was integrating with Amazon site and that was the reason why I was a and as well as maybe the hotel hotels a reservation system for a particular restaurant you know in addition to the check-in process right and right now what happens right now you provide the same darn information to six people possibly within the same transaction and that's just frustrating right and so at least within the transaction you don't want to and you know you want to own the totality of the experience so you know we have our own ledger database but you know we think it's also important to integrate with you know whatever is available in an open source so you know we have a managed blockchain as well which integrates with etherion and hyper ledger fabric and you know the goal of this is to integrate the lower-level ledger capability that we've been using for some time with the api's and environments to let people build the apps that they want already so really the core of the point I'm making here is you know the data has power right you it has power if you amass it make it available to other to as many people as you can make it available through all the tools that you can and you use it for far more than transactional optimization right because you know understanding things over time understanding things to optimize relationships and I think that's the way that at the end of the day we're going to have you know enriched experiences you know both as consumers and obvious hopefully all says the you know business people and business owners that's actually my last slide and looks like I went through this talk really quickly so I'm happy to take any questions that people may have or you can you know get to hear in next hotel yes sir [Music] yeah so the question was what about data privacy you know how do you balance the information about that you're gathering and using with what you know people might want or will not want shared so my take on that is really if you make your available information available to Facebook or Google or whatever you've kind of made it available to anybody who's willing to pay you know two cents a click right and so that's point one and I think people have moved towards more sharing you kind of have to be careful not to make to do it in a creepy way right so like the part of the example I gave you that may have felt creepy was this notion of you know in the you know reservation system example is the fact that I've you know they knew that I liked sushi so I might be able to frame that in another way like saying you know I've got reservations at a couple of different places here that are pretty popular you know Sushi Samba or you know maybe pick an Italian place or whatever right and then you kind of know where the person's gonna go but you know you've got your backup options right and you know maybe I'll change my mind about what I want to eat today anyway right and so I think those sorts of things help but you know it has to be done with a level of delicacy for sure right the question for me is are you more likely to help or to offend and I think by and large people like feeling that they're involved in a relationship that and people know who they are interacting with and they remember those folks you know even if that's something that you know is more institutionalized memory then you know the specific person at the desk in front of them right we all feel like we're kind of VIPs at that point right and they do treat VIPs like that why can't I be a VIP anything else yes sir so the question was in late for formation support cloud formation templates right and in time I think there's a desire to have all services available through cloud formation because it's kind of the way things are built nowadays in AWS I think of lake formation as perhaps more analogous to cloud formation you can see it in the name then maybe something that you'd want to use so cloud formation I think of as something I want to do to do to complete a sort of a short term task not some you know this is something that you might take days or maybe weeks to complete in terms of constructing a lake so we're trying to eat away time much the same way cloud formation does so I'll have to get more tread on the tires to determine what how people are using it to determine what I can automate and simplify if perhaps through cloud formation but you know we can talk afterwards about how you're using cloud formation for your lake and maybe we you know that might give me help me define the service I'm sorry could you say that again what do you so the question is what do I think about data virtualization what do you mean by data virtualization maybe oh I see so data lakes versus data virtualization so you're talking about like federated data versus centralized data so I don't necessarily think of it as a either-or question so I think your data catalog and your data Lake in some sense should be able to support data that resides in a multitude of places so lake formation won't do that and our initial launch but I think it's something that we just kind of want to support in a long-term basis mostly because it removes one step of data movement right I mean the data lake itself prevents you from having a many to many mapping much as you know data warehouses used to but it's even better if you can just say like hey for my hop data let me just get it where it is right and so I think people will want that in general people will also want the ability to have all of that hot data eventually make it to the lake right so I think sometimes people think of federated data access as a way to avoid using a lake but I think that that may be at least in my view not correct because I think that what that leads towards is a world where you know you still end up with the deletions happening in those source systems and they generally aren't initially architected to deal with the same kind of scale points as you know s3 and you know the low-cost right so that I think is what leads you to end up wanting your data inside a lake in addition anything else yes ma'am or sir sorry so the question was so I can't afford to store all my data particularly video style things which are 10 200 X the size of my audio which is 10 200 exercise the my you know transactional information right and so what do i do how do i draw that line and so I think the question there kind of comes to one of them there's one line which is what's useful to me now right and that line is a prayer we deep cut on your data and then there's a second line which is what might be useful to someone right and that line might be okay I'm gonna keep it for a day a week or whatever and that might let someone have a place to play with something right and if they build something that's useful as long as the value of it is exceeds the cost I'm in good place right but I know one thing which is if I don't store it no one's gonna build an app to use it right so it's one of that potential energy kinds of questions right so you at least need enough of it there to be able to get some value right maybe you know some portion of the stores or whatever right yeah so the question there was about hey so you had a lot of components on those slides and aren't there just as many opportunities for aggregation as disaggregation with respect to functional things for example ml and sequel so I would say that you're absolutely right there should be a integration of ml style capabilities into sequel starting with the notion of supporting sparse matrices as a first-class component of sequel just as sexes are today and then you could say like I generate a sparse matrix from a join and filter and whatever out of us you know in sequel but then I can manipulate that sparse matrix as a baseline data type you know using some set of more ml ish or stats ish you know capabilities now the fact of the matter is is that you know that might let you do some portion of what you can do in you know sage maker right in your data warehouse or whatever system you're using to do sequel but it's not going to have the totality of capabilities right and so I think it's and as you point rightly point out it's not an either/or but it's a question of when do you go to that higher end capability versus the more baseline capability does that answer make sense to you anything else yes sir so the question was basically ingestion is a dirty job I mean I'm pulling it from lots of different places is there anything that simplifies it nothing sufficiently good you know I would is would be my answer and I think that at core that is the lack of the out of an application of machine learning right because if you think about it every time Adobe Omniture change is a format all of us go out and add a column right that should happen once for our two million customers right because it's a shared format you know like a lot of us are using Omniture as an example right as a you know different example you know we all have street addresses we all are trying to get to conformed street addresses that should be something that could be easily cleansed particularly by a company that maybe has a pretty decent mapping of street addresses at least in the United States right and as an example right and so there are capabilities that you could knock off because they provide shared value right and then beyond that I think it comes to this notion of recognizing commonality of type right and materializing that in terms of the metadata around something this isn't a bar car this is a street address this isn't a bar car it's a person's first name not their last name right and that actually depends depending you know where they come from which one is which right in terms of salutation right and this isn't a bar car it's actually a zip code because it's got a dash in it but so is this other thing that you thought was a number right so those things you know where somebody just has a five digit zip code so those things I think can all be done and you know it's one of the things where I think in time the cloud is going to have access to more data and be able to democratize sort of the the learning and the training that we do across data versus any individual customer but is it there and now or is it close to there now no is it something we're aware of that we need to get done someday yes I think we're done so thank you both very much
Info
Channel: Amazon Web Services
Views: 4,450
Rating: undefined out of 5
Keywords: re:Invent 2018, Amazon, AWS re:Invent, Analytics, ANT205
Id: RxYtS0nOYow
Channel Id: undefined
Length: 47min 53sec (2873 seconds)
Published: Thu Nov 29 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.