Appnext: Kinesis, EMR, Athena, Redshift - Choosing the Right Tool for Your Analytics Jobs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to this is my architecture this week we are in Tel Aviv Israel I'm Benjamin from AWS today I'm joined by Vladimir from apnic's thank you for joining me Blatter thank you hello me before we dig into your architecture tell me a little bit about what apnic's does hypnosis is a leading mobile discovery platform helping millions of users to experience apps at the right time today so today okay so a mobile platform many many uses around the world I imagine this produces a ton of events how what kind of scale are we talking about right now we are talking about 700 millions of users daily okay and and this is really the basis of your data platform and that needs to imagine consume store and process all of these data events and from what I understand your platform kind of evolved over time tell me kind of where did you where did you start what was the starting point we've started with a classical tea people is all about her texture okay we've had our application service writing all this data into my signal running on the c2 and after that we have some ETL processing running on the ec2 instance it was copying the data and transforming it and don't redshift okay so really the beginning point is where a lot of customers are they're ending point is was a healthy split between your transactional in your analytics data stores so why did you need me to move beyond that why wasn't that sufficient after we reached the point of something like 400 million events load the data into my sequel we started to say performance issues and the scaling issues with the my sequel database right so it's relational that's the trick it doesn't really scale as easily so so ran into scale issues what was the next step beyond that how do you how did you progress we've decided to move all these big data streams that were loaded into my sequel to somewhere else and lured them directly to the redshift so we've decided to use Genesis files so on application servers were just writing all this big data streams into Kinesis virus which was writing them to a3 and after that they don't load it do right shift okay so right off the bat you kind of split the big data elements which shouldn't really be stored in a transactional database to begin with put them in a good place to store them in the beginning gets dumped into s3 loaded back into redshift so that's a good move I imagine at that point your scale or your requirements from the my sequel reduced because we didn't hear it does this compute power and this EBS volumes the people using we were using provision types volumes for this my sequel instances we've just moved our storage from a provision types type to GP to immediate immediate cost savings yeah reduced is it caused by something like 30% okay and I see there are other boxes here so I know that wasn't the endgame what was the next step beyond that how do you what was the next point that you had that you figured okay we need a new tool here so after we moved all this data from my secret to redshift topics we have several teams that are querying the data in addition to analyst there's a creating integration team data team so they all started to query the data and redshift and this wasn't analytical recording problem that were all squadrons it was a select our queries that were running on redshift and once the time when we have rational ease or a configuration changed all of these users were running a concurrent queries on redshift so this is really workloads that we're hitting redshift because that's where the data was but it wasn't really a data warehouse type query or a query that's optimized for data warehouse so so you decided to peel that off that's because it's exhausted already frustrated queries was very very slow we decided to move all this ed ed her querying out from Richard to somewhere else okay and because we already had the data saved in the three we decided to use the tenor to query this in this data directly okay so super simple the data is already there just add Athena write your queries and then immediately I imagine reduce the amount of concurrency on redshift and more importantly pull away the queries that aren't really effective on a data warehouse yes cool so how do we get to using EMR so at this point when we wish them being an event loaded the only entire eight shift we had dog regression processes all of them running on redshift and with more data we've loaded the processes became more and more slow so run something like a three to four four hours for each process so we decided to move all these aggregation processes out from redshift okay because it became very slow and because we were loading more and more data to redshift we needed to resize the rigid class and more often and all the each time you set the clustering becames in read-only mode so all the data we were learning from from observers to redshift was wasn't loading on this time and we needed to manually reload it from history and also to rerun all of our aggregation processes that weren't riding on this time once again so again sort of you were running things that you could run on redshift but that's not the right place for them yes going forward so what's the solution to that I've decided to spark on my mouth so what it is were doing once again because we had already all this data saved in s3 in JSON format you know just consuming it well Emmett rosters running all this analytical application processing there and saving the results as a market format which is a communal format and it's much more efficient for analytical coring I know from other customer conversations that park' will reduce the query processing time on Athena did you see the benefit of that as well with Parque use now just for Athena know in addition to attainable is running queries directly from a three with Richard Specter so it is available in the same rich of clusters that we have all this data safe for my sequel in other places now available is for querying a three data okay so more respective and again cost effective reducing the load over the compute nodes that don't really need to have they don't have to do this so so this is really cool I really like this story and and I think the reason why is because it kind of tells the story how when we're thinking about a data platform there is maybe this illusion that there's there's a Nirvana state where the this data needs to live in this place and if you just separate Olaf from OLTP everything would be all right and and I think what your story is telling is well know that the the data events and the scale continues to grow and the consumers change in their behavior now they want more because it's available and so you continuously need to evolve your platform and find the right tool for the right job and that's an ongoing thing so I think this really tells that Wow and and so thank you and thank you for sharing that Daniel thank you for watching this is my architecture

Info

Channel: Amazon Web Services

Views: 42,996

Rating: undefined out of 5

Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, dw, data warehouse, redshift, analytics, spectrum, parquet, big data, data lake, OLAP, OLTP, Spark, parquet format, Amazon Athena, Amazon Elastic MapReduce (EMR), Amazon EMR, Amazon Kinesis, Amazon Kinesis Data Firehose, Amazon Redshift, Amazon Simple Storage Service (S3), Amazon EC2, TMA, This is My Architecture

Id: wEOm6aiN4ww

Channel Id: undefined

Length: 7min 30sec (450 seconds)

Published: Thu Oct 25 2018