Processing Streaming Events at Scale with Amazon Kinesis and AWS Glue

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay we'll go ahead and get started uh welcome everyone to the webinar presented by clearscale and aws today's presentation will be on processing streaming events at scale with amazon kinesis and aws glue my name is david ertl i am a solutions architect with clearscale and co-presenting with me is gandhi raketla who is a solutions architect at aws a little bit about clear scale to get started clearscale is an aws premier consulting partner we started in silicon valley in 2011 and have completed over 850 projects with more than 250 amazon customers we are headquartered in san francisco with offices in new york texas florida colorado north carolina and arizona clearscale is an aws premier partner we are trusted aws experts we have 10 aws partner competencies such as data analytics migrations devops and sas our engineering team holds over 100 aws certifications on today's agenda we're going to talk about how to use amazon kinesis to process streaming events how to use kinesis data analytics applications to transform and analyze data in real time how to archive streaming events to durable storage on amazon s3 elasticsearch or redshift with kinesis firehose then how to use aws glue to catalog event archives how to use amazon athena to query archives and then we'll follow that up with a demonstration and a little bit of q a okay so um the first part of the presentation will be how to use amazon kinesis to process streaming events at scale um that will be presented by gandhi raquetla from aws uh so take it away connie hey uh hey good afternoon everyone um can you see my screen everyone yeah so um today i'll be talking about data streaming and real-time analytics with kinesis so let's get started so uh what we see is the explosion of the data in the recent times from the time where we had limited number of applications generating limited amount of data today we are having connected devices internet of things your smartphones your wearable devices everything generates a huge amount of data so 90 percent of the data what we see today is generating the last two years so the volume of data being produced is increasing exponentially um every year so just to get a sense of what is this volume and velocity okay just not the volume it's about the velocity also every time you are watching a netflix movie or you are tweeting something you are clicking on the ema commas portal are you posting something on facebook instagram everything is generates a data for example twitter alone generates a 12 terabytes of data per day so the second component the first component is a volume because of the lot of content devices and the streaming data the second part is like velocity okay and third also is that variety of the data like you have the data coming in form of a text it could be pictures it could be a json it could be a simple text message so you have variety of data the volume velocity variety of the data is changing just like every commodity data is also having a shelf life so as a time progresses the amount of the value of the data the the decisions you can make with the data will reduce like initially you could make from a preventive predictive actions to an actionable to a reactive to historical take an example of an hospital where thousands of patients are there if he can constantly monitor their vital parameters and then detect any anomaly in another any of the patients parameters we can alert the doctor or nurse staff for a possible stroke for the patient like like any uh cyber security company uh which is constantly monitoring the network locks or vpc flow locks or application locks can detect some patterns which are abnormal patterns like some up some anomalies it can detect much in advance and then take an actions so basically the real-time analytics provides you that's value which is actually a competitive advantage a companies are having or other their competencies like most more quickly you can take an action on the data in the beginning stage you have more competitive advantage so what are the candidates for this real-time analytics okay so any company as a tool that has got volume which is having lot of devices a lot of applications internet of things sensors and then velocity that coming with very high speed like every time you click a button every time you uh some vehicle passes through a tool gate or every time a product man passes through a a convenient belt on the manufacturing line so every time there is a data being generated with the velocity and then variety every data is different it keeps it as different and i would sensor data is different and then a data coming from a patient a heartbeat monitor is different so every data is different and then we have to analyze this data in real time so that the business can take a corrective actions even before the event occurs and then not only taking action not only taking action at that moment you also require this data for the future at future purpose to ingest data into a real-time near reality analytics of our data lake or for even for a data variation system for a further analysis like deep analysis you have to should be able to just so these are the candidates for any real-time analytics so what are the challenges we have with the streaming technologies today one is that they are very difficult to set up the reason being any system which needs to scale to a millions of data points which has to process per second or per millisecond you have to set up huge number of servers technologies the tools the processes monitoring in place to make sure that you are able to ingest the data from so many sources second second challenge what we are having is how do we achieve the high availability right uh the nature of the data is that it has to be near real time or real time it should be react in this in the subs milliseconds that means it has to be available all the time 24 7. even a downtime of a one minute means you might have lost lot of critical data points when any system is difficult to set up which is having high availability obviously it brings lot of challenges on managing it could be error prone and very complex to manage and other challenges the scaling part the your your your competitive landscape or business landscape might change today you are monitoring say 50 devices suddenly you might want to monitor a thousand devices or five thousand ten thousand today you are actually having your application running in one geography tomorrow you want to make it a global application and you want to monitor your website analytics globally across all users like which part of the world are having some difficulty in using this in this my website so you might have to scale quickly and then to scale quickly if you are building everything on your own obviously you need to invest a lot of capex you need a lot of servers again it takes time to scale things and a streaming data or a data cannot work in standalone right it requires integration with lot of other enterprise systems it might have to access a a relational database it might have to access a object store database like an s3 it might have to access uh do some access some other third party service where it has to join the data this streaming data with some predefined data to make a decision so you require a integration with a wide range of services to process and when you have a system which is of this scale of this availability integrations obviously it will be expensive to maintain it you require a lot of hardware required servers you require the people to man those system to make sure they are available 24 7. so how does the streaming real-time data with aws works let's see how we address these challenges one is easy to use because if you have to set up a streaming all you do is we have a service called kinesis which i'll be talking about more in the coming slides where with the click of a button you have a service which is up and running and you can easily ingest the data a long way and second thing is availability by by default all these things are managed services so you don't have to really worry about making sure that the systems are up and running all the time and then the data is durable you can store the data durability there is a predefined sla guaranteed by aws for all the services and these services are fully managed you don't require any system administrator to make sure that these systems are up and running all the time doing the patching doing the security upgrades you do not really worry about all these things elastic so if you want to scale certainly today you are having a hundred thousand devices or tomorrow you want to double them today you are having an application in one part of the geography but you want to from us you want to have this application running in europe also and you want to manage the data coming from europe also all you require is you can just with the click of a button you can increase the number of uh streams what you are processing or number of shards you are you want to increase and then you will be able to scale up easily and then seamless integration with any aws service so to process the data whether you can integrate your stimulated lambda to process the data to store you can store in s3 or red shift or elasticsearch as plant this kind of services you can easily store and just like any other aws services it is pay what you use you don't have to really make any upfront financial commitment you can start small you can test your applications your proof of concept and then if everything is working fine you can slowly scale it because you'll be paying only for what you use and if you want don't want to uh take out some uh devices or some some data you want to reduce the volume that i want to transfer something future you can reduce it and you don't have to pay anything so there is no really an upfront commitment you are making in terms of the infrastructure you are provisioning to manage the applications of this scale so what is more key important thing is that you are building an application which can handle the volume the variety and velocity of data at minimal price because you are paying only as you go so let's see what is the real time analytics ecosystem looks in aws what all the different components we we have uh for a real-time analytics the first thing is the source that means you need to have the sources which generates this data so what are the sources this could be mobile apps that means every time user is interacting with mobile apps they want to personalize an app so you would you can stream the data or web click strings every time user clicks on a particular link on your web page a particular button on your web page you might want to generate the data because that gives you a critical information about how the users are accessing your web page do they have the difficulties in accessing the web page are they having some trouble how much time they're taking from navigating one part of the uh page to other part of the page so there are variety of insights you can get from the way user is trying to access your application it could be application locks like you network locks or application lock where you want to continuously monitor the application logs to detect any anomalies are there some users are trying to gain access some service which they are not supposed to is there some service which is constantly failing so basically you notice that okay uh when some service is constantly failing after a few seconds application might go down so basically you can take an action in advance or iot sensors or smart buildings or any metering records your any system which can generate the data with that velocity and volume can be your source of your uh real-time analytics and once you have the source now we have identified the sources like the source identified the next next part is how do we ingest this data into the streaming technologies like how do we data injection happens so for that in aws you have sdk aws sdk which you can install on that particular devices or on the systems which could stream the data you can also have a kinesis producer library uh you can also have a mobile sdk you have also a kinesis agent which could run on that particular server and then it can constantly monitor your uh locks and then pump the locks into the penises and then it could be your iot devices or cloud watch logs or it could be third party offerings like log4j flume so these are all the different uh libraries or in third party applications or libraries which could ingest the data into kinesis so now you have the source which is generating the data now you have identified the toolkit or library which could take the data and ingest now where it will ingest that's where the next part of the ecosystem the real-time analytics ecosystem is stream storage so a stream storage is what will actually take the data from the streaming sources right that's where we amazon has is having amazon kinesis data streams uh which is used to ingest and store the stream data so basically it is fully managed service so you don't have to really set up any servers or any infrastructure to manage it you can specify like what is your size of the data what is the number of data you are going to get per second and then can it just suggest you how many shards you want to have so the data is stored in something called shard basically every you can create a multiple shards in a stream and then they can all have a different partition key and then you can store the data in different shots by by nature these are secure and durable storage you could have a an encryption managed by encryption keys managed by you or you could have aws manage encryption keys where your keys can be rotated using aws kms service or and other important thing is that it is available to multiple real-time analytics applications what it means is that once you have the kinesis data there could be multiple uh application who are trying to consume the same data but they are trying to looking different angle just to give an example of extreme the same fiction data one application might want to look if how much time users are spending in a particular page another fixed data another analysis application might want to check the same data but want to see like what is the average uh time user is uh uh taking to perform a particular action on a page like how much time they're trying to do a checkout so these are all different different angles for the same data different analysis for same data can be performed because once the data is there in stream you could have multiple consumers who can consume the data at the same time and you can create a new stream in second as i told you can see the desired capacity with shards you can scale up and down today you are having a one shard just you are trying to do a proof of concept suddenly you want to scale it for a thousand devices or thousand application all you do is just go and increase the number of shots and then uh when you don't want to uh those applications you want to come down you can merge the shards or you can create a new stream with the ls number of shots the next part of the problem is how do we process the streaming data so from the source you identify the source you are ingested the data into the you identify your injection mechanism now the data is there in the kinesis the next challenge what we have is how do we process this data okay for that we are having something called kinesis data analytics this is a sequel or java and applications where you can use it to process these skins the streaming data in real time or we have aws lambda which could read the data in from this kinesis streams and then we could run a lambda lambda function or it could be amazon emr where you could run a spark job on emr which could actually read the data from your kinesis streams using sparse streaming and then you can spark an emr to process or if we have also third party applications which you can use to process this data so i will talk little bit more about this kinesis data analytics so this is a analytics application which could read the data from your streaming source and then process the streaming data in real time so it can be it has comes up with lot of built-in functions you can use it to uh perform say anomaly detection on a particular streaming data you are getting a credit card transactions continuously from your multiple applications you won't detect if any anomaly is there in this particular transactions so you can use it using a built-in um anomaly detection functions in kinesis analytics so it works very simple you have a streaming source which is your kinesis streaming source you connect your streaming source and then you can write a sequel or java put a process streaming data so if you don't have to train your team on new technology because if they know sql if they know java any one of them they can actually write a sql uh queries to process this data and then takes action and then continuous deliver results so now you're getting the streaming data you are actually using sql or java to process them and then you are delivering this data continuously to another applications you can store it in s3 or you can generate a real-time dashboard of the data using quickside or athena i'll talk about david is going to talk about more about that in the next part of presentation and we also have a demo we'll be showing how we can do that so what are the key uh use cases for kde so any where you require a sub second end to end processing latencies something like as told an example of where you want to detect the patient's vital parameters in hospitals in a seconds where you want to prevent a stroke happening to your patients you know seconds then you require a subsequent latency processing uh something like and and uh and other thing is that if you want to process the data using your uh and i'm sorry for the distance and if you want to process the data using your sequels uh sequel and you want to you want to leverage a pre-built function like you want detect anomaly detection using random function or you want to continuously aggregate the data using windows uh of windowing the data like by nature of this uh streaming data is that it is coming continuously you should be able to window them into like you have different windowing mechanisms like you know stagger window tumbling window where you want to i want to see like okay all the data points all the clicks made by the user in the last one minute you want to aggregate by user right so basically if you have the the requirement of windowing this real-time streaming data then kda would help you in that so some of the features it is simple programming it is like you could if you know sql if you know java you can simply quickly start using it it doesn't take much time the stream you connect your application to the streaming source you can start adding your sequels and high performance it provides a sub second latency and if you require a stateful uh processing and then a strong data integrity what it means is that your data is processed exactly once and uh and in pros and data is in a consistent state after processing so the next next part of the analytics ecosystem is now you have process streaming data now you have gathered some insights but you want to store it for a future processing or maybe you want to store it for a visualization you want to build a real time a pie chart or bar chart or some tree map some kind of a graphs you want to generate then you could use the kinesis data fire hose to do this job so what kinesis fight host does is it can take the data coming out of kinesis data analytics or maybe kinesis uh streams anyway and then it can store the data into destinations like amazon s3 redshift or elasticsearch or splunk these things are provided by a default to you and also you can do and also you can do processing of this data you can transform the data so once you have the data coming from kinesis analytics you want to transform the data you want anonymize the data there could be multiple requirements to transform the data and you can use the lambda to use the transform the data and you can store them into the s3 or any other redshift on splunk or elasticsearch so how it works is you are getting the data from kinesis agents or streams or cloud watch logs or kinesthetic applications you can redirect them towards the kinesis fire hose you can write a lambda to transform the data and then you can deliver the data into all this four destination i just stored like s data shift splunk and elasticsearch so before we get the further let's see like when to use kinesis firehose versus finishes streams the difference is if you want to have a more custom processing where you want to have a tight control over where you want to store how you want to process and you want to have a sub second latency and then you would go for it data streams but you want to have you are okay if they're having a lag of 60 seconds before you can actually store the data before you can take an action but you want to have a zero administration where you don't have to fully manage without you having to manage anything and store the data into s3 or redshift or splunk or elasticsearch then the firehouse is the right candidate for you so it's all about the different use case sometimes you can use them in combination right you could have the kinesis data streams getting the data a kda doing the processing and then you again give a technician's fire hose which can it can put into the destinations so it's all depending upon your use case you can use the right technology you have all the tools from end to end starting from source to destination you have got all the services you can use services in combination so if i put all these things together what i told this is how the entrance picture looks like so we talked about sources the sources like mobile devices click streams or iot sensors or logs they are generating the data at high volume velocity and variety of data and now we are we talked about genesis agents or genesis producer libraries aws sdks to ingest this data into a skin is streams so that's where we are using kinesis streams for the storage of this particular uh incoming data and then for processing we talked about kinesis data analytics applications and then kinesis consumer libraries can be used for processing and then for destination we could use kinesis fire hose to store the data into either multiple aws services it could be s3 redshift or splunk or elasticsearch you can store in all these things so this is the basically an entry end uh picture of uh from source destination now the data is there in the destination do we stop there no we need still required to further analytics on that we would want to understand the uh understand the nature of the data you want to catalog the data we want to run some queries on this data using athena we want to do a visualization that's what david is going to talk to you about uh glue and athena where he will talk about the second part of stories the first part of stories we got the data we processed it it is there in the a storage like s3 now what do we do with that next that's where we have the blue and athena we'll be using to process them so over to you david great thank you gandhi uh so if you can stop sharing your screen then um i'll take over okay great all right hopefully everybody can see my screen alright so now that you have all the streaming data what are you going to do with it wouldn't it be nice if you had an intuitive powerful platform to transform and catalog your data so that you can search and analyze it well that's what we use aws blue for okay so what is aws glue well glue is a fully managed extract transform and load service it is a powerful platform that can transform and leverage your existing data stores to realize new business values because glue is fully managed there's no need to run your own complicated emr clusters on fleets of servers this greatly simplifies the management and the processing of your data so aws glue consists of multiple components they seamlessly work together to provide a powerful data processing service it starts by connecting your existing data stores that contain all of your source data glue can be managed through an intuitive and easy to use console as well as programmatic apis you can connect your data sources through the console you can set them up through the apis as well efficient and cost effective blue crawlers automatically scan and catalog your data for you powerful glue jobs process and transform the data leveraging the apache spark framework at the center of it all is the universal glue data catalog which facilitates the data search and analysis data stores so aws glue natively supports data stored and amazon aurora and other rds engines like mysql postgres and oracle it also integrates with amazon redshift amazon s3 as well as common databases running on ec2 in your virtual private cloud setup is easy and you can use the aws management console to connect all of your data sources the aws console provides the ability to define and orchestrate your etl workloads through a web interface you can define jobs tables crawlers connections to existing data sources through the interface you can manage crawler schedules which can be used to kick off blue jobs you can also define events and triggers for starting glue jobs and set up complex workflows to process your data once you connect a data source you can edit transformation scripts that are automatically generated by glue and scala or python directly through the console the glue data catalog is the persistent metadata store that represents your data from multiple disparate data sources it is a fully managed service that lets you store annotate and share metadata in the aws cloud each aws account has a single glue data catalog per region this allows you to have a uniform metadata repository from multiple different data stores the glue data catalog consists of databases and tables databases are a logical grouping of metadata tables in aws glue tables themselves represent the metadata for a connected data store tables can only exist in one database but tables from multiple different data stores can exist within the same database an example would be something like a glue table that represents data stored in amazon s3 and another table from a relational data in stored in amazon rds so getting data into your glue data catalog is really easy you can set up a crawler to crawl your existing data stores and automatically construct your data catalog aws glue will crawl your data sources and construct your data catalog by inferring schemas using pre-built classifiers you can also set up custom classifiers if the pre-built ones don't handle the work case for you this works with many popular source formats and data types including json csv and even apache parquet when a crawler runs it will write the metadata to the data catalog automatically creating tables and schemas in the specified database aws crawlers are used to build out your data catalog based on your existing data stores crawlers can scan your data in multiple types of repositories automatically classify it and extract and store the schema information and metadata in the glue catalog crawlers can be run on a schedule like once a day or every few hours or can be kicked off manually through the console or the api crawls crawlers can also be configured to be triggered to run based on completion state of other crawlers or glue jobs so aws glue jobs provide a managed infrastructure to orchestrate your etl workflow glue jobs can be used to transform and filter your data from the source format to a format that is more efficient for you to find bi and analog analytics workloads they can be written in python or scala and take advantage of the apache spark framework to make transformations efficient and cost effective and within the aws glue console there are already some starter jobs there that do some basic transformations um some basic etl that can get you going that you can use to modify and make to fit your particular workloads glue jobs can be scheduled and chained with other glue jobs to create advanced workflows to solve complex analytics problems glue jobs can be triggered to start when new data arrives in your data repositories efficiently keeping your glute catalog up to date in aws glue you can use workflows to create and visualize etl activities involving multiple crawlers glue jobs and triggers each workflow manages the execution and monitoring of all of its components as a workflow runs each component it records execution progress and status providing you with an overview of the larger task and the details of each step the aws glue console provides a visual representation of a workflow as a graph this makes it easy to set up and modify the workflows right in the console event triggers within workflows can be fired by both jobs and crawlers and can also be used to start drops and crawlers thus you can create complex chains of interdependent components to share and manage state throughout a workflow you can define default workflow properties these properties which are name value pairs are available to all jobs in the workflow using the aws glue api jobs can retrieve the workflow run properties and modify them for jobs that come later in the workflow this allows jobs to pass information to other jobs so you can use aws glue to query against an s3 data lake aws glue can catalog your amazon s3 data making it available for querying with amazon athena and amazon redshift spectrum with crawlers your metadata stays in sync with the underlying data athena and redshift spectrum can directly query your amazon s3 data lake using the aws glue data catalog with aws glue you can access and analyze data through one unified interface without looking into multiple data silos aws glue can be used to prepare your click stream or process log data for analytics by cleaning normalizing and enriching your data sets using aws group aws glue generates the schema for your semi-structured data creates etl code to transform flatten and enrich your data and load your data warehouse on a recurring basis you can use the aws glue catalog to quickly discover and search across multiple aws data sets without moving the data once the data is cataloged it is immediately available for search using amazon athena amazon emr and amazon redshift spectrum aws glue also enables you to perform etl operations on streaming data using continuously running jobs aws glue streaming etl is built on the apache spark structured streaming engine and can ingest streams from amazon kinesis data streams and apache kafka using amazon managed streaming for apache kafka streaming etl can clear clean and transform streaming data and load it into amazon s3 or jdbc data stores use streaming etl and aws glue to process event data lakes event data like iot streams click streams and network logs okay great so now you have your data cataloged in aws glue what can you do with it how can you search it this is amazon athena so amazon athena is an interactive query service that makes it easy to analyze data in amazon s3 using standard sql athena is serverless so there is no infrastructure to manage and you pay only for the queries that you run athena is easy to use you simply point to your data in amazon s3 to find the schema or let a group glue crawler define it for you and start querying using standard sql most results are delivered within seconds okay some amazon athena features so as i mentioned it can run standard sql queries against data stored in s3 it has fast performance queries run in parallel so you get results in seconds even on large data sets it is a pay-per-query model you only pay for the queries that you run and you're charged on the amount of data that is actually scanned during those queries so you can optimize those scans and costs by compressing and partitioning your data to reduce total data scanned per query athena is highly available and durable underlying data is stored in s3 athena uses the apache presto as the query engine amazon athena also integrates with many aws services you can use it to query logs for cloudfront aws elastic load balancers and cloudtrail athena tables can even be created directly from the cloudtrail console athena also integrates with amazon quicksite for easy data visualization giving you a powerful platform to both analyze and present your data the aws glue data catalog is available to amazon athena this allows you to share tables between athena and other data sources in your organization athena has multiple components it is composed of databases and tables databases are a logical grouping of tables tables are containers for the metadata definitions that define a schema for the underlying source data and tell athena where the data data is located in s3 tables can be created automatically with a amazon glue crawlers or manually through the console glue crawlers are obviously the easier way to do it because they infer the schema and they set them up for you amazon athena can integrate with the aws glue data catalog the glue data catalog is your central metadata repository in regions where aws glue is supported athena uses the aws glue data catalog as a central location to store and retrieve table metadata throughout an aws account the athena query engine requires table metadata that instructs it where to read data how to read it and other information necessary to process the data the aws glue data catalog provides a unified metadata repository across a variety of data sources and data formats integrating not only with athena but also with amazon s3 amazon rds amazon redshift redshift spectrum amazon enr emr and any application compatible with the apache hive metastore the athena console is easy to use you can manage your queries see your query history and manage your connected data sources through the console you can run queries against existing tables using standard sql and create new tables or views from the results you can save those queries and run them as needed the call so the console also allows you to preview data and tables and also delete existing tables so when to use amazon athena so amazon athena helps you analyze unstructured semi-structured and structured data stored in amazon s3 you can use athena to run ad-hoc queries using standard sql without having to aggregate and load data into athena athena's integration with amazon quicksite allows you to provide powerful visual representations of your data and you can use athena to generate reports or explore data with business intelligence tools or sql clients okay so now what we're going to do is uh we're going to walk through a simple demonstration gandhi's going to present that and it will show us how you can tie all this stuff together and how you can actually use it to get business value out of it so go ahead gani yeah thanks david can you see my screen it is loading give it a second here okay so um so this is going to be the uh our demo so before i actually go to the demo i want to show you the quick walkthrough where we have uh data injection happening from multiple sources we have a kinesis analytics app basically fixed data is being generated and then we use guinness analytics have to um analyze the data and then using kinesis firehose we store that in s3 and then we catalog the data using glue and then we write the queries in athena and then we build a dashboard in quick site so let me walk you through each of these steps in detail just let me know if you can see my console so basically the first thing is that as i told in my presentation we we need a source which generates a data um for the demonstration purpose what i did is i written this a simple lambda function uh which could generate a click stream data for us so which we can use it gandhi we can see your presentation but not the console can you see my screen it says loading at the moment how about now still says loading it's there we go now we can see the console thank you yeah uh so i have written a simple lambda function which can generate simulate a real-time clicks typically a question data contains the user who has clicked the particular webpage from which device is clicking what is his clicking you know whether he is trying to click on a checkout or he is trying to do select for example we are trying to build a simple beer selection application here where he can search for the beer he can add them he can check out them and he can pay them right so we this we run this lambda function everywhere every continuously to generate the clicks data and then this data get ingested into a kinesis stream called session clicks so we'll start from there so i go to kinesis so uh there is a data stream what i created called session clicks you can see that that particular all the clicks coming from the user devices it could be laptop mobile or tablet these all the clicks from the users every click they make on the user on the device will be ingested into this particular kinesis streams so you can actually open the kinesis streams you can actually monitor them like how many it provides detail monitoring on how many records being ingested what is the data account the sum of data like what is the latency so uh you can once you have the data ingested you can actually come and do all your monitoring here so now the data has landed into kinesis data streams so now the data is there we should be able to process the data all right so processing the data as we discussed uh in my presentation we will be using a kinesis data analytics application so a guinness is data analytics application simple application where you can connect your streaming source in this case we are connecting this particular application to accession click stream to which all the sessions are being pumped optionally you can also connect reference data for example you might have some master data which you want to store in s3 and then use it for your querying you can do that also and then we can run the analytics on this data so when we go to analytics how do we run analytics using a sequel so this is the source data if you can see that since my lambda is running the clicks are being continuously ingested into this particular kinesis streams where you can see the user id automatically the kinesis application is automatically identifying the schema of that particular json and then it is trying to build you a simple table structure where you can view the data like this user id252 device computer he clicked on this particular event on this particular timestamp so we can actually keep looking at the same user what is doing then we can sessionize the data once we have the data what we do something called session i that means i am having user i want to see each individual user behavior how much time they're taking from navigating from one page to other page which pages is spending more time so i want to have the kind of insights so for that what i do is i write a simple sql function like this where i am actually trying to write a sql function base with this source as kinesis stream by using a window function i am using a staggering window function that means i i'm trying to group all the clicks by a particular user particular device id in a range of say one minute so every we want the for example user might browse the site for a minute or two minutes and we might go back so what we're trying to do is what the user has in the last one minute we are trying to group all the clicks by a particular user on a particular device in the last one minute that's what the simple sql function is doing so that's reason when you that's reason when you go here if you see this is running continuous analytics on top of this so basically what you're seeing is the result of this sql function so this equal function is being executed continuously on incoming data and then it is deriving the results it is saying user id uh from this mobile device the number of events is begin navigation is this is starting from here is ending at here his total he started at this he ended up just and his duration is 14 seconds the data is refreshing continuously so you might see that you know thing that is changing so now we have taken the raw click stream data coming from multiple sources multiple devices multiple users multiple clicks we are able to derive some meaningful aggregation of this data okay we may not have to stop here we can do further processing for the analytics of this this is helping us okay just to get an idea of okay what is happening with these small particular users which uses where i'm getting maximum number of uh where i could have multiple sql for a second write like this i can write like which users are clicking more button like more uh links which dividers they have been accessing that i can do also by writing some simple queries now i can also send this to a destination so i have written a simple lambda function what this lambda function will do is it will take that this income this data what you saw in this page it take this data it stores this draw data into an s3 bucket at the same time it will um it will store this dot attend to some class sorry cloud watch at the same time it will ingest this data into a kinesis fire hose from where we store the data into an s3 bucket so you could see here i have configured a kinesis data fire hose so i have configured kinesis data file host which will take the data and then store this data into an s3 bucket called fixed stream session ice so this is my s3 bucket you could see that i'm storing the raw data as it is just for my future analysis and also i'm aggregating the data and then while storing i'm able to automatically partition the data for the particular date particular month and then by particular hour like you have you can see that all my 24 hours by hour i am able to partition the data that means all the clicks which happened the last one hour which are being aggregated by every minute i am able to store them in this particular sd bucket so now once the data has landed in s3 bucket what is the next stop is next up is i need to um catalog this data so basically uh the textual data is coming up and i might have to catalog the data for that i use something called as david just uh presented a service called glue so glue is having something called crawlers which can crawl the particular s3 bucket and then understand the data and then create a table out of it so i'm having a crawler called this crawler what this does is this runs every one hour it is actually checking this particular s3 bucket and then from this s3 bucket it is able to derive understand the data the schema and then automatically create a table out of it okay so so like this it is able to create the session the user device id what is the begin navigation and navigation is able to create a table or topic so now this is toughest this this key part right you should be able to understand the data which is there in your uh database in your s3 buckets and they create a table which you can use it for a later purpose so once the data is available in the uh glue has catalog the data now i can use athena to query the data so it automatically created the tables like this i have paid two views so i can click on this now i can just preview the data it shows me what is my top 10 results today uh where my you know what is a begin navigation you can see that this particular user started with selection and give selection he took totally not much time okay but some user started from the products and then he by them he made a selection he took 17 seconds so you can see that this is this is a quick insight like you want to query the data and get a quick insights you can actually use this now once the you are able to query using athena but also you might want to have visualization you want to generate some kind of analysis that's where we use quick site quickside helps you to visualize the data and um the the advantage side is that if i can quickly walk you through it you can connect multiple uh data stores so when i click on new analysis if you click on new data set you have you could have the data in s3 you could upload you could use from a sales force you could use from s3 athena rds redshift dynamo presto or even if you're having third party device like a snowflakes or iot you could twitter you can connect from multiple data sources to click of a button uh what i have done is i have chosen athena as my data source and i have built my data set like this so this is my data set i have created using athena i could refresh it continuously i could do a schedule refresh and i could have a row level security i can look in this data and i can write my own queries so once you have the data source been identified then you can visualize them using the data so now i am able to draw two charts the if you see these are all happening this data what this charts they are updating in real time that means every time every um you see the data frequency like what you saw in kda is the data which you are able to query sub second like every time the data is coming your contents will be querying okay and then you might want to consolidate all the data from the last one hour and then build a visualization out of it so what you see this data is maybe a one hour old what you see in the kda is a seconds old that's the difference between what i told in the beginning like you where you can take an actionable insights mostly were reactive action versus preemptive action so different applications are used for different purposes business purposes yeah this is what the uh the visualization has done so that means we we should be if i go back uh we are able to take the data frictional data from users ingest them into kinesis stream and then a kinesis analytics application we are able to query the data then we pump the data into this fire hose from there a glue crawler crawled the data and then created a table and then we wrote some queries and views in athena and then we have used that view written in athena to generate a dashboard in quick site so this is the entry and data pipeline uh what we could build but and if you see everything is fully managed services i didn't launch a single ec instance or a server to create this everything is fully managed so you focus only on building the core business logic instead of really worrying about how to manage this infrastructure so i hand it out to david uh for any uh q and d and you can ask any q and a any questions and tennessee's or glue or athena what discussed we're happy to take some questions now great thanks for the demonstration gandhi appreciate that um yeah so let's uh let's go over to some q a um here's a question i see quite often actually um one is uh is can i use amazon athena and s3 to replace my transactional database so i actually hear this one quite often people want to use s3 as their database the short answer to that is no that's not really the use case for it um s3 is an object store and it is not a database that handles things like transactional updates so if you if you try to meet that use case with athena and s3 that's not really going to be a good technology fit you always want to use the right technology for that particular business case that you're trying to solve so athena and s3 is more of like your business analytics and your your bi type use cases not really your transactional data update cases okay so um other question i could see here is um what how do i how do i transform the incoming data like i'm getting the data from multiple uh sources but my destinations requires the data in a different format like partly how do i do that so uh you have two options one is that you could use a data transformation while the in the kinesis file in the kinesis firehose any simple transformations or if you want to write some complex etl programming blue has got that glue etl can be used glue by default generates some etl code for you you can take the default generated epl code and maybe customize on top of it cool great okay guys we are out of time thank you very much for joining the webinar today um and as always if you have questions please reach out we are always here to help and thank you very much
Info
Channel: ClearScale: APN Premier Consulting Partner
Views: 218
Rating: undefined out of 5
Keywords: streaming events, amazon kinesis, aws glue, clearscale, aws athena, s3, redshift, data analytics
Id: iolVB-Bq8Zc
Channel Id: undefined
Length: 59min 45sec (3585 seconds)
Published: Mon Sep 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.