Azure Stream Analytics with Event Hubs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey dustin vannoy here i'm going to share with you a bit about azure stream analytics and show you how we get that set up talk about some of the features and capabilities and then we'll look at creating a stream analytics job or two so stream analytics is a really easy way to do streaming within azure it's very tightly integrated within azure but it also has limited inputs and outputs so what that means is that if you're working with azure and you find a use case that fits well it's an easy solution it's serverless auto scale some really good features but if you're working with a variety of sources like apache kafka not not event hubs or event hubs for apache kafka but a true apache kafka or confluent cloud setup it's not going to work for you unless you start also streaming that data into event hubs which is all very possible and reasonable but basically the point is that if you're within azure if you're looking for an easy way to do stream analytics or you're bringing data into azure from another you know kafka cluster or something like that it might be a good fit for you so without further ado let's let's take a quick look and see what we think of this let's first set up our stream analytics job and that'll help us get a feel for what capabilities exist there if we find stream analytics in the portal and create a brand new job demo stream analytics very creative name i know and then we will select a resource group i have a streaming resource group that it'll fit well in hopefully that's the right location we'll just keep going and then streaming units at defaults to three i'm okay with that for this uh scenario you may want to secure private data assets you may want to look into some of your options around data protection there's a nice article in the docs that i think will give you a lot of good info if you have an edge case where you're not hosting in the cloud you're actually hosting elsewhere on edge devices definitely look up into that but most cases at least for uh getting started testing it out we don't need to add all this uh data security quite yet definitely for production workloads within your company take a look at that and make the right decisions for your use case so as it's deploying we actually have some interesting things we can see one is that deployed really quickly so that's nice like i said easy to get started easy to use especially if you're trying to do this from the ui there are uh apis so that you can do some things with stream analytics jobs from you know using some kind of client that you write that's calling rest rest apis but we'll just stick with the ui for this demo that deployed pretty quickly and now we can take a look at uh the next steps for our job so if we want to set up event hubs for our azure streaming we can jump to the event hubs page choose create pick a valid resource group and give it some sort of name okay i have that created and then you need to choose a location usually you're going to put it near your other resources and then the pricing tier is interesting if i'm doing a demo there is this basic uh very you know very strict number of consumers and broker connections type of option but i'm going to use event hubs for apache kafka in some of my examples so let me go and choose standard you cannot do even hubs for apache kafka with basic and then of course if you're setting this up for your company and production environment you might want to take a look at this preview premium option and other options that are available throughput units this is actually for the whole namespace so if i create a lot of event hubs then this could become you know something i need to to spin up more for a demo one is fine auto inflate is a good thing to know exist i'll go and say one to two which i probably will never even inflate with the example i'll do but just to make sure we have a little bit of flexibility in how much throughput we can handle you have the option to choose tags and then you'll review the options you've selected and choose create so once your event hub namespace is created you can go and set some various settings the access control is something you'll always need to think about within azure and we can add some access policies which are typically used in this case i'll start by doing some stream analytics example which will add one and then i might just use this default root directory which probably isn't what you'll do in production you'll probably want to limit it to just produce or just consume send or listen but for now we'll just stick with that for a demo i can go change my throughput units after i've done it i can go think about geo recovery i have network settings which is pretty typical to limit it to some internal azure resource resources some internal virtual networks encryption is something you certainly want to think about i often am pretty comfortable not using a customer managed key but if it's you know at a at a company i let them make that decision of course and then we can always go down and view our properties now we get to the good stuff under entities we have a schema registry which is an option that it's not going to work exactly like a confluence camera registry but it's something you can work with within event hubs maybe i can come back and talk about that another time for now though let's take a look at the event hubs page so here we are on the event hub section and this is where i can actually create an event hub so when you go to create an event hub basically what you're defining is here's my event hub name and then the partition count and so partition count is going to let you decide how much you can scale your consumers if i only have one partition then all of my data is going to one single partition and i can really only consume from that partition whereas if i have let's say three partitions i could have three separate consumers all with the same consumer group id and those will then each read only the messages that that they should right so it's going to kind of split our messages into three groups if you will and they'll each grab a piece of that and that way you're running in parallel with with three different consumers so we have the option of one to seven days here and for message message retention and uh you know you may only need it for a day or two just in case you need to catch up or replace some data i could also choose to turn on capture which is going to store that data in azure storage and so then i could store it for indefinitely instead of only for seven days you go ahead and give it a name click create and now you've got an event hub you can start to work with and again you can either take that route policy with all of the permissions it could have or you could create one specific for an event hub if you're really trying to lock this down to only those that need to use it so here i'm in a stream analytics job i created and it's blank let me go ahead and add the inputs and we'll talk about that as we go so we only have the three options i'll go ahead and add event hub here and then we can give it a name and this is really just for stream analytics so i'll call it eh1 input sure and then we need to pick an event hub's namespace i'll go and choose demo eh2 we can use an existing event hub that i already set up i just find it's easier to set up the event hubs in advance and then well to go ahead and let it create a new user group i don't normally have problems with that and then i have trouble getting managed identity to work you'd have to go set that yourself typically at least in my in my environment it fails quite a bit so i just go ahead and use a connection string for for demos at least and then you can either create a new event hub policy for this or you can use existing i like to create new if i have the permissions it just it gives it a default name that's very obvious to me where it's coming from and that's all taken care of you would want to think about partition key if you're using event hubs especially if you have a lot of data coming through we won't get into that right now let's focus on sort of just setting up this technology for a basic scenario the serialization format you have a few options to choose from we will go ahead and stick with json for ours this is really what the input data is going to be coming from event hubs so you may not have the choice in the real world someone else may be producing the data for you and you just need to find out that that format that they're using to serialize that data encoding is always going to be utf-8 at this point event compression type you have a few options if you're going to deal with compressed data we'll click save that should be able to create that policy i think i have all the right permissions here and then we have an input that we can use in our stream analytics job let's go and decide what our output will be so we have an input and an output and then we get to define the query and potentially functions and lookup data if we choose to we'll go ahead and just write it back to event hubs i think that if we're doing a data engineer practice that i think i want to show you is if we are doing some processing maybe some aggregation we would often want to keep that data streaming for multiple consumers and so if everything's going to be done by us in stream analytics then we could have multiple outputs here but really let's pretend that we've got additional consumers maybe even some kind of reporting micro service that goes back to the customers and we want to make sure they can get this exact same data so for for this piece of the stream analytics job we're going to just write it back to event hubs we'll call that eh out i'll use eventhub namespace and create a new eventhub topic demo out there we go okay so we have event hub name demo out i'll go ahead and do a connection string we'll let it create a brand new one for us partition key i'm not really going to worry about this for the example i'm not going to work with custom with property columns and i'll go ahead and have it write the data out as json that way it's the most portable for the different consumers i expect to have typically with analytics and streaming systems will do line separated json so you'll have multiple json objects in if you're writing the files within a file maybe within a batch that that gets sent from stream analytics and then the encoding is going to still be utf-8 within stream analytics i'm mostly going to be working on adding a query that will transform the data i do want to point out very quickly that there's this functions feature where you can have an azure ml service or ml studio or a javascript function so the really important piece of the stream analytics job is the query itself and this is where that stream analytics query language comes into play notice you can jump here and and take a look at the docs right from stream analytics job which is which is really handy because if you're new to this you'll probably need to check those out for the exact functionality you're looking for the select star means select all of the fields that will be in the message and where do we want that to go we're going to put it into our output event hub eh out one and from is going to be eh1 input so if i were to run this all i'd be doing is taking data from one event hub and writing it to another event hub there's a chance we'd want to do that probably not with event hubs though i don't think that's very likely to be our case let's start looking at some of the other capabilities we have first before we can really do much here i need to have data flowing and then i have some handy features like input preview that we can use so i'll go kick off my producer and once we have some data we can start to build out this query a little more for the time being i'll go ahead and save it let's take a look at azure stream analytics hands-on and i'll do this with the stream analytics job so here we are in azure stream analytics in the query pane and i've already set up an output topic and an input topic these are both event hub topics at first and what's going to happen is if i kick this off is we will select every record from my input data and send it along to the output topic if we take a look at our sample data towards the bottom we can see that i've got the new york taxi trip data we can see our column names and get a glimpse of the data types and from there i could actually instead of doing select star i could start to choose just a few of the fields to work with so for example if i do trip distance and passenger count and vendor id and do a test query it's going to show me a much you know smaller number of columns and it's going to show me the first 50 rows from the recent data that's streamed in in addition i could do filters so a lot of the things that you'd expect to do with sql especially things that would work in t-sql are available here um the more you get into functions and the more advanced things that the less likely it is to match up but that's why the query language docs are so helpful and they're right here to open up and take a look at what you care about so for example if i wanted to only get vendor 2 i could easily add vendor id equals 2 4 1 2. so here i'm querying only for vendor id equals two just by adding that where clause then you can see that it's narrowed down those results so that's pretty good i can also do aggregation so let's go and take a look at an aggregate as you probably know from running sql queries we would need to use aggregate functions so i can sum passenger count for a total number of passengers i can sum trip distance and then maybe we'll do a pretty simple calculation so we could do average of the tip amount and total amount let's go ahead and continue on this group by so if i hover over i'll see that it's telling me that i need to have a group by statement or i need to use the over clause let's go and add a group by vendor id and then what it's looking for as well is a window basically data is going to continue to stream so we need a way to define what what window of time what section of time do we want to aggregate this data for so let's go ahead and use a tumbling window give it a duration and that will take care of the warnings we're getting what it's going to do is there's a event time that comes in with this data from event hubs and it's going to by default use that for my window we really would probably go back and take you may have seen there's a pickup time that we might use instead and basically you would just define that on the from clause we'll just let it use the default event timestamp and run this test query and see what we get now i'll go ahead and produce some more data and we'll see how this changes things so if we go back and refresh our input preview that'll change the data that we're looking at now when i run my test query i have numbers that are different than before likewise if i keep refreshing the input then those numbers will change what that means is i could go ahead and save this and run this and we should see every minute we're getting results now this is okay but vendor id there's only two vendors right now and the id itself doesn't give me a lot of information what i'd rather do is use the taxi zone so let's see how we can take some reference data as input and find the right value and use that for the grouping so i have my query based on vendor id with the the group by and the select calls using it but there's only two vendors and so i really want to go and join that into the taxi zone data so i can see what zone the pickup happened in in order to do that we need to add a reference data set and then we'll add a join to get my zone data and change from vendor to zone i need to add a reference input that reference input the easiest option i've seen is the sql database so we'll go ahead and get that set up and i have a serverless sql instance going i can go ahead and enter my connection information here and then i need to update the sql statement to be valid for my database okay and now i can save now i have reference data and i can actually join that in so to get zone sql joined in i'll alias my first input i'll go ahead and add in zone sql set up the join clause and now i need to take this t1 alias and use it in front of each of the fields that comes from that initial input and then rather than use vendor id i'm actually going to replace it with t2 zone which is the the name of my taxi zone now i can test out this query and see if i got all of that syntax correct and there we go we've now have uh many more records each time this tumbling window runs and we can see our data by zone which i think would be a little more realistic than just the vendor id then you can go ahead and save that query and kick off your job and you'll see data for in the output event hub to kick off your job you can go to the overview page and click on start it'll take just a little bit of time to spin up these resources for you i've got the three streaming units check this before you run it because this is where it starts to cost you a little bit of money kick off my job that'll take a little bit of time and then i'll check and make sure that i'm getting output in my event hub but now that it's run for a little bit it's produced some events it's hit a tumbling window of one minute and it's output 23 events let's go take a look at our our output topic and see those results so from my output event hub i can scroll down to the monitoring and this will show me that some messages have been received and then i can go and use process data which is actually going to take me to this this other way of getting a stream analytics job so this is like a stream analytics a query built into viewing the the events coming out of event hub so we get to that same view same stream analytics query language will work we'll keep it simple and just kind of run to see the test results and like before there we go and so it might be a little lag but it should come in pretty soon feel free to hit it once or twice and then if i'm looking through here i have this event in queued time and i should see that the same zone might show up multiple times for different tumbling windows and the event in queue time is what will help me catch that now in our example i was outputting to another event hub topic which i think is fairly typical for data engineers to have some parts of their processing that are streaming read from event hubs or a similar broker and right back out to that that that broker to a different topic and the reason for that is so that multiple consumers can read the data however you may have other use cases where you need to write directly to storage with stream analytics and so there's quite a few outputs we can choose from one of the one of the ones i find really interesting is power bi i can actually set this up to do a live report within my power bi service so obviously you'll need power bi set up you'll need to authorize and all that but that just so you know that capabilities out there if you're trying to do real-time reports this might be a way you pull that off so that's our hands-on look at azure stream analytics i hope that you learned something along the way don't forget to subscribe to this channel or check out dustinvannoy.com for more content i'll see you next time
Info
Channel: Dustin Vannoy
Views: 215
Rating: undefined out of 5
Keywords: Azure, Stream Analytics, Event Hubs, Data Engineering, DataKickstart
Id: 83e0HCmLFfY
Channel Id: undefined
Length: 21min 15sec (1275 seconds)
Published: Fri Nov 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.