Kinesis versus MSK

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] welcome to clouderanoldplus your source for exclusive videos and online events for aws professionals my name is michael vitig and today i'm going to talk to you about the differences between the managed streaming service for apache kafka and kinesis data streams and so those are two services that offer similar functionality but in very different ways and this was a topic that was asked for by the community so one of our cloud or not plus members thomas and it's not the same thomas as last week so it's a different one this week but they they share the same name and but this time the username is remote coffee um so he asked us to present this topic and we deliver this this week so i hope you like it thomas and i hope you learned something new from this video as well as all the others that are watching it so let me dive into um the agenda for today so we are looking at the most important differences of both services so this helps you to kind of quickly understand where each service is good at we also understand the different pricing dimensions that both services work because always keep that in mind when you create an architecture in aws it is very important to understand what implications the building blocks that you choose have on the overall bill at the end of the month then i have a short demo where i basically set up an msk cluster i i show you how you can connect to it how can you produce messages how can you consume them so that's the demo that we have here this um time and then uh last but not least i talk about a pitfall and when it comes to auto scaling and this pitfall applies to both solutions so that's um the goal of today so let's dive into the most important differences um let's start with how can you access the data so what's the concept behind those services in kinesis you access the data always as a stream uh so you can read um from a specific point in time in the stream um or you can read from now on that's basically what you can decide in kafka which is the software that msk provides you you can either look at the data in a stream fashion or you can look at the data in a table fashion which is very handy because you have basically both worlds you have the table world that you know from databases and you have to stream approach from this new streaming and data stores so that's pretty cool and that is very helpful at sometimes so when we talk about delivery guarantee um with kinesis you can only achieve at least once delivery because on the producer side there is no way for us to ensure that we don't add an element again into the stream when we retry because of and failure so there is no id that helps us with a uniqueness constraint so that's why with kinesis you cannot achieve exactly once delivery even if your client implements everything that's needed still the producer can cause an issue so you basically have to ensure that this whole chain is item potent and then you basically achieve the same result as you would with exactly once delivery so that's kind of the downside here with manage streaming for kafka it's possible um um um depending again on the client and the producer side they all have to implement things correctly but there are libraries for out there to achieve that so when we look at the data processing so once the data is in so how can we process it um with kinesis there basically are two options um so there is i mean besides reading the raw data from the stream days kinesis data analytics uh which can connect to a data stream then you can run an application on their data and it's kind of a sql language and which you can do window aggregations and stuff like this um we also now have the lambda thumping window feature which also helps you with aggregating data in a stream into chunks so that's kind of similar approach but you have more control you can program in your language of choice and for kafka we have a stream's domain specific language so this is very powerful you can do all kinds of crazy things so again this is more powerful than what we have on the kinesis side um there's also an interesting difference when we talk about persistence of data so with kinesis you can persist data up to 365 days and and that's it and there is no way to keep the data forever with kafka you could if you kind of have the storage keep the data in the stream forever which is very helpful if you want to read all the data again from scratch for example because you change your processing logic and therefore you have to reprocess all the data and so this is possible with kafka it's not possible with kinesis so when we look at operational overhead kinesis is a fully managed thing so there is not much to tweak not much to configure that's why the operational overhead is very very low compared to manage streaming for kafka where you still care about um brokers in a cluster you can add brokers you have to partition your topics and so there is a couple of things to do when you for example scale out you also have to take care of the storage so you have to make sure that the cluster has enough storage to keep track of all your data so that's why the operational overhead is a little bit higher on the kafka side one thing where i think the two services are different is when we talk about fan out and i mean by fan out we put in one event into the stream and if you want to read it by many consumers maybe 10 or maybe 100 and with kinesis the intention is that you actually don't have that many consumers so by default you can support up to two consumers and you can increase that with additional um features or capabilities that you also have to pay for additionally but with managed streaming for kafka you can subscribe to the same stream many times but it's not an issue so last but not least we talk about costs and we go into details here um so kinesis is and this is again a little bit hard to say but in the scenario that i will show you kinesis is cheaper than kafka it always depends on lots of things for example on your throughput of the system um so i think that most use cases kind of fall into this category where kinesis is cheaper than kafka but there will be use cases where it will be the other way around that's why we will go into the pricing details now so first we start with kinesis data streams because it's a little bit i would say easier to understand so we have two main pricing dimensions one is the short hour so a stream consists of one or multiple charts and we pay for every shot that we use and we pay in this case one and a half dollar cent per chart and then we multiply this by 24 and by 30 assuming a month has 30 days and then it will cost us around 10 it will cost around 10 and 80 cents per month so that's the short cost we also pay in kinesis for the data that we ingest or the records that we ingest so that's what a message is called in kinesis and assuming this we have a payload of five kilobyte we insert 100 messages per second um and if you um kind of do the math this will be around 30 43 gigabytes a day and we pay one dollars 1.4 cents for 1 million put requests into the system so i take this number and one dot four dollar cents i divided by one million i multiplied by 100 by 60 by 60 by 24 by 30 and i have the monthly costs which is around three dollars and sixty cents so in total we pay around fourteen uh dollars and forty three cents um so a couple of things to keep in mind a shot has a capacity limit it can support up to 1 megabyte per second or up to 1 000 records per second so in our use case we utilize the shard around 50 and because we produce around 500 kilobytes of data per second which is half of the one megabyte per second throughput limit um so there are other pricing dimensions in kinesis data streams that can be optionally come into and the the picture so for example if you want to retain data for longer and then a day for up to seven days you pay additionally two dollars per shot hour if you want to retain it even longer than seven days for up to one year you pay per gigabyte um of storage and when you read the data you pay again per gigabyte and if you have more than two consumers you again pay per consumer short hour and per gigabyte that you read so that's a little bit complicated so you really have to do the math exactly once you go into this optional categories as well but i think the default example here is simple enough now we switch over to the manage streaming for kafka thing and as you can see like by just quickly skimming this it it it totals to 332 dollars which is significantly more expensive than the kinesis data stream option and i used exactly the same kind of throughput in this example to calculate the costs so kinesis data streams by default is multi-a set so that's why i also choose two brokers in this example so that we can have um replication in the system i use an m5 large instance because that's the the smallest instance that's supported that's not a versatile instance type because we do not recommend running burstable um instances in production so those are the t3s t2s and and and things like that because it's just so complicated to keep that under control because if you run out of those um elasticity or like this bursting behavior you will run into issues even with the unlimited option in t3 you then basically have an unlimited cost problem and so that's why i chose an m5 large here and you pay 21 cents per hour where this broker hour runs so i multiplied by two because we have two brokers times 24 because 24 hours a day and then i multiplied by 30 days so the brokers alone cost me over 300 us dollars then we have to pay for the storage and in this scenario i know that i ingest 43 gigabytes a day and kinesis data stream keeps the data for one day by default so i will configure kafka the same way so after one day i drop the data that's why i think 50 gigabyte should be enough and to store all the data depending on the overhead i don't know and so the payload is 43 gigabytes i don't know how much uh overhead is added here but assuming it's seven gigabyte and we will end up roughly with 10 us dollars for the storage so it's not the significant cost driver here and an additional dimension that you have to keep in mind is that you pay for traffic as well in some situations um and some situation means that if you do cross asset traffic so if your broker is in zone a but your client is in zone b then you do cross zone traffic and you pay one cent per gigabyte of traffic in and out um so assuming that you have clients in three zones and chances that you kind of do cross a side traffic are around 66.7 percent so um again assuming some overhead on the protocol i assume 100 gigabyte of traffic we will end up with around 20 dollars in traffic costs um which is then kind of if we sum up all the costs around 330 and one thing to keep in mind here is that we do not utilize the m5 lot in large instance at all with our workload so an m5 large comes with a network throughput of up to 10 gigabits but according to our network performance cheat sheet which we generated you can expect around 0.74 gigabits per second stable throughput on the network layer so with our workload we will utilize that around one percent so you could run a workload that is 100 times uh more heavy using these two broker instances if you only talk about the ingestion part and so keep that in mind so the comparison is not completely fair but the and the result is that for like small um or smaller um applications kinesis is probably a better fit than kafka and for large workloads kafka might be more efficient in terms of cost than kinesis data streams all right so those are the costs and it is complicated but i hope you still learn something from this detailed comparison here all right so that's enough of the theory um let's look into how this works in aws uh so how does a such a cluster looks like so i'm going to switch into the a doors console and then we are going to switch into the terminal all right this is the msk console as you can see i created a cluster and i will provide you the code examples in the video description as i download so i created this setup based on our cloudformation templates uh sorry based on cloud formation modules so you can easily recreate that and if i say easily it takes quite some time to spin up such a cluster so you need some like maybe half an hour or something to actually get a cluster up and running so that's it um clusters are not so exciting if you click on the cluster and the the most important information here um is um the the broker endpoints so that is what we need if you want to ingest data or if you want to consume data out of the um cluster and keep in mind that kafka relies on zookeeper nodes to orchestrate the metadata of the system so for example the topics or things are created into keeper and then distributed using zookeeper so zookeeper is also running in this cluster for us so this is all managed by aws and this is a cluster that comes with tls encrypted traffic only so i'm not allowed to connect to the cluster in plain text so that makes a couple of things a little bit harder but i think it's still better to understand how it works in a secure way so um let's um start looking at the console so i'm connected to an ec2 instance here over uh systems manager let me increase this a little bit to make it easier for you to read it all right so what i did and i will list these steps in the video description as well because it's not so super interesting so i installed um it java 1.8 and my installed um the apache kafka client as well in the same version as the the clusters and make sure that they are um they match the versions um and then i created a file that is quite important for um the tls and thing it is basically advising uh kafka to create a tls connection and it uses a truss drawer that i copied into my temp folder and this comes from the jre so it's copied out of the gie folder you will see this all in the video description and the paths are not interesting but this is something that you need to have in place otherwise you cannot connect to your cluster and the error will just be a timeout so this is kind of annoying so with those things um in mind you will be able to connect to the cluster without any problems all right what you also need is um the zookeeper nodes um and this is the variable where i stored all my zookeeper nodes as you can see there this is a comma separated list so this is the first one and the port is at 2181 so i have three zookeeper nodes here and that's the last one and also the same for the um the brokers so i have the brokers again in a list similar to the zookeeper nodes and depending on what you do you might need to connect to zookeeper cluster or you need to connect to the kafka cluster and sometimes you can use any of those but at least one of them is is required so all right the first thing that we need to do is we need to create a topic and to create a topic oops sorry for that um we need to enter a name so i already created the the topic cloud and out so i create the topic cloud or not um demo maybe with a dash that's it and this command will create a topic and the important part here is that it will be replicated three times so if i send a message into the cluster to this cloudant demo topic each message will be stored three times and partitions means if you have a topic that is kind of receiving lots of traffic you want to add partitions so that's basically the way to scale out a topic add partitions and then they are spread over multiple brokers so that's the way to scale out here adding partitions all right now we have created our topic we can now actually use the the system so what i'm going to do is i will produce a message and to do that i will use the kafka console producer it's definitely not what you use in production i mean in production you will embed this into your application somehow but this is good tool to demo something so basically whatever i type here it sends a string as a message so as you can see the first parameter here is the broker list so that basically tells the tool okay where to connect to then i specified the topic i use the cloud in our topic so that's the one that was created already and then we need this producer config and i think there is a little mistake here so let me add this dash all right and then i send a message test three test four test five because i already inserted test one and tests two so that's why i had it three four and five all right so this is working um i can send message into the cluster so now let's look at the other side so how can we get the messages out of our cluster so we use a similar command this time we use the console consumer we again need to specify the the server list we say we want to consume from beginning so from the time the topic was created we specified that the topic and then we again specified that the client file here as you can see already our messages are available so test has two this was i created in preparation for this demo and three four and five were added just a couple of seconds before so that's working and so that's basically all you need to know about kafka from an operational perspective so you can spin up a cluster and and then you can connect to it and if you use tls um only options make sure that you create this properties file and to make sure that an ntls connection is opened so let's look at a pitfall so there are a couple of things to keep in mind when operating kinesis or msk so when we operate kinesis the only thing that we have to specify is how many shards do we want to provision and we can change the number of shards um and i think the first idea here is that can we also scale the number of shards with the workload i mean that would make a lot of sense the answer is that this is uh somehow supported so there's no native support for that so there is a what aws called solution available um which integrates into application auto scaling but it is not natively supported by application auto scaling you need to implement your own stuff on top of application auto scaling and then you can basically modify the number of shards in your kinesis stream there are a couple of things to keep in mind here and you cannot increase or decrease the number of shots in a stream for as many times as you wish there is there are limits for example you can do that by default 10 times per 24 hours um so keep that in mind when you use kinesis so the auto scaling part is not so easy it is something that you have to do and you again have to monitor also your throughput of the system so for msk things look a little a little bit different so first i was very happy because i sorted auto scaling it's like the application auto scaling supports msk but then i figured out it only can scale the storage of the cluster so it does not add brokers um so all right storage scaling msk is natively supported and that's good um but the problem is that broker scaling is not supported at all and so um yeah there is no solution for that so you have to write maybe your own land the functions do whatever you wish and keep in mind that you can only expand your cluster so you can only add notes you cannot remove nodes at the moment so that's also something you have to keep in mind and this will affect your bill and probably in bad ways so um yeah there is no good solution as you can see for both options um i'm not very happy about that um i still think that the kinesis is a little bit easier to alter scale and then the msk solution but in both ways it's painful so i'm still hoping that this gets better i mean kinesis is around for quite some time and so i'm not very confident that this will happen soon but um the managed streaming for kafka service is very new so we can i think expect new features here um in the in the next or in the near future so let's wait for that at the moment as i said it's a little bit um painful okay so let's end with a quick summary here um so one thing to keep in mind and this is related to both kinesis and msk or kafka there is no global ordering in such systems so kinesis calls it a stream kafka calls it a topic but only within what kinesis calls a chart or kafka and calls a partition there is an order there is no order on top of that so there is no global stream order or topic order only messages within the partition or shard are ordered so keep that in mind when designing software for those systems and the other k key takeaway is that basically there is no auto scaling i mean if you want to be fair it's not really possible to auto scale both solutions and even if you add something yourself you are very limited by the restrictions and so keep that in mind if you have dynamic dynamic workloads you likely have to over provision this layer and keep that in mind when you forecast your costs because the layer is actually quite expensive so the last thing that i want to mention here in the summary is that i think um msk is more expensive for smaller workloads so there's basically no reason to to do the math here if you have only a couple of of charts i'm very certain that um kinesis data stream is is is is more um is more cost efficient but if you have a really big thing a big system many shorts and then it might could be cheaper to use kafka also if you have lots of consumers it it is likely cheaper to use kafka so keep that in mind and and and do the math if required so yeah that's the summary of the topic you can reach out to us or the community about this topic or any other aws related topic at any time so to do so visit community.cloud.io and ask your question we are looking forward to hear from you and i hope the community will join in as well answering your questions thank you for watching don't forget to like this video if you learned something new today your feedback helps us to produce relevant videos if you have any ideas for topics feel free to reach out to us you can reach us via twitter email or via the community you will find all the details in the video description below so we are back in one week thanks for watching bye
Info
Channel: cloudonaut
Views: 252
Rating: undefined out of 5
Keywords: aws, amazon web service, cloudonaut, cloud, cloudcomputing, cloud computing, aws training, aws cloud, aws tutorial, aws tutorial for beginners, amazon aws tutorial, aws kinesis tutorial, aws msk tutorial, aws kinesis vs. msk, msk vs kinesis, aws kinesis, aws msk, aws kinesis vs msk, aws msk vs kinesis, aws kafka vs kinesis, aws kafka, aws kinesis vs kafka, kafka kinesis, Kinesis Data Streams versus Managed Streaming for Apache Kafka, kinesis data stream, apache kafka
Id: kcBAKz0MPf8
Channel Id: undefined
Length: 26min 34sec (1594 seconds)
Published: Sun Oct 31 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.