Ask Confluent #6: Kafka, Partitions, and Exactly Once ft. Jason Gustafson

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to us confluent where we answer questions from the internet I'm your host goin Shapira and with me today we have special guest Jason Gustafson Jason is a member of the Apache Kafka PMC which is pretty big deal it's even bigger committer which everyone knows is a big deal and he is an engineer here at confluent and we got him to the show to answer some big questions about Kafka and exactly once and kind of more advanced topics for the show we got a request from the audience that they want more advanced topics on the show so here we are let's get started now before us to start something I'm kind of curious about so you work at confluence core kafka team that's right yes this is kind of a cool team right you're working on the open source Apache cough card the brokers there's probably like a hundred thousand engineers around the world using what you write it's kind of incredible do you still hire for the team yeah we do yeah absolutely so we're looking for all experience levels so maybe it's useful if I describe the core team a little bit yeah okay so on the core team yeah we're focused on the broker internals so we focus on the storage layer we focus on the replication layer the core clients and exactly once semantics which we'll be talking about in in this unit session so what does it take to be the same like if I want to get accepted am I good enough god is good enough for of course yeah I think you know as I said we're looking for it all experience levels so if you're a junior engineer or you have a lot of experience we look for both I think in general we're looking for is you know smart people who who are enthusiastic about Kafka her enthusiastic about distributed systems so you know if you have some experience already in distributed systems it's a plus or at storage layer but it's not a prerequisite so really we just look for smart people you know good at communication but we're hiring we want to grow the team quite a bit it's amazing so all of you who if you're watching to show your orbital engineer who is enthusiastic about Kafka so consider the idea of actually writing Kafka itself okay and for our first question it's a question from the podcast and they gave us five stars and so we like them that's why we picked the question it's a two-part question and the first part is what's your suggestion on how many partitions our topic can have so is there a limit for how many partitions yeah that's a good question I think it is there's a lot of confusion about this topic so there's the limit um you know that most people are concerned about is really the total number of partitions inside a single cluster so but on a topic level you know you do you do want to choose enough partitions that you can parallelize the load so people are often concerned about either parallelization at the producer level or on the consume consumption side so a typical thing that we recommend is that you you keep in mind the the growth of your application over time and you know over partitions so if you do some you know back of the napkin calculations you think you need you know 16 processes in the long term you may just consider doubling that and do something like 32 you know overall that there's there's no specific limit on a per topic level as I said it's just it's more like at the cluster level what's the largest topic we see thousands you know that's that's not a typical I think these use cases tend to be a little bit rarer I think more common is you know if you're considering that there's the throughput requirements of your application you know you'd need something you know like 32 to 64 I would say it's pretty common and that gives you quite a lot of room for growth yeah and then I've seen cases where someone shows up and say hey I'm a big bank we have 100,000 accounts in my bank clearly we need to have in partition to which account so they wouldn't mix together that's kind of an anti-pattern right you don't need that level of parallelism yeah some of these use cases I think when you have some kind of interconnection that you're trying to represent through partitions I I don't know yeah it it's you just need to keep in mind the overall limit so I don't know if I would call it to pattern necessarily but I mean there's no reason why not to really have multiple data types in the same partition and kind of fun it out later if the overall parties and works for you that's true yes you're absolutely right I think you know the main thing you're trying to trying to get as good utilization for your partitions so if if you you know combining them allows you to get more out of a single partition you should absolutely do so yes and now for the for me in the more interesting part of the question a dell car also asked could you talk about exactly once feature and its pros and cons for using it and that's a pretty good question like why shouldn't everyone just go and use exactly once for everything yeah so we can talk about that a little bit yes and I think the short answer is you know nothing comes for free so exactly once is an improvement over the normal semantics so there there are some costs so no I definitely doesn't do that so it may be useful if we can just describe what what problem exactly wants is actually solving so really you know it's designed for Kafka streams so if you imagine you have a you know a Kafka streams application so this is very going to be a very simplified version but typically you have a number of input partitions which are feeding into the streams application and you have some output partitions okay now what what you're concerned about when it comes to exactly one semantics is the guarantees that you get in inside this processing so as we're taking the data from the input partitions we're doing the processing aggregation and then we're writing them to the outputs so the question is mainly what guarantees can we actually get out of this and by default the the default that Kafka comes with is at least once so for several reasons while you're you know if there are Network glitches then you can have some error error cases in which duplicates can be introduced during the process so it could be when you're consuming the offsets if one of you one of your streams processes dies or on the producer side if you know there's a there's a network timeouts which causes a response to be missed in any of those cases you can have duplicates in the stree and so the problem with that is if you're doing any kind of aggregation or something which depends on having you know any kind of correct correct results then maybe reporting that to the Gnostic who knows so exactly once is an improvement over this where we're able to guarantee that for this processing from input outputs you only have you know the input is reflected only once in the output so but what's the downside well as I said it you know it doesn't come for free there is a cost to it and the cost comes from from two dimensions so first off is it takes some latency so the as as we are I'll describe it in a little bit more detail in a minute but as we are taking the it data and writing it to the output partitions we're basically committing a transaction and the transaction incurs some latency cost now that that's usually this is not so bad I think you know by defaults you know we're talking about 100 milliseconds additional latency for example off the top of my head I don't remember what the the parameter is in streams but okay so the other thing that we need to keep in mind so exactly once it's introduced about about a year ago I think you know we have the usage has been growing over time so it has been stabilizing you know it's still fairly new though so just keep that in mind and you know you need to you need to do some amount of testing in particular I think the main thing the the main performance drawback at the moment is really when you have a large number of partitions so the requirements for exactly once you know that they on a per process level at the streams layer they have more intense requirements so as your you know if you're trying to handle a streams application which has thousands and thousands of partitions exactly once maybe maybe difficult so now if we take exactly once and we break it down so it exactly once really is composed of two pieces kind of two building blocks so the first thing is idempotence so by default the producer if you have you know if we're not talking math dreams you're just talking about a regular producer application so this time we just have a producer and it's writing to several partitions so by default Kafka gives you at least once guarantees and they said basically the problem is when we do a writes to the broker if we don't get the acknowledgment back if that connection is lost for whatever reason then we may have to retry and in this case we may have duplicates so idempotence is about ensuring that even if we retry we won't have duplicates and additionally it ensures that when we retry we will be able to preserve the ordering and so that's the first part it's really about ensuring the order ordering and the deduplication on a per partition level now the second part is transactions now what transactions is doing is giving you the ability to write to multiple partitions in an atomic way and so basically when you write to the partitions in a transaction they will either all be committed or they will all be aborted so you get the data or you don't get any of it you get all of it or none of it sorry it's just like I do begin transaction I produce a bunch of events and then I commit or abort yeah that's exactly it now on the producer side yeah that's exactly right sit on the consumption side you know similar to databases also we have in isolation level so an isolation level basically describes you know the visibility for this transactional data so the default out of the box is what we call read uncommitted and read uncommitted means that we can see the data that's both part of aborted transactions as well as committed transactions now read committed on the other hand we will skip over all of the aborted data and you will only see the committed data a downside of this you know as we talked about with streams the downside is really that's there's going to be some some additional latency because while you're preparing this transaction you know the consumers in the read committed mode aren't allowed to see the data until it has been written to the you know completed the border buffer by the consumer and it's not buffered by the consumer actually so the the broker has some internal logic which basically allows it to kind of see ahead and see which data is going to be aborted for end so when you do a fetch amazing yeah when you do a fetch on the on the consumer side will only return the data after it has been completed and so the no buffering is needed on the consumer it's fantastic so and yeah like I said the dump the downside is really about about the additional latency but from a throughput perspective you can still get very good you know very good throughput so what's the sort to measure you know I think I think um you know we were we saw maybe a 5% degradation you know is something on the order like this because that's nothing you lose more just enabling SSA yeah and in particular so the thing about transactions is that they it you know it does require some additional logic in the application to take advantage of them on the other hand this idempotence you can use it you know as when we're out there you know you just need this config to say enable item potency equals true and there's really no cost it has you you can just start using it without making any changes to your application yeah I'm kind of shocked I don't see more people using it I mean I I've seen more in more than we had maybe six months ago but it just sounds like so so simple and saves you so much headache further on with the duplication that's like why not thanks that's been a fascinating okay next question is from our old friend of the show Scarface I think we have a question from Scarface at every single episode and we appreciate that and basically I think maybe a few weeks back I published a demo where show how to build a data pipeline from Google Cloud I put some data in Google Cloud I use cask ul to manage it a bit and then I stream the data to big worry and I ran some queries on it and Scarface had the really reasonable question why do you do some SQL over here and some SQL over there how does it even make sense and it is very reasonable but a SQL even though it's SQL is different than what bigquery would give you as a full-fledged database and this is something a lot of people kind of forget wrong about SQL they imagine that if they use cask you early trans calf came through database but Caseville is really streaming it processes the data as it flies by and usually the kind of thing you do to it is a bit more et early you denormalize you do some simple aggregations but then if you want to do the really deep analysis and compare the performance of this quarter to the performance of every quarter in the last 25 years across 15 different dimensions this is where you would bring out bigquery you don't want to do it on the fly especially since you're processing 15 quarters worth of data while you're doing it so basically SQL for kind of ETL like stuff bigquery where you need to do long term kind of more big data analysis I hope that the confuses and you can see that I don't know if you can see that in the demo but I used basically D SQL for some enrichment of the data and then bigquery to actually aggregate and see how many we eventually head of this type versus how many events we had for that type over the entire history next question is kind of new who I mean the podcast has not been around for maybe a month or two and in between we had Kafka sometimes you've spoken even one of the highest-rated the talks of the entire show yeah I gave a talk on Kafka or Kafka replication protocols some of the problems that we had with it as well as how we how we found the problems and how we fix them and if you like getting nerdy about replication protocols and you want to learn more about TLA plus which apparently is a trend now I said after you gave your talk I started hearing so many people who are like I bought the TLA + book and I'm going to learn all about feel a plus I never saw that sees that as a trend either welcome one yeah I think yeah and I think it's a great thing actually for the you know stability of software we you know we found problems with it that we didn't know existed so it was a huge huge value yes and I definitely thank you for starting this trend I did not expect it to be a trend but I credit you for getting people to we need to realize how practical it is it used to be seen as oh just theory and once I'm done with school I want him to look at it ever again yeah I think I think Leslie Lamport maybe contributes to some of that because it kind of makes it sound a little bit scarier than it seems because he likes to talk about the mathematical aspect of it there definitely is that there but there's also a very pragmatic side to it where you can kind of see it more as just just regular programming that's amazing so another very popular talk has been Martin Clements like he not have you listened to it I did yes that was quite in person yeah it was impressive and he really introduced a bunch of patterns in the talk the title was very controversial but the talk itself you just said a bunch of patterns that I thought are very useful and Abhishek Gupta a watch this and he had a question about the eventual consistency model because basically Monica Levin said let's turn the database inside out who will use the stream of events at the core and use it to basically hydrate and drive a lot of different applications and different databases and he had a reasonable assumptions that you actually at any given point of time my application may react to an event that doesn't doesn't exist in the database yet because those two streams are in parallels are they will be unsynchronized and i think the idea is that maybe the application will return a result to a user and then the user will go check something in the database and he will see something totally different and i think it is a concern he asked how to solve the problem and I'm wondering is that's not really something that - exactly one solves for you because the transactions in exactly ones are inside Kafka they don't extend to external databases dole yeah that's right so exactly once really is is the exactly wants guarantees that it gives you it covers that streams transition so you're reading from an input topic and you're writing to an output topic and it guarantees that the you know the input the transformation that you've made to it will be reflected only once in the output but if you've got something that's outside of the system then you need something else yeah and I'm seeing one of the things you could do is take the principles of what we implemented and maybe have something that's very idempotent on your application levels so it wouldn't matter much if things diverge for a while because you know that eventually is I'll conversion to something that is consistent rate and then I think that as I think the only way to really get around it is to write your applications in a way that it wouldn't yes they will diverge and we have to look at design patterns were a tool and better so try to avoid having the application that reads from Kafka immediately look for something in a database that also reads from Kafka just depend on one of those sources and you'll always be consistent you'll always have them get data in order because you're always reading it from Kafka in the right order you don't go looking for the same data in an external database at the same time so you try to pick one source and stick with it seems to be a good solution to me but there could be other solutions if you're going if you solved it in another way if you want to write to us so we'll feature your solution in the next episode we can do that and I think that is for a questions I want to show of some compliments that we've got and so Marco Rosie wrote as a comment to intro to streams named video and he said very well explained the argument you made my day thank you and I think the young man in question is Timberland which yeah I'm just impressed that Tim who probably made Chamberlain's day because you called him a hangman let's put it that way and I'm glad that you now know all about comic strips and then Kai Werner who is a beloved confluent employee from the region of Germany wrote in commenting on Victoire gap of video and victor did a video helm 101 expanding cough control platform on kubernetes and kai said he said that it was great presentation and he said that the kubernetes operator is going to be very useful and he's sure that the victor will create another video about this topic so hi we're happy to inform you is that Victor recorded a new video on this exact topic just this morning so you can expect it very very soon and that's it for today that was lots of fun thank you so much for being here Jason it was my pleasure good great questions for the community as usual don't forget to subscribe to the channel and don't forget to give exactly once a try as we learned if you use a dampener producers there's basically no drawbacks and you will never miss your duplicates as well as if you find any issues make sure to report them yes each issues that dr. Kafka dot Apache toward [Music]
Info
Channel: Confluent
Views: 12,665
Rating: undefined out of 5
Keywords: confluent, apache kafka, streams, logs, hadoop, free, open source, streaming, platform, applications, apps, real-time, processing, data, Multi-Datacenter, monitoring, developers, operations, ops, java, .net, messaging queue, distributed, commit, fault-tolerant, publish, subscribe, pipelines, ask confluent, comments, Q&A, questions, answers, gwen shapira, discuss, twitter, Jason Gustafson, Kafka PMC, engineer, number of partitions, exactly once, semantics
Id: CeDivZQvdcs
Channel Id: undefined
Length: 21min 20sec (1280 seconds)
Published: Mon Nov 05 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.