Kafka overview and fundamentals explained for beginners, architects and system designers

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey welcome back and in this video i'm going to show you how you can quickly get started with apache kafka now if you don't know what apache kafka is it's a highly distributed highly scalable fault tolerant event streaming platform and you can use it for essentially decoupling backhand services and doing asynchronous message processing and event streaming now if you don't know what any of that means don't worry about it because i'm going to explain that all in a second [Music] okay so the first thing we're going to do is we're going to walk through some of the key concepts of what kafka is and how it works now as i said it is a an event streaming platform or a messaging platform so if you think about that what i am trying to do is i'm going to asynchronously process messages so it's going to be a platform where i'm going to have a message of some sort it may be an event it may be a message doesn't really matter and and i'm going to write that message onto the platform and then some later on something else is going to read that message and then do some processing involved in that so if we think about that if i just draw here maybe on the center here i'm going to have something that is kafka right so that will be our starting point so i'll just draw that on the ipad you can see kafka so as i said there are going to people who write to the platform and they're going to be things that read from the platform so in kafka terms uh writers to kafka are known as producers so we'll just write that there so producer so anyone who writes to kafka is a producer and and we'll just draw another one and as you can imagine there can be multiple producers of the system towards right producer and they are writing to kafka just like any other message or any other queue or any other sort of stream of some sort so once messages are written to to kafka then you're going to have something that is going to be able to to read that and they are the consumers and so we'll just write that in their consumer so that is our sort of key concepts that you've got is producers right to kafka consumers read from kafka so now that we have an idea of the terms producers and consumers let's pull back the layers a little bit and look at what's going on in kafka and actually the first thing to understand is this concept of a broker and a broker is really just a server in kafka terms and that's the easiest way of thinking about it a broker is just a server and actually when we think about this producers actually talk directly to brokers so part of the things that brokers have is they have the apis so anything the is exposed to the outside world you know by things like the rest apis etc then that is exposed via the broker and that is what the producer it talks to so when a producer wants to send the message into kafka it is talking to a broker so we can sort of draw that on at the ipad again so i can just write producer and what it's going to do is it's going to send a message so we'll just write send and we'll have m that represents message and then it's going to talk to a broker yeah and then that in kafka terms broker is just think about it as a server now we can we'll get into scalability of kafka in a second and but you can pretty much probably guess already the in order to be scalable and also to be fault tolerant then you're going to have multiple brokers or multiple servers that exist within your kafka estate or in as kafka terms like to name it and within your kafka cluster and all a kafka cluster is is multiple brokers that reside within the cluster to form the overall kafka service so we've got this broker here at the moment we've got an individual broker and his broker is within this kafka cluster that we said and therefore the message has to go somewhere and to do that what a broker is writing to is essentially right into a thing called a topic now this is not the best diagram in the world because uh yeah i ran out of space i wasn't really thinking about that there but what it's going to do is it's going to take that message that came in from the producer um and the broker is then gonna append it onto a thing called a topic so i will write append here and we'll put append m and then a topic is it's really like a big log file um the really nice description of kafka that i've seen is think about it as like a log file and therefore when you're appending things or when you're reading from logs and think it was like a kind of tail of a log but we could we can sort of draw this out so let's let's let's just draw what would look like a kind of a log based system with kind of slots and this this this also could if you think about it looks like a bit a bit like a queue there so that's fine so i'm gonna i'm gonna put uh m on my log here so this is uh the first item on there so that's a message so it's doing an append and this is on a topic there we go so that is when you hear the word topic think about it as your log your message log think about it is it's basically the storage area where messages are being held so now that you understand that if you if you think about it the other way then later on when a consumer wants to read messages they are just going to read from that topic they're going to read from that log any messages that have been written so this is our our first point and then again i'm just gonna put a uh another fancy graphic up on the screen there now which which basically shows what we've got so we have this producer it sends a message to a broker a broker is just a service within your cluster and then the broker is going to append the message into what's known as a topic and that's where that's going to reside and once it's on that topic then a consumer will be able to read that later on okay so now that we've got that we've got producers we've got consumers we've got brokers and we've got topics now as i said kafka is a highly scalable platform and you're going to have multiple brokers and we're going to look at that in a second but in order to be both scalable and to be able to be uh things like fault tolerant then we need the topics to be scalable and fault tolerant as well and the way that is achieved is the topic in some regards is actually a sort of logical construct and what we actually have from a physical perspective is what is known as a topic partition so what do i mean by that so the topic partition is the thing that resides on the broker so for a topic a broker would have a single partition representing that topic so if i only had one broker then i would only have one topic so why is that important and it's important because partition is essentially the unit of scale for these topics so that allows me to have things like fault tolerance and it allows me to have things like scalability i.e be able to scale out my uh my topics out to multiple servers to increase throughput so if i come back to the example that i was showing earlier where i had a producer sending a message to a broker in this particular example because i've not scaled out i actually only have one broker and because i only have one broker in my kafka cluster then actually i only have one topic partition as it stands because that is my unit of scale so so if you think about it the producer is sending a message to the broker and then the broker is appending the message to a topic partition yep and then of course later on when we look at scaling um then we will be able to introduce more topic partitions as a unit of scale but also when we look at fault tolerance more partitions will be able to appear across different brokers but we'll come on to that slightly later so for just now in a single broker world then in a single topic world then i have a single topic partition so back to this example my producer is sending a message m to my broker and my broker is appending a message onto a topic partition and therefore i have m appearing on that message and that is how producers work pretty simple stuff okay so now that we understand that let's deep dive a little bit further and look at the kafka architecture as a whole so in my previous example i had one broker in my kafka cluster but of course i can have many many more brokers many more servers right because i want that ability to scale and have fault tolerance um so this is where this idea of a kafka cluster comes in so my kafka cluster essentially is just something that contains one or more servers or one or more brokers in this particular case that's all kafka cluster is so now that we get what a kafka cluster is let me just draw that out again just to make the architecture sort of come to life a bit so we've got our producers so i'm going to write producer 1 producer 2 producer 3 in this sense and now we understand that we are talking to kafka clusters so i'm going to write kafka cluster here so we'll just write kafka cluster okay and then as i said a kafka cluster can have one or more brokers so we'll just write kind of b1 for broker b2 for broker and that could go all the way across to sort of bn so we can have multiple brokers and then of course we're going to have our consumers who are going to read uh from our uh cluster so we'll just write that here consumer consumer consumer okay and that is a kafka cluster but of course i need something that can manage that cluster so something that is going to be responsible for making sure the the brokers are up and running if the broker fails and restart the brokers and and how that is run today is with a tool called zookeeper which is responsible for keeping the brokers up and running now in future versions of kafka then they are trying to get rid of zookeeper and have a kafka sort of its own protocol for managing clusters but you know in for just now we're using zookeeper in in this particular sense which is a sort of cluster management uh piece of software so if we look at the fancy diagram there um that sort of brings our architecture together which is this sort of kafka cluster consisting of brokers uh zookeeper to manage the cluster and then our producers and our consumers and so if we think about this in our overall cluster architecture the broker handles all requests from clients and keeps data replicated within the cluster and zookeeper manages the kafka cluster and it stay i.e the topics brokers and users so that is the sort of shared responsibilities uh between the two so of course if i i'm scaling up my brokers um then i'm also going to need to scale out my zookeepers as well over time as well or my zookeeper instances because that will be a point of scale so if i want to have lots and lots of brokers that's fine um but i can't have a single instance of zookeeper either i would need to scale that out and have multiple instances of zookeeper to to manage uh the clusters so that if you think about it allows me to have scale and that allows me to have vault tolerance and we'll deep dive into that a second from a zookeeper perspective you would usually have sort of three to five zoo keepers in the cluster uh you keep the numbers low to conserve overhead always and you always have odd numbers to maintain a majority so if you think about cloth your kafka cluster is a hole if we just come back to that then i'm going to have somewhere between three to five zookeepers um instances to manage my cluster as a whole but my broker i can have thousands of instances in that right it's that can scale to whatever whatever it needs because that is going to be running all my topics but zookeeper is about three to five and that's sort of my point of vault tolerance and scale in that sense okay so we understand our kafka cluster now it's got brokers we've got our zookeepers to manage the cluster so now that we understand that the last thing i want to talk about before we crack open the code and actually write some code and get it up and running on a machine is how do these topic partitions scale with brokers and and again we'll go into that in a little bit more detail later but the the nice way of thinking about this is you are gonna have a topic partition per broker so if i put broker one here and i put broker two and i have uh we'll have broker three there then if you think about it for a second in order to have something like fault tolerance i need that partition to get sort of replicated across multiple brokers so you know uh from a fault tolerance perspective the minimum sort of number of replicas or copies that you need would be about five uh would be about three um and that's following the kind of camel in the desert principle um so in this sense if we took our topic partition so let's say we had a topic called chris and that's our topic partition and as i said there's going to be one one there then in order to have fault tolerance you would also have a copy of of chris here which would be your sort of replica and you would also have it over here as a replica as well and and that means because each broker has a copy of that topic that then starts to give you fault tolerance replication so if broker one dies so if we just put an x through that then broker two would be able to pick that up and and uh or broker three would be able to pick that and they would become what's known as the leader right so they would be the one running the cluster now just think about this for a second for that topic right is as soon as you get down to two then you are essentially uh you're you're you're you're gonna sort of if you lose that that one uh extra node there then then you're not gonna be able to maintain your cluster at that point because you're you're gonna be running out of nodes to run full tone so you need to sort of try and maintain three there but you can you can deal with losing a node but you certainly can't lose two nodes if you lose two nodes and you know uh you you sort of you can't if you can't guarantee your consistency sort of beyond that so therefore it's just something to be aware of in that sense that that you do want to have as many sort of brokers kicking around you do want to maintain a minimum of three replicas we'll look at that when we talk about the settings there but of course so you'll have two two brokers at that point and then if we stand up another broker you know that's where zookeeper would come in it would try and restart that broker get it back up and running again or if you had another broker that was available then that could that data could get replicated onto that so that's kind of how how brokers work in that sense but that gives you that sort of availability site point of view okay so we had this topic uh and a topic partition and as we introduce these multiple brokers then that that can be replicated across the the multiple brokers and therefore i could lose a broker and it can have that those messages will stay persisted now from a sort of broker perspective uh for a particular topic then uh there's this concept of the leader and what that means by that is that if you think about the writing of messages to a topic so i had so i had my sort of messages here which is sort of m and m one two so if i had let's say i had a message m one m two m three and four as i write those messages i cut because i want to maintain order m1 m2 m3 m4 so when the consumer comes along then reads those messages m1 m2 m3 m4 then it wants to be able to read that in order now if you think about that for a second if i allowed writing on broker 2 broker 3 broker 1 in parallel then i wouldn't be able to maintain the ordering of these messages right because i could write to i could write m1 to broker 1 m2 to broker 3 and and m3 the broker 2 then depending on the order they go written to the log and then replicated then then that order wouldn't be guaranteed it wouldn't be consistent so the way that kafka maintains that ordering is by essentially saying that for a topic partition then only one broker can be the leader and what it means by that is one broker is responsible for rights so in this sense when when the producer writes uh to the broker then it's going to be writing to the lead broker so in in these brokers for that particular topic so for the chris topic you're only one of those brokers would be considered as the leader and it would go to the leader broker which would then write the message and then it would get replicated to all the other brokers there and now that that becomes super interesting from a uh an acknowledgement perspective right because then you can start to have settings on the producer to say uh what sort of mode do i want to be in for consistency right so um you could be or even availability in that sense so you could you could then start to decide things like actually uh i i want all of the replicas to confirm that they've got that message and when the messages have been confirmed by all replicas then at that point i'm gonna say the message is written it could also acknowledge messages by just saying uh you know i only want the leader to confirm um the message or you could just do a fire forget and say i don't really care if it's written um you know um as soon as i've done the right told the broker about it i'm done i don't really care if it's written to the partition or not so there's different modes that the producer has and by default it's it's all replicas have to acknowledge that they've got the data before you can you can say back to the producer that it's been written that gives you a sort of transactional consistency but you've got that ability within your client to be able to decide um that i would recommend you keep with the default there's pretty much not there's some reasons you might want to do a fire forget but but in general you know that sort of default all uh replicas acknowledging that they've got the data is probably a good place to be okay so just to kind of recap where we are at the moment so at the moment we've got producers which is cool we've got our cluster which is all of our kafka brokers or servers that that we have in our our system um so we'll put cluster over here uh the producer is writing to a broker and then we have multiple brokers that could exist all the way up to kind of broker n and then we have topics and each broker can have one partition for a particular topic uh existing on it so we let's say we've got a topic called chris and therefore um that one of those brokers would be considered as the leader so anytime that the producer is writing it would write directly to the leader of the broker so it would it would go and talk to the cluster talk to the brokers and say okay i need to know who is the lead broker for topic chris and then it would go to broker one or broker two and then it would do the rights and then any other broker that is part of that replica set is then has that data replicated to other brokers so that's um so in this case for the chris topic you know it could be broker two and broker three and they would have a copy so they would be uh they would be essentially one of the followers in that sense so uh um they would have a copy of that data and therefore if the leader let's say broker one who was a leader died then obviously zookeeper is going to try and spin up a new broker etc and then within kafka within within its protocols then a new leader would be elected amongst the followers so chris follower uh who was a follower would then become a leader zookeeper would then uh try and spin up another broker for you so that you're maintaining your brokers or get that previous broker back up and running and therefore you can sort of maintain that replica set and of course in order to maintain fault tolerance you really want to have three brokers running but it can sort of survive running two brokers if you if you lose more than then then two if you're down to one broker then you can't guarantee the consistency of the data so and and it'll get into a kind of flick flux now the good news there from account uh from a kafka perspective is that leaders can split for different topics so let's say i had another topic called fred so we'll put that in there like the chris topic um it would have its replicas sitting over here across uh the brokers so that you know in this case fred could be over there and it and leaders and followers can switch across different brokers as well so in in the case of fred broker three could be the leader but broker one uh could be a follower and and therefore that's really important from a kind of cluster management perspective because then the load can be split across all the various brokers so if i've only got three brokers then you can imagine that i'm only gonna have three replicas anyway so you know if i had three topics maybe one would you know each different broker would would have a different leader for different topics so you know it doesn't mean one broker is going to be the leader for every single topic partition that exists so it will be evenly distributed across everything and that way that you're getting most usage out of your cluster now that where that really becomes important is if i've got a lot of topics maybe i've got a thousand topics um and maybe i've got something like 10 000 servers then all of those topics are going to switch across all that that massive cluster of 10 000 and therefore it's going to be sort of distributed across so that the brokers are not taking too much load so that that you start to get an idea of scale the the the more topics that i have then they're gonna move around that cluster etc okay so as it stands we've got a pretty good understanding of how brokers work and we have a pretty good understanding of how kafka maintains high availability by having replicas of your topic partitions and that works really well as you saw which is that you have more than one copy three copies of your data residing on different brokers throughout your kafka cluster and therefore if one broker goes down then it goes across to one of the other brokers now the problem that you might be thinking at the moment is how do i ensure that my brokers are not co-located in the same place and let me explain what i mean for a second so if i'm in an on-premise data center i could be in a situation where they're all residing on the same rack for example or the same machine even and if i bring that up into a cloud scenario so maybe it's something like aws or google then if you think about this concept of cloud availability zones within a region if you don't know what a cloud availability zone is think of it is almost equivalent to a data center that it sits really closely to another data center within that region so if one cloud so let's take an example i can let's take aws you may have three availability zones and therefore if you lose an availability zone then aws would automatically fail you over to another linked aws availability zone in that region so it gives you high availability within that region of course we could get into an argument about multi uh region availability and losing an entire region but that's probably well outside of the scope of this particular video but let's take this availability zone concept and think about this for brokers within our cluster what we wouldn't want to do is have brokers residing right for a particular topic petition we wouldn't want those replica top topic partitions residing on brokers on the same availability zone because in that case if you lost an entire availab an entire availability zone what you end up doing is risking the integrity of your cluster because essentially everything in that zone has gone let me explain what i mean for a second so if we get the ipad out and we just draw this up so let's let's imagine we've got three availability zones so i'm gonna say let's call this uh zone a and we will have zone b and we will have a zone c right and and essentially we would want our kafka cluster to spread across all three availability zones okay so that's going to represent our kafka cluster here and then what i'm going to do here is i'm going to have three uh brokers so let's put we'll have broker one and that's going to reside in zone a and we'll have uh broker two and that's gonna reside in zone b and then uh broker uh three and i'll reside in zone c now if we think about this for a second um i'm gonna have a uh topic partition remember topics is really a logical cluster uh or a logical construct and the topic partition is the unit of scale it's the physical thing so and as we described a topic partition needs to reside on a particular broker and you can't have the same multiple topic partitions on on the same broker it's one topic partition per broker so if i had my crest topic and i had that in uh on a broker so let's i've only got three here so let's let's put chris topic here so uh we'll call it ct here because i'm being lazy and then we'll have another one because we're gonna have a replica here ct and then we'll have another uh one here which is ct now our mark broker one is the leader and we'll mark uh uh broker two and broker three as the replicate but as you can start to see there that works well if i've got my split of my brokers and my topic partitions spread evenly across the zones now um and and by the way of course for different uh topic partitions or um or different topics again as i said before they don't need to all reside in one zone in fact it would be spread across so if we had another topic let's let's have the jemima topic and in this case it could be a replica here but broker two could be the leader for jumaimatopic and uh and broker two would be uh the leader and then in this case for jemima topic uh this would be a replica or a follower so if you think about this logically for a second if i lose zone a completely i'm gonna lose that broker and all of those topic partitions and one of the other zones would would pick it up so that is a nice spread out scenario that i'd want to maintain which is my brokers to spread across the zones so if if i lose a zone i can pick that up what i wouldn't want to happen is in zone a i wouldn't want broker 2 um residing down here in in zone a that would be and as well as a petition that would be a bad scenario because therefore if i lost zone a i would lose broker one and i would lose uh this this topic partition here but i would also lose these topic petitions there so i i need to have this evenly spread across my availability zones so that's that's the kind of concept there and how you can achieve that it's really simple within kafka and again we'll we'll look at that in one of the other videos when we do the implementations is you can actually use a function called rack awareness configuration setting and you can just set a thing called broker dot rack and therefore when you are creating your broker so broker one broker two broker three you can set an identifier of some sort which allows you to identify how you wanna treat um the rack awareness so in in the case of um aws you would you would probably set your broker dot rack to being uh uh aligned to your availability zone so you might set this to be uh zone a in this particular case uh for broker one and then for broker two uh so we'll say b1 here for broker two you would you would set that to be in zone b and then kafka will look at the broker dot rack setting and then make sure for that it doesn't place uh those brokers anywhere else it will spread it will spread the brokers across the zones but it will make sure that it aligns to that availability zone that you specify so that is a really cool way of doing that um so that means the that from a replica perspective then you've got that rack and awareness and you're not going to load any replicas or followers for those petitions in the same zone and risk outages from a particular zone um so that's really cool obviously it opens up a whole set of other scenarios that we're not going to go into because you you're now in a position where producers are uh and even consumers and we'll come on to in a second where producers may be uh writing to something that's in a different availability zone so the producer might be living in a different availability zone from the broker so there's a lot of complexity that comes in because that would that would uh create latency and you wanted to be really close um so they there starts to become a little bit of complexity that's involved and and again it's outside of the scope of this video but one of the things that you can do is you can write a little custom routine a custom selector and such which allows you to then uh be able to use the right uh producer uh to talk to that that's gonna reside in an availability zone to allow you to do the right right so there's there's a whole set of things that you can do around that but it's quite complicated and probably well outside of the scope of what we're doing here okay so that is uh rack awareness within kafka and i think it's a really cool thing okay so we've got a pretty good idea of how high availability works and we've got a good idea of how scaling occurs within a cluster for different topics but what about an individual topic so as i said before the topic partition there's only one version you know per broker now and as i said before order is guaranteed within that partition for a producer so if i write to my topic partition and let's say i write m1 m2 m3 m4 and 5 and m6 then that order is sort of guaranteed there because i've got one producer who is writing to one broker who's acting as the leader so in order to maintain that ordering then if you think about it logically i can only have one consumer able to read from that topic partition yeah because if i have multiple consumers who are able to read from that topic partition then i can't guarantee the order anymore because consumer one could read from m2 consumer two could read uh n3 consumer four could read m4 so therefore they would be reading things and processing things in a different order and therefore the order can't be maintained so the only way that you can guarantee that message ordering is by having one consumer a single consumer being able to read a topic partition at any time so that sort of completes that loop so we have producer and we have consumer we have the logical topic for producers and consumers so we can have multiple producers and consumers logically within a topic but when we bring it down to an individual level i can have one producer writing to one topic partition and one consumer instance reading from top one topic partition at once but that leads into an obvious problem there which is that if i have too many messages going on to an individual topic or topic partition in this case then i i might not be able to keep up with those messages and that's exactly why partitioning exists in the first place because partitioning allows you to then say you know what i want to be able to have a different way of scaling so what what do i mean by this so what you would essentially want to be able to do is shard or have multiple partitions per topic and then you want a way of separating that out so what that could mean if you think about this logically is that i might have my chris topic yeah um but then i might want to partition it on something like user id or i want to partition it on you know a system id or something like that so some sort of way of being able to distinguish in it and therefore by doing partitioning then messages can be written in order to a particular petition so what do i what do i mean by that so let's let's take a sort of more concrete example let's imagine i was writing something like um facebook for example or twitter so let's let's imagine i i'm gonna tweet on my twitter uh timeline for a second so i write a tweet so tweet one um and then i say hello and then i go hello world and then say walk to uh you know the kitchen ate sandwich walk back to my room right three three tweets now i wouldn't want them to be out of order for my user id right you know you would want in that timeline what to kitchen ate sandwich what back to room in that particular order because if it was eight sandwich what to kitchen walk back to room it wouldn't wouldn't make sense so user id in that sense could be a good logical sort of point of partitioning key so in that sense if you think about that the only order that that i can guarantee is order of messages within a partition and and that's how kafka scales so in that sense if i was to petition on something like a user id then what it's going to do is actually look at the number of partitions that i've set so i can say in kafka how many partitions i want to have for for my topic so maybe i want to have 10 partitions or maybe i want 20 or 100 it doesn't really matter but if i've got say something like 10 partitions then that means i could have up to 10 consumer instances and then what kafka is going to do is going to look at something like that partition key let's say user id it's going to do a hashing and then it's going to do a mod and therefore it would be able to split that out to those 10 partitions so if that's a user id then certain user ids would you know for my user id would be guaranteed to a a particular partition so my messages would be in order but but that means that somebody else's so let's say uh your uh twitter messages that that that would also be an order but it it wouldn't necessarily even though they could be processed out of order mine and yours could be processed differently but that's that's okay so if you think about that for a second you know uh we could have a a new partition here so let's call this partition one and we have uh partition two for the chris topic and then you're gonna have my uh and essentially i'm gonna have message one here message two here message three here and then over here i could have uh message one one message one two message one three message one four and and that is okay from a kind of ordering perspective so you can only guarantee your order within the petition and therefore when the producers are writing to this so we'll we'll just put a producer in here so when a producer is is writing to the cluster what's going to happen is the producer is is going to talk to that broker um and and because of the partition key for the topic so let's say we've got user id one two three four five then it's gonna know what partition it's going to go to so again similarly it will talk to the leader for that topic partition and then um once it's talking it could be totally you know so if it was your one it would be topic partition p2 which could be on broker three my uh you know my tweets could go on topic partition one which could be on broker one um so the producer would be right into different brokers depending on where that topic partition lives um it will write the message and then it will get sort of replicated so it's it's something to be aware of and then similarly you've got the replication as well so each topic petition would be replicated to different brokers as well so if you think about that for a second that allows massive scale right as long as you can partition correctly and you understand what your partitioning keys are going to be how how you're going to do that and how you specify your topics then you've got replication but you've also got massive scale because as long as you're not worried about the ordering then then you can kind of scale at that point so that idea of partitioning if you don't specify a petition key but you specify a number of partitions so let's say i'd say i want 10 petitions for my crest topic then how a kafka will deal with that is it will just round robin so evenly distribute your messages across the different topics which which is fine for scenarios where you don't care about message ordering but if you do care about message ordering such as you want a certain user's tweets to be in the same order then you know that's something that you you would uh you may care about so that's how petition works that's how things scale um it's pretty cool stuff but the key thing that you need to be thinking about is what order do i want my messages to be in okay now that we understand how ordering of partitions work and we understand how delivery acknowledgements works and those three modes which is sort of fire and forget uh leader acknowledgement or acknowledgement for all replicas within a set um so now that we understand how all of that works actually it's probably worth is just understanding how kafka handles message delivery for a second because i think it's it's really important on that acknowledgement message and then that will open us up into understanding how transactionality works within kafka so let me draw this up for a second so if we just draw here let's let's draw here our producer so we had that there before yep and as i said before what's going to happen here is we are going to the first thing that we're going to do is we're going to send the message so we're going to send her a message which is going to be m1 yeah and then as i said the broker will accept that message now what the broker is going to do at that point is it's going to append a message onto the topic partition so that's the physical kind of scale thing so we'll just draw our topic partition here so topic partition yeah and then it's going to put m1 onto our topic partition now what happens at this point is the producer is going to wait for an acknowledgement so this is this is we'll put two here which is append m but what it's gonna then expect is this acknowledgement message back from the broker so in fire and forget mode it's not going to care right as soon as it does that send it it doesn't care if it gets an acknowledgement so it'll just move on right but but outside of fire and forget it's going to wait for this acknowledgement so let's let's draw that in here so we the producer is expecting an acknowledgment here of m1 now [Music] if the producer gets that acknowledgement it's fine at that point it knows that transaction has been committed that it knows that that message is on that broker now as i said before if if the leader is uh is given in an acknowledgement on the right then as soon as that append occurs the broker will send the acknowledgement straight away if however that for if you've got you must acknowledge every replica having a copy then actually what's also going to occur here is every other topic partition yeah uh in your cluster all the replicas the followers will need to do their right so the right will need to happen here the right would need to happen here and once all of the replicas have confirmed that they've done their right then at that point the broker would send the acknowledgement here so that's that's the difference of how that works now what happens if an acknowledgement isn't sent well if a an acknowledgement isn't sent then what's going to happen is the producer is essentially going to resend the message so it will just send m1 again now in this particular case if for whatever reason there is some sort of issue with the broker yeah maybe it didn't get a chance to send out that acknowledgement message then you can kind of see that there would be a scenario where you could end up with duplicates and therefore that m1 message would be written twice so actually so we've drawn that on the ipad but what i'm going to do is just show that on a fancy graphic here it's the exact same thing as we said before we've got this standard flow where we send the message goes to the broker and in the broker appends on the topic partition and the and the producer is expecting that acknowledgement message and and again just as i drew on the ipad a second ago let's let's re-show that scenario how we end up with message duplication so in this particular scenario we are sending a message m2 from the producer onto the broker m2 gets appended after after m1 for whatever reason that acknowledgement message didn't come back so m2 no acknowledgement message so the producer does the retry on m2 and then eventually it appends m2 and therefore you end up with m2 appearing twice and then and then the acknowledgement happens for m2 so so there's a lot that sort of that we've got going on there right we we now understand how the uh the acknowledgement uh happens within kafka we understand also how uh the the various acknowledgement modes for for the various brokers and replicas so fire and forget leader acknowledgement or all replicas acknowledging but we also now understand how duplication could occur as well which is if you don't get that acknowledgement coming across then you could end up with with double messages on the cube now why is that important because actually there's some cases where you wouldn't want duplication and and that's probably a really important thing so maybe if you're doing something that is financial transactions um then that would be a scenario where you weren't doing duplication so a good example is i think uber they were on their advertising their real-time pricing type things off of kafka so no scenarios they wouldn't want uh to double charge people their advertising for example so therefore they would want to get rid of that message duplication somebody like banks for example or uh if you're sort of replicating uh logs you know database logs using for kafka then another scenario where you wouldn't want duplicates but maybe something like if you're running a social network maybe it's a twitter or a facebook or something like that you're doing likes maybe you don't care if you get a bit of duplication there maybe you can just you can deal with that because um you're not worrying about uh consistency so much in that sense so you know in that case you know what it's okay you know i'm i'm i can deal with a bit of message loss i can deal with a bit of duplication but in those financial uh situations what you could do is do a transaction and and and again i've got a little bit of code showing on the screen we will do that in one of the other videos where we code up in kafka jas but i just want you to kind of be aware from a kind of transaction point of view what you can do is is say actually i'm gonna configure kafka so that i'm gonna be having exactly one message on the cube when i do my production and and how you would do that is essentially you would need to there's a couple of things settings that you would set which is the max in flight request setting would be set to one so that you've not got multiple in-flight requests going on uh you would have to set the in them potent um uh flag there to being true so um so that's that's what's going to enable that and then the last thing that you would do in this case is you would have this concept over transactions so um so when the producer is is going to send that message so when we do that send m1 then you're going to wrap that in a transaction so um to when it's doing the send and then finally you're going to send this sort of commit message uh at the end once once you've got the acknowledgement back so you get into this sort of two-phase commit scenario and again that's something that is is completely supported by kafka and then that means that you don't get those duplicates at that point so you would be in this scenario where because you've wrapped the transaction around this you would just write m1 and you would have exactly once delivered now of course the downside of the transactional approach is you're obviously uh it there's a little bit more processing that kafka is going to have to do and therefore that will uh slow down your throughput a little bit as well so there's a trade-off of performance against that sort of level of transaction consistency okay so now we've got a pretty thorough understanding of how cathgo works how kafka clusters work how producers work what topics are what topic partitions are i really want to sort of round this off now with a deep dive on how consumers work and and again the simple way of thinking about this is consumers are really just the opposite of producers and okay so let's draw this out on the ipad again so if you remember we have a producer we'll call that p and that is going to be right into the broker which is b and the broker is going to write to the topic partition okay and we will draw that here this is this is everything within my broker and then you know we'll call this broker one and then as as you remember here we have sort of broker two and then we'll have um so broker n and then we've got this sort of alignment of partitions like this yeah now if we think about this for a second what i'm really wanting is i'm going to have a consumer in the same way i've got a producer right into a broker and that goes onto a topic partition what i want to happen is i'm gonna have a consumer that is gonna be reading from this so here is my consumer and we'll call that c1 for consumer okay now that's cool um but there is some rules of the road right which is consumers um are assigned to a topic partition so only one consumer can read messages from a topic partition at any time and let's just think about that logically for a second because remember topic partitions are a unit of scale if i had consumers that were able to read from different topic partitions yeah then i'm gonna have another ordering of message problem right because consumer one could read from this message here consumer two reads consumer one consumer two is going to get in the way of each other so consumer one reads message one here consumer two tries to read consumer message two consumer three tries to read message three it's just chaos right they they're all trying to keep in sync and figure out who's run from what so in this sense what actually happens is only a single consumer instance can read uh from a single topic partition now that might seem inefficient but what actually that is doing is allowing us on the consumer side to maintain that ordering of message allows us fast throughput going through and and it allows you to to basically get that performance and availability and consistency all coming from that consumer and system so a consumer can only read from one topic partition at a time then how do we scale our consumers and this is where the concept of consumer groups come in so in kafka you would specify your consumers to reside as part of a consumer group so let me draw what that mean here so i'm going to introduce a consumer group here so consumer group is really a group of consumers that are there to perform a single task and scale and distribute that across all consumers within that group so let me draw that out what i mean so let's have a consumer group here and it could be something that is processing bookings or it could be something that is uh processing uh likes or it could be processing transactions doesn't really matter and let's say i want to let's say i'm gonna have uh five consumers within my group here so we'll to have c1 c2 c3 c4 and c5 now let's let's now look at how this looks from a kind of topic partition perspective so i'm going to introduce back my quiz topic um so we are going to have let's say we will have a topic at the chris topic and we are gonna have uh petition one so we're gonna have chris topic petition one and that's gonna reside on broker one and then we'll have uh topic two so chris topic again uh here and that will be on broker too so in these cases if you think about this for a second only one consumer can read from a topic partition any any one time so in this case let's imagine uh uh consumer one is reading from chris topic one and therefore that's fine um and then from a scaling perspective if i want to read data and process data from chris topic 2 then consumer 2 can pick that up so that's how i start to scale right and that's why partition is important because if i only had one topic partition of chris topic only had one petition only one consumer would be able to read from that and therefore i would be from a scalability and through perspective could only perform as fast as the number of consumers that i've got ie consumer one so in order to increase my throughput i can therefore increase the number of consumers in a consumer group and that is my scaling point for read throughput so in this case i've you know so in this case i can double the amount of messages i can process because i've got two partitions now if you think about this for a second you know i want to scale a bit more so i will introduce broker three and we have a consumer topic three so and again at that point uh cons consumer three can pick that up so now i've got three and it's the same way as as we were talking about for the right element with producers we you know our unit scale is consumers and therefore we've got three there now that means that logically if you think about this from the ipad perspective i can only uh there's no point me having instances beyond the number of partitions that i have right the maximum these other consumers are just going to sit idle right because they've got no work to process for that for that topic right it's it's what can it do right it's it's it's three instances and you're going to look at three you know three topics so um if i want to scale further then i need to introduce more partitions more topic partitions and therefore more consumer instances can pick that up so that that is a starvation of work things so there there is no point having more instances than atopic partitions because they won't be able to scale so that is how topic partitions work and that's how consumer groups and of course as you saw from a producer perspective i can add more partitions at any time or i could scale up and introduce more instances right now there is a link here um as you see you know a consumer one is linked to topic partition one now what if i lose a a consumer instance for whatever reason of course i could have the opposite scenario to this where i may have a topic maybe i've got three partitions uh two uh three and then i may have uh only got two consumer instances within my group and and that may mean that i have a double allocation in that sense so consumer one could be uh handling uh chris topic one and two whereas consumer two could be dealing with chris topic three and then of course so that means uh consumer instance one is is maybe a little bit overloaded in comparison to consumer instance two so i may decide to try and um increase my scale and i may introduce a another instance c3 and therefore at that point the the the consumer group would rebalance the cluster and then um uh and and and essentially assign that work to consumer and sensory so in this case um you would get this of course rebalancing so at that point i'm pretty good because i've now got three instances and i can increase my throughput but of course there's a little bit of work happens to reallocate consumer instances so i want to know what my consumer instances are a little bit in advance because i if i start scaling up and down those consumer instances then uh there's a little bit of work that the consumer group needs to do to rebalance the work across all the consumer instances and therefore there's going to be a delay it will stop processing and messages and and that will cause you a little bit of delay in processing so you want to be a little bit thoughtful about that but as you can see the mechanism exists both to kind of scale up and down your consumer instances and and then have that sort of uh that time to petition so that's kind of how consumer groups work one of the things that you want to be aware of um previously we talked about rack awareness and um producers um you know right into to your your broker and then they the brokers being spread across across the various availability zones and then the topic partitions being spread across those those availability zones as well so the data isn't residing in one area as i said the the problem with that scenario is that that you've got latency across multiple data centers and and therefore one of the things that's pretty cool about consumer groups and and consumer instances is you can also align a consumer instance to an availability zone as well and there's a setting called client.rack and then you can say this this client resides in this availability zone and then what will happen is the consumer group will align your consumer instances to being close to the broker's availability zone that it resides in and therefore that's pretty cool so if i've got a topic partition in availability zone a then consumer instances that live in availability zone a will will read from that as well rather than trying to have close communication of course there's more complexity than that but again it's something useful to to kind of to kind of be aware of that leads us into another thing um to be aware of as well is as i said producers can only write to the leader of atopic partition for a topic however the nice thing about consumer instances is of course it used to be in kafka they had to read from leaders as well but actually they have the ability to read from followers so they don't necessarily need to read from the leader they can read from followers as well and that's really useful because that gives you scale but it also means in that availability zone scenario you can configure kafka so that it reads from uh the closest follower the closest replica so that i i can avoid cross data center traffic so i think that's something super cool there as well okay so that's how consumer groups work um that is my overview of kafka a little bit of a long one but i'm hoping that you really understand those topics and therefore in the next video what i'm going to do is is we will get kafka jas out and we will actually start interacting with kafka get up and running on a server and write some topics et cetera anyway i hope this has been useful and we'll catch you in the next video [Music] you

Info

Channel: Chris Hay

Views: 195

Rating: undefined out of 5

Keywords: chris hay, chrishayuk, apache kafka tutorial, kafka explained for beginners, exactly once semantics kafka, kafka introduction, kafka tutorial, apache kafka, kafka explained, kafka topics, system design, messaging, systems architecture, integration architexture, event streaming, messaging queue, how does kafka work, kafka fundamentals, kafka introduction tutorial, kafka introduction for beginners, apache kafka explained, apache kafka tutorial for beginners, confluent

Id: wKbxNk5X3bg

Channel Id: undefined

Length: 63min 26sec (3806 seconds)

Published: Thu Dec 02 2021