A Deep Dive into Apache Kafka This is Event Streaming by Andrew Dunnings & Katherine Stanley

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so hello everyone welcome to a deep dive into Apache Kafka thank you so much for coming I know there's a lot of exciting talks on at this time so we're very grateful you've come to this one my name is Kate Steinway I'm Andru Dunning's and yet today we're going to be doing a deep dive into Apache Kafka so first just to get an idea of the audience who here has already had a go at running a basic Kafka and doing producer-consumer okay a reasonable number who has done something beyond that so actually starting to write some Java code to consume produce doing something more complicated with streams or something and who's actually running in production with Kafka okay cool so this is what we're planning to cover today so this is broken to two parts and we'll have a break in the middle so I'm going to be talking about event-driven applications what is Kafka and then going into the Kafka cluster and some of the details there and then I'll be talking to you about producers consumers considerations for Apache cough grim production then we'll take a half an hour break and then after the break I'll talk to you about Kafka connect and change data capture and then I'll finish off talking about Kafka streams ok so to get started what is Kafka why are people going to event-driven so more and more I'm seeing people move from all of these new micro services or even existing systems that weren't micro services based and wanting to have events at the center of their applications and by being event based an event driven it really allows you to get better insight into the data that's flowing around your system because an event represents a lot more than a static piece of data because it tells you about what happened before and then what is happening now so I'm just going to do a components of an event streaming application and then we'll see throughout the steep dive different elements that come in and how Kafka makes a great event backbone so generally an event streaming application has the event backbone and then some event sources so these could be from sensor data they could be clicks on a website they could even be an existing database and we'll see that in the change data capture later then there's an option to do stream processing so taking events off of this event backbone doing some sort of processing and then feeding that onto an app or even back into the event backbone and we'll see that through kafka streams systems like Kafka aren't designed to store events forever so if you really want something that's going to be a persistent store forever of those events then you're going to want to put them into an event archive and then the other big element that moving to events gives you over being data-driven is this idea of notifications so we're all used to having those notifications ping up on your phone whether it's from Twitter or from whatsapp but notifications is something that you don't get in quite the same way being data-driven because an event really translates nicely into a notification so these are some of the elements that we're going to be talking about today but focusing around kafka as the event backbone now whenever I talk about casca since I work at IBM I always get the question around what is Message Queuing versus event streaming so in IBM we have IBM MQ which is a message King system and then we have IBM event streams which is a fully supported kafka so that's event streaming and so I like to always show the slide and give the differences just so everyone's on the same page so a message queueing system has transient data persistence so when you put a message on to a queue as soon as it's read off it's not on the queue anymore so no one else can read that same message it's generally more aimed for that targeted reliable delivery MQ has a lot of features built in to help with reliable targeting of sending a message to a specific person and because of that it's also often at for a request reply based in comparison event streaming provides stream history so that's one of the big differences here the events that go on to your event store they stay there and they can be reread this also provides very scalable consumption you can have multiple different consumers at once and we'll see how that works in Kafka and the data is immutable so once the events are on the event store in that order that's what happened the event is a statement of something that has happened it's not a message for someone to do something so it should be immutable so on top of these the properties that you want in an event backbone are also for it to be scalable and highly available so that it supplies all of your needs so this is where we bring in Apache Kafka so Kafka is an open source distributed streaming platform and some of the key characteristics that it talks about is published and subscribed to a stream of events being able to store events in a durable way and processing streams of events as they occur so you can see that matches up quite well to some of the properties that I had in the previous slide one of the other really nice things about Kafka is it's a really rapidly growing community so these are some pictures from some of the Kafka summits this year that I've been going and speaking at the picture at the top there is a big board that they had at Kafka summit in San Francisco with lots of the core contributors to Kafka but all those names on there is just a tiny fraction of all of the people that have contributed to Kafka and that everyone that makes up the community as well so by looking at Kafka as your event backbone for all of your event based applications you're buying into this rapidly growing community that are every single day creating new ways to integrate with Kafka so that's one of the really nice things and you will see that later when we talk about Kafka Connect as well so first thing we do is show you a quick getting started with Kafka so this is where we pray to the demo gods that everything works so if you haven't had a go with Kafka before this is the first place to get started this is where I always send people this is the Kafka Docs Cathcart arpeggio org and they have a QuickStart page with a set of instructions too to help you get started so I've prepared this in advance so we first have to use startup AZ keeper so the first thing you have to know about categories that you need zoo keeper and that's the wrong command I'm should be on this one there we go that's starting zookeeper the shell one you'll see later when I talk more about Z keeper so we start up zookeeper and hopefully it doesn't teach you long there we go and then you have to start Kafka itself and you'll notice here I'm running shell scripts and I've got some properties files I haven't had to set anything up in advance here basically when you download Kafka it comes with some shell scripts to get started and some default config files so that helps you to get started once I've started up both zookeeper and Kafka I might need a topic so in Kafka we store events in topics so we're starting to introduce some terminology here and I can originally just list to see what topics I've already got the answer is none so let's create one and we'll go into what some of these options are as we'll go through the talk so this should create a topic of the name test and I've had to tell yep where's Yuki Perez so then I can start to produce and again just using a shell script getting started with Kafka is really really easy and while that's running I will also start my consumer so in the producer tab here it's put me a place for me to start sending messages and so I can say hello de box and then we see them appearing on the continue side so getting started with Kafka is pretty straightforward but we've already started to see some terminology that you have to know about so we've seen topic we've seen over here replication factor and partitions so one of the nice things about category is it is very configurable and I think that's really great but it can mean that you have an initially a bit of a learning curve so so let's dive in to some of the kathcart internals and how it works so I'm going to focus first on the Kafka cluster itself and what you have to send up to start up Kafka so in my shell script when I ran Kafka server start that was actually starting what's called a Kafka broker and generally you should run with three or more Kafka brokers and we'll see why as we go through how they work but three is a good starting point so for a particular topic in Kafka it's broken down into different partitions and the reason that we have this is we can then spread the partitions across the different brokers and it means that if we have lots of applications that are wanting to send events to topic a they're not all overwhelming one breaker they can talk to you breaker one breaker to who or broker three and the events will go into the different partitions so we can see straight away that Kafka has scalability built in because we have these multiple different partitions and we can run them across the different breakers Kafka is also very highly available and it does this using replication so for a particular topic and a particular partition there is one broker which is the leader for that topic and partition and then all of the other brokers are called followers and what will happen is when you want to send events into that topic and that partition you only talk to the leader for that particular one so here all of the catechol ants they go straight to the leader but what cathica will do is under the covers it will actually be replicating all of this data across into the other brokers so that means if for some leader so for some reason the leader was to go down now we've haven't lost the data it's already on the other followers and what Kafka will do is have a leader election elect a new leader for that topic and that partition and then everyone just moves over and starts talking to that new one so when you're starting to connect to Kafka you will generally be able to talk to every single broker and then Kafka will tell you which broker you have to come back to talk to and that will be the leader so that's fine we've got leaders and followers but what happens if this replication under the covers isn't keeping up well that's what we call the difference between an out of sync and in sync replicas so the leader for a particular partition and a particular topic will track at what point all of the followers have got up to so we will know how far behind everyone is and a replica is called out of sync if it hasn't requested a message in more than 10 seconds or it hasn't caught up to the most recent message in ten seconds where 10 seconds is the default value and you can change it using the replica lag-time max m/s so that seems fine you've got the difference between in sync and out of sync and each partition also has a preferred leader so when you initially create the partition it will get assigned a leader and what Kefka will do is it will make sure that when all the partitions are created the leaders are spread out across the different brokers of course if leader elections happen because something goes down that might change but it's worth bearing in mind that initially they'll all be spread out and we'll see later when we talk about running production what happens and what challenges you can face if there's too many leader elections and you don't have all of your partitions using the preferred leader so we've got this idea of replication but what does this actually mean for you when you're running with Kafka well this setting is a very important setting to no it's the it's called min in sync replicas and basically this is declared by you as the number of replicas that must be in sync for a partition of the topic to be considered available so it's up to you when you create the topic to decide how many replicas you're going to have and then what your min in sync replicas is going to be and this has different effects so if we say that we've got three breakers and the min in sync rep because it's currently set to two and we have a leader and then we have one in sync and one out of sync all of your producers and consumers can happily talk they can create events consume events all fine but the key thing is if another one of your brokers goes out of sync you now don't have that min in sync right because of T so the leader counts as one and you did have another one but now that's gone what Kafka will do is it will prevent any of your producers sending events into Kafka consumers can still continue consuming what was there already but the key thing is we don't want to get in a situation where the producer sends some more events into the leader the leader goes down and then suddenly you've got inconsistent events going on before it's replicated so this is worth bearing in mind and it means your producers will get an error message back from Kafka saying that the minions sing crap because isn't high enough it also of course means that if you set up your replicas and your minions sync replicas in a incorrect way you could end up in a scenario where you can never send events into your topic so you do have to make sure that you set these two correctly okay so we've got all of our different brokers and followers but there is one particular broker in which is given an additional role on top of potentially being leaders for different partitions and that's the controller so the controller is in charge of who's going to be the leader for the different partitions and telling them what they're doing so it's a special broker that gets assigned and in order for only one controller to think that it's currently the leader it uses zookeeper to do that so right at the beginning I said the first thing you have to know when you're getting started with Kafka is the fact that you have to run both Kafka and zookeeper so let's have a look at what is in zookeeper so that's where I want this shell over here so when you are running a zookeeper normally if you say you're running it in a container you can actually exec into the container and you can run something called ZK CLI da and that will allow you to then do commands on your zookeeper for Kafka if you're running just locally they provide this you keep a shell that you can use this as the same thing so if I run that it should connect to my zookeeper again it's connecting to their guy who's to 181 so all of these ports and things are just the defaults that get picked and I can do LS / so this is my current zookeeper tree so you can see some of the things that are being stored in here specifically the controller a POC the controller and also things like the brokers so if I do a get on brokers IDs actually I'm gonna do I'm gonna do LS first just so you can see what's in there so you can see I'm only running one broker currently because I'm running locally and standing up more takes more time so if I do a get now on breakers IDs zero this is one of my brokers so when you're running the different you're more than one broker in Kafka the all have to have a unique ID if you're interested in particularly around how that works if you're doing persistence and things in Cuba Nettie's I'm doing a talk tomorrow about Kafka on Cuban St so feel free to come to that but the key thing is when the broker starts up it has this unique ID you can see there's also other stuff being stored in here so we've got some end points and things like that but by storing on the zookeeper tree the information about that particular ID zookeeper will only know about the right number of brokers and they'll each have their unique name we also have the controller in here so if I get the controller you can see here I've got stored which one is currently the controller so it says version one broker ID is zero and then a timestamp so that's the current controller so what happens when a broker decides that it's going to be the controller what it will do is it will try and write into this a node in your calf in your zookeeper tree and assuming it writes successfully then all's good basically what will happen is that other brokers will then try to be the controller and try to do the same thing but they'll see there's already an entry here and so they say fine now of course you could get a scenario where one of the brokers thinks it's the controller the broker goes down and it well it doesn't fully go down when it can't doctor zookeeper for some reason anymore at that point do you keep it in thinks there's nothing running as a controller and you might have another controller come up and think that it's now the controller another broker and the way we get around that is there's this controller epoch so you can see that controller epoch currently has a one so what will happen is every time you get a new controller it's counting up and that's how you make sure that you don't end up with split brain and two different brokers thinking they're the controller at the same time because whichever one has the latest number that will then be the correct controller so when you're starting to look at your calf groans you keeper and understand what's going on it's worth knowing how to look into zookeeper and look at this tree and and see what's actually happening under the covers so zookeeper actually has an interesting history within the Kafka community so those of you who keep up with the Kafka news this might not be new used to you but actually in previous versions quite a lot of things talked to zookeeper so you might have noticed at the beginning I actually made a slight error in the command that I ran to create the topic I was pointing to zookeeper not Kafka and that's because actually all of the admin tools used to talk directly to zookeeper to create topics and that kind of thing and all of the consumers would also talk to zookeeper to store that offsets so we'll learn about offsets in a bit but basically we had all these different things all talking to zookeeper and that means if you're the person that's running Kafka that can be a bit annoying because you're having to expose not only endpoints for Kafka but also endpoints to zoo keeper externally to anybody who wants to use the system so what the Kafka community have been doing is over time they've been actually trying to move everything across so this is what it currently looks like we have something called an admin client which runs alongside Kafka and everything talks directly to Kafka and then get to reread to zookeeper and little covers so you'll notice if you look at like previous versions of articles and things like that often they might talk about being able to do commands directly to zookeeper that's actually now not the case with quite a lot of the commands you have to go to Kafka and then cathica well and talk to zookeeper if needs be now the future plans is there's actually a Kip in process so it kept stands for a Kafka improvement proposal and if you have a look on the Internet you can have a look at Kip 500 so basically the plan going forwards is currently what's in zookeeper is things around metadata and previously we did have other data in zookeeper so some of the metadata for the consumers did live in zookeeper but what's been happening is the Kafka community's been moving everything across and storing things in casco topics instead so the aim of this Kip is basically to take reliability that the kafka itself house on zookeeper away and use Kafka itself to store everything and the main motivations for this are a it means that people don't have to maintain both hey Kafka and a zookeeper which it can be a bit frustrating if you want to start with calico the first thing you have to do stabs you keeper but also zookeeper is a generally used tool it's not like perfectly designed for Catherine wasn't created for Kafka it was there just in advance so by moving this functionality across and putting in Kafka it means we can actually get to a scenario where it's all like custom built for Kafka the way everything stored works really well with hack after works and how Kafka Kafka is going to work going forwards so this skip is currently still in progress and I don't expect it to be in Kafka until sort of a sometime next year hopefully but there is a really good talk on this so if you're interested more in like the journey that the Kafka community have been on with Katherine zookeeper then I would highly recommend checking out the talk that was a Kafka summit in San Francisco all of the videos are online and it's called Kafka needs no keeper is the name of the talk so I definitely have a look at that but that's just kind of a high-level overview of what's happening with zookeeper and why if you look back at tutorials and things there are changes between when you used to talk to zookeeper and now at pretty much everything apart from the casket breakers you talk directly to Kafka rather than to see you keeper ok so I'm now going to be handing over to Andrew and he's going to talk about the producers and consumers in a bit more detail and tell you a little bit more about how the partitions work and what that does beyond sort of the basic scalability and how you make use of that in your producers and consumers all right hello everyone so now Kate's familiarize to yourself with the in the actual kaffir cluster itself I'm going to talk to you a bit more about the external clients that connect into this cluster to read data no that's not right so we get data into and out of Kafka so first I'm going to talk about producers so produce essentially are what get your data into Kafka the producers will send a record or an event to a given topic in Kafka to a certain partition so a bit more about topics before we go in to produce a bit more so topics are an ordered sequence of logs of events and they are immutable so this is important because we have our immutable stream history and we can replay it afterwards and we don't want any changes to occur to the stream so producers publish logs to topics and this is a monatomic monotonically increasing sequence and we call these offsets so each log has an offset off starts at 0 and increments by 1 each time these can't be changed after they've been produced so how do you actually start producer so I won't demo disk is kate's actually showing you this already but included with Kafka there's a console producer script and you can use this to quickly start up a producer and send messages but if you want to do something a bit more in-depth producer has a Java API so here we can see we creating a Kafka producer we just instantiate a simple properties object and we add in all of our properties that we're going to use for our producer the most important one Porton want here is the bootstrap service config just going to tell you producer where Kafka actually is and these other properties are vary from security properties to how we're going to serialize our data etc and then we simply instantiate a null instance of the Kafka producer and then try to start it with these properties and here's a couple of other methods that we want to use so we can call produce with a given message and this is just gonna give the producer the message and attempt to send it so we called Kafka producer send with the with the record and because it's a future we call get to actually wait for the response to come back and then we return it and similarly a simple method shut down which you're just going to shut down the producer and make sure it's lost all of its messages so it's not just a Java API so labardi Kafka is the underlying seed library for the Kafka protocol and this is what a lot of the other language clients are built from so you can see here go know Jess and Python some of the most popular ones but they're not quite as developed as the Java API because Java API has the admin client built in it's allows you to do all the admin operations for your cluster so there's these aren't the only ones that there are this about 10 or 15 of them so chances are if you're developing using a language it has a Kafka library built in another thing to mention for the producers is spring actually have a spring boot starter for Kafka producers so you can easily integrate this into your existing spring app applications right so back to producers so every record that we produce has a key and a value the values the important parts so this is basically what you want to send to Kafka this could be sensor data a simple message or maybe a log or a transaction and the key is optional so you always have to have a value but you can choose whether or not to have a key if you don't pick a key then Kafka will just append this log to the partition using it whatever scheme is being configured to use so my default is round-robin so if you produce a record to the kafir' cluster it will ensure that the records are split between the partitions in a round robin fashion however you can be a bit more specific and specify a key so what the key allows you to do is send records to a specific partition so how this works is basically a hash of the key is created and the partition that closely matches this hash is the partition that they keep there we'll end on so here we've got a value and no key so are all lock records are just appended in a round robin fashion there's an equal number of Records on each topic however if we specify a key we can ensure that the records go to a specific partition I'm pressing the wrong button there I think certainly now we can see that all of our messages that we've just produced and on to the specific partition of 0 so while we're talking about partitions I'm going to explain a few more in-depth concepts about them so you might be wondering how long do records persist in a topic so for a given partition we have a retention period and this is basically how long logs stay in that partition before they're cleaned up by whatever whatever means so we can specify this in either time or in space so if you perhaps don't care about your records after a day or so you can set your log retention minutes to be whatever a day is in minutes or you might have a certain amount of storage so you can set your log retention bytes so your records will expire after this has been met so it's a good way to reduce the size of your there partitions if you don't care about certain parts of the data there's also the concept of compaction so here we've got a partition with records on it and they've each got keys and values you'll notice that some of the keys are the same but they've all got different values so what compaction allows you to do is evolve the data store and only take certain records and dispose of other ones so here we can pack the topic and you can see that we now only have three records and they all have unique keys so what compaction allows you to do is take the latest value for a given key so here we've got key a value a one key a value a two and we only take the a2 value so this is useful if perhaps you've got user information and you only care about what the latest action the users done you can get rid of all the previous records for that key so kate mentioned that the partitions are balanced across the brokers so this ensures that when you produce two kafka brokers the load is spread between the different brokers however sometimes your partitions can become unbalanced so imagine you've got three brokers and you've got a partition on each but then one of your brokers goes down and well the brokers down you you create a new topic and you want it to have three partitions so in this case you'd end up having two partitions on one broker and one partition on the other broker then when your other your down broker comes up that broker doesn't have any partitions for that topic so rebalancing your partitions using the script that's built in with Kafka will allow you to evenly spread these partitions again across the brokers so back to producers what sort of configuration can we provide for producers so there's an array of things you can do but I'll go into some of the more important ones so the most important one is the acknowledgement level so this is what level of acknowledgement does the producer expect to receive from Kafka before considering a record to have been committed so how does it know how well does it know that the record has actually gone into the Kafka topic so you can choose zero basically what this means is the producer won't wait for any acknowledgement back from Kafka to consider the record has been committed so you might send a record Kafka could or could not have committed it but you would move on to the next one and keep producing so it's risky because you don't know that it's at the records actually made it into Kafka but the flipside is it's going to be very fast you don't have to wait for any response back from Kafka the middle ground is one so we wait for the leader of a given partition to acknowledge that it's received the record this is sort of the middle ground I think it's the default as well but the other end of the spectrum is you wait for all so what this means is kate mentioned in sing replicas for a given broker you have to wait until all the in sync replicas have acknowledged that they received that record so here we produce a record it goes to the leader the leader in knowledge is but we also have to wait for the replicas to acknowledge as well so it takes a lot longer for the day to come the reply to come back but we can be sure that the records actually made it into Kafka and that it's been replicated across all of the brokers so you choose this setting if you actually cared about the data being persisted similarly there's a option for retrying so your producer could not retry this might be suitable if you essay using an IOT sensor that's producing events and it produces an event every second you don't really care if one second of data gets lost so you might do this just to reduce the load in your system however if you really care about your record getting into Kafka you can choose a given number of retries so you might have this really high if you want to ensure that the records got into Kafka or you might just choose one or two just to be safe and we can also use idempotence so basically this this works by the producer sending a process ID and a monitor monotonically I can't say that word today monotonically increasing a number with each record so it will start with zero and send that to the broker and when the broker acknowledges it will send a combination of the process ID and the number so the broker knows what acknowledgement is going for what message so it won't send them it same message twice right so more conflict we can batch records together so producer could send a record as soon as it's ready or it can wait until it's got a few records together batch them together and send all those all at once so this is useful for reducing network traffic you might have just one set of packets sent across the network rather than a stream of packets and we've also got compression so we want to reduce the amount of data this be sent across our network so we can compress the records and there are many different compression types available but here's gzip as an example so that's how you get data into your system using a Kafka producer how about reading data from it this is where consumers come in so consumers read records from a given topic and they you'll see in a minute that they can work in in parallel with each other to consume very scalable so consumer consumes from a given offset typically you consumer would start at 0 and then read up 1 2 3 4 5 etc but the consumer can - can choose to start reading wherever it likes so the consumer could come in and start reading from offset to be could read from offset 5 and the key thing to realize here is consumer doesn't actually mean the data has been consumed off the topic the data remains on the topic the consumer reads it and makes a copy of itself this enables multiple consumers to read from the same topic at the same time and it means you have your immutable log preserved so it's not a transaction it's a publish subscribe you view the data and it stays where it is and similarly to the producer we've got Java API I won't go through that because it's basically the same as the producer we have the properties object and then we instantiate the producer their consumer sorry and the methods are very similar to the producer however we've got this extra one set or generate consumer group ID so this is to do with consumer groups which I'll go into in detail in a minute but basically consumers can work together and they have the same ID it means they're in a group and they share the load of consuming off the topic again the same languages are variable for consumers and we also have a spring boot template for consumers as well so consumer groups consumer groups are a way for consumers to work together to consume in parallel from a topic so consumer group you have multiple consumers and each one consumes from a specific top partition of the topic so you can have a Casilla consuming from the first partition a consumer also producing from the second partition I'll show you an example so it makes sense so in consumer group a we've got three consumers and they're all reading from different partitions and they're at different offsets so this allows them to consume in parallel whereas if there was just one consumer it would have to read from the first partition then the second partition then the third partition so it's much faster consume you'd be however only has to consumers in it so when we split try and split the load between the consumers one of the consumers ends up with two partitions so this isn't ideal it's not going to consume as fast and we have a larger point of failure in the second consumer so it ideally you want to ensure that your consumer group contains as many consumers as there are partitions of the topic that you're consuming from we can add a nor the consumer to this consuming group but right now it won't do anything because there's no extra partition for it to consume from however this is also useful because if one of our other consumers were to go down this new consumer could pick up where the other consumer left off so how did the consumers actually coordinate and know where each one's reading from and where the new consumers need to read from so here we've got three brokers and the way this works is there's an internal topic called consumer offsets so basically all the cuff fit can see internal topics hat start with a double underscore so you don't want to create topics with those so this topic has a leader obviously and say if it's broke as zero so in this case the leader of the consumer offsets topic is is said to be the group coordinator and basically the consumers are going to commit their offsets to this broker so come in their offsets means where have I read up to so say the first consumer has read up to offset 7 it could commit this to the group coordinator so if I consumer were to die and a new consumer comes up it can pick up where the other one left off so the consumers consume and then they commit their offsets back to broker zero so this allows us to add and add new consumers in as for the ones fail so now talk about some config for consumers so as I mentioned erst then yep so question was why why do you actually need consumer groups so if you were having a single consumer consumed from a topic and there were multiple partitions a consumer can only read from one partition at a time so if it really has to read from all of them then asked to do synchronously read from the first one read from the second one read from the third one and that's going to take some time and also is the fact that if that consume were to die you're not consuming any records anymore but if you have a consumer group with multiple consumers you can split the load of the partitions between the consumers so each of them consumes only one partition and at the end of that you've got the whole every partition has been consumed within that group so it's its scalability you can consume in parallel and its fault tolerance if one of them is to die you can start up the consumer as well so committing offsets is basically the consumer saying I've read up to this point and this can be automatic so by default and I'm not sure what the default value is but the consumer after so many records will say I've read up to this point the problem with this is commits might go faster than processing so when a consumer consumes a record it typically will perform some sort of operation on it and that's called the processing but if this processing takes a certain amount of time the you this automatic committing might occur before you've actually processed the record so say I consume a message and I'm processing it but then I Stella Kafka right I've read this I've read this record then my consumer dies when the new consumer comes back up it's going to start at that point but the message hasn't actually been processed so you can do this manually asynchronously instead so now what the consumer is going to do it's going to consume a record start performing some processing on it and asynchronously go and tell Kafka that it's consumed this record again this is still risky because you could still not have processed the message by the time that you've acknowledged a Kafka that you've actually consumed it so to be safe you can do it synchronously so now you're going to consume the message process it and then tell Kafka that you've actually finished with reading the message it's obviously a safe but it's going to be much slower because you have to wait until you've processed the message and then you send Kafka the the acknowledgment so typically you'll see off that's being committed on a timer so every say 30 seconds you'll commit your offsets back to Kafka so any more questions about consumers so the question was how do we scale if we've got a lot of records coming into a partition but we've only got one consumer of the partition so typically you don't have one consumer consuming per partition but in that case you just have to allocate more resources to your consuming app because you can't have to consumers reading from the same partition if they're part of the same group however what you could do you could have multiple consumers who aren't in in the same group but then you're going to risk consuming the same message multiple times yep so in that case if you were using synchronous so what you're talking to that was synchronous committing of offsets in that case if you wanted to speed up you'd have to use one of the other methods so if your processing takes ten seconds you would consider using maybe a synchronous manual instead to speed up the processing it's a balance between what you actually want from your application so next session I'm going to talk about is Kafka features which make it useful in production so we've talked about the basics we talked about consumers and producers what are the features this Kafka have which make it so powerful so Kafka has got built-in monitoring basically the Kafka the Kafka code produces llaman metrics so basically it tracked certain operations that performed in the code and exposes these via JMX so if java management extensions allows us to hook in external applications to access these metrics the Kafka is producing so JMX runs at the JVM level and you can hook into this using say a j console or you can use more advanced external monitoring tools you might want to use graph honor for example to plot some of these metrics so what sort of metrics does calc are actually give you so here's some of the basic metrics that you get so we can monitor how many bytes are going in or are consumed are our producers getting messages into Kafka how much load is there is it too much load how many bytes are we getting out how many bytes are consumers pulling from our cluster this allows us to tell if what consumers are actually working properly an important one for monitoring under replicated partitions so if a given partition is on the replicated this means that the replicas of the leader on keeping up with the load on that partition so this might indicate a problem where too much load is coming into your cluster and your calf we can't handle it and again network metrics if you want to measure how much traffic is going through our network so there's a variety of metrics that caracal produces you can hook into and analyze for your production systems another brilliant feature of Kafka is it's got built-in security so you may or may not choose to use this in production but it is useful nonetheless so there are two types of security in Kafka we have what we call internal communication so this is communication between the Kafka brokers so for example replication and setting up partitions brokers need to communicate and by default they won't do this encrypted but you can encrypt if you like and then we have our traffic going into Kafka so we have producers we have consumers and also zookeeper this is external communication and these two things can be configured separately in the calf of config so internal SSL before I start on this I'll just say SSL has been deprecated by TLS but in the calf ago world it's they're referred to as SSL so this might seem a bit confusing but I will prefer to it as SSL so this allows you to encrypt communication between your Kafka brokers you can set this up simply by specifying the security into broker protocol setting and setting that to SSL you'll have to provide certificates for your Kafka brokers and you will also have to provide a trust or for each broker so the reason for this is you have to add into the trust or the certificates of the other brokers so the brokers all trust each other but once you've set that up you good to go and you can also optionally optionally enable a stiff cateura tea so the advantage of this is if you trust if each broker trust the certificate authority then every broker that's has a certificate that's signed by that authority is going to be trusted so if one of your brokers were to go down for example and need a new certificate as long as it's signed by the certificate authority the other brokers will trust it so similarly we have external ssl so this is going to require all your clients to authenticate using TLS this just requires a simple broker setting to require and when every client connects it must provide these properties so it must tell tell the client that is using SSL and we must provide a trust or on our password otherwise Kafka will just redirect reject the traffic so this is obviously useful you don't want any old person producing into your production kafir cluster and you don't want any old person consuming your data from your kafir cluster so a security enables you to lock it down and it's built into Kafka another things built into Kafka is authorization so in Cuffy you can enable simple authorization and by default this is done with access control so it's quite simple basically you have a list of given users and what actions they perform so say we've got a user Bob he can write two topics and they'll be written to the access control list and you can write to this using scripts that are provided with Kafka and additionally though Kafka has various authorizer classes so what you can do if you want you can implement or extend these classes and provide those as a jar on the kafir class path so you could essentially have whatever authorization method you want underneath the covers rather than using the default Kafka because the default after all this authorization is pretty simple okay so that's securing your kafir cluster now let's talk about debugging your cluster so the good news is Kafka provides a very good logging functionality I won't go through any logs because they're quite long but I'll talk to you a bit about the air internal workings of how it works so Kafka being a Java project to use this log4j which basically allows you to configure their logging for your application without having to change the code so you do this by modifying a properties file so in cuff this case this is log4j top properties and this is also the same for zookeeper because that's also a Java project what's allows you to do is change the log level of various packages within Kafka to tailor it to what you want to actually see in the logs yourself so in our log4j properties file we've got our appenders and layouts which basically say and what we want the logs to look like but more importantly we've got our loggers and our loggers tell your art to specify a package and say what level of logging you want so on the right we've got all the parrots that was logging from off to warnings to trace obviously you don't want everything at trace or your logs are just gonna be hundreds of gigabytes long you want to tailor it so that only packages that you're interested in have trace so I've missed off the appenders here but um this is basically a sample of a Kafka log forge a properties file with our various packages with different levels of logging so you can tailor this to your to your own needs so some useful debugging commands so one thing you want to make sure is that your records and your cover cluster are actually getting into your topics and case train you already that cover comes with a consult consumer you can use this quick and easy way to check that your records are actually on your topic you can also check what consumers are connected so the consumer groups script you run this and this will tell you what consumer groups are currently reading from your cluster this is useful because it allows you to actually check whether your clients are connected or whether they're hanging similarly with producers I mentioned the CAF housing in monitoring built in with JMX we can hook into that and see how many bytes are going into our cluster so this allows us to tell if the producers are actually running so what to look out for when you're running after in production so Kate mentioned minimum in sync replicas basically if your replicas on in sync with the leader your data's not being produced your data is not being spread between the brokers so if one of the brokers goes down you're going to be in trouble and your records might be lost something to replicate partition basically means that the replicas for that partition aren't meeting the minimum in sync replicas and this is bad because as Kate said data can't be produced to that topic anymore so you're going to have a bit of a deadlock situation and if you're offline partitions they won't have any leader so you can't produce to those what warnings disk a forgave you so you'll often see in the logs partitions going in and out a fully replicated state this is normal but if you see it a lot you might be something to worry about so basically this is when you're having data go into your Kafka leader too fast for replication to keep up but this might just be a sudden spike in traffic and the replication might catch up after a while what you need to watch out for is if the replication doesn't catch up because then you need to perhaps give your Kafka more is bosses brokers can also restart and garbage collection can run which is going to affect your logging you can actually run this command which comes with Kafka Kafka topics that Sh to check what under replicate partitions are are in zookeeper so Kafka produces metrics which you can monitor and set up alerts for so what sort of stuff might you want to be alerting on as I mentioned one of the most important things is in sync replicas if our partitions aren't keeping up with replication then we're not gonna be able to produce to our topics so you might see something like this not enough replicas exception so this could be an indicator of fail brokers you don't have enough network between your brokers you don't have enough resources for your brokers basically they it's not a good state to be in you need to do something about it so what can you do about it ensure leaders are preferred so Kate mentioned that there's a preferred leader for each partition and the preferred leader basically ensures that leaders are spread throughout your brokers in your cluster you don't want a leader of every partition being on the broker zero because then if broker zero goes down you can have so many leader elections and it's gonna slow things down a lot so you can do a preferred leadership election and this will basically trigger elections in Kafka so the leader switches to whatever the preferred leader is you want to make sure that your partitions are evenly distributed across the brokers in your cluster otherwise you might be in a situation where most of the partitions for a given topic you're on a single broker this isn't good because if a broker goes down you can lose all of your partitions you can reassign your partitions using the script built into Kafka to evenly spread them over your workers and you can also add partitions and consumers if you're seeing a topic isn't keeping up with the load that you're putting into it you can increase the number of partitions and therefore get more consumers what about zookeeper so there's this thing in zookeeper called four-letter words basically these allow you to query the state of zookeeper see you echo the four letter word and netcat map to your zookeeper so here arm echoing server which basically gives me a full summary of the status of the zookeeper server and this is similar to what Kate showed you before by browsing the zookeeper tree we've got other commands so are you okay is the basic are you running in an error state so this is a quick and easy check to check if your su keeper is actually eret or not envy what environment this is we could be running in stat is basically a short summary of the server command and kampf what config is my suit given started with so these are great tools for checking the state of your zookeeper but you can also as Kate showed you navigate the zookeeper tree using the Zeb case honey which is built into zookeeper this allows you to check the internal state of what zookeeper thinks your Kafka's like so this is what advertised listeners okay a zookeeper house for my broker one and you want to make sure that this lines up with what Kafka actually has if there's a mismatch mismatch between the two you might take some action so what if you find a book in Kafka or you want to contribute so if you want to raise an issue with CAF McAfee community you can raise a Jireh which is basically an issue management platform and this is essentially when you found some sort of bug you could raise a Jireh and someone in the catholic community will eventually look at it and perhaps fix it Kate mentioned catherine pugh improvement proposals if you have a feature that you want to suggest to the Catholic community to go into Kafka you can raise a Kip hips are basically for the more severe changes they're going to actually change how Kafka functions itself and every so often the calf community will meet up vote on what Kipps actually go into Kafka and some of them don't make it in so it's quite tightly regulated or if it's not such a severe change you can just make a pull request in Kafka and what will happen is CAF Kazi things called committers which are people who've contributed to the CAF occur source code over a certain period of time and they're sort of trusted people that review PRS and can actually add them to the source code so you would need a committer to go and look at this and say that's alright to put in right so that's everything for running covering production so now I'm gonna take a 30 minute break should we save your Mac here at 11 yeah thank you very much if you do have any questions you can always come ask and break a little bit but yeah we'll see you at 11:00 okay so let's get started again so before the break we talked through the Kafka cluster itself different ways to configure it how to run producers and consumers and as you may have noticed there are quite a lot of configuration options and things like that and so running your own Kafka cluster does come with overhead in terms of learning and in terms of actual day-to-day management so there are plenty of ways to get somebody else to run it for you so IBM we have a fully supported Kafka both in IBM cloud as a managed service and then you could also run it yourself but with support with some additional value-add capabilities but there also plenty of things out in the community so for example Stream Z is a Kafka and on Cuban it is based system so that's an operator and if you want to know more about CAF growing communities again you can come to my talk tomorrow but what we wanted to do is spend the second half of this talking about how to get data in and out of Kafka through other systems so because you can get other people to run Kafka for you actually a lot of your time might be spent in working out what data you're going to flow into Kafka from other systems and then being able to take advantage of the stream processing it has in Kafka streams so andrew is going to start with calf connect and then I'll finish off talking about Kafka streams and hopefully if it works showing a demo of streams as well yes all right so let's start with Khafre Connect so we've talked about producers consumers and the actual Kaspar cluster but now we're focusing on this external part where how do we actually link Kafka to our existing external systems and integrate Kafka into our architectures so this is where cafard Connect comes in cuff Connect is a tool for scalable and reliably streaming data between Apache Kafka and other systems essentially this means CAF Connect is a tool you can use to get data into Kafka from your external systems and take their to from Kafka into your other external systems so why would you want to use it so caf-co Connect is focus on streaming data to and from Kafka you can just pick it up and use it straight away this many open source connectors already so it's an easy way to get started integrating Kafka into your existing architectures oops okay it also offers guarantees that are difficult to achieve with other frameworks so Kafka Connect runs on top of Kafka it's built with Kafka by the same community that built Kafka so it works very closely with Kafka under the covers so when you run it it will create topics for you it will interact with Kafka and and basically ensure that best practices are followed by what you're doing in the app so you don't have to worry about a lot of things that you would have to worry about if you're using some other framework to do this also built-in fault tolerance and scalability so because it's interacting with Kafka under the covers we use some of the features that caf-co offers so we can make our connectors fault tolerant and scalable so this is the typical anatomy of a system that's using connectors so we have our external system at the top and what we call our source connector so the source connect basically links Kafka to an external system where data is coming from the external system the source and into Kafka conversely we have the sink connector so this is taking data from your cafe cluster and sending it to some sort of external system so source and sink people often get you so try and remember as I mentioned before so calf reconnect his open source there are over 80 open source connectors for you so just go and download and use as you want some of the key ones elasticsearch JDBC HDFS but there's so many often there and basically you can just go and download them off the internet and run them so it's unlikely that you're going to have to actually write your own connector but if you have a special use case you might want to consider doing that so for most of you you're probably just gonna download a connector and plug it in so let's move on to a couple examples of a use case of a source connector so say I've got some sort of database where most of my applications are writing data to I'm using it for the backbone of my application but I want to be able to stream this data using client there so clients can read from this and stream it in comes calf would connect so I can connect a source connector which takes data from my database and streamed into Kafka so the clients can scale ibly consume the data and what about the sink so I've got data in my Kafka cluster but I want to take this somewhere else for example elasticsearch maybe I want to search this data which kefka doesn't easily let you do this is what elastic is for so we use a sink connector to take data from the cuff of the cluster stream it into elasticsearch where we can perform our searches so as you can see it's very powerful for linking Kafka to your existing architectures how do you use it then so when you download Kafka you can see here in the libraries directory you've got loads of connect specific jars which allow you to actually run Kafka connect out of the box and a couple of scripts in the bin which are just for starting the different type of connect workers I'll come up to those in a second so when we start the connector let's take the connect standalone script for an example we provide worker config which is a connect worker where everything will run within and our actual connector config so work a config I'll come to that in a minute actually and explain those in detail so we start our connect worker and then we start up our connector in that so everything runs on the connect worker all the processes within there are managed by the worker and when the connector runs it's started to do a specific task and when it actually finds that this data that needs moving from source or the sink it starts what we call tasks so tasks are like the workhorses of the system the connector starts them to actually perform the movement of the data the connector will spawn one or many when you configure your connector you can specify a maximum number of tasks but you can increase this number to allow parallelization so that's the different that's the layout of it all but this is what we call standalone mode so you'll notice there's one worker and everything's running within that so this is easy - easy to get started you might if you're just giving calf to connect to go you could start the standalone and start connector in it but obviously for production reasons this wouldn't work because you've got a single point of failure and you're not scalable here so this is where distributed mode comes in so here we've got another script which in the Kafka bin connect distributed and for this one we only print up provide the work of config you'll see why in a minute but the key here as we start multiple of them so I'm going to start three distributed workers so now when I start a connector you'll get allocated to a worker so here I can start two connectors which spawn their own tasks but what Kefka ensures here is that the load is balanced between the workers and it does this dynamically under the covers on its own so here we balance the tasks around the workers so there's an even spread and the load is spread within the workers so now what would happen if one of our workers were to die so the tasks that would be sorry not attached the pros that were being executed in that worker now have nowhere to run so these are often processes but what Catholic connect will ensure is that these processes get allocated to your other workers so as you can see running in distributive mode provides you inherent fault tolerance and when this when the other worker were to come back up for the processors would rebalance across these so you've got standalone mode and distributed mode distributed mode is the one you really want to be doing in production but you you notice that I didn't start the I'm not gonna go back actually I didn't start the distributed connector with a connector config only a worker config and it's because you can provide it using a REST API which I'll talk about more in a minute so let's talk a bit about configuring our connectors so there's a few important properties in the worker config that we want to look at so obviously there's where is my care for cluster the bootstrap lot service where should I start my REST API which I'll talk a bit more about in a minute where should I store my offset so as I mentioned and stuff will be stored in Kafka on a topic this specific to kapha Connect so all the connect workers can maintain like what work they've done and where they are if you're interested beauty mode you'll have to use a group ID so similar to consumer groups where consumers have a group ID and that means they're all in the same group with our distributed connectors we have a group ID in this ensures that they all work as part of the same group to perform the overall operation that the connector is trying to do so as you can see the work configs are the more like abstract overall calf kerf properties whereas the actual connector configs are specific to the particular connector you're trying to use so over here connect clasp tells you exactly what connector you want to run so far stream source this is basically connected which takes data from a file into Kafka and this is one of the two connectors that are built in my captor Kafka by default so you've got the file stream source and the file stream sync they're out of the box all the rest you'll have to download yourself how many tasks should I use I mentioned this before so you can set a maximum number of tasks and the rest of the config in here is connect a specific config so here I'm using the file stream connector so I'm going to say what file out actually want to stream into Kafka if you using a sink connector you have to specify a topic because the sink connector will have to write stuff back to Kafka so it knows why it's consumed so I mentioned the REST API so cough reconnect actually starts of a REST API when you start it on whatever port you choose and you can call any of these operations on my API so anything from getting what connectors are there already to starting a connector and at the bottom this is linked to the magic Africa docks of where the safety guys documented so when I started my distributed worker before I didn't supply a specific connector I just supplied a worker config this is because when you use distributed mode you start the connector through the connect REST API so I'll give you a couple examples of how we can use the REST API so I can do a simple curl to the connector plugins endpoint and this will tell me exactly what connectors are running in that worker so as I said we've got there ones that come with Kafka by default the file stream sink and the file stream source you can also curl a post so here I want to start a connector so I'm going to post with some content I'm going to say the name of the connector I want to start the config is going to be what class or connector class I want to use this is gonna say what connector I'm actually going to start and then the file and the topic are specific to that connector so now I've started a file stream source connector and if I curl the connected endpoint which tells me what actual connectors I have I can see there that my connector has actually started so some key considerations for connectors where should I ruin my connector so you don't want to run your connector in the same place as your New York Africa because what happens if that server goes then you've now got no kafka and no connector so what you want to do instead is run it somewhere else and a good place to do that is in a dock container because um connect to users it's REST API you can interact with it through that port so here I'm simply pulling from kafka adding in my config and starting with the scripts and you'll notice that we're building from Kafka this shows that cafard Connect it actually runs on this exactly the same classes as Kafka does itself another thing to consider is partitions so here what I'm going to do is I've got a file I'm going to use the file source and fast sync connectors and we've got some content in the file and it's numbered one to six which are the different parts of the file and it that's in order and I'm going to use a source connector to connect that to Kafka to a topic with two partitions and if you start the source connector the file stream source connector by default it will use it will put the records in in this order if you use in standalone mode so you can see we've got the old numbers in the first partition and the even numbers in the second partition now if I use the corresponding sync connector the file sync connector in different in standalone mode you can see that the data's got all jumbled up so this example just illustrates that before you run cough we connect in earnest you actually have to understand exactly what's happening with your partitions and if you're using standalone or distributor mode it's good to think about things before you just get straight in another thing to consider data formats so obviously you won't have the same data format across all of your systems you might have a database that has a certain data format and Kafka might not be able to represent this itself in a Kaffir format so you use these things called value converters so we've got our external system format our Kafka record format and the internal CAF cover format which is basically Java objects so your connector might need to specify a way of getting data from your external format into the Kafka internal format so this isn't always necessary normally you can just do it straight off but sometimes for example this one with the MQ connector we have to specify our own specific builders which tell CAF could connect how to transfer the records from one format to another and on the other end this is required we need to get the data from Kafka internal format to how we want to actually store it in Kafka so typically the three that you would use are byte string or JSON depending on how you want your data to appear on your Kafka topic so where can you actually get these connectors from so this is the idea of an end street and say connect to catalog so this contains connectors which aren't specific to our event streams you can run these on your calf get anywhere so you can download your connectors from there so I mentioned that you probably wouldn't have to write your own connector because there's so many out of out of the box that you can just download that are open source but if you do I've got a couple of slides to illustrate how you would do that so carefully connect it's an open source java api for implementing connectors here you can see we've got the source and sync connectors and if we wanted to create our own connectors we all would have to do is extend these classes and override certain methods and we could do that simple with a simple class bundle they stop into a jar and then when we start our worker we we configure it to use this jar or we can use the rest api to add the jar to the worker and so it's on the class path and now when a worker actually runs when actual connector runs in this worker it can make use of the jar that is on the class path so we've got a connector plug-in and the actual connector running on the worker so as you can see Khafre Connect provides an easy way to get Kafka connected to all the pieces of your architecture and it's also allows you to write your own connectors if you've got a very specific use case another thing with the note which I'm going to talk about is change data capture stranger to capture identifies and captures the changes to a data store and we can do this as a stream of kafka events let's look at the scenario where you have a master database and you want to get data out of this using an application because you need to move this data to different places so you might have an audit log so you're replicating data from your master database to an audit log using some application you might also want this data in a recovery database what if your database was to go down so you want to replicate data across with the same application you might want a query cache so you can query your database more quickly again we can move this over with some application so the problem here is that if this application were to go down everything stopped there's a single point of failure and you're reading directly from your master database which might make it perform more slowly so here you can use Kafka so it's decoupling the two sides of the system Kafka replicates from the master database using change data capture whenever an event is put into the master database it's replicated onto Kafka using Khafre connect and now all your other end of your applications could just link directly into Kafka which we know allows for scalable consumption so there's different approaches for generating the changes that go into Kafka the data store can drive these changes you can repeat queries with optimizations or restrictions or as most of them work the logs scanning so that's the ways to look for changes in your database and make after aware of it so why would you want to use Kafka with CDC so you know from what I've just told you that calf grows lots of connectors to other systems there's over 80 of them there's probably one you can find for anything you want to use it for yourself and acts as a buffer so rather than reading directly from our database we can just interact with Kafka scale ibly and easily publish/subscribe set a point to point so multiple clients can do it in parallel and it allows ease for the clients to process the records as a stream of events alright that's it for me on connectors now Kate's going to talk to you about CAFTA streams okay so we've talked about the Catholic luster itself we've talked about producers and consumers and of course when Kafka started out that was all we had was just simple producer consumer and then we've got all of these different connectors that you can run and you can go as Andrew said to our kinetic catalog to have a look download some and have a go at them as well but I'm going to be talking about stream processing and how you can do that with something called Kafka streams so I'm going to start by defining what I think an event stream is because if you look at Kafka as a distributed streaming platform you might think well actually why have I got both Kafka streams and I've got all of these producers and consumers and hopefully by the end of this you'll see why Kafka seems has things built on top that you wouldn't necessarily do with your basic producers consumers it's very similar to the Kinect so in the same way that you probably could write a connector to pull from a database or whatever into Kafka or vice-versa just using a normal producer and consumer Kafka Connect builds on top of Kafka it uses some of those things around for tolerance that are built in to give you a better experience and basically the Kafka community have used their experience to build something so that you can get started really quickly Kafka's Dreams is a little bit different in the unlike with Kinect where you can just run a shell script run a rest command and you're done Kafka seems you do have to write a little bit more code again it's making use of that underlying so here when I'm talking about event streams and this is what you would use caf-co streams for it's an abstraction representing an infinite and ever-growing dicta set and actually if you think about it that's quite a lot of different data in the current world so Kafka was originally created as part of LinkedIn before they open sourced it they have an infinite and ever-growing data set every time somebody clicks on the website on LinkedIn Twitter is a good example of this as well all of that data so it's really that's what we define we're calling an event stream and you can flow data through Kafka there isn't an event stream if you are flowing an event stream then Kafka streams might be something to look at generally characters sick wise it's ordered it's immutable records so obviously that makes sense for Kafka and if possible replayable because that means that you don't have to process the events exactly as they arrive you can wait and process things that happened last month or whatever which from a processing point of view is really useful because if I wanted to do some new processing on some new data going forwards I could potentially train it or test it on old data and so particularly for machine learning having replayable data that you could use as like training data is a really good idea so stream processing is the ongoing processing of one or more events in this event stream so what is Kafka streams specifically so it provides a processor API so that's quite low level and then a streams DSL built on top and I'll kind of explain what that involves in a minute the key thing here is caf-co streams is Java based so with the connectors you can just run your shell script and your REST API so it doesn't really matter what technologies you're using apart from that at the moment caprica streams is only a java-based library whether they'll introduce other languages I'm not sure at the moment within Kafka itself Java is the only language that is built into the actual release so anyone who would develop an equivalent for a different language every time there was a new version of Kafka they would have to go and update it separately whereas when you get a renew release of Kafka you automatically get a new release of Kafka streams because again it comes built-in so all of the jar files are in your normal Kafka download one of the key things that is different about Kafka streams versus other stream processing technologies is that the processing happens in the app so Kafka streams provides these api's for you to write a Kafka streams app and all of the processing is happening in the app you don't have a separate processing engine running so that's one of the advantages of running with Kafka streams it also supports per record processing so if you're looking at different technologies it's worth having a look at whether things are actually been doing exactly per record or whether it's actually batch processing because true stream processing is processing every record as they appear not necessarily doing it in batches so it's worth bearing that in mind and having a little look so I've sort of said this a little bit already but it stores the state in Kafka and we'll see a little bit more how it does that and when I do the demo so this allows for stateful processing so Kafka is a good place to store state so we may as well make use of it and because it's making use of Kafka and all of the nice things that we've talked about this earlier today then it allows it to be scalable and also very highly available as well so the processor API allows you to then pretty much build any processing app that you want but it does mean that you have to probably write quite a lot of Java code so the streams DSL is designed to allow to give you abstraction so that you can very quickly write have costumes apps with a very small amount of code so an example of one of the apps tray that gives you is called case stream so you can see here it's an interface that's just from the Javadoc and a key stream represents an abstraction of a record stream of key value pairs so if you remember earlier all of our records in Kafka are a key value pair so this is a stream of them coming in and in your java code the way you would instantiate a new kay stream is either directly from one or multiple kafka topics so there is a function that allows you to say take all of the records that are coming from this particular topic put them in my case stream object or you can perform a transformation on an existing case stream or some other object and have the result be a kay stream so kay stream is one of the ones that you will definitely come across if you're using the DSL because that's how you get data off Kafka initially and then put it back on again at the end the second one that I want to talk about is K table so we're seeing more and more particularly with sort of architecture choices like CQRS those a lot of people are wanting to get data from a database and then interact with it in a more event-based way so a K table here is an abstraction of a changelog stream from a primary keyed table so if you think back to the clavicle connect with the change data capture what that was doing so technologies like Toby's iam is a project that has all sorts of different databases that does CDC if you run those connectors they will basically create a stream of records that each represent that change in a table so whether that's a create update delete and whatever it is and so you can represent that inside your caf-co streams app using a K table again you can pull it from a single kafka topic so if your topic is already set up to be that changelogs and that's what you would do it can also be created as the result of a K table transformation calling some function on existing K table and then finally by aggregating a K stream so for example doing things like joins from multiple different K streams you would potentially end up with a K table so what I'm going to do is show you a sort of a visual example and then we can start looking at more substantial examples with actual code as well so at a more conceptual level this is the kind of thing you can start to do if you run caf-co streams apps so say I've got an input topic here I've got my key and values and for now let's just assume I've only got one partition so they're all just on the same partition so I got food red bar orange yellow get green so you can see I've got a few different keys there and all sorts of different values with just four lines of code I can then process this stream and have this result at the bottom so you can see here that we're already starting to see the case stream come in so I've done a build a lot stream from my input so that's the name of my input topic I've been done a filter so I'm filtering the key in values and I'm basically saying I'm only going to pick any record that has a key equal to bingo so that will be the fourth one along and the fifth one along so at that point I've now got an internal K stream that looks like that and then going to map that and update the key in the value so I'm keeping the key the same so I'll still get a bingo at the other end but I'm upper casing the value and then I'm putting all onto an output topic so that's how I get those two events there as you can see it looks like quite nice sort of functional type Java so it has because caf-co streams is more recent it favors this kind of writing of your java code and that transform although but you could say we're straight forward actually in terms of the amount of code that you have to write because we're using the DSL and we're being able to make use of the K stream and all of the functions that come along with it it's a very short amount of code to do something nice and quick and easy so I'm gonna look at two specific examples and these are taken from the cat Gaddafi so if you go have a look at the caf-co streams docks there's tutorial to help you write an app and there's a few different ones but hopefully this will give you a better example of how to run them and what you might do in real life and what's involved so when you first come to write a calf across streams up the first thing you have to work out is what is my topology going to look at look like so the first app that we will have is a basic pipe so it takes all data from this input topic and puts it to this other output topic it's not very exciting but it's a good first step and so we just have two topics involved we have the input topic and the output topic and the data will flow between and the apps running somewhere in the middle where the black line is now the nice thing about running on Kafka is of course the fact that we have partitions and everything so what I could do is between these two topics I could have multiple of the same caf-co streams at running and they would take one partition each and each of them between them would then get all of the events and they would flow out to the output topic so I wouldn't end up with any duplicates on my output but I could get better scale so that's where you see that actually because caf-co streams is built into Kafka it's making good use of Kafka you can already start to imagine that you can scale at this point this is the second app that we're going to look at I'll show this now so that I don't have to switch back and forth and this one's going to be a little bit more complicated so this app is doing a word count so we'll have an input topic that has a set of records that have one or more words in each record I'm going to lowercase them split them into different words so we only have one word and then we're going to re partition them onto the topic and then so the reason that we do this and the reason why it's worth you know thinking about how you're building at your topology is again we can make use of the way Casca do thing does things and understanding the Kafka partitions to then make our Kafka streams apps run even faster so it doesn't make any sense to have for a particular event for one app to read it lowercase it and then give it to somewhere else or it the lowercase thing splitting makes sense to do that all in one go but at the point where we have all of these different words if we had more than one caf-co streams app running at the point where they've locations put them into words they've now all got different like a mixture of the different words potentially so what you can do is get Kafka streams to repartition the topic so you put on to a new topic and basically what you would end up with is on each partition you would have one or more of the same word so if I've put hello in five times by the time we get to this repartition topic all of the words all of the records with the word hello are all on the same topic then we have different Kafka shreaves apps that are running for an output so this means that I could do this then so again my Kafka streams apps I could have multiple ones of them running they get a partition each because of consumer groups they can work nicely together and the first one could read all of the hellos the second one the worlds and the last one the highs and they're naked do the count there's little databases here so what's actually happening is whenever I've shown a blue box data is going back onto a new partition but the green boxes is basically a local state just because from a performance reason it doesn't make sense to put it in the topic so caf-co streams makes use of both actual topics for storing the state but it also makes use of the local storage as well just within the app before I go across and show the demo the last thing I want to say is just this diagram kind of shows the topology that you build up as part of your app actually from a Catholic extremes perspective what you would do is write one app that has this topology and then you can run multiple versions and because you're using Kafka streams under the covers this is kind of what it will be doing and it will be if you run multiple of them they'll coordinate together to get the data three so although I've only drawn the multiple boxes at the point of the rope we partitioned you could have multiple apps running from the input topic as well getting a word lower casing it's pissing it and then putting it on to the topic and you can have multiple at that point but you don't have to worry as much about all of this how you're coordinating it you just run multiple Kafka streams apps with the same ID like Africa streams does it for you so let's have a look at some Kafka streams apps so this is where I talked about getting started so if you go to the documentation and click Kafka streams here it is see there's a demo app and then the tutorials specifically that I'm going to be talking about is this one tutorial right an app so you can see you can get started with just generating from the archetype and then there are three different examples line split pipe and word camp so I'm going to show type and word camp so let's switch to pipe first because that was simpler so here is the code for pipe so there's a few different things that we do when we start writing our first caf-co streams app this ones are quite basic example you can see the codes not massive but there are a few things that we would have to do in every single streams app to set it up so a caf-co streams app is written in a very similar way to the producers and the consumers that we saw earlier the first thing you have to do is create a properties object which is a standard Java util properties and in my properties I'm first going to put a application ID so this is how I tell path go streams which apps of the see maps that are going to work together so in this case this is going to be called streams pipe make sense and the key thing is actually so Kafka streams we'll see in the next example we'll start to create its own topics to handle stuff under the covers and it will use this as a unique name so you want to make sure that this ID is unique within your cluster for your streams apps again you have to tell it just like your producers consumers and connect where to fight Kafka localhost 9:09 - is the default and then you have to provide some DC realises and serializers so we haven't really touched on this because when you create consumers and producers you will don't tend to think about it quite so much but when you're using Kafka streams you quite quickly have to think about it as soon as you go off the beaten path of doing strings everywhere so in Kafka there is a default class called 30 Central Java which is serialize deserialize earth that's where it gets its name and that provides a set of sort of standard DC realizes and serializers for Kafka for the standard types so here you can see we're using string and we have to provide two so we provide one for the key and one for the value this is something you provide as part of your producers and consumers as well so you can actually have a different like type for your key and your value in all of your records they don't have to match so I have to tell my Kafka streams what it's supposed to be doing when it's reading and pushing events from Kafka it has to know how to serialize and deserialize those are all my properties and then I create a streams builder so this is going to create my first K stream so you can see here I've got K stream source equals build a lot stream very similar to what we saw earlier my input topic is streams plain text input and then my what I'm going to do is just do source dot - that writes it back stream stop pipe - output so that creates a streams builder object from there you have to create your topology which was the diagram I saw earlier so you can create the topology you just using builder build it's quite nice to print out the topology in probably you wouldn't normally do an assistant or out print line but in your log or somewhere because then you can actually see what topology it's built and then you can start it and the code here at the bottom is just so that it runs continuously so that's my app let's have a go at running it and see what happens so I already created some topics earlier in the break so you can see I listed them here so we've now got a new one streams PI output and streams plain text input so if I want to produce I now need to be used to a new thing at the top there again to a new topic so I'm going to start by producing two streams pipe a note streams plain text input that's my new topic and the console producer will then allow me to set a bunch of text when it runs there we go so we can do similar to what we had on the other topic I created earlier hello to Vox maybe one with space between how things may be hi again so we've got that producer there now if I run my consumer what are you doing hang on there we go right so I've added a new property to my consumer so earlier when I was running the consumer we ran from beginning so that means it will pick a zero as the offset all of the time because by default if you just stand it up it will go from the end I've also put in this property of print key the reason I've done that is because by default if you run the console consumer it won't print out the key now for this example we maybe don't care about the key but if I was doing filtering example we would care and we will care in a minute for the word camp 1 so my topic I'm consuming from is this one that confused itself ok so that should be consuming so I need to actually start my pipe so I can just run it from there so you can see the first thing that happens is it printed out the topology and the topology is pretty straightforward we've just got our source and you can see it tells you this is really handy it tells you exactly what topic is going to be created under the covers so in this one it's created a topic or it's not created topic but it's using a topic called streams plain text input I created the topic already and then it's going to be pushing it to the topic of streams pipe output so this topology is really straightforward but we'll see in the word camp one that printing out is very useful and I haven't configured at LS for J properly but that doesn't really matter so then in our consumer you can see all of the data has come through we've got hello DevOps hi there how are things hi so that's all good and that's our basic pipe so writing a pipe based streams app is pretty straightforward but not particularly exciting because kafka allows you to have multiple consumers so why you just wouldn't create multiple consumers on the same topic I don't know so let's look at one that's more interesting so this is word count so word count again this is an example that is in the docs this time I'm going to have a different config so streams word count and this time we will see it creating intermediary topics so it didn't need to for my pipe but for this one it will need to again I've provided localhost nine zero nine two and I've provided the key and value of Surtees for string now this serializer deserialize that at this point is the default value so unless told otherwise caf-co streams will assume that this is what I it has to use for everything and then I'm going to create my streams builder similar as before create a case stream and then here I'm going to do the same input so streams plaintext input and then this is where we have some more exciting code so I'm going to lower case the values split on whitespace and then flat map them so this is what I'm doing here so hopefully a lot of you familiar with reading this kind of code but I'm going to flat map the values I'm doing a raise as list lower casing the value and splitting it if I don't do the flat map basically what will happen is for every record I will end up with a list with the individual words so I would still have multiple of them but because I actually don't care which ones came in on which record I want them all to be sort of flat mapped so that's the reason for using them then this is doing the group by so I'm saying I'm going to group by the value so basically the value at this point will be whatever word it was and it will be a single word because of the flat mapping a minute ago and the splitting on white space so we've got this single word and basically what we're going to end up with after the group byline is we will have lots of records where they have a key of the word and a value of the same word and that will all get put on to a new topic and you can see here there's another example of one of the abstractions that Kafka streams provides so we've got a que group to stream there then we're going to do the count so cool count we say as count store so that's basically telling it what a local store to use that's not a new topic it's a local store and at this point we've got a key table because we've got a key table and we want to actually convert that back to a stream to send on so I could do other K table type transformations at this point but I'm not I'm going to just do a two stream and then I'm going to send it to the streams word count output now count you can see here gives you a K table with a string and a lot so in this case you so here using IntelliJ is kind of helping me to see what's going on which is quite nice so the reason it's got along is because obviously if you do account make sense for it to be along when I've converted it to a K table from a K table sorry to AK a stream I've still got string and along so my key is a string and my value is along now at the top I told kafka streams that had to use string so if it then tried to send that value into Kafka it would basically give me back an error saying I don't know how to serialize this object so what I can do is actually within the code of my Kafka stream itself I can then tell it what type to use so I could of course have at this point done a to string on my count but in a lot of places you're not going to want to to string something before you send up to Kafka so it makes sense to do it like this and have a default value at the top and then on the specific line where you want to overwrite which deserialize the serializer it's going to use you can do it there so this too is slightly different to the pipe one because I'd not only give it the topic that I want to write you I also do this produced with sir D string and then surtees long so first one being the key the second one being the value you'll find with everything in Category it goes keep that value again I'm going to print out the topology and then I've got the same code at the bottom that just sort of starts it up this way so I do should already have yes I've already created the word count output topic I've already got a producer running which is pointing at streams plaintext input and what I'm going to do is actually just restart this one pointing at my streams word count output that and at the moment it won't have any data because nothing's running I probably need to stop my pipe from running although my boots just run in the background let's find out yeah it's quite happy to run both I'm just gonna kill the pipe what we're here so we don't tempt fate right so here we've got a much more interesting print from our system out of our topology so you've got two actual sub topologies going on here you can see so the first one goes from stream flame text input so that's the one I've already created I know about that topic already because I created it you can see that we're going to be doing a flat map of the values and then we've got a bunch of sort of processor steps before it goes to the sink so this is doing key selecting and filtering so you can see actually I only had like one or two lines of code at this point because I had the flat map values in the group by but under the covers it's doing quite a lot of different stuff and that's the nice thing about using these abstractions and it ends up being something called the count story partition then in topology - I've got count story partition going to count store which is just an internal store so here's the difference I've got topics here so that's a specific topic in Kafka this is just a store so it's not a topic in Africa and then it goes to the streams word count output so I've got my events there and you'll notice it has a printed a key and that's because of the D serialization so by default my console consumer is assuming the types are all string but nicely there are quite a lot of configuration options so I have got one somewhere because I can't remember this off the top of my head there it is where I've added the D serializer so this is actually wrong because it should be key deserialize and not Valley know is value yes it's like I'm talking nonsense so the key is still a string because it's the word and then the value so here well I've actually got printed out but the top there is all of the keys and if I had just printed the values so I've added this new thing that says property print key if I didn't have that I would just get loads of blank lines if I then run it again with the deserialize err I have put the right one in it should just print everything out oh no cuz I haven't got the right topic hang on this is why DC realizes is worth being aware of how they work streams word cut output there we go that should then work now there we go so you can see the counts there so basically what we've ended up with is a set of records in our topic where the key is now the word and then the value is how many of them there are and to get to this point I had to know that I was gonna end up with a long at the other end now this is nice it just sort of turns up but it's also good to have a look at actually what's happening in between so I'm going to create a different consumer if I now do my topics list again which is this one you'll see there's a new topic that has turned up when it runs okay so we've got you've got two new topics we've got a change log and a repartition and if you remember a minutes ago in here I was looking at what topics it's been created so count story partition has been created by kafka streams itself so you need to be aware if you want to start using kafka streams it will randomly create topics for it well not randomly it will on purpose create topics for it to use but the point is that you will get these extra topics appearing so it's worth being aware of it particularly if you've got things in place like turning off the auto creative partition of topics so what we can actually do is have a go at consuming from this one we don't need the TC réaliser anymore I'm gonna leave the key there and then instead of the topic that we had before we're actually gonna use this new repartition one to see what happens if we consume in so we're doing from beginning and I'm gonna throw a few more events in here oh that's let's duplicate some of her words so you can see what we've got in that repartition one is exactly what you would expect which is that we end up with the key and the value being exactly the same so then in a minute when we want to actually do the count we could have multiple different caf-co streams apps running and they could read the different ones based on the key and process that particular one to see how many events have turned up and you'll notice that these are that's not actually every single word and that's because this new topic that's been created it's been created with the compaction on so but you don't have to worry about that because basically the key thing is caf-co streams is doing things for you under the covers and putting things onto the topic the thing that's worth bearing in mind with all of this is if you are looking at your cuff katrin's apps and you're saying actually we're getting quite a lot of latency between one end and the other there is other things going on under the covers so first have a look at your topology work out what topics are being created under the covers and then have a look at the other topics and you can consume from them it's not disruptive in any way because you can run a consumer quite happily and then what you can see is start to look at okay how long is it taking me to get the events to here to there what is it that I can change the other thing that's worth bearing in mind is this property set here is just a fraction of what you can set actually kafka streams will let you edit the underlying producer consumer config so if there's particular producer and consumer settings that you want to change to improve latency and things like that you can do that from your Kafka streams up that is perfectly fine so that's the word count you can see we've got all of this data flowing through quite happily and it comes out the other end but we do have to remember about our DC realises cool so let's go here so I think Kafka streams can take a lot of time to get your head around what's happening where so I just wanted to place the diagrams back up against the points here to see - you can sort of map back and forth so we created our new streams builder we do a build dream that's pulling from that input topic so that's the point in time which we're actually talking to Kafka the lower casing in the splitting to words is all done within the Kafka streams app itself it's not talking out to Kafka at that point and you can see the function there there is probably nicer ways to write that piece of code but that's the example that they gave so I thought I'd stick to it so it looks familiar if you go have a look at the examples if you've read the definitive guide to Kafka book which I would highly recommend there is a there is a sort of slightly different version of this example so it does the same thing but it's slightly different they've just laid out slightly differently and they're removing the word there and things like that then you do the group by so at this point you're whether knowingly or not creating a new topic under the covers and it's going to repartition all of that data and then you do the count to the stream and then back out the other end to your topic so the nice thing is you haven't had to think about storing the repartition topic but if you had processed a word and it ended up on the repartition topic and then your Kafka streams up dies when it comes back up it can just carry on from the repartition topic so by using the Kafka topics under the covers it's getting all of that sort of resiliency and that kind of thing so to review Kafka streams I'm I would highly recommend just starting with the streams DSL because that makes it really easy to get started as you can see you can do fairly complex processes or processing of events without having to write much code so that's quite nice but if that doesn't give you exactly what you want there is another sort of an underlying processor API which is what it's actually using so you can go and write things directly from there if you want to it's different from other processing engines because ore processing technologies because it doesn't have a separate processing engine as you saw I'm just running my app and Casca that is literally it and it does all of that nice stuff and it's using the caf-co topics for resiliency and scalability so it means that you can be sure that if York Africa streams up ties it comes back it'll just carry on where it left off because it's making use of all those nice things in Kafka and unlike with Kafka Connect you don't have to run anything separate you just run your one app so tomorrow I'm doing a talk on craft and kubernetes but I just wanted to do a slide on it because some of you might I want to go to that talk or and might just want a little preview so in the past people have said to me like oh should you run after and queue bilities I think definitely yes more and more people are doing it there is no reason not to run Kafka on kubernetes especially if you're already running all of your apps on kubernetes it makes sense just run it all in one place but there are some things to consider so throughout the talk today we have talked about all the different ways that Casco is resilient and available and all of that goodness but if you suddenly run it on kubernetes in not quite the right way you can quickly undo all of the work that the CAF community has done so tomorrow I'm going to be talking about exactly which parts of the kubernetes underlying system you need to be aware of and which parts of Kafka you need to be aware of and how they then match together so you can build a system that makes use of all the good availability stuff in both Casca and cuban at ease some of the things that I'll be touching on include liveness and readiness managing state because even though Kafka is highly available it might be that you do still want to persist things persist the data on your breakers somewhere there isn't just in the app because if it's on a container and it dies it's not there anymore talking about node affinity external access so we didn't go to much today in the difference between listeners and advertise listeners when you're running locally you don't really need to think about it particularly in containers it's definitely a topic that you need to be aware of and then also automation so there are plenty of projects out there they're already helping you run catherine kubernetes so our product event streams runs on cue Betty's we're using helm for our automation other projects like the streams II open source project are using operators so you definitely can run catherine kubernetes if you want to know more details then do come to my talk tomorrow so we've covered all sorts of things in the tour today we've looked at the caf-co cluster itself the fact that it has these brokers how they work the idea of replication and also that they give you scalability through all of the different partitions we've also talked about a zookeeper and how that is used as part of Kafka but also the fact that it's going away at some point so keep an eye out for that we then looked at the producers and the consumers and particularly all of the different configuration options and you really can both with the breaker level and the producers and consumers configure Kafka to do exactly what you want it to do so you do need to be aware of that when you're starting to use it and then we've looked at Kafka connect and Kafka streams so if you're going to be flowing data from an external system isn't Kafka into Kafka then a lot of times it makes sense to use connect and if you're wanting to process the data and do some streams processing on that data you can use Kafka streams and actually a lot of places where I've spoken to people and they've said actually we're not going to use Kafka connect because we want to do some special processing or something actually what they could have done is used Kafka Connect and streams in collaboration with each other so the Kafka connect you can provide sort of transformations on the data as it's flowing in but Kafka streams are so powerful that it really makes sense to flow the data straight from that excelsis them into Kafka you don't have to write any additional code you just run some scripts you can make use of all the good stuff that other people in the community are building around connectors like connectors for all the different JDBC databases or mem cue for example or elasticsearch all of those good ones and then once it's in your Kafka you can use the stream processing so if you're going to take away five things from this talk this is what I want you to take away Kafka is the de-facto event streaming platform I originally wrote becoming the de facto on here but I think at this point it is the fact that there's so many people here sort of speaks to that and the community is growing so much every year I've been to the Kafka summits for the last six four of them I think and every time I go there's just more and more people and there's more people adding to this community so you really are starting to use something that is getting a lot of sort of emphasis in the community and lots of people building stuff around it it has scalability and availability built-in it's very configurable so you can get it set up exactly how you want it but that comes with the caveat that you have to make sure you've set it up correctly so do look at other options for people hosting it for you if you don't want to spend your time treating Kafka it depends on your use case you can use Kafka Connect for connecting to other systems and then Kafka streams for event stream processing that was all we had so I've put up a few links here to the Kafka QuickStart guide to connect streams and the connectors and if you do want to know any more about event streams as an option to run Kafka for you on our own cloud offering or to provide a kubernetes based system that is easier for you to deploy and has things like a UI a schema registry for versioning your data your events then you can have a look at event streams and I will have some event streams cards at the front if you are interested so thank you very much and are there any questions yep so for event archiving yes so for event archiving if you're just trying to get your data out of Kafka and into a system to run like long term I would say that looking at a Kafka kinetics connector is a good way to do that there are quite a lot of connectors around a lot of the database ones as well will go both ways so you can do change data capture to get it in but then you can flow the data back out again as well so that is definitely a good place to start the one caveat with Kafka Connect is depending on the system you sometimes find that the way cavaquinho is built you just can't write a connector that works for that so in that case you could use a standard consumer but if you can use a connector I would highly recommend it because it just means less work for you yep so there are there are there is a connector for example for things like MQTT or other devices that run outside of Kafka your your options basically are writing a consumer producer so you can use node for example if your front end is written a node I can't remember off the top are we using which is the node library that we use node Rd Kafka which is built on top of labarda Kafka so you can make use of that depending on what is creating the events on the browser side it might be that you can that there's a system that uses it so LinkedIn when they originally created Kafka they did for collecting data from their front end systems and they were basically just using a standard can producer but it might so you could have Kefka and have a producer running that's pushing to your front end I don't know specifically about what technologies are available to do notifications on the browser I personally haven't looked at that I'm a more of a back-end developer but there are so many so what I would say is it's definitely worth having a little look at all of the different talks that happened at Kafka Connect that what am I trying to say the Kafka summit because there were loads of torts there and I think there were some more friend ones I personally didn't go to them because I don't do front-end that much any more questions this is where we're testing my eyesight ok well you can always come ask questions at the end so andrew is having to head home today but I'll be here as well tomorrow and I'll generally be around the IBM booth if you do have questions surf the web thank you very much thank you you [Music]
Info
Channel: Devoxx
Views: 78,201
Rating: undefined out of 5
Keywords: DevoxxBe, DevoxxBE19
Id: X40EozwK75s
Channel Id: undefined
Length: 124min 57sec (7497 seconds)
Published: Tue Nov 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.