How To Fail At Kafka by Peter Godfrey

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] thanks for coming this is a very very quick how to fail at Kafka session obviously it's a massive topic so we're kind of try and compress it all into 15 minutes so firstly how many people are cafe users already here okay you don't need me then I think you've already solved all these problems I guess or you're at least you're aware of them so for those of you that don't know what Kafka is or I've never used it basically it's designed to solve this spaghetti architecture that enterprises often end up with lots of disparate point-to-point connections caf-co is designed to slice through that d couple everything I mean everything communicate can communicate nicely between each other but it's more than just messaging it's got storage ETL and it's also got real-time processing so stream processing and a whole library of connectors there's a whole ecosystem of things you can connect to Kafka so at its heart it's very simple it's just a log messages come in they're written to the end of a log file dead easy other things can come along and read those messages from any point and out of that you get a whole host of great features scalability durability resilience massive throughput low latency all that good stuff so what could possibly go wrong when you put Kafka into production it's obviously something that a lot of companies are reliant on it's being used all over the place to solve all kinds of different problems so I'm just going to run through the top five I think I've got things that people don't think about when they're deploying Kafka or that they stumble on to or that they get wrong and hopefully we've got time for all five but if not I will miss some out and you can come and see me afterwards so number one durability so Kafka is designed to be clustered so if you have one Kafka broker you can have multiple partitions where your data comes in and it's stored in in different partitions and if that breaker goes down your class is gone you have no character cluster so that's not what you want so with Kafka it achieves durability through replication so Kafka clusters will copy the data to the other cluster members so here each of those gray blocks is a partition the ones marked with an L are the leaders so that's what your data gets written to and then the data gets copied across to the other partitions in the cluster this way if one cluster goes down one cluster member goes down the leader is automatically moved to another node and everything carries on working as you would want it to so in this setup is your data safe well it does depend on your configuration so as your producers are writing data into Kafka they get an acknowledgment so at this producer has sent its message gets an acknowledgment back as soon as it's written to the lead partition so your producer then might think great brilliant my date is in my Kafka cluster I can go off and do something else but your data has not yet been replicated to the other Cathcart members so there's a danger point here if your producer thinks it's written the message but that node on the left goes down you've actually lost your data it hasn't been replicated so there's a setting you need to be aware of which is ax so by default ax is 1 which means it waits for the producer waits for an acknowledgment from one kefka cluster member before moving on but you probably want to set it to ax equals all and that means it will wait for the data to be replicated to the other cluster members before counting as an acknowledgment so this way your data maintains its durability but there's a little hidden gotcha here because it if you say ax equal to all it won't necessarily wait for acknowledgments from all the cluster members it relies on another little setting we've got which is the number of in sync replicas and so if here your producer puts a message into one caf-co node it has ax equals all it'll wait for acknowledgments for all from all cluster members up to the count of the min in sync replicas right so let's make sure we clear on this one if I've set min in sync replicas to two but I've only got one node in my cluster that won't count as that data being written so you need to make sure you set your min in sync replicas appropriately so for example default is 1 but if you set it to 3 which is the recommended setting when your producer writes a message it makes sure that the date has been written to all three members of the members of the cluster before it counts as independent lodgement that means when a node goes down you can be sure that your data is copied to the other cluster members so a little note on the way Kafka's being designed is by default is optimized for availability and latency if durability is more important you need to change these parameters and you need to make sure that you've set them to the appropriate settings for your particular use case so that's got your number one number two is assuming everything will always work so in this case you've got a producer and it sends a message into your Kafka cluster but if the node is writing too is down it will fail to get an acknowledgment so the producer will think it hasn't written the message at all and in this case here that's because it tried to do it when the leader was being moved to another broker so another setting you need to play with is retries so the default retry value is actually zero which means that your producers won't retry and send this message it'll just say oh well I didn't work so you need to change that to be a much bigger number so if for example you set it to two in this case here the producer will try and send its message it will fail to get an acknowledgment so it will try again and this time it will hit the leader on the right no to the one that's up we actually say why not just put it up to infinity then your producers will keep trying to send messages forever until it gets successfully written that then actually exposes you to another problem which is that of duplicate messages so if we look here the producer will write a message to Kafka and if it then fails to get an acknowledgment it'll try again it'll send it again and you end up with two messages because actually the first time the message was successfully written but it was the acknowledgement that failed so another setting you need to be aware of is idempotence so Kafka produces you can just enable this idempotence the default value again is false so you need to change it true this means that each message gets a unique ID so if a producer right successfully writes the message but doesn't know that it has and tries again the kefka brokers will spot that it's the same message because if it's user ID its unique ID and it will just count it as one message so you don't end up with duplicate messages so tip number two is to use the built-in idempotency I'm really skipping these a high level there's a whole world you can uncover for idempotent consumers exactly once semantics come and see us in the confident booth if you would like more details number three is you've got to be careful going to be aware of your exception handling so here I'm just going to talk about exception handling on the producer side but obviously things can go wrong in lots of different places you need to keep an eye on thinking about where things might go wrong and how you might deal with those situations on the producer side you've got a few options so I just mentioned the sort of infinite retry idea so that's just saying keep trying to write my message until it gets written the other option is to flag that message as not being written maybe have ten retries and then send it to some kind of dead letter q or topic where you can keep track of all these messages that have failed to be written and then another process can come along later and deal with it option three is to just ignore that message ignore that error and carry on and again this will just depend on the kind of data that you've got and what you need to do with it so unfortunately this one there's no silver bullet right you just need to think about what your data is what happens when it goes wrong what happens if it goes wrong at each stage of the process or each component and deal with it accordingly I'm really rattling through these in a 15 so apologize for the speed so number four is no no data governance this is a common mistake people have so with Kafka you can put any data you like into it it's just bytes so any message can be any collection of bytes they all get moved around Kafka doesn't care a consumer can come along and it can consume those bytes at that point it probably will care what those bytes are going to be so a common thing we see is people have a message so that one going in there's a name and a date of birth this is obviously a particular format here everything here you could say is version 1 and their version 1 of the schema there if you then change that schema so basically now you've changed that date it was a string and now it's a timestamp but there's a consumer that's been left behind here it has not been upgraded to the latest - understanding that latest schema so it's going to fail so you've got to be aware that changes to what your producers are doing will impact your consumers further down the line so there's a solution to that this and that's to use the schema registry and this gives you a central point where your producers and consumers can look up their data types they can make sure they're all understanding the same messages and actually complement schema registry is based on Avro and allows you to do non-breaking changes so you can actually change the schema you can have a producer producing messages to an you can do schemer and your consumers are fine with that because they can look up the schema in the schema registry or ignore those messages if it doesn't understand them so it's important to have some sort of central mechanism in place where your producers and consumers can be sure that they're all talking the same language right they're all sharing the same data number five is you've got to pay attention to your network bandwidth within your cluster now this is it this one crops up quite a lot so if you have a calf cut cluster here this has got four nodes and the topics are partitioned across the nodes like that so there's a the ones in orange of the leaders so there's a leader for the topics on each node if one node goes down this is all fine because your data is all replicated something else takes over to be the leader and you're now running on a three node cluster it all works fine but probably you want to put that fourth node back up because that's what you designed that's what you had to start with and as soon as you do that these topic partitions will start to be copied across right we want our data to be replicated across our nodes so this data starts to be copied across if these partitions are 50 gigabytes hundred gigabytes there's a big impact there right on the network infrastructure in your cluster and it's also going to take time if if these are massive partitions might be ours might be days for all your data to be copied back onto this new caf-co broker that spins up so you need to be aware of how big your partitions are you need to think about what happens when they move around maybe need smaller partitions but more of them maybe you need more caf-co nodes maybe you need better network infrastructure in your cluster you need to be aware of how things like this can happen so the last one I've got here is no monitoring so it's all very well having your Cathcart class - up and running but you need to find out if it's working correctly if everything is going according to plan so there's a collection of JMX metrics you can gather from all the brokers from the clients we have a nice web page that explains how to do all this and the kind of things you probably want to be able to answer our sort of application level things so are my applications receiving all my data are is everything got the latest data which applications are running slowly is there any areas that we need to scale up and can the day to get lost anywhere will it be any interruptions so as well as monitoring sort of system os level things CPU usage memory usage disk usage it's important to keep an eye on some of the sort of application level things as well we actually have our compliment control center which helps you monitor these kind of things gives you a nice web-based user interface it'll show you if more messages are being produced and consumed they'll show you end-to-end latency and various metrics about what your caf-co cluster is doing whether the partitions are balanced and all kinds of metrics and information about throughput and bandwidth usage so that I think is my time up really rattle through it a quick plug at the end for Kefka summit which is on Monday Tuesday next week there are some tickets available come and see us at the conference stand if you'd like to attend and if you're lucky enough to be in San Francisco at the end of September then pop along there as well any questions in the last thirty Seconds there will be a quiz after another one thank you very much everyone [Applause] [Music]

Info

Channel: Devoxx

Views: 5,258

Rating: 4.9358287 out of 5

Keywords: DevoxxUK, DevoxxUK2019

Id: xsdoQkoao2U

Channel Id: undefined

Length: 15min 17sec (917 seconds)

Published: Thu May 16 2019