Apache Kafka Explained (Comprehensive Overview)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is Casca before we dive into Kafka let's start with a quick recap on what publish/subscribe messaging is publish/subscribe is a messaging pattern where the sender doesn't send data directly to a specific receiver instead the publisher classifies the messages without knowing if there are any subscribers interested in a particular type of messages similarly the receiver subscribes to receive a certain class of messages without knowing if there are any sender sending those messages pub/sub systems usually have a broker where all messages are published these decouples publishers from subscribers and allows for greater flexibility in the type of data that subscribers want to receive it also reduces the number of potential connections between publishers and subscribers bulletin board comes handy as a good analogy to a pubsub messaging pattern where people can publish information in a central place without knowing who the recipients are okay so what is Kafka then Apache Kafka is an open source publish/subscribe messaging system also very often described as a distributed event log where all the new records are immutable and appended to the end of the log in Kafka messages are persisted on disk for a certain period of time known as the retention policy this is usually the main difference between Kafka and other messaging systems and makes Kafka in some way a hybrid between a messaging system and a database the main concepts behind Kafka are producers producing messages to different topics and consumers consuming those messages and maintaining the position in the stream of data you can think about producers as publishers or senders of messages consumers on the other hand are analogous to the receivers or subscribers Kafka aims to provide a reliable and high throughput platform for handling real-time data streams and building data pipelines it also provides a single place for storing and distributing events that can be fed into multiple downstream systems which helps to fight the ever-growing problem of integration complexity besides all of that Kafka can also be easily used to build a modern and scalable ETL change data capture or big data ingest systems Kafka is used across multiple industries from companies like Twitter and Netflix to Goldman Sachs and PayPal it was originally developed by LinkedIn and open sourced in 2011 now let's dive a little bit deeper into the Kafka architecture on a high-level usual Kafka architecture consists of a Kafka cluster producers and consumers a single Kafka server within the cluster is called a broker a Kafka cluster usually consists of at least three brokers to provide enough level of redundancy the broker is responsible for receiving messages from producers assigning offsets and committing messages to disk it is also responsible for responding to consumers fetch requests and serving messages in Kafka when messages are sent to a broker they are sent to a particular topic topics provide a way of categorizing data that is being sent and they can be further broken down into a number of partitions for example a system can have separate topics for processing new users and for processing metrics each partition at as a separate commit log and the order of messages is guaranteed only across the same partition being able to split a topic into multiple partitions makes scaling easy as each partition can be read by a separate consumer this allows for achieving high throughput as both partitions and consumers can be split across multiple servers producers are usually other applications producing data this can be for example our application producing metrics and sending them to our Kafka cluster similarly consumers are usually other applications consuming data from Kafka as we mentioned before Kafka very often acts like a central hub for all the events in the system which means it's a perfect place to connect to if we are interested in a particular type of data a good example would be a database that can consume and persist messages or an elastic search cluster that can consume certain events and provide full-text search capabilities for other applications now as we went through the general overview of Kafka let's jump into the nitty-gritty details in Kafka and message is a single unit of data that can be sent or received as far as Kafka is concerned a message is just a byte array so the data doesn't have any special meaning to Kafka a message can also have an optional key also a byte array that can be used to write data in a more controlled way to multiple partitions within the same topic as an example let's assume we want to write our data to multiple partitions as it will be easier to scale the system later we realized that certain messages let's say for each user have to be written in order if our topic has multiple partitions there is no guarantee which messages will be written to which partitions most likely the new messages will be written to partitions in a round robin fashion to avoid that situation we can define a consistent way for choosing the same partition based on a message key one way of doing that will be as simple as using user ID modulo number of partitions that would assign always the same partition to the same user sending single messages over the network creates a lot of overhead that's when messages are written into Kafka in batches a batch is a collection of messages produced for the same topic and partition sending messages in batches provides a trade-off between latency and throughput and it can be further controlled by adjusting a few Kafka settings additionally batches can be compressed which provides even more efficient data transfer even though we already established that Kafka messengers are just simple byte arrays in most cases it makes sense to provide additional structure to the message content there are multiple schema options available the most popular ones are JSON XML Avro or protobuf we already described what topics and partitions are but let's just emphasize again the importance of not having any guarantees when it comes to a message time ordering across multiple partitions of the same topic the only way to achieve the ordering for all messages is to have only one partition by doing that we can be sure that events are always ordered by the time they were written into Kafka another important concept when it comes to partitions is the fact that each partition can be hosted on a different server which means that a single topic be scaled horizontally across multiple servers to improve the throughput kafka cluster wouldn't be very useful without its clients who are the producers and consumers of the messages producers create new messages and send them to a specific topic if a partition is not specified and the topic has multiple partitions messages would be written into multiple partitions evenly this can be further controlled by having a consistent message key that we described earlier consumers on the other hand read messages they subscribe to one or multiple topics and read messages in the order they were produced the consumer keeps track of its position in the stream of data by remembering what offset was already consumed offsets are created at a time and message is written to Kafka and they correspond to a specific message in a specific partition within the same topic multiple partitions can have different offsets and it's up to the consumer to remember what offset each partition is at by storing offsets in zookeeper or Kafka itself a consumer can stop and restart without losing its position in the stream of data consumers always belong to a specific consumer group consumers within a consumer group work together to consume a topic the group makes sure that each partition is only consumed by one member of a consumer group this way consumers can scale horizontally to consume topics with a large number of messages additionally if a single consumer fails the remaining members of the group will rebalance the partitions to make it up for the missing member in case we want to consume the same messages multiple times we have to make sure the consumers belong to different consumer groups this can be useful if we have multiple applications that have to process the same data separately as we mentioned before Casca cluster consists of multiple servers called brokers depending on the specific hardware a single broker can easily handle thousands of partitions and millions of messages per second Kafka brokers are designed to operate as part of a cluster within the cluster of brokers one broker will also act as the cluster controller the controller is responsible for administrative operations including assigning partitions to brokers and monitoring for broker failures a partition is always owned by a single broker in the cluster who is called the leader of the partition the partition may be assigned to multiple brokers which will result in the partition being replicated this provides redundancy of messages in the partition such that another broker can take over leadership is there is a broker failure however all consumers and producers operating on that partition must connect to the leader one of the key features of Kafka is retention which for some period of time provides durable storage of messages CAFTA brokers are configured with a default retention setting for topic either retaining messages for some period of time seven days by default or until the topic reaches a certain size in bytes for example one gigabyte once these limits are reached the oldest messages are expires and delete it so that the retention configuration is a minimum amount of data available at any time individual topics can also configure their own retention settings for example a topic for storing metrics might have very short retention of a few hours on the other hand a topic containing bank transfers might have a retention policy of a few month reliability is often discussed in terms of guarantees these are certain behaviors of a system that should be preserved under different circumstances understanding those guarantees is critical for anyone trying to build reliable applications on the top of Kafka these are the most common Kafka reliability guarantees Kafka guarantees the order of messages in a partition if message a was written before message B using the same producer in the same partition then Kafka grantees that the offset of message B will be higher than message a this means that consumers will read message a before message B messages are considered committed when they are written to the leader and all in sync replicas number of insane replicas and number of acts can be configured committed messages won't be lost as long as at least one replica remains alive and retention policy holds consumers can only read committed messages Kafka provides at least ones message delivery semantics doesn't prevent duplicated messages being produced the important thing to note is that even though Kafka provides at least ones delivery semantics it does not provide exactly once semantics and to achieve that we have to either rely on an external system with some support for unique keys or use Kafka streams let's also remember that even though these basic guarantees can be used to build a reliable system there is much more to that in Kafka there is a lot of trade-offs involved in building a reliable system the usual trade-offs are reliability and consistency versus availability height Ruppert and low latency let's review both the pros and cons of choosing Kafka pros tackles integration complexity great tool for ETL or CVC great for big data ingestion high throughput disk based retention supports multiple producers consumers highly scalable fault tolerant fairly low latency highly configurable and provides back pressure cons requires a fair amount of time to understand and do not shoot yourself in the foot by accident might not be the best solution for real low latency systems a lot of things in Kafka were purposely named to resemble a JMS like messaging systems this makes people wondering what the actual differences between CAFTA and standard JMS systems like RabbitMQ are active and QR first of all the main difference is that Kafka consumers pull messages from the brokers which allows for buffering messages for as long as the retention period halls in most other JMS systems messages are actually pushed to the consumers instead pushing messages to the consumers makes things like back pressure really hard to achieve Kafka also makes replaying of events easy as messages are stored on disk and can be replayed anytime besides that Kafka tee's the ordering of messages within one partition and it provides an easy way for building scalable and fault-tolerant systems time for a quick summary in the era of ever-growing data integration complexity having a reliable and high throughput messaging system that can be easily scaled is a must Kafka seems to be one of the best available options that meet those criteria it has been battle tested for years by one of the biggest companies in the world we have to remember that Kafka is a fairly complex messaging system and there is a lot to learn to make full potential of it without shooting ourselves in the foot there are also multiple libraries and frameworks that make using Kafka even easier some of the most notable ones are calf calf streams and Kafka connect if you want to learn even more about Kafka I can recommend the following book Kafka the definitive guide which I found very useful you can find the book and other useful links in the description box below and also visit my web site cinematix com if you liked the video hit the like button and subscribe to my channel thanks for watching
Info
Channel: Finematics
Views: 197,115
Rating: undefined out of 5
Keywords: kafka, apache kafka, messaging, jms, coding, software development, tech, java
Id: JalUUBKdcA0
Channel Id: undefined
Length: 19min 17sec (1157 seconds)
Published: Mon Sep 16 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.