Understanding Kafka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so far we've talked a lot and I mean a lot about processing big data that might be sitting on your Hadoop cluster already or sitting in a database already but we haven't really talked about how that data gets in there in the first place so now we're going to talk about streaming and this is the process of actually publishing data from some source like web logs or sensor data or something like that and actually getting that in a scalable manner published into your cluster where you can actually do some processing on it maybe that processing is also done in real time so this is called streaming and we're going to talk about in the next section just how you go about publishing that data from your data sources into your cluster in real time so that you can then process it let's dive right in the first technology we're going to talk about for streaming data into your cluster is going to be called Casca this is a published subscribed messaging system so what is streaming anyway what are we talking about well we've been talking a lot so far about processing data on your cluster using tools like hive and Pagan spark but we're assuming that your data is already on your cluster somewhere it had to come from somewhere though right like it didn't just magically get onto your HDFS file system or it didn't magically get into a database sometimes you want to process new data as it's coming in and you don't want to deal with having to load it manually all the time in these big chunks right so that's where streaming comes in with streaming technologies such as Kappa you can actually process new data as it's generated into your cluster maybe you're going to save it into HDFS maybe you'll save it into HBase or some other database or maybe you'll actually process it in real time as it comes in you can do all of that with streaming so that might there are many applications of this for example you might be monitoring customer behavior data coming from the logs on your web servers you might want to be transforming those logs into databases into more structured forms you might have Center data coming in from some big Internet of Things deployment right you know or you might be dealing with stock trades coming into real time who knows it could be anything but usually when we're talking about Big Data there's a big flow of it coming in all the time and you want to be dealing with it as it comes instead of storing it up and dealing it with it in batches so streaming once you publish that data in real time to your cluster and you can even process it as it comes in if you want so there are two different problems to this whole scenario of streaming one is how to get the data from your data sources into your cluster so you might have a very widely distributed cluster of web servers or sensors or what-have-you and you need some mechanism for being able to publish those to your cluster in some scalable and reliable manner and then the second problem is what you do with it once it gets there so we're going to focus in this section just on that first problem how do I actually publish data from my data sources into my cluster at scale so Casca is one popular solution for this it's not just a Hadoop thing it's a more general-purpose published subscribed messaging system so what does that mean well you can set up a cluster of caf-co servers and their entire job is just to store all incoming messages from publishers which might be a bunch of web servers or a bunch of sensors or who knows for some period of time and as it comes in it will store them up and publish them to anyone who wants to consume them now these messages are associated with something called a topic and that represents a specific stream so for example you might have a topic of web logs from a given application or a topic of sensor data from a given system right consumers basically subscribe to one or more topics and they will receive data as its published and the good thing is that Casca because it stores it again your consumers can catch up from where they last left off so it will maintain the point where each consumer left off and allow them to just pick up whenever they want to so it can publish data in real time to your consumers but if your consumer goes offline or just wants to catch up from some point in the past it can do that too so it's very flexible and how it can manage these sorts of things that's one thing that calc is especially good at that other systems aren't so good at managing multiple consumers that might be at different points in the same stream and it can do that very efficiently and again Casca is not just for Hadoop you can use this for any sort of application outside of Hadoop as well that requires some sort of publish/subscribe mechanism that is scalable and reliable and I can tell you from firsthand experience that is actually a very very hard problem especially in the face of an unreliable network anyway architectural e this is basically my retooling of an image from the caf-co website but you can think of a casket cluster as being at the center of this entire world of streaming data here so that might represent many processes running on many servers that are distributing at your Kalka stored storage and Kapler processing now producers are the things we talked about that are generating the data so these might be individual apps that might be listening for new loglines they might be listening for new sensor data but somehow they are applications that have been written to communicate with both the thing that's generating your data and to kafka so these apps are responsible for pushing data into yours caf-co cluster now on the other end you can have consumers that just receive that data as it comes out so as producers published messages to topics these consumer apps might also be public might be subscribing to those topics and receiving that data as it comes in so these consumer apps also link in the kaskell libraries to be able to read that data as well and process it in some way so for example you might have a spark streaming app that is actually configured to talk to a specific topic on specific tasks a cluster that receives data in real time from these apps up here now usually you know there are going to be existing apps you can use off-the-shelf you don't always have to write them from scratch so even though it says app that doesn't necessarily mean you're going to be developing an app in order to use Kafka often there are ones you can just use off-the-shelf and Casca even comes with some built-in that might serve your purposes so don't be too scared about that you can also have connectors in Casca where they're just plug-in modules for various databases that will automatically either publish new rows in a database as messages on a topic to Kafka or can receive new messages on a topic from Kafka so you can set up a database to automatically publish changes into Kafka or to automatically receive changes that come into Kalka as new rows in the database so you can see that can be pretty powerful stuff you know you can monitor an existing database throw that into a casket topic and you can have some other application that's listening for changes to that database and processing it as they come in or you might have some producer of data that's publishing data in to capture that you want to store more persistently and you can have a database just connected to Casca that's listening for all that new data and automatically saving that data to new rows in some database table also one last thing you can do with Kafka is what's called stream processors and what these can do is transform data as it comes in so your producer for example might be producing maybe unstructured raw web log lines out of a web server you might have some stream processor that listens for new lines new log lines from that log data extracts the information you care about it in a more succinct and structured format and then republish is that on a new topic back into Kafka so think about what my how you might set this up let's say you have a bunch of producers that just listen for new laws on a web server you know every access that your web server gets over your entire fleet of web servers for some giant website you can have a stream processor that processes those log entries in real time extracts the information you care about which might just be a few fields and then republish is that on a new topic which could go to a database connector to be stored more persistently so that's an example of like the power of captives architecture you can make these pretty fancy systems that are very scalable very fast that can transform data and store it and do whatever you want with it really as it comes in how does Casca itself scale well like we said Casca can be spread out on a cluster of its own so you definitely don't want a single point of failure with katka you can actually have multiple servers each running multiple processes and those will distribute the storage of the data on your topics and the processing of all the publishers and subscribers connected to caf-co or consumers and producers and kapha terminology you can also distribute the consumers themselves so let's say you have a consumer group that it's all a bunch of consumer servers that are all subscribed to the same consumer group so these guys are configured to have the same consumer group name in this situation when Kafka publishes data to that consumer group that's subscribed to a given topic it will distribute the processing throughout that entire group so in this situation you might have four servers that are set up to process data on some given topic and Casca will automatically distribute messages throughout that entire cluster of computers there so as new data comes in it might send one message to see for and another message c-6 and the next message is c3 and that way you can actually scale out the processing the consumption of that data you can also set things up so that each consumer has its own group and in that case each individual group will get its own copy of the message so that's what this image here is trying to show I just looked at that from the Casca website but the point is you can set things up to distribute the processing of data as it comes in to a consumer group or you can actually replicate that data to as many groups as you want it's up to you so with that background let's play around for a bit let's kick off the cactus service on our Hortonworks and box first thing we're going to do is set up a topic and just using some command line tools publish some data into it and watch it gets consumed by another consumer that's also running from a command line and that is a more interesting example we're going to set up a file connector and what it's going to do is actually listen for changes to a given text file on our sandbox and publish that through Kafka and actually write that out to another file somewhere else and we can also listen to that from another consumer as well so let's see if calf car really works and get more familiar with it and get our hands dirty let's dive in
Info
Channel: Frank Kane
Views: 159,604
Rating: 4.9050846 out of 5
Keywords: kafka
Id: k-7lz6Ex354
Channel Id: undefined
Length: 9min 48sec (588 seconds)
Published: Thu Feb 23 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.