From Zero to Hero with Kafka Connect by Robin Moffatt

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay good afternoon hello everyone my name is Robin I work for confluence you may have heard a confluence they're one of the companies who contributes the open source Apache kafka projects we have our own Apache Kafka distribution called confident platform so I want to tell you a bunch of good things about how you can use Apache Kafka connects but I want you to tell me two things the first is am i speaking too fast not right now I get more and more excited so if I get too fast please start waving at me okay because I thought I'm sorry I don't speak French so we have to do it in English but if I get too fast please just start waving the second thing is who's using Apache kafka today okay about a third of you is anyone using Kafka connects today one may be thinking about it okay so hopefully you're here not just because it's like well this one or two and it's kind of it looks alright but you want to know about Kafka connects and so that's what I'm going to tell you all about so Kafka Connect is part of Apache Kafka you've probably heard a Kafka it's kind of everyone's talking about it for last few years and Kafka Connect is part of the Apache Kafka API so if you're using Apache Kafka you have Apache Kafka connect and what Kafka connects is is a way of doing streaming integration between systems upstream from Kafka into Kafka and from Kafka out to other systems and Kafka connects is fantastic because it avoids you having to reinvent the wheel a lot of times and lets you just build your applications as you want to do and spend your time writing your application rather than writing kind of frameworks and kind of bits to do the integration that's what Kafka connects does for you so you can use it to get data from systems upstream data and databases and message queues and flat files and JMS and who knows what's you got a certain thousands of different systems you can stream it into Kafka using Kafka correct you can take data from Kafka and push it down to numerous other systems out to databases out to the same database or different databases or to hey chef I saw a bigquery rs3 or wherever you want to put your data and the beautiful thing about Kafka connects is it's just configuration files so it means you can actually build end-to-end pipelines using Kafka integrating different systems but just with adjacent so this being focused I suspect a lot of people here are kind of hardcore Java programmers and that's fantastic I'm like a bit of a charlatan I'm not actually a Java coder but I can read them write a mean bit of configuration and that's why Caffrey connects kind of interesting to me but it should be to you as well because it means the tricky thing of how do I get data from this place into Kafka or from Kafka to this place you don't have to write that you can I mean it's fun solving problems that someone else has solved already but it's not a great use of your time so Kafka Connect does that for you so you can use it to build and Twin Pines or just data from Kafka to push it down to here but it means that you're now having to write that piece yourself so people use Kafka connect in different ways some people start off with Apache Kafka simply is this idea of a message bus and we're just going to shunt data from here integration pipeline over there and that's fine I can Patrick Kafka is fantastic for doing that and Kafka Connect is the piece that lets you do it got data sat in this expensive transactional database over here we're going to strain it we're gonna offload it into Kafka I'm gonna push it into to wherever got into s3 but we also have that same data we say well that's also go and put into HDFS because we have two different teams who want the same data are we using s3 but we also want like an on-premises replica or for whatever reason you can have the same data pushed out to different places not everyone realizes that Kafka persists data when you read a message out of Kafka it's still there until you expire it out and you say well it's been two weeks or we've got like 10 gigabytes so we start to age out the old stuff we can also persist data forever you can keep it in Kafka forever if you want to depending on the business case but the point is as the data comes into Kafka you can use that multiple times you can push it up to us three you can push it to HDFS you could consume it in your application so when else can consume it when it's read from Kafka it's not deleted so we can use the data multiple times to build out these pipelines of different types we can use Kafka connect to take data that applications writing into our broker and push it down to a data store so instead of our application doing our application thing and say right now I'm also going to write a bunch of application code which is going to connect to elasticsearch and work out like if alex is sir she isn't offline then I'll have to buffer that day to myself and when it comes back online then I'll push it down and then we need to scale out and how do I manage all of that kind of stuff when something breaks and how do I handle restarts and scheme is all that kind of stuff Kafka Connect does that so Kafka Connect is this framework that is distributed it's scalable and it's smart it manages what's the schemer of the data where did I get to before something went bang and I can restart all of those tricky things that you start off on day one it's our old just to push out to this data star and on day two it's like oh the data store went offline I guess we need some retry logic and then like oh we need to scale out and I oh crap this is a bit more complicated than we thought and then you get really really clever you spend a month building a solution but it's now entirely bespoke to your system and that data star and then sort of says well we also need it over there and can you pull data in from over there and by the way this new person joined the team can you bring them up to speed on your bespoke solution all there's a framework that does that it's called Kafka Connect and everyone knows it hopefully after this so Kafka Connect gives you this ability to push data down to data stores from a caf-co topic to pull data in from places as well so we can use Kafka Connect as a way of bridging from existing systems into new ones we've got an existing system we've got the big scary naughty monoliths at on top of a database we've got it doing all this monolithic things writing down into a database as our state as our state star and we say well when something happens in that original application that we want to start chipping away bits of functionality into our nice new shiny services we want to know about that well we can't start building hooks into that but suppose we could but much neater is we can say well the database is our source of events we can use Kafka connect to hook up to the database to the transaction log and when something changes in database as a result of our original application doing something we can capture it and insert an update or delete any change in that databases in events and we can capture that using Kafka Connect and push it into a Kafka topic and from there our application can subscribe to that Kafka topic and can use it so those are some of the reasons some of the ways in which people use Kafka collects but anytime you've gotten data somewhere that you're wanting to pull into Kafka anytime you've got data and Kafka that you wanting to push somewhere else you should be instinctively reaching for Kafka Connect and there may be fine like here's the edge case like here's the reason I'm gonna write it myself but a lot of people end up writing a bit occured that I'm gonna pull the database and pull this into a Kafka topic so that I can then use the date and do something useful that bit before you get to the useful thing that's what Kafka connectors so I want to show you how easy it is to use and what it can actually do so let me mirror my screen all of this code is on github so I'll share the links afterwards or tweet them if you do the Twitter's so you can actually try it out for yourself so what we've got here is that big enough at the back if I do this can you see the text you still awake yeah good stuff so we've got my sequel it's a relational database it's any relational database I'm using my sequel just because I am but it could be any other relational database and in this database we've got tables because that's what you do with databases and we've got a table called orders so you can probably guess what's in the orders table it's got all the data it's got order IDs and who place the order and what did they buy and what was the value and where do they live stuff about orders in a database this is not revolutionary we've got an application it's writing data to that database okay it could be something it could be a third party it could be our own application it's something writing to the database in my case it's a data generator so we're gonna run that let me go back to the database and we keep running that query and the data is changing because there's new data being written to the database I'm not going to do now is run yes and while that's just going to do is just going to sit there it's going to poll the database and just show us the fact that new data is arriving we've got a create time stamp on there which is UTC so a couple of hours behind for twelve thirty nine to thirty nine new data being created all the time what I'm going to do now in a new screen is I'm going to create our collector so let me paste that and then talk about it so what we've got up here let me show you on here because it's Clara is we're saying I'm going to send some configuration to the rest endpoints Kafka Connect you configure using rest and we say here's our connector class here's what we're going to connect to we're going to connect to my sequel so with Kafka Connect you simply give it configuration which kind of connection is it well this one is my sequel one it could be a JMS one or a flat file this is my sequel okay it's my sequel so how do we connect to it well that's how we connect to it and then this is a particular table we want from it and a bit of other config and stuff down the bottom but we run that and it says okay I've created it and we say well have you let's just double check because this is a live demo and things like to break in live demos so we run that which goes against the rest API and it says what's this what connectors have you got and what's the status this is what about connector it's a source connector it's called that and it's running which is marvelous news because it means hopefully when we do the next bit will actually see some data so let's go and have a look at the data over here this is just polling the Kafka topic what it's consuming from the Kafka topic rather so this is showing us a continuous stream so if I switch my windows to something useful like this on the right hand side what Kafka on the left hand side we got my sequel so this is our application that's writing to a database this is just a database with data arriving this is a Kafka topic showing basically a near real-time feed of that data so as a full snapshot of what was in the database every single subsequent change and we're talking about not only inserts but any updates only deletes as well so a full copy of all the events happening in the database in a Kafka topic which is kind of useful because we can use that data in the Kafka topic which is a copy of what's in the database and we can use that we can shove it down to another data style what you're going to do in a moment to data source because that's twice as good but we can also use that data to drive our applications we've got a service that wants to know about when someone creates an order well we could go to the original application and say when someone creates an order and you're right at the database can you tell me about it as well well that's kind of a bit icky we could start pulling the database ourselves directly well that's also a bit crappy or it could simply say when something changes in that application let's generate an event out of the database and now anyone who wants to know about that order the full payload is available as the whole point of an event it's a notification something happened then it states here's what happened so we've got my data in a Kafka topic from wanting running one configuration file so what we can do now is we can actually push the data down elsewhere so I'm going to send it to a couple of places I'm going to send it over to elasticsearch so we've got the data about orders we want to do some search on it all we want to do some real-time analytics so we say well let's go and put it into classic search again we've got the connector class we're using the elastic search sink because we want to sync the data to elastic search how do we connect to elastic search well here's the connection URL which topic do we want its that topic it's kind of fairly self-evident how we do it so we run that and it says it's been created and then we're going to run another one and this is going to send it to neo4j so one of the reasons why you would want to put data into Kafka to them route other places is you can now start to use the most appropriate tool for the job you're trying to do rather than let's shove it into this one generic box that does most things quite well but none of them particularly well we can say we can use the best technology for the job at hand I want to do some search indexing I'll use elastic search I want to do some graph analysis I'll put it into a graph database I'll put into something like neo4j but it's the same stream of data in near-real-time replica from the database we're not connecting each of these things together with bits of string and one thing changes the whole thing goes bang we're saying we've got the stream of events in Kafka we can route it to here here and here if one of these things we want to change your goes offline the others aren't impacted so we've got the data flowing between them all if we run this it says they're all running we've now got two sinks we've got one for elasticsearch we've got one for neo4j and all of them they're running the demo gods are smiling on me indeed so what we're gonna do now is actually have a look at this data so whilst I open up cabana and neo4j we can actually just pull the elasticsearch REST API so we can say show me the order information so let's put that on the right hand side on the left hand side we still got the database still got new data arriving so in the database order ID 4 5 4 4 5 7 and so on that's this one here in elasticsearch if we keep pulling it 4 6 5 this one here for 6 5 4 6 8 so it's basically we go 4 7 1 it's exactly as it is in the database streamed into Kafka and now in this case in elasticsearch so we could have an application deciding to use elastic search for its state star we can now use cabana on top of elastic search to give us a real-time visualization of the data we can start noodling around it we can start inspecting it because they are analyzing us and digging into it doing all the kind of good stuff that give an elastic search or fantastic cut we can also go and use a graph database and start poking around the data that way we can say well let's look at the data from a different slant instead of like kind of rows and columns idea of data let's actually define entities and how they relate and start poking around it like that I quite like this one just cuz it's nice visually appealing if I was like oh wow that's cool but graph analysis is also really useful unless you answer different questions that relational doesn't always do so easily so you can say here's the customer where do they live what kind of car did they burn do they buy dig into that who else bought a master well where do they live and then you can start relating well people who live in this place start to buy that type of car and all that kind of cool stuff so this is the kind of stuff you can do with Kafka connect configuration file science tree in the data room from a database another configuration file stream the data from this topic or which elasticsearch or stream the data over to near 4j use the data where everyone's due to do the particular job at hand so Kafka connect pulls the data from the source system and Kafka Connect pushes the data down to the target system so the demo worked which is always nice it doesn't always but it did today so I walked through setting that up so when you're using Khafre connect there are a few little speed bumps should I say along the way so in England we call them sleeping policemen do you call them out over here they can't like those bumps in the road that the car goes yeah the same okay so Kafka Connect is fantastic Kafka Connect is what you should be using most of the time when you're doing integration but there are a few of these little sleeping police and these little speed bumps that when you start using it you'll come out you might come against and that's why I wrote this talk the KERS Kafka connects so important and so useful but it can be a bit of a pain sometimes to understand what's going on so showing you like the happy path the perfect path light oh isn't that simple but now I want to explain some of the bits under the covers that when you are using it you can understand some other things you might go against so Kafka Connect is in a sense for data engineers it obstructs away this idea of having to write scary code it's just configuration which makes people like me happy because I can use it but understanding what's happening under the covers is useful because it makes you lets you make more sense of how you're deploying it and how you configuring it so you've got your sauce system you got your Kafka connects box in the middle and then your Kafka brokers if you're dealing with a sink it's reverse you got Kafka Kafka Connect and then your target system I'll talk about how you actually deploy Kafka Connect later on but an important point to kind of reiterate is that Kafka Connect does not run on your Kafka brokers the only thing that runs on your Kafka brokers is your Kafka broker maybe zookeeper but a lot of people would disagree with that so your Kafka brokers are for your Kafka brokering everything else your streams applications your connect your whatever they run on separate hardware container slash pod slash whatever so within Kafka connects it breaks down into pieces the first bit is the connector plugin and the terminology gets a bit confusing because you've got Kafka connects you connectors you've got configuration and we've got converters as well and I'm ENGLISH I find it confusing so the connector is the only technology specific piece of the puzzle so this defines how it integrates with the particular technology so pulling from a database that connects a jar understands how to talk JDBC or how to write read a bin lock or whatever if you're pushing down to a elasticsearch it understands the elasticsearch api how do I take this piece of data and put it in this specific system but after the connector plug in it's then just obstructed through so you specify the particular connector class that you want to use and from that the connector them passes internally and this is under the covers you don't actually really see this unless you're writing stuff it's got a connect record which is a generic representation of the payload plus the schema and schemas is like this dirty word that some people don't like but you kind of have to deal with it because a lot of the time when you're pulling in data from a source system it's got a schema it comes from a relational database it's got a schema comes from something else's almost always at least got a loose schema around us so you've got this payload and you've got a schema and that gets passed internally as this generic representation so from there on in this is now just really useable components so the first one the next one that you have to use is the converter so the converter deals with taking that generic connect record and actually writing it to Kafka and converting it into an serializing it so Kafka messages are just key value bytes okay Kafka doesn't care if it's Avro or JSON or protobuf oh who the hell knows what it's just bytes whereas as users of Kafka and as developers with Kafka you should be caring very very deeply how your serializing your data because there are good ways to serialize your data and there's some really crappy ways to serialize your data and I lost the time people go eeny meeny miny moe that's the one I recognized so I'll just use that but actually schemas matter schemas matter an awful lot my colleague when Shapiro has his great quotation she says schemas are your contract they're your API between your services if you're using events to integrate between your services which a lot of time you will be schemas define that contract and if you don't care about your schemas you're not loving and cherishing them and looking after them things go badly wrong but things go badly wrong six months later after you've already committed to like well that one looks easiest we'll just use that and then it's one of those Oh bother we should have maybe done something different now we have to collect back step and unpack and do things the proper way so if you're not using Kafka yet then my strong recommendation is use something like alpha if you're not sure use of Rho if you're strongly opinionated use something else but at least know why you're using it rather than I oh well Jason sounds good enough JSON itself doesn't have an innate representation of schema so when you're passing these messages around you have to have that schema somewhere else you're remanding it to each other you're checking it on and gets hub you're can I do something bespoke and specific if you use something like alpha Rho it has explicit support for schemas so whether you're using Kafka Connect whether you're writing your own Kafka producer whether using something like case equal its serialized the data into Avro which gives you a nice although binary message goes on to the Kafka topic and the schemer which we need and know and love that we need kind of we need that schemer goes into the schemer registry and the avro message gets this little bite on the front of it with an identifier for the schemer because then when we come to use that data whether it's Kafka connect as a sink whether it's our own application with the Kafka consumer whether it's Kay sequel or whatever the deserialize that says I'll read my other a message there's my little bytes with a information I'll go and fetch that schemer and now I can deserialize it so scheme is a super-important scheme is that you have this kind of quality control over your pipelines because you can actually start putting crap onto the topic because the sillier Iser will reject it or throw an exception or say that doesn't match the schemer of the other stuff if you're just using JSON you just have to cross your fingers as a consumer that other people upstream does that mucking around because if they do your stuff falls over it so why the hell did that happen and most solutions around that end up with some kind of human coupling around I know we'll have a Change Committee and if you want to cannot change that then we have to have sign-off process in all that kind of crap if you use alpha or something like that actually enforces that governance of the schema on this it says you cannot put something on that topic if it doesn't match the scheme if it's not compatible with the schema that you've declared that's enough our schemers but they matter an awful lot if you're starting out with Kafka if you're using Kafka and you're not giving a lot of thought to how do we manage our schemas Howie serializing our data please please do so because it will save you a lot of trouble later on so you've heard the riot act you've understood that schemas are important you said I know we'll use Avro for our data or not screw that guy we'll use JSON whatever you're going to use you have to specify it in the configuration so you can specify this globally so your cough would connect workers at which I'll talk about later they can have like the default and then when you're consuming from a particular topic perhaps you're writing a sink and someone else has written to that topic they've been a naughty boy or girl they've used JSON you can override it and say okay this particular topic we have to read is JSON so connectors technology specific bit which pulls in the data the specifics or system or pushes it down to the specific target system converters takeout generic message from that connector serialize it into Kafka or on the sink side of things take a binary sorry the bites out of Kafka deserialize it into a generic connect record visit bless you there's a third bet in the middle transforms so single missus transforms are optional you don't have to use them but they're kind of useful if you're building some kind of transformations then in your pipeline if you've got data coming in you want to kind of on a massive fist field or I want to drop this field I want to rename the field I want to change the data type or I want to change the topic name or that kind of stuff you can use single message transforms to do that so it's only per single message you can't start building aggregates it's not designed for doing lookups to other systems when you get into that level of transformation work that's when you crack out Kafka streams or Casey call or another stream processing library and say right now we need to kind of do stream processing single missus transforms are part of your pipeline as it shows here the data comes in it passes through a transform one or many or none passes through the converter and it goes into after so for example you put it pulling in data from a source system upstream you've got a database it's a HR database it's got loads of sensitive information in you pullin from the table and HR say there is no way you're storing that chunk of the message in Kafka like that cannot even hit Kafka you can use single message transfer to say well we got the full payload we're gonna drop all the x y&z and then only those columns go through and Kafka we can say in Kafka topic we've got this what nice wide message it's got a bunch of stuff in that's useful for lots of people consuming it but we're now pushing this message down to a target system and the target system the only one kind of a handful of fields they don't want all of that stuff so we could make them do the work and kind of take the message and then process it all we could simply say on the sink side of things as it passes through in Reverse before we write it down the target dropout these particular fields or masks these fields and then just push those other ones down so it's actually a very very lightweight but very elegant way of managing particular fields within your data to configure them it's not particularly pretty I'm I'll be the first to admit it's not pretty but you simply perfect you configure it your configuration with the transforms prefix you label each of your transforms so you got two different transforms you got label foo bar and one called add dates topic which is a much more sensible name and then each of the transforms have got configuration so you've got transforms dot label and then it's got a type so in this case it's a timestamp router the time somewhere it has got configuration it's got the topic format and the time stand format so going to take the topic we're going to append the timestamp to it so when we're streaming data down to elasticsearch for example the elasticsearch index will take the particular timestamp so then we can do time stamp based partitioning this one here label fubar it's got a silly name the type is called replace field which is going to rename delivery address to shipping address it's those little changes to the data which you can then normalize down your schemas you can say our business terminology is it should be called this the source system called it that we could say we'll just bring in the source system copy and anyone who uses it there needs to remember to switch them over or we'll write a new application that will consume it and switch over and write it back to a new topic or you could say well at point of ingest we'll just fix that and then we've got the actual corrected record starting Kafka so all of this is extensible so I can't but I'm sure all of you can you could crack open the API and you could write your own connector you could write your own transforms you can write your own converter and people do so there's a huge number of existing connectors and transforms and converters you can go and get them off comm flow and hub you can find them on github you can find them all over the interwebs but if there isn't a particular technology already written for you can go and write your own but the brilliant thing about it is all you have to do if it's a source is say how do I pull a record from this funky database and getting then expose it and Kafka Connect does all the rest Kafka connects can apply transforms Kafka Connect will serialize it Kafka Connect will write it to Kafka Kafka Connect will handle failures Kafka Connect will do all of it except for that little interface between Kafka connects and that source system or inverted Kafka Connect and your target system writing down to a target system so that's how you go about configuring it and it's important to understand some of those things I lectured you at length about schemers and converters because they matter so much probably the biggest problem I see people having with Kafka Connect is going to Google going to Stack Overflow finding some configuration running it it's like yeah I've got data in a Kafka topic because I pulled it in from here and then they find another one and they write it I'm going to take data from this Kafka topic to push down to there why is it not working it's like what you're trying to get a round shape through a square hole because you pulled it in in alpha row and you're trying to write it out as JSON it's like you can't just pick and choose like that you have to be conscious and think about how am I going to serialize my data once you've configured your connectors you can then start deploying them and running them so like I said Kafka Connect runs on its own instant slash bare-metal slash container slash pod slash whatever but before we even get to that we have to understand logically how does it run so when you create a connector so we've got our s3 sync connector a connector will run it connected our logical definition and it will run using one or more tasks so if we're writing to us three maybe we only have a single task because we can't paralyze it we forgot a connector pulling in data from a database maybe we've got a single connector people pulling in multiple tables so actually why don't we just pull from those tables concurrently so you could have multiple different tasks and this is something that's specific to each connector the person who wrote that connector will have said does it make sense to paralyze this work if someone says do a select star from every single table in the database technically could i paralyze that yes well my DBA hate me probably but that's up to the user to configure so you can configure that each way so calculate we'll scale out the work if you let it and those tasks execute within workers so now we get down to the nitty-gritty our workers are our actual GPM processors when we start going to actually run it and where's this thing actually reside it's a JVM process called Kafka connect within that JVM process you've got your different tasks being executed but it doesn't stop there Kafka connect just to confuse you and I have to say it does feel slightly like this has two different deployment modes so first speed bump first sleeping policeman to watch out for is how am i serializing my data how am i deserialized my data second one to watch out for how am i running Kafka connect because there are two modes the standalone and there's distributed and it's like well which one do I pick most people say well I'm just starting out with Kafka Connect I just got a little laptop I stand alone sounds safer distribution sounds like big and scary like hacking news like web scale which maybe I'll get to eventually but I'll just start off with standalone because that's much more standalone and simple unfortunately that's not the best choice the best choice almost always is to go for distributed because distributed runs on a single node the point of it is that it can be run distributed whereas when you run stand-alone it's just a standalone worker the offsets like where your connectors have got to it stores in a flat file its configuration is stored in a flat file if you need to scale it out so you've got a JDBC connector pulling in data from a database you've got an s3 connector pushing that data to s3 if you reach the capacity of a single machine you can't scale it out other than kind of a scale out which is to deploy to workers a connector on each so now we've scaled it out but let's say our company is like it's growing they say well we can't fit it all on a single machine so we need a bit more capacity we've kind of saturated the CPU on that one we need more capacity so we scale it out now we have two GM's one with on each our company gets bigger and bigger we have more and more work on the database we're ingesting so much we can't keep up with a throughput saturated the network through parts now we've got nowhere to go because it's a standalone deployment you can only have a single JVM for a collector a connector cannot be deployed over multiple ones in standalone mode the other problem with this is it's not fault tolerant if that machine goes bang or that pod or whatever it is we're running on if that goes bang we've lost it we have to hope that we've persisted that offsets file somewhere because that's important runtime information that state that would have lost so it's not fault tolerant this is where distributed mode comes in because distributed it's exactly the same it's a JVM process the task from within it but the offsets the configuration of stasis the tasks they all get persisted in Kafka itself so you remote right at the beginning I was talking about Kafka and Kafka persist States and not everyone realizes that Kafka can persist data forever so Kafka has something called compacted topics where it's guaranteed to always have available the latest value for every single key regardless of when that message was written could mean 10 years ago a thousand years ago infinite years ago it always keeps the latest value for every key so Kafka connects actually uses Kafka itself as Israel's its only store of state it's pretty clever because Kafka itself is a distributed system its fault tolerant is resilient and so on and so on and so on we can rely on Kafka storing that data for us safely Kafka Connect does that so it means that we can then scale it out we can say we'll add another worker we need more capacity Kafka can access or hey we've got a second worker it will form a cluster and we simply do that by saying it's on the same cluster ID and it just does it but we don't have start copying around files and saying always that status father offset file because it's using Kafka so it's a distributed system built on top of a distributed system and it's also fault tolerant so that worker goes bang Kafka Connect says well that's I'll just run the workloads on the remaining workers and we can scale out and go in and out as we want to calf connecting manages that for us Kafka connects in distributed mode we can also still partition we could also still have independent workers for different connectors so it's nice get a little bit confusing here but you don't have to have one great big cluster so don't take away this idea that Kafka connects if in distributed mode means my company has like a thousand nud Kafka Connect cluster and everything runs in it you could do that some people do that but a lot of the time people deploy it with separate Kafka connect clusters / connector type or / team function or / whatever so you've got the distributed nature of it you got the resilient nature of it you've also got the workload isolation of it in Kafka prior to 2.3 there was this thing called the the rebalancing so if a task change or a worker went bang or something like that it did basically like stop the world rebalancing so if you've read about Kafka Connect you may have come across this in Kafka 2.3 that's no longer a problem so you can actually have bigger clusters and worry less about the kind of the impact of that rebalancing for deploying Kafka Connect you can run our contain a few ones there are connect images on docker hub if you want to create your own image all you need is that particular plug-in jar to drop into your image so the plug-in jar you get off docker hub sorry confluent hub and you can install it either at runtime so you see something like docker compose to pull that down and then launch the worker or you can just build your own docker file from that and story in your own repository you can also automate the creation of the connector itself so that's talking about building an image once you've spun that up you've just got a connect worker with no configuration you can automate that Creek that's the creation of that configuration so you launch the worker you wait for the worker to kind of spin off and start wait start running and the rest api to be available and then you just say here's a lump of json and go and create it troubleshooting I could lie and say that there's no need for this because it's all simple but I've already alluded to the fact that it's not always kind of happiness and roses so sometimes with Kafka connect you do need to kind of start doing a bit of debugging and carefully connect again it likes to mess with your little bits so the first thing you'll do is you'll say it's Kafka connect which connectors have I got and what's their status and it'll say well you've got a connector and it's running you say yes my connector is running where the hell's my data I've got no data and calculate so is well hello connects is running your tasks aren't new tasks are screwed but the connector is running and that's to do with the nature of your connector is your logical entity the work is carried out by one or more tasks underneath so as kafka connect sees it and worker can be created and running but the tasks one or more may be failed so the aggregate of those tasks being failed is nothing happens so you always have to look at the status of birth you can use the REST API to get a stack trace out of tasks that failed and start poking around that if you once it's usually quite useful but almost always what you'll need to do is dive down into the connect worker log to actually work out what's going on what's what's failed what caused the task to abort Kafka Clank does have some error handling built in this is improved in recent releases so it actually has the ability to skip over about messages and ask the ability to send messages to a dead letter q this is definitely the message you'll see if you take the approach of stack overflow and can run them jiggling it until them things work and you're randomly jiggle and then things won't work and you'll hit up against this which is when you try to use the JSON know when you try to use the alpha or converter to read data that is not Averell because remember I said when I was talk about schemas yet this little magic bytes at the front of the Kafka messengerís describes the schemer itself the alvaro converter tries to read a message that's not Avro I can't read that so if you have a topic with JSON data in and you say it's Kafka Connect use the Avro convertor Kafka connect will say huh that's not a ver oh no throw you'll also get that message if you've got a mess if you've got a topic with birth and this is what you get when you're doing the random jiggling off I'll try this out I'll try this out I'll try that out or crap it's broken because you'll have bits of JSON bits there are 400 bits of who knows what and your connector will say well I'm gonna try and deserialize it using the one I've been told to and as soon as it hits when it can't make sense of it will fall over which is where the error handling comes in and the dead letter Q's so by default Kafka Connect will fail fast which sounds good but it means it'll fall over and cry because they'll say well I can deserialize those first three messages that third one I don't know what it is so all right the first two down to the target sync and then it'll abort and you have to go on Gollum and pick up the pieces which might might be what you want to do it might be your writing down to a target system that cannot have missing data they cannot have corrupt data we would much rather like stop processing then send anything down that we shouldn't it could be though it's like well can you just send me some damn data I don't mind if you miss some of the bad day to just send me dumb data so in that case you can say well say there is tolerance to all if it can't deserialize it that's okay it keeps on going now you can also say if you hit a message that you can't process go and send it to another topic so that way you get all of your good messages down to your sync as they should be you get your bad messages written to a topic which is just a Kafka topic so you could go and inspect it using something like Kafka cat but let's say you've got a source topic that's been written to you by different systems it's supposed to be alpha row data but you've got someone didn't get the memo they're writing valid data but in JSON you can use this pattern to actually say well will consume Avro data will not fall over if we hit any JSON we write it to a new topic dead letter Q topic and then we can consume from that topic using a JSON converter and assume it's JSON and try and read it that way so you get like two passes at the same data without dropping any data to monitor what's going on you've got the rest API which is I showed you in the demo so you can basically just iterate through the connectors that exists and look at their status you can also use confident control center to look at the throughput that you've got to look at the lag over the different consumers the connector tasks and there's also jmx metrics so you can starts kinda like really dig down into things as much as you want to so I hope that's been useful I hope that's opened your eyes as to what Kafka can do in terms of integration capabilities I hope next time you try to do integration you won't just try and write it yourself you'll at least think of Kafka connects and if you want to try out Kafka and you haven't already confluent do have a managed hosted Kafka solution with Casey call with schema registry managed for you in the cloud so we can go along there you can give that a try so with that thank you very much for your time I think I've got three minutes for questions if I may before that I have a question for you which is may I do a speaker selfie with you because I like to collect these and sometimes they turn out alright so if you want to look like you had fun and it was interesting I'll not take a picture of the screen I'll take picture you lovely people thank you very much so I need any questions one yeah yeah so the question is what would you use to aggregate data because I said don't use the single message transforms I would recommend something called K sequel which is a sequel language that you can use for stream processing so you could actually just as you would write an aggregate statements in a database you can write the same sequel but it will run as a continuous application I can show you some links afterwards which show that if you want to get down into it the Java libraries there's kafka streams there's other stream processing languages as well but I would start off with K sequel okay has anyone tried K sequel by the way oh that's my talk for next year then k sequels really good any other questions okay thank you very much [Applause]
Info
Channel: Devoxx
Views: 26,715
Rating: undefined out of 5
Keywords: conference, developers, microservices, paris, Apache Kafka, Integration, Data Streaming
Id: Jkcp28ki82k
Channel Id: undefined
Length: 44min 40sec (2680 seconds)
Published: Thu Nov 14 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.