Processing Streaming Data with KSQL - Tim Berglund

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning thanks for being here yeah we're going to talk about K sequel K sequel is a thing to help us process streaming data that is in Apache Kafka now in a room like this where there are big bright lights up there and no lights on you I can just barely see you so I'm gonna when I ask for hands I'm gonna need to ask for like enthusiastic hands held high and I'll just barely be able to make the mouth tell me who here is using k sequel right now like in development or in production anybody no hands Kafka using Kafka right now that looks like 15-20 percent very new to Kafka kind of not so sure about Kafka a little bit familiar with Kafka alright good I just want to get to know you a little bit let me tell you something about me my name is Tim that's true I come from the United States I live in Denver a city by the mountains and I work for a company called confluent and I run the developer relations team at confluence so my team's responsibility is well among other things things like this talking about the stuff that we do and really trying to make it easy for people to adopt Kafka and elements of the confluent platforms that's what I do and I would love to be able to talk with you later in the day if we bump into each other here and you have more questions so it's probably a good idea to give everybody just kind of a little bit of an overview of Kafka now most of us have the sense that it is a distributed messaging platform and some of us are kind of still holding on to this is very understandable the Kafka of a few years ago it really started life as a giant scalable queue messaging queue and it is that you know under the covers at its core but it's a little more than that and it has kind of crept into this much larger architectural role and for something like Kay sequel to make sense like why you would even want a technology like Kay sequel you need to understand architectural II what Kafka is all about so I'm gonna give you a little tiny overview of Kafka and and a sense of what it wants out of you architectural II how it wants you to build systems then we're gonna dive into Kay sequel we'll look at the syntax we'll talk through some of the concepts in it and of course there will be live coding because well who doesn't want live coding right all right and if you have questions well I always say put your hand up but not in this room if you have a question see me afterward all right right here you have the simplest Kafka architecture diagram you will ever see it's also the most general so every system built on Kafka every every piece of the Kafka and confluent platform is basically conforms to this diagram so Kafka itself is the thing in the middle that's that's what we'd call the Kafka cluster that's composed of a number of systems that we are machines we call brokers and brokers are responsible for the pub/sub functionality for for receiving new messages and and serving up messages to consumers and for storing them right so that's what a broker does it stores messages you can publish math publish messages to it and you could subscribe to messages in it and that's that's kind of it and obviously there's hours and hours of things that we could talk about in there for how Kafka does how it manages replication and failover and consistency and all kinds of cool things like that it's got a story but for now let's just say there's a cluster of brokers and they cooperate in the task of being a messaging system every application that puts messages into Kafka is a producer and that producer is an API that's a public API and it's even a public wire protocol so you can write your own your own language binding for it if you want for you to get messages into topics in Kafka so when I say producer I'm referring to both an API that we can implement that we can look at on its own and when I say producer I'm referring to an application a program that you write that puts messages into Kafka likewise with the consumer that's an API that's a wire protocol and that also refers to a program the program that you write that subscribes to messages in Kafka and does something with them does some some kind of computation on them and puts the computed results somewhere else producer/consumer everything you ever do with Kafka is going to be either a producer or a consumer now those api's producer and consumer are SuperDuper low-level if you've programmed against them you know if you haven't you can google them and it'll take you like five minutes to think about the example code and kind of get it and let me tell you a little bit about Kafka's data model without even leaving this slide so you can have a sense of what I mean by the producer and the consumer are a low-level API a message in Kafka when I say write a message to Kafka a message is a key value pair all right now that has some sort of typing in whatever language you're using Kafka under the covers doesn't care about the type but of course our language will probably be typed in some way and so you know the key will be a string and the value will be some blob of JSON or something whatever right so the message is a key value pair the message is probably pretty small a few tens of kilobytes is kind of getting to the big ish range for Kafka you don't you don't usually put like media in their messages usually correspond to events which I'll talk about in some other talks later today but that's that's kind of it's a key/value pair you put it in there and you put into a thing called a topic inside the the brokers inside that Kafka cluster there can be many many topics topics are just named queues of messages and that's kind of it that's the data model that's that's what there is to know about how data is organized in Kafka and the producer that API all it knows to do is to say well there's a there's a topic named movies and here's my little movie key value pair and I'm gonna write it into that topic that's it that's all you get right consumer same thing consumer you say well there's a topic called you know movie ratings let's say that borrowing from the example later on there's this topic called movie ratings out there and in the consumer you say hey mr. broker let me know anytime a message shows up in the movie ratings topic and when one does there you get it you get a key value pair and here you go here's your key value pair do some kind of computation with it all right very simple I don't even have any code I don't need have anything illustrating that but I think you get it even if you've never thought about that API before you can kind of understand that in two minutes of talking that's the good side the downside is that API is not doing too much for you and that's going to get to be a pain later on we'll come back to that that's really what Kay sequel is all about so producer consumer very very simple easy to understand also not helping you out a lot let's look under the covers and Kafka a tiny bit more so you can appreciate what it is doing for us and what it's all about now the big box in the middle is a topic like I said topics are the fundamental unit of organization and Kafka when I write messages into Kafka I'm writing them into a topic and you see that producer over there on the left is is writing to the topic the topic though in order to scale is split up into partitions when you create a topic you get to specify how many partitions you'd like it to have in the demo later on I'm not going to do that they're all just gonna auto create and they're all going to be a single partition because that fits nicely on the laptop but in real life you don't want a single broker or a single machine to have to do all the work of storing messages and all the pub/sub because these topics might get big right this system might scale to large sizes so we take topics when we split them up into partitions each partition then can live on its own broker and so when the producer is producing it says hey I've got this key value pair and I want to write it into the movie ratings topic it has to decide what actual partition to write it to it usually it does that by taking the key and hashing it and so messages with the same key will always land in the same partition by default messages with the same key then are always going to be consumed in order but in general we don't know if we're consuming all of the produced messages in order once you partition you give up on on global ordering and that becomes an interesting topic that that we could say more about in a different discussion but basically partitioning is what we do consumers then over on the right they do there's some interesting interesting functionality packed into consumers to help us scale because we don't really know in general how expensive the work of consumption is going to be some consume process processes might be really lightweight some of them might be really expensive and some kind of expensive consumer computation you're gonna need to scale that thing out and you might want to scale it out for for fault tolerance anyway right so this we have I have to in this this diagram we see we have three partitions and two different consumer applications they're called consumer group a and consumer group B they're doing in completely different things we don't know what they are but they're just different applications consuming those those messages in different ways both of them have to consumers and as you can see if you follow the arrows that very top consumer group a instance is consuming partitions one and two that second consumer group a instance is consuming partition three that assignment is automatic so if I had just one instance say I started up consumer Group C and I only deployed one instance of that of that consumer application it would be consuming all three partitions and I can elastically scale that application by just adding new instances of the application spinning up new containers or whatever it is however it is I deploy code I can I can spin up a second instance I could spin up a third instance and for as many partitions as I have I'm able to scale out that consumer group and that by the way is automatic so when you're when you're writing your consumer code you don't have to do anything directly to think about that or enable that that's just a thing that Kafka does buy by virtue of you being an application that consumes messages you have access to some elastic scalability and some fault tolerance functionality right out of the box it's pretty cool so you you've built a little elastically scalable fault tolerant distributed system without even trying now that's that's kind of again the basics topics topics are partitioned producers hash the key figure out what partition to write to consumers consume from partitions and depending on how many consumers you have the cluster will assign partitions to consumers as needed the fundamental data structure or abstraction at the very bottom of all of Kafka is inside each partition if you know really what each partition is is simply a log this is what Kafka is the producer puts new messages on the end of the log that's the only place that can write them once it has written them they do not change I cannot go back to to message at offset 3 and alter that it is immutable once it's written and that is a very important principle that enables a whole bunch of functionality and hope of architectural paradigms now are possible because that data is immutable so producers produced to the end and I can have in theory any number of consumers reading from this topic here I have two of them and they're at different offsets that's ok consumers can can seek backwards to a particular offset or a particular timestamp in a partition or they can if they're if they're caught up they can just consume the most recent message the important I guess there are two important things there that make Kafka a little bit different from legacy messaging systems one is that it stores data the active of reading a message of quote unquote consuming a message doesn't make the message go away that just means that you've read it it's still going to be there stored in the topic until the data retention policy says that we can get rid of it as a result of that we can have multiple consumers consuming from different places so that I can have you know one topic full of data and many applications I can deploy many applications to compute do computation over that data in different ways so there you go what this does and what I'm about to say I'll admit I haven't quite made this argument I've got a couple talks later on the day that I think make this argument a little more coherently but Kafka doesn't just want to be a big giant pipe to get some events from over there into some system to store them what it really wants to be is a distributed commit log that's the backbone of your entire system it wants to be as much as possible the system of record it says you know really what you're doing is you're processing events and that's all any software does things happen we become aware of those things we do computations over those things and produce useful results to people and Kafka is saying look yeah everything's an event I'll store the events for you and I'll give you ways of doing processing over those events I'll give you ways of integrating legacy databases that that aren't evented systems but our regular old update in place databases like relational relational databases I'll give you ways of capturing change log streams for those and making those just data in topics and integrating those applications those would be microservices really Kafka as a messaging backbone becomes an intelligent way to integrate micro services as well so there's all kinds of things we can do this and if it seems like I'm just kind of saying that and not really backing that up there are a couple of talks on that theme that I'm giving later today and come to one of those you know it's come to both but just come to one and I'll make that case in more detail but that'll and that stream processing those orange boxes in there once I've got data in topics what do I want to do with it well if all you had was the consumer API right and you could read a message and do stuff with it things are gonna happen all right if I if I set you loose and told you about it you said yeah that's awesome I'm gonna go build things you're gonna find out that over the next year and a half certain things will happen like you're gonna have to you're gonna have to group things by key and count them that's probably gonna happen you're probably gonna have one topic that's got some data in it that has a key something like a foreign key that points to some other look-up tables somewhere and you're gonna have to join that one topic with a table or maybe join it with another topic or you know these these are gonna happen you're gonna you're gonna have to chunk the data up into Windows right and if all I give you is the consumer you're gonna have to go write that and it's fun right it's a fun to write framework code it's better than business code right because the regular business code there are insane stakeholders that want like weird exceptions you know you build some beautiful data model that says okay we've got one promotional activity that could happen every day and schema won't let us have two of them so we'll never double up and do two at once okay it works then you go and demo it and you know the product owner goes okay great what about two for Tuesday what's what's two for Tuesday well on Tuesdays we have to send two you know that's that's business code it's always messy and weird and there exceptions framework codes not like that so you know it's tempting to want to go write the stream processing engine because you can make it beautiful and perfect but you know then you get fired because you're not actually delivering features you're building frameworks and that is where a case equal enters the picture these things that you're gonna have to do with data in topics k sequel is there to do them it's a declarative stream processing language it is not a subset of ansi SQL but darn it you're gonna find that it looks suspiciously like SQL let me show you what it looks like now that code on the screen right now I think you could be forgiven if you just thought it was sequel because it sure could be the only thing that makes that not equal is that clickstream and where is click stream click stream right here is not a table click stream is a stream and this is one of the two primitives that Caseyville works with k sequel also does have a concept of tables but first and foremost is a stream now a stream is lightweight abstraction on top of a Kafka topic and so underneath click stream there's a topic called something we don't know what and we've told K sequel hey there's that topic down there we want to call it this stream click stream and every time a message arrives in the topic underlying click stream run this query and emit a result and this you can see this is just a stateless filtering operation this is getting click stream events from some surely gdpr compliant click stream measurement system out in that we've installed on our website and we're filtering it to only show clicks from IE 6 users that's what this code does hey maybe that you know what might be interesting you want to know what those IE 6 people are doing often thought you know I think we're in this weird in-between stage where ie6 jokes are like almost not funny anymore like 4 years ago there were still like oh yeah they make us support it it hurts so bad and now it's like well no that's kind of done I think in another few years I'm gonna keep it alive I'm not letting it go right in another few years we'll be at the point where we're ready for the Java spear JavaScript ie6 emulator that somebody's going to write like that'll become an interesting project to do we're not wearing this weird ie6 trough it's not stal Jacquet and the pain is fading anyway some more k sequel this this query introduces a couple of things and i want you to for now ignore that top line just look at the second line it says select user ID page action from clickstream left join users I don't even need to explain what's happening there because you know sequel and it's fairly intuitive but there is something tricky here so click stream we just talked about that's a topic of clicks that some external clicks click measurement thing is emitting is producing into that topic and we are consuming them here users isn't a stream users is a table this is the other kind of primitive that case sequel has case equal support streams and it also has a concept of tables now underlying users where where are those users stored well you have one option for where you can store things that's Kafka and kapha Kafka has one kind of thing that it can store data in and that's a topic so it's it's fairly straightforward to understand how click stream as this thing we're calling a stream is an abstraction on top of a Kafka topic it's a little weird for me to tell you users is a table and it's an abstraction on top of a Kafka topic so let me explain remember I said that the things that go into topics are messages and messages are key value pairs and that's like almost all you really need to know about the Kafka data model messages are key value pairs topics contain messages well imagine if we had a topic that contained updates to user records so it's like a like a change log okay updates to user records and every time somebody edits their profile we produce a new message into this topic where the key is the ID of the user and the value is some say JSON blob or Avro serialize thing or whatever it is we want the key is the user ID the value is the the updated user record now if somebody's crazy and they're editing their profile every minute well there'll be a new message every minute that would be kind of weird and and frankly unlikely most people don't edit their profiles all that often but every time you edit your profile produce a new record into that topic right well Kay sequel in creating a table is simply saying I'm going to get the most recent one of every message in that topic the most recent value and I will materialize that as an in-memory view of the data in that topic and that's it so maybe there are a billion messages in that thing from all these profile updates for the last some number of years but I'm only going to have the current version of each one in this table and that's that's actually going to be kept in memory and there are important reasons for that that's what a table is a table is just this materialized view of data in a topic and as long as that is sort of changed loggy data that works just fine now some questions that I'm sure you'd ask right now and some of you are thinking is oh no in-memory well too bad that's not going to scale right let's just stop there for a second I hear your skepticism but remember how Kafka wants to scale consumers we have topics topics get partitioned and you can have a consumer group you have as many instances of that consumer application as you have partitions in your topic right well what is case equal but under the covers it's a producer and a consumer I said everything that talks to Kafka is fundamentally a producer or a consumer or both in case equals no exceptions so the the engine that's executing this code and we'll get to that at the very end after the demo the engine that's executing this code is a Kafka consumer group so if I have a very very large user table well our you know collection of users I'll want to partition that and then I'll want to scale out my K sequel server group so that everybody can fit that little chunk there partition of users in memory and that is off heap memory this is Kafka salt written in Java but the the state store if you're interested is comes from a facebook project called rocks DB so inside K sequel there's this rocks DB instance that's keeping that stuff off heap and is able to just you know do basically deal with memory in a in a not crazy way anyway there you go you basically get what's happening here this is a join the interesting thing is key a rather stream and table those are concepts you have to know now I told you to ignore the first line of this query and I would like you to stop ignoring that first line and pay attention to it now create stream as select internally and informally we call this a C SAS see SAS create stream s select and what we're telling K sequel is don't just run the Select but go kind of like spawn a a continuously running persistent stream processing program and you'll see this in the demo I'll do a select and I'll do a create stream as selecting you'll see how the behave differently but you know threads are spun up and a thing is running inside the engine when I do this and the results of the Select are being produced to a new stream called VIP actions which by default if I don't tell it otherwise will overlay a topic by the same name so this case equal query will be producing records producing messages to a topic called VIP actions and that I can consume that directly I can Kafka connect it out to a relational database or s3 or whatever I want to do it's a Kafka topic it's a part of a the Kafka platform and I can do anything I need to with it with but I'm went is I'm producing it with this case equal query alright almost done with static code and ready to look at at code that I type one more here this introduces two new concepts on creating a table that create table as statement but let's walk through the rest of the query select card number and count from authorization attempts so this is a fraud detection algorithm anybody who works in financial services you are welcome to have this algorithm no extra charge very sophisticated get out your phone take a picture of slide right now it's amazing so and you're laughing wait why are they laughing authorization attempts is apparently a topic full of credit card authorization attempts and we're gonna group that notice I'm skipping the window for now but let's skip down to the group I want to group that by card number and any time I see the same card number appearing more than three times I would like to omit that card number and the count into the table so I'm flagging it as potential fraud now of course I don't want that to be overall time that would be a rather unfortunate credit card that I could only use three times although have one that I think their fraud detection algorithm is something like that because it seems like I'm getting a new one every couple months I don't know what I'm doing wrong the window statement in the middle is key there so this aggregation this counting is going to only take place inside five second windows so I need to see more than three uses of that card within five seconds and so if there are say for authorization attempts in 50 milliseconds then right away that card number and the number four will be emitted into possible fraud into the table we don't we don't wait for the window to close to produce results we're always producing results but if I see two authorizations in the first 50 milliseconds and then 6 seconds later I see another one that's defined as not fraud at least not by this algorithm because that's not within the same 5 second window so when doing it there's a much longer story to talk about there I just want to introduce you to the concept there are a couple of other windowing schemes that you can use but this is just the basics here let's go ahead and look I wanted to show you very quickly the DDL so for Kafka for case equal to know about a topic you have to register the topic as a stream and that looks like this looks a lot like a create table statement with a little vendor extension on the end create stream click stream with metadata and then we're going to say that the Kafka topic is is called my click stream topic and it contains JSON data I could also say the value for map is Avro case equal supports Avro and JSON at any serialization format you want as long as it's Avro or JSON and that's that's what the stream creation DDL looks like here is the table creation DDL it's almost exactly the same except I also have to specify the key column that we're going to use as well and I will actually do these things well now I think let's do a demo all right hey I know them let's go here and to okay that's still small actually let's like I hope that's that's good enough we'll see how this goes so I have I have the confluent open source 4.1 distribution running here and k sequel is a part of the console and open source distribution it is at present licensed as Apache public license to same as Kafka itself and you can go and you download this well you can go to github and build it from source it's actually not very hard to do you could download it from confluent thought IO slash download whatever it is you want to do but I'm running it and it has this handy dandy command line tool that's good for development and demos and things like that it's it's not a production tool in fact in the next release every time you type it it says hey don't use this for production please because some people were and it was making people sad let's let's go ahead and run case equal and so this is the command line version perfect for experimenting running locally you can connect to production instances with this but we're not going to do that so let's see here all right I'm gonna shrink the font a little bit so it doesn't wrap when we list the topics there we are those topics are just the internal ones that come with the platform when it starts up it's like some Kafka connect stuff and schema registry stuff and one KC for related topic basically let's just ignore those we do need some data though and that data is gonna be this this should be easy enough to find it is good deal we're gonna rate some movies I have an idea for this movie rating app where instead of just saying after you watch a film what you think of it you'd actually go and use the app even though the movie theater is kind of Hector you like whatever you do don't open your phone because the light will blind people I don't quite know what that is but you know you could actually get your phone out movie theater hate you for this but this is just my app idea totally fictional and rate the movie like while you're watching it that'd be kind of interesting you could get like rating data like wow that opening action scene was really great but things kind of bogged down in the second act and okay the fight scene was there was good but the ending doctor whatever you know you can just kind of get an idea over time it'd be valuable data so that's kind of what we're doing here I have a couple of JSON files I've got this one that's all these movies just all kinds of information about those movies and I have a script that I'm gonna run that's gonna create data that looks like this I'm not going to use this static file here because that would be boring I want real-time random streamed ratings but it's going to be a movie ID and a number so let's take a look at how to deal with this I want to take just one record from my movies JSON file and produce that into Kafka they produce into topic movies raw let's see is that gonna work for me I think it is excellent let's go over here and list my topic so look I have a topic called movies wrong now now Kay sequel will give me kind of some primitive debugging capabilities of like peeking into a topic and seeing what's there but what I really like to do is something like this right I'd like to do a select from a stream but that stream doesn't exist oh hey wait things are a lot more fun if I do that so let me create a stream called movies raw that has the following metadata title a title and we'll give it a release here right and value format is JSON why because it's much more pleasant to look at and the Kafka topic was called I believe movies - raw because I like to do that and now I have a stream and now I can select from that stream class dismissed no you can't go yet it's not cool enough yet so let's put another record in I'm gonna grab a record from the end of that file and I won't be able to get over fast enough to the other tab but you see now we have two things in there and I can you know I can keep doing that for a little while just keep putting records in there those are duplicates now but we'll see what happens we'll see if we can get rid of those duplicates in a minute what looks bad about that so it looks like I have the ID the title and the release here this is probably a timestamp what is this somebody shot it out the key you're very shy somebody knew that and you didn't shout it but that's the key and that's null because that's what Kafka cat does caca cat is just a little third party open source Kafka command-line utility written by actually a great guy guy named Magnus and it's a super great utility you should use it if you don't but it makes the key no I need the key to be movie ID in order to do that joint that's just somewhat intuitive right let's set that up so I've got movies raw let me create a stream called movies re-keyed as a select star from movies raw partitioned by movie ID uh I can repartition now notice that before when I created that stream it said stream created now it says stream created and running that's because the first create stream was just DDL just said hey there's a topic here I want to register it this one is a create stream as select so this is a program and this stream processing program is running it happens that my my underlying topic has one partition so it's only created one thread but if it had ten partitions there would have been 10 threads just spawned they're ready and waiting consuming from the underlying topic so now I can select star from movies rekey and it looks much better but that's not really what I want this is lookup this is reference data the movie stuff right it's not really a stream it is a table and so I want to create a table usually it's conventional to give them a name also title so that's that same metadata there with it's a table so remember I have to give it a key this little quirk of nature here value format will be JSON and the underlying Kafka topic is gonna be movies re-keyed so that movies reach Eid stream that I created is producing messages into a topic called movies reak Eid and now with this GDL here I'm saying create a table from the messages in that topic so and this is by the way intentional that this is a part of the demo because this is just real you get things that have the wrong key and there's a join that you want to set up to do and you have to repartition it's totally fine of course I am copying stuff right Mike my create stream mouse select to rekey the movies is taking the original movies topic and copying that data to another topic that's like denormalizing and that's immoral right you're a terrible person if you do that except not here right there's there's a storage cost to copying data and we deal with that but usually the thing that scares us about copying data is what if something changes you have to go change in the copy well the data is immutable every message in Kafka is immutable so you can copy things as much as you need to as much as you can afford to and you don't have any data integrity concerns you just have storage and an i/o concerns that you have to optimize for so what we're doing here in other words is just fine now select star from movies and you'll notice that I only have two records now those duplicates have been eliminated and only the most recent versions of each one of those things is included now get going here and I'm just going to dump the rest of my movies in and I made it over in time there we are we have everything I see Zombieland wreckit-ralph Doctor Zhivago the day of the beast I I don't know if that's a movie I can recommend so you know that's just on you I've got we've got our reference data that's all nailed down now let's get some ratings happening now it so happens that I have a little piece of groovy code here you know that that Gradle is gonna run for me and what its gonna do is gonna pick about 10 or 15 movies and generate some some randomized ratings based on you know some realistic averages that's someone with impeccable taste in film has put together for us and I'll let you guess who that might be okay we have a new a new topic called ratings and we need to register that I want to create a stream called ratings and it's gonna have a long movie ID and it's gonna have a double rating and it's gonna be JSON data and it's gonna come from a Kafka topic called ratings it's okay to call them both the same thing and so I now have registered that and I can select star from ratings and though boy is not exciting now I have cheated and the script that is producing messages into this topic is producing them with a key that is joinable so you see that that it's not displaying very well in the UI here but that second column that's a key that is ready for us to join so I don't have to mess with this anymore however it's not useful it's just an ID and a number you know maybe I could at least select title and rating from ratings left John on movies on ratings okay that's better let it finish I want to interrupt it just let that print for a second you can see there's names and numbers still you know if you'll permit me I think it's still kind of stupid and ok good you can you can see that all good so it ends with Lethal Weapon classic Christmas film at an 8.27 which i think is a pretty honest rating for that movie so there you go what have we got here notice a weird state that this is left and I want to point this out when you type in a select at the command line this is how it stops if there's no data coming into the stream I stopped my script just to keep things under control for now I'll restart it by the time we're done but selects don't finish it feels broken every time you do this you have to hit control C and at first it's like well that's kind of weird why should I do that well it's streaming data when is it ever going to get the last message it's just going to run until you stop it and these selects that we type in at the command line they're these non persistent queries that just run right now to make it a persistent query I have to do a create stream as select or a create table let's select which I'll do in a little bit now let's make this a little more exciting films are not our films are ambiguous by title we need title plus release here and so we're going to add that there and I don't really want just the rating oops ratings no I definitely want ratings I want the sum of raiding divided by the count of raiding and I want to I want to name that average rating and then I want to group this by title and release here so now I have average ratings and to do that with if if all you had were an API that said here's your next message there's a lot you have to do to make that happen that's actually pretty difficult we haven't talked about state management but to do this to aggregate based on a key and and then do computations over your your grouped keys that is a stateful operation there's stuff you have to keep in memory creating the table is inherently stateful there's stuff you have to keep in memory case equal does that for you it's built on an open Kafka API a java api called the Kafka streams API Kafka streams handles all that state for you and otherwise if all you're doing is programming directly against the consumer API you have some fairly thorny state management problems that you have to deal with all of those kind of evaporate in case equal they are they're handled for you by the engine now what's the one thing I want to do after this I want to create a table called rated movies as this select now the table is running and I can select star from rated movies and I'll go start up my rating script again and we'll see ya select star is a little ugly let me it turns out I don't have any remakes in my list here so we can just do title for now but there you go this is gonna continuously update my fans are gonna kick on you can't hear them but we're already at you know two hundred thousand movie ratings here that are being aggregated and I haven't defined any windowing so it's gonna be aggregating over everything this is gonna get out of hand after a little bit but there you go we are successfully doing stream processing with just a few lines of sequel and you can see you can see my movie recommendations like Super Mario Brothers you know just didn't really maybe it's time for a reboot I don't know I don't think it happened what else we have children of men okay tree of worth tree of life it's like a ten best movie ever made hear me now believe me later anyway you get the idea let me just explain maybe one more thing here we are this is something like what we were just doing I had that CLI running that command-line interface it was talking to basically in this case in the debug mode an in process case equal engine and that engine parses the case equal builds a little stream processing application and runs it the producer and the consumer stay you know they're internal to that thing and it's just doing the stream processing code to run this in production you're probably going to work out your case equal and your your case equal really is you writing stream processing programs in this sequel like language you'll put that into a file deploy that file with the server and you know scale out your case equal cluster to whatever the need is you could do one you can do three you can do ten it just depends on the volume and the complexity of your your processing and run that in the cluster those nodes in that cluster act as a cluster one of them can fail and the partitions that are assigned to it will be failed over to another machine and the state will move with them the states all managed for you you can you can elastically scale it out add more nodes add more case equal nodes and partitions will get assigned to those that that you know the whole business of being an elastically scalable fault tolerant application is handled for you by that engine basically you've got a script called case equal server start that you run in the container or on the VM or whatever it is and there you are your your a case equal engine and typically in production you're just gonna run from a file of code you can also of course run in production like this you can have a case equal cluster connect to it from the command line and an issue queries that way that's completely fine but that's what it looks like to actually run this thing in production if you want to know more you could check out the code it's on github k sequel is open-source confluent inc /k SQL there's some documentation in there most of the documentation have moved off to docs confluent I oh but there's some some Docs and examples in there confident that I have a /k sequel we'll give you some tutorial videos the guy who looks and sounds like me in them if you want to dive a little bit more deeply we also have a slack community which is operated by my team and and is people by like actual case equal engineers and if you have even very difficult questions about it you can probably get them answered in there when you're just getting started so check that out join slack if you're interested and hey find me later this afternoon I might be a little jet lag II but I'm gonna try to be around and I would love to talk to you more about this thanks for hanging out with me and your patience and your interest [Applause]
Info
Channel: Devoxx
Views: 42,492
Rating: 4.9268637 out of 5
Keywords: DVXPL18
Id: LM-aQQQes4Q
Channel Id: undefined
Length: 48min 59sec (2939 seconds)
Published: Sun Jul 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.