Connecting Apache Cassandra™ with Kafka and Spark

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good morning good evening everyone we should be live can you hear us what about the sound do you have any echo okay i okay so far so good try to okay so let's let's get rolling um welcome everyone for yet another workshop online and today i am with a very special guest raul sieg hi raul how are you hey i'm doing well thanks cedric thanks for having me again yeah so try to put your slide in full screen i like to get a little bit of shout out to you in slide 2. i don't know if you see my last minute slide so and yes so for those who don't know raul so he's a very important member of the cassandra community so first he's the ceo of anand corporation but he's also and his team creator of cassandra.link which is a super great resource to get all the news and articles about cassandra and he also created and maintained awesome cassandra you know where you find all the links for tools and you know great articles but not only that in the website you can see that they have clone cloned awesome kafka absolute spark awesome soleil i mean everything that can contribute to the community those guys are everywhere and not only that is also managing multiple meetups group in the east coast so thank you very much raul for being with us today thank you thank you so much appreciate that um just wanted to tell everybody thank you all for being here appreciate uh you know you guys coming out uh i'm in the east coast so it's kind of like a lunch hour but some of you may be coming from europe or asia or the pacific west here um this is part of a series showcasing an event-driven toolkit and the first uh presentation which is also uh which was you know cedric uh and i hosted this before it's on on youtube and it's this particular workshop is going to build upon that one in which we built a simple rest api on cassandra today we're going to be covering event source api on top of kafka because you can do event source api on on different systems and we're going to connect kafka to cassandra we're actually going to be very ambitious here we're going to be doing three ways in two hours using the you know the consumer yeah in two hours so uh we we even thought we could fit in the spark component but no that's not impossible so that means even though you can try to follow along i recommend just focusing on watching what i'm doing you can always come back to the video later the content is well documented in fact you'll see i'm going to be literally copying and pasting from the readme that is online and it's open source so you can always come back and and do this right um so today what are we covering uh well we're going to be talking a little bit about the event driven api what is a driven api how does that really fit into the world of microservices we're going to talk a little bit about kafka topic modeling if you've done cassandra topic modeling sorry cassandra data modeling you'll you'll find that there are some similarities in this idea of partitioning but um it's not exactly the same we're going to open up the code uh we already have code so that's a prerequisite if you had uh you know uh if you didn't have an astro account before you need an astra account uh there's a reap another repo we're gonna be looking at two repos cassandra.realtime and customer.api on github uh both of them are gonna be opening up in uh pod which is self-contained web-based id and we're going to do a couple of different things we're going to be creating topics messages consuming them and saving them into cassandra in a few different ways so you get an idea of how to leverage kafka with cassandra main takeaways are that you get a basic understanding of kafka topics schema partitioning uh we're not gonna make you an expert today but you'll get a basic idea you'll understand the basics of kafka the kafka api uh again you know you're not gonna walk a walk away at kafka architect today because frankly we're gonna be dealing with one node kafka cluster uh in real world scenario we're gonna you know we're talking about potentially several uh you know types of clusters zookeeper kafka etc but what we want is that you understand how kafka can work with cassandra kafka can work with many different technologies but today we just want you to take away that how i can work with cassandra just wanted to give a shout out to our sponsors uh gitpod has been very very good to us and to datastax in giving an amazing platform to do self-contained development uh um you know data stacks has been a great sponsor and the deloitte has also been involved in beta testing some of these workshops so thank you everybody for making this possible um you know what what is an architect uh so i went to uh google and i said you know what is an architect and the reality is an architect is a chief builder um and in computing it's to design or make things so we our company add-on we architect platforms on cassandra and related technologies we love cassandra and cassandra is the heart of many of the platforms we work with we love dealing with anything that's scalable uh and you know with data stacks it's a lot easier that's really what i like to tell people even though we've used all of the open source technologies apache spark apache cassandra apache kafka when we use things like data stacks and and for that matter confluent it makes life a lot easier that's where we come from technology is great it's even better with commercial partners so um some of the different ways uh that you're gonna be able to experience this um and find out about this is uh actually cedric you want to talk a little bit about you know the different platforms where people can watch so we are live on youtube this is where you're watching us probably but we also live on twitch we can use twitch as a backup we know really tracking the question on twitch so for question is mostly the youtube chat and also a discord so we are using a discord room for every workshop we are more than 7 000 people there now so if you need more deep dive question go to the discord and we are a full crew of advocates that has like that would help you in this card we did the menti quiz okay and for the workshop and for the end zone this is what i put the database is cassandra running into astra cloud-based as a managed service and we are using one first runtime for the api uh provided by gitpod and another one which is dedicated to kafka absolutely and you know the best part about this is that you don't need a powerful computer to do this in fact the very first um workshop i did which was just on the the api i did it on a tablet through the browser through the chrome browser so uh it's amazing what we can do these days so one of the first things i you know uh you know if we if you were following along i would say is to go and launch get pot get pod is a platform in which you know it's basically like a visual studio code but it's not exactly it's actually eclipsed the ide and i'm just going to make sure that my git pod is up and running it takes a while it's built on top of containers it kind of runs uh builds these images um and i'm also going to warm up my cassandra api uh get pot as you'll see you know some of these depending on if you've done this before it may come very very quickly other ones take a little bit longer and what i've done here is just bless you uh what i've done here is i went to the the url uh for standard.realtime this repository and i just clicked on open with git pop that's it and this is what i have running now same thing with the center.api looked for this icon open with getpod and it just it's bringing it up now and when you're up and running you will see a terminal like a linux terminal here you will see uh your folders with all your content and if you click on a file like let's say this docker file uh you'll be able to edit that file right that's when you know this is working um back to the presentation while that's running and we have to warm it up again depending on where we're at uh some of these git pods they they become dormant they'll shut down so we'll come back what is getpod doing for us gitpod is giving us an operating system it is giving us a ide where we can edit code it is giving us a proxy in which when we run a like an api uh rest api it'll dynamically map that port let's say port 8080 80 or 8000 to a url that we can access so it's doing a lot of different things for us i'm just going to say git pod is magic and then we can move on if you're interested in the technology all the technology that your pot runs on is open source yes and also because you know just because we are a good user of gitpod we will be with gitpod next week during this workshop i will be with the advocate from gitpod and we will run through all the small features so next week awesome that's great um yeah we love good pod they sent us a bunch of stickers they're an amazing organization um that's all you need to do is just give us stickers and i'll call you out um why are we building this application so this application uh that we're showcasing this workshop series it's not just deployed uh it's not a toy we're actually using this technology and a part of our company's operations um this is the larger uh picture of our knowledge platform we used to inform ourselves um we could have always gone and subscribed to something we really didn't like something out there all the apps out there so we decided to use a combination of open source tools and what this is uh is a diagram of where our knowledge base data is stored where it's indexed it has a interface for administrating content it also has interfaces for browsing as well as we publish websites out of this knowledge base cassandra.link is actually a site published from this knowledge base right so this is a real world use case for why we do this development it's not just to play around and do a workshop we actually do use this technology where we're going where we want to go is to have everything serverless absolutely everything serverless um right now all of our web interfaces are already serverless we're using things like zip and netlify astra examples use gatsby and astra with netlify we use uh gatsby and netify right now we may actually use some of the master apis later down the line um in terms of the databases there are some databases that are already managed um so astra we have not quite 100 migrated over but we're about to uh and then we're using a separate service called search stacks for mac and solar index um the storage attach index the new one on astra is good but solar is a different type of technology and what we're going to talk about today is this part right here which is how to consume events and processes them into astra because a big part of our transformation from where we are to where we want to go is to make this a scalable reactive platform that is 100 percent just a quick overview of microservices and how and rest versus uh you know event driven architecture they're related but they're not all the same um what is a micro service it is a loosely coupled uh basically a unit of um it's a process that's loosely coupled with other loosely coupled services it can be implemented using asynchronous or synchronous most people they interact with microservices as a rest api they will make a rest api that has one function one database and then it you know there's another system that calls it others uh may be familiar with using microservices with kafka or let's say with rabbitmq in which their events are driving the execution of these microservices normally people go into microservices because they want to have a system that can scale independently so that we can you know replace features very quickly without having to redeploy the whole system um this is a big uh discussion a lot of people have different terminologies with what microservices are but i just wanted to have you guys understand that we're talking about microservices right now in terms of having a simple function over rest over http and we're actually going to be doing a bit of that with kaka streams as well now there's a different subject of microservices on cassandra a lot of folks will say well if it's a microservice then it should have a separate database every single microservice needs a separate database when we come to the world of sandra we can actually host several different microservices on one cluster and you can have a microservice powered by one table or you can have a microservice powered powered by a whole key space or potentially even having a microservice powered by a whole data center right and one of the biggest benefits of cassandra is that it has a multi-model system system built in in which you can write data with cql tabular format and retrieve it via json right and then if you are looking at data stacks you can actually write data via cql and retrieve it using solar so in an index the arbitrary queries uh or in the newer versions of data stacks there's actually data sex graph where you can write data via cql and then request data back via gremlin so cassandra is a game changer in the microservices world and if you are um if you've ever worked with a mature organization that's using cassandra and microservices you'll see that they have large clusters powering hundreds and hundreds of microservices and that's the reality this is not the same database this is not your father's database this is a different type of technology that can empower you to do huge platforms um on one or two clusters event driven architecture is a way to do microservices it's a way to do uh different architectures um when we talk about a vendor of an architecture uh most of the time we're talking about how to implement something called cqrs command query request segregation or we're talking about how to sequence a set of different um microservices to do a job to do a saga for example right and in our case we're using pub sub uh in kafka essentially to say this is an event that came in and we want to process it in a certain way um event driven architecture is another very complicated i wouldn't say it's complicated it's a different area of architecture that most organizations eventually get to if they want to scale their platforms and so this you know just kind of going back organizations that start to mature their platforms they go to microservices some of them end up using cassandra and eventually all of them start to use event driven architecture because that's really the only way to create global platforms you just can't have everything tied together with synchronous calls that's the beginning of distributed computing that all came out because one big machine couldn't do everything now we're at apache kafka right uh we talked a bit about eventually we didn't get into it but we will see in kafka how to do event-driven apis before we do that let's talk about what is kafka when we use it and why to use it so apache kafka is actually a combination of a few different services uh most folks when they talk about kafka they're talking about um the broker but kafka is actually an ecosystem of separate tools and technologies generally speaking a kafka cluster has a zookeeper ensemble which hopefully sooner than later they're going to take out it's already in movement right now and the zookeeper ensemble is what coordinates the cluster and it coordinates different uh consumers and producers the kafka club cluster itself has several brokers now this is the minimum viable kafka right just to have a cluster of brokers that and the broker's responsibility is to accept messages from producers and to allow consumers to then consume messages from a particular topic kafka is built on a just an idea of a distributed message or distributed log and the way they scale is by keeping that log replicated in several different servers so if one particular server goes down then that message is also available on a different server um a lot of the scalable platforms out there are some implementation of a distributed log kafka is a literal distributed log all it really is doing is it's saving messages in a log and it's replicating it and making sure it's available at scale so remember when i said that kafka is is is a combination of different technologies so kafka most of the time is you know when people say what is kafka well kafka is the the broker service right and it involves uh clusters of brokers we have different topics uh i would say topic is like a like a table and then we have partitions uh which is a way to clump data together like similar to in cassandra and then we have actual messages which are like table rows or records um kafka is not the different ways to connect to kafka right so uh kafka client api which includes the consumers and producers and there's different client apis for different languages like python or java or scala or node or go and you name it there's a way to connect to kafka that is not kafka that's just the cockpit client kafka connect is not kafka either uh we're gonna be using all three of these today by the way kafka client uh kafka connect and kafka streams uh kafka connect is is like a framework i would say it's a lightweight you know enterprise service broker it's very very lightweight it allows you to connect different things to get data into kafka and take data out of kafka and put it somewhere else and finally kafka streams it's a way to do stream processing without necessarily needing spark for example flink you can just write a little program and do very basic stream processing using the kafka streams api are other helpers and these are amazing helpers but they're also not kafka right the kafka rest proxy is a way to get data into kafka without you creating your own api your own rest api right you can just stand this up and have a ready to go api and it works with something called the kafka schema registry and the schema registry uh else also works with uh kappa connect the schema registry is important in an enterprise where you may have several different systems talking to kafka the schema registry allows you to have a contract between different people connecting via kafka you don't need to have kafka rest you don't need to have kafka schemas you don't need to have any of this stuff to use kafka except for for the broker service and cockpit client but all of these things make your job a lot easier you'll see why do people use or when do people use kafka well it's it's both it's when you are big enough you need to use cassandra in a similar way when you're big enough uh you need to use kafka um it basically outperforms any uh you know matching technology that's out there that you can download and install uh and we're not comparing let's say you know amazon sqs and and google pubs up we're talking about technologies that you can download and install and scale on your own um it's it beats everybody right kafka beats everybody um who uses it well linkedin who made it people had linked it kafka they they use it um uber netflix comcast they all use it you can find this in their engineering blogs you can find this in their github repositories a lot of folks use kafka um it works well with other technologies that scale well like cassandra like spark like flink it works well with these technologies because spark and cassandra as they scale linearly kafka can also scale linearly to global scale there are other technologies that are now kafka compliant so there's another apache project called apache pulsar which is a it's engineered slightly differently but you can actually use kafka client api to talk to apache pulsar whenever i see two different technologies or three different technologies using the same api uh it means that this technology is there for a while it's not gonna go away anytime soon all right so that's a good signal that there's another technology using the same api one of the biggest gotchas with any distributed system uh whether it's cassandra whether it's kafka or spark is that people don't understand that distribution or scale from distribution only works if you distribute the work right so in cassandra if you if you model the database in which uh the data is clumping on certain partitions you know wide partitions or on servers hot spotting that problem comes from a design issue that same pro problem can happen in spark when you have tasks that are heavier than others in kafka we can have a similar problem right just because the technology is very powerful and it distributes it doesn't mean that your data is going to distribute so one of the first lessons we want to learn about kafka is that messages are partitioned and there's different ways to partition messages but partitioning what it does it does two things one it can group messages together so that let's say one consumer can read from it that's one reason to have partitioning another is that it allows us to load balance so that several consumers for example can process the same type of messages other technologies have um mechanisms to do routing and whatnot this is the way we do routing in kafka right we determine our routing uh strategy we determine our partitioning strategy and that's how we can spread the data spread the load so here's an example where in you know we have one topic um different consumers can pick up from different partitions that means that there's no necessarily there's no uh importance to the message versus we can have different partitions for a particular topic in which we say you know partition zero is the high um urgency message line and whereas partition three is a low um urgency partition line and these consumers can be scaled dependent on urgency the data in kafka is replicated it's not the same as cassandra but its effect is the same in that it it allows us to have millions and billions and sometimes trillions of messages coming in on a daily basis and systems can then read those messages uh billions of seconds you know billions of messages at a per second and process that um when a server goes down the the leader another one gets elected okay again i'm not going to get into the specifics of how but zookeeper is what allows the system to quickly know who is the new follower who is the new leader just a question about partition do you know if they are static or dynamic in kafka sure uh that's a good question um it is up to you and i'll get into that in one second um basically um you know your your partitions are either either either you give uh the partition yourself in the message in that case um you're dynamically creating this yourself right um and actually let me come back to you what do you mean by dynamic or static partition oh so i think i just read the question but i expect that can you fix in advance this is the partition name and you know every message that would um get this property will end up this partition like a group by or because it's dynamic i get too much too many message on atomic i need to split and do a partition this is how i understood the question got it yes that's a good that's a good uh that's that's a good question so when you set the message right um you can give the partition in that message once it's written it's it's written and we'll talk about rebalancing partitions later there's different tools to do that but as far as pure kafka is concerned um it's it's like in a cassandra table right when you set the partition key that's what it is and the only way the only way to really move it is to then write a new partition and move that data over um there are ways to have the system automatically distribute the data so if you don't give it a partition id then you know basically kafka can do that redirection for you it can balance it uh based on um you know just as a percentage uh and then finally there's round robin um but in each of these cases if you want to rebalance your partitions um there are different tools to help you do that and uh yeah confluent has a couple of tools uh there's open source tools out there as well um so the to the answer question is are they static or dynamic after you've set them they're fairly static but there's ways to change it if you want to later it's not uh well i don't want to say it's easy or hard it's just a matter of what scale you're dealing with right if you have a small scale it's pretty easy if it's a bigger schedule all right um so you know in this case uh this diagram just shows that you have control over how to partition your messages um the strategy is a big part of kafka architecture beyond the how do you scale kafka technology itself this is the way you scale your your messages this is the way you make sure the data gets spread properly and consumed properly um there are existing uh partitioning algorithms you can also make your own right there's there's class that you can inherit from and make your own partitioning strategies random partitioning will take messages and put them on different partitions versus let's say you can decide group by right i want to group by particular messages uh partition we can put everything in one partition if it's a particular or let's say uh key right uh and then by guarantee that's what i was talking about earlier you know you can have one partition that has high guarantee versus low guarantee and it's how you consume the messages that allows you to uh really do that and last reason why people would use kafka is its web scale if you've ever watched this little youtube series about and why is web scale you'll get the joke but kafka is is web scale there are actually two or three different technologies that allow kafka to scale at the global level one uh one ways by just the topology right the topology being that you have several different kafka clusters that then report into a global kafka cluster now there are other ways to do uh use a technology called mirror maker that will replicate data similar to how data is replicated in um you know data centers in cassandra but it's i would say it's still not there very few people need kafka at a truly global web scale and they're the ones who are tackling it like netflix and uber uh they're the ones who are coming up with these technologies uh but if you are watching the list for user group or if you're watching some of the blogs you'll see that they're trying to do this within the core technology meaning not necessarily using something else outside of kafka there are lots of different use cases uh for kafka um the main use case that started it all was basically linkedin wanting to connect different databases to different services and keep them consistent and they ended up basically making a unified distributed message log where systems would write to that message log and other systems would read from it this is the reason they made linkedin oh sorry they made uh uh kafka at linkedin they wanted to connect their systems together today people use it for everything under the sun uh folks will use it for aggregating logs from hundreds and hundreds of servers some of them will use it for let's say stream processing uh we're we're actually going to be using it for event sourcing we're populating the topic with events and then we're going to be processing those events every industry that i have personally worked with has i've seen at least one company uh use kafka um and that just comes with the territory right like if you're working with cassandra you're gonna work with companies that are pretty big and so they end up using uh kafka for e-commerce for iot or edge computing um i can't talk about the specifics on this uh workshop but trust me if you're using a brand that's a pretty global or national brand they're using kafka and probably cassandra yes all right so quick review of astra which you know if you've been watching these workshops with cedric and other team members from datastax you know what astra is um astra is a cassandra as a service and and you know um if you want to go over the features yes so indeed at the heart core is really cassandra as a service in the cloud it's a managed service we are doing all the backup and management and scaling for you and not only that we provide tools to use customers on top of it so development tools like the studio and also now some api on top of cassandra rest api graphql api ready to go and morse coming so data stacks astra is really now a platform to ease the usage of cassandra as much as possible excellent um and you know as as you can see in there are just like with cassandra you have different drivers you can use the same drivers to talk to uh apache cassandra sorry at data sets astr as if you were talking to apache cassandra yeah you talked about cql before so here is all about cql the way you used to interact with cassandra it's exactly the same with astra simply add a couple of parameters and certificates to secure the connectivity between you and astra but it's all the same and now must have even more ready to go rest api if you for instance a javascript developer and don't want to come with the node.js driver anymore but simply want to do rest called it's now possible absolutely so for us uh you know as app developers uh you know why do we use astra uh well frankly we know cassandra for well we help clients scale cassandra why should we use astrid uh well it's because we want to focus on the features of our platform uh what matters to us um the general trend and our clients and as we've seen in industries to go towards serverless uh we don't have the time to manage our own servers for an app like this right so it makes sense to use something like astra um and if we use astra we can scale our applications uh and we can build faster the features we want rather than having to go and you know check the health well you know we can still check the health there's some tools on the astra interface but honestly nobody wants to manage servers i don't know uh if you want to manage service go work for data stacks they can put you in in their astra team but uh at our clients that you know and ourselves we want to focus on our business right we don't want to necessarily have to worry about servers let's quickly launch astra actually went ahead and warmed up my database earlier so if you log into astra normally you'll see an interface to create a database and you know you can choose the free tier i've already used it here but uh you can then choose um either production or high density production workload and then you'll choose a cloud provider which region you want uh you know and and it'll show you how much it costs um i went ahead and done that already so uh i have a database uh running which um you know i call call the platform and in this database um it's first of all it's running in google compute and uss east um and right now i have one key space i can add other key spaces if i want but that's it it's just a it's cassandra uh with a really nice interface um if we want to see how the health of the server is as i mentioned earlier um i can check to see like for example how much data is in there um general requests served and you know you'll see earlier i was doing some reads and writes at high speed i can go see what happened there is a cql console the cql console is basically the same i haven't seen too many differences uh in this sequel console i'm going to zoom in a bit too much oh yeah um and you know i have i have my key space here but there are also a couple of other ones out there so everything you could do in secureless you can do here uh except maybe like copy files because you don't have access to you know copy into a copy out of there's also a studio interface here which it allows you to do you know basic visualizations you can run any any type of cql insert update delete command here immediately see the um output you can uh also browse data in here just to make sure that you're getting um what you want in there we'll come back to the interface as we go through the demo just a reminder right where are we going we want to make this a serverless platform and not have to run any code ourselves and um you know what we're talking about today here is taking data for a url and we're scraping some of the content and we're saving it into uh cassandra that's what we're storing we're storing a cache of a url and we're tagging it and we're categorizing it right now we're focusing on this part of the diagram which is the actual items and right now we have not implemented uh collections meaning collections of links and we have not implemented identities meaning users on astro yet but that's where we're going and we also want to have a cache for metadata which we don't necessarily want to pull from um from the internet every single time so that's why we can use a separate database for that as i said you know anybody can go to astra.datastax.com and create an account for free now to have you know sign in with google site with github um these credentials uh you know are what are hard coded into the code that we're using although i have separate credentials once you're logged in you'll see cql if you want you can see dse studio and also the grafana interface more resources about all the things we're going to talk about today are available on cassandra.link we also have a site called casino.tools which is a subset of the knowledgebase datastax is a great resource for all things cassandra and datastax i would say they have the best documentation out there for apache cassandra and datasacks uh as you would expect and uh to find out more about kafka go to you know patch.org or confluent uh they're the commercial vendors of kafka now we're ready to go into the the hands-on portion uh we're going to be going through the steps fairly quickly as i mentioned i don't expect you guys to follow along 100 percent um remember all of the stuff stuff that i'm doing is is online you can copy and paste exact same commands except maybe the passwords and it will work um once we start up our gitpod once we set up our gitpod uh the four things we're really focusing on are creating topics schemas and verifying them we're going to create and consume messages with python and then finally and that's using the kafka consumer producer api and then finally we'll use kafka streams and kafka connect as alternative methods to get data from kafka into cassandra amazing id let's go ahead and pull it up make sure that our interface see this is what happens when uh when you don't use a git pod uh instance for a while it takes a couple of seconds for it to start up but it's not as slow as if it was the first time that you're pulling it up um the first time it really has to build an image for you and so on and so forth while that's happening in data stacks astra i'm just going to show you that there's some some uh content in my database right and what does that look like and you're not supposed to be doing this count start but i did i just wanted thank you thank you there's a hot here's a thousand or so [Laughter] there's about a thousand rows in here right and um once we're done we're gonna expect to have a bunch of content in there what kind of content is it it's basically a url it's the content that is script from the url there are some incoming and outgoing links that we store in here there's a preview picture right it's not that many uh columns and i'm just going to truncate it um because i don't want you to think that i've baked it before and that's how we did it i want you to see that i've actually gone through and populated the data using this code right so and then if we do a select there's nothing there uh and then if i do a so there's there should be nothing there right we don't have an empty database the um as i mentioned earlier there's two uh git pods um one is the code that exposes an api uh cedric mentioned that there's different ways to use astra including ready to go graphql rest and rest apis we you know intend to actually use some of those but the goal of this codes was to show how to connect to cassandra using node or python and you can customize that code these two versions are uh they're synonymous they work exactly the same they're just same implementation different technology um and to make this work uh very quickly you know what i would do is i would go here and i would download the secure connect bundle which i've already downloaded okay and i'll go up here and i'm just going to upload it cool thing about git pod is you can just drag and drop um from your desktop so i'm just going to go ahead and upload this and it's done all right there's my secure connect bundle i'm going to go ahead and change my username in my credentials file and i didn't drop my tables it's already there okay and i've set up my credentials now if you want to know how i did this um how i knew what to do you can always go to this api repository and walk through the examples right everything that i i'm doing here is documented for you to check out the first thing that i need to do is to get data into cassandra just to make sure that the api is working i'm going to do that and what i want to make sure is that my api is working okay right so if i go into my i'm just installing some of the requirements is not a technology apparently not yet maybe next year somebody will come up with something called python um and this is saying sorry i don't like you let's see interesting well i don't have to um sorry i think go and just run the api we'll worry about the data imported in other ways yes i'm also going to install okay all right and then i'm going to run the api for python the python version of api and cool thing about gitpod is that when you have any you know apis running on a different port um it can you know just expose it uh by the way this is working i just don't have an end point uh defined here but this is this is good we know that the api is up and running i verified it it's showing up we are good to go here next thing we're going to do is in our real-time code the kafka code i'm going to walk through starting um kafka in fact uh i don't have to do that i don't have to start kafka because as part of the the docker image for this git pod i've already said to confluent to go ahead and start all the components so as you can see and i'm going to zoom in here so you can see as well uh confluent is like it's like a cheat code for kafka just like data stacks is like a cheat code for cassandra when you use it life is a lot easier uh so when you start kafka sorry when you start confluent it's starting zookeeper it started kafka broker it started the schema registry it started my kafka rest api it has an instance of kafka connect and it also has something called k-sequel we're not going to be looking at the sequel today but it's just a system that comes as part of kafka when we start kafka by you know very nature it's it's like a it's like an empty database right it's just nothing is there it's not configured so the first few things that we have to do is um create a schema create a topic just like any other database um now if you've used um gui's before there are some open source gui's for uh kafka um we're not using the confluent interface today we're actually going to be using something called akhq which i think cedric has also used in the past yes so the main developer is based in france and it's my friend so please use it [Laughter] there you go um so well you know what we're going to be doing the very very first thing we're going to be doing is in order to get ready to get data from this kafka instance into the other api we're going to copy and paste our url um that we exposed earlier and i'm just going to copy this and in here i'm just going to make a new like notes file and i'm just going to put my notes in here there we go now we have the interface uh sorry the api url um we have gitpod running let's run a command to create a topic okay when you see this dollar sign confluent home or when you see dollar sign project home these are environment variables that we have set up to make it easier for you um and in order for us to make it easier for you to run this outside of git pod you don't have to run this stuff in git pod you could clone this repository and run all of the examples using docker on your own computer right so in order to make it synonymous we put these environment variables in here um so oops um where did my copy go so i'm going to copy paste this line right here and if you cannot see the the text uh what it's doing is it's saying run a tool called kafka topics create a topic here called record cassandra leaves april and avro is a is a way to do schemas in um in kafka we're not going to get too much into that today but um we'll we'll briefly talk about why every schemas are useful and run this this is a java program it created a topic for me awesome um now what i want to see is i want to make sure that my command it did work on my kafka and if i list the topics it should tell me hey there you go here's a topic for uh cassandra leaves as we made earlier there are other topics in there so everything that say underscore is an internal kafka topic that helps kafka run itself um some of these are really more focused on kafka connect and the schema registry um and then the ones that are prefixed with kafka sorry with connect are for kafka connect and then this is for k sql in a vanilla apache kafka you're not going to see some of these things okay you're really not going to see anything at all you're just going to see this now that we have uh kafka running we have a topic the next thing we're going to do is create a schema and a schema for us is like like in a cassandra table it's a way to structure our information when we generate some messages actually show you the difference of data in kafka without a schema and with a schema um and we'll talk about why that why do we need a schema um so when we talk about schemas in um in computing in in architecture there's like two schools of thought one school of thought thought is oh we don't need you don't need schemas we don't need schemas for anything um i'm i'm not in that school um it's it's because i've learned better um you know it's like when you make an api right why why would you make an api that has absolutely no structure um just like you wouldn't create a rest api and with a schema and then document that rest api for other people to use i wouldn't want to make a table without a schema or i wouldn't want to make a a topic without a schema because a a schema is basically a contract between systems the benefit of things like cassandra and kafka is that you can change your schema it doesn't hurt the system uh but i i'm a believer in skills personally we can have a conversation offline if you don't do if you don't know schema i'm good though so even in question i i i even said this is great exactly and there we go um so we have um installed some prerequisites um and you know oh the prerequisites were for a python program um you can look at that code later but all this is doing is it's um it's going to be um helpful for us to create schemas and manipulate these things to do that we need the drivers right that's what the the app uh the requirements are for going to run this command to create a schema and i'll show you that i'll show you a preview of the schema let's take a look at what this file is so we made a schema using this uh avro schema file what does that look like so abroad schema basically is okay there we go an error schema basically is like a table schema it it says here are my fields right here's a field call is archive it's an integer or if i have structure if i have arrays inside a field i can also put that so i'm not limited to having a flat schema i can have a hierarchical schema if i want um and it's important for us to have a schema because when you map from kafka to cassandra if they're using similar um types the system will not have to do any special you know parsing and concatenating and all the random stuff to make that thing work if it's an integer it'll get mapped to an integer if it's a string it'll get mapped to a string so having a schema in kafka allows us to connect a cassandra better down the line all right no no no no no i i was complaining about about the preview interface there we go you got the schema uh let's verify uh let's verify that that schema is created right so if i if i curl sorry my mouse is doing funny things go today here there we go carl just a local host i don't have to really go to the outside world and it's telling me that i have a schema available to me right so we're good we did create a schema um and then i'm going to run a tool to verify this using ak hq um and just a recommendation um when you run this command run it in a different terminal so go to terminal open a new terminal because i want to be able to have this run running while i do other things okay and it takes a few seconds where it come up but it'll say i'm running let's open this up in the browser and this is an open source tool you can use with any um you know for instance or confluent engines there we go there's our schema there's no data in here obviously in this topic because well we haven't created any data right so this is running in one screen that's good and next thing we're going to do is import some data and we're going to be importing data into the topic uh first we're not doing anything with that data we're just importing it in there um this is uh something to note right kafka can be taking messages for you and essentially um you know keep those messages for you without having anything down the line do anything with that right that's huge so let's say you have a system that's using cassandra and you have another system um let's say that's using my sql your messages can be imported and processed to both systems if you wanted to all right okay most of my requests are installed and i'm just going to go ahead and paste this so um the data import code is you know fairly simple uh it actually writes to a couple of different topics you can take a look at that later um i'm gonna just double check to make sure the configs are correct right so we've got the connection information localhost localhost localhost there's nothing here that wouldn't work right so let's go ahead and run this command and what it's doing is it's actually reading a file um just a json file and it's sending the data into kafka using two things it's using the rest proxy and it's using a schema-less topic so when we go to akhq you'll see this this is the avro topic if i click on it i see that it has structure right this is the data that we populated using the rest proxy which uses the schema however if i look at the other one which i created without using any topic sorry without using any schema the data is there as well it's the same amount of data right looks like it's the same but it's not in the schema less topic it's just one big string that's it that's all it has so kafka can do what you want you can do it with a schema without a schema i like using it with a schema because it allows us to do further things like we'll see with kafka connect or if you wanted to use something called k sql you could use two so we started kafka we created a topic we created a schema we populated the data using a program just to you know dump data in there the python consumer api is fairly easy to use um it's all documented uh you know what again we're not going to walk through the code today but it's uh it's real i think python is probably the easiest way to get started with this stuff right change the code run it it's fast if you want uh you can run this program to see the same messages via command line tool um it takes a second so i'm gonna actually open up a different terminal and run this so we had sent about a thousand different messages these are the messages that are now available inside kafka i've talked to folks when they said you know why do we need kafka why can't we just write it to cassandra well that's the whole purpose of having uh an event-driven api right if you we have events coming in centrally in the unified log in this case we're using it to store data about urls then let's say uh in our use case we want the data to be then persisted to cassandra also want to index it and put it into solar uh and maybe we even want to um you know do some machine learning api to it to extract some smart tags or some smart topics right one event in kafka can then spin up several different processes and let's say we make a change in our code right let's say we make a change to our schema we can go back and run our code on the same exact topic and basically replay all the messages that ability in kafka allows us to do migrations so if you want to migrate from one code base to another as long as we can read the schema from from kafka we can repopulate a database using that information uh this is a just one of many reasons to to do event driven um but when people do stress tests with their software their next version of their soft software what they'll do is they'll say just replay the topic replay from kafka all of the events in our system and see how it manages right that's a great way to load test your system so this thing is going to take a second to replay all the messages i'm just going to go back to the uh to the code here our next thing that we're going to do is we're going to be looking at our code from before getting the api endpoint which is here and we're gonna go back to our casino real time and point our software to that url um let's see excuse me so this is an example file and i'm gonna first copy it uh before i um you know make a modification to it there we go so in our um properties file or kafka streams example um you know you don't have to touch anything up here and if you wanted to change the offset later you could come back but this is the way we could you know replay the messages our key space we're going to change it to what we have in astra our table is called urls and api host is not it's not localhost it's the it's the url that we have from here since this is a part of a scala application um it's not as uh simple as with python where we just change some code and run it we're actually going to be compiling this uh we could have done this in an interactive way but we we wanted you to get to the feel of basically running a kafka stream uh example uh in kafka streams what we're really doing is we're um looking at a message and one item at a time we're processing it into the api um the the code um is all scala if you've ever used java scala is kind of similar it's not exactly the same but it's similar it's simpler to look at more or less more or less how about this the more you know in scala the less you program because you get it yeah that's true that's true you get really really good at making really concise code um so that's right nobody understood which nobody understands yeah right um so we did this uh we had to compile our code um right and to package it we use something called maven you can use maven you can use simple build tool you can do a lot of different things in my case maven is the most i would say the lowest common denominator uh every project i work on eventually ends up using maven to do everything so why not just play with that um anytime you make a change to this config file by the way you'll have to run this because the config file is packaged as part of this jar um but hopefully you know you're not gonna have to do it too many times there we go and the moment of truth is going to be when we run this file when we run this program with this properties file that all the data that is currently sitting in in kafka right it's just chilling for us to do something all the stuff that's here is going to make its way into cassandra so let's see i don't know if you will get the message into casama but you're just downloading alph of internet already yeah i know um since we actually uh process the messages i'm gonna have to re i'm gonna recreate the messages using the data import api um oh yeah it has been consumed already yeah so the mistakes are still there right it shows the consumer that you're still there yes and if i yeah i guess i could have gone into the config and said the offset is zero and then basically replay from there but um i want to see so right now we have a consumer um that is just sitting around waiting for the stream right nothing's happening and i'm gonna go ahead and start dumping some data into that all right and here we go so doing something and by the way it's not perfect there are some errors that come up um it's part of the api processing um let's see what's happening on data stack studio so let's run ah we have some data coming in nice cool clap clap then let's see if we do a select star no don't do select star man put away close always [Laughter] you want to break my machine so so i would agree we're not gonna do more than a thousand oh yeah okay okay okay so it's processing right so data is coming in the messages have been put into kafka we verified that here uh we have another program written in scala that is picking up the messages from kafka and saving it into cassandra and then we're verifying with cql here that data is coming in so while this is happening let's take a look at briefly at the differences between uh the code right i'm going to minimize this a little bit one part of the code that we have here is python right so there's a program here that takes data from this there's like a json file and sends a message to um kafka one way is that it sends it using the rest api the rest proxy the other is that it just sends the data itself okay um so for example here's a method send one with schema and here's another one uh we're using the rest proxy and here's one that does sends it without a schema so if you want to see the different ways of doing it um you know you're more than welcome to download this code and see how sending with the schema when we send it with the schema we're basically saying before i send the data i want to make sure that it meets the criteria for my schema right and the way we're doing it is that we're going to be using the same schema file that we have as what we used before so how do we do that first of all that's the reason why we have a schema registry when you send data to a topic that is using a schema we first talk to the schema registry and say hey let's let's see if my data actually meets the requirement for the schema then when we send the data it can basically save the data properly it's a little bit more involved but it works every single time if you have bad data it'll kick it out now to send data without a schema you don't have to do anything special you just say hey put the data in here i'm done um but this causes issues potentially in the future where um you want to do something like key sequel or if you want to do something like kafka connect you're not going to be able to do that because k-sequel and kafka connect and kafka streams expect structure okay they expect a structure to be able to take the stream to make it into a k table and and so on and so forth um scala on the other hand is uh in our code we're using kapka streams this is different from the kafka consumer and producer api so our kafka code is using um the average scheme as well and i don't know why the scala um language interpreter is not showing any of the the syntax here um but what we're doing here is eventually after all of our little formalities of how to connect essentially we come down to you know processing a stream and here let me see if i can zoom out let's see this better yep yeah too small too small yeah it's too small who said that scala was a concise code well so actually so trust me this is more comments these are comments these are i trust you i know just kidding i like scala yeah um here we go i'm gonna actually jump right into the the record processor um so you get an idea what we're doing here is once we read the message okay we're we're actually sending it using um a rest api to uh to cassandra um you don't have to do that right you can have a kafka streams app written in scala that reads from kafka and directly writes to cassandra using the java api right the draw the driver itself um and and that's actually how most people will probably end up doing things with kafka and scala and cassandra um having an intermediary step for us was to just show from in the you know to respect the idea of microservices we have one microservice that is purely focused on saving data to cassandra and retrieving it that's what the cassandra api code is doing whereas this service its pure reason for being is its processing messages from kafka and it's sending it to that other microservice oh i got an interesting question in the chat here um about so the processor who extract data from kafka going to casanova is uh written in scala um it's just to show scala would it be possible to do it in python or is there any limitation yes you can use so this this kafka are you talking about kafka streams or just the consumer right yeah the consumer is bad i mean the the code you just re written just reading from kafka putting in cassandra this is scala but you can also do the same in python yep you can do the same exact thing in python you can do it in java you can do it in node it doesn't really matter um however the kafka streams api right this api right here is is slightly different so i can i can check so kafka streams also the consumer you can use any clients but kafka streams expect yes it's it's slightly different yes exactly so consumer producer api there's is every language under the sun i mean pretty much um let's see what the the content has now um let's see it doesn't have a list here but um i'm i'm pretty sure i'm pretty sure if you are programming in like any of the modern languages the producer consumer api exists for every single uh language out there that you're using um worst case scenario and i'm saying absolute worst case scenario you can use the kafka rest proxy to put data and get data but then you don't need to do that all right now the the kafka streams api which is slightly different that's the one in which scala definitely uh has um native support uh as well as as well as java i don't think python has that yet uh i can i'm looking for it but i don't think it's there yet yeah i'm looking as well but i am yeah makes sense because yeah you know it's a stream with a cursor and you just do some callbacks so yeah right and and by the way right so kafka is written in scala so if there's any language that's going to have native support for something first it's obviously always going to be scala okay so what's happened is that we processed 999 messages there were 871 successes and a failure rate of about 127. now i know um why the errors are there a couple of different reasons why the errors are there one is that the data had issues since there was a failure it didn't match the schema other times what happens is that the cassandra api which is running um right here like there were some issues on the cassandra api side so there's always some errors that happen here um so here's a perfect scenario where let's say we have to fix our code right and we need to recover those messages we can always go and recover those messages and replay our code and not lose anything because the data that we have all those 999 records are here in fact if i refresh you'll see roughly about 2 2 000. here we go there's roughly 2000 messages here right so all the data we've been putting into kafka is always there never gets lost um that's what i like about kafka is that you can have real-time traffic always being stored in messages and if there's a bug you fix it you can just replay the messages and it'll be fine all right so our kafka streams example works for the most part there are some errors that it didn't handle well we can always uh you know change that later next thing we're going to do is use something called kafka connect kafka connect is uh is like a lightweight enterprise service bus it's not really an enterprise service bus um maybe i'm going too far maybe it's more like a etl framework but what kafka connect does is it allows us to have an arbitrary connector to have data coming in so using something called a kafka source it takes data from a cockpit connect source and puts it into the topic and then there's something called connect sync which takes data out of kafka and puts it somewhere so in our example uh we're not using a source we're providing the data right we're creating the data but we're going to be using a kafka connect sync from cassandra from data sacs excuse me that allows us to take the data in the topic and put it into cassandra directly in this case we're not using any of the rest api we're actually directly talking to astra from our code so make sure you can get to the right part of the documentation here and i will zoom in when we said we do have a lot of documentation it's true i mean the readme just super super long yeah if you copy and paste it it'll work trust i think um that's the big thing about um any of this code is that we don't want you to reinvent the wheel right if you can copy and paste it and get it to work you can then change it you can build upon it okay so um i have already um i thought i had already uploaded it in order to run this next example we need to upload the secure connect bundle that we had downloaded earlier and i'm just going to go ahead and put it in the where it belongs so there's an astra credentials folder and i'm just going to drag over the secure connect i've got my secure connect file there so during the the yeah in the meantime um here i got a question from uh gerard gulla about does kafka string do any kind of transformation of message and you know good question that's a really good question um so kafka streams exposes your kafka topic has two uh abstracted objects one is called case streams and another is called k table and with either of those you can operate um in case streams you're operating on one item at a time whereas with with k table you can operate on multiple kind of recorders at a time um but yeah you can do anything you can do in scala you can do it in case streams in fact if you wanted to consume the message um run a machine learning evaluation to get a prediction and then save it back to the topic you could do all of that in case streams using um you know whatever model you have it's you know using the scala api um you can do that um yeah k streams is it's just scala right or java whichever environment you're in so whatever transformation you can do in scala or java you can do with case streams just you know everything has a processing time but you can do anything you want to do there so in order for us to run [Music] the kafka connect example we need to provide a uh first of all the um the kafka connect sync jar uh for your convenience uh i've already uploaded it here okay but for any um connector or uh sync or connector source you would have to provide the jar the other thing to note about calcult is that it doesn't require you to run a server or a service you can run a connect process standalone and that's what we're going to be doing we're running it standalone if you wanted this thing to run as part of a service you can do that too but we're not going to we're not going to cover that today so kafka from somewhere into kafka and then from kafka to somewhere else our that declaration is a uh it's a properties file and we're gonna copy this properties file um we're not gonna not like you know if we mess up we can come back i can't copy error and see if i can just ah no it's gonna do this all right um don't get scared of this file um and the reason i say that is it can be it can look daunting but remember what we're doing here we have a compiled piece of code right this connect cassandra sync which can take any schema in sandra and any schema in kafka and make a bridge that's the kind of audacious goal that this connector has so what do we have to do we have to first be able to tell it which kafka instance to talk to because it has to be able to store data somewhere and we have to tell it which cassandra to talk to to where to send the data and we have to tell it which fields are going to be mapped to which fields this one file has to do all of that and on top of that it's not compiled right it's just a it's a configuration file so this is why it's it looks complicated but it really isn't i would just break it apart into like three different sections right so the first part of it here is just says this is my sink connector um here's the class to use in my jar right only one one task at a time if i have a bigger machine i could say do more tasks at a time here's the topic that i want to subscribe to we're going to be creating our messages using something but this is going to read the messages i'm giving a timeout in terms of you know when it's talking to data stacks astra i'm also giving it my secure connect bundle right what is what it's going to be using to connect to my astra so i have to change this in my case it's called platform then we have a couple other lines you have to change um i'm fast forwarding other sections here all of this is really more refined ways to connect to cassandra right that i wanted to to change um compression if i want to change authentication how many connects per second all that i can change if i want to i need to give it a password um so in my case it's called admin uh sorry yes user name the password is very secure uh and again i don't care if you look at this password and username because unless you have the secure connect bundle you can't connect to the sas instance more connectivity information for cassandra we're just going to fast forward this and then finally here's the line which tells the connect how to map my information so the mapping says whatever the topic is archived field is sorry the cassandra column make that from the topic is archived and because i had the same exact names i didn't really have to do any special mapping right all i'm saying is this cassandra column is equal to this topic field and so on and so forth uh if i wanted i could do some basic transformation in here from not just the value to value but you can also say create this insert statement from the data that i have okay again and we're not going to do that but you can do some basic transformation if you wanted to uh another thing uh i'll just briefly mention it you could take data that comes into kafk have a streams api manipulated and then have uh that data go to another topic that is connected via kafka connect essentially doing specialized transformation before it ends up in cassandra in message orientation it's all about questions one of the message is off and you got an error what does kafka think what what does kafka think if you uh if you send a message and it has an error in yeah i mean processing or yeah i mean today it's cassandra but um is kaska sync as some mechanisms to just put the message in the later queue something like that it so kafka connect has some basic error handling where it will try to retry if let's say the right operation didn't work right it'll have some basic uh retries um it's not like super robust but it's it's there uh in in terms of the the the reading from kafka i'm i mean it's intelligent enough in the sense that if if it didn't process the message properly it won't process it in kafka right and it'll try again until it does process now i want i would like to um i want to add that in this kind of architecture you want your flows to be ident anyway you want to be able to replay it it's it's true for kafka it's true for cassandra you want to be ident potent everywhere abs absolutely and um you want any event to be event system to be important it doesn't matter the moment you say you have a event source system you are buying into that mentality so absolutely um a couple of things to note you don't have to just do inserts or updates you could even do deletes uh we're not gonna do deletes today but um connect is a very powerful system just be careful with what you're doing here um couple other things i'm changing i'm changing some topic um these are now settings related to the kafka topic right so i'm changing um the key space uh and the the table name that i have these are special uh to my instance and there are a bunch of settings related to kafka that i'm not touching uh you can always go back and read more into this but i believe that is it so just to summarize this file looks daunting but it's doing three things it's telling connect which kafka instance to talk to which cassandra instance to talk to and it's saying how to map the kafka topic to the cassandra topic so before we can run our kafka connect example we need to run the we need to stop the kafka connect service remember i mentioned that you can run this as a service so confluent out of the box is basically running something for us i'm just going to go ahead and stop that and i'm gonna empty out my um my table okay so connect is down i'm gonna do a couple more i'm going to do a couple more checks here to make sure that i'm not running anything competing okay my python data importer is not running um let's see mike afka connect bundle sorry my connect service is down okay good what i don't want is a competing process right and this is the magic of copy connect with one file and a jar compiled by data stacks but you didn't touch uh all i'm saying is run kafka connect using this jar and using this uh properties file and it's now listening for messages let's see let's see that's okay this happens sometimes all right so it's running and i'm going to go ahead and start producing messages so now we don't need any consumer anymore so this is at the kafka connect level that kafka will push the data into cassandra directly that's right that's right um i'm going to double check to make sure that i did when i'm informed let me okay this is good uh we have uh connect secure connect bundle okay good um all of this looks like it's okay i love this looks like it's okay oh here we go here's my problem i did not give it proper mapping the key space most important line in the file i didn't give it the right you know the proper keys face name okay so it's running and it's should be processing you're not going to see it but um so you see how fast that went um i didn't truncate it so it added more stuff but it basically took everything in kafka right because i had set the offset at a higher number and it just put it into cassandra so i'm gonna actually i'm gonna stop the service so that you guys can believe that you know um this is really really fast um and i'm gonna yeah so i'm gonna recreate the messages okay so under truncate there's zero records okay i'm going to go back to my let's see messages create a thousand messages and run the connect again and let's do account so within a few seconds it created 750 records um and it's done all thousand rows have been processed and it's finished after connect is the least programmatic way to get data from kafka into cassandra but we have to do some planning and while we were using the other examples we did a bit of that homework right so we created a schema in cassandra we created a schema in kafka which matched that schema we had a you know a data importer that was dumping data into that into that topic but when we finally connected kafka took a standard using connect it was just one file that's all we said here's where to connect to kafka here's where it connected cassandra here's the mapping and there it is everything is moved from uh from capcom to cassandra we talked about three different ways to work with kafka today uh in in conjunction with cassandra is consumer producer api um consumer producer apis are 100 um you know compatible with pretty much every programming language c sharp python node etc second way that we used kafka cassandra is kafka streams kafka streams is a domain specific language inside scala to work with kafka data and to do processes on stream level or or what's called a k table level and the last way that we used uh that we can cop credit cassandra was using um kafka connect what are the pros and cons well if you have a team where people are programming in different languages kafka can provide the lingua franca the the common language between all the different systems to say as long as you meet this schema go ahead and create topic create messages in these topics it doesn't matter what programming language you use right um then if you want to do special processing and you had team members that knew uh you know scala or java then you could use kafka streams um and you know in fact um if you go to uh google and look for um kafka streams topology um you'll find lots of different examples of people using a topology or a shape of the topic from one topic to another to do everything inside coca streams okay um what are the type of things you can do in kafka streams you can do lookups from other topics you can do merges between two different top different topics could get merged into another topic um that's a whole different like subject on how to use kafka streams um but it's very very powerful i couldn't i could probably talk about like a like a there's probably like a three-day workshop just explaining what kafka streams can do for you but but to put it blankly streams allows you to do operations on the topic inside the cluster um i do have a question uh maybe i hope it's not a trap when would you put kafka stream versus spark streaming sure sure um great question um so spark streaming spark is like very very powerful it's built for distributed computing that was meant to replace something like hadoop but spark can do not only distributed operations you can do you can do single item operations in spark similar similar to kafka streams but in that case spark is a little bit more heavyweight than streams kafka streams is operating on a item by item level it doesn't have any knowledge about the whole record set um whereas in spark you have you can have knowledge about the whole record set uh you can do operations on the whole record set and it breaks it apart into you know discrete memory uh basically memory tasks that that can do work on one record or you can do work on one column in a whole table if you wanted to um but it just breaks down the work you're not really doing that in cockpit streams um so where would i use spark versus where would i use kafka streams if i knew that all i needed to do was operate on the message uh even machine learning if i have the model that's made i can use that model in kafka streams versus with with spark i could make a new machine learning model i could do crunching on the whole data set there's no way that i want to do that with cockpit streams um so there's there's really like one like little sliver of the venn diagram where they kind of make uh they're similar which is the single item processing you can do the same thing that we're doing with kafka connect you could do it in spark but like man that's a lot of work to do that or um or even joins right it's a lot of work okay thank you okay um and then kafka took cassandra with kaka connect this is the simplest easiest way to get data from kafka to cassandra but that's not it kafka connect has so many different libraries for mysql for oracle uh even cdc you can take data from one cassandra cluster and send it to another cassandra cluster by using the the cdc connector right so cockpit connect is very powerful why would you use coffee connect well i think it's great for getting data from other systems into from one system to another not just for synchronization but for just general um etl beyond like you know let's say you have to do data analytics across several systems coffee connect would be the perfect tool to get that data into one place so spark yes it can do all that work but it's way more work so you take mysql you take oracle you take uh you know you take all that data and you use connectors to get it into kafka and then from kafka you can stream it all into cassandra once it's in cassandra you can do anything with that data so i think copy connect is very powerful if you use it in the right context but i wouldn't replace application development with copper connect i don't i don't know if that's really the right place for it so with that we covered everything we set out to do we brought up the api from before we brought up kafka created topics moved messages with python used kafka streams and finally we used connect um and we did all of that using git pod uh online web-based tool so you can do all this fast data big data stuff without having fast big computers um and with that i'm going to hand it back to okay thank you very much raul it was very very useful you know i you know i really love doing all the step-by-step to review the content it's very dense you learn a lot it's python and and and scala really love the content and really great coding as well more formats to change [Music] [Music] you

Info

Channel: DataStax Developers

Views: 2,885

Rating: 4.9230771 out of 5

Keywords:

Id: W2RKO2Egdag

Channel Id: undefined

Length: 110min 6sec (6606 seconds)

Published: Tue Oct 06 2020