Real-time Tables in Apache Pinot

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello and welcome to the getting started series with apache pino i'm matt landers head of developer relations at startree and in this video we're going to talk about real-time tables but before we dive into the specifics of real-time tables in pino let's talk about what real time actually means real time and pinho is the end-to-end experience of getting data into the system and then making it queryable out of the system we want that entire process to be as fast as possible and that's critical to providing great user experiences to our end users for instance we want to allow them to be able to get insights out of the data that's being produced and make actionable decisions on that data so we can't have a lot of delay there the smaller the delay is the faster we can make those decisions about our business so pinot is really good at this in fact it was created at linkedin for this purpose so if you go on linkedin and you view who's viewing my profile or see a lot of the stats and things that they provide to you all of that's coming out of pino and it's really fast so if it took seconds to get out of the database you would never see it because you would just go past that page and not worry about it but since it's so readily available to you and it's in real time it's pretty cool to be able to take advantage of that data that's provided out of pinot so the way that pinot works is we can take in events we're talking about tens of thousands and millions of events per second and load that into the database and have them be immediately queryable and then that query that runs on top of pino needs to be really fast it needs to be under a second so that our users stick around and gain those insights out of that data and we can also do things like anomaly detection and the faster that we can do that the faster we can stop security threats and things like that so in this demo we're going to show you how the basics of how to get real time working in pino now if you haven't already watched the previous video to this you should do that and there's a link right here that you can click on and go do that if it's not you can go to our youtube channel and search for getting started with pinot and start with the very first video in that video we talk about all the different moving parts of pinot and then we get our first cluster up and running so in this video we assume that you've already got a cluster up and running and now what we're going to do is actually ingest real-time events and query them so what we're going to do is we're going to ingest events from the wikipedia foundation they have an event platform where they send all the events about wikipedia over http and you can just subscribe to those events and start doing things with them which is pretty cool that this is available for free to the public so we're going to do that so we're going to write a little node.js app to read those events from wikipedia then we're going to load those events into apache kafka now apache kafka if you haven't used it before is just an event streaming platform the reason we're going to use kafka is that pinot is built to is already built in with the capability of reading things out of kafka so we're going to take node we're going to read those events from wikipedia throw them into kafka and then we're going to create a schema and a real-time table on top of that schema that is hooked up directly to kafka and loading that data into our database and then it will be immediately queryable so this is pretty cool now in our example we're just going to have one node but just imagine this could be living across thousands of nodes and that data ingestion is going to work just the same and those queries are going to be just as fast it's pretty amazing actually so let's dive in create our first real-time table and query it alright here we go all right so the first thing that we want to do is connect to the wiki event stream and the way that we're going to do that is we're going to use node.js write a little script to hook up to the wiki events and we're going to pull those down we're going to write them to files so that we can see them just for debugging and then we're also going to send them over to kafka so let's go ahead and install node because you're going to need that so i'm in my docker container right now that we set up in the previous video and i'm going to run this command you can get this command in the youtube comments or on my repo which that link is in the youtube comments as well all right and this is adding a res repo to apt so that i can go and install node all right so then i'm just going to run apt install node.js after i get that repo added that's going to install node and just to make sure that it's working which this only takes a second i can just type node minus v and i'll see you have node and make sure you have npm as well so if you already have that on your machine you don't have to worry about it but i wanted to put those commands in here to make sure that you could follow along if you needed it we're also going to use them so you can do apt install vim i'm just doing this because i want to stay in the terminal and make this really simple rather than copying stuff back and forth to the docker container so if you have everything locally you might not need them you can use whatever text editor you want all right so we have node installed now let's go ahead and create a folder for our real-time app that we're going to write so i'm going to create a directory called real time let's go into that directory then i'm going to use them to create a wiki events.js file now i'm not going to make you watch me type all this out it's not that long but um we'll talk about what it does all right so here it is this is the whole the whole file we basically are going to use the event source package and kafka js package in order to connect to the events because wiki events follows a web standard for event source and this package just makes it really simple to connect to it we're going to connect to kafka which is we're going to run on uh port 9092 which is a default port for kafka and then if we go down here you're going to see that whenever we get a message from wiki we're going to write it to the file system for now that's just for debugging purposes and then we're going to write it to a topic wikipedia events that's going to be important to remember that we're going to need that when we create our real-time table and then we're just basically going to send this straight over as is as we get it from wiki and we'll be able to see what that looks like all right so i'm going to save this file and you can get this on the repo as well all right now we need to install kafka to make this work so to install kafka we just need to go and download the package similar to what we did for pinot so let me grab that link all right so if we go to the previous folder we're going to download it right here this is where we have pinot and our real time folder we're going to add kafka to this folder as well so i'm going to run wget and then get this file it'll take just a second all right now that file has been downloaded let's go ahead and extract it so we'll run tar minus xvf and then kafka all right so that's going to extract everything into a folder with the same name now i'm going to rename this folder just for simplicity's sake as we're moving around so i'm going to move it to just kafka and i'm going to get rid of that tar file so all right so now we have a nice clean folder here and now what we want to do is get kafka running so let's go into the kafka folder and we're going to edit a config file i'm going to go to the bin folder this is where i'm going to run it from but i want to edit the config file for kafka and i'll show you why in just a second so we're going to do bi um i'm going to go back a folder and then go to the server properties and i'm going to change the zookeeper address right here where it says localhost colon 2181 that would put all the zookeeper metadata that kafka uses directly into the root of my zookeeper and it would clutter it all up and look bad so i'm going to add slash kafka at the end of that just to make that not happen because we have pinot data in there as well and then it just it's hard to tell what's what well if you do that all right so that's all i'm going to change i'm going to leave the defaults for everything else all right now what we can run is kafka server start and then we're going to give it that config file so we're going to go to the config server properties i'm going to add ampersand at the end of here so that we can get back to the terminal all right so that's going to run and connect up to zookeeper get kafka up and running and again kafka is just a simple event streaming platform say simple it can be pretty complex but the way that we're going to use it it's going to be pretty simple we're just going to send events to kafka in the form of on topics so kafka has this idea that you have a topic and then you send events to that topic and then we're going to tell pinot to subscribe to that topic and pull that into the database once we create our schema in our table all right well let's go ahead and go back to our real-time folder all right so now we should be able to run this wiki events js and it'll start pulling down events from wikipedia putting them in files so that we can look at them and see what they look like and also start pushing them to that topic in kafka so let's go ahead and run this well let's make a directory first i put all these in a events folder so create an events folder and then we should able to run this with node so let's say node wiki events js we need to also install the packages so npm install event source and kafka js that'll go and download those two packages uh into your node module so now you'll see you have a node modules a package.json all the fun stuff that you get with node and we should be able to run this so it's going to open the connection to wikipedia it's going to start pulling in all these events and it's going to be loading them into files into kafko all right so this has been running for a second so we should have quite a few files because this is a very active event stream so i'm just going to cancel it here and then we'll go to the events folder and you'll see that all of these folders were just created we can go to the english version and look in here and look at all of these events that happen just in the english version so we could like cat one of these out let's just say you can see that here's the oh it's probably hard to see let me clear it so here's the file it doesn't look great because it's not formatted so i have one for us to look at back here so this is what the json looks like that wiki sends us we see that we have an id we have a unique id up here that we're going to want to use this id i don't think is unique so we're not going to use that one but we have this id here we have the domain so this is the englishwikipedia.org we have the type categorize the title the comment the user the timestamp timestamp is going to be very important whenever we're dealing with a real-time table we've got to have a timestamp how is it real time if we don't have a time right so that is a requirement for a real-time table is that we do need a timestamp all right so this is what the file looks like we um i have all i know what this is going to look like and now we're going to create a schema that can pull in some of this data we won't pull all of it in we'll just create some columns to pull some of this data in so that then we can pull it from kafka and then start reading it all right so that's the next thing that we're going to do let's do it all right now we're in the ui for pinot and in here we're going to create our schema and our table all right first i want to show you something in zookeeper so you remember how we pushed that slash kafka on the zookeeper command that's what happened here so that's why in zookeeper now we see all our kafka config if we hadn't done that all of these folders and files i mean they're not really that but their key value pairs would have been at the root rather than nested and it gets kind of ugly it would still work but when you come in here it wouldn't look very nice uh so we just kind of broke that out just wanted to show that all right so we have our pinot cluster up we don't have any tables so let's go ahead and create a schema all right we're going to name the schema wiki events all right actually we'll just call it wikipedia and let's go ahead and add a few things here we're going to add a id and this will be a dimension we're only going to add dimensions for this this data isn't great to have metrics on but you get the idea if you could add a metric as well if you had data that required it and then let's add another one we also want to add our timestamp we have to have this so this is our date time field and this is already right so it's a epoch in milliseconds uh what else can we add we can grab the user that's also a dimension and we're just going to add a few fields here i'm actually going to replace this with a whole thing that i've already created but i just want to show you would come through here create dimensions for every field that you wanted to pull out of that json and then we'll write the real-time table and we'll be able to transform and move some of those nested elements into the columns as well so for instance in our nested elements uh we had that id so i'll show you how to map that with a transform function here in just a second as well all right so if we save this we end up with our schema and then the next thing that we want to do is we want to create that real time table but i have this schema config on getting started with apache pinot repo on my github so definitely you can just pull that down and add it from the command line and the real-time table config is there as well if you don't want to go through and create all this but i wanted to show it and how you would do it all right so now let's create our real time table to create a real time table it's pretty simple we're just going to click the add real time table up here and if we name our table the same as our schema it'll automatically pull in our timestamp column if there's only one if not you can click this drop down and choose the timestamp column that you want we only have one so and it's called timestamp and we need that for a real time table because real time tables are based on time and we segment that data based on lengths of time so we have to have a time column on a real time table all right so we have that now let's go down and look at our stream config so if we scroll down this is different than our offline tables which we saw in the previous videos but this is where we can figure how we want to get that streaming data into the table and we're going to use kafka so it's a streaming type of kafka and this is already set up for kafka immediately in the ui because that's what most people are using to stream data in here but you can use different plugins to pull in data from different places as well we need to put in our broker list so here you're going to want to put localhost 1992. that was the port that we ran this on when we ran kafka so we just got to get that going and then remember in our javascript file we call our topic name wikipedia events all right and then what's going to happen is it's going to use this decoder class here which uses the kafka json message decoder which is the default and that will parse the topic messages from kafka and if they're in the root then it'll just automatically map them to the columns so if the column if the root property in the json matches a column name it's going to try to map that over automatically we don't have to do anything now there are a few columns that we chose that are under the meta property and under there was more json so we're going to have to do some transform configs to get to those the other option was that we could have flattened that when we put it into kafka but you might not have that option you might have uh other things that are reading from the stream which you probably do so we want to be able to take the json as it exists and then map that to the right columns so the first thing we're going to do is we're going to go up to those transform functions and we're going to add a new transform function on the meta json field now when we created this schema we didn't put all of these fields in here i went ahead and added them all the meta json field is just a string field uh but we're gonna put json in it so we're gonna say uh json format and meta so in our json that we get from wikipedia there's a property called meta and that's what i'm referencing here and that's a just a nested list of more properties i like the id and a couple other things like stream domain and topic that we're going to pull out of there so we need to add this json to our a column in our table in order to use some of these other transform functions so one of them that we're going to do is the id so we're going to say json path this time we're going to reference the meta json field as to where our json exists which we just created here so we have to create that in order to reference that here and then in quote single quotes we're going to put dollar dot id so this uses the json path syntax which is a standard so if you've never seen it you can look it up just json path but usually dollar is the root of whatever the document is and then dot property indexers just like you would use in most languages or whenever you're referencing json in javascript so we have a couple of other columns here that we're going to add i'll just do another one real quick so we'll do the stream that was also in the meta the same thing just meta json [Music] is the field and then dollar dot stream let's just pull in data out of that document all right so once we define all of our transforms and everything we can come down here and just save it and then that's going to go ahead and create that real time table now this real-time table should go out to kafka and start listening to the stream and as we pump events to it it should start to populate so let's go see if we can get some data in there now all right so let's see what happens if we just go ahead and query this table so we'll go to our query console we'll click wikipedia and we'll see that we have data in here already now our event streaming isn't running right now but we have data in here in fact we have 1069 dots that we went through and we can see all the data coming out of here and the reason for that is that kafka isn't ephemeral kafka keeps those topics and messages around so whenever we created the table pinot hooked up to that topic and pulled in whatever data was there now if we start streaming data we'll see more data fall in here as well so let's go ahead and start our stream go back over here let's go to our javascript file and we'll just run run node wiki wiki events now we're open up we're talking to the wiki events we're pulling those events in we're writing json files which we should probably comment out because that's going to make a huge folder so go ahead and do that if you want i'm just going to leave it for now but now let's go back over to our browser and see what's happening so if we run this query again look at this now we're at 1822 docs 1939 so these things are being loaded in real time as are actually happening on wikipedia that's pretty cool so now we can start to query this data in a more meaningful way so let's see let's pull out the user let's see who is making all of these edits so we can do something like count user and we'll pull back the user as well and then we'll do group by group by user order by count user descending all right we'll run that we come down here and we see that yaya it has made 186 edits in a matter of seconds a lot of the things that are editing the wikipedia stuff are bots you'll see some of them are actually called bots like i don't know what exactly they all do maybe you want to look into that but most of them are bots and then some of them will come through as actual people making changes but yes they have bots running through and cleaning things up all the time so that's where a lot of these messages are coming through but you can play with this data and see that now we're at 3 403 docs so you can imagine that over the course of a day or a few days a month this data could get really really huge and we would want to spread our data across multiple servers and scale that out but we would want to just make sure that we maintain this really fast query response time that we have here i mean this is 24 milliseconds right now obviously it's not a ton of data but it isn't insignificant at this point and it's growing as we sit here we're at 5000 now all right so this is pretty cool so now you kind of get the idea of how a real-time table works it goes out it connects to a streaming source which you can create plug-ins for any streaming source there's some for kafka and a few of the cloud providers different messaging systems that you can connect to as well if you need a custom one for whatever you use then you can create one as well in java as a plugin so it connects to that streaming source it pulls it in decodes the message and then maps that to columns in your database and this happens in real time as soon as those messages come in they're immediately queryable so this gives you the basic overview of how real time tables work if you have any questions ask in the comments definitely check out the repo and if you're interested in apache pinot please join our slack channel you can find that at pinot.apache.org and ask questions in there get to know the community and introduce yourself we're really excited at how fast community is growing and how many people are loving and adopting pinot so please come join us and until then go build you some real-time tables write some apps that uh run off of that data that you have in your organization and happy coding

Info

Channel: StarTree

Views: 241

Rating: 5 out of 5

Keywords:

Id: jxERUAzb9Eo

Channel Id: undefined

Length: 23min 9sec (1389 seconds)

Published: Fri Aug 20 2021