Getting Started with Apache Pinot

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello and welcome to getting started with apache pino i'm matt landers head of developer relations at star tree and in this video series we're going to go over all the basics of pinot and get you leveled up and ready to start using pinot in your environment and with your data all right so in this first video what we're going to do is we're going to cover the anatomy of apache pino and what i mean by that is like what are all the different components how do they work together how do we get them started and run a cluster then we're going to go ahead and do that we're going to run a cluster and we're going to load some data in from a file and query that data now in this video we're not going to go over real-time ingestion which would be pulling in events from kafka and having that data immediately ready for us to query we're going to cover that in a different video but here we're meant to just get a cluster up and running load some data and query it so we get the basics down of all the components that we need in order to run apache pino all right so let's dive into the different components so the first thing i want to talk about is a logical component which is a table now in other databases that might not be a logical component that might actually be the physical representation of the data like in my sql but if you think about an oltp database like mysql or sql server postgres those are databases that can only be vertically scaled because they don't have the concept of having this distributed file system so what the way peanut works though is that we can have one table that spans lots of different servers and that allows us to get incredible query performance out of it so we can be running hundreds of thousands of queries against the pinot cluster and be getting less than a second response times this is really important whenever we're talking about surfacing real-time analytics to our users we can't have things take five ten seconds minutes to get a report back we need that immediately so we have a great user experience and tables and their logical representation and their physical representation in segments is what really makes this happen so a table is made up of a lot of segments and these segments can be distributed across many different servers the segments can be replicated as well so that you have fault tolerance so if one segment goes down on a server then it can pick up and run on a different server where that segment also exists so segments are this key component the physical representation of the data and then a table can also have two different properties it can be offline or real time and what we mean by that is that in an offline table is data that's been loaded from maybe a batch file or something like that it's not real time ingesting live events it's just loading historical data for instance and then we might have one that's real time that's hooked up to a real-time eventing service like kafka pulling those events in and making them ready to be queried immediately now one cool thing about pinot is that even though we might have these two different table types we can query them as if they are one which is really important and pretty neat so tables adhere to a schema so in most systems you may think of the schema and table as almost like interchangeable but the schema is what defines our columns and our data types and then a table implements that schema so we can have an offline table and a real time table and then when we query they are implementing the same schema so we can query them as if they're one so that's really important thing to understand about how tables work in pinot which is different than you might get out of other oltp databases all right so that's the only logical component and it doesn't need its own service or anything like that that's just something you need to understand before we talk about these other components all right these other components we're going to talk about are actual services and servers that we need to have up and running for the cluster to be operational apache zookeeper is what we use for configuration management and metadata so whenever uh we make changes to the cluster that data needs to be saved somewhere so the so the cluster understands what state it should be in think of zookeeper as like the file system for that configuration metadata so they can be distributed across multiple servers so it can never go down which keeps our cluster up and running happily now this demo we're only going to run one version of each of these services so it's not a production environment by any stretch and we're going to run them all in one container but just keep that in mind that all of these are replicated all of them are built for fault tolerance and to keep your cluster up and running at all times all right one major component the first major component that we'll run that's actually pinot is the controllers the controllers are like the brains of the operation they maintain the state of the cluster they do this by writing the state to zookeeper but they also use apache helix now helix isn't a server that we need to run it's embedded in pino itself but it does the cluster management like leader elections and everything that you would think about that you need in a distributed system it's got all those algorithms and things in place and they're built into pinot itself and then it also provides the endpoints for like our data set manager our data manager which allows us to have a ui to view the different components that are inside of our cluster it has our rest api endpoints that we can use to query the cluster or make configuration changes those kind of things and then it also provides the means of uploading segments so your batch file processing all right the next thing that we need to run is our broker service so the broker manages query execution and maintains a routing table for the tables so if you what we talked about we have those segments distributed across a lot of servers the brokers keep track of where those segments are so that when a query comes in for a table it knows what servers it needs to send that query to in order to get the results back so let's say you query a table and it needs to span across a lot of different servers the broker will send the query down get the responses from each server and then merge and roll that up into one response that then gets sent as a result of that query so they're very important for query performance and they work independently of the controller so that we have uh great scalability options there and the servers are like the the meat of everything they hold the data they are where the segments live and when the broker distributes those queries they go to the server then the server does the actual execution and uh performs those queries and sends that back to the broker now the segments also can be configured for the real-time ingestion so the real-time ingestion happens on the servers itself so most servers are configured to be either offline or online because we don't want the real-time ingestion that cpu hit that we take the affect our performance across all of our different segments so if we have offline segments they can just be ready and executing queries all the time but we do have to have that ingestion mechanism somewhere and then that has to make that real-time data available immediately so that's what those real-time servers do so those are the major components of apache pino and now we're gonna go and actually create a docker container get all these running load some data and query it so let's do it the first thing that we want to do is create a docker file so that we can get a container up and running with apache pinot loaded so let's do that choose your favorite text editor create a new docker file and then we'll create it from scratch so we're going to do from open jdk 11. you may be able to use jdk 16 or above by the time you see this video but right now uh as the as of the recording of this video i can't use 16 just yet because it's so new uh but we are working on that all right let's set our working directory uh slash app and then let's set our pinot version we're gonna use this so that we can create some urls to grab the right version and as of the recording again this is 0.7.1 but just check the dots and make sure you grab the latest version there and then also in the dots there are some links to pull down the zip or the tar file and then we can unzip it so this uses our pinot version uh but just check out the dots for the manual cluster setup and you'll be able to grab those links they may be a little bit different so don't try to copy this go to the dots grab that link and then we also want to expose port 9000 port 9000 is where our ui is going to live our apis we want to expose that to our os so that we can view that in the browser all right so we're going to go ahead and save this file and then we're going to build the container so i'll do dr build and i'll do i'll tag it with gs.pino so getting started pinot all right we're going to let that build when we come back we'll go into the container and we will get pinot up and running all right let's do it all right now that our image is built we want to go into a running container and start up a cluster so let's go ahead and do that all right so we're going to say docker run uh we're going to expose the port 9000 just do dash p 9000 colon 9000 and then dash i t i'm going to say gs pino because that's what we named our image and then we're just gonna run bash so slash bin slash bash all right that'll get us to a prompt where we can see what is in our working directory so let's list out the files you'll notice that we have a folder called uh the long name of the zip that we pulled down and we have the the zip or the tar and i'm gonna rename that just so it's not so long so i'm gonna just move it to pino all right now let's go into that folder pino and you'll see a bunch of different folders in here as well we have a bin directory and examples directory those are the ones we're going to look at in examples we have some sample files and different things to get us started to play around with pino but ben is where all of our scripts are to get the server up and running so let's go in there so in here you'll see a bunch of quick starts now we could just run the batch quick start and do everything that we're about to do but then you wouldn't really get to know all the different components that we need in order to get the server running and to load in the data so let's go ahead and get a server up and running using the admin script so that is pinot dash admin sh if we just run that we'll get a help of all the different commands that we have and you'll see all of these start commands we're going to run the start commands for zookeeper the controller the broker and the server now there are other ones in here that we're not going to cover right now but these are the main components that you need to get a pinot cluster up and running now they do have some options that you can pass in like setting the ports and stuff like that but we're just going to use the defaults for everything and get it running for demo purposes so this is just to get you up and running quickly this definitely isn't what you would do for production setup but for this demo it's perfectly acceptable to play with locally so let's do it the first thing that we need to run is zookeeper so we're just going to say start zookeeper i believe this is case sensitive so make sure it looks like what i have here or what you see in the help and then also add an ampersand at the end so we're going to run everything in our current running session uh you definitely wouldn't do this normally but i'm going to go ahead and do it which means that all of our logs are going to output to this session but if we have that ampersand we can still get back to a command prompt and type in other commands or terminal whatever your name for that prompt is all right so we have zookeeper running that's our configuration and metadata service that's not really part of pino but pinot relies on it to manage that and now we're going to run all the pinot stuff so we'll do the admin script and we'll say start controller be sure to add that ampersand that's going to get a controller up we're seeing a lot of logs here this is uh the controller talking to zookeeper setting up the cluster getting all the configuration and stuff going and that's going to run on port 9000 so the controller is what we're going to hit from our local machine to see the ui on localhost 9000 that won't quite work yet so let's go ahead and get everything else up and running i will run a broker now remember to add the ampersand so just start broker let those logs output for a second here the broker is what handles our query execution so a query comes in the broker knows where all the data is and what servers are on but right now we don't have a server so let's go ahead and start a server as well i'll say start server add that ampersand and now you have your first pinot cluster up and running all in one container so calling a cluster is a little extreme here but we have pinot up and running uh so once that's done we get back here all the logs have stopped we can go to a browser and actually check this out so let's go to browser we're gonna type in localhost 9000 and there we are there is pinot this is the ui for pinot and it shows us a couple different things here so let's step through what those are we see our our controllers our broker our servers we just start up one of each of those if we continue to start these up we we would see more of them come online we also have tenants here we didn't talk about tenants and we'll get into that in future videos but tenants are there to provide isolation for different components like tables so the idea here is that in your organization you would only have one pinot cluster and then use tenants to separate and isolate the different tables between different departments or application teams or whoever is using it so you might have finance with a finance tenant and then you can have all their tables in there and marketing might have a tenant and other groups might have a tenant and then that keeps them all separated they can't step on each other they can't see each other's data which is great so that really gives us that multi-tenancy for our organization and then we can only we only have to maintain that one cluster for everybody all right another thing here is our query console so once we create a table and a schema and we load some data we can come in here and actually write commands now we don't have anything but this is where we would write our sqlite select star from table etc we'll do that here in just a little bit but first let's look at zookeeper so from here we can view zookeeper which means we can look at all the configuration and metadata that zookeeper has for our cluster so you'll see pinot cluster in here and in here we'll see things like external view this is what our running state is we'll see our ideal state so as we make changes the ideal state will change then helix will make sure that the cluster gets to that ideal state in the external view so ideally these things um match at some point uh and that's what helix is managing for us but that's part of those uh pinot processes it's embedded helix is embedded in there all right so we won't dig too much in here but this is cool because you can go through and see all of your metadata and everything so when we create a table it'll show up under our property store we'll see our tables our schemas our configs and everything like that but this is just a nice little visual peak into zookeeper all right so now that we have our cluster up and running we have our controller we have our broker we have our server we need to create some tables and load some data so let's do that next the first thing that we need to do to create our table is to add the schema so let's go ahead and do that i'm not going to do it in this terminal window i'm going to open a new one just so we don't have those logs coming through so i'm going to run docker ps and get the container id for the running container so i don't want to kill that one i don't want to close that session because that's where my cluster is running i'm just going to create a new terminal so i'll do docker exec dash i t and i'll give it that container id need a space there and then i'll just do slash bin bash and that'll get me into a new terminal all right so now let's go into pino and let's go ahead and install them so we'll do apt-get install them so we have a way to look at some of these files and let's go to the bin folder but we're going to open some of the example files and then we're going to run them so the first let's look at the schema file that we're going to add to our cluster all right so i'll do them and i'll just say go to the examples batch baseball and then the baseball schema file all right if we look in here you'll see that it's just a json file that has all of our configuration information for the schema and remember the schema just contains the columns that a table is going to implement so we have metrics and dimensions since this is an olap database we do have different types of columns the metrics are things that can be aggregated so they're usually numbers so we have like home runs number of hits that kind of thing and then in our dimensions field we have the player id the player name what league they were in that kind of stuff so it describes the metrics so that's all the schema is is it is all of the columns that a table is going to implement and we need that because we have multiple different types of tables we have offline tables and we have real time tables and they need to implement the same schema all right so let's go ahead and add this schema to the cluster what we're going to do is just run the admin and we're going to say add schema then we give it the schema file that same file location and then we need to do dash exec to make sure that it's going to read the file in and add that all right cool let's go over to our ui and see if that worked we go to our ui and we refresh we can see that now we have a schema down here if we click on it we'll see all of our columns and everything that's ready for us to implement a table on top of this so pretty neat all right now that we have the schema let's go ahead and create the table so that then we can load some data into the table and start querying it because that's what that's what we're here to do we're here to query data and we're almost there all right so let's do the table config let's take a look at it first go back to that examples batch baseball and it's offline table config we'll see this is a lot smaller so it just says what schema it implements uh the name of it that it's offline that kind of stuff it has the indexes in here too we don't really have a lot of index it's just a couple inverted index columns but we'll get into indexing which is a key feature of pino and future getting started videos so check those out alright let's go ahead and add this to the database so we just call the pinot admin and we'll say add table we'll give it a table config which is that same file and same thing we do dash exec run that and then we go and refresh here and we have our baseball stats table and there it is awesome so it's connected to our schema you can see that the table now has all those same columns we didn't have to define any columns because it just implements that schema all right now if we go to our query console we're actually here and we can query our baseball stats now but look nothing's here we don't have any data so that's the next thing that we need to do is we need to add some data to our table so let's do that all right in order to load data we need to create an ingestion job so let's go back to our terminal and make that happen so let's first look at the job that we have in the examples folder so if we go to the examples batch baseball we have an ingestion job spec and it's got a lot of documentation here for you to see what each thing is but basically we're telling it that we want to load a csv file which you can see down here it tells you where the directory is so it's looking for any csv files in that folder and then we need to give it the exact location so right now it's using a relative path we need to go in and change this so that it is a absolute path in order for this to work so let's do that in a second but i don't want to mess this up because the quick start chooses this so let's copy it somewhere so let's go out of here and let's uh copy the ingestion spec let's just copy it to the root folder here which copy it to slash app slash app slash job diamond let's say all right so now let's open that perfect now let's go down and edit this all we have to do is edit this directory so it didn't know where it was where it's living so we just add slash app slash pino and then we'll add that to the output directory as well which this is where the segments are going to be created and then we'll just uh write this file and then we'll go out all right so now we're going to run this file to run the ingestion job we're going to use the pinot admin i know you probably couldn't have guessed that because we keep using it over and over but that's how we do it we're going to use the pinot admin and we're going to use launch data ingestion job so let's go ahead and run it so you know admin [Music] launch data ingestion job and we need to specify a job spec file which is the file we just created so it's in slash app job.yaml and when we run this it's going to go out to that folder the directory that we specified in the file and it's going to look for csv files which there's one it's pretty small so it's already done loading it and then we can go back over to our ui and start querying it so let's do it all right so we're in the ui we just go to query console and now we have our table we click on it and it'll automatically run a quick query for you just pulling all the data back and we'll see all of our data in here so this is cool we've loaded up 97 000 documents and we can start writing some analytical queries here so let's do one real quick so we could say team id we could sum the home runs i will group by team id and let's order it by the most home runs first so some home runs and we'll make that descending we run that we'll see that new york has the most home runs which was probably expected if you've watched any baseball uh yeah so that's pretty cool we've got all of our data in there and we're querying it that uh just like that so one thing to notice here is that baseball stats isn't actually a table name which we talked about this a little bit but there's actually baseball stats underscore offline that's the table name but when we query we just query baseball stats and so what that means is that whenever we add a real time table here when we query we're going to get data from the offline and the real time table at the same time to make sure all of our aggregations are correct and all of that so now that we have all this running let's walk through all the different pieces that we did to make this work we ran zookeeper to get our configuration and metadata we ran uh the controller which gives us this interface and the apis and the segment loading and stuff like that uh then we run the broker which manages the query so we go to this query and we run it the broker gets the query and then it has that routing table for all the segments are for that table and it sends it to the right server in our case we only have the one server because we have a really small example here so it routes that query to that server that server responds with its result the broker merges all the results from all the different servers and then it re returns the final result and then we also um i guess that's all the different pieces that we have oh we have the server the server is also running so we ran all of that which gets our cluster up and then we created our table with our table config our schema config and then we loaded david up with a ingestion job now in future videos we'll go over real-time data creating indexes and stuff like this but this is the basics to get you up and running so you could go load some of your own data now we also have bigger example data sets uh if you just check out some of our blogs and some of our examples so you can learn uh load in a lot of github data or stuff from wikipedia there's a lot of different examples that you can load in from data or you can load your own data as well so go out there check out pino start to use it for some of your real-time applications and please give us feedback for anything that you want to see how to do on the platform and build something cool using real-time analytics so happy coding

Info

Channel: StarTree

Views: 751

Rating: 4.7948718 out of 5

Keywords: Apache Pinot, OLAP, Databases, Real-Time Analytics, Apache Kafka

Id: 9B6MCv0uC1s

Channel Id: undefined

Length: 26min 18sec (1578 seconds)

Published: Tue Aug 03 2021